1 분 소요

image

  • Finite Markov Decision Processes
  • Introduction to Reinforcement Learning
  • Q-learning, Deep Q-Networks
  • Policy Learning


Finite Markov Decision Processes


Markov Decision Process (MDP)

  • Set of states S
  • Set of actions A
  • State transition probabilities $p(s’ s, a)$. This is the probability distribution over the state space given we take action a in state $s$
  • Discount factor 𝛾 in [0, 1]
  • Reward function R: S x A -> set of real numbers
  • For simplicity, assume discrete rewards
  • Finite MDP if both S and A are finite


Example: What SEQUENCE of actions should our agent take?

  • Each action costs –1/25
  • Agent can take action N, E, S, W
  • Faces uncertainty in every state


image


MDP Tuple: <S, A, P, R>

  • S: State of the agent on the grid (4,3)
    • Note that cell denoted by (x,y)
  • A: Actions of the agent, i.e., N, E, S, W
  • P: Transition function
    • Table P(s’ s, a), prob of s’ given action “a” in state “s”
    • E.g., P( (4,3) (3,3), N) = 0.1
    • E.g., P((3, 2) (3,3), N) = 0.8
    • (Robot movement, uncertainty of another agent’s actions,…)
  • R: Reward (more comments on the reward function later)
    • R( (3, 3), N) = -1/25
    • R (4,1) = +1


Markov Assumption

  • Markov Assumption: Transition probabilities (and rewards) from any given state depend only on the state and not on previous history
  • Where you end up after action depends only on current state
    • After Russian Mathematician A. A. Markov (1856-1922)
    • (He did not come up with markov decision processes however)
    • Transitions in state (1,2) do not depend on prior state (1,1) or (1,2)


Non-Optimal Vs Optimal Policy

image


  • Choose Red policy or Yellow policy?
  • Choose Red policy or Blue policy?

  • Which is optimal (if any)?
    • Value iteration: One popular algorithm to determine optimal


Introduction to Reinforcement Learning

image


What is Reinforcement Learning?

  • Learning from interaction with an environment to achieve some long-term goal that is related to the state of the environment
  • The goal is defined by reward signal, which must be maximized
  • Agent must be able to partially/fully sense the environment state and take actions to influence the environment state
  • The state is typically described with a feature-vector


Atari game(블록 깨기)

  • Objective: complete the game with the highest score
  • State: raw pixel inputs of the game state
  • Action: game controls, e.g. left, right
  • Reward: score increase/decrease at each time step


###

댓글남기기