[Artificial Intelligence] Reinforcement Learning
- Finite Markov Decision Processes
- Introduction to Reinforcement Learning
- Q-learning, Deep Q-Networks
- Policy Learning
Finite Markov Decision Processes
Markov Decision Process (MDP)
- Set of states S
- Set of actions A
-
State transition probabilities $p(s’ s, a)$. This is the probability distribution over the state space given we take action a in state $s$ - Discount factor 𝛾 in [0, 1]
- Reward function R: S x A -> set of real numbers
- For simplicity, assume discrete rewards
- Finite MDP if both S and A are finite
Example: What SEQUENCE of actions should our agent take?
- Each action costs –1/25
- Agent can take action N, E, S, W
- Faces uncertainty in every state
MDP Tuple: <S, A, P, R>
- S: State of the agent on the grid (4,3)
- Note that cell denoted by (x,y)
- A: Actions of the agent, i.e., N, E, S, W
- P: Transition function
-
Table P(s’ s, a), prob of s’ given action “a” in state “s” -
E.g., P( (4,3) (3,3), N) = 0.1 -
E.g., P((3, 2) (3,3), N) = 0.8 - (Robot movement, uncertainty of another agent’s actions,…)
-
- R: Reward (more comments on the reward function later)
- R( (3, 3), N) = -1/25
- R (4,1) = +1
Markov Assumption
- Markov Assumption: Transition probabilities (and rewards) from any given state depend only on the state and not on previous history
- Where you end up after action depends only on current state
- After Russian Mathematician A. A. Markov (1856-1922)
- (He did not come up with markov decision processes however)
- Transitions in state (1,2) do not depend on prior state (1,1) or (1,2)
Non-Optimal Vs Optimal Policy
- Choose Red policy or Yellow policy?
-
Choose Red policy or Blue policy?
- Which is optimal (if any)?
- Value iteration: One popular algorithm to determine optimal
Introduction to Reinforcement Learning
What is Reinforcement Learning?
- Learning from interaction with an environment to achieve some long-term goal that is related to the state of the environment
- The goal is defined by reward signal, which must be maximized
- Agent must be able to partially/fully sense the environment state and take actions to influence the environment state
- The state is typically described with a feature-vector
Atari game(블록 깨기)
- Objective: complete the game with the highest score
- State: raw pixel inputs of the game state
- Action: game controls, e.g. left, right
- Reward: score increase/decrease at each time step
###
댓글남기기