LEE CHANWOO

LEE CHANWOO

📝 Blog 2025 version 2.0.3
🔖 Jul 03, 2025

[Artificial Intelligence] Reinforcement Learning

1 분 소요

Finite Markov Decision Processes
Introduction to Reinforcement Learning
Q-learning, Deep Q-Networks
Policy Learning

Finite Markov Decision Processes

Markov Decision Process (MDP)

Set of states S
Set of actions A
State transition probabilities $p(s’ s, a)$. This is the probability distribution over the state space given we take action a in state $s$
Discount factor 𝛾 in [0, 1]
Reward function R: S x A -> set of real numbers
For simplicity, assume discrete rewards
Finite MDP if both S and A are finite

Example: What SEQUENCE of actions should our agent take?

Each action costs –1/25
Agent can take action N, E, S, W
Faces uncertainty in every state

MDP Tuple: <S, A, P, R>

S: State of the agent on the grid (4,3)
- Note that cell denoted by (x,y)
A: Actions of the agent, i.e., N, E, S, W
P: Transition function
- Table P(s’ s, a), prob of s’ given action “a” in state “s”
- E.g., P( (4,3) (3,3), N) = 0.1
- E.g., P((3, 2) (3,3), N) = 0.8
- (Robot movement, uncertainty of another agent’s actions,…)
R: Reward (more comments on the reward function later)
- R( (3, 3), N) = -1/25
- R (4,1) = +1

Markov Assumption

Markov Assumption: Transition probabilities (and rewards) from any given state depend only on the state and not on previous history
Where you end up after action depends only on current state
- After Russian Mathematician A. A. Markov (1856-1922)
- (He did not come up with markov decision processes however)
- Transitions in state (1,2) do not depend on prior state (1,1) or (1,2)

Non-Optimal Vs Optimal Policy

Choose Red policy or Yellow policy?
Choose Red policy or Blue policy?
Which is optimal (if any)?
- Value iteration: One popular algorithm to determine optimal

Introduction to Reinforcement Learning

What is Reinforcement Learning?

Learning from interaction with an environment to achieve some long-term goal that is related to the state of the environment
The goal is defined by reward signal, which must be maximized
Agent must be able to partially/fully sense the environment state and take actions to influence the environment state
The state is typically described with a feature-vector

Atari game(블록 깨기)

Objective: complete the game with the highest score
State: raw pixel inputs of the game state
Action: game controls, e.g. left, right
Reward: score increase/decrease at each time step

###

공유하기

Twitter Facebook LinkedIn

댓글남기기

참고

[Programming] gRPC란? gRPC와 REST의 차이점

8 분 소요

[Python] uv : 패키지 관리 도구

6 분 소요

[Python] PEP 8 : Style Guide for Python Code

7 분 소요

[Python] PEP 20 : The Zen of Python

4 분 소요