Lectures 18-20: Reinforcement Learning & MDPsΒΆ
πΉ Watch Lecture 18 | Watch Lecture 19 | Watch Lecture 20
From Andrew Ngβs CS229 Lectures 18-20 (Autumn 2018)ΒΆ
βWhat I hope you learn today is how to apply reinforcement learning, even to continuous state or infinite state MDPs.β - Andrew Ng
Reinforcement Learning OverviewΒΆ
Different from Supervised Learning:
No explicit labels (no βcorrect answerβ given)
Learn from rewards and punishments
Agent interacts with environment
Goal: Maximize cumulative reward
Markov Decision Process (MDP)ΒΆ
Components:
S: Set of states
A: Set of actions
P_sa: Transition probabilities P(sβ|s,a)
Ξ³: Discount factor (0 β€ Ξ³ < 1)
R: Reward function
Example from lecture:
11-state grid world
Actions: North, South, East, West
Some states give rewards/penalties
Find optimal policy
Key ConceptsΒΆ
Value Function V^Ο(s):
βV^Ο was the value function for a policy Ο which is the expected payoff if you execute that policy starting from a state sβ
Optimal Value Function V(s):*
Best possible expected return from state s
V*(s) = max_Ο V^Ο(s)
Optimal Policy Ο:*
Ο*(s) = argmax_a [R(s,a) + Ξ³ Ξ£ P_sa(s') V*(s')]
Or equivalently:
Ο*(s) = argmax_a [R(s,a) + Ξ³ E_{s'~P_sa} [V*(s')]]
Value Iteration AlgorithmΒΆ
Bellmanβs Equations:
V*(s) = max_a [R(s,a) + Ξ³ Ξ£ P_sa(s') V*(s')]
Algorithm:
Initialize V(s) = 0 for all states
Repeat until convergence:
V(s) := max_a [R(s,a) + Ξ³ Ξ£ P_sa(s') V(s')]
Extract policy: Ο(s) = argmax_a [β¦]
Lecture 18: Continuous State MDPsΒΆ
The Challenge:
βEverything weβve done so far was built on the MDP having a finite set of statesβ
Self-Driving Car Example:
State space (continuous):
Position: (x, y) - latitude/longitude
Orientation: ΞΈ - angle relative to North
Velocities: (αΊ, αΊ) - velocity in x and y directions
Angular velocity: ΞΈΜ - rate of turning
βItβs up to you to decide what is the state-space you want to use to model this carβ¦ and if you are building a car to race in a racetrack, maybe it is important to model what is the temperature of the engineβ
Approaches for Continuous States:
1. Discretization
Divide continuous space into grid
Each grid cell is a discrete state
Apply value iteration
Problem: Curse of dimensionality
2. Model-Based RL
Learn model of environment
Use model for planning
3. Fitted Value Iteration (main algorithm)
Function approximation for V*
Use supervised learning to fit V*
Lecture 19: Continuous State RL AlgorithmsΒΆ
Topics:
Value function approximation
Linear function approximators
Neural network approximators
Q-learning for continuous states
Deep Q-Networks (DQN)
Lecture 20: Policy Search & REINFORCEΒΆ
Policy Search:
Directly search in policy space
Donβt need to compute V* explicitly
Gradient ascent on expected reward
REINFORCE Algorithm:
Policy gradient method
Monte Carlo approach
Works with continuous actions
Applications:
Robotics (walking, manipulation)
Game playing (Go, Chess, Atari)
Autonomous vehicles
Resource management
Key Algorithms SummaryΒΆ
Algorithm |
Type |
When to Use |
|---|---|---|
Value Iteration |
Value-based |
Discrete, known model |
Policy Iteration |
Value-based |
Discrete, known model |
Q-Learning |
Value-based |
Model-free, discrete |
DQN |
Value-based |
Model-free, continuous state |
REINFORCE |
Policy-based |
Continuous action space |
Actor-Critic |
Hybrid |
General purpose |
Advanced RL Theory: Deep Dive into Key AlgorithmsΒΆ
1. DQN (Deep Q-Network): Mathematical FoundationΒΆ
The Q-Function:
In Q-learning, we learn the action-value function:
Bellman Optimality Equation:
DQN Innovation: Use neural network to approximate \(Q^*(s, a)\)
where \(\theta\) are network parameters.
Loss Function:
Key innovations:
Experience Replay: Store transitions \((s, a, r, s')\) in replay buffer \(D\)
Break temporal correlations
Reuse experiences (sample efficiency)
Stabilize training
Target Network: Use separate network with frozen weights \(\theta^-\)
Update every \(C\) steps: \(\theta^- \leftarrow \theta\)
Reduces moving target problem
Stabilizes learning
Gradient:
DQN Algorithm:
Initialize replay buffer D
Initialize Q-network with random weights ΞΈ
Initialize target network ΞΈβ» = ΞΈ
For episode = 1 to M:
Initialize state s
For t = 1 to T:
With probability Ξ΅: select random action a
Otherwise: a = argmax_a' Q(s, a'; ΞΈ)
Execute action a, observe reward r, next state s'
Store (s, a, r, s') in D
Sample minibatch {(sβ±Ό, aβ±Ό, rβ±Ό, s'β±Ό)} from D
Set target: yβ±Ό = rβ±Ό + Ξ³ max_a' Q(s'β±Ό, a'; ΞΈβ»)
Update: ΞΈ β ΞΈ - Ξ± β_ΞΈ (yβ±Ό - Q(sβ±Ό, aβ±Ό; ΞΈ))Β²
Every C steps: ΞΈβ» β ΞΈ
s β s'
2. Policy Gradient Methods: REINFORCEΒΆ
Problem with Q-learning: Cannot handle continuous action spaces easily.
Solution: Directly parameterize policy \(\pi_\theta(a|s)\)
Objective: Maximize expected return
where \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) is a trajectory.
Policy Gradient Theorem:
where \(R_t = \sum_{k=t}^T \gamma^{k-t} r_k\) is return from time \(t\).
Intuition:
If action led to good outcome (\(R_t\) high) β increase probability
If action led to bad outcome (\(R_t\) low) β decrease probability
Weight by gradient of log probability
REINFORCE Algorithm:
Initialize policy parameters ΞΈ
For episode = 1 to M:
Generate trajectory Ο = (sβ, aβ, rβ, ..., s_T, a_T, r_T)
using Ο_ΞΈ
For t = 0 to T:
Compute return: R_t = Ξ£_{k=t}^T Ξ³^(k-t) r_k
Update: ΞΈ β ΞΈ + Ξ± β_ΞΈ log Ο_ΞΈ(a_t|s_t) Β· R_t
Variance Reduction: Baseline
High variance is main problem. Add baseline \(b(s_t)\):
Common baseline: Value function \(V^\pi(s_t)\)
This gives us the advantage function:
3. Actor-Critic MethodsΒΆ
Combine best of both worlds:
Actor: Policy \(\pi_\theta(a|s)\)
Critic: Value function \(V_\phi(s)\) or \(Q_\phi(s,a)\)
Advantage Actor-Critic (A2C):
Actor update (policy gradient):
Critic update (TD learning):
Advantage estimate:
This is the TD error \(\delta\)!
4. Monte Carlo Tree Search (MCTS)ΒΆ
Used in: AlphaGo, AlphaZero, MuZero
Core Idea: Build search tree by simulation
Four Phases:
Selection: Start from root, select actions using UCB1
First term: Exploitation (choose high-value actions)
Second term: Exploration (choose less-visited actions)
\(c\): Exploration constant (typically \(\sqrt{2}\))
Expansion: Add new child node to tree
Simulation: Play out to terminal state (random or guided)
Backpropagation: Update values along path
MCTS Algorithm:
function MCTS(root_state, n_simulations):
Initialize tree with root node
For i = 1 to n_simulations:
# 1. Selection
node = root
while node.is_fully_expanded() and not node.is_terminal():
node = node.select_child_UCB1()
# 2. Expansion
if not node.is_terminal():
node = node.expand() # Add new child
# 3. Simulation (rollout)
reward = node.simulate_to_end()
# 4. Backpropagation
while node is not None:
node.visits += 1
node.value += reward
node = node.parent
# Return best action from root
return root.best_child().action
UCB1 Derivation:
Comes from Hoeffdingβs inequality:
Setting bound probability to \(t^{-4}\) and solving gives:
AlphaGo Enhancement:
Combines MCTS with neural networks:
Policy network: Guides selection (prior \(P(s, a)\))
Value network: Evaluates positions (replaces rollout)
5. Advanced TopicsΒΆ
Proximal Policy Optimization (PPO):
Problem: Policy gradient can take too large steps β performance collapse
Solution: Constrain policy updates
where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is probability ratio.
Trust Region Policy Optimization (TRPO):
Constrained optimization:
Deep Deterministic Policy Gradient (DDPG):
For continuous actions:
Deterministic policy: \(a = \mu_\theta(s)\)
Critic: \(Q_\phi(s, a)\)
Algorithm Comparison:
Algorithm |
Value/Policy |
On/Off-Policy |
Action Space |
Sample Efficiency |
|---|---|---|---|---|
DQN |
Value-based |
Off-policy |
Discrete |
High |
REINFORCE |
Policy-based |
On-policy |
Any |
Low |
A2C/A3C |
Actor-Critic |
On-policy |
Any |
Medium |
PPO |
Actor-Critic |
On-policy |
Any |
Medium |
TRPO |
Actor-Critic |
On-policy |
Any |
Medium |
DDPG |
Actor-Critic |
Off-policy |
Continuous |
High |
SAC |
Actor-Critic |
Off-policy |
Continuous |
High |
When to use:
Discrete actions + sample efficiency: DQN
Continuous actions + stability: PPO, SAC
Maximum performance (sample-rich): AlphaZero (MCTS + NN)
Simple baseline: REINFORCE with baseline
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Rectangle
from collections import defaultdict
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
print("Libraries loaded!")
14.1 GridWorld EnvironmentΒΆ
What and WhyΒΆ
A GridWorld is the canonical toy environment for reinforcement learning, where an agent navigates a discrete grid to reach a goal while avoiding obstacles. Despite its simplicity, GridWorld captures the essential RL structure: states (grid positions), actions (up/down/left/right), transitions (possibly stochastic), and rewards (positive at the goal, negative at obstacles, small step penalty elsewhere).
How It WorksΒΆ
The environment is formalized as a Markov Decision Process (MDP) defined by the tuple \((S, A, P, R, \gamma)\), where \(P(s'|s,a)\) gives transition probabilities and \(R(s,a,s')\) gives rewards. The agentβs objective is to find a policy \(\pi(a|s)\) that maximizes the expected cumulative discounted reward \(\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R_t\right]\).
Connection to MLΒΆ
GridWorld environments are the standard testbed for all RL algorithms, from tabular methods (value iteration, Q-learning) to deep RL (DQN, policy gradient). The same MDP framework scales to robotics navigation, game playing, and resource allocation problems.
class GridWorld:
def __init__(self, size=5):
self.size = size
self.goal = (size-1, size-1)
self.obstacles = [(1, 1), (2, 2), (3, 1)]
self.reset()
# Actions: up, down, left, right
self.actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
self.action_names = ['Up', 'Down', 'Left', 'Right']
def reset(self):
self.state = (0, 0)
return self.state
def step(self, action_idx):
"""
Take action, return (next_state, reward, done)
"""
action = self.actions[action_idx]
next_state = (self.state[0] + action[0], self.state[1] + action[1])
# Check boundaries
if (next_state[0] < 0 or next_state[0] >= self.size or
next_state[1] < 0 or next_state[1] >= self.size):
next_state = self.state # Stay in place
reward = -1
# Check obstacles
elif next_state in self.obstacles:
next_state = self.state # Stay in place
reward = -1
# Check goal
elif next_state == self.goal:
reward = 10
else:
reward = -0.1
self.state = next_state
done = (next_state == self.goal)
return next_state, reward, done
def visualize(self, values=None, policy=None):
fig, ax = plt.subplots(figsize=(8, 8))
# Draw grid
for i in range(self.size):
for j in range(self.size):
# Color based on value
if values is not None:
value = values.get((i, j), 0)
color = plt.cm.RdYlGn((value + 1) / 11) # Normalize
else:
color = 'white'
rect = Rectangle((j, self.size-1-i), 1, 1,
facecolor=color, edgecolor='black', linewidth=2)
ax.add_patch(rect)
# Mark obstacles
if (i, j) in self.obstacles:
ax.text(j+0.5, self.size-1-i+0.5, 'X',
ha='center', va='center', fontsize=20, fontweight='bold')
# Mark goal
elif (i, j) == self.goal:
ax.text(j+0.5, self.size-1-i+0.5, 'G',
ha='center', va='center', fontsize=20,
fontweight='bold', color='green')
# Show value
elif values is not None:
ax.text(j+0.5, self.size-1-i+0.5, f'{value:.1f}',
ha='center', va='center', fontsize=10)
# Show policy arrow
if policy is not None and (i, j) not in self.obstacles and (i, j) != self.goal:
action_idx = policy.get((i, j), 0)
arrows = ['β', 'β', 'β', 'β']
ax.text(j+0.5, self.size-1-i+0.2, arrows[action_idx],
ha='center', va='center', fontsize=16, color='blue')
ax.set_xlim(0, self.size)
ax.set_ylim(0, self.size)
ax.set_aspect('equal')
ax.axis('off')
plt.tight_layout()
return fig
# Create and visualize environment
env = GridWorld(size=5)
fig = env.visualize()
plt.title('GridWorld Environment', fontsize=14, fontweight='bold', pad=20)
plt.show()
print("Environment created!")
print(f"Grid size: {env.size}x{env.size}")
print(f"Start: (0, 0)")
print(f"Goal: {env.goal}")
print(f"Obstacles: {env.obstacles}")
14.2 Value IterationΒΆ
Bellman Equations: Complete Mathematical TreatmentΒΆ
Bellman Expectation EquationsΒΆ
For a given policy \(\pi\):
Value function:
Expanding:
Action-value function:
Or:
Relationship:
Bellman Optimality EquationsΒΆ
Optimal value function:
Optimal action-value function:
These are systems of nonlinear equations!
Proof: Value Iteration ConvergesΒΆ
Theorem: Value iteration converges to \(V^*\).
Proof sketch:
Define Bellman operator \(\mathcal{T}\):
Value iteration: \(V_{k+1} = \mathcal{T}V_k\)
Key property: \(\mathcal{T}\) is a contraction mapping with factor \(\gamma\):
Proof of contraction:
By Banach fixed-point theorem:
Unique fixed point \(V^*\)
\(V_k \to V^*\) exponentially fast
Convergence rate:
Exponential convergence in \(\gamma\)!
Policy Iteration: Guaranteed ImprovementΒΆ
Algorithm:
Policy Evaluation: Compute \(V^\pi\)
Policy Improvement: \(\pi' = \text{greedy}(V^\pi)\)
Repeat until convergence
Theorem (Policy Improvement):
If \(\pi'(s) = \arg\max_a Q^\pi(s, a)\), then:
Proof:
Convergence: Finite MDPs β finite policies β must converge to \(\pi^*\)
Policy iteration vs Value iteration:
Aspect |
Value Iteration |
Policy Iteration |
|---|---|---|
Update |
One sweep |
Full evaluation |
Per iteration |
Fast |
Slow |
Convergence |
Many iterations |
Few iterations |
When better |
Large state space |
Small state space |
Temporal Difference (TD) Learning TheoryΒΆ
TD(0) update:
TD error:
Theorem: TD(0) converges to \(V^\pi\) (under conditions):
Robbins-Monro: \(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)
Every state visited infinitely often
Bias-Variance Trade-off:
Method |
Update Target |
Bias |
Variance |
|---|---|---|---|
Monte Carlo |
\(G_t = \sum_k \gamma^k r_{t+k}\) |
0 |
High |
TD(0) |
\(r_{t+1} + \gamma V(s_{t+1})\) |
Yes |
Low |
TD(n) |
\(\sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n})\) |
Medium |
Medium |
TD(\(\lambda\)): Unified View
Combines all n-step returns:
where:
\(\lambda = 0\): TD(0)
\(\lambda = 1\): Monte Carlo
\(\lambda \in (0,1)\): Trade-off
Exploration vs ExploitationΒΆ
\(\epsilon\)-greedy:
Boltzmann (softmax) exploration:
\(\tau \to 0\): Greedy
\(\tau \to \infty\): Uniform
\(\tau\): Temperature parameter
UCB (Upper Confidence Bound):
Optimism in face of uncertainty!
Regret bounds:
For \(K\)-armed bandit with UCB:
Comparison:
Method |
Regret |
Pros |
Cons |
|---|---|---|---|
Random |
\(O(T)\) |
Simple |
Poor |
\(\epsilon\)-greedy |
\(O(T)\) |
Simple |
Linear regret |
\(\epsilon\)-decreasing |
\(O(\log T)\) |
Better |
Tuning needed |
UCB |
\(O(\sqrt{T \log T})\) |
Theoretical guarantee |
Stationary only |
Thompson |
\(O(\sqrt{T})\) |
Bayesian, great empirically |
Complex |
def value_iteration(env, gamma=0.9, theta=0.001, max_iterations=1000):
"""
Value Iteration algorithm
"""
# Initialize value function
V = defaultdict(float)
for iteration in range(max_iterations):
delta = 0
# For each state
for i in range(env.size):
for j in range(env.size):
state = (i, j)
# Skip terminal and obstacle states
if state == env.goal or state in env.obstacles:
continue
v = V[state]
# Compute value for each action
action_values = []
for action_idx in range(len(env.actions)):
env.state = state
next_state, reward, done = env.step(action_idx)
action_value = reward + gamma * V[next_state]
action_values.append(action_value)
# Bellman update: take max over actions
V[state] = max(action_values)
delta = max(delta, abs(v - V[state]))
if delta < theta:
print(f"Value iteration converged in {iteration+1} iterations")
break
# Extract policy
policy = {}
for i in range(env.size):
for j in range(env.size):
state = (i, j)
if state == env.goal or state in env.obstacles:
continue
action_values = []
for action_idx in range(len(env.actions)):
env.state = state
next_state, reward, done = env.step(action_idx)
action_value = reward + gamma * V[next_state]
action_values.append(action_value)
policy[state] = np.argmax(action_values)
return V, policy
# Run value iteration
V_star, policy_star = value_iteration(env, gamma=0.9)
# Visualize results
fig = env.visualize(values=V_star, policy=policy_star)
plt.title('Value Iteration Results\n(Values and Optimal Policy)',
fontsize=14, fontweight='bold', pad=20)
plt.show()
print("\nOptimal Value Function:")
for i in range(env.size):
for j in range(env.size):
print(f"{V_star[(i,j)]:6.2f}", end=" ")
print()
14.3 Q-LearningΒΆ
Advanced Q-Learning Variants and ImprovementsΒΆ
Double Q-Learning: Addressing Overestimation BiasΒΆ
Problem with standard Q-learning:
The \(\max\) operator causes positive bias:
Example: Two actions, true values both 0, estimates \(N(0, 1)\):
This accumulates over iterations!
Double Q-Learning Solution:
Maintain two Q-functions \(Q_A\) and \(Q_B\).
Update \(Q_A\): (with probability 0.5)
Update \(Q_B\): (with probability 0.5)
Key insight:
Use \(Q_A\) to select action
Use \(Q_B\) to evaluate action
Decorrelates selection and evaluation!
Final policy:
Double DQN:
Use online network \(\theta\) to select, target network \(\theta^-\) to evaluate.
Dueling DQN: Separating Value and AdvantageΒΆ
Decomposition:
where:
\(V(s)\): Value of being in state \(s\)
\(A(s, a)\): Advantage of action \(a\) over others
Problem: Not unique! Can add constant to \(V\), subtract from \(A\).
Solution: Force identifiability constraint:
Or (better empirically):
Architecture:
Input state s
β
CNN/FC layers (shared)
β
βββββββββββ¬ββββββββββ
β β β
V(s) A(s,aβ) A(s,aβ) ...
β β β
βββββββββββ΄ββββββββββ
β
Q(s,a) = V + (A - mean(A))
Advantages:
β Learns value function explicitly (useful when actions donβt matter much)
β Better gradient flow
β Improved performance on Atari
Prioritized Experience ReplayΒΆ
Standard experience replay: Uniform sampling from buffer
Problem: Not all experiences equally important!
Solution: Sample based on TD error magnitude
where \(\delta_i = r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\)
Prioritization with importance sampling:
Sampling probability:
\(\alpha = 0\): Uniform (standard)
\(\alpha = 1\): Full prioritization
Typical: \(\alpha = 0.6\)
Importance sampling weights:
Corrects bias from non-uniform sampling
\(\beta\): Annealed from \(\beta_0\) to 1
Typical: \(\beta_0 = 0.4\), anneal to 1
Loss:
Update priorities after gradient step:
Results:
49 out of 57 Atari games: improvement
Especially good for sparse rewards
Rainbow DQN: Combining ImprovementsΒΆ
Combines 6 extensions:
Double DQN: Reduce overestimation
Dueling: Separate V and A
Prioritized replay: Important transitions
Multi-step returns: \(n\)-step TD targets
Distributional RL: Learn return distribution
Noisy Nets: Exploration via parameter noise
Multi-step Learning:
Instead of 1-step TD:
Use \(n\)-step return:
Bias-variance trade-off:
Small \(n\): Low variance, high bias (bootstraps more)
Large \(n\): High variance, low bias (like Monte Carlo)
Typical: \(n = 3\)
Distributional RL (C51):
Instead of \(Q(s, a) = \mathbb{E}[Z]\), learn full distribution \(Z\).
Approximate with \(N\) atoms:
where \(z_i \in [V_{\min}, V_{\max}]\) are support points.
Loss: Cross-entropy between distributions
Why better?
Captures uncertainty
Richer learning signal
Better performance
Noisy Nets:
Add noise to network parameters:
where:
\(\mu^w, \mu^b\): Learned mean
\(\sigma^w, \sigma^b\): Learned std
\(\epsilon^w, \epsilon^b \sim \mathcal{N}(0, 1)\): Noise
Advantages over \(\epsilon\)-greedy:
State-dependent exploration
Automatic annealing (learned \(\sigma\) decreases)
No hyperparameter tuning
Rainbow Performance:
Median human-normalized score on Atari:
DQN: 100%
Rainbow: 227%
Ablation study shows all components help!
Practical Implementation TipsΒΆ
Hyperparameters (DQN on Atari):
Parameter |
Value |
Purpose |
|---|---|---|
Replay buffer |
1M |
Store transitions |
Batch size |
32 |
Mini-batch |
Learning rate |
0.00025 |
Adam optimizer |
\(\gamma\) |
0.99 |
Discount factor |
\(\epsilon\) start |
1.0 |
Initial exploration |
\(\epsilon\) end |
0.1 |
Final exploration |
\(\epsilon\) decay |
1M steps |
Anneal schedule |
Target update |
10K steps |
Freeze target |
Frame stack |
4 |
Temporal context |
Frame skip |
4 |
Action repeat |
Network Architecture:
Input: 84Γ84Γ4 (stacked frames)
β
Conv1: 32 filters, 8Γ8, stride 4, ReLU
β
Conv2: 64 filters, 4Γ4, stride 2, ReLU
β
Conv3: 64 filters, 3Γ3, stride 1, ReLU
β
Flatten β FC: 512 units, ReLU
β
Output: |A| Q-values (linear)
Training Tricks:
β Gradient clipping: Clip \(\delta\) to \([-1, 1]\) or use Huber loss
β Reward clipping: Clip rewards to \([-1, 1]\) for stability
β Frame preprocessing: Grayscale, resize, normalize
β Action repeat: Skip frames (4Γ) for efficiency
β Burn-in: Fill replay buffer before training
β Evaluation: Separate evaluation episodes with \(\epsilon = 0.05\)
Common Pitfalls:
β Instability: Try lower learning rate, smaller network β Slow learning: Check exploration, increase buffer size β Catastrophic forgetting: Increase replay buffer β Overestimation: Use Double DQN β Sparse rewards: Try curiosity-driven exploration or reward shaping
class QLearningAgent:
def __init__(self, n_actions, learning_rate=0.1, gamma=0.9, epsilon=0.1):
self.Q = defaultdict(lambda: np.zeros(n_actions))
self.lr = learning_rate
self.gamma = gamma
self.epsilon = epsilon
self.n_actions = n_actions
def get_action(self, state, training=True):
"""Epsilon-greedy action selection"""
if training and np.random.random() < self.epsilon:
return np.random.randint(self.n_actions)
else:
return np.argmax(self.Q[state])
def update(self, state, action, reward, next_state, done):
"""Q-learning update"""
current_q = self.Q[state][action]
if done:
target_q = reward
else:
target_q = reward + self.gamma * np.max(self.Q[next_state])
# Q-learning update
self.Q[state][action] = current_q + self.lr * (target_q - current_q)
# Train Q-learning agent
agent = QLearningAgent(n_actions=4, learning_rate=0.1, gamma=0.9, epsilon=0.1)
n_episodes = 1000
episode_rewards = []
episode_lengths = []
for episode in range(n_episodes):
state = env.reset()
total_reward = 0
steps = 0
for step in range(100): # Max 100 steps per episode
action = agent.get_action(state, training=True)
next_state, reward, done = env.step(action)
agent.update(state, action, reward, next_state, done)
state = next_state
total_reward += reward
steps += 1
if done:
break
episode_rewards.append(total_reward)
episode_lengths.append(steps)
if (episode + 1) % 200 == 0:
avg_reward = np.mean(episode_rewards[-100:])
avg_length = np.mean(episode_lengths[-100:])
print(f"Episode {episode+1}: Avg Reward={avg_reward:.2f}, Avg Steps={avg_length:.1f}")
# Plot learning curves
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Smooth rewards
window = 50
smoothed_rewards = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
smoothed_lengths = np.convolve(episode_lengths, np.ones(window)/window, mode='valid')
axes[0].plot(smoothed_rewards, linewidth=2)
axes[0].set_xlabel('Episode', fontsize=12)
axes[0].set_ylabel('Total Reward', fontsize=12)
axes[0].set_title('Q-Learning: Reward per Episode', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
axes[1].plot(smoothed_lengths, linewidth=2, color='orange')
axes[1].set_xlabel('Episode', fontsize=12)
axes[1].set_ylabel('Steps to Goal', fontsize=12)
axes[1].set_title('Q-Learning: Episode Length', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
14.4 Learned Q-Values and PolicyΒΆ
What and WhyΒΆ
After Q-learning converges, the learned Q-values \(Q(s,a)\) encode the expected cumulative reward for taking action \(a\) in state \(s\) and following the optimal policy thereafter. Visualizing Q-values as a heatmap reveals how the agent values different state-action pairs, while the greedy policy \(\pi^*(s) = \arg\max_a Q(s,a)\) shows the optimal action at each state.
How It WorksΒΆ
The code extracts the maximum Q-value at each grid position and displays it as a color-coded heatmap, with warmer colors indicating higher-value states (closer to the goal via safe paths). Arrows overlay each cell showing the greedy action direction. States near the goal have high values that decay with distance, reflecting the discount factor \(\gamma\).
Connection to MLΒΆ
Interpreting learned value functions is critical for debugging and trusting RL agents in practice. The same visualization principles apply to deep Q-networks (DQN) where value functions are approximated by neural networks, and to actor-critic methods where both value and policy are learned.
# Extract learned values and policy
learned_values = {}
learned_policy = {}
for i in range(env.size):
for j in range(env.size):
state = (i, j)
if state not in env.obstacles and state != env.goal:
learned_values[state] = np.max(agent.Q[state])
learned_policy[state] = np.argmax(agent.Q[state])
# Visualize
fig = env.visualize(values=learned_values, policy=learned_policy)
plt.title('Q-Learning Results\n(Learned Values and Policy)',
fontsize=14, fontweight='bold', pad=20)
plt.show()
# Test learned policy
print("\nTesting Learned Policy:")
state = env.reset()
path = [state]
for step in range(20):
action = agent.get_action(state, training=False)
next_state, reward, done = env.step(action)
path.append(next_state)
print(f"Step {step+1}: {state} β {env.action_names[action]} β {next_state} (reward={reward:.1f})")
state = next_state
if done:
print(f"\nReached goal in {step+1} steps!")
break
Key TakeawaysΒΆ
1. Markov Decision Process (MDP)ΒΆ
Framework for sequential decision making
Components: States, Actions, Transitions, Rewards
Markov property: Future depends only on current state
Goal: Find policy Ο that maximizes cumulative reward
2. Value FunctionsΒΆ
State value: V(s) = Expected cumulative reward from state s
Action value: Q(s,a) = Expected reward from state s, action a
Bellman equation: Recursive relationship
Optimal policy: Choose action with max Q-value
3. Value IterationΒΆ
Model-based: Requires knowledge of transitions P(sβ|s,a)
Iteratively update V(s) using Bellman equation
Converges to optimal value function V*
Extract policy: Ο(s) = argmax_a Q(s,a)
4. Q-LearningΒΆ
Model-free: Learns from experience (no transition model)
Off-policy: Learns optimal Q* while following Ξ΅-greedy
Update rule: Q(s,a) β Q(s,a) + Ξ±[r + Ξ³ max Q(sβ,aβ) - Q(s,a)]
Exploration vs exploitation: Ξ΅-greedy balances both
5. Key DifferencesΒΆ
Aspect |
Value Iteration |
Q-Learning |
|---|---|---|
Model |
Required |
Not required |
Updates |
All states |
Visited states |
Convergence |
Guaranteed |
Guaranteed (with conditions) |
Speed |
Fast with small state space |
Slow, needs many episodes |
6. Exploration StrategiesΒΆ
Ξ΅-greedy: Random action with probability Ξ΅
Softmax: Sample from Boltzmann distribution
Upper Confidence Bound: Bonus for rarely visited states
Exploration essential for finding optimal policy
7. ExtensionsΒΆ
SARSA: On-policy TD learning
Double Q-Learning: Reduces overestimation bias
DQN: Deep neural network for Q-function
Policy Gradient: Directly optimize policy
Actor-Critic: Combine value and policy methods
A3C, PPO, SAC: Modern RL algorithms
8. Practical TipsΒΆ
Start with simple environments (GridWorld)
Tune learning rate Ξ± and discount Ξ³
Balance exploration (Ξ΅) carefully
Use experience replay for stability
Normalize rewards if possible
ReferencesΒΆ
βReinforcement Learning: An Introductionβ - Sutton & Barto (2nd Edition)
βPlaying Atari with Deep Reinforcement Learningβ - Mnih et al. (2013)
CS229 Lecture Notes on Reinforcement Learning
David Silverβs RL Course (UCL)
Congratulations! Youβve completed all 14 CS229 lectures! π