Run this notebook: Open in Colab Open in Kaggle

Lectures 18-20: Reinforcement Learning & MDPs¶

📹 Watch Lecture 18 | Watch Lecture 19 | Watch Lecture 20

From Andrew Ng’s CS229 Lectures 18-20 (Autumn 2018)¶

“What I hope you learn today is how to apply reinforcement learning, even to continuous state or infinite state MDPs.” - Andrew Ng

Reinforcement Learning Overview¶

Different from Supervised Learning:

No explicit labels (no “correct answer” given)
Learn from rewards and punishments
Agent interacts with environment
Goal: Maximize cumulative reward

Markov Decision Process (MDP)¶

Components:

S: Set of states
A: Set of actions
P_sa: Transition probabilities P(s’|s,a)
γ: Discount factor (0 ≤ γ < 1)
R: Reward function

Example from lecture:

11-state grid world
Actions: North, South, East, West
Some states give rewards/penalties
Find optimal policy

Key Concepts¶

Value Function V^π(s):

“V^π was the value function for a policy π which is the expected payoff if you execute that policy starting from a state s”

Optimal Value Function V(s):*

Best possible expected return from state s
V*(s) = max_π V^π(s)

Optimal Policy π:*

π*(s) = argmax_a [R(s,a) + γ Σ P_sa(s') V*(s')]

Or equivalently:

π*(s) = argmax_a [R(s,a) + γ E_{s'~P_sa} [V*(s')]]

Value Iteration Algorithm¶

Bellman’s Equations:

V*(s) = max_a [R(s,a) + γ Σ P_sa(s') V*(s')]

Algorithm:

Initialize V(s) = 0 for all states

Repeat until convergence:

V(s) := max_a [R(s,a) + γ Σ P_sa(s') V(s')]

Extract policy: π(s) = argmax_a […]

Lecture 18: Continuous State MDPs¶

The Challenge:

“Everything we’ve done so far was built on the MDP having a finite set of states”

Self-Driving Car Example:

State space (continuous):

Position: (x, y) - latitude/longitude
Orientation: θ - angle relative to North
Velocities: (ẋ, ẏ) - velocity in x and y directions
Angular velocity: θ̇ - rate of turning

“It’s up to you to decide what is the state-space you want to use to model this car… and if you are building a car to race in a racetrack, maybe it is important to model what is the temperature of the engine”

Approaches for Continuous States:

1. Discretization

Divide continuous space into grid
Each grid cell is a discrete state
Apply value iteration
Problem: Curse of dimensionality

2. Model-Based RL

Learn model of environment
Use model for planning

3. Fitted Value Iteration (main algorithm)

Function approximation for V*
Use supervised learning to fit V*

Lecture 19: Continuous State RL Algorithms¶

Topics:

Value function approximation
Linear function approximators
Neural network approximators
Q-learning for continuous states
Deep Q-Networks (DQN)

Lecture 20: Policy Search & REINFORCE¶

Policy Search:

Directly search in policy space
Don’t need to compute V* explicitly
Gradient ascent on expected reward

REINFORCE Algorithm:

Policy gradient method
Monte Carlo approach
Works with continuous actions

Applications:

Robotics (walking, manipulation)
Game playing (Go, Chess, Atari)
Autonomous vehicles
Resource management

Key Algorithms Summary¶

Algorithm	Type	When to Use
Value Iteration	Value-based	Discrete, known model
Policy Iteration	Value-based	Discrete, known model
Q-Learning	Value-based	Model-free, discrete
DQN	Value-based	Model-free, continuous state
REINFORCE	Policy-based	Continuous action space
Actor-Critic	Hybrid	General purpose

Advanced RL Theory: Deep Dive into Key Algorithms¶

1. DQN (Deep Q-Network): Mathematical Foundation¶

The Q-Function:

In Q-learning, we learn the action-value function:

\[Q^*(s, a) = \mathbb{E}\left[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \ldots \mid s_t = s, a_t = a\right]\]

Bellman Optimality Equation:

\[Q^*(s, a) = \mathbb{E}_{s' \sim P}\left[r + \gamma \max_{a'} Q^*(s', a') \mid s, a\right]\]

DQN Innovation: Use neural network to approximate \(Q^*(s, a)\)

\[Q(s, a; \theta) \approx Q^*(s, a)\]

where \(\theta\) are network parameters.

Loss Function:

\[L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2\right]\]

Key innovations:

Experience Replay: Store transitions \((s, a, r, s')\) in replay buffer \(D\)
- Break temporal correlations
- Reuse experiences (sample efficiency)
- Stabilize training
Target Network: Use separate network with frozen weights \(\theta^-\)
- Update every \(C\) steps: \(\theta^- \leftarrow \theta\)
- Reduces moving target problem
- Stabilizes learning
Gradient:

\[\nabla_\theta L(\theta) = \mathbb{E}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right) \nabla_\theta Q(s, a; \theta)\right]\]

DQN Algorithm:

Initialize replay buffer D
Initialize Q-network with random weights θ
Initialize target network θ⁻ = θ

For episode = 1 to M:
    Initialize state s
    For t = 1 to T:
        With probability ε: select random action a
        Otherwise: a = argmax_a' Q(s, a'; θ)
        
        Execute action a, observe reward r, next state s'
        Store (s, a, r, s') in D
        
        Sample minibatch {(sⱼ, aⱼ, rⱼ, s'ⱼ)} from D
        
        Set target: yⱼ = rⱼ + γ max_a' Q(s'ⱼ, a'; θ⁻)
        
        Update: θ ← θ - α ∇_θ (yⱼ - Q(sⱼ, aⱼ; θ))²
        
        Every C steps: θ⁻ ← θ
        s ← s'

2. Policy Gradient Methods: REINFORCE¶

Problem with Q-learning: Cannot handle continuous action spaces easily.

Solution: Directly parameterize policy \(\pi_\theta(a|s)\)

Objective: Maximize expected return

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right]\]

where \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) is a trajectory.

Policy Gradient Theorem:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\right]\]

where \(R_t = \sum_{k=t}^T \gamma^{k-t} r_k\) is return from time \(t\).

Intuition:

If action led to good outcome (\(R_t\) high) → increase probability
If action led to bad outcome (\(R_t\) low) → decrease probability
Weight by gradient of log probability

REINFORCE Algorithm:

Initialize policy parameters θ

For episode = 1 to M:
    Generate trajectory τ = (s₀, a₀, r₀, ..., s_T, a_T, r_T) 
                          using π_θ
    
    For t = 0 to T:
        Compute return: R_t = Σ_{k=t}^T γ^(k-t) r_k
        
        Update: θ ← θ + α ∇_θ log π_θ(a_t|s_t) · R_t

Variance Reduction: Baseline

High variance is main problem. Add baseline \(b(s_t)\):

\[\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (R_t - b(s_t))\right]\]

Common baseline: Value function \(V^\pi(s_t)\)

This gives us the advantage function:

\[A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)\]

3. Actor-Critic Methods¶

Combine best of both worlds:

Actor: Policy \(\pi_\theta(a|s)\)
Critic: Value function \(V_\phi(s)\) or \(Q_\phi(s,a)\)

Advantage Actor-Critic (A2C):

Actor update (policy gradient):

\[\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a)\]

Critic update (TD learning):

\[\phi \leftarrow \phi + \beta (r + \gamma V_\phi(s') - V_\phi(s)) \nabla_\phi V_\phi(s)\]

Advantage estimate:

\[A(s, a) = r + \gamma V_\phi(s') - V_\phi(s)\]

This is the TD error \(\delta\)!

4. Monte Carlo Tree Search (MCTS)¶

Used in: AlphaGo, AlphaZero, MuZero

Core Idea: Build search tree by simulation

Four Phases:

Selection: Start from root, select actions using UCB1

\[\text{UCB1} = \frac{Q(s, a)}{N(s, a)} + c \sqrt{\frac{\ln N(s)}{N(s, a)}}\]

First term: Exploitation (choose high-value actions)
Second term: Exploration (choose less-visited actions)
\(c\): Exploration constant (typically \(\sqrt{2}\))

Expansion: Add new child node to tree
Simulation: Play out to terminal state (random or guided)
Backpropagation: Update values along path

\[Q(s, a) \leftarrow \frac{N(s, a) \cdot Q(s, a) + r}{N(s, a) + 1}\]

\[N(s, a) \leftarrow N(s, a) + 1\]

MCTS Algorithm:

function MCTS(root_state, n_simulations):
    Initialize tree with root node
    
    For i = 1 to n_simulations:
        # 1. Selection
        node = root
        while node.is_fully_expanded() and not node.is_terminal():
            node = node.select_child_UCB1()
        
        # 2. Expansion
        if not node.is_terminal():
            node = node.expand()  # Add new child
        
        # 3. Simulation (rollout)
        reward = node.simulate_to_end()
        
        # 4. Backpropagation
        while node is not None:
            node.visits += 1
            node.value += reward
            node = node.parent
    
    # Return best action from root
    return root.best_child().action

UCB1 Derivation:

Comes from Hoeffding’s inequality:

\[P\left(\left|\mu - \hat{\mu}\right| > \epsilon\right) \leq 2e^{-2n\epsilon^2}\]

Setting bound probability to \(t^{-4}\) and solving gives:

\[\text{UCB}(a) = \hat{\mu}_a + \sqrt{\frac{2 \ln t}{n_a}}\]

AlphaGo Enhancement:

Combines MCTS with neural networks:

Policy network: Guides selection (prior \(P(s, a)\))
Value network: Evaluates positions (replaces rollout)

\[\text{UCB}_{\text{AlphaGo}} = Q(s, a) + c \cdot P(s, a) \cdot \frac{\sqrt{N(s)}}{1 + N(s, a)}\]

5. Advanced Topics¶

Proximal Policy Optimization (PPO):

Problem: Policy gradient can take too large steps → performance collapse

Solution: Constrain policy updates

\[L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is probability ratio.

Trust Region Policy Optimization (TRPO):

Constrained optimization:

\[\max_\theta \mathbb{E}[A_{\theta_{\text{old}}}(s, a)]\]

\[\text{subject to } \mathbb{E}[\text{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)] \leq \delta\]

Deep Deterministic Policy Gradient (DDPG):

For continuous actions:

Deterministic policy: \(a = \mu_\theta(s)\)
Critic: \(Q_\phi(s, a)\)

Algorithm Comparison:

Algorithm	Value/Policy	On/Off-Policy	Action Space	Sample Efficiency
DQN	Value-based	Off-policy	Discrete	High
REINFORCE	Policy-based	On-policy	Any	Low
A2C/A3C	Actor-Critic	On-policy	Any	Medium
PPO	Actor-Critic	On-policy	Any	Medium
TRPO	Actor-Critic	On-policy	Any	Medium
DDPG	Actor-Critic	Off-policy	Continuous	High
SAC	Actor-Critic	Off-policy	Continuous	High

When to use:

Discrete actions + sample efficiency: DQN
Continuous actions + stability: PPO, SAC
Maximum performance (sample-rich): AlphaZero (MCTS + NN)
Simple baseline: REINFORCE with baseline

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Rectangle
from collections import defaultdict

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

14.1 GridWorld Environment¶

What and Why¶

A GridWorld is the canonical toy environment for reinforcement learning, where an agent navigates a discrete grid to reach a goal while avoiding obstacles. Despite its simplicity, GridWorld captures the essential RL structure: states (grid positions), actions (up/down/left/right), transitions (possibly stochastic), and rewards (positive at the goal, negative at obstacles, small step penalty elsewhere).

How It Works¶

The environment is formalized as a Markov Decision Process (MDP) defined by the tuple \((S, A, P, R, \gamma)\), where \(P(s'|s,a)\) gives transition probabilities and \(R(s,a,s')\) gives rewards. The agent’s objective is to find a policy \(\pi(a|s)\) that maximizes the expected cumulative discounted reward \(\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R_t\right]\).

Connection to ML¶

GridWorld environments are the standard testbed for all RL algorithms, from tabular methods (value iteration, Q-learning) to deep RL (DQN, policy gradient). The same MDP framework scales to robotics navigation, game playing, and resource allocation problems.

class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.goal = (size-1, size-1)
        self.obstacles = [(1, 1), (2, 2), (3, 1)]
        self.reset()
        
        # Actions: up, down, left, right
        self.actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        self.action_names = ['Up', 'Down', 'Left', 'Right']
    
    def reset(self):
        self.state = (0, 0)
        return self.state
    
    def step(self, action_idx):
        """
        Take action, return (next_state, reward, done)
        """
        action = self.actions[action_idx]
        next_state = (self.state[0] + action[0], self.state[1] + action[1])
        
        # Check boundaries
        if (next_state[0] < 0 or next_state[0] >= self.size or
            next_state[1] < 0 or next_state[1] >= self.size):
            next_state = self.state  # Stay in place
            reward = -1
        # Check obstacles
        elif next_state in self.obstacles:
            next_state = self.state  # Stay in place
            reward = -1
        # Check goal
        elif next_state == self.goal:
            reward = 10
        else:
            reward = -0.1
        
        self.state = next_state
        done = (next_state == self.goal)
        
        return next_state, reward, done
    
    def visualize(self, values=None, policy=None):
        fig, ax = plt.subplots(figsize=(8, 8))
        
        # Draw grid
        for i in range(self.size):
            for j in range(self.size):
                # Color based on value
                if values is not None:
                    value = values.get((i, j), 0)
                    color = plt.cm.RdYlGn((value + 1) / 11)  # Normalize
                else:
                    color = 'white'
                
                rect = Rectangle((j, self.size-1-i), 1, 1, 
                                facecolor=color, edgecolor='black', linewidth=2)
                ax.add_patch(rect)
                
                # Mark obstacles
                if (i, j) in self.obstacles:
                    ax.text(j+0.5, self.size-1-i+0.5, 'X', 
                           ha='center', va='center', fontsize=20, fontweight='bold')
                # Mark goal
                elif (i, j) == self.goal:
                    ax.text(j+0.5, self.size-1-i+0.5, 'G', 
                           ha='center', va='center', fontsize=20, 
                           fontweight='bold', color='green')
                # Show value
                elif values is not None:
                    ax.text(j+0.5, self.size-1-i+0.5, f'{value:.1f}', 
                           ha='center', va='center', fontsize=10)
                
                # Show policy arrow
                if policy is not None and (i, j) not in self.obstacles and (i, j) != self.goal:
                    action_idx = policy.get((i, j), 0)
                    arrows = ['↑', '↓', '←', '→']
                    ax.text(j+0.5, self.size-1-i+0.2, arrows[action_idx], 
                           ha='center', va='center', fontsize=16, color='blue')
        
        ax.set_xlim(0, self.size)
        ax.set_ylim(0, self.size)
        ax.set_aspect('equal')
        ax.axis('off')
        plt.tight_layout()
        return fig

# Create and visualize environment
env = GridWorld(size=5)
fig = env.visualize()
plt.title('GridWorld Environment', fontsize=14, fontweight='bold', pad=20)
plt.show()

print("Environment created!")
print(f"Grid size: {env.size}x{env.size}")
print(f"Start: (0, 0)")
print(f"Goal: {env.goal}")
print(f"Obstacles: {env.obstacles}")

14.2 Value Iteration¶

Bellman Equations: Complete Mathematical Treatment¶

Bellman Expectation Equations¶

For a given policy \(\pi\):

Value function:

\[V^\pi(s) = \mathbb{E}_\pi\left[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s\right]\]

Expanding:

\[V^\pi(s) = \sum_a \pi(a|s) \left[\mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^\pi(s')\right]\]

Action-value function:

\[Q^\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^\pi(s')\]

Or:

\[Q^\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a \sum_{a'} \pi(a'|s') Q^\pi(s', a')\]

Relationship:

\[V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a)\]

\[Q^\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^\pi(s')\]

Bellman Optimality Equations¶

Optimal value function:

\[V^*(s) = \max_\pi V^\pi(s) = \max_a Q^*(s, a)\]

\[V^*(s) = \max_a \left[\mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^*(s')\right]\]

Optimal action-value function:

\[Q^*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a \max_{a'} Q^*(s', a')\]

These are systems of nonlinear equations!

Proof: Value Iteration Converges¶

Theorem: Value iteration converges to \(V^*\).

Proof sketch:

Define Bellman operator \(\mathcal{T}\):

\[(\mathcal{T}V)(s) = \max_a \left[\mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V(s')\right]\]

Value iteration: \(V_{k+1} = \mathcal{T}V_k\)

Key property: \(\mathcal{T}\) is a contraction mapping with factor \(\gamma\):

\[\|\mathcal{T}V_1 - \mathcal{T}V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty\]

Proof of contraction:

\[\begin{split}\begin{align} |(\mathcal{T}V_1)(s) - (\mathcal{T}V_2)(s)| &= \left|\max_a \mathbb{E}[\gamma V_1(s')] - \max_a \mathbb{E}[\gamma V_2(s')]\right| \\ &\leq \max_a \left|\mathbb{E}[\gamma V_1(s')] - \mathbb{E}[\gamma V_2(s')]\right| \\ &= \gamma \max_a \mathbb{E}[|V_1(s') - V_2(s')|] \\ &\leq \gamma \|V_1 - V_2\|_\infty \end{align}\end{split}\]

By Banach fixed-point theorem:

Unique fixed point \(V^*\)
\(V_k \to V^*\) exponentially fast

Convergence rate:

\[\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty\]

Exponential convergence in \(\gamma\)!

Policy Iteration: Guaranteed Improvement¶

Algorithm:

Policy Evaluation: Compute \(V^\pi\)
Policy Improvement: \(\pi' = \text{greedy}(V^\pi)\)
Repeat until convergence

Theorem (Policy Improvement):

If \(\pi'(s) = \arg\max_a Q^\pi(s, a)\), then:

\[V^{\pi'}(s) \geq V^\pi(s) \quad \forall s\]

Proof:

\[\begin{split}\begin{align} V^\pi(s) &\leq Q^\pi(s, \pi'(s)) && \text{(greedy improvement)} \\ &= \mathbb{E}[r + \gamma V^\pi(s') \mid s, \pi'(s)] \\ &\leq \mathbb{E}[r + \gamma Q^\pi(s', \pi'(s')) \mid s, \pi'(s)] && \text{(apply again)} \\ &= \mathbb{E}[r + \gamma r' + \gamma^2 V^\pi(s'') \mid \ldots] \\ &\leq \ldots \\ &= V^{\pi'}(s) \end{align}\end{split}\]

Convergence: Finite MDPs → finite policies → must converge to \(\pi^*\)

Policy iteration vs Value iteration:

Aspect	Value Iteration	Policy Iteration
Update	One sweep	Full evaluation
Per iteration	Fast	Slow
Convergence	Many iterations	Few iterations
When better	Large state space	Small state space

Temporal Difference (TD) Learning Theory¶

TD(0) update:

\[V(s_t) \leftarrow V(s_t) + \alpha [r_{t+1} + \gamma V(s_{t+1}) - V(s_t)]\]

TD error:

\[\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\]

Theorem: TD(0) converges to \(V^\pi\) (under conditions):

Robbins-Monro: \(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)
Every state visited infinitely often

Bias-Variance Trade-off:

Method	Update Target	Bias	Variance
Monte Carlo	\(G_t = \sum_k \gamma^k r_{t+k}\)	0	High
TD(0)	\(r_{t+1} + \gamma V(s_{t+1})\)	Yes	Low
TD(n)	\(\sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n})\)	Medium	Medium

TD(\(\lambda\)): Unified View

Combines all n-step returns:

\[V(s_t) \leftarrow V(s_t) + \alpha [G_t^\lambda - V(s_t)]\]

where:

\[G_t^\lambda = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}\]

\(\lambda = 0\): TD(0)
\(\lambda = 1\): Monte Carlo
\(\lambda \in (0,1)\): Trade-off

Exploration vs Exploitation¶

\(\epsilon\)-greedy:

\[\begin{split}\pi(a|s) = \begin{cases} 1 - \epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s, a') \\ \frac{\epsilon}{|A|} & \text{otherwise} \end{cases}\end{split}\]

Boltzmann (softmax) exploration:

\[\pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)}\]

\(\tau \to 0\): Greedy
\(\tau \to \infty\): Uniform
\(\tau\): Temperature parameter

UCB (Upper Confidence Bound):

\[a_t = \arg\max_a \left[Q(a) + c\sqrt{\frac{\ln t}{N(a)}}\right]\]

Optimism in face of uncertainty!

Regret bounds:

For \(K\)-armed bandit with UCB:

\[\mathbb{E}[\text{Regret}_T] = O(\sqrt{KT \ln T})\]

Comparison:

Method	Regret	Pros	Cons
Random	\(O(T)\)	Simple	Poor
\(\epsilon\)-greedy	\(O(T)\)	Simple	Linear regret
\(\epsilon\)-decreasing	\(O(\log T)\)	Better	Tuning needed
UCB	\(O(\sqrt{T \log T})\)	Theoretical guarantee	Stationary only
Thompson	\(O(\sqrt{T})\)	Bayesian, great empirically	Complex

def value_iteration(env, gamma=0.9, theta=0.001, max_iterations=1000):
    """
    Value Iteration algorithm
    """
    # Initialize value function
    V = defaultdict(float)
    
    for iteration in range(max_iterations):
        delta = 0
        
        # For each state
        for i in range(env.size):
            for j in range(env.size):
                state = (i, j)
                
                # Skip terminal and obstacle states
                if state == env.goal or state in env.obstacles:
                    continue
                
                v = V[state]
                
                # Compute value for each action
                action_values = []
                for action_idx in range(len(env.actions)):
                    env.state = state
                    next_state, reward, done = env.step(action_idx)
                    action_value = reward + gamma * V[next_state]
                    action_values.append(action_value)
                
                # Bellman update: take max over actions
                V[state] = max(action_values)
                delta = max(delta, abs(v - V[state]))
        
        if delta < theta:
            print(f"Value iteration converged in {iteration+1} iterations")
            break
    
    # Extract policy
    policy = {}
    for i in range(env.size):
        for j in range(env.size):
            state = (i, j)
            if state == env.goal or state in env.obstacles:
                continue
            
            action_values = []
            for action_idx in range(len(env.actions)):
                env.state = state
                next_state, reward, done = env.step(action_idx)
                action_value = reward + gamma * V[next_state]
                action_values.append(action_value)
            
            policy[state] = np.argmax(action_values)
    
    return V, policy

# Run value iteration
V_star, policy_star = value_iteration(env, gamma=0.9)

# Visualize results
fig = env.visualize(values=V_star, policy=policy_star)
plt.title('Value Iteration Results\n(Values and Optimal Policy)', 
         fontsize=14, fontweight='bold', pad=20)
plt.show()

print("\nOptimal Value Function:")
for i in range(env.size):
    for j in range(env.size):
        print(f"{V_star[(i,j)]:6.2f}", end=" ")
    print()

14.3 Q-Learning¶

Advanced Q-Learning Variants and Improvements¶

Double Q-Learning: Addressing Overestimation Bias¶

Problem with standard Q-learning:

\[Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')\]

The \(\max\) operator causes positive bias:

\[\mathbb{E}[\max(X_1, X_2, \ldots)] \geq \max(\mathbb{E}[X_1], \mathbb{E}[X_2], \ldots)\]

Example: Two actions, true values both 0, estimates \(N(0, 1)\):

\[\mathbb{E}[\max(Q_1, Q_2)] \approx 0.56 > 0\]

This accumulates over iterations!

Double Q-Learning Solution:

Maintain two Q-functions \(Q_A\) and \(Q_B\).

Update \(Q_A\): (with probability 0.5)

\[Q_A(s, a) \leftarrow r + \gamma Q_B(s', \arg\max_{a'} Q_A(s', a'))\]

Update \(Q_B\): (with probability 0.5)

\[Q_B(s, a) \leftarrow r + \gamma Q_A(s', \arg\max_{a'} Q_B(s', a'))\]

Key insight:

Use \(Q_A\) to select action
Use \(Q_B\) to evaluate action
Decorrelates selection and evaluation!

Final policy:

\[\pi(s) = \arg\max_a [Q_A(s, a) + Q_B(s, a)]\]

Double DQN:

\[y = r + \gamma Q(s', \arg\max_{a'} Q(s', a'; \theta); \theta^-)\]

Use online network \(\theta\) to select, target network \(\theta^-\) to evaluate.

Dueling DQN: Separating Value and Advantage¶

Decomposition:

\[Q(s, a) = V(s) + A(s, a)\]

where:

\(V(s)\): Value of being in state \(s\)
\(A(s, a)\): Advantage of action \(a\) over others

Problem: Not unique! Can add constant to \(V\), subtract from \(A\).

Solution: Force identifiability constraint:

\[Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left(A(s, a; \theta, \alpha) - \max_{a'} A(s, a'; \theta, \alpha)\right)\]

Or (better empirically):

\[Q(s, a) = V(s) + \left(A(s, a) - \frac{1}{|A|}\sum_{a'} A(s, a')\right)\]

Architecture:

Input state s
    ↓
  CNN/FC layers (shared)
    ↓
   ├─────────┬─────────┐
   ↓         ↓         ↓
 V(s)    A(s,a₁)   A(s,a₂) ...
   ↓         ↓         ↓
   └─────────┴─────────┘
            ↓
     Q(s,a) = V + (A - mean(A))

Advantages:

✅ Learns value function explicitly (useful when actions don’t matter much)

✅ Better gradient flow

✅ Improved performance on Atari

Prioritized Experience Replay¶

Standard experience replay: Uniform sampling from buffer

Problem: Not all experiences equally important!

Solution: Sample based on TD error magnitude

\[p_i = |\delta_i| + \epsilon\]

where \(\delta_i = r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\)

Prioritization with importance sampling:

Sampling probability:

\[P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}\]

\(\alpha = 0\): Uniform (standard)
\(\alpha = 1\): Full prioritization
Typical: \(\alpha = 0.6\)

Importance sampling weights:

\[w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta\]

Corrects bias from non-uniform sampling
\(\beta\): Annealed from \(\beta_0\) to 1
Typical: \(\beta_0 = 0.4\), anneal to 1

Loss:

\[L = \frac{1}{N}\sum_i w_i \delta_i^2\]

Update priorities after gradient step:

\[p_i \leftarrow |\delta_i| + \epsilon\]

Results:

49 out of 57 Atari games: improvement
Especially good for sparse rewards

Rainbow DQN: Combining Improvements¶

Combines 6 extensions:

Double DQN: Reduce overestimation
Dueling: Separate V and A
Prioritized replay: Important transitions
Multi-step returns: \(n\)-step TD targets
Distributional RL: Learn return distribution
Noisy Nets: Exploration via parameter noise

Multi-step Learning:

Instead of 1-step TD:

\[y = r_t + \gamma \max_{a'} Q(s_{t+1}, a')\]

Use \(n\)-step return:

\[y = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \max_{a'} Q(s_{t+n}, a')\]

Bias-variance trade-off:

Small \(n\): Low variance, high bias (bootstraps more)
Large \(n\): High variance, low bias (like Monte Carlo)
Typical: \(n = 3\)

Distributional RL (C51):

Instead of \(Q(s, a) = \mathbb{E}[Z]\), learn full distribution \(Z\).

Approximate with \(N\) atoms:

\[Z(s, a) = \sum_{i=1}^N p_i(s, a) \delta_{z_i}\]

where \(z_i \in [V_{\min}, V_{\max}]\) are support points.

Loss: Cross-entropy between distributions

Why better?

Captures uncertainty
Richer learning signal
Better performance

Noisy Nets:

Add noise to network parameters:

\[y = (\mu^w + \sigma^w \odot \epsilon^w) x + \mu^b + \sigma^b \odot \epsilon^b\]

where:

\(\mu^w, \mu^b\): Learned mean
\(\sigma^w, \sigma^b\): Learned std
\(\epsilon^w, \epsilon^b \sim \mathcal{N}(0, 1)\): Noise

Advantages over \(\epsilon\)-greedy:

State-dependent exploration
Automatic annealing (learned \(\sigma\) decreases)
No hyperparameter tuning

Rainbow Performance:

Median human-normalized score on Atari:

DQN: 100%
Rainbow: 227%

Ablation study shows all components help!

Practical Implementation Tips¶

Hyperparameters (DQN on Atari):

Parameter	Value	Purpose
Replay buffer	1M	Store transitions
Batch size	32	Mini-batch
Learning rate	0.00025	Adam optimizer
\(\gamma\)	0.99	Discount factor
\(\epsilon\) start	1.0	Initial exploration
\(\epsilon\) end	0.1	Final exploration
\(\epsilon\) decay	1M steps	Anneal schedule
Target update	10K steps	Freeze target
Frame stack	4	Temporal context
Frame skip	4	Action repeat

Network Architecture:

Input: 84×84×4 (stacked frames)
    ↓
Conv1: 32 filters, 8×8, stride 4, ReLU
    ↓
Conv2: 64 filters, 4×4, stride 2, ReLU
    ↓
Conv3: 64 filters, 3×3, stride 1, ReLU
    ↓
Flatten → FC: 512 units, ReLU
    ↓
Output: |A| Q-values (linear)

Training Tricks:

✅ Gradient clipping: Clip \(\delta\) to \([-1, 1]\) or use Huber loss

✅ Reward clipping: Clip rewards to \([-1, 1]\) for stability

✅ Frame preprocessing: Grayscale, resize, normalize

✅ Action repeat: Skip frames (4×) for efficiency

✅ Burn-in: Fill replay buffer before training

✅ Evaluation: Separate evaluation episodes with \(\epsilon = 0.05\)

Common Pitfalls:

❌ Instability: Try lower learning rate, smaller network ❌ Slow learning: Check exploration, increase buffer size ❌ Catastrophic forgetting: Increase replay buffer ❌ Overestimation: Use Double DQN ❌ Sparse rewards: Try curiosity-driven exploration or reward shaping

class QLearningAgent:
    def __init__(self, n_actions, learning_rate=0.1, gamma=0.9, epsilon=0.1):
        self.Q = defaultdict(lambda: np.zeros(n_actions))
        self.lr = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions
    
    def get_action(self, state, training=True):
        """Epsilon-greedy action selection"""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q[state])
    
    def update(self, state, action, reward, next_state, done):
        """Q-learning update"""
        current_q = self.Q[state][action]
        
        if done:
            target_q = reward
        else:
            target_q = reward + self.gamma * np.max(self.Q[next_state])
        
        # Q-learning update
        self.Q[state][action] = current_q + self.lr * (target_q - current_q)

# Train Q-learning agent
agent = QLearningAgent(n_actions=4, learning_rate=0.1, gamma=0.9, epsilon=0.1)

n_episodes = 1000
episode_rewards = []
episode_lengths = []

for episode in range(n_episodes):
    state = env.reset()
    total_reward = 0
    steps = 0
    
    for step in range(100):  # Max 100 steps per episode
        action = agent.get_action(state, training=True)
        next_state, reward, done = env.step(action)
        
        agent.update(state, action, reward, next_state, done)
        
        state = next_state
        total_reward += reward
        steps += 1
        
        if done:
            break
    
    episode_rewards.append(total_reward)
    episode_lengths.append(steps)
    
    if (episode + 1) % 200 == 0:
        avg_reward = np.mean(episode_rewards[-100:])
        avg_length = np.mean(episode_lengths[-100:])
        print(f"Episode {episode+1}: Avg Reward={avg_reward:.2f}, Avg Steps={avg_length:.1f}")

# Plot learning curves
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Smooth rewards
window = 50
smoothed_rewards = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
smoothed_lengths = np.convolve(episode_lengths, np.ones(window)/window, mode='valid')

axes[0].plot(smoothed_rewards, linewidth=2)
axes[0].set_xlabel('Episode', fontsize=12)
axes[0].set_ylabel('Total Reward', fontsize=12)
axes[0].set_title('Q-Learning: Reward per Episode', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

axes[1].plot(smoothed_lengths, linewidth=2, color='orange')
axes[1].set_xlabel('Episode', fontsize=12)
axes[1].set_ylabel('Steps to Goal', fontsize=12)
axes[1].set_title('Q-Learning: Episode Length', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

14.4 Learned Q-Values and Policy¶

What and Why¶

After Q-learning converges, the learned Q-values \(Q(s,a)\) encode the expected cumulative reward for taking action \(a\) in state \(s\) and following the optimal policy thereafter. Visualizing Q-values as a heatmap reveals how the agent values different state-action pairs, while the greedy policy \(\pi^*(s) = \arg\max_a Q(s,a)\) shows the optimal action at each state.

How It Works¶

The code extracts the maximum Q-value at each grid position and displays it as a color-coded heatmap, with warmer colors indicating higher-value states (closer to the goal via safe paths). Arrows overlay each cell showing the greedy action direction. States near the goal have high values that decay with distance, reflecting the discount factor \(\gamma\).

Connection to ML¶

Interpreting learned value functions is critical for debugging and trusting RL agents in practice. The same visualization principles apply to deep Q-networks (DQN) where value functions are approximated by neural networks, and to actor-critic methods where both value and policy are learned.

# Extract learned values and policy
learned_values = {}
learned_policy = {}

for i in range(env.size):
    for j in range(env.size):
        state = (i, j)
        if state not in env.obstacles and state != env.goal:
            learned_values[state] = np.max(agent.Q[state])
            learned_policy[state] = np.argmax(agent.Q[state])

# Visualize
fig = env.visualize(values=learned_values, policy=learned_policy)
plt.title('Q-Learning Results\n(Learned Values and Policy)', 
         fontsize=14, fontweight='bold', pad=20)
plt.show()

# Test learned policy
print("\nTesting Learned Policy:")
state = env.reset()
path = [state]

for step in range(20):
    action = agent.get_action(state, training=False)
    next_state, reward, done = env.step(action)
    path.append(next_state)
    
    print(f"Step {step+1}: {state} → {env.action_names[action]} → {next_state} (reward={reward:.1f})")
    
    state = next_state
    if done:
        print(f"\nReached goal in {step+1} steps!")
        break

Key Takeaways¶

1. Markov Decision Process (MDP)¶

Framework for sequential decision making
Components: States, Actions, Transitions, Rewards
Markov property: Future depends only on current state
Goal: Find policy π that maximizes cumulative reward

2. Value Functions¶

State value: V(s) = Expected cumulative reward from state s
Action value: Q(s,a) = Expected reward from state s, action a
Bellman equation: Recursive relationship
Optimal policy: Choose action with max Q-value

3. Value Iteration¶

Model-based: Requires knowledge of transitions P(s’|s,a)
Iteratively update V(s) using Bellman equation
Converges to optimal value function V*
Extract policy: π(s) = argmax_a Q(s,a)

4. Q-Learning¶

Model-free: Learns from experience (no transition model)
Off-policy: Learns optimal Q* while following ε-greedy
Update rule: Q(s,a) ← Q(s,a) + α[r + γ max Q(s’,a’) - Q(s,a)]
Exploration vs exploitation: ε-greedy balances both

5. Key Differences¶

Aspect	Value Iteration	Q-Learning
Model	Required	Not required
Updates	All states	Visited states
Convergence	Guaranteed	Guaranteed (with conditions)
Speed	Fast with small state space	Slow, needs many episodes

6. Exploration Strategies¶

ε-greedy: Random action with probability ε
Softmax: Sample from Boltzmann distribution
Upper Confidence Bound: Bonus for rarely visited states
Exploration essential for finding optimal policy

7. Extensions¶

SARSA: On-policy TD learning
Double Q-Learning: Reduces overestimation bias
DQN: Deep neural network for Q-function
Policy Gradient: Directly optimize policy
Actor-Critic: Combine value and policy methods
A3C, PPO, SAC: Modern RL algorithms

8. Practical Tips¶

Start with simple environments (GridWorld)
Tune learning rate α and discount γ
Balance exploration (ε) carefully
Use experience replay for stability
Normalize rewards if possible

References¶

“Reinforcement Learning: An Introduction” - Sutton & Barto (2nd Edition)
“Playing Atari with Deep Reinforcement Learning” - Mnih et al. (2013)
CS229 Lecture Notes on Reinforcement Learning
David Silver’s RL Course (UCL)

Congratulations! You’ve completed all 14 CS229 lectures! 🎉