Lectures 18-20: Reinforcement Learning & MDPsΒΆ

πŸ“Ή Watch Lecture 18 | Watch Lecture 19 | Watch Lecture 20

From Andrew Ng’s CS229 Lectures 18-20 (Autumn 2018)ΒΆ

β€œWhat I hope you learn today is how to apply reinforcement learning, even to continuous state or infinite state MDPs.” - Andrew Ng

Reinforcement Learning OverviewΒΆ

Different from Supervised Learning:

  • No explicit labels (no β€œcorrect answer” given)

  • Learn from rewards and punishments

  • Agent interacts with environment

  • Goal: Maximize cumulative reward

Markov Decision Process (MDP)ΒΆ

Components:

  • S: Set of states

  • A: Set of actions

  • P_sa: Transition probabilities P(s’|s,a)

  • Ξ³: Discount factor (0 ≀ Ξ³ < 1)

  • R: Reward function

Example from lecture:

  • 11-state grid world

  • Actions: North, South, East, West

  • Some states give rewards/penalties

  • Find optimal policy

Key ConceptsΒΆ

Value Function V^Ο€(s):

β€œV^Ο€ was the value function for a policy Ο€ which is the expected payoff if you execute that policy starting from a state s”

Optimal Value Function V(s):*

  • Best possible expected return from state s

  • V*(s) = max_Ο€ V^Ο€(s)

Optimal Policy Ο€:*

Ο€*(s) = argmax_a [R(s,a) + Ξ³ Ξ£ P_sa(s') V*(s')]

Or equivalently:

Ο€*(s) = argmax_a [R(s,a) + Ξ³ E_{s'~P_sa} [V*(s')]]

Value Iteration AlgorithmΒΆ

Bellman’s Equations:

V*(s) = max_a [R(s,a) + Ξ³ Ξ£ P_sa(s') V*(s')]

Algorithm:

  1. Initialize V(s) = 0 for all states

  2. Repeat until convergence:

    V(s) := max_a [R(s,a) + Ξ³ Ξ£ P_sa(s') V(s')]
    
  3. Extract policy: Ο€(s) = argmax_a […]

Lecture 18: Continuous State MDPsΒΆ

The Challenge:

β€œEverything we’ve done so far was built on the MDP having a finite set of states”

Self-Driving Car Example:

State space (continuous):

  • Position: (x, y) - latitude/longitude

  • Orientation: ΞΈ - angle relative to North

  • Velocities: (αΊ‹, ẏ) - velocity in x and y directions

  • Angular velocity: ΞΈΜ‡ - rate of turning

β€œIt’s up to you to decide what is the state-space you want to use to model this car… and if you are building a car to race in a racetrack, maybe it is important to model what is the temperature of the engine”

Approaches for Continuous States:

1. Discretization

  • Divide continuous space into grid

  • Each grid cell is a discrete state

  • Apply value iteration

  • Problem: Curse of dimensionality

2. Model-Based RL

  • Learn model of environment

  • Use model for planning

3. Fitted Value Iteration (main algorithm)

  • Function approximation for V*

  • Use supervised learning to fit V*

Lecture 19: Continuous State RL AlgorithmsΒΆ

Topics:

  • Value function approximation

  • Linear function approximators

  • Neural network approximators

  • Q-learning for continuous states

  • Deep Q-Networks (DQN)

Lecture 20: Policy Search & REINFORCEΒΆ

Policy Search:

  • Directly search in policy space

  • Don’t need to compute V* explicitly

  • Gradient ascent on expected reward

REINFORCE Algorithm:

  • Policy gradient method

  • Monte Carlo approach

  • Works with continuous actions

Applications:

  • Robotics (walking, manipulation)

  • Game playing (Go, Chess, Atari)

  • Autonomous vehicles

  • Resource management

Key Algorithms SummaryΒΆ

Algorithm

Type

When to Use

Value Iteration

Value-based

Discrete, known model

Policy Iteration

Value-based

Discrete, known model

Q-Learning

Value-based

Model-free, discrete

DQN

Value-based

Model-free, continuous state

REINFORCE

Policy-based

Continuous action space

Actor-Critic

Hybrid

General purpose

Advanced RL Theory: Deep Dive into Key AlgorithmsΒΆ

1. DQN (Deep Q-Network): Mathematical FoundationΒΆ

The Q-Function:

In Q-learning, we learn the action-value function:

\[Q^*(s, a) = \mathbb{E}\left[R_t + \gamma R_{t+1} + \gamma^2 R_{t+2} + \ldots \mid s_t = s, a_t = a\right]\]

Bellman Optimality Equation:

\[Q^*(s, a) = \mathbb{E}_{s' \sim P}\left[r + \gamma \max_{a'} Q^*(s', a') \mid s, a\right]\]

DQN Innovation: Use neural network to approximate \(Q^*(s, a)\)

\[Q(s, a; \theta) \approx Q^*(s, a)\]

where \(\theta\) are network parameters.

Loss Function:

\[L(\theta) = \mathbb{E}_{(s,a,r,s') \sim D}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right)^2\right]\]

Key innovations:

  1. Experience Replay: Store transitions \((s, a, r, s')\) in replay buffer \(D\)

    • Break temporal correlations

    • Reuse experiences (sample efficiency)

    • Stabilize training

  2. Target Network: Use separate network with frozen weights \(\theta^-\)

    • Update every \(C\) steps: \(\theta^- \leftarrow \theta\)

    • Reduces moving target problem

    • Stabilizes learning

  3. Gradient:

\[\nabla_\theta L(\theta) = \mathbb{E}\left[\left(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\right) \nabla_\theta Q(s, a; \theta)\right]\]

DQN Algorithm:

Initialize replay buffer D
Initialize Q-network with random weights ΞΈ
Initialize target network θ⁻ = θ

For episode = 1 to M:
    Initialize state s
    For t = 1 to T:
        With probability Ξ΅: select random action a
        Otherwise: a = argmax_a' Q(s, a'; ΞΈ)
        
        Execute action a, observe reward r, next state s'
        Store (s, a, r, s') in D
        
        Sample minibatch {(sβ±Ό, aβ±Ό, rβ±Ό, s'β±Ό)} from D
        
        Set target: yⱼ = rⱼ + γ max_a' Q(s'ⱼ, a'; θ⁻)
        
        Update: ΞΈ ← ΞΈ - Ξ± βˆ‡_ΞΈ (yβ±Ό - Q(sβ±Ό, aβ±Ό; ΞΈ))Β²
        
        Every C steps: θ⁻ ← ΞΈ
        s ← s'

2. Policy Gradient Methods: REINFORCEΒΆ

Problem with Q-learning: Cannot handle continuous action spaces easily.

Solution: Directly parameterize policy \(\pi_\theta(a|s)\)

Objective: Maximize expected return

\[J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \gamma^t r_t\right]\]

where \(\tau = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) is a trajectory.

Policy Gradient Theorem:

\[\nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot R_t\right]\]

where \(R_t = \sum_{k=t}^T \gamma^{k-t} r_k\) is return from time \(t\).

Intuition:

  • If action led to good outcome (\(R_t\) high) β†’ increase probability

  • If action led to bad outcome (\(R_t\) low) β†’ decrease probability

  • Weight by gradient of log probability

REINFORCE Algorithm:

Initialize policy parameters ΞΈ

For episode = 1 to M:
    Generate trajectory Ο„ = (sβ‚€, aβ‚€, rβ‚€, ..., s_T, a_T, r_T) 
                          using Ο€_ΞΈ
    
    For t = 0 to T:
        Compute return: R_t = Ξ£_{k=t}^T Ξ³^(k-t) r_k
        
        Update: ΞΈ ← ΞΈ + Ξ± βˆ‡_ΞΈ log Ο€_ΞΈ(a_t|s_t) Β· R_t

Variance Reduction: Baseline

High variance is main problem. Add baseline \(b(s_t)\):

\[\nabla_\theta J(\theta) = \mathbb{E}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot (R_t - b(s_t))\right]\]

Common baseline: Value function \(V^\pi(s_t)\)

This gives us the advantage function:

\[A^\pi(s_t, a_t) = Q^\pi(s_t, a_t) - V^\pi(s_t)\]

3. Actor-Critic MethodsΒΆ

Combine best of both worlds:

  • Actor: Policy \(\pi_\theta(a|s)\)

  • Critic: Value function \(V_\phi(s)\) or \(Q_\phi(s,a)\)

Advantage Actor-Critic (A2C):

Actor update (policy gradient):

\[\theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a)\]

Critic update (TD learning):

\[\phi \leftarrow \phi + \beta (r + \gamma V_\phi(s') - V_\phi(s)) \nabla_\phi V_\phi(s)\]

Advantage estimate:

\[A(s, a) = r + \gamma V_\phi(s') - V_\phi(s)\]

This is the TD error \(\delta\)!

4. Monte Carlo Tree Search (MCTS)ΒΆ

Used in: AlphaGo, AlphaZero, MuZero

Core Idea: Build search tree by simulation

Four Phases:

  1. Selection: Start from root, select actions using UCB1

\[\text{UCB1} = \frac{Q(s, a)}{N(s, a)} + c \sqrt{\frac{\ln N(s)}{N(s, a)}}\]
  • First term: Exploitation (choose high-value actions)

  • Second term: Exploration (choose less-visited actions)

  • \(c\): Exploration constant (typically \(\sqrt{2}\))

  1. Expansion: Add new child node to tree

  2. Simulation: Play out to terminal state (random or guided)

  3. Backpropagation: Update values along path

\[Q(s, a) \leftarrow \frac{N(s, a) \cdot Q(s, a) + r}{N(s, a) + 1}\]
\[N(s, a) \leftarrow N(s, a) + 1\]

MCTS Algorithm:

function MCTS(root_state, n_simulations):
    Initialize tree with root node
    
    For i = 1 to n_simulations:
        # 1. Selection
        node = root
        while node.is_fully_expanded() and not node.is_terminal():
            node = node.select_child_UCB1()
        
        # 2. Expansion
        if not node.is_terminal():
            node = node.expand()  # Add new child
        
        # 3. Simulation (rollout)
        reward = node.simulate_to_end()
        
        # 4. Backpropagation
        while node is not None:
            node.visits += 1
            node.value += reward
            node = node.parent
    
    # Return best action from root
    return root.best_child().action

UCB1 Derivation:

Comes from Hoeffding’s inequality:

\[P\left(\left|\mu - \hat{\mu}\right| > \epsilon\right) \leq 2e^{-2n\epsilon^2}\]

Setting bound probability to \(t^{-4}\) and solving gives:

\[\text{UCB}(a) = \hat{\mu}_a + \sqrt{\frac{2 \ln t}{n_a}}\]

AlphaGo Enhancement:

Combines MCTS with neural networks:

  • Policy network: Guides selection (prior \(P(s, a)\))

  • Value network: Evaluates positions (replaces rollout)

\[\text{UCB}_{\text{AlphaGo}} = Q(s, a) + c \cdot P(s, a) \cdot \frac{\sqrt{N(s)}}{1 + N(s, a)}\]

5. Advanced TopicsΒΆ

Proximal Policy Optimization (PPO):

Problem: Policy gradient can take too large steps β†’ performance collapse

Solution: Constrain policy updates

\[L^{\text{CLIP}}(\theta) = \mathbb{E}\left[\min\left(r_t(\theta) A_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) A_t\right)\right]\]

where \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)}\) is probability ratio.

Trust Region Policy Optimization (TRPO):

Constrained optimization:

\[\max_\theta \mathbb{E}[A_{\theta_{\text{old}}}(s, a)]\]
\[\text{subject to } \mathbb{E}[\text{KL}(\pi_{\theta_{\text{old}}} \| \pi_\theta)] \leq \delta\]

Deep Deterministic Policy Gradient (DDPG):

For continuous actions:

  • Deterministic policy: \(a = \mu_\theta(s)\)

  • Critic: \(Q_\phi(s, a)\)

Algorithm Comparison:

Algorithm

Value/Policy

On/Off-Policy

Action Space

Sample Efficiency

DQN

Value-based

Off-policy

Discrete

High

REINFORCE

Policy-based

On-policy

Any

Low

A2C/A3C

Actor-Critic

On-policy

Any

Medium

PPO

Actor-Critic

On-policy

Any

Medium

TRPO

Actor-Critic

On-policy

Any

Medium

DDPG

Actor-Critic

Off-policy

Continuous

High

SAC

Actor-Critic

Off-policy

Continuous

High

When to use:

  • Discrete actions + sample efficiency: DQN

  • Continuous actions + stability: PPO, SAC

  • Maximum performance (sample-rich): AlphaZero (MCTS + NN)

  • Simple baseline: REINFORCE with baseline

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Rectangle
from collections import defaultdict

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

14.1 GridWorld EnvironmentΒΆ

What and WhyΒΆ

A GridWorld is the canonical toy environment for reinforcement learning, where an agent navigates a discrete grid to reach a goal while avoiding obstacles. Despite its simplicity, GridWorld captures the essential RL structure: states (grid positions), actions (up/down/left/right), transitions (possibly stochastic), and rewards (positive at the goal, negative at obstacles, small step penalty elsewhere).

How It WorksΒΆ

The environment is formalized as a Markov Decision Process (MDP) defined by the tuple \((S, A, P, R, \gamma)\), where \(P(s'|s,a)\) gives transition probabilities and \(R(s,a,s')\) gives rewards. The agent’s objective is to find a policy \(\pi(a|s)\) that maximizes the expected cumulative discounted reward \(\mathbb{E}\left[\sum_{t=0}^{\infty} \gamma^t R_t\right]\).

Connection to MLΒΆ

GridWorld environments are the standard testbed for all RL algorithms, from tabular methods (value iteration, Q-learning) to deep RL (DQN, policy gradient). The same MDP framework scales to robotics navigation, game playing, and resource allocation problems.

class GridWorld:
    def __init__(self, size=5):
        self.size = size
        self.goal = (size-1, size-1)
        self.obstacles = [(1, 1), (2, 2), (3, 1)]
        self.reset()
        
        # Actions: up, down, left, right
        self.actions = [(-1, 0), (1, 0), (0, -1), (0, 1)]
        self.action_names = ['Up', 'Down', 'Left', 'Right']
    
    def reset(self):
        self.state = (0, 0)
        return self.state
    
    def step(self, action_idx):
        """
        Take action, return (next_state, reward, done)
        """
        action = self.actions[action_idx]
        next_state = (self.state[0] + action[0], self.state[1] + action[1])
        
        # Check boundaries
        if (next_state[0] < 0 or next_state[0] >= self.size or
            next_state[1] < 0 or next_state[1] >= self.size):
            next_state = self.state  # Stay in place
            reward = -1
        # Check obstacles
        elif next_state in self.obstacles:
            next_state = self.state  # Stay in place
            reward = -1
        # Check goal
        elif next_state == self.goal:
            reward = 10
        else:
            reward = -0.1
        
        self.state = next_state
        done = (next_state == self.goal)
        
        return next_state, reward, done
    
    def visualize(self, values=None, policy=None):
        fig, ax = plt.subplots(figsize=(8, 8))
        
        # Draw grid
        for i in range(self.size):
            for j in range(self.size):
                # Color based on value
                if values is not None:
                    value = values.get((i, j), 0)
                    color = plt.cm.RdYlGn((value + 1) / 11)  # Normalize
                else:
                    color = 'white'
                
                rect = Rectangle((j, self.size-1-i), 1, 1, 
                                facecolor=color, edgecolor='black', linewidth=2)
                ax.add_patch(rect)
                
                # Mark obstacles
                if (i, j) in self.obstacles:
                    ax.text(j+0.5, self.size-1-i+0.5, 'X', 
                           ha='center', va='center', fontsize=20, fontweight='bold')
                # Mark goal
                elif (i, j) == self.goal:
                    ax.text(j+0.5, self.size-1-i+0.5, 'G', 
                           ha='center', va='center', fontsize=20, 
                           fontweight='bold', color='green')
                # Show value
                elif values is not None:
                    ax.text(j+0.5, self.size-1-i+0.5, f'{value:.1f}', 
                           ha='center', va='center', fontsize=10)
                
                # Show policy arrow
                if policy is not None and (i, j) not in self.obstacles and (i, j) != self.goal:
                    action_idx = policy.get((i, j), 0)
                    arrows = ['↑', '↓', '←', 'β†’']
                    ax.text(j+0.5, self.size-1-i+0.2, arrows[action_idx], 
                           ha='center', va='center', fontsize=16, color='blue')
        
        ax.set_xlim(0, self.size)
        ax.set_ylim(0, self.size)
        ax.set_aspect('equal')
        ax.axis('off')
        plt.tight_layout()
        return fig

# Create and visualize environment
env = GridWorld(size=5)
fig = env.visualize()
plt.title('GridWorld Environment', fontsize=14, fontweight='bold', pad=20)
plt.show()

print("Environment created!")
print(f"Grid size: {env.size}x{env.size}")
print(f"Start: (0, 0)")
print(f"Goal: {env.goal}")
print(f"Obstacles: {env.obstacles}")

14.2 Value IterationΒΆ

Bellman Equations: Complete Mathematical TreatmentΒΆ

Bellman Expectation EquationsΒΆ

For a given policy \(\pi\):

Value function:

\[V^\pi(s) = \mathbb{E}_\pi\left[R_{t+1} + \gamma V^\pi(S_{t+1}) \mid S_t = s\right]\]

Expanding:

\[V^\pi(s) = \sum_a \pi(a|s) \left[\mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^\pi(s')\right]\]

Action-value function:

\[Q^\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^\pi(s')\]

Or:

\[Q^\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a \sum_{a'} \pi(a'|s') Q^\pi(s', a')\]

Relationship:

\[V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s, a)\]
\[Q^\pi(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^\pi(s')\]

Bellman Optimality EquationsΒΆ

Optimal value function:

\[V^*(s) = \max_\pi V^\pi(s) = \max_a Q^*(s, a)\]
\[V^*(s) = \max_a \left[\mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V^*(s')\right]\]

Optimal action-value function:

\[Q^*(s, a) = \mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a \max_{a'} Q^*(s', a')\]

These are systems of nonlinear equations!

Proof: Value Iteration ConvergesΒΆ

Theorem: Value iteration converges to \(V^*\).

Proof sketch:

Define Bellman operator \(\mathcal{T}\):

\[(\mathcal{T}V)(s) = \max_a \left[\mathcal{R}_s^a + \gamma \sum_{s'} \mathcal{P}_{ss'}^a V(s')\right]\]

Value iteration: \(V_{k+1} = \mathcal{T}V_k\)

Key property: \(\mathcal{T}\) is a contraction mapping with factor \(\gamma\):

\[\|\mathcal{T}V_1 - \mathcal{T}V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty\]

Proof of contraction:

\[\begin{split}\begin{align} |(\mathcal{T}V_1)(s) - (\mathcal{T}V_2)(s)| &= \left|\max_a \mathbb{E}[\gamma V_1(s')] - \max_a \mathbb{E}[\gamma V_2(s')]\right| \\ &\leq \max_a \left|\mathbb{E}[\gamma V_1(s')] - \mathbb{E}[\gamma V_2(s')]\right| \\ &= \gamma \max_a \mathbb{E}[|V_1(s') - V_2(s')|] \\ &\leq \gamma \|V_1 - V_2\|_\infty \end{align}\end{split}\]

By Banach fixed-point theorem:

  • Unique fixed point \(V^*\)

  • \(V_k \to V^*\) exponentially fast

Convergence rate:

\[\|V_k - V^*\|_\infty \leq \gamma^k \|V_0 - V^*\|_\infty\]

Exponential convergence in \(\gamma\)!

Policy Iteration: Guaranteed ImprovementΒΆ

Algorithm:

  1. Policy Evaluation: Compute \(V^\pi\)

  2. Policy Improvement: \(\pi' = \text{greedy}(V^\pi)\)

  3. Repeat until convergence

Theorem (Policy Improvement):

If \(\pi'(s) = \arg\max_a Q^\pi(s, a)\), then:

\[V^{\pi'}(s) \geq V^\pi(s) \quad \forall s\]

Proof:

\[\begin{split}\begin{align} V^\pi(s) &\leq Q^\pi(s, \pi'(s)) && \text{(greedy improvement)} \\ &= \mathbb{E}[r + \gamma V^\pi(s') \mid s, \pi'(s)] \\ &\leq \mathbb{E}[r + \gamma Q^\pi(s', \pi'(s')) \mid s, \pi'(s)] && \text{(apply again)} \\ &= \mathbb{E}[r + \gamma r' + \gamma^2 V^\pi(s'') \mid \ldots] \\ &\leq \ldots \\ &= V^{\pi'}(s) \end{align}\end{split}\]

Convergence: Finite MDPs β†’ finite policies β†’ must converge to \(\pi^*\)

Policy iteration vs Value iteration:

Aspect

Value Iteration

Policy Iteration

Update

One sweep

Full evaluation

Per iteration

Fast

Slow

Convergence

Many iterations

Few iterations

When better

Large state space

Small state space

Temporal Difference (TD) Learning TheoryΒΆ

TD(0) update:

\[V(s_t) \leftarrow V(s_t) + \alpha [r_{t+1} + \gamma V(s_{t+1}) - V(s_t)]\]

TD error:

\[\delta_t = r_{t+1} + \gamma V(s_{t+1}) - V(s_t)\]

Theorem: TD(0) converges to \(V^\pi\) (under conditions):

  • Robbins-Monro: \(\sum_t \alpha_t = \infty\), \(\sum_t \alpha_t^2 < \infty\)

  • Every state visited infinitely often

Bias-Variance Trade-off:

Method

Update Target

Bias

Variance

Monte Carlo

\(G_t = \sum_k \gamma^k r_{t+k}\)

0

High

TD(0)

\(r_{t+1} + \gamma V(s_{t+1})\)

Yes

Low

TD(n)

\(\sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n V(s_{t+n})\)

Medium

Medium

TD(\(\lambda\)): Unified View

Combines all n-step returns:

\[V(s_t) \leftarrow V(s_t) + \alpha [G_t^\lambda - V(s_t)]\]

where:

\[G_t^\lambda = (1-\lambda) \sum_{n=1}^\infty \lambda^{n-1} G_t^{(n)}\]
  • \(\lambda = 0\): TD(0)

  • \(\lambda = 1\): Monte Carlo

  • \(\lambda \in (0,1)\): Trade-off

Exploration vs ExploitationΒΆ

\(\epsilon\)-greedy:

\[\begin{split}\pi(a|s) = \begin{cases} 1 - \epsilon + \frac{\epsilon}{|A|} & \text{if } a = \arg\max_{a'} Q(s, a') \\ \frac{\epsilon}{|A|} & \text{otherwise} \end{cases}\end{split}\]

Boltzmann (softmax) exploration:

\[\pi(a|s) = \frac{\exp(Q(s,a)/\tau)}{\sum_{a'} \exp(Q(s,a')/\tau)}\]
  • \(\tau \to 0\): Greedy

  • \(\tau \to \infty\): Uniform

  • \(\tau\): Temperature parameter

UCB (Upper Confidence Bound):

\[a_t = \arg\max_a \left[Q(a) + c\sqrt{\frac{\ln t}{N(a)}}\right]\]

Optimism in face of uncertainty!

Regret bounds:

For \(K\)-armed bandit with UCB:

\[\mathbb{E}[\text{Regret}_T] = O(\sqrt{KT \ln T})\]

Comparison:

Method

Regret

Pros

Cons

Random

\(O(T)\)

Simple

Poor

\(\epsilon\)-greedy

\(O(T)\)

Simple

Linear regret

\(\epsilon\)-decreasing

\(O(\log T)\)

Better

Tuning needed

UCB

\(O(\sqrt{T \log T})\)

Theoretical guarantee

Stationary only

Thompson

\(O(\sqrt{T})\)

Bayesian, great empirically

Complex

def value_iteration(env, gamma=0.9, theta=0.001, max_iterations=1000):
    """
    Value Iteration algorithm
    """
    # Initialize value function
    V = defaultdict(float)
    
    for iteration in range(max_iterations):
        delta = 0
        
        # For each state
        for i in range(env.size):
            for j in range(env.size):
                state = (i, j)
                
                # Skip terminal and obstacle states
                if state == env.goal or state in env.obstacles:
                    continue
                
                v = V[state]
                
                # Compute value for each action
                action_values = []
                for action_idx in range(len(env.actions)):
                    env.state = state
                    next_state, reward, done = env.step(action_idx)
                    action_value = reward + gamma * V[next_state]
                    action_values.append(action_value)
                
                # Bellman update: take max over actions
                V[state] = max(action_values)
                delta = max(delta, abs(v - V[state]))
        
        if delta < theta:
            print(f"Value iteration converged in {iteration+1} iterations")
            break
    
    # Extract policy
    policy = {}
    for i in range(env.size):
        for j in range(env.size):
            state = (i, j)
            if state == env.goal or state in env.obstacles:
                continue
            
            action_values = []
            for action_idx in range(len(env.actions)):
                env.state = state
                next_state, reward, done = env.step(action_idx)
                action_value = reward + gamma * V[next_state]
                action_values.append(action_value)
            
            policy[state] = np.argmax(action_values)
    
    return V, policy

# Run value iteration
V_star, policy_star = value_iteration(env, gamma=0.9)

# Visualize results
fig = env.visualize(values=V_star, policy=policy_star)
plt.title('Value Iteration Results\n(Values and Optimal Policy)', 
         fontsize=14, fontweight='bold', pad=20)
plt.show()

print("\nOptimal Value Function:")
for i in range(env.size):
    for j in range(env.size):
        print(f"{V_star[(i,j)]:6.2f}", end=" ")
    print()

14.3 Q-LearningΒΆ

Advanced Q-Learning Variants and ImprovementsΒΆ

Double Q-Learning: Addressing Overestimation BiasΒΆ

Problem with standard Q-learning:

\[Q(s, a) \leftarrow r + \gamma \max_{a'} Q(s', a')\]

The \(\max\) operator causes positive bias:

\[\mathbb{E}[\max(X_1, X_2, \ldots)] \geq \max(\mathbb{E}[X_1], \mathbb{E}[X_2], \ldots)\]

Example: Two actions, true values both 0, estimates \(N(0, 1)\):

\[\mathbb{E}[\max(Q_1, Q_2)] \approx 0.56 > 0\]

This accumulates over iterations!

Double Q-Learning Solution:

Maintain two Q-functions \(Q_A\) and \(Q_B\).

Update \(Q_A\): (with probability 0.5)

\[Q_A(s, a) \leftarrow r + \gamma Q_B(s', \arg\max_{a'} Q_A(s', a'))\]

Update \(Q_B\): (with probability 0.5)

\[Q_B(s, a) \leftarrow r + \gamma Q_A(s', \arg\max_{a'} Q_B(s', a'))\]

Key insight:

  • Use \(Q_A\) to select action

  • Use \(Q_B\) to evaluate action

  • Decorrelates selection and evaluation!

Final policy:

\[\pi(s) = \arg\max_a [Q_A(s, a) + Q_B(s, a)]\]

Double DQN:

\[y = r + \gamma Q(s', \arg\max_{a'} Q(s', a'; \theta); \theta^-)\]

Use online network \(\theta\) to select, target network \(\theta^-\) to evaluate.

Dueling DQN: Separating Value and AdvantageΒΆ

Decomposition:

\[Q(s, a) = V(s) + A(s, a)\]

where:

  • \(V(s)\): Value of being in state \(s\)

  • \(A(s, a)\): Advantage of action \(a\) over others

Problem: Not unique! Can add constant to \(V\), subtract from \(A\).

Solution: Force identifiability constraint:

\[Q(s, a; \theta, \alpha, \beta) = V(s; \theta, \beta) + \left(A(s, a; \theta, \alpha) - \max_{a'} A(s, a'; \theta, \alpha)\right)\]

Or (better empirically):

\[Q(s, a) = V(s) + \left(A(s, a) - \frac{1}{|A|}\sum_{a'} A(s, a')\right)\]

Architecture:

Input state s
    ↓
  CNN/FC layers (shared)
    ↓
   β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
   ↓         ↓         ↓
 V(s)    A(s,a₁)   A(s,aβ‚‚) ...
   ↓         ↓         ↓
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
            ↓
     Q(s,a) = V + (A - mean(A))

Advantages:

βœ… Learns value function explicitly (useful when actions don’t matter much)

βœ… Better gradient flow

βœ… Improved performance on Atari

Prioritized Experience ReplayΒΆ

Standard experience replay: Uniform sampling from buffer

Problem: Not all experiences equally important!

Solution: Sample based on TD error magnitude

\[p_i = |\delta_i| + \epsilon\]

where \(\delta_i = r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta)\)

Prioritization with importance sampling:

Sampling probability:

\[P(i) = \frac{p_i^\alpha}{\sum_k p_k^\alpha}\]
  • \(\alpha = 0\): Uniform (standard)

  • \(\alpha = 1\): Full prioritization

  • Typical: \(\alpha = 0.6\)

Importance sampling weights:

\[w_i = \left(\frac{1}{N} \cdot \frac{1}{P(i)}\right)^\beta\]
  • Corrects bias from non-uniform sampling

  • \(\beta\): Annealed from \(\beta_0\) to 1

  • Typical: \(\beta_0 = 0.4\), anneal to 1

Loss:

\[L = \frac{1}{N}\sum_i w_i \delta_i^2\]

Update priorities after gradient step:

\[p_i \leftarrow |\delta_i| + \epsilon\]

Results:

  • 49 out of 57 Atari games: improvement

  • Especially good for sparse rewards

Rainbow DQN: Combining ImprovementsΒΆ

Combines 6 extensions:

  1. Double DQN: Reduce overestimation

  2. Dueling: Separate V and A

  3. Prioritized replay: Important transitions

  4. Multi-step returns: \(n\)-step TD targets

  5. Distributional RL: Learn return distribution

  6. Noisy Nets: Exploration via parameter noise

Multi-step Learning:

Instead of 1-step TD:

\[y = r_t + \gamma \max_{a'} Q(s_{t+1}, a')\]

Use \(n\)-step return:

\[y = \sum_{k=0}^{n-1} \gamma^k r_{t+k} + \gamma^n \max_{a'} Q(s_{t+n}, a')\]

Bias-variance trade-off:

  • Small \(n\): Low variance, high bias (bootstraps more)

  • Large \(n\): High variance, low bias (like Monte Carlo)

  • Typical: \(n = 3\)

Distributional RL (C51):

Instead of \(Q(s, a) = \mathbb{E}[Z]\), learn full distribution \(Z\).

Approximate with \(N\) atoms:

\[Z(s, a) = \sum_{i=1}^N p_i(s, a) \delta_{z_i}\]

where \(z_i \in [V_{\min}, V_{\max}]\) are support points.

Loss: Cross-entropy between distributions

Why better?

  • Captures uncertainty

  • Richer learning signal

  • Better performance

Noisy Nets:

Add noise to network parameters:

\[y = (\mu^w + \sigma^w \odot \epsilon^w) x + \mu^b + \sigma^b \odot \epsilon^b\]

where:

  • \(\mu^w, \mu^b\): Learned mean

  • \(\sigma^w, \sigma^b\): Learned std

  • \(\epsilon^w, \epsilon^b \sim \mathcal{N}(0, 1)\): Noise

Advantages over \(\epsilon\)-greedy:

  • State-dependent exploration

  • Automatic annealing (learned \(\sigma\) decreases)

  • No hyperparameter tuning

Rainbow Performance:

Median human-normalized score on Atari:

  • DQN: 100%

  • Rainbow: 227%

Ablation study shows all components help!

Practical Implementation TipsΒΆ

Hyperparameters (DQN on Atari):

Parameter

Value

Purpose

Replay buffer

1M

Store transitions

Batch size

32

Mini-batch

Learning rate

0.00025

Adam optimizer

\(\gamma\)

0.99

Discount factor

\(\epsilon\) start

1.0

Initial exploration

\(\epsilon\) end

0.1

Final exploration

\(\epsilon\) decay

1M steps

Anneal schedule

Target update

10K steps

Freeze target

Frame stack

4

Temporal context

Frame skip

4

Action repeat

Network Architecture:

Input: 84Γ—84Γ—4 (stacked frames)
    ↓
Conv1: 32 filters, 8Γ—8, stride 4, ReLU
    ↓
Conv2: 64 filters, 4Γ—4, stride 2, ReLU
    ↓
Conv3: 64 filters, 3Γ—3, stride 1, ReLU
    ↓
Flatten β†’ FC: 512 units, ReLU
    ↓
Output: |A| Q-values (linear)

Training Tricks:

βœ… Gradient clipping: Clip \(\delta\) to \([-1, 1]\) or use Huber loss

βœ… Reward clipping: Clip rewards to \([-1, 1]\) for stability

βœ… Frame preprocessing: Grayscale, resize, normalize

βœ… Action repeat: Skip frames (4Γ—) for efficiency

βœ… Burn-in: Fill replay buffer before training

βœ… Evaluation: Separate evaluation episodes with \(\epsilon = 0.05\)

Common Pitfalls:

❌ Instability: Try lower learning rate, smaller network ❌ Slow learning: Check exploration, increase buffer size ❌ Catastrophic forgetting: Increase replay buffer ❌ Overestimation: Use Double DQN ❌ Sparse rewards: Try curiosity-driven exploration or reward shaping

class QLearningAgent:
    def __init__(self, n_actions, learning_rate=0.1, gamma=0.9, epsilon=0.1):
        self.Q = defaultdict(lambda: np.zeros(n_actions))
        self.lr = learning_rate
        self.gamma = gamma
        self.epsilon = epsilon
        self.n_actions = n_actions
    
    def get_action(self, state, training=True):
        """Epsilon-greedy action selection"""
        if training and np.random.random() < self.epsilon:
            return np.random.randint(self.n_actions)
        else:
            return np.argmax(self.Q[state])
    
    def update(self, state, action, reward, next_state, done):
        """Q-learning update"""
        current_q = self.Q[state][action]
        
        if done:
            target_q = reward
        else:
            target_q = reward + self.gamma * np.max(self.Q[next_state])
        
        # Q-learning update
        self.Q[state][action] = current_q + self.lr * (target_q - current_q)

# Train Q-learning agent
agent = QLearningAgent(n_actions=4, learning_rate=0.1, gamma=0.9, epsilon=0.1)

n_episodes = 1000
episode_rewards = []
episode_lengths = []

for episode in range(n_episodes):
    state = env.reset()
    total_reward = 0
    steps = 0
    
    for step in range(100):  # Max 100 steps per episode
        action = agent.get_action(state, training=True)
        next_state, reward, done = env.step(action)
        
        agent.update(state, action, reward, next_state, done)
        
        state = next_state
        total_reward += reward
        steps += 1
        
        if done:
            break
    
    episode_rewards.append(total_reward)
    episode_lengths.append(steps)
    
    if (episode + 1) % 200 == 0:
        avg_reward = np.mean(episode_rewards[-100:])
        avg_length = np.mean(episode_lengths[-100:])
        print(f"Episode {episode+1}: Avg Reward={avg_reward:.2f}, Avg Steps={avg_length:.1f}")

# Plot learning curves
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Smooth rewards
window = 50
smoothed_rewards = np.convolve(episode_rewards, np.ones(window)/window, mode='valid')
smoothed_lengths = np.convolve(episode_lengths, np.ones(window)/window, mode='valid')

axes[0].plot(smoothed_rewards, linewidth=2)
axes[0].set_xlabel('Episode', fontsize=12)
axes[0].set_ylabel('Total Reward', fontsize=12)
axes[0].set_title('Q-Learning: Reward per Episode', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)

axes[1].plot(smoothed_lengths, linewidth=2, color='orange')
axes[1].set_xlabel('Episode', fontsize=12)
axes[1].set_ylabel('Steps to Goal', fontsize=12)
axes[1].set_title('Q-Learning: Episode Length', fontsize=13, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

14.4 Learned Q-Values and PolicyΒΆ

What and WhyΒΆ

After Q-learning converges, the learned Q-values \(Q(s,a)\) encode the expected cumulative reward for taking action \(a\) in state \(s\) and following the optimal policy thereafter. Visualizing Q-values as a heatmap reveals how the agent values different state-action pairs, while the greedy policy \(\pi^*(s) = \arg\max_a Q(s,a)\) shows the optimal action at each state.

How It WorksΒΆ

The code extracts the maximum Q-value at each grid position and displays it as a color-coded heatmap, with warmer colors indicating higher-value states (closer to the goal via safe paths). Arrows overlay each cell showing the greedy action direction. States near the goal have high values that decay with distance, reflecting the discount factor \(\gamma\).

Connection to MLΒΆ

Interpreting learned value functions is critical for debugging and trusting RL agents in practice. The same visualization principles apply to deep Q-networks (DQN) where value functions are approximated by neural networks, and to actor-critic methods where both value and policy are learned.

# Extract learned values and policy
learned_values = {}
learned_policy = {}

for i in range(env.size):
    for j in range(env.size):
        state = (i, j)
        if state not in env.obstacles and state != env.goal:
            learned_values[state] = np.max(agent.Q[state])
            learned_policy[state] = np.argmax(agent.Q[state])

# Visualize
fig = env.visualize(values=learned_values, policy=learned_policy)
plt.title('Q-Learning Results\n(Learned Values and Policy)', 
         fontsize=14, fontweight='bold', pad=20)
plt.show()

# Test learned policy
print("\nTesting Learned Policy:")
state = env.reset()
path = [state]

for step in range(20):
    action = agent.get_action(state, training=False)
    next_state, reward, done = env.step(action)
    path.append(next_state)
    
    print(f"Step {step+1}: {state} β†’ {env.action_names[action]} β†’ {next_state} (reward={reward:.1f})")
    
    state = next_state
    if done:
        print(f"\nReached goal in {step+1} steps!")
        break

Key TakeawaysΒΆ

1. Markov Decision Process (MDP)ΒΆ

  • Framework for sequential decision making

  • Components: States, Actions, Transitions, Rewards

  • Markov property: Future depends only on current state

  • Goal: Find policy Ο€ that maximizes cumulative reward

2. Value FunctionsΒΆ

  • State value: V(s) = Expected cumulative reward from state s

  • Action value: Q(s,a) = Expected reward from state s, action a

  • Bellman equation: Recursive relationship

  • Optimal policy: Choose action with max Q-value

3. Value IterationΒΆ

  • Model-based: Requires knowledge of transitions P(s’|s,a)

  • Iteratively update V(s) using Bellman equation

  • Converges to optimal value function V*

  • Extract policy: Ο€(s) = argmax_a Q(s,a)

4. Q-LearningΒΆ

  • Model-free: Learns from experience (no transition model)

  • Off-policy: Learns optimal Q* while following Ξ΅-greedy

  • Update rule: Q(s,a) ← Q(s,a) + Ξ±[r + Ξ³ max Q(s’,a’) - Q(s,a)]

  • Exploration vs exploitation: Ξ΅-greedy balances both

5. Key DifferencesΒΆ

Aspect

Value Iteration

Q-Learning

Model

Required

Not required

Updates

All states

Visited states

Convergence

Guaranteed

Guaranteed (with conditions)

Speed

Fast with small state space

Slow, needs many episodes

6. Exploration StrategiesΒΆ

  • Ξ΅-greedy: Random action with probability Ξ΅

  • Softmax: Sample from Boltzmann distribution

  • Upper Confidence Bound: Bonus for rarely visited states

  • Exploration essential for finding optimal policy

7. ExtensionsΒΆ

  • SARSA: On-policy TD learning

  • Double Q-Learning: Reduces overestimation bias

  • DQN: Deep neural network for Q-function

  • Policy Gradient: Directly optimize policy

  • Actor-Critic: Combine value and policy methods

  • A3C, PPO, SAC: Modern RL algorithms

8. Practical TipsΒΆ

  • Start with simple environments (GridWorld)

  • Tune learning rate Ξ± and discount Ξ³

  • Balance exploration (Ξ΅) carefully

  • Use experience replay for stability

  • Normalize rewards if possible

ReferencesΒΆ

  1. β€œReinforcement Learning: An Introduction” - Sutton & Barto (2nd Edition)

  2. β€œPlaying Atari with Deep Reinforcement Learning” - Mnih et al. (2013)

  3. CS229 Lecture Notes on Reinforcement Learning

  4. David Silver’s RL Course (UCL)

Congratulations! You’ve completed all 14 CS229 lectures! πŸŽ‰