Run this notebook: Open in Colab Open in Kaggle

02: Value-Based Methods (Q-Learning)¶

“The best way to predict the future is to create it.” - Peter Drucker

Welcome to the exciting world of Q-Learning! This notebook introduces value-based reinforcement learning methods, where we learn to estimate the value of state-action pairs directly.

🎯 Learning Objectives¶

By the end of this notebook, you’ll understand:

The difference between model-based and model-free learning
How Q-learning works and why it’s powerful
Temporal Difference (TD) learning principles
Exploration vs exploitation trade-offs
Practical implementation of Q-learning

🧠 Model-Based vs Model-Free Learning¶

Model-Based Methods¶

Learn a model of the environment: P(s’|s,a) and R(s,a)
Use planning to compute optimal policy
Examples: Value Iteration, Policy Iteration
Pros: Sample efficient, can plan ahead
Cons: Need accurate model, computation intensive

Model-Free Methods¶

Learn directly from experience without modeling environment
No need to know transition probabilities or rewards
Examples: Q-Learning, SARSA, Monte Carlo
Pros: Simpler, works with any environment
Cons: Usually need more samples

🎮 Q-Learning Algorithm¶

Q-Learning is a model-free, off-policy reinforcement learning algorithm that learns the optimal action-value function Q*(s,a).

Key Idea¶

Learn Q(s,a) directly by updating estimates based on observed rewards and estimated future rewards.

Q-Learning Update Rule¶

Q(S_t, A_t) ← Q(S_t, A_t) + α [R_{t+1} + γ max_a Q(S_{t+1}, a) - Q(S_t, A_t)]

Where:

α: Learning rate (0 < α ≤ 1)
γ: Discount factor (0 ≤ γ ≤ 1)
R_{t+1}: Immediate reward
max_a Q(S_{t+1}, a): Best estimate of future rewards

🔄 Temporal Difference Learning¶

Q-Learning is a Temporal Difference (TD) learning method:

TD Learning: Update estimates based on other estimates
Bootstrapping: Use current estimates to improve current estimates
Online Learning: Learn from incomplete episodes

TD vs Monte Carlo¶

MC: Wait until end of episode to update values
TD: Update values immediately after each step
TD: Can learn online, before episode ends
TD: Usually converges faster

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import random
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]

Q-Learning Implementation¶

With the theoretical foundation in place, we now translate the Q-learning update rule into working code. The QLearningAgent class below maintains a Q-table – a dictionary mapping each (state, action) pair to an estimated value – and updates it after every environment step using the temporal difference update:

\[Q(s, a) \leftarrow Q(s, a) + \alpha \bigl[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \bigr]\]

The term in brackets is the TD error: the difference between the bootstrap target \(r + \gamma \max_{a'} Q(s', a')\) and the current estimate. Multiplying by the learning rate \(\alpha\) controls how aggressively we incorporate new information. Action selection follows an epsilon-greedy strategy, choosing a random action with probability \(\epsilon\) and the greedy action otherwise. This is the same exploration mechanism used in bandit problems, scaled up to sequential decision-making.

class QLearningAgent:
    """Q-Learning agent for grid world"""
    
    def __init__(self, env, alpha: float = 0.1, gamma: float = 0.9, epsilon: float = 0.1):
        self.env = env
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor
        self.epsilon = epsilon  # Exploration rate
        
        # Initialize Q-table
        self.Q = defaultdict(lambda: defaultdict(float))
        
        # Track learning progress
        self.episode_rewards = []
        self.episode_lengths = []
    
    def get_action(self, state: Tuple[int, int], explore: bool = True) -> str:
        """Choose action using epsilon-greedy policy"""
        if explore and random.random() < self.epsilon:
            # Explore: random action
            return random.choice(self.env.actions)
        else:
            # Exploit: best action
            q_values = [self.Q[state][action] for action in self.env.actions]
            max_q = max(q_values)
            # Break ties randomly
            best_actions = [action for action, q in zip(self.env.actions, q_values) if q == max_q]
            return random.choice(best_actions)
    
    def update_q(self, state: Tuple[int, int], action: str, reward: float, next_state: Tuple[int, int]):
        """Update Q-value using Q-learning update rule"""
        # Current Q-value
        current_q = self.Q[state][action]
        
        # Maximum Q-value for next state
        if self.env.is_terminal(next_state):
            max_next_q = 0.0
        else:
            max_next_q = max([self.Q[next_state][a] for a in self.env.actions])
        
        # Q-learning update
        target = reward + self.gamma * max_next_q
        self.Q[state][action] = current_q + self.alpha * (target - current_q)
    
    def train_episode(self) -> Tuple[float, int]:
        """Run one training episode"""
        state = self.env.start
        total_reward = 0.0
        steps = 0
        
        while not self.env.is_terminal(state) and steps < 100:  # Prevent infinite loops
            # Choose action
            action = self.get_action(state, explore=True)
            
            # Take action
            next_state = self.env.get_next_state(state, action)
            reward = self.env.get_reward(state, action, next_state)
            
            # Update Q-value
            self.update_q(state, action, reward, next_state)
            
            # Update state and counters
            state = next_state
            total_reward += reward
            steps += 1
        
        return total_reward, steps
    
    def train(self, num_episodes: int = 1000):
        """Train the agent for multiple episodes"""
        for episode in range(num_episodes):
            reward, length = self.train_episode()
            self.episode_rewards.append(reward)
            self.episode_lengths.append(length)
            
            # Decay epsilon (reduce exploration over time)
            self.epsilon = max(0.01, self.epsilon * 0.995)
    
    def get_policy(self) -> Dict[Tuple[int, int], str]:
        """Extract greedy policy from Q-table"""
        policy = {}
        for state in self.env.states:
            if not self.env.is_terminal(state):
                q_values = [self.Q[state][action] for action in self.env.actions]
                max_q = max(q_values)
                best_actions = [action for action, q in zip(self.env.actions, q_values) if q == max_q]
                policy[state] = random.choice(best_actions)  # Break ties randomly
        return policy
    
    def get_q_values(self) -> Dict[Tuple[int, int], List[float]]:
        """Get Q-values for visualization"""
        q_values = {}
        for state in self.env.states:
            q_values[state] = [self.Q[state][action] for action in self.env.actions]
        return q_values

Training the Q-Learning Agent¶

Training proceeds by running the agent through many episodes of the grid world. In each episode the agent starts at the initial state, selects actions via epsilon-greedy, observes rewards, and updates Q-values until it reaches a terminal state or hits the step limit. Over time, epsilon decays so the agent shifts from exploration (gathering information about unknown state-action pairs) toward exploitation (leveraging what it has already learned). Monitoring the cumulative reward per episode and a smoothed moving average reveals the characteristic RL learning curve: noisy early performance that gradually converges as the Q-table sharpens.

# Create environment and agent
env = GridWorld()
agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)

# Train the agent
print("Training Q-Learning agent...")
agent.train(num_episodes=1000)

# Plot learning progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Episode rewards
ax1.plot(agent.episode_rewards, alpha=0.7)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Total Reward')
ax1.set_title('Learning Progress: Episode Rewards')
ax1.grid(True, alpha=0.3)

# Moving average of rewards
window_size = 50
moving_avg = np.convolve(agent.episode_rewards, np.ones(window_size)/window_size, mode='valid')
ax2.plot(moving_avg, color='red', linewidth=2)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Average Reward (50 episodes)')
ax2.set_title('Smoothed Learning Progress')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final average reward (last 100 episodes): {np.mean(agent.episode_rewards[-100:]):.2f}")
print(f"Final average episode length: {np.mean(agent.episode_lengths[-100:]):.2f}")

Visualizing Learned Q-Values¶

A heatmap of Q-values for each action provides insight into how the agent perceives the environment. For every grid cell, the Q-value of an action reflects the expected discounted return if the agent takes that action and then follows the optimal policy. Cells near the goal should show high Q-values for actions that move toward it, while cells near the obstacle should show low Q-values for actions that lead into danger. Comparing the four heatmaps (one per action) reveals the agent’s internal model of which movements are valuable in which locations – a key debugging tool in applied RL.

def visualize_q_values(agent: QLearningAgent):
    """Visualize Q-values and policy"""
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 15))
    actions = ['up', 'down', 'left', 'right']
    action_symbols = ['↑', '↓', '←', '→']
    
    for idx, (action, symbol) in enumerate(zip(actions, action_symbols)):
        ax = axes[idx // 2, idx % 2]
        
        # Create Q-value grid for this action
        q_grid = np.zeros((env.grid_size, env.grid_size))
        for i in range(env.grid_size):
            for j in range(env.grid_size):
                state = (i, j)
                q_grid[i, j] = agent.Q[state][action]
        
        # Plot heatmap
        im = ax.imshow(q_grid, cmap='viridis', alpha=0.8)
        plt.colorbar(im, ax=ax, shrink=0.8)
        
        # Add grid lines
        for i in range(env.grid_size + 1):
            ax.axhline(i - 0.5, color='black', linewidth=1)
            ax.axvline(i - 0.5, color='black', linewidth=1)
        
        # Add special state labels
        ax.text(env.start[1], env.start[0], 'START', ha='center', va='center', 
               fontsize=12, fontweight='bold', color='white')
        ax.text(env.goal[1], env.goal[0], 'GOAL', ha='center', va='center', 
               fontsize=12, fontweight='bold', color='white')
        ax.text(env.obstacle[1], env.obstacle[0], 'OBSTACLE', ha='center', va='center', 
               fontsize=8, fontweight='bold', color='red')
        
        ax.set_title(f'Q-Values for Action: {symbol} ({action})')
        ax.set_xticks(range(env.grid_size))
        ax.set_yticks(range(env.grid_size))
        plt.gca().invert_yaxis()
    
    plt.tight_layout()
    plt.show()

# Visualize Q-values
visualize_q_values(agent)

Extracting the Optimal Policy¶

Once Q-values have converged, the optimal policy is recovered by selecting the action with the highest Q-value in each state: \(\pi^*(s) = \arg\max_a Q^*(s, a)\). The resulting policy is deterministic and greedy – no more exploration needed at deployment time. Visualizing these greedy actions as directional arrows on the grid confirms whether training succeeded: arrows should form a coherent path from every reachable cell toward the goal while steering around the obstacle. In real-world RL systems, this same extract-and-deploy pattern is used when moving from training to production.

def visualize_policy(agent: QLearningAgent):
    """Visualize the learned policy"""
    
    policy = agent.get_policy()
    
    fig, ax = plt.subplots(figsize=(8, 8))
    
    # Create empty grid
    grid = np.zeros((env.grid_size, env.grid_size))
    ax.imshow(grid, cmap='Blues', alpha=0.1)
    
    # Add grid lines
    for i in range(env.grid_size + 1):
        ax.axhline(i - 0.5, color='black', linewidth=2)
        ax.axvline(i - 0.5, color='black', linewidth=2)
    
    # Add policy arrows and labels
    action_arrows = {'up': '↑', 'down': '↓', 'left': '←', 'right': '→'}
    
    for i in range(env.grid_size):
        for j in range(env.grid_size):
            state = (i, j)
            
            if state == env.start:
                ax.text(j, i, 'START\n' + action_arrows.get(policy.get(state, ''), ''), 
                       ha='center', va='center', fontsize=10, fontweight='bold')
            elif state == env.goal:
                ax.text(j, i, 'GOAL', ha='center', va='center', fontsize=12, fontweight='bold')
            elif state == env.obstacle:
                ax.text(j, i, 'OBSTACLE', ha='center', va='center', fontsize=8, fontweight='bold', color='red')
            elif not env.is_terminal(state):
                action = policy.get(state, '')
                arrow = action_arrows.get(action, '')
                ax.text(j, i, arrow, ha='center', va='center', fontsize=20, fontweight='bold')
    
    ax.set_xlim(-0.5, env.grid_size - 0.5)
    ax.set_ylim(-0.5, env.grid_size - 0.5)
    ax.set_xticks(range(env.grid_size))
    ax.set_yticks(range(env.grid_size))
    ax.set_title('Learned Optimal Policy', fontsize=16)
    plt.gca().invert_yaxis()
    plt.show()

# Visualize the learned policy
visualize_policy(agent)

🔍 Exploration vs Exploitation¶

One of the key challenges in RL is balancing exploration (trying new things) and exploitation (using what we know).

ε-Greedy Strategy¶

With probability ε: explore (random action)
With probability 1-ε: exploit (best action)
ε usually decays over time

Other Strategies¶

Softmax/ Boltzmann Exploration: Choose actions probabilistically based on Q-values
Upper Confidence Bound (UCB): Prefer actions with high uncertainty
Thompson Sampling: Sample from posterior distribution

def compare_exploration_strategies():
    """Compare different exploration strategies"""
    
    strategies = [
        {'name': 'ε=0.1 (fixed)', 'epsilon': 0.1, 'decay': False},
        {'name': 'ε=0.5 (fixed)', 'epsilon': 0.5, 'decay': False},
        {'name': 'ε=1.0→0.01 (decay)', 'epsilon': 1.0, 'decay': True}
    ]
    
    results = {}
    
    for strategy in strategies:
        print(f"Training with {strategy['name']}...")
        
        # Create agent
        agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=strategy['epsilon'])
        
        # Train
        agent.train(num_episodes=500)
        
        # Store results
        results[strategy['name']] = {
            'rewards': agent.episode_rewards,
            'final_avg': np.mean(agent.episode_rewards[-100:])
        }
    
    # Plot comparison
    plt.figure(figsize=(12, 6))
    for name, data in results.items():
        plt.plot(data['rewards'], label=f"{name} (final: {data['final_avg']:.1f})", alpha=0.7)
    
    plt.xlabel('Episode')
    plt.ylabel('Total Reward')
    plt.title('Comparison of Exploration Strategies')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Compare exploration strategies
compare_exploration_strategies()

🎮 SARSA vs Q-Learning¶

Q-Learning (Off-Policy)¶

Learns optimal policy regardless of behavior policy
Uses max over next actions: Q(s,a) ← Q(s,a) + α[R + γ max_a’ Q(s’,a’)]
More aggressive exploration

SARSA (On-Policy)¶

Learns policy being followed
Uses actual next action: Q(s,a) ← Q(s,a) + α[R + γ Q(s’,a’)]
Safer, more conservative

When to Use Which?¶

Q-Learning: When you want optimal policy, can afford exploration
SARSA: When learning online with real consequences

class SARSAAgent(QLearningAgent):
    """SARSA agent (on-policy learning)"""
    
    def update_q(self, state: Tuple[int, int], action: str, reward: float, 
                 next_state: Tuple[int, int], next_action: Optional[str] = None):
        """Update Q-value using SARSA update rule"""
        # Current Q-value
        current_q = self.Q[state][action]
        
        # Q-value for next state-action pair
        if self.env.is_terminal(next_state):
            next_q = 0.0
        else:
            if next_action is None:
                next_action = self.get_action(next_state, explore=False)  # Greedy action
            next_q = self.Q[next_state][next_action]
        
        # SARSA update
        target = reward + self.gamma * next_q
        self.Q[state][action] = current_q + self.alpha * (target - current_q)
    
    def train_episode(self) -> Tuple[float, int]:
        """Run one SARSA training episode"""
        state = self.env.start
        action = self.get_action(state, explore=True)  # Choose first action
        total_reward = 0.0
        steps = 0
        
        while not self.env.is_terminal(state) and steps < 100:
            # Take action
            next_state = self.env.get_next_state(state, action)
            reward = self.env.get_reward(state, action, next_state)
            
            # Choose next action
            next_action = self.get_action(next_state, explore=True)
            
            # Update Q-value (SARSA)
            self.update_q(state, action, reward, next_state, next_action)
            
            # Update state and action
            state = next_state
            action = next_action
            
            total_reward += reward
            steps += 1
        
        return total_reward, steps

# Compare Q-Learning and SARSA
print("Comparing Q-Learning vs SARSA...")

q_agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)
q_agent.train(num_episodes=500)

sarsa_agent = SARSAAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)
sarsa_agent.train(num_episodes=500)

# Plot comparison
plt.figure(figsize=(12, 6))
plt.plot(q_agent.episode_rewards, label=f'Q-Learning (final: {np.mean(q_agent.episode_rewards[-100:]):.1f})', alpha=0.7)
plt.plot(sarsa_agent.episode_rewards, label=f'SARSA (final: {np.mean(sarsa_agent.episode_rewards[-100:]):.1f})', alpha=0.7)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Q-Learning vs SARSA Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

🧠 Key Takeaways¶

Q-Learning is model-free: Learns directly from experience without knowing environment dynamics
Temporal Difference learning: Updates estimates using other estimates, enabling online learning
Off-policy learning: Learns optimal policy while following exploratory policy
Exploration vs exploitation: ε-greedy balances trying new actions vs using known good actions
Convergence: Q-learning converges to optimal Q-values under certain conditions

🚀 What’s Next?¶

Now that you understand Q-learning, you’re ready for:

Deep Q-Networks (DQN): Using neural networks for Q-value approximation
Policy Gradient Methods: Learning policies directly
Actor-Critic Methods: Combining value and policy learning
Multi-Agent Reinforcement Learning: Multiple agents learning together

📚 Further Reading¶

Q-Learning Paper by Watkins & Dayan
Reinforcement Learning: An Introduction Chapter 6
Temporal Difference Learning

🏋️ Exercises¶

Implement SARSA from scratch and compare with Q-learning
Experiment with different exploration strategies (softmax, UCB)
Add function approximation using a simple neural network
Solve a different environment (e.g., Taxi-v3 from Gymnasium)

💡 Discussion Questions¶

Why is Q-learning considered “off-policy”?
How does the exploration rate affect learning?
What are the advantages of TD learning over Monte Carlo methods?
How might you handle continuous state spaces with Q-learning?