02: Value-Based Methods (Q-Learning)ยถ

โ€œThe best way to predict the future is to create it.โ€ - Peter Drucker

Welcome to the exciting world of Q-Learning! This notebook introduces value-based reinforcement learning methods, where we learn to estimate the value of state-action pairs directly.

๐ŸŽฏ Learning Objectivesยถ

By the end of this notebook, youโ€™ll understand:

  • The difference between model-based and model-free learning

  • How Q-learning works and why itโ€™s powerful

  • Temporal Difference (TD) learning principles

  • Exploration vs exploitation trade-offs

  • Practical implementation of Q-learning

๐Ÿง  Model-Based vs Model-Free Learningยถ

Model-Based Methodsยถ

  • Learn a model of the environment: P(sโ€™|s,a) and R(s,a)

  • Use planning to compute optimal policy

  • Examples: Value Iteration, Policy Iteration

  • Pros: Sample efficient, can plan ahead

  • Cons: Need accurate model, computation intensive

Model-Free Methodsยถ

  • Learn directly from experience without modeling environment

  • No need to know transition probabilities or rewards

  • Examples: Q-Learning, SARSA, Monte Carlo

  • Pros: Simpler, works with any environment

  • Cons: Usually need more samples

๐ŸŽฎ Q-Learning Algorithmยถ

Q-Learning is a model-free, off-policy reinforcement learning algorithm that learns the optimal action-value function Q*(s,a).

Key Ideaยถ

Learn Q(s,a) directly by updating estimates based on observed rewards and estimated future rewards.

Q-Learning Update Ruleยถ

Q(S_t, A_t) โ† Q(S_t, A_t) + ฮฑ [R_{t+1} + ฮณ max_a Q(S_{t+1}, a) - Q(S_t, A_t)]

Where:

  • ฮฑ: Learning rate (0 < ฮฑ โ‰ค 1)

  • ฮณ: Discount factor (0 โ‰ค ฮณ โ‰ค 1)

  • R_{t+1}: Immediate reward

  • max_a Q(S_{t+1}, a): Best estimate of future rewards

๐Ÿ”„ Temporal Difference Learningยถ

Q-Learning is a Temporal Difference (TD) learning method:

  • TD Learning: Update estimates based on other estimates

  • Bootstrapping: Use current estimates to improve current estimates

  • Online Learning: Learn from incomplete episodes

TD vs Monte Carloยถ

  • MC: Wait until end of episode to update values

  • TD: Update values immediately after each step

  • TD: Can learn online, before episode ends

  • TD: Usually converges faster

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import random
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]

Q-Learning Implementationยถ

With the theoretical foundation in place, we now translate the Q-learning update rule into working code. The QLearningAgent class below maintains a Q-table โ€“ a dictionary mapping each (state, action) pair to an estimated value โ€“ and updates it after every environment step using the temporal difference update:

\[Q(s, a) \leftarrow Q(s, a) + \alpha \bigl[ r + \gamma \max_{a'} Q(s', a') - Q(s, a) \bigr]\]

The term in brackets is the TD error: the difference between the bootstrap target \(r + \gamma \max_{a'} Q(s', a')\) and the current estimate. Multiplying by the learning rate \(\alpha\) controls how aggressively we incorporate new information. Action selection follows an epsilon-greedy strategy, choosing a random action with probability \(\epsilon\) and the greedy action otherwise. This is the same exploration mechanism used in bandit problems, scaled up to sequential decision-making.

class QLearningAgent:
    """Q-Learning agent for grid world"""
    
    def __init__(self, env, alpha: float = 0.1, gamma: float = 0.9, epsilon: float = 0.1):
        self.env = env
        self.alpha = alpha  # Learning rate
        self.gamma = gamma  # Discount factor
        self.epsilon = epsilon  # Exploration rate
        
        # Initialize Q-table
        self.Q = defaultdict(lambda: defaultdict(float))
        
        # Track learning progress
        self.episode_rewards = []
        self.episode_lengths = []
    
    def get_action(self, state: Tuple[int, int], explore: bool = True) -> str:
        """Choose action using epsilon-greedy policy"""
        if explore and random.random() < self.epsilon:
            # Explore: random action
            return random.choice(self.env.actions)
        else:
            # Exploit: best action
            q_values = [self.Q[state][action] for action in self.env.actions]
            max_q = max(q_values)
            # Break ties randomly
            best_actions = [action for action, q in zip(self.env.actions, q_values) if q == max_q]
            return random.choice(best_actions)
    
    def update_q(self, state: Tuple[int, int], action: str, reward: float, next_state: Tuple[int, int]):
        """Update Q-value using Q-learning update rule"""
        # Current Q-value
        current_q = self.Q[state][action]
        
        # Maximum Q-value for next state
        if self.env.is_terminal(next_state):
            max_next_q = 0.0
        else:
            max_next_q = max([self.Q[next_state][a] for a in self.env.actions])
        
        # Q-learning update
        target = reward + self.gamma * max_next_q
        self.Q[state][action] = current_q + self.alpha * (target - current_q)
    
    def train_episode(self) -> Tuple[float, int]:
        """Run one training episode"""
        state = self.env.start
        total_reward = 0.0
        steps = 0
        
        while not self.env.is_terminal(state) and steps < 100:  # Prevent infinite loops
            # Choose action
            action = self.get_action(state, explore=True)
            
            # Take action
            next_state = self.env.get_next_state(state, action)
            reward = self.env.get_reward(state, action, next_state)
            
            # Update Q-value
            self.update_q(state, action, reward, next_state)
            
            # Update state and counters
            state = next_state
            total_reward += reward
            steps += 1
        
        return total_reward, steps
    
    def train(self, num_episodes: int = 1000):
        """Train the agent for multiple episodes"""
        for episode in range(num_episodes):
            reward, length = self.train_episode()
            self.episode_rewards.append(reward)
            self.episode_lengths.append(length)
            
            # Decay epsilon (reduce exploration over time)
            self.epsilon = max(0.01, self.epsilon * 0.995)
    
    def get_policy(self) -> Dict[Tuple[int, int], str]:
        """Extract greedy policy from Q-table"""
        policy = {}
        for state in self.env.states:
            if not self.env.is_terminal(state):
                q_values = [self.Q[state][action] for action in self.env.actions]
                max_q = max(q_values)
                best_actions = [action for action, q in zip(self.env.actions, q_values) if q == max_q]
                policy[state] = random.choice(best_actions)  # Break ties randomly
        return policy
    
    def get_q_values(self) -> Dict[Tuple[int, int], List[float]]:
        """Get Q-values for visualization"""
        q_values = {}
        for state in self.env.states:
            q_values[state] = [self.Q[state][action] for action in self.env.actions]
        return q_values

Training the Q-Learning Agentยถ

Training proceeds by running the agent through many episodes of the grid world. In each episode the agent starts at the initial state, selects actions via epsilon-greedy, observes rewards, and updates Q-values until it reaches a terminal state or hits the step limit. Over time, epsilon decays so the agent shifts from exploration (gathering information about unknown state-action pairs) toward exploitation (leveraging what it has already learned). Monitoring the cumulative reward per episode and a smoothed moving average reveals the characteristic RL learning curve: noisy early performance that gradually converges as the Q-table sharpens.

# Create environment and agent
env = GridWorld()
agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)

# Train the agent
print("Training Q-Learning agent...")
agent.train(num_episodes=1000)

# Plot learning progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))

# Episode rewards
ax1.plot(agent.episode_rewards, alpha=0.7)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Total Reward')
ax1.set_title('Learning Progress: Episode Rewards')
ax1.grid(True, alpha=0.3)

# Moving average of rewards
window_size = 50
moving_avg = np.convolve(agent.episode_rewards, np.ones(window_size)/window_size, mode='valid')
ax2.plot(moving_avg, color='red', linewidth=2)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Average Reward (50 episodes)')
ax2.set_title('Smoothed Learning Progress')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Final average reward (last 100 episodes): {np.mean(agent.episode_rewards[-100:]):.2f}")
print(f"Final average episode length: {np.mean(agent.episode_lengths[-100:]):.2f}")

Visualizing Learned Q-Valuesยถ

A heatmap of Q-values for each action provides insight into how the agent perceives the environment. For every grid cell, the Q-value of an action reflects the expected discounted return if the agent takes that action and then follows the optimal policy. Cells near the goal should show high Q-values for actions that move toward it, while cells near the obstacle should show low Q-values for actions that lead into danger. Comparing the four heatmaps (one per action) reveals the agentโ€™s internal model of which movements are valuable in which locations โ€“ a key debugging tool in applied RL.

def visualize_q_values(agent: QLearningAgent):
    """Visualize Q-values and policy"""
    
    fig, axes = plt.subplots(2, 2, figsize=(15, 15))
    actions = ['up', 'down', 'left', 'right']
    action_symbols = ['โ†‘', 'โ†“', 'โ†', 'โ†’']
    
    for idx, (action, symbol) in enumerate(zip(actions, action_symbols)):
        ax = axes[idx // 2, idx % 2]
        
        # Create Q-value grid for this action
        q_grid = np.zeros((env.grid_size, env.grid_size))
        for i in range(env.grid_size):
            for j in range(env.grid_size):
                state = (i, j)
                q_grid[i, j] = agent.Q[state][action]
        
        # Plot heatmap
        im = ax.imshow(q_grid, cmap='viridis', alpha=0.8)
        plt.colorbar(im, ax=ax, shrink=0.8)
        
        # Add grid lines
        for i in range(env.grid_size + 1):
            ax.axhline(i - 0.5, color='black', linewidth=1)
            ax.axvline(i - 0.5, color='black', linewidth=1)
        
        # Add special state labels
        ax.text(env.start[1], env.start[0], 'START', ha='center', va='center', 
               fontsize=12, fontweight='bold', color='white')
        ax.text(env.goal[1], env.goal[0], 'GOAL', ha='center', va='center', 
               fontsize=12, fontweight='bold', color='white')
        ax.text(env.obstacle[1], env.obstacle[0], 'OBSTACLE', ha='center', va='center', 
               fontsize=8, fontweight='bold', color='red')
        
        ax.set_title(f'Q-Values for Action: {symbol} ({action})')
        ax.set_xticks(range(env.grid_size))
        ax.set_yticks(range(env.grid_size))
        plt.gca().invert_yaxis()
    
    plt.tight_layout()
    plt.show()

# Visualize Q-values
visualize_q_values(agent)

Extracting the Optimal Policyยถ

Once Q-values have converged, the optimal policy is recovered by selecting the action with the highest Q-value in each state: \(\pi^*(s) = \arg\max_a Q^*(s, a)\). The resulting policy is deterministic and greedy โ€“ no more exploration needed at deployment time. Visualizing these greedy actions as directional arrows on the grid confirms whether training succeeded: arrows should form a coherent path from every reachable cell toward the goal while steering around the obstacle. In real-world RL systems, this same extract-and-deploy pattern is used when moving from training to production.

def visualize_policy(agent: QLearningAgent):
    """Visualize the learned policy"""
    
    policy = agent.get_policy()
    
    fig, ax = plt.subplots(figsize=(8, 8))
    
    # Create empty grid
    grid = np.zeros((env.grid_size, env.grid_size))
    ax.imshow(grid, cmap='Blues', alpha=0.1)
    
    # Add grid lines
    for i in range(env.grid_size + 1):
        ax.axhline(i - 0.5, color='black', linewidth=2)
        ax.axvline(i - 0.5, color='black', linewidth=2)
    
    # Add policy arrows and labels
    action_arrows = {'up': 'โ†‘', 'down': 'โ†“', 'left': 'โ†', 'right': 'โ†’'}
    
    for i in range(env.grid_size):
        for j in range(env.grid_size):
            state = (i, j)
            
            if state == env.start:
                ax.text(j, i, 'START\n' + action_arrows.get(policy.get(state, ''), ''), 
                       ha='center', va='center', fontsize=10, fontweight='bold')
            elif state == env.goal:
                ax.text(j, i, 'GOAL', ha='center', va='center', fontsize=12, fontweight='bold')
            elif state == env.obstacle:
                ax.text(j, i, 'OBSTACLE', ha='center', va='center', fontsize=8, fontweight='bold', color='red')
            elif not env.is_terminal(state):
                action = policy.get(state, '')
                arrow = action_arrows.get(action, '')
                ax.text(j, i, arrow, ha='center', va='center', fontsize=20, fontweight='bold')
    
    ax.set_xlim(-0.5, env.grid_size - 0.5)
    ax.set_ylim(-0.5, env.grid_size - 0.5)
    ax.set_xticks(range(env.grid_size))
    ax.set_yticks(range(env.grid_size))
    ax.set_title('Learned Optimal Policy', fontsize=16)
    plt.gca().invert_yaxis()
    plt.show()

# Visualize the learned policy
visualize_policy(agent)

๐Ÿ” Exploration vs Exploitationยถ

One of the key challenges in RL is balancing exploration (trying new things) and exploitation (using what we know).

ฮต-Greedy Strategyยถ

  • With probability ฮต: explore (random action)

  • With probability 1-ฮต: exploit (best action)

  • ฮต usually decays over time

Other Strategiesยถ

  • Softmax/ Boltzmann Exploration: Choose actions probabilistically based on Q-values

  • Upper Confidence Bound (UCB): Prefer actions with high uncertainty

  • Thompson Sampling: Sample from posterior distribution

def compare_exploration_strategies():
    """Compare different exploration strategies"""
    
    strategies = [
        {'name': 'ฮต=0.1 (fixed)', 'epsilon': 0.1, 'decay': False},
        {'name': 'ฮต=0.5 (fixed)', 'epsilon': 0.5, 'decay': False},
        {'name': 'ฮต=1.0โ†’0.01 (decay)', 'epsilon': 1.0, 'decay': True}
    ]
    
    results = {}
    
    for strategy in strategies:
        print(f"Training with {strategy['name']}...")
        
        # Create agent
        agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=strategy['epsilon'])
        
        # Train
        agent.train(num_episodes=500)
        
        # Store results
        results[strategy['name']] = {
            'rewards': agent.episode_rewards,
            'final_avg': np.mean(agent.episode_rewards[-100:])
        }
    
    # Plot comparison
    plt.figure(figsize=(12, 6))
    for name, data in results.items():
        plt.plot(data['rewards'], label=f"{name} (final: {data['final_avg']:.1f})", alpha=0.7)
    
    plt.xlabel('Episode')
    plt.ylabel('Total Reward')
    plt.title('Comparison of Exploration Strategies')
    plt.legend()
    plt.grid(True, alpha=0.3)
    plt.show()

# Compare exploration strategies
compare_exploration_strategies()

๐ŸŽฎ SARSA vs Q-Learningยถ

Q-Learning (Off-Policy)ยถ

  • Learns optimal policy regardless of behavior policy

  • Uses max over next actions: Q(s,a) โ† Q(s,a) + ฮฑ[R + ฮณ max_aโ€™ Q(sโ€™,aโ€™)]

  • More aggressive exploration

SARSA (On-Policy)ยถ

  • Learns policy being followed

  • Uses actual next action: Q(s,a) โ† Q(s,a) + ฮฑ[R + ฮณ Q(sโ€™,aโ€™)]

  • Safer, more conservative

When to Use Which?ยถ

  • Q-Learning: When you want optimal policy, can afford exploration

  • SARSA: When learning online with real consequences

class SARSAAgent(QLearningAgent):
    """SARSA agent (on-policy learning)"""
    
    def update_q(self, state: Tuple[int, int], action: str, reward: float, 
                 next_state: Tuple[int, int], next_action: Optional[str] = None):
        """Update Q-value using SARSA update rule"""
        # Current Q-value
        current_q = self.Q[state][action]
        
        # Q-value for next state-action pair
        if self.env.is_terminal(next_state):
            next_q = 0.0
        else:
            if next_action is None:
                next_action = self.get_action(next_state, explore=False)  # Greedy action
            next_q = self.Q[next_state][next_action]
        
        # SARSA update
        target = reward + self.gamma * next_q
        self.Q[state][action] = current_q + self.alpha * (target - current_q)
    
    def train_episode(self) -> Tuple[float, int]:
        """Run one SARSA training episode"""
        state = self.env.start
        action = self.get_action(state, explore=True)  # Choose first action
        total_reward = 0.0
        steps = 0
        
        while not self.env.is_terminal(state) and steps < 100:
            # Take action
            next_state = self.env.get_next_state(state, action)
            reward = self.env.get_reward(state, action, next_state)
            
            # Choose next action
            next_action = self.get_action(next_state, explore=True)
            
            # Update Q-value (SARSA)
            self.update_q(state, action, reward, next_state, next_action)
            
            # Update state and action
            state = next_state
            action = next_action
            
            total_reward += reward
            steps += 1
        
        return total_reward, steps

# Compare Q-Learning and SARSA
print("Comparing Q-Learning vs SARSA...")

q_agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)
q_agent.train(num_episodes=500)

sarsa_agent = SARSAAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)
sarsa_agent.train(num_episodes=500)

# Plot comparison
plt.figure(figsize=(12, 6))
plt.plot(q_agent.episode_rewards, label=f'Q-Learning (final: {np.mean(q_agent.episode_rewards[-100:]):.1f})', alpha=0.7)
plt.plot(sarsa_agent.episode_rewards, label=f'SARSA (final: {np.mean(sarsa_agent.episode_rewards[-100:]):.1f})', alpha=0.7)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Q-Learning vs SARSA Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

๐Ÿง  Key Takeawaysยถ

  1. Q-Learning is model-free: Learns directly from experience without knowing environment dynamics

  2. Temporal Difference learning: Updates estimates using other estimates, enabling online learning

  3. Off-policy learning: Learns optimal policy while following exploratory policy

  4. Exploration vs exploitation: ฮต-greedy balances trying new actions vs using known good actions

  5. Convergence: Q-learning converges to optimal Q-values under certain conditions

๐Ÿš€ Whatโ€™s Next?ยถ

Now that you understand Q-learning, youโ€™re ready for:

  • Deep Q-Networks (DQN): Using neural networks for Q-value approximation

  • Policy Gradient Methods: Learning policies directly

  • Actor-Critic Methods: Combining value and policy learning

  • Multi-Agent Reinforcement Learning: Multiple agents learning together

๐Ÿ“š Further Readingยถ

๐Ÿ‹๏ธ Exercisesยถ

  1. Implement SARSA from scratch and compare with Q-learning

  2. Experiment with different exploration strategies (softmax, UCB)

  3. Add function approximation using a simple neural network

  4. Solve a different environment (e.g., Taxi-v3 from Gymnasium)

๐Ÿ’ก Discussion Questionsยถ

  • Why is Q-learning considered โ€œoff-policyโ€?

  • How does the exploration rate affect learning?

  • What are the advantages of TD learning over Monte Carlo methods?

  • How might you handle continuous state spaces with Q-learning?