02: Value-Based Methods (Q-Learning)ยถ
โThe best way to predict the future is to create it.โ - Peter Drucker
Welcome to the exciting world of Q-Learning! This notebook introduces value-based reinforcement learning methods, where we learn to estimate the value of state-action pairs directly.
๐ฏ Learning Objectivesยถ
By the end of this notebook, youโll understand:
The difference between model-based and model-free learning
How Q-learning works and why itโs powerful
Temporal Difference (TD) learning principles
Exploration vs exploitation trade-offs
Practical implementation of Q-learning
๐ง Model-Based vs Model-Free Learningยถ
Model-Based Methodsยถ
Learn a model of the environment: P(sโ|s,a) and R(s,a)
Use planning to compute optimal policy
Examples: Value Iteration, Policy Iteration
Pros: Sample efficient, can plan ahead
Cons: Need accurate model, computation intensive
Model-Free Methodsยถ
Learn directly from experience without modeling environment
No need to know transition probabilities or rewards
Examples: Q-Learning, SARSA, Monte Carlo
Pros: Simpler, works with any environment
Cons: Usually need more samples
๐ฎ Q-Learning Algorithmยถ
Q-Learning is a model-free, off-policy reinforcement learning algorithm that learns the optimal action-value function Q*(s,a).
Key Ideaยถ
Learn Q(s,a) directly by updating estimates based on observed rewards and estimated future rewards.
Q-Learning Update Ruleยถ
Q(S_t, A_t) โ Q(S_t, A_t) + ฮฑ [R_{t+1} + ฮณ max_a Q(S_{t+1}, a) - Q(S_t, A_t)]
Where:
ฮฑ: Learning rate (0 < ฮฑ โค 1)
ฮณ: Discount factor (0 โค ฮณ โค 1)
R_{t+1}: Immediate reward
max_a Q(S_{t+1}, a): Best estimate of future rewards
๐ Temporal Difference Learningยถ
Q-Learning is a Temporal Difference (TD) learning method:
TD Learning: Update estimates based on other estimates
Bootstrapping: Use current estimates to improve current estimates
Online Learning: Learn from incomplete episodes
TD vs Monte Carloยถ
MC: Wait until end of episode to update values
TD: Update values immediately after each step
TD: Can learn online, before episode ends
TD: Usually converges faster
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from collections import defaultdict
import random
from typing import Dict, List, Tuple, Optional
import warnings
warnings.filterwarnings('ignore')
# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [12, 8]
Q-Learning Implementationยถ
With the theoretical foundation in place, we now translate the Q-learning update rule into working code. The QLearningAgent class below maintains a Q-table โ a dictionary mapping each (state, action) pair to an estimated value โ and updates it after every environment step using the temporal difference update:
The term in brackets is the TD error: the difference between the bootstrap target \(r + \gamma \max_{a'} Q(s', a')\) and the current estimate. Multiplying by the learning rate \(\alpha\) controls how aggressively we incorporate new information. Action selection follows an epsilon-greedy strategy, choosing a random action with probability \(\epsilon\) and the greedy action otherwise. This is the same exploration mechanism used in bandit problems, scaled up to sequential decision-making.
class QLearningAgent:
"""Q-Learning agent for grid world"""
def __init__(self, env, alpha: float = 0.1, gamma: float = 0.9, epsilon: float = 0.1):
self.env = env
self.alpha = alpha # Learning rate
self.gamma = gamma # Discount factor
self.epsilon = epsilon # Exploration rate
# Initialize Q-table
self.Q = defaultdict(lambda: defaultdict(float))
# Track learning progress
self.episode_rewards = []
self.episode_lengths = []
def get_action(self, state: Tuple[int, int], explore: bool = True) -> str:
"""Choose action using epsilon-greedy policy"""
if explore and random.random() < self.epsilon:
# Explore: random action
return random.choice(self.env.actions)
else:
# Exploit: best action
q_values = [self.Q[state][action] for action in self.env.actions]
max_q = max(q_values)
# Break ties randomly
best_actions = [action for action, q in zip(self.env.actions, q_values) if q == max_q]
return random.choice(best_actions)
def update_q(self, state: Tuple[int, int], action: str, reward: float, next_state: Tuple[int, int]):
"""Update Q-value using Q-learning update rule"""
# Current Q-value
current_q = self.Q[state][action]
# Maximum Q-value for next state
if self.env.is_terminal(next_state):
max_next_q = 0.0
else:
max_next_q = max([self.Q[next_state][a] for a in self.env.actions])
# Q-learning update
target = reward + self.gamma * max_next_q
self.Q[state][action] = current_q + self.alpha * (target - current_q)
def train_episode(self) -> Tuple[float, int]:
"""Run one training episode"""
state = self.env.start
total_reward = 0.0
steps = 0
while not self.env.is_terminal(state) and steps < 100: # Prevent infinite loops
# Choose action
action = self.get_action(state, explore=True)
# Take action
next_state = self.env.get_next_state(state, action)
reward = self.env.get_reward(state, action, next_state)
# Update Q-value
self.update_q(state, action, reward, next_state)
# Update state and counters
state = next_state
total_reward += reward
steps += 1
return total_reward, steps
def train(self, num_episodes: int = 1000):
"""Train the agent for multiple episodes"""
for episode in range(num_episodes):
reward, length = self.train_episode()
self.episode_rewards.append(reward)
self.episode_lengths.append(length)
# Decay epsilon (reduce exploration over time)
self.epsilon = max(0.01, self.epsilon * 0.995)
def get_policy(self) -> Dict[Tuple[int, int], str]:
"""Extract greedy policy from Q-table"""
policy = {}
for state in self.env.states:
if not self.env.is_terminal(state):
q_values = [self.Q[state][action] for action in self.env.actions]
max_q = max(q_values)
best_actions = [action for action, q in zip(self.env.actions, q_values) if q == max_q]
policy[state] = random.choice(best_actions) # Break ties randomly
return policy
def get_q_values(self) -> Dict[Tuple[int, int], List[float]]:
"""Get Q-values for visualization"""
q_values = {}
for state in self.env.states:
q_values[state] = [self.Q[state][action] for action in self.env.actions]
return q_values
Training the Q-Learning Agentยถ
Training proceeds by running the agent through many episodes of the grid world. In each episode the agent starts at the initial state, selects actions via epsilon-greedy, observes rewards, and updates Q-values until it reaches a terminal state or hits the step limit. Over time, epsilon decays so the agent shifts from exploration (gathering information about unknown state-action pairs) toward exploitation (leveraging what it has already learned). Monitoring the cumulative reward per episode and a smoothed moving average reveals the characteristic RL learning curve: noisy early performance that gradually converges as the Q-table sharpens.
# Create environment and agent
env = GridWorld()
agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)
# Train the agent
print("Training Q-Learning agent...")
agent.train(num_episodes=1000)
# Plot learning progress
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 5))
# Episode rewards
ax1.plot(agent.episode_rewards, alpha=0.7)
ax1.set_xlabel('Episode')
ax1.set_ylabel('Total Reward')
ax1.set_title('Learning Progress: Episode Rewards')
ax1.grid(True, alpha=0.3)
# Moving average of rewards
window_size = 50
moving_avg = np.convolve(agent.episode_rewards, np.ones(window_size)/window_size, mode='valid')
ax2.plot(moving_avg, color='red', linewidth=2)
ax2.set_xlabel('Episode')
ax2.set_ylabel('Average Reward (50 episodes)')
ax2.set_title('Smoothed Learning Progress')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Final average reward (last 100 episodes): {np.mean(agent.episode_rewards[-100:]):.2f}")
print(f"Final average episode length: {np.mean(agent.episode_lengths[-100:]):.2f}")
Visualizing Learned Q-Valuesยถ
A heatmap of Q-values for each action provides insight into how the agent perceives the environment. For every grid cell, the Q-value of an action reflects the expected discounted return if the agent takes that action and then follows the optimal policy. Cells near the goal should show high Q-values for actions that move toward it, while cells near the obstacle should show low Q-values for actions that lead into danger. Comparing the four heatmaps (one per action) reveals the agentโs internal model of which movements are valuable in which locations โ a key debugging tool in applied RL.
def visualize_q_values(agent: QLearningAgent):
"""Visualize Q-values and policy"""
fig, axes = plt.subplots(2, 2, figsize=(15, 15))
actions = ['up', 'down', 'left', 'right']
action_symbols = ['โ', 'โ', 'โ', 'โ']
for idx, (action, symbol) in enumerate(zip(actions, action_symbols)):
ax = axes[idx // 2, idx % 2]
# Create Q-value grid for this action
q_grid = np.zeros((env.grid_size, env.grid_size))
for i in range(env.grid_size):
for j in range(env.grid_size):
state = (i, j)
q_grid[i, j] = agent.Q[state][action]
# Plot heatmap
im = ax.imshow(q_grid, cmap='viridis', alpha=0.8)
plt.colorbar(im, ax=ax, shrink=0.8)
# Add grid lines
for i in range(env.grid_size + 1):
ax.axhline(i - 0.5, color='black', linewidth=1)
ax.axvline(i - 0.5, color='black', linewidth=1)
# Add special state labels
ax.text(env.start[1], env.start[0], 'START', ha='center', va='center',
fontsize=12, fontweight='bold', color='white')
ax.text(env.goal[1], env.goal[0], 'GOAL', ha='center', va='center',
fontsize=12, fontweight='bold', color='white')
ax.text(env.obstacle[1], env.obstacle[0], 'OBSTACLE', ha='center', va='center',
fontsize=8, fontweight='bold', color='red')
ax.set_title(f'Q-Values for Action: {symbol} ({action})')
ax.set_xticks(range(env.grid_size))
ax.set_yticks(range(env.grid_size))
plt.gca().invert_yaxis()
plt.tight_layout()
plt.show()
# Visualize Q-values
visualize_q_values(agent)
Extracting the Optimal Policyยถ
Once Q-values have converged, the optimal policy is recovered by selecting the action with the highest Q-value in each state: \(\pi^*(s) = \arg\max_a Q^*(s, a)\). The resulting policy is deterministic and greedy โ no more exploration needed at deployment time. Visualizing these greedy actions as directional arrows on the grid confirms whether training succeeded: arrows should form a coherent path from every reachable cell toward the goal while steering around the obstacle. In real-world RL systems, this same extract-and-deploy pattern is used when moving from training to production.
def visualize_policy(agent: QLearningAgent):
"""Visualize the learned policy"""
policy = agent.get_policy()
fig, ax = plt.subplots(figsize=(8, 8))
# Create empty grid
grid = np.zeros((env.grid_size, env.grid_size))
ax.imshow(grid, cmap='Blues', alpha=0.1)
# Add grid lines
for i in range(env.grid_size + 1):
ax.axhline(i - 0.5, color='black', linewidth=2)
ax.axvline(i - 0.5, color='black', linewidth=2)
# Add policy arrows and labels
action_arrows = {'up': 'โ', 'down': 'โ', 'left': 'โ', 'right': 'โ'}
for i in range(env.grid_size):
for j in range(env.grid_size):
state = (i, j)
if state == env.start:
ax.text(j, i, 'START\n' + action_arrows.get(policy.get(state, ''), ''),
ha='center', va='center', fontsize=10, fontweight='bold')
elif state == env.goal:
ax.text(j, i, 'GOAL', ha='center', va='center', fontsize=12, fontweight='bold')
elif state == env.obstacle:
ax.text(j, i, 'OBSTACLE', ha='center', va='center', fontsize=8, fontweight='bold', color='red')
elif not env.is_terminal(state):
action = policy.get(state, '')
arrow = action_arrows.get(action, '')
ax.text(j, i, arrow, ha='center', va='center', fontsize=20, fontweight='bold')
ax.set_xlim(-0.5, env.grid_size - 0.5)
ax.set_ylim(-0.5, env.grid_size - 0.5)
ax.set_xticks(range(env.grid_size))
ax.set_yticks(range(env.grid_size))
ax.set_title('Learned Optimal Policy', fontsize=16)
plt.gca().invert_yaxis()
plt.show()
# Visualize the learned policy
visualize_policy(agent)
๐ Exploration vs Exploitationยถ
One of the key challenges in RL is balancing exploration (trying new things) and exploitation (using what we know).
ฮต-Greedy Strategyยถ
With probability ฮต: explore (random action)
With probability 1-ฮต: exploit (best action)
ฮต usually decays over time
Other Strategiesยถ
Softmax/ Boltzmann Exploration: Choose actions probabilistically based on Q-values
Upper Confidence Bound (UCB): Prefer actions with high uncertainty
Thompson Sampling: Sample from posterior distribution
def compare_exploration_strategies():
"""Compare different exploration strategies"""
strategies = [
{'name': 'ฮต=0.1 (fixed)', 'epsilon': 0.1, 'decay': False},
{'name': 'ฮต=0.5 (fixed)', 'epsilon': 0.5, 'decay': False},
{'name': 'ฮต=1.0โ0.01 (decay)', 'epsilon': 1.0, 'decay': True}
]
results = {}
for strategy in strategies:
print(f"Training with {strategy['name']}...")
# Create agent
agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=strategy['epsilon'])
# Train
agent.train(num_episodes=500)
# Store results
results[strategy['name']] = {
'rewards': agent.episode_rewards,
'final_avg': np.mean(agent.episode_rewards[-100:])
}
# Plot comparison
plt.figure(figsize=(12, 6))
for name, data in results.items():
plt.plot(data['rewards'], label=f"{name} (final: {data['final_avg']:.1f})", alpha=0.7)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Comparison of Exploration Strategies')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
# Compare exploration strategies
compare_exploration_strategies()
๐ฎ SARSA vs Q-Learningยถ
Q-Learning (Off-Policy)ยถ
Learns optimal policy regardless of behavior policy
Uses max over next actions: Q(s,a) โ Q(s,a) + ฮฑ[R + ฮณ max_aโ Q(sโ,aโ)]
More aggressive exploration
SARSA (On-Policy)ยถ
Learns policy being followed
Uses actual next action: Q(s,a) โ Q(s,a) + ฮฑ[R + ฮณ Q(sโ,aโ)]
Safer, more conservative
When to Use Which?ยถ
Q-Learning: When you want optimal policy, can afford exploration
SARSA: When learning online with real consequences
class SARSAAgent(QLearningAgent):
"""SARSA agent (on-policy learning)"""
def update_q(self, state: Tuple[int, int], action: str, reward: float,
next_state: Tuple[int, int], next_action: Optional[str] = None):
"""Update Q-value using SARSA update rule"""
# Current Q-value
current_q = self.Q[state][action]
# Q-value for next state-action pair
if self.env.is_terminal(next_state):
next_q = 0.0
else:
if next_action is None:
next_action = self.get_action(next_state, explore=False) # Greedy action
next_q = self.Q[next_state][next_action]
# SARSA update
target = reward + self.gamma * next_q
self.Q[state][action] = current_q + self.alpha * (target - current_q)
def train_episode(self) -> Tuple[float, int]:
"""Run one SARSA training episode"""
state = self.env.start
action = self.get_action(state, explore=True) # Choose first action
total_reward = 0.0
steps = 0
while not self.env.is_terminal(state) and steps < 100:
# Take action
next_state = self.env.get_next_state(state, action)
reward = self.env.get_reward(state, action, next_state)
# Choose next action
next_action = self.get_action(next_state, explore=True)
# Update Q-value (SARSA)
self.update_q(state, action, reward, next_state, next_action)
# Update state and action
state = next_state
action = next_action
total_reward += reward
steps += 1
return total_reward, steps
# Compare Q-Learning and SARSA
print("Comparing Q-Learning vs SARSA...")
q_agent = QLearningAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)
q_agent.train(num_episodes=500)
sarsa_agent = SARSAAgent(env, alpha=0.1, gamma=0.9, epsilon=0.1)
sarsa_agent.train(num_episodes=500)
# Plot comparison
plt.figure(figsize=(12, 6))
plt.plot(q_agent.episode_rewards, label=f'Q-Learning (final: {np.mean(q_agent.episode_rewards[-100:]):.1f})', alpha=0.7)
plt.plot(sarsa_agent.episode_rewards, label=f'SARSA (final: {np.mean(sarsa_agent.episode_rewards[-100:]):.1f})', alpha=0.7)
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Q-Learning vs SARSA Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
๐ง Key Takeawaysยถ
Q-Learning is model-free: Learns directly from experience without knowing environment dynamics
Temporal Difference learning: Updates estimates using other estimates, enabling online learning
Off-policy learning: Learns optimal policy while following exploratory policy
Exploration vs exploitation: ฮต-greedy balances trying new actions vs using known good actions
Convergence: Q-learning converges to optimal Q-values under certain conditions
๐ Whatโs Next?ยถ
Now that you understand Q-learning, youโre ready for:
Deep Q-Networks (DQN): Using neural networks for Q-value approximation
Policy Gradient Methods: Learning policies directly
Actor-Critic Methods: Combining value and policy learning
Multi-Agent Reinforcement Learning: Multiple agents learning together
๐ Further Readingยถ
Q-Learning Paper by Watkins & Dayan
Reinforcement Learning: An Introduction Chapter 6
๐๏ธ Exercisesยถ
Implement SARSA from scratch and compare with Q-learning
Experiment with different exploration strategies (softmax, UCB)
Add function approximation using a simple neural network
Solve a different environment (e.g., Taxi-v3 from Gymnasium)
๐ก Discussion Questionsยถ
Why is Q-learning considered โoff-policyโ?
How does the exploration rate affect learning?
What are the advantages of TD learning over Monte Carlo methods?
How might you handle continuous state spaces with Q-learning?