Run this notebook: Open in Colab Open in Kaggle

01: Markov Decision Processes (MDPs)¶

“All models are wrong, but some are useful.” - George Box

Welcome to the fascinating world of Reinforcement Learning! In this notebook, we’ll explore the mathematical foundation that makes RL possible: Markov Decision Processes (MDPs).

🎯 Learning Objectives¶

By the end of this notebook, you’ll understand:

What makes a problem suitable for reinforcement learning
The core components of Markov Decision Processes
How to formalize real-world problems as MDPs
The fundamental equations that govern optimal behavior

📚 What is Reinforcement Learning?¶

Reinforcement Learning (RL) is a type of machine learning where:

An agent learns through interaction with an environment
The agent receives feedback in the form of rewards or punishments
The goal is to learn a policy that maximizes cumulative rewards

Unlike supervised learning (which requires labeled examples) or unsupervised learning (which finds patterns in data), RL learns through trial and error.

🎮 The RL Framework¶

Let’s break down the key components:

1. Agent¶

The learner and decision-maker
Examples: Robot, game player, stock trader, recommendation system

2. Environment¶

Everything the agent interacts with
Examples: Game world, physical space, financial market, user preferences

3. Actions¶

Choices available to the agent
Examples: Move left/right, buy/sell stock, recommend product

4. States¶

Complete description of the current situation
Examples: Game board position, stock prices, user browsing history

5. Rewards¶

Feedback signal from the environment
Examples: Points scored, profit made, user satisfaction

🔄 The RL Loop¶

The interaction between agent and environment follows this cycle:

Agent observes current state (S_t)
Agent chooses action (A_t) based on policy
Environment responds with reward (R_{t+1}) and new state (S_{t+1})
Agent learns from experience and updates policy
Repeat...

This creates a trajectory or episode of experience.

🎯 Markov Decision Processes¶

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.

Formal Definition¶

An MDP is defined by the tuple (S, A, P, R, γ):

S: Set of states (finite or infinite)
A: Set of actions (finite or infinite)
P(s’|s,a): Transition probability from state s to s’ when taking action a
R(s,a) or R(s,a,s’): Reward function
γ: Discount factor (0 ≤ γ ≤ 1)

📊 The Markov Property¶

The Markov property states that:

“The future is independent of the past given the present.”

Mathematically:

P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} | S_t, A_t)

This means the current state contains all relevant information about the past for predicting the future.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [10, 6]

🏆 Example: Grid World¶

Let’s create a simple grid world MDP to illustrate these concepts:

+---+---+---+---+
| S |   |   | G |
+---+---+---+---+
|   | X |   |   |
+---+---+---+---+
|   |   |   |   |
+---+---+---+---+
|   |   |   |   |
+---+---+---+---+

S: Start position
G: Goal (terminal state)
X: Obstacle
Empty cells: Normal positions

Rewards:

Goal: +10
Obstacle: -5
Each step: -1
Actions: Up, Down, Left, Right

class GridWorld:
    """Simple 4x4 grid world MDP"""
    
    def __init__(self):
        self.grid_size = 4
        self.states = [(i, j) for i in range(self.grid_size) for j in range(self.grid_size)]
        self.actions = ['up', 'down', 'left', 'right']
        
        # Define special positions
        self.start = (0, 0)
        self.goal = (0, 3)
        self.obstacle = (1, 1)
        
        # Terminal states
        self.terminals = [self.goal, self.obstacle]
        
    def get_next_state(self, state: Tuple[int, int], action: str) -> Tuple[int, int]:
        """Get next state given current state and action"""
        i, j = state
        
        if action == 'up':
            next_i, next_j = max(0, i-1), j
        elif action == 'down':
            next_i, next_j = min(self.grid_size-1, i+1), j
        elif action == 'left':
            next_i, next_j = i, max(0, j-1)
        elif action == 'right':
            next_i, next_j = i, min(self.grid_size-1, j+1)
        else:
            next_i, next_j = i, j
            
        return (next_i, next_j)
    
    def get_reward(self, state: Tuple[int, int], action: str, next_state: Tuple[int, int]) -> float:
        """Get reward for state-action-next_state transition"""
        if next_state == self.goal:
            return 10.0
        elif next_state == self.obstacle:
            return -5.0
        else:
            return -1.0  # Step cost
    
    def is_terminal(self, state: Tuple[int, int]) -> bool:
        """Check if state is terminal"""
        return state in self.terminals
    
    def visualize_grid(self, values=None):
        """Visualize the grid world"""
        grid = np.zeros((self.grid_size, self.grid_size))
        
        # Mark special positions
        grid[self.start] = 1  # Start
        grid[self.goal] = 2   # Goal
        grid[self.obstacle] = -1  # Obstacle
        
        fig, ax = plt.subplots(figsize=(8, 8))
        
        # Create color map
        cmap = plt.cm.RdYlGn
        if values is not None:
            # Reshape values to grid
            value_grid = np.array(values).reshape(self.grid_size, self.grid_size)
            im = ax.imshow(value_grid, cmap=cmap, alpha=0.7)
            plt.colorbar(im, ax=ax, label='State Value')
        
        # Draw grid
        for i in range(self.grid_size + 1):
            ax.axhline(i - 0.5, color='black', linewidth=2)
            ax.axvline(i - 0.5, color='black', linewidth=2)
        
        # Add labels
        for i in range(self.grid_size):
            for j in range(self.grid_size):
                if (i, j) == self.start:
                    ax.text(j, i, 'START', ha='center', va='center', fontsize=12, fontweight='bold')
                elif (i, j) == self.goal:
                    ax.text(j, i, 'GOAL\n+10', ha='center', va='center', fontsize=12, fontweight='bold')
                elif (i, j) == self.obstacle:
                    ax.text(j, i, 'OBSTACLE\n-5', ha='center', va='center', fontsize=12, fontweight='bold')
                elif values is not None:
                    ax.text(j, i, f'{values[i*self.grid_size + j]:.1f}', 
                           ha='center', va='center', fontsize=10)
        
        ax.set_xlim(-0.5, self.grid_size - 0.5)
        ax.set_ylim(-0.5, self.grid_size - 0.5)
        ax.set_xticks(range(self.grid_size))
        ax.set_yticks(range(self.grid_size))
        ax.set_title('Grid World MDP', fontsize=16)
        plt.gca().invert_yaxis()
        plt.show()

# Create and visualize the grid world
env = GridWorld()
env.visualize_grid()

💰 Value Functions¶

The value function tells us how good it is to be in a particular state. There are two main types:

State-Value Function V(s)¶

Expected return starting from state s, following policy π:

V^π(s) = E[∑{k=0}^∞ γ^k R{t+k+1} | S_t = s, π]

Action-Value Function Q(s,a)¶

Expected return starting from state s, taking action a, then following policy π:

Q^π(s,a) = E[∑{k=0}^∞ γ^k R{t+k+1} | S_t = s, A_t = a, π]

🔔 Bellman Equations¶

The Bellman equations express value functions recursively:

Bellman Equation for V^π¶

V^π(s) = ∑a π(a|s) ∑{s’} P(s’|s,a) [R(s,a,s’) + γ V^π(s’)]

Bellman Equation for Q^π¶

Q^π(s,a) = ∑{s’} P(s’|s,a) [R(s,a,s’) + γ ∑{a’} π(a’|s’) Q^π(s’,a’)]

Bellman Optimality Equation¶

V*(s) = max_a ∑_{s’} P(s’|s,a) [R(s,a,s’) + γ V*(s’)]

Q*(s,a) = ∑{s’} P(s’|s,a) [R(s,a,s’) + γ max{a’} Q*(s’,a’)]

def value_iteration(env: GridWorld, gamma: float = 0.9, theta: float = 1e-6, max_iterations: int = 1000):
    """Solve MDP using value iteration"""
    
    # Initialize value function
    V = {state: 0.0 for state in env.states}
    
    for iteration in range(max_iterations):
        delta = 0
        
        for state in env.states:
            if env.is_terminal(state):
                continue  # Terminal states have value 0
            
            v_old = V[state]
            
            # Compute value for each action
            action_values = []
            for action in env.actions:
                next_state = env.get_next_state(state, action)
                reward = env.get_reward(state, action, next_state)
                action_values.append(reward + gamma * V[next_state])
            
            # Take maximum over actions
            V[state] = max(action_values)
            delta = max(delta, abs(v_old - V[state]))
        
        if delta < theta:
            break
    
    return V

# Solve the grid world using value iteration
optimal_values = value_iteration(env)

# Extract values in order for visualization
values_list = [optimal_values[state] for state in env.states]

# Visualize optimal values
env.visualize_grid(values_list)

📈 Policy Evaluation¶

Given a policy π, we can compute its value function using policy evaluation:

For each state s: V^π(s) ← ∑a π(a|s) ∑{s’} P(s’|s,a) [R(s,a,s’) + γ V^π(s’)]

This is repeated until convergence.

def policy_evaluation(env: GridWorld, policy: Dict[Tuple[int, int], str], 
                     gamma: float = 0.9, theta: float = 1e-6) -> Dict[Tuple[int, int], float]:
    """Evaluate a policy by computing its value function"""
    
    V = {state: 0.0 for state in env.states}
    
    while True:
        delta = 0
        
        for state in env.states:
            if env.is_terminal(state):
                continue
            
            v_old = V[state]
            
            # Get action from policy
            action = policy.get(state, 'right')  # Default action
            next_state = env.get_next_state(state, action)
            reward = env.get_reward(state, action, next_state)
            
            V[state] = reward + gamma * V[next_state]
            delta = max(delta, abs(v_old - V[state]))
        
        if delta < theta:
            break
    
    return V

# Define a simple policy: always go right
simple_policy = {}
for state in env.states:
    if not env.is_terminal(state):
        simple_policy[state] = 'right'

# Evaluate the policy
policy_values = policy_evaluation(env, simple_policy)
policy_values_list = [policy_values[state] for state in env.states]

print("Simple Policy (Always Right) Values:")
env.visualize_grid(policy_values_list)

🎯 Optimal Policies¶

An optimal policy π* satisfies:

V^π*(s) ≥ V^π(s) for all policies π and states s
Q^π*(s,a) ≥ Q^π(s,a) for all policies π, states s, and actions a

For any MDP, there exists at least one optimal policy that is deterministic.

🧠 Key Takeaways¶

MDPs provide a mathematical framework for modeling decision-making problems
The Markov property ensures we only need the current state to make decisions
Value functions quantify how good different states and actions are
Bellman equations provide recursive relationships for computing optimal behavior
Value iteration can find optimal policies by iteratively improving value estimates

🚀 What’s Next?¶

Now that you understand MDPs, you’re ready to learn:

Value Iteration & Policy Iteration (exact methods)
Monte Carlo Methods (model-free learning)
Temporal Difference Learning (Q-learning, SARSA)

📚 Further Reading¶

Reinforcement Learning: An Introduction by Sutton & Barto
MDPs on Wikipedia
Bellman Equations Explained

🏋️ Exercises¶

Modify the grid world to have different rewards or obstacles
Implement policy iteration from scratch
Compare different discount factors (γ) and their effects
Add stochastic transitions (actions don’t always succeed)

💡 Discussion Questions¶

Why is the Markov property important for RL?
How does the discount factor γ affect learning?
What are the limitations of the MDP framework?
How might you model a real-world problem as an MDP?