01: Markov Decision Processes (MDPs)ยถ

โ€œAll models are wrong, but some are useful.โ€ - George Box

Welcome to the fascinating world of Reinforcement Learning! In this notebook, weโ€™ll explore the mathematical foundation that makes RL possible: Markov Decision Processes (MDPs).

๐ŸŽฏ Learning Objectivesยถ

By the end of this notebook, youโ€™ll understand:

  • What makes a problem suitable for reinforcement learning

  • The core components of Markov Decision Processes

  • How to formalize real-world problems as MDPs

  • The fundamental equations that govern optimal behavior

๐Ÿ“š What is Reinforcement Learning?ยถ

Reinforcement Learning (RL) is a type of machine learning where:

  • An agent learns through interaction with an environment

  • The agent receives feedback in the form of rewards or punishments

  • The goal is to learn a policy that maximizes cumulative rewards

Unlike supervised learning (which requires labeled examples) or unsupervised learning (which finds patterns in data), RL learns through trial and error.

๐ŸŽฎ The RL Frameworkยถ

Letโ€™s break down the key components:

1. Agentยถ

  • The learner and decision-maker

  • Examples: Robot, game player, stock trader, recommendation system

2. Environmentยถ

  • Everything the agent interacts with

  • Examples: Game world, physical space, financial market, user preferences

3. Actionsยถ

  • Choices available to the agent

  • Examples: Move left/right, buy/sell stock, recommend product

4. Statesยถ

  • Complete description of the current situation

  • Examples: Game board position, stock prices, user browsing history

5. Rewardsยถ

  • Feedback signal from the environment

  • Examples: Points scored, profit made, user satisfaction

๐Ÿ”„ The RL Loopยถ

The interaction between agent and environment follows this cycle:

1. Agent observes current state (S_t)
2. Agent chooses action (A_t) based on policy
3. Environment responds with reward (R_{t+1}) and new state (S_{t+1})
4. Agent learns from experience and updates policy
5. Repeat...

This creates a trajectory or episode of experience.

๐ŸŽฏ Markov Decision Processesยถ

A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.

Formal Definitionยถ

An MDP is defined by the tuple (S, A, P, R, ฮณ):

  • S: Set of states (finite or infinite)

  • A: Set of actions (finite or infinite)

  • P(sโ€™|s,a): Transition probability from state s to sโ€™ when taking action a

  • R(s,a) or R(s,a,sโ€™): Reward function

  • ฮณ: Discount factor (0 โ‰ค ฮณ โ‰ค 1)

๐Ÿ“Š The Markov Propertyยถ

The Markov property states that:

โ€œThe future is independent of the past given the present.โ€

Mathematically:

P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} | S_t, A_t)

This means the current state contains all relevant information about the past for predicting the future.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')

# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [10, 6]

๐Ÿ† Example: Grid Worldยถ

Letโ€™s create a simple grid world MDP to illustrate these concepts:

+---+---+---+---+
| S |   |   | G |
+---+---+---+---+
|   | X |   |   |
+---+---+---+---+
|   |   |   |   |
+---+---+---+---+
|   |   |   |   |
+---+---+---+---+
  • S: Start position

  • G: Goal (terminal state)

  • X: Obstacle

  • Empty cells: Normal positions

Rewards:

  • Goal: +10

  • Obstacle: -5

  • Each step: -1

  • Actions: Up, Down, Left, Right

class GridWorld:
    """Simple 4x4 grid world MDP"""
    
    def __init__(self):
        self.grid_size = 4
        self.states = [(i, j) for i in range(self.grid_size) for j in range(self.grid_size)]
        self.actions = ['up', 'down', 'left', 'right']
        
        # Define special positions
        self.start = (0, 0)
        self.goal = (0, 3)
        self.obstacle = (1, 1)
        
        # Terminal states
        self.terminals = [self.goal, self.obstacle]
        
    def get_next_state(self, state: Tuple[int, int], action: str) -> Tuple[int, int]:
        """Get next state given current state and action"""
        i, j = state
        
        if action == 'up':
            next_i, next_j = max(0, i-1), j
        elif action == 'down':
            next_i, next_j = min(self.grid_size-1, i+1), j
        elif action == 'left':
            next_i, next_j = i, max(0, j-1)
        elif action == 'right':
            next_i, next_j = i, min(self.grid_size-1, j+1)
        else:
            next_i, next_j = i, j
            
        return (next_i, next_j)
    
    def get_reward(self, state: Tuple[int, int], action: str, next_state: Tuple[int, int]) -> float:
        """Get reward for state-action-next_state transition"""
        if next_state == self.goal:
            return 10.0
        elif next_state == self.obstacle:
            return -5.0
        else:
            return -1.0  # Step cost
    
    def is_terminal(self, state: Tuple[int, int]) -> bool:
        """Check if state is terminal"""
        return state in self.terminals
    
    def visualize_grid(self, values=None):
        """Visualize the grid world"""
        grid = np.zeros((self.grid_size, self.grid_size))
        
        # Mark special positions
        grid[self.start] = 1  # Start
        grid[self.goal] = 2   # Goal
        grid[self.obstacle] = -1  # Obstacle
        
        fig, ax = plt.subplots(figsize=(8, 8))
        
        # Create color map
        cmap = plt.cm.RdYlGn
        if values is not None:
            # Reshape values to grid
            value_grid = np.array(values).reshape(self.grid_size, self.grid_size)
            im = ax.imshow(value_grid, cmap=cmap, alpha=0.7)
            plt.colorbar(im, ax=ax, label='State Value')
        
        # Draw grid
        for i in range(self.grid_size + 1):
            ax.axhline(i - 0.5, color='black', linewidth=2)
            ax.axvline(i - 0.5, color='black', linewidth=2)
        
        # Add labels
        for i in range(self.grid_size):
            for j in range(self.grid_size):
                if (i, j) == self.start:
                    ax.text(j, i, 'START', ha='center', va='center', fontsize=12, fontweight='bold')
                elif (i, j) == self.goal:
                    ax.text(j, i, 'GOAL\n+10', ha='center', va='center', fontsize=12, fontweight='bold')
                elif (i, j) == self.obstacle:
                    ax.text(j, i, 'OBSTACLE\n-5', ha='center', va='center', fontsize=12, fontweight='bold')
                elif values is not None:
                    ax.text(j, i, f'{values[i*self.grid_size + j]:.1f}', 
                           ha='center', va='center', fontsize=10)
        
        ax.set_xlim(-0.5, self.grid_size - 0.5)
        ax.set_ylim(-0.5, self.grid_size - 0.5)
        ax.set_xticks(range(self.grid_size))
        ax.set_yticks(range(self.grid_size))
        ax.set_title('Grid World MDP', fontsize=16)
        plt.gca().invert_yaxis()
        plt.show()

# Create and visualize the grid world
env = GridWorld()
env.visualize_grid()

๐Ÿ’ฐ Value Functionsยถ

The value function tells us how good it is to be in a particular state. There are two main types:

State-Value Function V(s)ยถ

Expected return starting from state s, following policy ฯ€:

V^ฯ€(s) = E[โˆ‘{k=0}^โˆž ฮณ^k R{t+k+1} | S_t = s, ฯ€]

Action-Value Function Q(s,a)ยถ

Expected return starting from state s, taking action a, then following policy ฯ€:

Q^ฯ€(s,a) = E[โˆ‘{k=0}^โˆž ฮณ^k R{t+k+1} | S_t = s, A_t = a, ฯ€]

๐Ÿ”” Bellman Equationsยถ

The Bellman equations express value functions recursively:

Bellman Equation for V^ฯ€ยถ

V^ฯ€(s) = โˆ‘a ฯ€(a|s) โˆ‘{sโ€™} P(sโ€™|s,a) [R(s,a,sโ€™) + ฮณ V^ฯ€(sโ€™)]

Bellman Equation for Q^ฯ€ยถ

Q^ฯ€(s,a) = โˆ‘{sโ€™} P(sโ€™|s,a) [R(s,a,sโ€™) + ฮณ โˆ‘{aโ€™} ฯ€(aโ€™|sโ€™) Q^ฯ€(sโ€™,aโ€™)]

Bellman Optimality Equationยถ

V*(s) = max_a โˆ‘_{sโ€™} P(sโ€™|s,a) [R(s,a,sโ€™) + ฮณ V*(sโ€™)]

Q*(s,a) = โˆ‘{sโ€™} P(sโ€™|s,a) [R(s,a,sโ€™) + ฮณ max{aโ€™} Q*(sโ€™,aโ€™)]

def value_iteration(env: GridWorld, gamma: float = 0.9, theta: float = 1e-6, max_iterations: int = 1000):
    """Solve MDP using value iteration"""
    
    # Initialize value function
    V = {state: 0.0 for state in env.states}
    
    for iteration in range(max_iterations):
        delta = 0
        
        for state in env.states:
            if env.is_terminal(state):
                continue  # Terminal states have value 0
            
            v_old = V[state]
            
            # Compute value for each action
            action_values = []
            for action in env.actions:
                next_state = env.get_next_state(state, action)
                reward = env.get_reward(state, action, next_state)
                action_values.append(reward + gamma * V[next_state])
            
            # Take maximum over actions
            V[state] = max(action_values)
            delta = max(delta, abs(v_old - V[state]))
        
        if delta < theta:
            break
    
    return V

# Solve the grid world using value iteration
optimal_values = value_iteration(env)

# Extract values in order for visualization
values_list = [optimal_values[state] for state in env.states]

# Visualize optimal values
env.visualize_grid(values_list)

๐Ÿ“ˆ Policy Evaluationยถ

Given a policy ฯ€, we can compute its value function using policy evaluation:

For each state s: V^ฯ€(s) โ† โˆ‘a ฯ€(a|s) โˆ‘{sโ€™} P(sโ€™|s,a) [R(s,a,sโ€™) + ฮณ V^ฯ€(sโ€™)]

This is repeated until convergence.

def policy_evaluation(env: GridWorld, policy: Dict[Tuple[int, int], str], 
                     gamma: float = 0.9, theta: float = 1e-6) -> Dict[Tuple[int, int], float]:
    """Evaluate a policy by computing its value function"""
    
    V = {state: 0.0 for state in env.states}
    
    while True:
        delta = 0
        
        for state in env.states:
            if env.is_terminal(state):
                continue
            
            v_old = V[state]
            
            # Get action from policy
            action = policy.get(state, 'right')  # Default action
            next_state = env.get_next_state(state, action)
            reward = env.get_reward(state, action, next_state)
            
            V[state] = reward + gamma * V[next_state]
            delta = max(delta, abs(v_old - V[state]))
        
        if delta < theta:
            break
    
    return V

# Define a simple policy: always go right
simple_policy = {}
for state in env.states:
    if not env.is_terminal(state):
        simple_policy[state] = 'right'

# Evaluate the policy
policy_values = policy_evaluation(env, simple_policy)
policy_values_list = [policy_values[state] for state in env.states]

print("Simple Policy (Always Right) Values:")
env.visualize_grid(policy_values_list)

๐ŸŽฏ Optimal Policiesยถ

An optimal policy ฯ€* satisfies:

  • V^ฯ€*(s) โ‰ฅ V^ฯ€(s) for all policies ฯ€ and states s

  • Q^ฯ€*(s,a) โ‰ฅ Q^ฯ€(s,a) for all policies ฯ€, states s, and actions a

For any MDP, there exists at least one optimal policy that is deterministic.

๐Ÿง  Key Takeawaysยถ

  1. MDPs provide a mathematical framework for modeling decision-making problems

  2. The Markov property ensures we only need the current state to make decisions

  3. Value functions quantify how good different states and actions are

  4. Bellman equations provide recursive relationships for computing optimal behavior

  5. Value iteration can find optimal policies by iteratively improving value estimates

๐Ÿš€ Whatโ€™s Next?ยถ

Now that you understand MDPs, youโ€™re ready to learn:

  • Value Iteration & Policy Iteration (exact methods)

  • Monte Carlo Methods (model-free learning)

  • Temporal Difference Learning (Q-learning, SARSA)

๐Ÿ“š Further Readingยถ

๐Ÿ‹๏ธ Exercisesยถ

  1. Modify the grid world to have different rewards or obstacles

  2. Implement policy iteration from scratch

  3. Compare different discount factors (ฮณ) and their effects

  4. Add stochastic transitions (actions donโ€™t always succeed)

๐Ÿ’ก Discussion Questionsยถ

  • Why is the Markov property important for RL?

  • How does the discount factor ฮณ affect learning?

  • What are the limitations of the MDP framework?

  • How might you model a real-world problem as an MDP?