01: Markov Decision Processes (MDPs)ยถ
โAll models are wrong, but some are useful.โ - George Box
Welcome to the fascinating world of Reinforcement Learning! In this notebook, weโll explore the mathematical foundation that makes RL possible: Markov Decision Processes (MDPs).
๐ฏ Learning Objectivesยถ
By the end of this notebook, youโll understand:
What makes a problem suitable for reinforcement learning
The core components of Markov Decision Processes
How to formalize real-world problems as MDPs
The fundamental equations that govern optimal behavior
๐ What is Reinforcement Learning?ยถ
Reinforcement Learning (RL) is a type of machine learning where:
An agent learns through interaction with an environment
The agent receives feedback in the form of rewards or punishments
The goal is to learn a policy that maximizes cumulative rewards
Unlike supervised learning (which requires labeled examples) or unsupervised learning (which finds patterns in data), RL learns through trial and error.
๐ฎ The RL Frameworkยถ
Letโs break down the key components:
1. Agentยถ
The learner and decision-maker
Examples: Robot, game player, stock trader, recommendation system
2. Environmentยถ
Everything the agent interacts with
Examples: Game world, physical space, financial market, user preferences
3. Actionsยถ
Choices available to the agent
Examples: Move left/right, buy/sell stock, recommend product
4. Statesยถ
Complete description of the current situation
Examples: Game board position, stock prices, user browsing history
5. Rewardsยถ
Feedback signal from the environment
Examples: Points scored, profit made, user satisfaction
๐ The RL Loopยถ
The interaction between agent and environment follows this cycle:
1. Agent observes current state (S_t)
2. Agent chooses action (A_t) based on policy
3. Environment responds with reward (R_{t+1}) and new state (S_{t+1})
4. Agent learns from experience and updates policy
5. Repeat...
This creates a trajectory or episode of experience.
๐ฏ Markov Decision Processesยถ
A Markov Decision Process (MDP) is a mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision maker.
Formal Definitionยถ
An MDP is defined by the tuple (S, A, P, R, ฮณ):
S: Set of states (finite or infinite)
A: Set of actions (finite or infinite)
P(sโ|s,a): Transition probability from state s to sโ when taking action a
R(s,a) or R(s,a,sโ): Reward function
ฮณ: Discount factor (0 โค ฮณ โค 1)
๐ The Markov Propertyยถ
The Markov property states that:
โThe future is independent of the past given the present.โ
Mathematically:
P(S_{t+1} | S_t, A_t, S_{t-1}, A_{t-1}, ..., S_0, A_0) = P(S_{t+1} | S_t, A_t)
This means the current state contains all relevant information about the past for predicting the future.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from typing import Dict, List, Tuple
import warnings
warnings.filterwarnings('ignore')
# Set up plotting style
plt.style.use('default')
sns.set_palette("husl")
plt.rcParams['figure.figsize'] = [10, 6]
๐ Example: Grid Worldยถ
Letโs create a simple grid world MDP to illustrate these concepts:
+---+---+---+---+
| S | | | G |
+---+---+---+---+
| | X | | |
+---+---+---+---+
| | | | |
+---+---+---+---+
| | | | |
+---+---+---+---+
S: Start position
G: Goal (terminal state)
X: Obstacle
Empty cells: Normal positions
Rewards:
Goal: +10
Obstacle: -5
Each step: -1
Actions: Up, Down, Left, Right
class GridWorld:
"""Simple 4x4 grid world MDP"""
def __init__(self):
self.grid_size = 4
self.states = [(i, j) for i in range(self.grid_size) for j in range(self.grid_size)]
self.actions = ['up', 'down', 'left', 'right']
# Define special positions
self.start = (0, 0)
self.goal = (0, 3)
self.obstacle = (1, 1)
# Terminal states
self.terminals = [self.goal, self.obstacle]
def get_next_state(self, state: Tuple[int, int], action: str) -> Tuple[int, int]:
"""Get next state given current state and action"""
i, j = state
if action == 'up':
next_i, next_j = max(0, i-1), j
elif action == 'down':
next_i, next_j = min(self.grid_size-1, i+1), j
elif action == 'left':
next_i, next_j = i, max(0, j-1)
elif action == 'right':
next_i, next_j = i, min(self.grid_size-1, j+1)
else:
next_i, next_j = i, j
return (next_i, next_j)
def get_reward(self, state: Tuple[int, int], action: str, next_state: Tuple[int, int]) -> float:
"""Get reward for state-action-next_state transition"""
if next_state == self.goal:
return 10.0
elif next_state == self.obstacle:
return -5.0
else:
return -1.0 # Step cost
def is_terminal(self, state: Tuple[int, int]) -> bool:
"""Check if state is terminal"""
return state in self.terminals
def visualize_grid(self, values=None):
"""Visualize the grid world"""
grid = np.zeros((self.grid_size, self.grid_size))
# Mark special positions
grid[self.start] = 1 # Start
grid[self.goal] = 2 # Goal
grid[self.obstacle] = -1 # Obstacle
fig, ax = plt.subplots(figsize=(8, 8))
# Create color map
cmap = plt.cm.RdYlGn
if values is not None:
# Reshape values to grid
value_grid = np.array(values).reshape(self.grid_size, self.grid_size)
im = ax.imshow(value_grid, cmap=cmap, alpha=0.7)
plt.colorbar(im, ax=ax, label='State Value')
# Draw grid
for i in range(self.grid_size + 1):
ax.axhline(i - 0.5, color='black', linewidth=2)
ax.axvline(i - 0.5, color='black', linewidth=2)
# Add labels
for i in range(self.grid_size):
for j in range(self.grid_size):
if (i, j) == self.start:
ax.text(j, i, 'START', ha='center', va='center', fontsize=12, fontweight='bold')
elif (i, j) == self.goal:
ax.text(j, i, 'GOAL\n+10', ha='center', va='center', fontsize=12, fontweight='bold')
elif (i, j) == self.obstacle:
ax.text(j, i, 'OBSTACLE\n-5', ha='center', va='center', fontsize=12, fontweight='bold')
elif values is not None:
ax.text(j, i, f'{values[i*self.grid_size + j]:.1f}',
ha='center', va='center', fontsize=10)
ax.set_xlim(-0.5, self.grid_size - 0.5)
ax.set_ylim(-0.5, self.grid_size - 0.5)
ax.set_xticks(range(self.grid_size))
ax.set_yticks(range(self.grid_size))
ax.set_title('Grid World MDP', fontsize=16)
plt.gca().invert_yaxis()
plt.show()
# Create and visualize the grid world
env = GridWorld()
env.visualize_grid()
๐ฐ Value Functionsยถ
The value function tells us how good it is to be in a particular state. There are two main types:
State-Value Function V(s)ยถ
Expected return starting from state s, following policy ฯ:
V^ฯ(s) = E[โ{k=0}^โ ฮณ^k R{t+k+1} | S_t = s, ฯ]
Action-Value Function Q(s,a)ยถ
Expected return starting from state s, taking action a, then following policy ฯ:
Q^ฯ(s,a) = E[โ{k=0}^โ ฮณ^k R{t+k+1} | S_t = s, A_t = a, ฯ]
๐ Bellman Equationsยถ
The Bellman equations express value functions recursively:
Bellman Equation for V^ฯยถ
V^ฯ(s) = โa ฯ(a|s) โ{sโ} P(sโ|s,a) [R(s,a,sโ) + ฮณ V^ฯ(sโ)]
Bellman Equation for Q^ฯยถ
Q^ฯ(s,a) = โ{sโ} P(sโ|s,a) [R(s,a,sโ) + ฮณ โ{aโ} ฯ(aโ|sโ) Q^ฯ(sโ,aโ)]
Bellman Optimality Equationยถ
V*(s) = max_a โ_{sโ} P(sโ|s,a) [R(s,a,sโ) + ฮณ V*(sโ)]
Q*(s,a) = โ{sโ} P(sโ|s,a) [R(s,a,sโ) + ฮณ max{aโ} Q*(sโ,aโ)]
def value_iteration(env: GridWorld, gamma: float = 0.9, theta: float = 1e-6, max_iterations: int = 1000):
"""Solve MDP using value iteration"""
# Initialize value function
V = {state: 0.0 for state in env.states}
for iteration in range(max_iterations):
delta = 0
for state in env.states:
if env.is_terminal(state):
continue # Terminal states have value 0
v_old = V[state]
# Compute value for each action
action_values = []
for action in env.actions:
next_state = env.get_next_state(state, action)
reward = env.get_reward(state, action, next_state)
action_values.append(reward + gamma * V[next_state])
# Take maximum over actions
V[state] = max(action_values)
delta = max(delta, abs(v_old - V[state]))
if delta < theta:
break
return V
# Solve the grid world using value iteration
optimal_values = value_iteration(env)
# Extract values in order for visualization
values_list = [optimal_values[state] for state in env.states]
# Visualize optimal values
env.visualize_grid(values_list)
๐ Policy Evaluationยถ
Given a policy ฯ, we can compute its value function using policy evaluation:
For each state s: V^ฯ(s) โ โa ฯ(a|s) โ{sโ} P(sโ|s,a) [R(s,a,sโ) + ฮณ V^ฯ(sโ)]
This is repeated until convergence.
def policy_evaluation(env: GridWorld, policy: Dict[Tuple[int, int], str],
gamma: float = 0.9, theta: float = 1e-6) -> Dict[Tuple[int, int], float]:
"""Evaluate a policy by computing its value function"""
V = {state: 0.0 for state in env.states}
while True:
delta = 0
for state in env.states:
if env.is_terminal(state):
continue
v_old = V[state]
# Get action from policy
action = policy.get(state, 'right') # Default action
next_state = env.get_next_state(state, action)
reward = env.get_reward(state, action, next_state)
V[state] = reward + gamma * V[next_state]
delta = max(delta, abs(v_old - V[state]))
if delta < theta:
break
return V
# Define a simple policy: always go right
simple_policy = {}
for state in env.states:
if not env.is_terminal(state):
simple_policy[state] = 'right'
# Evaluate the policy
policy_values = policy_evaluation(env, simple_policy)
policy_values_list = [policy_values[state] for state in env.states]
print("Simple Policy (Always Right) Values:")
env.visualize_grid(policy_values_list)
๐ฏ Optimal Policiesยถ
An optimal policy ฯ* satisfies:
V^ฯ*(s) โฅ V^ฯ(s) for all policies ฯ and states s
Q^ฯ*(s,a) โฅ Q^ฯ(s,a) for all policies ฯ, states s, and actions a
For any MDP, there exists at least one optimal policy that is deterministic.
๐ง Key Takeawaysยถ
MDPs provide a mathematical framework for modeling decision-making problems
The Markov property ensures we only need the current state to make decisions
Value functions quantify how good different states and actions are
Bellman equations provide recursive relationships for computing optimal behavior
Value iteration can find optimal policies by iteratively improving value estimates
๐ Whatโs Next?ยถ
Now that you understand MDPs, youโre ready to learn:
Value Iteration & Policy Iteration (exact methods)
Monte Carlo Methods (model-free learning)
Temporal Difference Learning (Q-learning, SARSA)
๐ Further Readingยถ
๐๏ธ Exercisesยถ
Modify the grid world to have different rewards or obstacles
Implement policy iteration from scratch
Compare different discount factors (ฮณ) and their effects
Add stochastic transitions (actions donโt always succeed)
๐ก Discussion Questionsยถ
Why is the Markov property important for RL?
How does the discount factor ฮณ affect learning?
What are the limitations of the MDP framework?
How might you model a real-world problem as an MDP?