Markov Models & Hidden Markov Models (HMMs)ΒΆ
Source: The Math Behind Artificial Intelligence β Chapter 6, Section: βWhat Are Markov Models?β
OverviewΒΆ
Markov models are the foundation of sequential reasoning in AI β from speech recognition to financial modeling to modern LLM token prediction.
This notebook covers:
The Markov Property β βmemorylessβ sequences
Discrete Markov Chains β state transitions
Transition matrices β the math
Hidden Markov Models (HMMs) β when states are latent
HMM algorithms: Forward, Viterbi, Baum-Welch
Full implementation with
hmmlearnApplications in AI
1. The Markov PropertyΒΆ
A process has the Markov property if the future depends only on the present state, not on history:
This is the βmemorylessβ or βno memoryβ property.
Real-world examples:
Weather: tomorrowβs weather depends mostly on todayβs
Chess: next move depends on current board position, not full game history
Language models (1st order): next word depends on current word (bigram models)
Reinforcement learning: next state depends on current state + action
Where Markov fails: stock prices, long-range language dependencies (solved by Transformers with attention)
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyArrowPatch
# --- Example 1: Weather Markov Chain ---
# States: Sunny (0), Cloudy (1), Rainy (2)
states = ['Sunny', 'Cloudy', 'Rainy']
# Transition matrix T[i][j] = P(tomorrow=j | today=i)
T = np.array([
[0.7, 0.2, 0.1], # Sunny β [Sunny, Cloudy, Rainy]
[0.3, 0.4, 0.3], # Cloudy β [Sunny, Cloudy, Rainy]
[0.2, 0.3, 0.5], # Rainy β [Sunny, Cloudy, Rainy]
])
print("Weather Markov Chain β Transition Matrix")
print()
print(f"{'':>8}", end='')
for s in states:
print(f"{s:>8}", end='')
print()
for i, row_state in enumerate(states):
print(f"{row_state:>8}", end='')
for p in T[i]:
print(f"{p:>8.1f}", end='')
print(f" (sums to {T[i].sum():.1f})")
# --- Simulate a Markov Chain ---
def simulate_markov_chain(T, initial_state, n_steps, seed=42):
rng = np.random.default_rng(seed)
state = initial_state
path = [state]
for _ in range(n_steps):
state = rng.choice(len(T), p=T[state])
path.append(state)
return path
path = simulate_markov_chain(T, initial_state=0, n_steps=30)
state_names = [states[s] for s in path]
colors = {'Sunny': 'gold', 'Cloudy': 'lightgray', 'Rainy': 'steelblue'}
fig, ax = plt.subplots(figsize=(14, 3))
for i, (s, name) in enumerate(zip(path, state_names)):
ax.bar(i, 1, color=colors[name], edgecolor='white', linewidth=0.5)
if i % 5 == 0:
ax.text(i, 0.5, name[:1], ha='center', va='center', fontsize=9, fontweight='bold')
legend_patches = [mpatches.Patch(color=c, label=s) for s, c in colors.items()]
ax.legend(handles=legend_patches, loc='upper right')
ax.set_xlim(-0.5, len(path)-0.5)
ax.set_yticks([])
ax.set_xlabel('Day')
ax.set_title('Simulated Weather Sequence (30 days from Sunny)', fontsize=13)
plt.tight_layout()
plt.show()
# --- Stationary Distribution: long-run behavior ---
# As t β β, the distribution converges to Ο where Ο = Ο T
# Method 1: Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(T.T)
# Find the eigenvector for eigenvalue = 1
stationary_idx = np.argmin(np.abs(eigenvalues - 1.0))
stationary_dist = np.real(eigenvectors[:, stationary_idx])
stationary_dist = stationary_dist / stationary_dist.sum()
# Method 2: Power iteration (T^n for large n)
pi = np.array([1/3, 1/3, 1/3]) # start uniform
for _ in range(1000):
pi = pi @ T
print("Stationary Distribution (long-run probabilities):")
for i, s in enumerate(states):
print(f" P({s:6s}) = {stationary_dist[i]:.4f} (power iteration: {pi[i]:.4f})")
print("\nInterpretation: In the long run, this many days will be each weather type.")
2. Hidden Markov Models (HMMs)ΒΆ
In a Hidden Markov Model, the true states are not directly observable β we only see noisy observations.
Hidden states: Zβ β Zβ β Zβ β ... β Zβ
β β β β
Observations: Xβ Xβ Xβ Xβ
An HMM is defined by 5 parameters (Ξ» = A, B, Ο):
Symbol |
Name |
Description |
|---|---|---|
N |
Number of states |
Hidden states Z β {1,β¦,N} |
Ο |
Initial distribution |
P(Zβ = i) |
A |
Transition matrix |
A[i,j] = P(Z_{t+1}=j | Z_t=i) |
B |
Emission/observation dist |
B[i] = P(X_t | Z_t=i) |
Applications: Speech recognition, gene sequence analysis, financial regime detection, POS tagging, gesture recognition
# --- Step-by-step HMM construction (from book, Chapter 6) ---
# Scenario: Economy has hidden states (Bull/Bear market)
# Observations: daily stock returns (high/medium/low)
# Step 1: Define states
hidden_states = ['Bull Market', 'Bear Market']
observations = ['High Return', 'Medium Return', 'Low Return']
n_states = len(hidden_states)
n_obs = len(observations)
# Step 2: Define parameters
pi = np.array([0.6, 0.4]) # Initial: 60% bull, 40% bear
# Step 3: Transition matrix (states persist with high probability)
A = np.array([
[0.7, 0.3], # Bull β [Bull, Bear]
[0.4, 0.6], # Bear β [Bull, Bear]
])
# Step 4: Emission matrix B[state][observation]
B = np.array([
[0.6, 0.3, 0.1], # Bull: high=60%, medium=30%, low=10%
[0.1, 0.3, 0.6], # Bear: high=10%, medium=30%, low=60%
])
print("HMM Parameters β Market Regime Model")
print("\nInitial distribution Ο:")
for i, s in enumerate(hidden_states):
print(f" P(Zβ={s}) = {pi[i]}")
print("\nTransition matrix A:")
print(f" {'':15}", end='')
for s in hidden_states:
print(f"{s:>15}", end='')
print()
for i, s in enumerate(hidden_states):
print(f" {s:15}", end='')
for p in A[i]:
print(f"{p:>15.1f}", end='')
print()
print("\nEmission matrix B:")
print(f" {'':15}", end='')
for o in observations:
print(f"{o:>15}", end='')
print()
for i, s in enumerate(hidden_states):
print(f" {s:15}", end='')
for p in B[i]:
print(f"{p:>15.1f}", end='')
print()
# --- Forward Algorithm: P(observations | model) ---
# Given a sequence of observations, what is the probability it was generated by this HMM?
def forward_algorithm(obs_seq, pi, A, B):
"""
Computes P(O | Ξ») = probability of observation sequence given HMM.
Ξ±[t][i] = P(Oβ...Oβ, Zβ=i | Ξ»)
"""
T = len(obs_seq)
N = len(pi)
alpha = np.zeros((T, N))
# Initialization: Ξ±[0][i] = Ο[i] * B[i][Oβ]
alpha[0] = pi * B[:, obs_seq[0]]
# Induction: Ξ±[t][j] = B[j][Oβ] * Ξ£α΅’(Ξ±[t-1][i] * A[i][j])
for t in range(1, T):
for j in range(N):
alpha[t, j] = B[j, obs_seq[t]] * np.sum(alpha[t-1] * A[:, j])
# Total probability = Ξ£α΅’ Ξ±[T][i]
prob = np.sum(alpha[-1])
return alpha, prob
# Observation sequence: High, High, Low (encoded as 0, 0, 2)
obs_seq = [0, 0, 2] # High, High, Low
alpha, prob = forward_algorithm(obs_seq, pi, A, B)
obs_names = [observations[o] for o in obs_seq]
print(f"Observation sequence: {' β '.join(obs_names)}")
print(f"\nForward probabilities Ξ±[t][state]:")
print(f"{'Time':>6} | {'Bull Market':>12} | {'Bear Market':>12}")
print("-" * 40)
for t, obs in enumerate(obs_names):
print(f"t={t} ({obs[:4]}): {alpha[t,0]:>12.6f} | {alpha[t,1]:>12.6f}")
print(f"\nP(observations | model) = {prob:.6f}")
# --- Viterbi Algorithm: Most likely hidden state sequence ---
# Given observations, what sequence of hidden states is most likely?
def viterbi(obs_seq, pi, A, B):
"""
Finds the most likely hidden state sequence.
Ξ΄[t][i] = max probability of any path ending in state i at time t
"""
T = len(obs_seq)
N = len(pi)
delta = np.zeros((T, N))
psi = np.zeros((T, N), dtype=int) # backpointer
# Initialization
delta[0] = pi * B[:, obs_seq[0]]
# Recursion
for t in range(1, T):
for j in range(N):
trans_probs = delta[t-1] * A[:, j]
psi[t, j] = np.argmax(trans_probs)
delta[t, j] = B[j, obs_seq[t]] * np.max(trans_probs)
# Backtrack
path = np.zeros(T, dtype=int)
path[-1] = np.argmax(delta[-1])
for t in range(T-2, -1, -1):
path[t] = psi[t+1, path[t+1]]
return path, np.max(delta[-1])
# Test on longer sequence
obs_seq_long = [0, 0, 1, 0, 2, 2, 2, 1, 0, 0] # High,High,Med,High,Low,Low,Low,Med,High,High
path, max_prob = viterbi(obs_seq_long, pi, A, B)
print("Viterbi β Most Likely Hidden State Sequence")
print()
print(f"{'Day':>5} | {'Observed':>15} | {'Hidden State':>15}")
print("-" * 42)
for t, (o, z) in enumerate(zip(obs_seq_long, path)):
print(f" {t+1:>3} | {observations[o]:>15} | {hidden_states[z]:>15}")
3. HMM with hmmlearn β Gaussian HMMΒΆ
Continuous observations with Gaussian emission distributionsΒΆ
The discrete HMM we built above assumed a finite set of possible observations (High, Medium, Low). In practice, many signals are continuous-valued β sensor readings, stock returns, acoustic features in speech. A Gaussian HMM handles this by replacing the discrete emission matrix \(B\) with a set of Gaussian distributions, one per hidden state:
Each hidden state \(i\) emits observations from a Gaussian with its own mean \(\mu_i\) and covariance \(\Sigma_i\). The hmmlearn library implements the Baum-Welch algorithm (an instance of Expectation-Maximization) to learn all HMM parameters β transition matrix \(A\), emission means and covariances, and initial distribution \(\pi\) β from unlabeled observation sequences. The E-step computes the expected hidden state assignments given current parameters, and the M-step re-estimates parameters to maximize the expected log-likelihood. This unsupervised learning of sequential structure is what made HMMs the dominant approach in speech recognition for decades.
# --- Full Gaussian HMM implementation (from book, Chapter 6) ---
try:
from hmmlearn import hmm
HMM_AVAILABLE = True
except ImportError:
HMM_AVAILABLE = False
print("hmmlearn not installed. Run: pip install hmmlearn")
if HMM_AVAILABLE:
np.random.seed(42)
# Step 1: Define model (from book)
n_components = 2 # 2 hidden states
model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")
# Step 2: Set parameters (from book)
model.startprob_ = np.array([0.6, 0.4])
model.transmat_ = np.array([[0.7, 0.3],
[0.4, 0.6]])
model.means_ = np.array([[0.0], [3.0]]) # State 0: mean=0, State 1: mean=3
model.covars_ = np.array([[0.5], [0.5]])
# Step 3: Generate synthetic data (from book)
X, Z_true = model.sample(100) # 100 observations
# Step 4: Fit a new HMM to learn parameters from data (from book)
new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)
new_model.fit(X)
# Step 5: Predict hidden states (from book)
hidden_states_pred = new_model.predict(X)
print("Gaussian HMM β Learned Parameters vs True Parameters")
print("=" * 55)
print("\nTransition Matrix (learned):")
print(np.round(new_model.transmat_, 3))
print("Transition Matrix (true):")
print(model.transmat_)
print("\nMeans (learned):", np.round(new_model.means_.flatten(), 3))
print("Means (true): ", model.means_.flatten())
print("\nCovariances (learned):", np.round(new_model.covars_.flatten(), 3))
print("Covariances (true): ", model.covars_.flatten())
if HMM_AVAILABLE:
# Visualize: true hidden states vs observations vs predicted
fig, axes = plt.subplots(3, 1, figsize=(14, 8), sharex=True)
t = np.arange(len(X))
# Observations
axes[0].plot(t, X, 'b-', linewidth=0.8, alpha=0.7)
axes[0].set_ylabel('Observed value', fontsize=11)
axes[0].set_title('Gaussian HMM: Observations and Hidden States', fontsize=13, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# True hidden states
axes[1].step(t, Z_true, 'g-', linewidth=1.5, where='post')
axes[1].set_ylabel('True state', fontsize=11)
axes[1].set_yticks([0, 1])
axes[1].set_yticklabels(['State 0\n(mean=0)', 'State 1\n(mean=3)'])
axes[1].grid(True, alpha=0.3)
# Predicted hidden states
axes[2].step(t, hidden_states_pred, 'r-', linewidth=1.5, where='post')
axes[2].set_ylabel('Predicted state', fontsize=11)
axes[2].set_yticks([0, 1])
axes[2].set_yticklabels(['State 0', 'State 1'])
axes[2].set_xlabel('Time step', fontsize=11)
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Accuracy
# States may be flipped (label permutation) β try both
acc_direct = np.mean(hidden_states_pred == Z_true)
acc_flipped = np.mean((1 - hidden_states_pred) == Z_true)
accuracy = max(acc_direct, acc_flipped)
print(f"State recovery accuracy: {accuracy:.1%}")
4. HMM Algorithms SummaryΒΆ
Three fundamental problems and their solutionsΒΆ
Every application of HMMs reduces to one of three canonical problems, each solved by an elegant dynamic programming algorithm:
Algorithm |
Problem |
Method |
Complexity |
|---|---|---|---|
Forward |
\(P(O \mid \lambda)\): likelihood of observations |
Dynamic programming (sum-product) |
\(O(N^2 T)\) |
Viterbi |
\(Z^* = \arg\max P(Z \mid O, \lambda)\): decode most likely states |
Viterbi DP (max-product) |
\(O(N^2 T)\) |
Baum-Welch |
\(\lambda^* = \arg\max P(O \mid \lambda)\): learn parameters |
EM algorithm (forward-backward) |
\(O(N^2 T)\) per iteration |
where \(N\) = number of hidden states and \(T\) = sequence length. The Forward and Viterbi algorithms differ in a single operation: Forward uses summation (marginalizing over all paths), while Viterbi uses maximization (finding the single best path). Both run in \(O(N^2 T)\) instead of the brute-force \(O(N^T)\) by exploiting the Markov property to decompose the computation into stages.
5. Connection to Modern AIΒΆ
From HMMs to attention-based sequence modelsΒΆ
HMMs dominated sequential modeling for decades in speech recognition (phonemes as hidden states, acoustic features as observations), NLP (part-of-speech tagging, named entity recognition), and bioinformatics (gene finding, protein structure prediction). They were gradually replaced by RNNs and LSTMs (2014-2018) which could learn longer-range dependencies, and then by Transformers (2017-present) which eliminated the sequential bottleneck entirely.
The mathematical DNA of HMMs lives on in modern architectures. The attention mechanism in Transformers can be viewed as a differentiable, soft version of the Viterbi algorithm: instead of hard-assigning each position to a single βmost relevantβ source (as Viterbi does), attention computes a soft weighted average over all positions. The forward algorithmβs sum-product computation is analogous to the softmax-weighted value aggregation in self-attention. Understanding HMMs provides the conceptual foundation for grasping why attention works and what assumptions it relaxes compared to classical sequential models.
ExercisesΒΆ
For the weather Markov chain, compute the probability of the sequence: Sunny β Rainy β Sunny β Cloudy. (Use the transition matrix manually.)
What is the stationary distribution of a 2-state chain with A = [[0.9, 0.1], [0.2, 0.8]]? Verify using power iteration.
In the Viterbi algorithm, why do we take the
maxrather than thesum(unlike the forward algorithm)?Modify the Gaussian HMM code to use 3 hidden states instead of 2. How does accuracy change?
How does an HMM relate to a Kalman filter? (Research question β both are state-space models with different observation noise assumptions.)