Markov Models & Hidden Markov Models (HMMs)ΒΆ

Source: The Math Behind Artificial Intelligence β€” Chapter 6, Section: β€œWhat Are Markov Models?”

OverviewΒΆ

Markov models are the foundation of sequential reasoning in AI β€” from speech recognition to financial modeling to modern LLM token prediction.

This notebook covers:

  1. The Markov Property β€” β€œmemoryless” sequences

  2. Discrete Markov Chains β€” state transitions

  3. Transition matrices β€” the math

  4. Hidden Markov Models (HMMs) β€” when states are latent

  5. HMM algorithms: Forward, Viterbi, Baum-Welch

  6. Full implementation with hmmlearn

  7. Applications in AI

1. The Markov PropertyΒΆ

A process has the Markov property if the future depends only on the present state, not on history:

\[P(X_{t+1} = s | X_t, X_{t-1}, ..., X_1) = P(X_{t+1} = s | X_t)\]

This is the β€œmemoryless” or β€œno memory” property.

Real-world examples:

  • Weather: tomorrow’s weather depends mostly on today’s

  • Chess: next move depends on current board position, not full game history

  • Language models (1st order): next word depends on current word (bigram models)

  • Reinforcement learning: next state depends on current state + action

Where Markov fails: stock prices, long-range language dependencies (solved by Transformers with attention)

import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from matplotlib.patches import FancyArrowPatch

# --- Example 1: Weather Markov Chain ---
# States: Sunny (0), Cloudy (1), Rainy (2)

states = ['Sunny', 'Cloudy', 'Rainy']

# Transition matrix T[i][j] = P(tomorrow=j | today=i)
T = np.array([
    [0.7, 0.2, 0.1],  # Sunny  β†’ [Sunny, Cloudy, Rainy]
    [0.3, 0.4, 0.3],  # Cloudy β†’ [Sunny, Cloudy, Rainy]
    [0.2, 0.3, 0.5],  # Rainy  β†’ [Sunny, Cloudy, Rainy]
])

print("Weather Markov Chain β€” Transition Matrix")
print()
print(f"{'':>8}", end='')
for s in states:
    print(f"{s:>8}", end='')
print()
for i, row_state in enumerate(states):
    print(f"{row_state:>8}", end='')
    for p in T[i]:
        print(f"{p:>8.1f}", end='')
    print(f"  (sums to {T[i].sum():.1f})")
# --- Simulate a Markov Chain ---

def simulate_markov_chain(T, initial_state, n_steps, seed=42):
    rng = np.random.default_rng(seed)
    state = initial_state
    path = [state]
    for _ in range(n_steps):
        state = rng.choice(len(T), p=T[state])
        path.append(state)
    return path

path = simulate_markov_chain(T, initial_state=0, n_steps=30)
state_names = [states[s] for s in path]

colors = {'Sunny': 'gold', 'Cloudy': 'lightgray', 'Rainy': 'steelblue'}
fig, ax = plt.subplots(figsize=(14, 3))
for i, (s, name) in enumerate(zip(path, state_names)):
    ax.bar(i, 1, color=colors[name], edgecolor='white', linewidth=0.5)
    if i % 5 == 0:
        ax.text(i, 0.5, name[:1], ha='center', va='center', fontsize=9, fontweight='bold')

legend_patches = [mpatches.Patch(color=c, label=s) for s, c in colors.items()]
ax.legend(handles=legend_patches, loc='upper right')
ax.set_xlim(-0.5, len(path)-0.5)
ax.set_yticks([])
ax.set_xlabel('Day')
ax.set_title('Simulated Weather Sequence (30 days from Sunny)', fontsize=13)
plt.tight_layout()
plt.show()
# --- Stationary Distribution: long-run behavior ---
# As t β†’ ∞, the distribution converges to Ο€ where Ο€ = Ο€ T

# Method 1: Eigenvalue decomposition
eigenvalues, eigenvectors = np.linalg.eig(T.T)
# Find the eigenvector for eigenvalue = 1
stationary_idx = np.argmin(np.abs(eigenvalues - 1.0))
stationary_dist = np.real(eigenvectors[:, stationary_idx])
stationary_dist = stationary_dist / stationary_dist.sum()

# Method 2: Power iteration (T^n for large n)
pi = np.array([1/3, 1/3, 1/3])  # start uniform
for _ in range(1000):
    pi = pi @ T

print("Stationary Distribution (long-run probabilities):")
for i, s in enumerate(states):
    print(f"  P({s:6s}) = {stationary_dist[i]:.4f}  (power iteration: {pi[i]:.4f})")
print("\nInterpretation: In the long run, this many days will be each weather type.")

2. Hidden Markov Models (HMMs)ΒΆ

In a Hidden Markov Model, the true states are not directly observable β€” we only see noisy observations.

Hidden states:     Z₁ β†’ Zβ‚‚ β†’ Z₃ β†’ ... β†’ Zβ‚™
                   ↓    ↓    ↓           ↓
Observations:      X₁   Xβ‚‚   X₃         Xβ‚™

An HMM is defined by 5 parameters (Ξ» = A, B, Ο€):

Symbol

Name

Description

N

Number of states

Hidden states Z ∈ {1,…,N}

Ο€

Initial distribution

P(Z₁ = i)

A

Transition matrix

A[i,j] = P(Z_{t+1}=j | Z_t=i)

B

Emission/observation dist

B[i] = P(X_t | Z_t=i)

Applications: Speech recognition, gene sequence analysis, financial regime detection, POS tagging, gesture recognition

# --- Step-by-step HMM construction (from book, Chapter 6) ---
# Scenario: Economy has hidden states (Bull/Bear market)
# Observations: daily stock returns (high/medium/low)

# Step 1: Define states
hidden_states = ['Bull Market', 'Bear Market']
observations = ['High Return', 'Medium Return', 'Low Return']
n_states = len(hidden_states)
n_obs = len(observations)

# Step 2: Define parameters
pi = np.array([0.6, 0.4])  # Initial: 60% bull, 40% bear

# Step 3: Transition matrix (states persist with high probability)
A = np.array([
    [0.7, 0.3],  # Bull β†’ [Bull, Bear]
    [0.4, 0.6],  # Bear β†’ [Bull, Bear]
])

# Step 4: Emission matrix B[state][observation]
B = np.array([
    [0.6, 0.3, 0.1],  # Bull: high=60%, medium=30%, low=10%
    [0.1, 0.3, 0.6],  # Bear: high=10%, medium=30%, low=60%
])

print("HMM Parameters β€” Market Regime Model")
print("\nInitial distribution Ο€:")
for i, s in enumerate(hidden_states):
    print(f"  P(Z₁={s}) = {pi[i]}")

print("\nTransition matrix A:")
print(f"  {'':15}", end='')
for s in hidden_states:
    print(f"{s:>15}", end='')
print()
for i, s in enumerate(hidden_states):
    print(f"  {s:15}", end='')
    for p in A[i]:
        print(f"{p:>15.1f}", end='')
    print()

print("\nEmission matrix B:")
print(f"  {'':15}", end='')
for o in observations:
    print(f"{o:>15}", end='')
print()
for i, s in enumerate(hidden_states):
    print(f"  {s:15}", end='')
    for p in B[i]:
        print(f"{p:>15.1f}", end='')
    print()
# --- Forward Algorithm: P(observations | model) ---
# Given a sequence of observations, what is the probability it was generated by this HMM?

def forward_algorithm(obs_seq, pi, A, B):
    """
    Computes P(O | Ξ») = probability of observation sequence given HMM.
    Ξ±[t][i] = P(O₁...Oβ‚œ, Zβ‚œ=i | Ξ»)
    """
    T = len(obs_seq)
    N = len(pi)
    alpha = np.zeros((T, N))
    
    # Initialization: Ξ±[0][i] = Ο€[i] * B[i][O₁]
    alpha[0] = pi * B[:, obs_seq[0]]
    
    # Induction: Ξ±[t][j] = B[j][Oβ‚œ] * Ξ£α΅’(Ξ±[t-1][i] * A[i][j])
    for t in range(1, T):
        for j in range(N):
            alpha[t, j] = B[j, obs_seq[t]] * np.sum(alpha[t-1] * A[:, j])
    
    # Total probability = Ξ£α΅’ Ξ±[T][i]
    prob = np.sum(alpha[-1])
    return alpha, prob


# Observation sequence: High, High, Low (encoded as 0, 0, 2)
obs_seq = [0, 0, 2]   # High, High, Low
alpha, prob = forward_algorithm(obs_seq, pi, A, B)

obs_names = [observations[o] for o in obs_seq]
print(f"Observation sequence: {' β†’ '.join(obs_names)}")
print(f"\nForward probabilities Ξ±[t][state]:")
print(f"{'Time':>6} | {'Bull Market':>12} | {'Bear Market':>12}")
print("-" * 40)
for t, obs in enumerate(obs_names):
    print(f"t={t} ({obs[:4]}):  {alpha[t,0]:>12.6f} | {alpha[t,1]:>12.6f}")
print(f"\nP(observations | model) = {prob:.6f}")
# --- Viterbi Algorithm: Most likely hidden state sequence ---
# Given observations, what sequence of hidden states is most likely?

def viterbi(obs_seq, pi, A, B):
    """
    Finds the most likely hidden state sequence.
    Ξ΄[t][i] = max probability of any path ending in state i at time t
    """
    T = len(obs_seq)
    N = len(pi)
    delta = np.zeros((T, N))
    psi = np.zeros((T, N), dtype=int)   # backpointer
    
    # Initialization
    delta[0] = pi * B[:, obs_seq[0]]
    
    # Recursion
    for t in range(1, T):
        for j in range(N):
            trans_probs = delta[t-1] * A[:, j]
            psi[t, j] = np.argmax(trans_probs)
            delta[t, j] = B[j, obs_seq[t]] * np.max(trans_probs)
    
    # Backtrack
    path = np.zeros(T, dtype=int)
    path[-1] = np.argmax(delta[-1])
    for t in range(T-2, -1, -1):
        path[t] = psi[t+1, path[t+1]]
    
    return path, np.max(delta[-1])


# Test on longer sequence
obs_seq_long = [0, 0, 1, 0, 2, 2, 2, 1, 0, 0]   # High,High,Med,High,Low,Low,Low,Med,High,High
path, max_prob = viterbi(obs_seq_long, pi, A, B)

print("Viterbi β€” Most Likely Hidden State Sequence")
print()
print(f"{'Day':>5} | {'Observed':>15} | {'Hidden State':>15}")
print("-" * 42)
for t, (o, z) in enumerate(zip(obs_seq_long, path)):
    print(f"  {t+1:>3} | {observations[o]:>15} | {hidden_states[z]:>15}")

3. HMM with hmmlearn – Gaussian HMMΒΆ

Continuous observations with Gaussian emission distributionsΒΆ

The discrete HMM we built above assumed a finite set of possible observations (High, Medium, Low). In practice, many signals are continuous-valued – sensor readings, stock returns, acoustic features in speech. A Gaussian HMM handles this by replacing the discrete emission matrix \(B\) with a set of Gaussian distributions, one per hidden state:

\[P(X_t \mid Z_t = i) = \mathcal{N}(X_t; \mu_i, \Sigma_i)\]

Each hidden state \(i\) emits observations from a Gaussian with its own mean \(\mu_i\) and covariance \(\Sigma_i\). The hmmlearn library implements the Baum-Welch algorithm (an instance of Expectation-Maximization) to learn all HMM parameters – transition matrix \(A\), emission means and covariances, and initial distribution \(\pi\) – from unlabeled observation sequences. The E-step computes the expected hidden state assignments given current parameters, and the M-step re-estimates parameters to maximize the expected log-likelihood. This unsupervised learning of sequential structure is what made HMMs the dominant approach in speech recognition for decades.

# --- Full Gaussian HMM implementation (from book, Chapter 6) ---
try:
    from hmmlearn import hmm
    HMM_AVAILABLE = True
except ImportError:
    HMM_AVAILABLE = False
    print("hmmlearn not installed. Run: pip install hmmlearn")

if HMM_AVAILABLE:
    np.random.seed(42)

    # Step 1: Define model (from book)
    n_components = 2   # 2 hidden states
    model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag")

    # Step 2: Set parameters (from book)
    model.startprob_ = np.array([0.6, 0.4])
    model.transmat_  = np.array([[0.7, 0.3],
                                 [0.4, 0.6]])
    model.means_  = np.array([[0.0], [3.0]])    # State 0: mean=0, State 1: mean=3
    model.covars_ = np.array([[0.5], [0.5]])

    # Step 3: Generate synthetic data (from book)
    X, Z_true = model.sample(100)   # 100 observations

    # Step 4: Fit a new HMM to learn parameters from data (from book)
    new_model = hmm.GaussianHMM(n_components=n_components, covariance_type="diag", n_iter=100)
    new_model.fit(X)

    # Step 5: Predict hidden states (from book)
    hidden_states_pred = new_model.predict(X)

    print("Gaussian HMM β€” Learned Parameters vs True Parameters")
    print("=" * 55)
    print("\nTransition Matrix (learned):")
    print(np.round(new_model.transmat_, 3))
    print("Transition Matrix (true):")
    print(model.transmat_)
    print("\nMeans (learned):", np.round(new_model.means_.flatten(), 3))
    print("Means (true):   ", model.means_.flatten())
    print("\nCovariances (learned):", np.round(new_model.covars_.flatten(), 3))
    print("Covariances (true):   ", model.covars_.flatten())
if HMM_AVAILABLE:
    # Visualize: true hidden states vs observations vs predicted
    fig, axes = plt.subplots(3, 1, figsize=(14, 8), sharex=True)

    t = np.arange(len(X))

    # Observations
    axes[0].plot(t, X, 'b-', linewidth=0.8, alpha=0.7)
    axes[0].set_ylabel('Observed value', fontsize=11)
    axes[0].set_title('Gaussian HMM: Observations and Hidden States', fontsize=13, fontweight='bold')
    axes[0].grid(True, alpha=0.3)

    # True hidden states
    axes[1].step(t, Z_true, 'g-', linewidth=1.5, where='post')
    axes[1].set_ylabel('True state', fontsize=11)
    axes[1].set_yticks([0, 1])
    axes[1].set_yticklabels(['State 0\n(mean=0)', 'State 1\n(mean=3)'])
    axes[1].grid(True, alpha=0.3)

    # Predicted hidden states
    axes[2].step(t, hidden_states_pred, 'r-', linewidth=1.5, where='post')
    axes[2].set_ylabel('Predicted state', fontsize=11)
    axes[2].set_yticks([0, 1])
    axes[2].set_yticklabels(['State 0', 'State 1'])
    axes[2].set_xlabel('Time step', fontsize=11)
    axes[2].grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

    # Accuracy
    # States may be flipped (label permutation) β€” try both
    acc_direct = np.mean(hidden_states_pred == Z_true)
    acc_flipped = np.mean((1 - hidden_states_pred) == Z_true)
    accuracy = max(acc_direct, acc_flipped)
    print(f"State recovery accuracy: {accuracy:.1%}")

4. HMM Algorithms SummaryΒΆ

Three fundamental problems and their solutionsΒΆ

Every application of HMMs reduces to one of three canonical problems, each solved by an elegant dynamic programming algorithm:

Algorithm

Problem

Method

Complexity

Forward

\(P(O \mid \lambda)\): likelihood of observations

Dynamic programming (sum-product)

\(O(N^2 T)\)

Viterbi

\(Z^* = \arg\max P(Z \mid O, \lambda)\): decode most likely states

Viterbi DP (max-product)

\(O(N^2 T)\)

Baum-Welch

\(\lambda^* = \arg\max P(O \mid \lambda)\): learn parameters

EM algorithm (forward-backward)

\(O(N^2 T)\) per iteration

where \(N\) = number of hidden states and \(T\) = sequence length. The Forward and Viterbi algorithms differ in a single operation: Forward uses summation (marginalizing over all paths), while Viterbi uses maximization (finding the single best path). Both run in \(O(N^2 T)\) instead of the brute-force \(O(N^T)\) by exploiting the Markov property to decompose the computation into stages.

5. Connection to Modern AIΒΆ

From HMMs to attention-based sequence modelsΒΆ

HMMs dominated sequential modeling for decades in speech recognition (phonemes as hidden states, acoustic features as observations), NLP (part-of-speech tagging, named entity recognition), and bioinformatics (gene finding, protein structure prediction). They were gradually replaced by RNNs and LSTMs (2014-2018) which could learn longer-range dependencies, and then by Transformers (2017-present) which eliminated the sequential bottleneck entirely.

The mathematical DNA of HMMs lives on in modern architectures. The attention mechanism in Transformers can be viewed as a differentiable, soft version of the Viterbi algorithm: instead of hard-assigning each position to a single β€œmost relevant” source (as Viterbi does), attention computes a soft weighted average over all positions. The forward algorithm’s sum-product computation is analogous to the softmax-weighted value aggregation in self-attention. Understanding HMMs provides the conceptual foundation for grasping why attention works and what assumptions it relaxes compared to classical sequential models.

ExercisesΒΆ

  1. For the weather Markov chain, compute the probability of the sequence: Sunny β†’ Rainy β†’ Sunny β†’ Cloudy. (Use the transition matrix manually.)

  2. What is the stationary distribution of a 2-state chain with A = [[0.9, 0.1], [0.2, 0.8]]? Verify using power iteration.

  3. In the Viterbi algorithm, why do we take the max rather than the sum (unlike the forward algorithm)?

  4. Modify the Gaussian HMM code to use 3 hidden states instead of 2. How does accuracy change?

  5. How does an HMM relate to a Kalman filter? (Research question β€” both are state-space models with different observation noise assumptions.)