Lecture 14: Expectation Maximization & Clustering

📹 Watch Lecture

From Andrew Ng’s CS229 Lecture 14 (Autumn 2018)

“What I’d like to do today is start our foray into unsupervised learning.” - Andrew Ng

Unsupervised Learning

Supervised Learning:

  • Given x and y (labels)

  • Learn mapping from x → y

  • Examples: Regression, Classification

Unsupervised Learning:

  • Given only x (no labels)

  • Training set: {x⁽¹⁾, x⁽²⁾, …, x⁽ᵐ⁾}

  • Find something interesting about the data

First Unsupervised Algorithm: Clustering

“The first unsupervised learning algorithm we’ll talk about is clustering in which given a dataset like this, hopefully, we can have an algorithm that can figure out that this dataset has two separate clusters.”

Market Segmentation Use Case

“One of the most common uses of clustering is market segmentation. If you have a website selling things online, you have a huge database of many different users and run clustering to decide what are the different market segments.”

Example segments:

  • Age range and gender

  • Education level

  • Geographic location (East Coast vs West Coast)

  • Behavioral patterns

K-Means Clustering Algorithm

Most commonly used clustering algorithm

Algorithm Steps:

Step 1: Initialize

  • Pick K points as cluster centroids (initial guess)

  • Denoted by crosses in visualization

Step 2: Iterate

2a. Assignment Step:

“Go through each of your training examples and for each of them you color them either red or blue depending on which is the closer cluster centroid”

  • Assign each point to nearest centroid

2b. Update Step:

“Look at all the blue dots and compute the average, right? Just find the mean of all the blue dots, and move the blue cluster centroid there. And similarly, look at all the red dots and find the mean”

  • Move each centroid to the mean of its assigned points

  • “This is just a standard arithmetic average”

Convergence:

“If you keep running the algorithm, nothing changes. So the algorithm has converged.”

  • When assignments stop changing

  • When centroids stop moving

Coming Up

  1. K-Means mathematical formulation

  2. EM Algorithm (Expectation Maximization)

    • Generalizes K-Means

    • Handles soft assignments

    • Mixture of Gaussians model

  3. Choosing K (number of clusters)

  4. Alternative clustering methods

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.stats import multivariate_normal
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

8.1 K-Means from Scratch

What: Implementing the K-Means clustering algorithm step by step

We build K-Means from scratch following the two-step iterative procedure: (1) the assignment step, which assigns each data point to its nearest centroid by computing Euclidean distances, and (2) the update step, which moves each centroid to the mean of its assigned points. The algorithm minimizes the objective \(J = \sum_{k=1}^{K}\sum_{x \in C_k} \|x - \mu_k\|^2\) (called inertia), which monotonically decreases at each iteration until convergence.

Why: The most widely used clustering algorithm

K-Means is the default first choice for clustering due to its simplicity, speed (\(O(nmK)\) per iteration), and scalability to large datasets. Understanding its mechanics reveals both its strengths (fast convergence on spherical clusters) and limitations (sensitivity to initialization, assumption of equal-variance spherical clusters, need to specify \(K\) in advance). Every variant – K-Means++, mini-batch K-Means, bisecting K-Means – builds on this core algorithm.

class KMeansCustom:
    def __init__(self, n_clusters=3, max_iters=100, random_state=None):
        self.n_clusters = n_clusters
        self.max_iters = max_iters
        self.random_state = random_state
        self.centroids = None
        self.labels = None
        self.inertia_history = []
    
    def fit(self, X):
        if self.random_state:
            np.random.seed(self.random_state)
        
        # Initialize centroids randomly
        idx = np.random.choice(len(X), self.n_clusters, replace=False)
        self.centroids = X[idx]
        
        for iteration in range(self.max_iters):
            # Assignment step
            distances = np.sqrt(((X[:, np.newaxis] - self.centroids)**2).sum(axis=2))
            self.labels = np.argmin(distances, axis=1)
            
            # Update step
            new_centroids = np.array([X[self.labels == k].mean(axis=0) 
                                     for k in range(self.n_clusters)])
            
            # Compute inertia
            inertia = sum([np.sum((X[self.labels == k] - new_centroids[k])**2) 
                          for k in range(self.n_clusters)])
            self.inertia_history.append(inertia)
            
            # Check convergence
            if np.allclose(self.centroids, new_centroids):
                break
            
            self.centroids = new_centroids
        
        return self
    
    def predict(self, X):
        distances = np.sqrt(((X[:, np.newaxis] - self.centroids)**2).sum(axis=2))
        return np.argmin(distances, axis=1)

# Test
X_blobs, y_true = make_blobs(n_samples=300, centers=4, random_state=42)

kmeans = KMeansCustom(n_clusters=4, random_state=42)
kmeans.fit(X_blobs)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Clustering result
axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=kmeans.labels, cmap='viridis', s=50, alpha=0.6)
axes[0].scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1], 
               c='red', marker='X', s=200, edgecolors='black', linewidths=2)
axes[0].set_title('K-Means Clustering', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Inertia convergence
axes[1].plot(kmeans.inertia_history, linewidth=2)
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Inertia', fontsize=12)
axes[1].set_title('K-Means Convergence', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"Converged in {len(kmeans.inertia_history)} iterations")

8.2 Elbow Method for Optimal K

What: Choosing the number of clusters using inertia and silhouette scores

Since K-Means requires specifying \(K\) in advance, we need a principled way to select it. The elbow method plots inertia vs. \(K\) and looks for a “bend” where adding more clusters yields diminishing returns. The silhouette score provides a complementary view: for each point, it measures how similar it is to its own cluster versus the nearest neighboring cluster, ranging from \(-1\) (wrong cluster) to \(+1\) (well-clustered).

Why: Avoiding arbitrary cluster counts in practice

In real applications – customer segmentation, document clustering, image quantization – the true number of clusters is unknown. The elbow method and silhouette analysis are the two standard heuristics practitioners use. Neither is foolproof (the “elbow” can be ambiguous, and silhouette scores favor convex clusters), but together they provide a reasonable starting point. For more rigorous selection, information-theoretic criteria like BIC (used by Gaussian Mixture Models) offer a probabilistic alternative.

# Elbow method
K_range = range(1, 11)
inertias = []
silhouettes = []

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_blobs)
    inertias.append(km.inertia_)
    if k > 1:
        silhouettes.append(silhouette_score(X_blobs, km.labels_))
    else:
        silhouettes.append(0)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

axes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='red', linestyle='--', label='Optimal K=4')
axes[0].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[0].set_ylabel('Inertia', fontsize=12)
axes[0].set_title('Elbow Method', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)

axes[1].plot(K_range, silhouettes, 'go-', linewidth=2, markersize=8)
axes[1].axvline(x=4, color='red', linestyle='--', label='Optimal K=4')
axes[1].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Analysis', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

8.3 Gaussian Mixture Models (GMM)

What: Soft clustering with probabilistic assignments

Unlike K-Means, which assigns each point to exactly one cluster (hard assignment), GMMs model the data as a mixture of \(K\) Gaussian distributions and assign each point a probability of belonging to each cluster (soft assignment). The model is: \(p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)\), where \(\pi_k\) are mixing weights, and each component has its own mean \(\mu_k\) and covariance \(\Sigma_k\). Parameters are learned via the Expectation-Maximization (EM) algorithm.

Why: A principled probabilistic framework for clustering

GMMs address several limitations of K-Means: they can model elliptical clusters (through full covariance matrices), they provide uncertainty estimates (the posterior probability \(P(z=k|x)\) tells you how confident the assignment is), and they offer principled model selection via the Bayesian Information Criterion (BIC). In many real-world settings – anomaly detection, density estimation, speech recognition – knowing the confidence of a cluster assignment is as important as the assignment itself.

# Generate data from mixture
X_gmm, y_gmm = make_blobs(n_samples=400, centers=3, cluster_std=[1.0, 1.5, 0.5], random_state=42)

# Fit GMM
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X_gmm)
labels_gmm = gmm.predict(X_gmm)
probs = gmm.predict_proba(X_gmm)

# Compare with K-Means
kmeans_gmm = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_km = kmeans_gmm.fit_predict(X_gmm)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# K-Means (hard assignment)
axes[0].scatter(X_gmm[:, 0], X_gmm[:, 1], c=labels_km, cmap='viridis', s=50, alpha=0.6)
axes[0].set_title('K-Means (Hard Assignment)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# GMM (soft assignment - show uncertainty)
uncertainty = -np.max(probs, axis=1)  # Higher = more uncertain
axes[1].scatter(X_gmm[:, 0], X_gmm[:, 1], c=labels_gmm, cmap='viridis', 
               s=50, alpha=0.6, edgecolors='black', linewidths=0.5)
# Highlight uncertain points
uncertain_idx = uncertainty < -0.5
axes[1].scatter(X_gmm[uncertain_idx, 0], X_gmm[uncertain_idx, 1], 
               s=200, facecolors='none', edgecolors='red', linewidths=2)
axes[1].set_title('GMM (Soft Assignment)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nGMM Parameters:")
print(f"Means:\n{gmm.means_}")
print(f"\nCovariances shape: {gmm.covariances_.shape}")
print(f"Weights: {gmm.weights_}")

8.4 Hierarchical Clustering

What: Building a tree of nested clusters without specifying K

Agglomerative (bottom-up) hierarchical clustering starts with each point as its own cluster, then repeatedly merges the two closest clusters until only one remains. The result is a dendrogram – a tree diagram showing the merge history – which can be “cut” at any height to produce a flat clustering with any number of groups. The distance between clusters is defined by a linkage criterion: single (min distance), complete (max distance), average, or Ward’s method (minimize variance increase).

Why: Exploratory analysis when the number of clusters is unknown

Hierarchical clustering is uniquely valuable for exploring the structure of data because the dendrogram reveals clusters at multiple scales simultaneously. Unlike K-Means, there is no need to choose \(K\) upfront – you can inspect the dendrogram and cut it where the merge distance jumps. Ward’s linkage, which minimizes the total within-cluster variance at each merge, tends to produce the most balanced and interpretable trees. The main drawback is \(O(n^2)\) memory and \(O(n^3)\) time, limiting it to small-to-medium datasets.

# Generate data
X_hier, _ = make_blobs(n_samples=50, centers=3, random_state=42)

# Compute linkage
linkage_matrix = linkage(X_hier, method='ward')

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Dendrogram
dendrogram(linkage_matrix, ax=axes[0], color_threshold=10)
axes[0].set_title('Hierarchical Clustering Dendrogram', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Sample Index', fontsize=12)
axes[0].set_ylabel('Distance', fontsize=12)
axes[0].axhline(y=10, c='red', linestyle='--', label='Cut at distance=10')
axes[0].legend(fontsize=11)

# Clustering result
agg = AgglomerativeClustering(n_clusters=3)
labels_hier = agg.fit_predict(X_hier)
axes[1].scatter(X_hier[:, 0], X_hier[:, 1], c=labels_hier, cmap='viridis', s=100, edgecolors='black')
axes[1].set_title('Agglomerative Clustering Result', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

8.5 DBSCAN (Density-Based)

What: Discovering clusters of arbitrary shape based on point density

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as contiguous regions of high density separated by regions of low density. A point is a core point if it has at least min_samples neighbors within radius eps; points reachable from core points belong to the same cluster, and unreachable points are labeled as noise. Unlike K-Means and GMM, DBSCAN does not assume any particular cluster shape.

Why: Handling the shapes that K-Means cannot

K-Means assumes spherical clusters with roughly equal variance – it fails dramatically on non-convex shapes like interlocking crescents or rings. DBSCAN discovers clusters of any shape, automatically determines the number of clusters, and identifies outliers as noise points. This makes it ideal for spatial data (geographic clustering), anomaly detection, and any domain where cluster boundaries are irregular. The tradeoff is sensitivity to the eps and min_samples parameters, which must be tuned for each dataset’s density scale.

# Generate non-convex data
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)

# Compare algorithms
kmeans_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_km_moons = kmeans_moons.fit_predict(X_moons)

dbscan = DBSCAN(eps=0.2, min_samples=5)
labels_dbscan = dbscan.fit_predict(X_moons)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# K-Means fails on non-convex
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_km_moons, cmap='viridis', s=50)
axes[0].set_title('K-Means (Fails on Non-Convex)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# DBSCAN succeeds
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_dbscan, cmap='viridis', s=50)
# Highlight noise points
noise = labels_dbscan == -1
axes[1].scatter(X_moons[noise, 0], X_moons[noise, 1], c='red', marker='x', s=100, label='Noise')
axes[1].set_title('DBSCAN (Handles Non-Convex)', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nDBSCAN found {len(set(labels_dbscan)) - (1 if -1 in labels_dbscan else 0)} clusters")
print(f"Noise points: {np.sum(noise)}")

Key Takeaways

1. K-Means

  • Simple, fast, scalable

  • Assumes spherical clusters

  • Sensitive to initialization (use k-means++)

  • Need to choose K (elbow method, silhouette score)

2. GMM

  • Probabilistic (soft assignment)

  • Handles elliptical clusters

  • Uses EM algorithm

  • More flexible than K-Means

3. Hierarchical

  • No need to specify K upfront

  • Visualize via dendrogram

  • O(n²) complexity (slow for large data)

  • Linkage methods: single, complete, average, Ward

4. DBSCAN

  • Density-based (finds arbitrary shapes)

  • Handles noise/outliers

  • Automatic cluster detection

  • Parameters: eps (neighborhood radius), min_samples

5. When to Use What

  • K-Means: Large data, spherical clusters, known K

  • GMM: Need probabilities, elliptical clusters

  • Hierarchical: Small data, unknown K, need dendrogram

  • DBSCAN: Arbitrary shapes, noisy data, unknown K

Practice Exercises

  1. Implement GMM from scratch using EM

  2. Apply K-Means to image compression

  3. Compare all methods on real dataset

  4. Tune DBSCAN parameters

  5. Customer segmentation project

References

  1. CS229 Lecture Notes on Clustering

  2. “Pattern Recognition and Machine Learning” - Bishop

  3. “The Elements of Statistical Learning” - Hastie et al.

Next: Lecture 9: Dimensionality Reduction