Lecture 14: Expectation Maximization & Clustering¶
From Andrew Ng’s CS229 Lecture 14 (Autumn 2018)¶
“What I’d like to do today is start our foray into unsupervised learning.” - Andrew Ng
Unsupervised Learning¶
Supervised Learning:
Given x and y (labels)
Learn mapping from x → y
Examples: Regression, Classification
Unsupervised Learning:
Given only x (no labels)
Training set: {x⁽¹⁾, x⁽²⁾, …, x⁽ᵐ⁾}
Find something interesting about the data
First Unsupervised Algorithm: Clustering¶
“The first unsupervised learning algorithm we’ll talk about is clustering in which given a dataset like this, hopefully, we can have an algorithm that can figure out that this dataset has two separate clusters.”
Market Segmentation Use Case¶
“One of the most common uses of clustering is market segmentation. If you have a website selling things online, you have a huge database of many different users and run clustering to decide what are the different market segments.”
Example segments:
Age range and gender
Education level
Geographic location (East Coast vs West Coast)
Behavioral patterns
K-Means Clustering Algorithm¶
Most commonly used clustering algorithm
Algorithm Steps:
Step 1: Initialize
Pick K points as cluster centroids (initial guess)
Denoted by crosses in visualization
Step 2: Iterate
2a. Assignment Step:
“Go through each of your training examples and for each of them you color them either red or blue depending on which is the closer cluster centroid”
Assign each point to nearest centroid
2b. Update Step:
“Look at all the blue dots and compute the average, right? Just find the mean of all the blue dots, and move the blue cluster centroid there. And similarly, look at all the red dots and find the mean”
Move each centroid to the mean of its assigned points
“This is just a standard arithmetic average”
Convergence:
“If you keep running the algorithm, nothing changes. So the algorithm has converged.”
When assignments stop changing
When centroids stop moving
Coming Up¶
K-Means mathematical formulation
EM Algorithm (Expectation Maximization)
Generalizes K-Means
Handles soft assignments
Mixture of Gaussians model
Choosing K (number of clusters)
Alternative clustering methods
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_blobs, make_moons
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, davies_bouldin_score
from scipy.cluster.hierarchy import dendrogram, linkage
from scipy.stats import multivariate_normal
import warnings
warnings.filterwarnings('ignore')
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)
print("Libraries loaded!")
8.1 K-Means from Scratch¶
What: Implementing the K-Means clustering algorithm step by step¶
We build K-Means from scratch following the two-step iterative procedure: (1) the assignment step, which assigns each data point to its nearest centroid by computing Euclidean distances, and (2) the update step, which moves each centroid to the mean of its assigned points. The algorithm minimizes the objective \(J = \sum_{k=1}^{K}\sum_{x \in C_k} \|x - \mu_k\|^2\) (called inertia), which monotonically decreases at each iteration until convergence.
Why: The most widely used clustering algorithm¶
K-Means is the default first choice for clustering due to its simplicity, speed (\(O(nmK)\) per iteration), and scalability to large datasets. Understanding its mechanics reveals both its strengths (fast convergence on spherical clusters) and limitations (sensitivity to initialization, assumption of equal-variance spherical clusters, need to specify \(K\) in advance). Every variant – K-Means++, mini-batch K-Means, bisecting K-Means – builds on this core algorithm.
class KMeansCustom:
def __init__(self, n_clusters=3, max_iters=100, random_state=None):
self.n_clusters = n_clusters
self.max_iters = max_iters
self.random_state = random_state
self.centroids = None
self.labels = None
self.inertia_history = []
def fit(self, X):
if self.random_state:
np.random.seed(self.random_state)
# Initialize centroids randomly
idx = np.random.choice(len(X), self.n_clusters, replace=False)
self.centroids = X[idx]
for iteration in range(self.max_iters):
# Assignment step
distances = np.sqrt(((X[:, np.newaxis] - self.centroids)**2).sum(axis=2))
self.labels = np.argmin(distances, axis=1)
# Update step
new_centroids = np.array([X[self.labels == k].mean(axis=0)
for k in range(self.n_clusters)])
# Compute inertia
inertia = sum([np.sum((X[self.labels == k] - new_centroids[k])**2)
for k in range(self.n_clusters)])
self.inertia_history.append(inertia)
# Check convergence
if np.allclose(self.centroids, new_centroids):
break
self.centroids = new_centroids
return self
def predict(self, X):
distances = np.sqrt(((X[:, np.newaxis] - self.centroids)**2).sum(axis=2))
return np.argmin(distances, axis=1)
# Test
X_blobs, y_true = make_blobs(n_samples=300, centers=4, random_state=42)
kmeans = KMeansCustom(n_clusters=4, random_state=42)
kmeans.fit(X_blobs)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Clustering result
axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=kmeans.labels, cmap='viridis', s=50, alpha=0.6)
axes[0].scatter(kmeans.centroids[:, 0], kmeans.centroids[:, 1],
c='red', marker='X', s=200, edgecolors='black', linewidths=2)
axes[0].set_title('K-Means Clustering', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# Inertia convergence
axes[1].plot(kmeans.inertia_history, linewidth=2)
axes[1].set_xlabel('Iteration', fontsize=12)
axes[1].set_ylabel('Inertia', fontsize=12)
axes[1].set_title('K-Means Convergence', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Converged in {len(kmeans.inertia_history)} iterations")
8.2 Elbow Method for Optimal K¶
What: Choosing the number of clusters using inertia and silhouette scores¶
Since K-Means requires specifying \(K\) in advance, we need a principled way to select it. The elbow method plots inertia vs. \(K\) and looks for a “bend” where adding more clusters yields diminishing returns. The silhouette score provides a complementary view: for each point, it measures how similar it is to its own cluster versus the nearest neighboring cluster, ranging from \(-1\) (wrong cluster) to \(+1\) (well-clustered).
Why: Avoiding arbitrary cluster counts in practice¶
In real applications – customer segmentation, document clustering, image quantization – the true number of clusters is unknown. The elbow method and silhouette analysis are the two standard heuristics practitioners use. Neither is foolproof (the “elbow” can be ambiguous, and silhouette scores favor convex clusters), but together they provide a reasonable starting point. For more rigorous selection, information-theoretic criteria like BIC (used by Gaussian Mixture Models) offer a probabilistic alternative.
# Elbow method
K_range = range(1, 11)
inertias = []
silhouettes = []
for k in K_range:
km = KMeans(n_clusters=k, random_state=42, n_init=10)
km.fit(X_blobs)
inertias.append(km.inertia_)
if k > 1:
silhouettes.append(silhouette_score(X_blobs, km.labels_))
else:
silhouettes.append(0)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
axes[0].plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
axes[0].axvline(x=4, color='red', linestyle='--', label='Optimal K=4')
axes[0].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[0].set_ylabel('Inertia', fontsize=12)
axes[0].set_title('Elbow Method', fontsize=14, fontweight='bold')
axes[0].legend(fontsize=11)
axes[0].grid(True, alpha=0.3)
axes[1].plot(K_range, silhouettes, 'go-', linewidth=2, markersize=8)
axes[1].axvline(x=4, color='red', linestyle='--', label='Optimal K=4')
axes[1].set_xlabel('Number of Clusters (K)', fontsize=12)
axes[1].set_ylabel('Silhouette Score', fontsize=12)
axes[1].set_title('Silhouette Analysis', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
8.3 Gaussian Mixture Models (GMM)¶
What: Soft clustering with probabilistic assignments¶
Unlike K-Means, which assigns each point to exactly one cluster (hard assignment), GMMs model the data as a mixture of \(K\) Gaussian distributions and assign each point a probability of belonging to each cluster (soft assignment). The model is: \(p(x) = \sum_{k=1}^{K} \pi_k \, \mathcal{N}(x \mid \mu_k, \Sigma_k)\), where \(\pi_k\) are mixing weights, and each component has its own mean \(\mu_k\) and covariance \(\Sigma_k\). Parameters are learned via the Expectation-Maximization (EM) algorithm.
Why: A principled probabilistic framework for clustering¶
GMMs address several limitations of K-Means: they can model elliptical clusters (through full covariance matrices), they provide uncertainty estimates (the posterior probability \(P(z=k|x)\) tells you how confident the assignment is), and they offer principled model selection via the Bayesian Information Criterion (BIC). In many real-world settings – anomaly detection, density estimation, speech recognition – knowing the confidence of a cluster assignment is as important as the assignment itself.
# Generate data from mixture
X_gmm, y_gmm = make_blobs(n_samples=400, centers=3, cluster_std=[1.0, 1.5, 0.5], random_state=42)
# Fit GMM
gmm = GaussianMixture(n_components=3, random_state=42)
gmm.fit(X_gmm)
labels_gmm = gmm.predict(X_gmm)
probs = gmm.predict_proba(X_gmm)
# Compare with K-Means
kmeans_gmm = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_km = kmeans_gmm.fit_predict(X_gmm)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# K-Means (hard assignment)
axes[0].scatter(X_gmm[:, 0], X_gmm[:, 1], c=labels_km, cmap='viridis', s=50, alpha=0.6)
axes[0].set_title('K-Means (Hard Assignment)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# GMM (soft assignment - show uncertainty)
uncertainty = -np.max(probs, axis=1) # Higher = more uncertain
axes[1].scatter(X_gmm[:, 0], X_gmm[:, 1], c=labels_gmm, cmap='viridis',
s=50, alpha=0.6, edgecolors='black', linewidths=0.5)
# Highlight uncertain points
uncertain_idx = uncertainty < -0.5
axes[1].scatter(X_gmm[uncertain_idx, 0], X_gmm[uncertain_idx, 1],
s=200, facecolors='none', edgecolors='red', linewidths=2)
axes[1].set_title('GMM (Soft Assignment)', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nGMM Parameters:")
print(f"Means:\n{gmm.means_}")
print(f"\nCovariances shape: {gmm.covariances_.shape}")
print(f"Weights: {gmm.weights_}")
8.4 Hierarchical Clustering¶
What: Building a tree of nested clusters without specifying K¶
Agglomerative (bottom-up) hierarchical clustering starts with each point as its own cluster, then repeatedly merges the two closest clusters until only one remains. The result is a dendrogram – a tree diagram showing the merge history – which can be “cut” at any height to produce a flat clustering with any number of groups. The distance between clusters is defined by a linkage criterion: single (min distance), complete (max distance), average, or Ward’s method (minimize variance increase).
Why: Exploratory analysis when the number of clusters is unknown¶
Hierarchical clustering is uniquely valuable for exploring the structure of data because the dendrogram reveals clusters at multiple scales simultaneously. Unlike K-Means, there is no need to choose \(K\) upfront – you can inspect the dendrogram and cut it where the merge distance jumps. Ward’s linkage, which minimizes the total within-cluster variance at each merge, tends to produce the most balanced and interpretable trees. The main drawback is \(O(n^2)\) memory and \(O(n^3)\) time, limiting it to small-to-medium datasets.
# Generate data
X_hier, _ = make_blobs(n_samples=50, centers=3, random_state=42)
# Compute linkage
linkage_matrix = linkage(X_hier, method='ward')
fig, axes = plt.subplots(1, 2, figsize=(16, 6))
# Dendrogram
dendrogram(linkage_matrix, ax=axes[0], color_threshold=10)
axes[0].set_title('Hierarchical Clustering Dendrogram', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Sample Index', fontsize=12)
axes[0].set_ylabel('Distance', fontsize=12)
axes[0].axhline(y=10, c='red', linestyle='--', label='Cut at distance=10')
axes[0].legend(fontsize=11)
# Clustering result
agg = AgglomerativeClustering(n_clusters=3)
labels_hier = agg.fit_predict(X_hier)
axes[1].scatter(X_hier[:, 0], X_hier[:, 1], c=labels_hier, cmap='viridis', s=100, edgecolors='black')
axes[1].set_title('Agglomerative Clustering Result', fontsize=14, fontweight='bold')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
8.5 DBSCAN (Density-Based)¶
What: Discovering clusters of arbitrary shape based on point density¶
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) defines clusters as contiguous regions of high density separated by regions of low density. A point is a core point if it has at least min_samples neighbors within radius eps; points reachable from core points belong to the same cluster, and unreachable points are labeled as noise. Unlike K-Means and GMM, DBSCAN does not assume any particular cluster shape.
Why: Handling the shapes that K-Means cannot¶
K-Means assumes spherical clusters with roughly equal variance – it fails dramatically on non-convex shapes like interlocking crescents or rings. DBSCAN discovers clusters of any shape, automatically determines the number of clusters, and identifies outliers as noise points. This makes it ideal for spatial data (geographic clustering), anomaly detection, and any domain where cluster boundaries are irregular. The tradeoff is sensitivity to the eps and min_samples parameters, which must be tuned for each dataset’s density scale.
# Generate non-convex data
X_moons, _ = make_moons(n_samples=300, noise=0.05, random_state=42)
# Compare algorithms
kmeans_moons = KMeans(n_clusters=2, random_state=42, n_init=10)
labels_km_moons = kmeans_moons.fit_predict(X_moons)
dbscan = DBSCAN(eps=0.2, min_samples=5)
labels_dbscan = dbscan.fit_predict(X_moons)
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# K-Means fails on non-convex
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_km_moons, cmap='viridis', s=50)
axes[0].set_title('K-Means (Fails on Non-Convex)', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)
# DBSCAN succeeds
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=labels_dbscan, cmap='viridis', s=50)
# Highlight noise points
noise = labels_dbscan == -1
axes[1].scatter(X_moons[noise, 0], X_moons[noise, 1], c='red', marker='x', s=100, label='Noise')
axes[1].set_title('DBSCAN (Handles Non-Convex)', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nDBSCAN found {len(set(labels_dbscan)) - (1 if -1 in labels_dbscan else 0)} clusters")
print(f"Noise points: {np.sum(noise)}")
Key Takeaways¶
1. K-Means¶
Simple, fast, scalable
Assumes spherical clusters
Sensitive to initialization (use k-means++)
Need to choose K (elbow method, silhouette score)
2. GMM¶
Probabilistic (soft assignment)
Handles elliptical clusters
Uses EM algorithm
More flexible than K-Means
3. Hierarchical¶
No need to specify K upfront
Visualize via dendrogram
O(n²) complexity (slow for large data)
Linkage methods: single, complete, average, Ward
4. DBSCAN¶
Density-based (finds arbitrary shapes)
Handles noise/outliers
Automatic cluster detection
Parameters: eps (neighborhood radius), min_samples
5. When to Use What¶
K-Means: Large data, spherical clusters, known K
GMM: Need probabilities, elliptical clusters
Hierarchical: Small data, unknown K, need dendrogram
DBSCAN: Arbitrary shapes, noisy data, unknown K
Practice Exercises¶
Implement GMM from scratch using EM
Apply K-Means to image compression
Compare all methods on real dataset
Tune DBSCAN parameters
Customer segmentation project
References¶
CS229 Lecture Notes on Clustering
“Pattern Recognition and Machine Learning” - Bishop
“The Elements of Statistical Learning” - Hastie et al.