Chapter 12: Unsupervised LearningΒΆ
OverviewΒΆ
Unsupervised Learning: Find patterns in data without labels
Key Differences from Supervised LearningΒΆ
Aspect |
Supervised |
Unsupervised |
|---|---|---|
Labels |
Yes (Y available) |
No (only X) |
Goal |
Predict Y |
Find structure |
Evaluation |
Clear metrics |
Subjective |
Examples |
Classification, Regression |
Clustering, Dim Reduction |
Main TasksΒΆ
1. Dimensionality ReductionΒΆ
Goal: Reduce features while preserving information
Methods: PCA, t-SNE, UMAP
Use: Visualization, preprocessing, compression
2. ClusteringΒΆ
Goal: Group similar observations
Methods: K-Means, Hierarchical, DBSCAN
Use: Customer segmentation, gene expression, image compression
ChallengesΒΆ
β No ground truth - Hard to validate
β Subjective interpretation
β Many hyperparameters
β Scalability issues
β Sensitive to outliers
ApplicationsΒΆ
β
Exploratory data analysis
β
Customer segmentation
β
Anomaly detection
β
Feature engineering
β
Compression
12.1 Principal Component Analysis (PCA)ΒΆ
GoalΒΆ
Find linear combinations of features that capture maximum variance
Mathematical FormulationΒΆ
First Principal Component (PC1): $\(Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \cdots + \phi_{p1}X_p\)$
where \(\sum_{j=1}^p \phi_{j1}^2 = 1\) (normalized)
Objective: Maximize variance of \(Z_1\)
Subsequent PCs:
PC2: Maximum variance, orthogonal to PC1
PC3: Maximum variance, orthogonal to PC1 and PC2
β¦
Key PropertiesΒΆ
β
Uncorrelated components
β
Ordered by variance explained
β
Total variance preserved
β
Rotation of original space
Proportion of Variance Explained (PVE)ΒΆ
When to Use PCAΒΆ
High-dimensional data (p large)
Features are correlated
Need visualization (reduce to 2-3 dims)
Multicollinearity issues
Noise reduction
Important: Scale Features!ΒΆ
PCA sensitive to scale β always standardize
# PCA Example: Iris Dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names
print("π Iris Dataset:")
print(f" Shape: {X_iris.shape}")
print(f" Features: {feature_names}")
# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)
# Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)
# Variance explained
print(f"\nπ Variance Explained by Each PC:")
for i, var in enumerate(pca.explained_variance_ratio_, 1):
print(f" PC{i}: {var:.4f} ({var*100:.2f}%)")
print(f"\n Cumulative variance (PC1+PC2): {pca.explained_variance_ratio_[:2].sum()*100:.2f}%")
# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
# Scree plot
axes[0].bar(range(1, len(pca.explained_variance_ratio_)+1),
pca.explained_variance_ratio_, alpha=0.7)
axes[0].plot(range(1, len(pca.explained_variance_ratio_)+1),
np.cumsum(pca.explained_variance_ratio_), 'ro-', linewidth=2)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained')
axes[0].set_title('Scree Plot')
axes[0].legend(['Cumulative', 'Individual'])
axes[0].grid(True, alpha=0.3)
# PC1 vs PC2
scatter = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y_iris,
cmap='viridis', s=100, alpha=0.7, edgecolors='k')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
axes[1].set_title('First Two Principal Components')
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[1], label='Species')
# Loadings (component weights)
loadings = pca.components_[:2].T
axes[2].bar(range(len(feature_names)), loadings[:, 0], alpha=0.7, label='PC1')
axes[2].bar(range(len(feature_names)), loadings[:, 1], alpha=0.7, label='PC2')
axes[2].set_xticks(range(len(feature_names)))
axes[2].set_xticklabels([f.split()[0] for f in feature_names], rotation=45)
axes[2].set_ylabel('Loading')
axes[2].set_title('Feature Loadings on PC1 & PC2')
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
print("\nπ‘ Interpretation:")
print(" β’ PC1 captures overall flower size (all features positive)")
print(" β’ PC2 contrasts sepal vs petal dimensions")
print(" β’ 2 PCs capture 95%+ of variance β good low-dim representation")
12.2 K-Means ClusteringΒΆ
GoalΒΆ
Partition data into K distinct clusters
AlgorithmΒΆ
Minimize within-cluster variation: $\(\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k)\)$
where \(W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} ||x_i - x_{i'}||^2\)
K-Means Steps:
Initialize: Randomly assign K centroids
Assignment: Assign each point to nearest centroid
Update: Recalculate centroids as cluster means
Repeat steps 2-3 until convergence
Key PropertiesΒΆ
β
Fast and simple
β
Scales well to large datasets
β
Guaranteed to converge
β Local optimum (random initialization matters)
β Must specify K
β Sensitive to outliers
β Assumes spherical clusters
Choosing K: Elbow MethodΒΆ
Plot within-cluster SS vs K, look for βelbowβ
Choosing K: Silhouette ScoreΒΆ
\(a(i)\) = avg distance to points in same cluster
\(b(i)\) = avg distance to points in nearest other cluster
Score: -1 (worst) to +1 (best)
# K-Means Clustering
# Generate synthetic data
X_blobs, y_true = make_blobs(n_samples=300, centers=4,
cluster_std=0.6, random_state=42)
# Try different K values
K_range = range(2, 9)
inertias = []
silhouettes = []
for k in K_range:
kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans.fit(X_blobs)
inertias.append(kmeans.inertia_)
silhouettes.append(silhouette_score(X_blobs, kmeans.labels_))
# Visualize K selection
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Elbow plot
axes[0].plot(K_range, inertias, 'o-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Within-Cluster Sum of Squares')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)
# Silhouette plot
axes[1].plot(K_range, silhouettes, 's-', linewidth=2, markersize=8, color='orange')
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Fit with optimal K
optimal_k = 4
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
y_pred = kmeans_final.fit_predict(X_blobs)
# Visualize clusters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# True labels
ax1.scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis',
s=50, alpha=0.7, edgecolors='k')
ax1.set_title('True Clusters')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')
# K-Means clusters
ax2.scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_pred, cmap='viridis',
s=50, alpha=0.7, edgecolors='k')
ax2.scatter(kmeans_final.cluster_centers_[:, 0],
kmeans_final.cluster_centers_[:, 1],
c='red', marker='X', s=300, edgecolors='black', linewidths=2,
label='Centroids')
ax2.set_title(f'K-Means (K={optimal_k})\nSilhouette: {silhouettes[optimal_k-2]:.3f}')
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.legend()
plt.tight_layout()
plt.show()
print(f"\nπ K-Means Results (K={optimal_k}):")
print(f" Inertia: {kmeans_final.inertia_:.2f}")
print(f" Silhouette Score: {silhouette_score(X_blobs, y_pred):.4f}")
print(f" Cluster sizes: {np.bincount(y_pred)}")
12.3 Hierarchical ClusteringΒΆ
Two ApproachesΒΆ
1. Agglomerative (Bottom-Up):
Start: Each point is a cluster
Iteratively merge closest clusters
End: All points in one cluster
2. Divisive (Top-Down):
Start: All points in one cluster
Iteratively split clusters
End: Each point is a cluster
Linkage MethodsΒΆ
How to measure distance between clusters?
Complete Linkage: $\(d(C_1, C_2) = \max_{i \in C_1, j \in C_2} d(i, j)\)$
Maximum distance between any two points
Tends to produce compact clusters
Single Linkage: $\(d(C_1, C_2) = \min_{i \in C_1, j \in C_2} d(i, j)\)$
Minimum distance
Can produce elongated clusters
Average Linkage: $\(d(C_1, C_2) = \frac{1}{|C_1||C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(i, j)\)$
Average of all pairwise distances
Good compromise
Wardβs Method:
Minimize within-cluster variance
Often produces balanced clusters
DendrogramΒΆ
Tree diagram showing merging history
Height: Distance at which clusters merge
Cut: Horizontal line determines number of clusters
AdvantagesΒΆ
β
Donβt need to specify K
β
Dendrogram is interpretable
β
Any distance metric
β
Deterministic (no random init)
DisadvantagesΒΆ
β Computationally expensive O(nΒ²) or O(nΒ³)
β Sensitive to outliers
β Canβt undo merges
# Hierarchical Clustering
# Use smaller sample for speed
np.random.seed(42)
indices = np.random.choice(len(X_blobs), 100, replace=False)
X_sample = X_blobs[indices]
y_sample = y_true[indices]
# Compute linkage
linkages = ['complete', 'average', 'single', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()
for idx, method in enumerate(linkages):
Z = linkage(X_sample, method=method)
dendrogram(Z, ax=axes[idx], no_labels=True, color_threshold=None)
axes[idx].set_title(f'{method.capitalize()} Linkage')
axes[idx].set_xlabel('Sample Index')
axes[idx].set_ylabel('Distance')
plt.tight_layout()
plt.show()
# Compare linkage methods
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()
for idx, method in enumerate(linkages):
agg = AgglomerativeClustering(n_clusters=4, linkage=method)
labels = agg.fit_predict(X_sample)
axes[idx].scatter(X_sample[:, 0], X_sample[:, 1], c=labels,
cmap='viridis', s=100, alpha=0.7, edgecolors='k')
axes[idx].set_title(f'{method.capitalize()} Linkage (K=4)\n'
f'Silhouette: {silhouette_score(X_sample, labels):.3f}')
axes[idx].set_xlabel('Feature 1')
axes[idx].set_ylabel('Feature 2')
plt.tight_layout()
plt.show()
print("\nπ‘ Linkage Comparison:")
print(" β’ Complete: Compact, spherical clusters")
print(" β’ Average: Good balance")
print(" β’ Single: Can create elongated chains")
print(" β’ Ward: Minimizes variance, often best")
12.4 Real Example: Handwritten DigitsΒΆ
TaskΒΆ
Use PCA to visualize high-dimensional digit images
Cluster digits without using labels
Evaluate if clusters match true digits
# Load digits dataset
digits = load_digits()
X_digits = digits.data
y_digits = digits.target
print("π Digits Dataset:")
print(f" Shape: {X_digits.shape}")
print(f" 64 features (8x8 pixels)")
print(f" 10 classes (digits 0-9)")
# PCA for visualization
pca_digits = PCA(n_components=2)
X_pca_2d = pca_digits.fit_transform(X_digits)
print(f"\n Variance explained by 2 PCs: {pca_digits.explained_variance_ratio_.sum()*100:.2f}%")
# Visualize in 2D
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# True labels
scatter1 = ax1.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y_digits,
cmap='tab10', s=20, alpha=0.6)
ax1.set_xlabel(f'PC1 ({pca_digits.explained_variance_ratio_[0]*100:.1f}%)')
ax1.set_ylabel(f'PC2 ({pca_digits.explained_variance_ratio_[1]*100:.1f}%)')
ax1.set_title('Digits in 2D (True Labels)')
plt.colorbar(scatter1, ax=ax1, label='Digit')
# K-Means clustering
kmeans_digits = KMeans(n_clusters=10, random_state=42, n_init=10)
clusters = kmeans_digits.fit_predict(X_digits)
scatter2 = ax2.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=clusters,
cmap='tab10', s=20, alpha=0.6)
ax2.set_xlabel(f'PC1 ({pca_digits.explained_variance_ratio_[0]*100:.1f}%)')
ax2.set_ylabel(f'PC2 ({pca_digits.explained_variance_ratio_[1]*100:.1f}%)')
ax2.set_title(f'K-Means Clusters (K=10)\nSilhouette: {silhouette_score(X_digits, clusters):.3f}')
plt.colorbar(scatter2, ax=ax2, label='Cluster')
plt.tight_layout()
plt.show()
# Show sample digits from each cluster
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
axes = axes.ravel()
for i in range(10):
cluster_members = np.where(clusters == i)[0]
sample_idx = cluster_members[0]
axes[i].imshow(X_digits[sample_idx].reshape(8, 8), cmap='gray')
axes[i].set_title(f'Cluster {i}\n(Digit {y_digits[sample_idx]})')
axes[i].axis('off')
plt.suptitle('Sample from Each Cluster', fontsize=14)
plt.tight_layout()
plt.show()
print("\nπ‘ Observations:")
print(" β’ Some digits cluster well (e.g., 0, 1)")
print(" β’ Others overlap (e.g., 4 and 9, 3 and 8)")
print(" β’ Unsupervised clustering without labels!")
Key TakeawaysΒΆ
PCA Best PracticesΒΆ
β
Always standardize features first
β
Check variance explained (scree plot)
β
Use for visualization (reduce to 2-3 dims)
β
Preprocessing for ML (reduce multicollinearity)
β Donβt use for categorical data
β Careful with interpretation (PCs are combinations)
Clustering Best PracticesΒΆ
K-Means:
Use for large datasets
Try multiple random initializations
Scale features
Use elbow + silhouette to choose K
Hierarchical:
Use for small datasets (n < 10,000)
Dendrogram for visualization
Try different linkage methods
Ward often works well
Validation MetricsΒΆ
Silhouette Score: [-1, 1]
0.5: Good
0.25-0.5: Weak
< 0.25: Poor
Davies-Bouldin Index: Lower is better
Ratio of within-cluster to between-cluster distances
Inertia (K-Means): Within-cluster sum of squares
Lower is better
Use for elbow method
Common PitfallsΒΆ
β Not scaling features
β Assuming clusters exist
β Over-interpreting results
β Ignoring outliers
β Using wrong distance metric
β Not validating clusters
Method SelectionΒΆ
Task |
Method |
Why |
|---|---|---|
Visualization |
PCA |
Reduce to 2-3 dims |
Feature reduction |
PCA |
Remove correlations |
Large dataset clustering |
K-Means |
Scalable |
Unknown K |
Hierarchical |
Dendrogram helps |
Non-spherical clusters |
DBSCAN |
Density-based |
Interpretability needed |
Hierarchical |
Visual tree |
Next ChapterΒΆ
Chapter 13: Multiple Testing
Family-Wise Error Rate (FWER)
False Discovery Rate (FDR)
Bonferroni Correction
Benjamini-Hochberg Procedure
Practice ExercisesΒΆ
Exercise 1: PCA ComponentsΒΆ
Using iris dataset:
Compute PCA without scaling
Compute PCA with scaling
Compare results (variance explained, loadings)
Explain why scaling matters
Exercise 2: Optimal K SelectionΒΆ
Generate 3-5 true clusters:
Apply K-Means for K=2 to 10
Plot elbow curve
Plot silhouette scores
Use gap statistic
Compare all methods
Exercise 3: Linkage ComparisonΒΆ
Create elongated clusters:
Apply hierarchical with all linkage types
Visualize dendrograms
Cut at same height
Compare resulting clusters
Which linkage works best?
Exercise 4: PCA + ClusteringΒΆ
High-dimensional dataset:
Apply K-Means directly
Apply PCA then K-Means
Compare results (silhouette, speed)
Visualize both approaches
Recommend best pipeline
Exercise 5: Real Data AnalysisΒΆ
Find your own dataset:
Perform EDA
Apply PCA for visualization
Try 2-3 clustering methods
Evaluate and compare
Interpret clusters (what do they represent?)
Visualize results
Exercise 6: Image CompressionΒΆ
Load grayscale image:
Apply PCA with different n_components
Reconstruct image from components
Compare: original vs compressed
Plot: compression ratio vs quality
Find sweet spot