Chapter 12: Unsupervised LearningΒΆ

OverviewΒΆ

Unsupervised Learning: Find patterns in data without labels

Key Differences from Supervised LearningΒΆ

Aspect

Supervised

Unsupervised

Labels

Yes (Y available)

No (only X)

Goal

Predict Y

Find structure

Evaluation

Clear metrics

Subjective

Examples

Classification, Regression

Clustering, Dim Reduction

Main TasksΒΆ

1. Dimensionality ReductionΒΆ

  • Goal: Reduce features while preserving information

  • Methods: PCA, t-SNE, UMAP

  • Use: Visualization, preprocessing, compression

2. ClusteringΒΆ

  • Goal: Group similar observations

  • Methods: K-Means, Hierarchical, DBSCAN

  • Use: Customer segmentation, gene expression, image compression

ChallengesΒΆ

❌ No ground truth - Hard to validate
❌ Subjective interpretation
❌ Many hyperparameters
❌ Scalability issues
❌ Sensitive to outliers

ApplicationsΒΆ

βœ… Exploratory data analysis
βœ… Customer segmentation
βœ… Anomaly detection
βœ… Feature engineering
βœ… Compression

12.1 Principal Component Analysis (PCA)ΒΆ

GoalΒΆ

Find linear combinations of features that capture maximum variance

Mathematical FormulationΒΆ

First Principal Component (PC1): $\(Z_1 = \phi_{11}X_1 + \phi_{21}X_2 + \cdots + \phi_{p1}X_p\)$

where \(\sum_{j=1}^p \phi_{j1}^2 = 1\) (normalized)

Objective: Maximize variance of \(Z_1\)

Subsequent PCs:

  • PC2: Maximum variance, orthogonal to PC1

  • PC3: Maximum variance, orthogonal to PC1 and PC2

  • …

Key PropertiesΒΆ

βœ… Uncorrelated components
βœ… Ordered by variance explained
βœ… Total variance preserved
βœ… Rotation of original space

Proportion of Variance Explained (PVE)ΒΆ

\[PVE_k = \frac{\text{Var}(Z_k)}{\sum_{j=1}^p \text{Var}(X_j)}\]

When to Use PCAΒΆ

  • High-dimensional data (p large)

  • Features are correlated

  • Need visualization (reduce to 2-3 dims)

  • Multicollinearity issues

  • Noise reduction

Important: Scale Features!ΒΆ

PCA sensitive to scale β†’ always standardize

# PCA Example: Iris Dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
feature_names = iris.feature_names

print("πŸ“Š Iris Dataset:")
print(f"   Shape: {X_iris.shape}")
print(f"   Features: {feature_names}")

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)

# Fit PCA
pca = PCA()
X_pca = pca.fit_transform(X_scaled)

# Variance explained
print(f"\nπŸ“ˆ Variance Explained by Each PC:")
for i, var in enumerate(pca.explained_variance_ratio_, 1):
    print(f"   PC{i}: {var:.4f} ({var*100:.2f}%)")

print(f"\n   Cumulative variance (PC1+PC2): {pca.explained_variance_ratio_[:2].sum()*100:.2f}%")

# Visualize
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

# Scree plot
axes[0].bar(range(1, len(pca.explained_variance_ratio_)+1), 
           pca.explained_variance_ratio_, alpha=0.7)
axes[0].plot(range(1, len(pca.explained_variance_ratio_)+1),
            np.cumsum(pca.explained_variance_ratio_), 'ro-', linewidth=2)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Variance Explained')
axes[0].set_title('Scree Plot')
axes[0].legend(['Cumulative', 'Individual'])
axes[0].grid(True, alpha=0.3)

# PC1 vs PC2
scatter = axes[1].scatter(X_pca[:, 0], X_pca[:, 1], c=y_iris, 
                         cmap='viridis', s=100, alpha=0.7, edgecolors='k')
axes[1].set_xlabel(f'PC1 ({pca.explained_variance_ratio_[0]*100:.1f}%)')
axes[1].set_ylabel(f'PC2 ({pca.explained_variance_ratio_[1]*100:.1f}%)')
axes[1].set_title('First Two Principal Components')
axes[1].grid(True, alpha=0.3)
plt.colorbar(scatter, ax=axes[1], label='Species')

# Loadings (component weights)
loadings = pca.components_[:2].T
axes[2].bar(range(len(feature_names)), loadings[:, 0], alpha=0.7, label='PC1')
axes[2].bar(range(len(feature_names)), loadings[:, 1], alpha=0.7, label='PC2')
axes[2].set_xticks(range(len(feature_names)))
axes[2].set_xticklabels([f.split()[0] for f in feature_names], rotation=45)
axes[2].set_ylabel('Loading')
axes[2].set_title('Feature Loadings on PC1 & PC2')
axes[2].legend()
axes[2].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Interpretation:")
print("   β€’ PC1 captures overall flower size (all features positive)")
print("   β€’ PC2 contrasts sepal vs petal dimensions")
print("   β€’ 2 PCs capture 95%+ of variance β†’ good low-dim representation")

12.2 K-Means ClusteringΒΆ

GoalΒΆ

Partition data into K distinct clusters

AlgorithmΒΆ

Minimize within-cluster variation: $\(\min_{C_1,\ldots,C_K} \sum_{k=1}^K W(C_k)\)$

where \(W(C_k) = \frac{1}{|C_k|} \sum_{i,i' \in C_k} ||x_i - x_{i'}||^2\)

K-Means Steps:

  1. Initialize: Randomly assign K centroids

  2. Assignment: Assign each point to nearest centroid

  3. Update: Recalculate centroids as cluster means

  4. Repeat steps 2-3 until convergence

Key PropertiesΒΆ

βœ… Fast and simple
βœ… Scales well to large datasets
βœ… Guaranteed to converge
❌ Local optimum (random initialization matters)
❌ Must specify K
❌ Sensitive to outliers
❌ Assumes spherical clusters

Choosing K: Elbow MethodΒΆ

Plot within-cluster SS vs K, look for β€œelbow”

Choosing K: Silhouette ScoreΒΆ

\[s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}\]
  • \(a(i)\) = avg distance to points in same cluster

  • \(b(i)\) = avg distance to points in nearest other cluster

  • Score: -1 (worst) to +1 (best)

# K-Means Clustering
# Generate synthetic data
X_blobs, y_true = make_blobs(n_samples=300, centers=4, 
                             cluster_std=0.6, random_state=42)

# Try different K values
K_range = range(2, 9)
inertias = []
silhouettes = []

for k in K_range:
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    kmeans.fit(X_blobs)
    inertias.append(kmeans.inertia_)
    silhouettes.append(silhouette_score(X_blobs, kmeans.labels_))

# Visualize K selection
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Elbow plot
axes[0].plot(K_range, inertias, 'o-', linewidth=2, markersize=8)
axes[0].set_xlabel('Number of Clusters (K)')
axes[0].set_ylabel('Within-Cluster Sum of Squares')
axes[0].set_title('Elbow Method')
axes[0].grid(True, alpha=0.3)

# Silhouette plot
axes[1].plot(K_range, silhouettes, 's-', linewidth=2, markersize=8, color='orange')
axes[1].set_xlabel('Number of Clusters (K)')
axes[1].set_ylabel('Silhouette Score')
axes[1].set_title('Silhouette Analysis')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Fit with optimal K
optimal_k = 4
kmeans_final = KMeans(n_clusters=optimal_k, random_state=42, n_init=10)
y_pred = kmeans_final.fit_predict(X_blobs)

# Visualize clusters
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# True labels
ax1.scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis', 
           s=50, alpha=0.7, edgecolors='k')
ax1.set_title('True Clusters')
ax1.set_xlabel('Feature 1')
ax1.set_ylabel('Feature 2')

# K-Means clusters
ax2.scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_pred, cmap='viridis',
           s=50, alpha=0.7, edgecolors='k')
ax2.scatter(kmeans_final.cluster_centers_[:, 0], 
           kmeans_final.cluster_centers_[:, 1],
           c='red', marker='X', s=300, edgecolors='black', linewidths=2,
           label='Centroids')
ax2.set_title(f'K-Means (K={optimal_k})\nSilhouette: {silhouettes[optimal_k-2]:.3f}')
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.legend()

plt.tight_layout()
plt.show()

print(f"\nπŸ“Š K-Means Results (K={optimal_k}):")
print(f"   Inertia: {kmeans_final.inertia_:.2f}")
print(f"   Silhouette Score: {silhouette_score(X_blobs, y_pred):.4f}")
print(f"   Cluster sizes: {np.bincount(y_pred)}")

12.3 Hierarchical ClusteringΒΆ

Two ApproachesΒΆ

1. Agglomerative (Bottom-Up):

  • Start: Each point is a cluster

  • Iteratively merge closest clusters

  • End: All points in one cluster

2. Divisive (Top-Down):

  • Start: All points in one cluster

  • Iteratively split clusters

  • End: Each point is a cluster

Linkage MethodsΒΆ

How to measure distance between clusters?

Complete Linkage: $\(d(C_1, C_2) = \max_{i \in C_1, j \in C_2} d(i, j)\)$

  • Maximum distance between any two points

  • Tends to produce compact clusters

Single Linkage: $\(d(C_1, C_2) = \min_{i \in C_1, j \in C_2} d(i, j)\)$

  • Minimum distance

  • Can produce elongated clusters

Average Linkage: $\(d(C_1, C_2) = \frac{1}{|C_1||C_2|} \sum_{i \in C_1} \sum_{j \in C_2} d(i, j)\)$

  • Average of all pairwise distances

  • Good compromise

Ward’s Method:

  • Minimize within-cluster variance

  • Often produces balanced clusters

DendrogramΒΆ

Tree diagram showing merging history

  • Height: Distance at which clusters merge

  • Cut: Horizontal line determines number of clusters

AdvantagesΒΆ

βœ… Don’t need to specify K
βœ… Dendrogram is interpretable
βœ… Any distance metric
βœ… Deterministic (no random init)

DisadvantagesΒΆ

❌ Computationally expensive O(n²) or O(n³)
❌ Sensitive to outliers
❌ Can’t undo merges

# Hierarchical Clustering
# Use smaller sample for speed
np.random.seed(42)
indices = np.random.choice(len(X_blobs), 100, replace=False)
X_sample = X_blobs[indices]
y_sample = y_true[indices]

# Compute linkage
linkages = ['complete', 'average', 'single', 'ward']
fig, axes = plt.subplots(2, 2, figsize=(16, 12))
axes = axes.ravel()

for idx, method in enumerate(linkages):
    Z = linkage(X_sample, method=method)
    dendrogram(Z, ax=axes[idx], no_labels=True, color_threshold=None)
    axes[idx].set_title(f'{method.capitalize()} Linkage')
    axes[idx].set_xlabel('Sample Index')
    axes[idx].set_ylabel('Distance')

plt.tight_layout()
plt.show()

# Compare linkage methods
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, method in enumerate(linkages):
    agg = AgglomerativeClustering(n_clusters=4, linkage=method)
    labels = agg.fit_predict(X_sample)
    
    axes[idx].scatter(X_sample[:, 0], X_sample[:, 1], c=labels,
                     cmap='viridis', s=100, alpha=0.7, edgecolors='k')
    axes[idx].set_title(f'{method.capitalize()} Linkage (K=4)\n'
                       f'Silhouette: {silhouette_score(X_sample, labels):.3f}')
    axes[idx].set_xlabel('Feature 1')
    axes[idx].set_ylabel('Feature 2')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Linkage Comparison:")
print("   β€’ Complete: Compact, spherical clusters")
print("   β€’ Average: Good balance")
print("   β€’ Single: Can create elongated chains")
print("   β€’ Ward: Minimizes variance, often best")

12.4 Real Example: Handwritten DigitsΒΆ

TaskΒΆ

  1. Use PCA to visualize high-dimensional digit images

  2. Cluster digits without using labels

  3. Evaluate if clusters match true digits

# Load digits dataset
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

print("πŸ“Š Digits Dataset:")
print(f"   Shape: {X_digits.shape}")
print(f"   64 features (8x8 pixels)")
print(f"   10 classes (digits 0-9)")

# PCA for visualization
pca_digits = PCA(n_components=2)
X_pca_2d = pca_digits.fit_transform(X_digits)

print(f"\n   Variance explained by 2 PCs: {pca_digits.explained_variance_ratio_.sum()*100:.2f}%")

# Visualize in 2D
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# True labels
scatter1 = ax1.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y_digits,
                      cmap='tab10', s=20, alpha=0.6)
ax1.set_xlabel(f'PC1 ({pca_digits.explained_variance_ratio_[0]*100:.1f}%)')
ax1.set_ylabel(f'PC2 ({pca_digits.explained_variance_ratio_[1]*100:.1f}%)')
ax1.set_title('Digits in 2D (True Labels)')
plt.colorbar(scatter1, ax=ax1, label='Digit')

# K-Means clustering
kmeans_digits = KMeans(n_clusters=10, random_state=42, n_init=10)
clusters = kmeans_digits.fit_predict(X_digits)

scatter2 = ax2.scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=clusters,
                      cmap='tab10', s=20, alpha=0.6)
ax2.set_xlabel(f'PC1 ({pca_digits.explained_variance_ratio_[0]*100:.1f}%)')
ax2.set_ylabel(f'PC2 ({pca_digits.explained_variance_ratio_[1]*100:.1f}%)')
ax2.set_title(f'K-Means Clusters (K=10)\nSilhouette: {silhouette_score(X_digits, clusters):.3f}')
plt.colorbar(scatter2, ax=ax2, label='Cluster')

plt.tight_layout()
plt.show()

# Show sample digits from each cluster
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
axes = axes.ravel()

for i in range(10):
    cluster_members = np.where(clusters == i)[0]
    sample_idx = cluster_members[0]
    axes[i].imshow(X_digits[sample_idx].reshape(8, 8), cmap='gray')
    axes[i].set_title(f'Cluster {i}\n(Digit {y_digits[sample_idx]})')
    axes[i].axis('off')

plt.suptitle('Sample from Each Cluster', fontsize=14)
plt.tight_layout()
plt.show()

print("\nπŸ’‘ Observations:")
print("   β€’ Some digits cluster well (e.g., 0, 1)")
print("   β€’ Others overlap (e.g., 4 and 9, 3 and 8)")
print("   β€’ Unsupervised clustering without labels!")

Key TakeawaysΒΆ

PCA Best PracticesΒΆ

βœ… Always standardize features first
βœ… Check variance explained (scree plot)
βœ… Use for visualization (reduce to 2-3 dims)
βœ… Preprocessing for ML (reduce multicollinearity)
❌ Don’t use for categorical data
❌ Careful with interpretation (PCs are combinations)

Clustering Best PracticesΒΆ

K-Means:

  • Use for large datasets

  • Try multiple random initializations

  • Scale features

  • Use elbow + silhouette to choose K

Hierarchical:

  • Use for small datasets (n < 10,000)

  • Dendrogram for visualization

  • Try different linkage methods

  • Ward often works well

Validation MetricsΒΆ

Silhouette Score: [-1, 1]

  • 0.5: Good

  • 0.25-0.5: Weak

  • < 0.25: Poor

Davies-Bouldin Index: Lower is better

  • Ratio of within-cluster to between-cluster distances

Inertia (K-Means): Within-cluster sum of squares

  • Lower is better

  • Use for elbow method

Common PitfallsΒΆ

❌ Not scaling features
❌ Assuming clusters exist
❌ Over-interpreting results
❌ Ignoring outliers
❌ Using wrong distance metric
❌ Not validating clusters

Method SelectionΒΆ

Task

Method

Why

Visualization

PCA

Reduce to 2-3 dims

Feature reduction

PCA

Remove correlations

Large dataset clustering

K-Means

Scalable

Unknown K

Hierarchical

Dendrogram helps

Non-spherical clusters

DBSCAN

Density-based

Interpretability needed

Hierarchical

Visual tree

Next ChapterΒΆ

Chapter 13: Multiple Testing

  • Family-Wise Error Rate (FWER)

  • False Discovery Rate (FDR)

  • Bonferroni Correction

  • Benjamini-Hochberg Procedure

Practice ExercisesΒΆ

Exercise 1: PCA ComponentsΒΆ

Using iris dataset:

  1. Compute PCA without scaling

  2. Compute PCA with scaling

  3. Compare results (variance explained, loadings)

  4. Explain why scaling matters

Exercise 2: Optimal K SelectionΒΆ

Generate 3-5 true clusters:

  1. Apply K-Means for K=2 to 10

  2. Plot elbow curve

  3. Plot silhouette scores

  4. Use gap statistic

  5. Compare all methods

Exercise 3: Linkage ComparisonΒΆ

Create elongated clusters:

  1. Apply hierarchical with all linkage types

  2. Visualize dendrograms

  3. Cut at same height

  4. Compare resulting clusters

  5. Which linkage works best?

Exercise 4: PCA + ClusteringΒΆ

High-dimensional dataset:

  1. Apply K-Means directly

  2. Apply PCA then K-Means

  3. Compare results (silhouette, speed)

  4. Visualize both approaches

  5. Recommend best pipeline

Exercise 5: Real Data AnalysisΒΆ

Find your own dataset:

  1. Perform EDA

  2. Apply PCA for visualization

  3. Try 2-3 clustering methods

  4. Evaluate and compare

  5. Interpret clusters (what do they represent?)

  6. Visualize results

Exercise 6: Image CompressionΒΆ

Load grayscale image:

  1. Apply PCA with different n_components

  2. Reconstruct image from components

  3. Compare: original vs compressed

  4. Plot: compression ratio vs quality

  5. Find sweet spot