Run this notebook: Open in Colab Open in Kaggle

Lectures 15-17: EM, Factor Analysis, PCA & ICA¶

📹 Watch Lecture 15 | Watch Lecture 16 | Watch Lecture 17

From Andrew Ng’s CS229 Lectures 15-17 (Autumn 2018)¶

Lecture 15: EM Algorithm & Factor Analysis¶

“What we’ll see today is additional elaborations on the EM, on the expectation maximization algorithm.” - Andrew Ng

Topics:

Quick recap of EM algorithm
Monitoring EM convergence
E and M steps for mixture of Gaussians
Factor Analysis model (main focus)
Properties of Gaussian distributions

Why Factor Analysis?

“This model useful for data that can be very high-dimensional even when you have very few training examples.”

Learning Goals:

Useful algorithm in its own right
EM derivation practice (one of the trickier examples)

EM Algorithm Recap¶

E-Step:

Constructs lower bound of log likelihood
Makes lower bound tight at current θ
Different choices of Q give different bounds

M-Step:

Maximizes the lower bound
Updates parameters θ

Mixture of Gaussians:

P(xⁱ, zⁱ) = P(xⁱ|zⁱ) P(zⁱ)
z ~ Multinomial(φ)
x|z=j ~ N(μⱼ, Σⱼ)

Key difference from GDA:

“The key difference is that in this density estimation problem, z is not observed or z is a latent random variable. Which is why we have all this machinery of EM.”

Lecture 16: Principal Component Analysis (PCA)¶

Purpose:

Dimensionality reduction
Find principal directions of variation
Project high-dimensional data to lower dimensions

Applications:

Data visualization (reduce to 2D/3D)
Noise reduction
Feature extraction
Compression
Preprocessing for other algorithms

Key Concepts:

Maximize variance in projected space
Minimize reconstruction error
Eigenvalue decomposition
SVD (Singular Value Decomposition)

When to use:

Have many correlated features
Need to visualize high-dimensional data
Want to reduce computational cost
Remove noise from data

Lecture 17: Independent Component Analysis (ICA)¶

Cocktail Party Problem:

Multiple microphones recording mixed signals
Want to separate individual speakers
Blind source separation

Key Difference from PCA:

PCA: Finds uncorrelated components (orthogonal)
ICA: Finds statistically independent components

Applications:

Audio source separation
Medical signal processing (EEG, fMRI)
Image processing
Financial data analysis

Mathematical Foundation:

Non-Gaussian distributions
Maximizing non-Gaussianity
Information theory concepts

Common Thread: Latent Variable Models¶

All three topics use latent (hidden) variables:

EM/Factor Analysis: Latent class assignments or factors
PCA: Latent low-dimensional representation
ICA: Latent independent sources

Practical Workflow¶

1. Data Exploration

Visualize with PCA
Check for clusters

2. Unsupervised Learning

EM/K-Means for clustering
Factor Analysis for high-dim, few samples

3. Feature Engineering

PCA for dimensionality reduction
ICA for source separation

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_digits, load_iris, fetch_olivetti_faces
from sklearn.decomposition import PCA, FastICA
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette('husl')
np.random.seed(42)

print("Libraries loaded!")

9.1 PCA from Scratch¶

What: Implementing Principal Component Analysis via eigendecomposition¶

We build PCA from first principles: center the data by subtracting the mean, compute the covariance matrix \(\Sigma = \frac{1}{m}X^T X\), then find its eigenvectors (the principal components) and eigenvalues (the variance explained along each direction). The top \(k\) eigenvectors define the low-dimensional subspace that captures the most variance in the data. Projection is simply \(Z = X_{\text{centered}} \cdot W_k\), where \(W_k\) contains the top \(k\) eigenvectors.

Why: The most fundamental dimensionality reduction technique¶

PCA appears everywhere in ML: as a preprocessing step to decorrelate features before feeding them to classifiers, as a visualization tool for projecting high-dimensional data to 2D/3D, as a compression method for images and signals, and as the mathematical foundation for techniques like factor analysis and probabilistic PCA. Understanding that PCA finds the directions of maximum variance (equivalently, minimizes reconstruction error \(\|X - X W_k W_k^T\|^2\)) connects it to the broader framework of linear algebra and optimization that underlies all of machine learning.

class PCACustom:
    def __init__(self, n_components=2):
        self.n_components = n_components
        self.components = None
        self.mean = None
        self.explained_variance = None
    
    def fit(self, X):
        # Center data
        self.mean = np.mean(X, axis=0)
        X_centered = X - self.mean
        
        # Covariance matrix
        cov = np.cov(X_centered.T)
        
        # Eigendecomposition
        eigenvalues, eigenvectors = np.linalg.eig(cov)
        
        # Sort by eigenvalues
        idx = eigenvalues.argsort()[::-1]
        eigenvalues = eigenvalues[idx]
        eigenvectors = eigenvectors[:, idx]
        
        # Store components
        self.components = eigenvectors[:, :self.n_components]
        self.explained_variance = eigenvalues[:self.n_components]
        
        return self
    
    def transform(self, X):
        X_centered = X - self.mean
        return X_centered @ self.components
    
    def inverse_transform(self, X_transformed):
        return X_transformed @ self.components.T + self.mean

# Test on Iris
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)

# Apply PCA
pca = PCACustom(n_components=2)
X_pca = pca.fit(X_scaled).transform(X_scaled)

# Visualize
plt.figure(figsize=(10, 7))
scatter = plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y_iris, cmap='viridis', 
                     s=100, edgecolors='black', alpha=0.7)
plt.xlabel(f'PC1 ({pca.explained_variance[0]/np.sum(pca.explained_variance):.1%} variance)', fontsize=12)
plt.ylabel(f'PC2 ({pca.explained_variance[1]/np.sum(pca.explained_variance):.1%} variance)', fontsize=12)
plt.title('PCA on Iris Dataset', fontsize=14, fontweight='bold')
plt.colorbar(scatter, label='Species')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

9.2 Explained Variance¶

What: Quantifying how much information each principal component captures¶

Each eigenvalue \(\lambda_k\) of the covariance matrix represents the variance of the data along the \(k\)-th principal component. The explained variance ratio \(\lambda_k / \sum_j \lambda_j\) tells us what fraction of the total variability is captured by that component. The cumulative explained variance shows how many components are needed to retain a desired fraction (commonly 90% or 95%) of the total information.

Why: Principled dimensionality selection¶

The explained variance plot (scree plot) is the standard tool for deciding how many components to keep. For the digits dataset with 64 pixel features, we typically find that only 20-30 components capture 95% of the variance – a compression ratio of 2-3x with minimal information loss. In practice, the choice of threshold depends on the downstream task: visualization requires just 2-3 components, while classification may need enough components to retain 95-99% of variance. The sharp dropoff in the scree plot also reveals the intrinsic dimensionality of the data – the number of meaningful degrees of freedom, as opposed to the ambient dimension.

# Load digits
digits = load_digits()
X_digits = digits.data
y_digits = digits.target

# Scale
X_digits_scaled = StandardScaler().fit_transform(X_digits)

# Full PCA
pca_full = PCA()
pca_full.fit(X_digits_scaled)

# Plot explained variance
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Individual variance
axes[0].bar(range(len(pca_full.explained_variance_ratio_)), 
           pca_full.explained_variance_ratio_, alpha=0.7)
axes[0].set_xlabel('Principal Component', fontsize=12)
axes[0].set_ylabel('Explained Variance Ratio', fontsize=12)
axes[0].set_title('Variance per Component', fontsize=14, fontweight='bold')
axes[0].grid(True, alpha=0.3)

# Cumulative variance
cumsum = np.cumsum(pca_full.explained_variance_ratio_)
axes[1].plot(cumsum, linewidth=2, marker='o')
axes[1].axhline(y=0.95, color='r', linestyle='--', label='95% variance')
axes[1].axhline(y=0.90, color='orange', linestyle='--', label='90% variance')
n_95 = np.argmax(cumsum >= 0.95) + 1
axes[1].axvline(x=n_95, color='r', linestyle=':', alpha=0.5)
axes[1].set_xlabel('Number of Components', fontsize=12)
axes[1].set_ylabel('Cumulative Explained Variance', fontsize=12)
axes[1].set_title(f'Need {n_95} components for 95% variance', fontsize=14, fontweight='bold')
axes[1].legend(fontsize=11)
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nOriginal dimensions: {X_digits.shape[1]}")
print(f"Components for 95% variance: {n_95}")
print(f"Compression ratio: {n_95/X_digits.shape[1]:.1%}")

9.3 Image Reconstruction with PCA¶

What: Visualizing information loss at different compression levels¶

By projecting digit images (8x8 = 64 pixels) down to \(k\) principal components and then reconstructing them via \(\hat{x} = Z W_k^T + \mu\), we can visually assess how much detail is preserved at each level of compression. With just 10 components (84% reduction), the digit is clearly recognizable; with 1 component, only the overall brightness is captured.

Why: Building intuition for the bias-variance tradeoff in representation learning¶

Image reconstruction provides a vivid, visual demonstration of the information-compression tradeoff at the heart of dimensionality reduction. Too few components means losing discriminative features (high bias in the representation); too many components means retaining noise and wasting computation. This same tradeoff appears in autoencoders, variational autoencoders, and any latent-variable model – PCA is simply the linear, closed-form version of the same idea.

# Sample digit
sample_idx = 0
original = X_digits[sample_idx].reshape(8, 8)

# Reconstruct with different numbers of components
n_components_list = [1, 5, 10, 20, 40, 64]

fig, axes = plt.subplots(2, 4, figsize=(14, 7))
axes = axes.ravel()

# Original
axes[0].imshow(original, cmap='gray')
axes[0].set_title('Original', fontsize=11, fontweight='bold')
axes[0].axis('off')

for idx, n_comp in enumerate(n_components_list):
    pca_temp = PCA(n_components=n_comp)
    X_reduced = pca_temp.fit_transform(X_digits_scaled)
    X_reconstructed = pca_temp.inverse_transform(X_reduced)
    
    reconstructed = X_reconstructed[sample_idx].reshape(8, 8)
    
    axes[idx + 1].imshow(reconstructed, cmap='gray')
    var_explained = pca_temp.explained_variance_ratio_.sum()
    axes[idx + 1].set_title(f'{n_comp} PCs ({var_explained:.1%})', fontsize=11)
    axes[idx + 1].axis('off')

# Hide last subplot
axes[7].axis('off')

plt.suptitle('PCA Reconstruction Quality', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

9.4 t-SNE for Visualization¶

What: Non-linear dimensionality reduction optimized for 2D visualization¶

t-SNE (t-distributed Stochastic Neighbor Embedding) converts pairwise similarities in high-dimensional space to a probability distribution, then finds a low-dimensional embedding that preserves those similarities as faithfully as possible. Unlike PCA, which preserves global variance (linear projection), t-SNE focuses on preserving local neighborhood structure – points that are close in high dimensions stay close in the embedding. The perplexity parameter controls the effective neighborhood size.

Why: The gold standard for visualizing high-dimensional data¶

PCA projects data linearly, so it often fails to reveal cluster structure when classes are separated by non-linear manifolds. t-SNE, by contrast, routinely produces strikingly clear cluster separations – which is why it has become the standard visualization tool in genomics, NLP (word embeddings), and computer vision. The key caveat is that t-SNE is a visualization-only tool: distances and densities in the 2D plot are not directly interpretable, and the algorithm is non-deterministic and slow (\(O(n^2)\), though Barnes-Hut approximation reduces this to \(O(n \log n)\)).

# Reduce to 30D first with PCA (for speed)
pca_pre = PCA(n_components=30)
X_pca_30 = pca_pre.fit_transform(X_digits_scaled)

# Apply t-SNE
print("Running t-SNE (this may take a minute)...")
tsne = TSNE(n_components=2, random_state=42, perplexity=30)
X_tsne = tsne.fit_transform(X_pca_30)

# Compare PCA and t-SNE
pca_2d = PCA(n_components=2)
X_pca_2d = pca_2d.fit_transform(X_digits_scaled)

fig, axes = plt.subplots(1, 2, figsize=(16, 7))

# PCA
scatter1 = axes[0].scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], 
                          c=y_digits, cmap='tab10', s=20, alpha=0.6)
axes[0].set_title('PCA (2D)', fontsize=14, fontweight='bold')
axes[0].set_xlabel('PC1', fontsize=12)
axes[0].set_ylabel('PC2', fontsize=12)
axes[0].grid(True, alpha=0.3)

# t-SNE
scatter2 = axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], 
                          c=y_digits, cmap='tab10', s=20, alpha=0.6)
axes[1].set_title('t-SNE (2D)', fontsize=14, fontweight='bold')
axes[1].set_xlabel('t-SNE 1', fontsize=12)
axes[1].set_ylabel('t-SNE 2', fontsize=12)
axes[1].grid(True, alpha=0.3)

# Shared colorbar
cbar = plt.colorbar(scatter2, ax=axes, label='Digit', ticks=range(10))

plt.tight_layout()
plt.show()

print("\nt-SNE better preserves local structure and separates clusters!")

9.5 Independent Component Analysis (ICA)¶

What: Recovering original source signals from observed mixtures¶

ICA solves the blind source separation problem: given observations \(X = SA^T\) where \(S\) contains unknown independent source signals and \(A\) is an unknown mixing matrix, recover \(S\) without knowing either. The classic example is the “cocktail party problem” – separating individual speakers from recordings made by multiple microphones. FastICA from scikit-learn estimates the unmixing matrix by maximizing the non-Gaussianity of the recovered components, exploiting the Central Limit Theorem (mixtures are more Gaussian than their sources).

Why: Fundamentally different from PCA¶

While PCA finds uncorrelated components (second-order statistics only), ICA finds statistically independent components (using higher-order statistics). This distinction matters when the underlying sources are non-Gaussian, which is the case in most real-world signals – audio, EEG, financial time series. ICA is widely used in neuroscience (separating brain signals from artifacts in EEG/fMRI), telecommunications (separating multiplexed signals), and feature learning (discovering independent factors of variation in data).

# Generate mixed signals
time = np.linspace(0, 8, 2000)
s1 = np.sin(2 * time)  # Sinusoid
s2 = np.sign(np.sin(3 * time))  # Square wave
s3 = np.random.laplace(0, 1, len(time))  # Noise

S = np.c_[s1, s2, s3]
S += 0.2 * np.random.normal(size=S.shape)
S /= S.std(axis=0)

# Mix signals
A = np.array([[1, 1, 1], [0.5, 2, 1.0], [1.5, 1.0, 2.0]])
X = S @ A.T  # Mixed signals

# Apply ICA
ica = FastICA(n_components=3, random_state=42)
S_recovered = ica.fit_transform(X)

# Plot
fig, axes = plt.subplots(3, 3, figsize=(16, 10))

# Original sources
for i in range(3):
    axes[0, i].plot(time, S[:, i], linewidth=0.5)
    axes[0, i].set_title(f'Source {i+1}', fontsize=11)
    axes[0, i].grid(True, alpha=0.3)

# Mixed signals
for i in range(3):
    axes[1, i].plot(time, X[:, i], linewidth=0.5, color='orange')
    axes[1, i].set_title(f'Mixed {i+1}', fontsize=11)
    axes[1, i].grid(True, alpha=0.3)

# Recovered sources
for i in range(3):
    axes[2, i].plot(time, S_recovered[:, i], linewidth=0.5, color='green')
    axes[2, i].set_title(f'Recovered {i+1}', fontsize=11)
    axes[2, i].set_xlabel('Time', fontsize=10)
    axes[2, i].grid(True, alpha=0.3)

axes[0, 0].set_ylabel('Original', fontsize=12)
axes[1, 0].set_ylabel('Mixed', fontsize=12)
axes[2, 0].set_ylabel('ICA Recovered', fontsize=12)

plt.suptitle('Independent Component Analysis (Blind Source Separation)', 
            fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

Key Takeaways¶

1. PCA¶

Finds directions of maximum variance
Linear transformation
Assumes Gaussian distribution
Use for: compression, noise reduction, visualization

2. Choosing Components¶

Scree plot (elbow method)
Cumulative variance (e.g., 95%)
Cross-validation on downstream task

3. t-SNE¶

Non-linear, preserves local structure
Great for visualization
Not for reconstruction
Slow on large datasets

4. ICA¶

Finds statistically independent components
Blind source separation
Assumes non-Gaussian sources
Use for: signal processing, feature extraction

5. When to Use¶

PCA: General compression, preprocessing
t-SNE: Final visualization only
ICA: Separate mixed signals
Autoencoders: Non-linear, learnable

Practice Exercises¶

Implement kernel PCA
Apply PCA to face recognition
Compare PCA vs ICA on images
Build autoencoder for MNIST
Tune t-SNE perplexity parameter

References¶

CS229 Lecture Notes on PCA and ICA
“Visualizing Data using t-SNE” - van der Maaten & Hinton (2008)
“The Elements of Statistical Learning” - Hastie et al.

Next: Lecture 10: Anomaly Detection