Advanced Linear AlgebraΒΆ
Eigendecomposition, SVD, PCA, matrix factorization, and low-rank approximations.
Linear algebra is arguably the single most important mathematical subject for machine learning. Every data point is a vector, every dataset is a matrix, every linear transformation is a matrix multiplication, and the core operations of training neural networks β forward passes, gradient computations, weight updates β are all linear algebra under the hood. This notebook goes beyond basic matrix operations to cover the spectral theory of matrices: how to decompose them into their fundamental components, and why those decompositions are so powerful.
Eigendecomposition reveals the intrinsic directions along which a transformation acts, SVD generalizes this to any matrix and provides optimal low-rank approximations, and PCA leverages SVD to find the directions of maximum variance in data. These techniques appear everywhere in modern ML: from dimensionality reduction and data compression to recommender systems and the weight initialization strategies used in deep learning.
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from mpl_toolkits.mplot3d import Axes3D
from sklearn.datasets import load_iris, fetch_olivetti_faces
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
np.random.seed(42)
1. EigendecompositionΒΆ
Finding the natural axes of a linear transformationΒΆ
An eigenvector \(v\) of a square matrix \(A\) is a special direction that the transformation \(A\) merely scales without rotating:
The scalar \(\lambda\) is the eigenvalue β it tells you by how much the eigenvector is stretched (or flipped, if negative). Geometrically, while most vectors change both direction and magnitude under \(A\), eigenvectors keep pointing the same way.
Matrix form β if \(A\) has \(n\) linearly independent eigenvectors, we can write: $\(A = Q \Lambda Q^{-1}\)$
where \(Q\) is the matrix whose columns are the eigenvectors and \(\Lambda\) is a diagonal matrix of eigenvalues. This decomposition is tremendously useful: raising \(A\) to a power becomes \(A^k = Q \Lambda^k Q^{-1}\) (just exponentiate the diagonal), and the eigenvalues reveal stability properties β a system is stable if all \(|\lambda_i| < 1\). In ML, the eigenvalues of the Hessian matrix tell us about the curvature of the loss surface: large eigenvalues indicate steep directions, small ones indicate flat directions, and negative ones indicate saddle points.
# Create a symmetric matrix (guaranteed real eigenvalues)
A = np.array([[3, 1], [1, 3]])
# Compute eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(A)
print("Matrix A:")
print(A)
print(f"\nEigenvalues: {eigenvalues}")
print(f"\nEigenvectors:")
print(eigenvectors)
# Verify: A*v = lambda*v
v1 = eigenvectors[:, 0]
lambda1 = eigenvalues[0]
print(f"\nVerification for first eigenvalue:")
print(f"A @ v1 = {A @ v1}")
print(f"lambda1 * v1 = {lambda1 * v1}")
print(f"Equal? {np.allclose(A @ v1, lambda1 * v1)}")
# Visualize eigenvectors and transformation
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Before transformation
axes[0].arrow(0, 0, eigenvectors[0, 0], eigenvectors[1, 0],
head_width=0.1, head_length=0.1, fc='red', ec='red', linewidth=2)
axes[0].arrow(0, 0, eigenvectors[0, 1], eigenvectors[1, 1],
head_width=0.1, head_length=0.1, fc='blue', ec='blue', linewidth=2)
axes[0].set_xlim(-1, 1)
axes[0].set_ylim(-1, 1)
axes[0].set_aspect('equal')
axes[0].grid(True, alpha=0.3)
axes[0].set_title('Eigenvectors (before transformation)', fontsize=13)
axes[0].axhline(y=0, color='k', linewidth=0.5)
axes[0].axvline(x=0, color='k', linewidth=0.5)
# After transformation
v1_transformed = A @ eigenvectors[:, 0]
v2_transformed = A @ eigenvectors[:, 1]
axes[1].arrow(0, 0, v1_transformed[0], v1_transformed[1],
head_width=0.2, head_length=0.2, fc='red', ec='red', linewidth=2,
label=f'Ξ»β={eigenvalues[0]:.2f}')
axes[1].arrow(0, 0, v2_transformed[0], v2_transformed[1],
head_width=0.2, head_length=0.2, fc='blue', ec='blue', linewidth=2,
label=f'Ξ»β={eigenvalues[1]:.2f}')
axes[1].set_xlim(-5, 5)
axes[1].set_ylim(-5, 5)
axes[1].set_aspect('equal')
axes[1].grid(True, alpha=0.3)
axes[1].set_title('After A transformation (scaled along eigenvectors)', fontsize=13)
axes[1].axhline(y=0, color='k', linewidth=0.5)
axes[1].axvline(x=0, color='k', linewidth=0.5)
axes[1].legend(fontsize=11)
plt.tight_layout()
plt.show()
print("Eigenvectors maintain their direction, only scaled by eigenvalue!")
2. Singular Value Decomposition (SVD)ΒΆ
Most important matrix factorization in ML!
For any matrix A (m Γ n): $\(A = U \Sigma V^T\)$
Where:
U (m Γ m) = left singular vectors (orthonormal)
Ξ£ (m Γ n) = diagonal matrix of singular values (sorted descending)
V (n Γ n) = right singular vectors (orthonormal)
Key properties:
Works for any matrix (not just square)
Singular values = square roots of eigenvalues of \(A^T A\)
Provides optimal low-rank approximation
# Example matrix
A = np.array([
[1, 2, 3],
[4, 5, 6],
[7, 8, 9],
[10, 11, 12]
])
# Compute SVD
U, s, Vt = np.linalg.svd(A, full_matrices=True)
print(f"Original matrix A shape: {A.shape}")
print(f"\nU shape (left singular vectors): {U.shape}")
print(f"s shape (singular values): {s.shape}")
print(f"V^T shape (right singular vectors): {Vt.shape}")
print(f"\nSingular values: {s}")
# Reconstruct A
Sigma = np.zeros((A.shape[0], A.shape[1]))
np.fill_diagonal(Sigma, s)
A_reconstructed = U @ Sigma @ Vt
print(f"\nReconstruction error: {np.linalg.norm(A - A_reconstructed)}")
print("\nOriginal A:")
print(A)
print("\nReconstructed A:")
print(A_reconstructed)
3. Low-Rank Approximation with SVDΒΆ
Key insight: Keep only top k singular values
Eckart-Young theorem: This is the optimal rank-k approximation!
Applications:
Image compression
Noise reduction
Recommender systems
Latent semantic analysis
# Low-rank approximation example
def low_rank_approx(A, k):
"""Compute rank-k approximation using SVD"""
U, s, Vt = np.linalg.svd(A, full_matrices=False)
# Keep only top k components
U_k = U[:, :k]
s_k = s[:k]
Vt_k = Vt[:k, :]
A_k = U_k @ np.diag(s_k) @ Vt_k
return A_k
# Create a more interesting matrix
np.random.seed(42)
X = np.random.randn(50, 50)
# Compute SVD
U, s, Vt = np.linalg.svd(X, full_matrices=False)
# Visualize singular value spectrum
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(s, 'o-', linewidth=2, markersize=5)
plt.xlabel('Index', fontsize=12)
plt.ylabel('Singular Value', fontsize=12)
plt.title('Singular Value Spectrum', fontsize=14)
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
cumsum = np.cumsum(s**2) / np.sum(s**2)
plt.plot(cumsum, linewidth=2)
plt.axhline(y=0.95, color='r', linestyle='--', label='95% variance')
plt.xlabel('Number of Components', fontsize=12)
plt.ylabel('Cumulative Explained Variance', fontsize=12)
plt.title('Variance Explained by Components', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
k_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Components needed for 95% variance: {k_95} out of {len(s)}")
4. Principal Component Analysis (PCA)ΒΆ
Dimensionality reduction by finding directions of maximum varianceΒΆ
PCA answers a fundamental question: given high-dimensional data, what are the most informative directions to project onto? The answer is the directions along which the data varies the most β the principal components.
Algorithm:
Center the data: \(X_{\text{centered}} = X - \bar{X}\)
Compute the covariance matrix: \(C = \frac{1}{n-1}X^T X\)
Find the eigenvectors of \(C\) β these are the principal components, ranked by their eigenvalues (variance explained)
Project data onto the top \(k\) components: \(Z = X \cdot V_k\)
Connection to SVD: Rather than forming the covariance matrix explicitly (which is expensive and numerically unstable for large feature counts), we can compute PCA directly from the SVD of the centered data matrix. If \(X = U \Sigma V^T\), then the columns of \(V\) are the principal components and \(\sigma_i^2 / (n-1)\) gives the variance along each direction. This SVD-based approach is what sklearn.decomposition.PCA uses internally, and it is both faster and more numerically stable than eigendecomposing the covariance matrix.
def pca_from_scratch(X, n_components=2):
"""PCA using eigendecomposition"""
# Center the data
X_centered = X - np.mean(X, axis=0)
# Covariance matrix
cov_matrix = np.cov(X_centered.T)
# Eigendecomposition
eigenvalues, eigenvectors = np.linalg.eig(cov_matrix)
# Sort by eigenvalue (descending)
idx = np.argsort(eigenvalues)[::-1]
eigenvalues = eigenvalues[idx]
eigenvectors = eigenvectors[:, idx]
# Select top k eigenvectors
principal_components = eigenvectors[:, :n_components]
# Project data
X_projected = X_centered @ principal_components
# Explained variance ratio
explained_var_ratio = eigenvalues / np.sum(eigenvalues)
return X_projected, principal_components, explained_var_ratio
def pca_via_svd(X, n_components=2):
"""PCA using SVD (more numerically stable)"""
# Center the data
X_centered = X - np.mean(X, axis=0)
# SVD
U, s, Vt = np.linalg.svd(X_centered, full_matrices=False)
# Principal components are columns of V (rows of Vt)
principal_components = Vt.T[:, :n_components]
# Project data
X_projected = X_centered @ principal_components
# Explained variance
explained_var = (s**2) / (X.shape[0] - 1)
explained_var_ratio = explained_var / np.sum(explained_var)
return X_projected, principal_components, explained_var_ratio
# Test on Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target
print(f"Original data shape: {X_iris.shape}")
# Apply both methods
X_pca_eig, pcs_eig, var_ratio_eig = pca_from_scratch(X_iris, n_components=2)
X_pca_svd, pcs_svd, var_ratio_svd = pca_via_svd(X_iris, n_components=2)
print(f"\nProjected data shape: {X_pca_svd.shape}")
print(f"\nExplained variance ratio (SVD): {var_ratio_svd[:2]}")
print(f"Total variance explained: {np.sum(var_ratio_svd[:2]):.2%}")
# Visualize PCA projection
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
# Original 3D data (first 3 features)
ax = fig.add_subplot(121, projection='3d')
scatter = ax.scatter(X_iris[:, 0], X_iris[:, 1], X_iris[:, 2],
c=y_iris, cmap='viridis', s=50)
ax.set_xlabel(iris.feature_names[0], fontsize=10)
ax.set_ylabel(iris.feature_names[1], fontsize=10)
ax.set_zlabel(iris.feature_names[2], fontsize=10)
ax.set_title('Original 3D Data', fontsize=13)
plt.colorbar(scatter, ax=ax, label='Species')
# PCA 2D projection
axes[1].scatter(X_pca_svd[:, 0], X_pca_svd[:, 1],
c=y_iris, cmap='viridis', s=50)
axes[1].set_xlabel(f'PC1 ({var_ratio_svd[0]:.1%} var)', fontsize=12)
axes[1].set_ylabel(f'PC2 ({var_ratio_svd[1]:.1%} var)', fontsize=12)
axes[1].set_title('PCA 2D Projection', fontsize=13)
axes[1].grid(True, alpha=0.3)
# Add arrows for principal components
X_centered = X_iris - np.mean(X_iris, axis=0)
for i in range(2):
axes[1].arrow(0, 0, pcs_svd[0, i]*3, pcs_svd[1, i]*3,
head_width=0.3, head_length=0.5, fc='red', ec='red',
alpha=0.6, linewidth=2)
plt.tight_layout()
plt.show()
print("PCA finds directions of maximum variance!")
print("Classes remain well-separated in 2D projection")
5. Image Compression with SVDΒΆ
Practical demonstration of low-rank approximationΒΆ
Image compression via SVD is one of the most intuitive demonstrations of why low-rank approximations are powerful. A grayscale image is just a matrix of pixel intensities, and its SVD decomposes it into a sum of rank-1 βbasis images,β each capturing progressively finer detail. Keeping only the top \(k\) singular values discards the high-frequency noise and fine texture while preserving the dominant structure.
The compression ratio for a rank-\(k\) approximation of an \(m \times n\) image is:
You need to store \(k\) left singular vectors (\(m\) values each), \(k\) right singular vectors (\(n\) values each), and \(k\) singular values β compared to the full \(m \times n\) pixel grid. For a \(64 \times 64\) face image, keeping just 10 components stores \(10 \times (64 + 64 + 1) = 1290\) values instead of \(4096\) β a 3.2x compression while retaining recognizable facial features. This same principle drives dimensionality reduction in high-dimensional ML datasets, where most of the meaningful signal lives in a low-dimensional subspace.
# Load face images
faces = fetch_olivetti_faces()
face = faces.images[0] # Single face (64x64)
print(f"Original image shape: {face.shape}")
# Compute SVD
U, s, Vt = np.linalg.svd(face, full_matrices=False)
# Reconstruct with different ranks
ranks = [5, 10, 20, 40, 64]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()
for idx, k in enumerate(ranks):
# Low-rank approximation
face_k = U[:, :k] @ np.diag(s[:k]) @ Vt[:k, :]
# Compression ratio
original_size = face.shape[0] * face.shape[1]
compressed_size = k * (face.shape[0] + face.shape[1] + 1)
compression = compressed_size / original_size
# Plot
axes[idx].imshow(face_k, cmap='gray')
axes[idx].set_title(f'Rank {k} ({compression:.1%} of original)', fontsize=11)
axes[idx].axis('off')
# Original
axes[5].imshow(face, cmap='gray')
axes[5].set_title('Original (100%)', fontsize=11)
axes[5].axis('off')
plt.tight_layout()
plt.show()
print("Low-rank approximation provides excellent compression!")
# Visualize singular values for face image
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
plt.plot(s, 'o-', linewidth=2, markersize=4)
plt.xlabel('Singular Value Index', fontsize=12)
plt.ylabel('Magnitude', fontsize=12)
plt.title('Singular Values of Face Image', fontsize=14)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.subplot(1, 2, 2)
cumsum = np.cumsum(s**2) / np.sum(s**2)
plt.plot(cumsum, linewidth=2)
plt.axhline(y=0.90, color='r', linestyle='--', label='90%')
plt.axhline(y=0.95, color='orange', linestyle='--', label='95%')
plt.axhline(y=0.99, color='green', linestyle='--', label='99%')
plt.xlabel('Number of Components', fontsize=12)
plt.ylabel('Cumulative Energy', fontsize=12)
plt.title('Energy Preserved by Components', fontsize=14)
plt.legend(fontsize=11)
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
k_90 = np.argmax(cumsum >= 0.90) + 1
k_95 = np.argmax(cumsum >= 0.95) + 1
print(f"Components for 90% energy: {k_90}")
print(f"Components for 95% energy: {k_95}")
print(f"Compression at 95%: {k_95 * (64 + 64 + 1) / (64*64):.1%} of original")
6. Matrix Factorization for Recommender SystemsΒΆ
Problem: User-item rating matrix (mostly empty)
Solution: Factor into low-rank matrices $\(R \approx U \cdot V^T\)$
Where:
R = ratings matrix (users Γ items)
U = user features (users Γ k)
V = item features (items Γ k)
k = number of latent factors
# Simple matrix factorization
def matrix_factorization(R, k, steps=5000, alpha=0.002, beta=0.02):
"""
Factor R into U and V using gradient descent
R: rating matrix (users Γ items)
k: number of latent factors
alpha: learning rate
beta: regularization parameter
"""
m, n = R.shape
# Initialize U and V randomly
U = np.random.rand(m, k)
V = np.random.rand(n, k)
# Get indices of non-zero entries (known ratings)
rows, cols = np.nonzero(R)
errors = []
for step in range(steps):
# Compute error on known ratings
error = 0
for i, j in zip(rows, cols):
r_ij = R[i, j]
pred_ij = U[i, :] @ V[j, :]
e_ij = r_ij - pred_ij
# Update U and V
U[i, :] += alpha * (2 * e_ij * V[j, :] - beta * U[i, :])
V[j, :] += alpha * (2 * e_ij * U[i, :] - beta * V[j, :])
error += e_ij**2
# Add regularization
error += beta/2 * (np.sum(U**2) + np.sum(V**2))
errors.append(error)
if step % 1000 == 0:
print(f"Step {step}, Error: {error:.4f}")
return U, V, errors
# Create sample ratings matrix (5 users, 6 items)
R = np.array([
[5, 3, 0, 1, 0, 0],
[4, 0, 0, 1, 0, 0],
[1, 1, 0, 5, 0, 0],
[1, 0, 0, 4, 0, 0],
[0, 1, 5, 4, 0, 0],
])
print("Original ratings matrix (0 = unknown):")
print(R)
# Factorize
k = 2 # latent factors
U, V, errors = matrix_factorization(R, k)
# Predict ratings
R_pred = U @ V.T
print("\nPredicted ratings:")
print(np.round(R_pred, 2))
# Visualize learning
plt.figure(figsize=(10, 6))
plt.plot(errors, linewidth=2)
plt.xlabel('Iteration', fontsize=12)
plt.ylabel('Error', fontsize=12)
plt.title('Matrix Factorization Training', fontsize=14)
plt.grid(True, alpha=0.3)
plt.yscale('log')
plt.show()
# Compare original vs predicted for known ratings
mask = R > 0
known_original = R[mask]
known_predicted = R_pred[mask]
print(f"\nMSE on known ratings: {np.mean((known_original - known_predicted)**2):.4f}")
print(f"MAE on known ratings: {np.mean(np.abs(known_original - known_predicted)):.4f}")
SummaryΒΆ
β Eigendecomposition: Finds special directions (eigenvectors) and scaling (eigenvalues) β SVD: Most versatile matrix factorization, works for any matrix β Low-Rank Approximation: Optimal compression using top singular values β PCA: Dimensionality reduction via SVD or eigendecomposition β Image Compression: Practical application of low-rank approximation β Matrix Factorization: Recommender systems and collaborative filtering
Key Insights:ΒΆ
SVD is everywhere in ML:
PCA (dimensionality reduction)
LSA (text analysis)
Recommender systems
Image compression
Noise reduction
Eigendecomposition vs SVD:
Eigen: Only for square matrices
SVD: Works for any matrix
SVD is numerically more stable
Dimensionality reduction:
Reduces computational cost
Removes noise
Visualization
Feature extraction
Applications in Modern ML:ΒΆ
Computer Vision: PCA for face recognition (Eigenfaces)
NLP: LSA for topic modeling, word embeddings
Recommender Systems: Netflix Prize, Amazon recommendations
Anomaly Detection: Low-rank plus sparse decomposition
Deep Learning: Weight initialization, compression
Next Steps:ΒΆ
Study randomized SVD for large-scale problems
Explore Non-negative Matrix Factorization (NMF)
Learn about kernel PCA for non-linear dimensionality reduction
Understand t-SNE and UMAP as alternatives to PCA