Linear Algebra Fundamentals¶
Linear algebra is the mathematical backbone of machine learning. Nearly every ML algorithm — from simple linear regression to deep neural networks — relies on vector and matrix operations under the hood. When a neural network makes a prediction, it’s performing a sequence of matrix multiplications. When PCA reduces dimensionality, it’s finding eigenvectors. When embeddings measure similarity, they’re computing dot products.
This notebook covers the essential linear algebra concepts you need for ML: vectors (how data is represented), matrices (how data is transformed), eigendecomposition (how we find the most important directions in data), and practical applications like solving linear regression with the Normal Equation.
Prerequisites: Basic Python and NumPy familiarity.
# Import required libraries
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import seaborn as sns
# Set style for better visualizations
plt.style.use('seaborn-v0_8-darkgrid')
sns.set_palette("husl")
1. Vectors: The Building Blocks¶
A vector is an ordered array of numbers. In ML, data points are often represented as vectors.
Example: A house with 3 features: [square_feet, bedrooms, age] = [2000, 3, 10]
# Creating vectors
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])
print("Vector v1:", v1)
print("Vector v2:", v2)
print("Shape of v1:", v1.shape)
print("Dimension:", v1.ndim)
Vector Operations: Addition, Scaling, and the Dot Product¶
Vector operations are the atoms of machine learning computation. Vector addition combines features or updates model weights during gradient descent. Scalar multiplication scales vectors — this is what happens when you apply a learning rate. The dot product (np.dot) is perhaps the most important single operation in ML: it measures the alignment between two vectors, computing \(\mathbf{v}_1 \cdot \mathbf{v}_2 = \sum_i v_{1i} \cdot v_{2i}\).
The dot product powers cosine similarity (used in embeddings and search), computes neuron activations in neural networks (\(z = \mathbf{w} \cdot \mathbf{x} + b\)), and forms the basis of attention mechanisms in transformers. In NumPy, np.dot(v1, v2) and v1 @ v2 are equivalent for 1D vectors — the @ operator is the modern, preferred syntax.
# Vector addition
v_add = v1 + v2
print("v1 + v2 =", v_add)
# Scalar multiplication
v_scalar = 3 * v1
print("3 * v1 =", v_scalar)
# Dot product (inner product)
dot_product = np.dot(v1, v2)
print("v1 · v2 =", dot_product)
# Alternative dot product syntax
print("v1 · v2 (alternative):", v1 @ v2)
Vector Magnitude (Norm) and Normalization¶
The L2 norm (Euclidean norm) measures the “length” of a vector: \(\|\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \ldots + v_n^2}\). In ML, norms serve several critical purposes: they measure distance between data points (k-NN, clustering), they regularize model weights (L2 regularization penalizes \(\|\mathbf{w}\|^2\)), and they enable normalization.
Normalizing a vector (dividing by its magnitude) creates a unit vector — a vector with length 1 that preserves direction. This is essential for cosine similarity, where we compare directions regardless of magnitude. When you compute cosine similarity between two embeddings, you’re really computing the dot product of their normalized versions: \(\cos(\theta) = \frac{\mathbf{a} \cdot \mathbf{b}}{\|\mathbf{a}\| \|\mathbf{b}\|}\). Many embedding models return pre-normalized vectors for this reason.
# Calculate magnitude (L2 norm)
magnitude_v1 = np.linalg.norm(v1)
print("Magnitude of v1:", magnitude_v1)
# Unit vector (normalized)
v1_normalized = v1 / magnitude_v1
print("Normalized v1:", v1_normalized)
print("Magnitude of normalized v1:", np.linalg.norm(v1_normalized))
Visualizing Vectors in 2D¶
Visualizing vector addition helps build geometric intuition. When you add two vectors, the result follows the parallelogram rule: place vector b at the tip of vector a, and the sum a + b goes from the origin to the new tip. This is exactly what happens during gradient descent — the parameter vector gets updated by adding (a scaled version of) the gradient vector, moving it toward a better solution. The plot below shows this graphically using plt.quiver, which draws arrows from an origin point to a destination.
# 2D vectors for visualization
a = np.array([3, 2])
b = np.array([1, 4])
c = a + b
# Plot vectors
plt.figure(figsize=(8, 8))
origin = [0, 0]
# Plot individual vectors
plt.quiver(*origin, *a, angles='xy', scale_units='xy', scale=1, color='r', width=0.01, label='Vector a')
plt.quiver(*origin, *b, angles='xy', scale_units='xy', scale=1, color='b', width=0.01, label='Vector b')
plt.quiver(*origin, *c, angles='xy', scale_units='xy', scale=1, color='g', width=0.01, label='a + b')
# Show vector addition graphically
plt.quiver(*a, *b, angles='xy', scale_units='xy', scale=1, color='b', width=0.005, alpha=0.5, linestyle='--')
plt.xlim(-1, 5)
plt.ylim(-1, 7)
plt.grid(True)
plt.legend()
plt.title('Vector Addition in 2D')
plt.xlabel('x')
plt.ylabel('y')
plt.axhline(y=0, color='k', linewidth=0.5)
plt.axvline(x=0, color='k', linewidth=0.5)
plt.show()
2. Matrices: Collections of Vectors¶
A matrix is a rectangular grid of numbers arranged in rows and columns. In ML, matrices are everywhere: a dataset is a matrix where each row is a sample and each column is a feature (shape (n_samples, n_features)). A neural network layer stores its learned parameters as a weight matrix. An image is a matrix of pixel intensities (or a 3D tensor for color channels).
Understanding matrix shapes is critical for debugging. When you see X.shape = (100, 3), that means 100 data points with 3 features each. The code below creates two matrices with compatible shapes for multiplication — A is (2, 3) and B is (3, 2), so A @ B will produce a (2, 2) result. Shape mismatches (matmul: Input operand 1 has a mismatch...) are the most common linear algebra error in ML code.
# Creating matrices
A = np.array([[1, 2, 3],
[4, 5, 6]])
B = np.array([[7, 8],
[9, 10],
[11, 12]])
print("Matrix A:")
print(A)
print("Shape:", A.shape) # (rows, columns)
print("\nMatrix B:")
print(B)
print("Shape:", B.shape)
Matrix Operations: Element-wise vs Matrix Multiplication¶
There are two fundamentally different ways to combine matrices. Element-wise operations (addition, Hadamard product C * D) operate on corresponding elements independently — these appear in activation functions, masking, and attention score computation. Scalar multiplication (2 * C) scales every element uniformly, similar to applying a learning rate to a gradient matrix.
The critical distinction to internalize: C * D (element-wise) is NOT the same as C @ D (matrix multiplication). In NumPy, * always means element-wise, while @ means matrix multiplication. This is a common source of bugs when implementing neural networks from scratch.
# Element-wise operations
C = np.array([[1, 2],
[3, 4]])
D = np.array([[5, 6],
[7, 8]])
print("Matrix C:")
print(C)
print("\nMatrix D:")
print(D)
# Element-wise addition
print("\nC + D:")
print(C + D)
# Element-wise multiplication (Hadamard product)
print("\nC ⊙ D (element-wise):")
print(C * D)
# Scalar multiplication
print("\n2 * C:")
print(2 * C)
Matrix Multiplication¶
Matrix multiplication is the single most important operation in deep learning. Every neural network forward pass is a chain of matrix multiplications: the input vector gets multiplied by weight matrices layer after layer, with activations applied in between. For matrices A (m×n) and B (n×p), the result C (m×p) is computed as:
Each element of the result is a dot product of a row of A with a column of B. The inner dimensions must match (n), and the result takes the outer dimensions (m×p). In NumPy, use the @ operator: A @ B. The older np.dot and np.matmul do the same thing for 2D arrays, but @ is cleaner and preferred in modern code. Note: GPU-accelerated matrix multiplication (via cuBLAS, etc.) is why deep learning became practical — these operations parallelize extremely well.
# Matrix multiplication
result = A @ B # or np.matmul(A, B) or np.dot(A, B)
print("A @ B:")
print(result)
print("Shape:", result.shape)
# Show the calculation for one element
print("\nCalculation for element [0,0]:")
print(f"A[0,:] = {A[0,:]}")
print(f"B[:,0] = {B[:,0]}")
print(f"Dot product = {np.dot(A[0,:], B[:,0])}")
Visualizing Matrix Multiplication¶
Matrix multiplication can feel abstract when working with formulas alone. The heatmap visualization below makes it concrete: each cell in the result matrix is the dot product of one row from A and one column from B. This “row-times-column” pattern is exactly what happens in a neural network’s forward pass — each output neuron computes a dot product of inputs with its weight row. Building visual intuition for shape compatibility ((m×n) @ (n×p) → (m×p)) helps you debug dimension mismatch errors, which are among the most common bugs when building models from scratch.
# Visualize matrix multiplication
fig, axes = plt.subplots(1, 4, figsize=(16, 3))
# Matrix A
axes[0].imshow(A, cmap='Blues', aspect='auto')
axes[0].set_title(f'Matrix A\n{A.shape}')
for i in range(A.shape[0]):
for j in range(A.shape[1]):
axes[0].text(j, i, str(A[i, j]), ha='center', va='center')
# Matrix B
axes[1].imshow(B, cmap='Greens', aspect='auto')
axes[1].set_title(f'Matrix B\n{B.shape}')
for i in range(B.shape[0]):
for j in range(B.shape[1]):
axes[1].text(j, i, str(B[i, j]), ha='center', va='center')
# Multiplication symbol
axes[2].text(0.5, 0.5, '×', fontsize=40, ha='center', va='center')
axes[2].axis('off')
# Result
axes[3].imshow(result, cmap='Reds', aspect='auto')
axes[3].set_title(f'Result A @ B\n{result.shape}')
for i in range(result.shape[0]):
for j in range(result.shape[1]):
axes[3].text(j, i, str(result[i, j]), ha='center', va='center')
plt.tight_layout()
plt.show()
3. Matrix Transpose¶
The transpose operation flips a matrix over its diagonal, turning rows into columns: \((A^T)_{ij} = A_{ji}\). A matrix of shape (m, n) becomes (n, m) after transposing. In NumPy, A.T is the standard shorthand.
Transpose appears constantly in ML formulas. The Normal Equation uses \(X^T X\) to compute the Gram matrix (a matrix of all pairwise dot products between features). Backpropagation transposes weight matrices to propagate gradients backward through layers. When computing covariance matrices for PCA, you compute \(\frac{1}{n} X^T X\) on centered data. Understanding transpose is also key to reading ML papers — many formulas assume column vectors, so you’ll see \(\mathbf{w}^T \mathbf{x}\) to mean “dot product of w and x.”
# Transpose
print("Original A:")
print(A)
print("Shape:", A.shape)
A_transpose = A.T
print("\nTranspose A^T:")
print(A_transpose)
print("Shape:", A_transpose.shape)
4. Identity Matrix and Inverse¶
Identity Matrix (I): Like the number 1 for matrices
Square matrix with 1s on diagonal, 0s elsewhere
\(A \cdot I = I \cdot A = A\)
Inverse Matrix (\(A^{-1}\)):
\(A \cdot A^{-1} = I\)
Only exists for square, non-singular matrices
# Identity matrix
I = np.eye(3)
print("Identity matrix (3x3):")
print(I)
# Create a square matrix
M = np.array([[4, 7],
[2, 6]])
print("\nMatrix M:")
print(M)
# Calculate inverse
M_inv = np.linalg.inv(M)
print("\nInverse M^(-1):")
print(M_inv)
# Verify: M @ M^(-1) = I
verification = M @ M_inv
print("\nM @ M^(-1) (should be I):")
print(np.round(verification, 10)) # Round to handle floating point errors
5. Determinant¶
The determinant is a scalar value that captures fundamental properties of a square matrix. Geometrically, it represents the factor by which the matrix scales area (in 2D) or volume (in higher dimensions). If det(A) = 0, the matrix is singular — it squishes space into a lower dimension, meaning it has no inverse and the system of equations it represents has no unique solution.
In ML, determinants appear in multivariate Gaussian distributions (the normalization constant involves \(|\\Sigma|\), the determinant of the covariance matrix), in computing matrix inverses, and in understanding whether a linear system is well-conditioned. np.linalg.det(M) computes the determinant. A near-zero determinant warns you that a matrix is close to singular, which can cause numerical instability in operations like the Normal Equation.
# Determinant
det_M = np.linalg.det(M)
print("Determinant of M:", det_M)
# Singular matrix (determinant = 0)
singular = np.array([[1, 2],
[2, 4]]) # Second row is 2x first row
print("\nSingular matrix:")
print(singular)
print("Determinant:", np.linalg.det(singular))
6. Eigenvalues and Eigenvectors¶
Eigendecomposition reveals the intrinsic structure of a matrix. For a square matrix A, an eigenvector \(\mathbf{v}\) is a non-zero vector that, when multiplied by the matrix, only gets scaled — not rotated:
The scalar \(\lambda\) is the corresponding eigenvalue, telling you how much the eigenvector gets stretched or compressed. If \(\lambda > 1\), the direction is amplified; if \(0 < \lambda < 1\), it’s shrunk; if \(\lambda < 0\), the direction is flipped.
This is foundational for PCA (Principal Component Analysis): the eigenvectors of a dataset’s covariance matrix are its principal components — the directions of maximum spread. Eigenvalues also determine the stability of dynamical systems, the convergence rate of iterative algorithms, and the condition number of matrices (which tells you if numerical computations will be stable). np.linalg.eig(S) returns both eigenvalues and eigenvectors in one call.
# Create a symmetric matrix (easier to interpret)
S = np.array([[3, 1],
[1, 3]])
print("Matrix S:")
print(S)
# Calculate eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eig(S)
print("\nEigenvalues:")
print(eigenvalues)
print("\nEigenvectors (as columns):")
print(eigenvectors)
Visualizing Eigenvectors¶
The visualization below makes the eigenvalue equation \(A\mathbf{v} = \lambda\mathbf{v}\) tangible. The left plot shows the eigenvectors of matrix S — these are the “special directions” that the matrix only stretches, never rotates. The right plot demonstrates this: applying the matrix to an eigenvector produces a vector in the exact same direction, just scaled by \(\lambda\). In PCA, the eigenvectors of the covariance matrix point in the directions of maximum variance in your data, and the eigenvalues tell you how much variance each direction captures — this is how PCA decides which dimensions to keep and which to discard.
# Visualize eigenvectors
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 6))
# Plot 1: Eigenvectors
origin = [0, 0]
for i in range(len(eigenvalues)):
ax1.quiver(*origin, *eigenvectors[:, i], angles='xy', scale_units='xy', scale=1,
width=0.01, label=f'Eigenvector {i+1} (λ={eigenvalues[i]:.2f})')
ax1.set_xlim(-1, 1)
ax1.set_ylim(-1, 1)
ax1.grid(True)
ax1.legend()
ax1.set_title('Eigenvectors of Matrix S')
ax1.set_xlabel('x')
ax1.set_ylabel('y')
ax1.axhline(y=0, color='k', linewidth=0.5)
ax1.axvline(x=0, color='k', linewidth=0.5)
# Plot 2: Show transformation effect
# Original vector (eigenvector 1)
v = eigenvectors[:, 0]
v_transformed = S @ v
ax2.quiver(*origin, *v, angles='xy', scale_units='xy', scale=1,
color='blue', width=0.015, label='Original eigenvector')
ax2.quiver(*origin, *v_transformed, angles='xy', scale_units='xy', scale=1,
color='red', width=0.015, label=f'After S (scaled by λ={eigenvalues[0]:.2f})')
ax2.set_xlim(-1, 5)
ax2.set_ylim(-1, 5)
ax2.grid(True)
ax2.legend()
ax2.set_title('Eigenvector Transformation: S·v = λ·v')
ax2.set_xlabel('x')
ax2.set_ylabel('y')
ax2.axhline(y=0, color='k', linewidth=0.5)
ax2.axvline(x=0, color='k', linewidth=0.5)
plt.tight_layout()
plt.show()
# Verify: S @ v = λ @ v
print("\nVerification for eigenvector 1:")
print(f"S @ v = {S @ eigenvectors[:, 0]}")
print(f"λ * v = {eigenvalues[0] * eigenvectors[:, 0]}")
7. ML Application: Linear Regression via the Normal Equation¶
Linear regression is the “hello world” of machine learning, and it has an elegant closed-form solution using linear algebra. Given a dataset where \(y = X\beta + \epsilon\) (target equals features times coefficients plus noise), the Normal Equation finds the optimal coefficients in one step:
This formula minimizes the sum of squared errors analytically — no gradient descent needed. It works by projecting the target vector \(y\) onto the column space of \(X\). The code below generates synthetic data with known coefficients [2, 3, -1.5], then recovers them using the Normal Equation. In practice, np.linalg.lstsq is preferred over computing the inverse directly (it’s numerically more stable), but understanding the formula builds intuition for how regularization works — Ridge regression simply adds \(\lambda I\) to get \(\hat{\beta} = (X^T X + \lambda I)^{-1} X^T y\).
# Generate sample data
np.random.seed(42)
n_samples = 100
# Features: [intercept, feature1, feature2]
X = np.column_stack([np.ones(n_samples), # Intercept term
np.random.randn(n_samples), # Feature 1
np.random.randn(n_samples)]) # Feature 2
# True coefficients
true_beta = np.array([2, 3, -1.5])
# Generate target with noise
y = X @ true_beta + np.random.randn(n_samples) * 0.5
print("Data shape:")
print(f"X: {X.shape}")
print(f"y: {y.shape}")
# Solve using Normal Equation: β = (X^T X)^(-1) X^T y
beta_hat = np.linalg.inv(X.T @ X) @ X.T @ y
print("True coefficients:")
print(true_beta)
print("\nEstimated coefficients:")
print(beta_hat)
print("\nDifference:")
print(true_beta - beta_hat)
# Make predictions
y_pred = X @ beta_hat
# Plot results
plt.figure(figsize=(10, 6))
plt.scatter(y, y_pred, alpha=0.5)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'r--', lw=2, label='Perfect predictions')
plt.xlabel('Actual values')
plt.ylabel('Predicted values')
plt.title('Linear Regression: Actual vs Predicted')
plt.legend()
plt.grid(True)
plt.show()
# Calculate R-squared
ss_res = np.sum((y - y_pred) ** 2)
ss_tot = np.sum((y - np.mean(y)) ** 2)
r_squared = 1 - (ss_res / ss_tot)
print(f"\nR-squared: {r_squared:.4f}")
8. ML Application: Data Transformation¶
Matrices are not just containers for data — they are transformation operators. Multiplying a data matrix by a transformation matrix applies geometric operations: rotation, scaling, shearing, or any combination. A rotation matrix preserves distances and angles (it’s orthogonal: \(R^T R = I\)). A scaling matrix stretches or compresses along each axis independently.
These transformations are central to ML: data augmentation in computer vision applies random rotations and scales to training images. Convolutional neural network features can be understood as learned transformations. PCA transforms data into a new coordinate system aligned with the directions of maximum variance. The code below shows how a unit circle transforms under rotation, scaling, and their composition — note that order matters: scaling then rotating gives a different result than rotating then scaling, because matrix multiplication is not commutative.
# Create sample 2D data points
theta = np.linspace(0, 2*np.pi, 50)
original_data = np.array([np.cos(theta), np.sin(theta)]) # Circle
# Rotation matrix (45 degrees)
angle = np.pi / 4
rotation_matrix = np.array([[np.cos(angle), -np.sin(angle)],
[np.sin(angle), np.cos(angle)]])
# Scaling matrix
scaling_matrix = np.array([[2, 0],
[0, 0.5]])
# Apply transformations
rotated_data = rotation_matrix @ original_data
scaled_data = scaling_matrix @ original_data
combined_data = scaling_matrix @ rotation_matrix @ original_data
# Visualize
fig, axes = plt.subplots(2, 2, figsize=(12, 12))
# Original
axes[0, 0].plot(original_data[0], original_data[1], 'o-', markersize=3)
axes[0, 0].set_title('Original Data (Circle)')
axes[0, 0].set_aspect('equal')
axes[0, 0].grid(True)
# Rotated
axes[0, 1].plot(rotated_data[0], rotated_data[1], 'o-', markersize=3, color='orange')
axes[0, 1].set_title('Rotated (45°)')
axes[0, 1].set_aspect('equal')
axes[0, 1].grid(True)
# Scaled
axes[1, 0].plot(scaled_data[0], scaled_data[1], 'o-', markersize=3, color='green')
axes[1, 0].set_title('Scaled (2x in x, 0.5x in y)')
axes[1, 0].set_aspect('equal')
axes[1, 0].grid(True)
# Combined
axes[1, 1].plot(combined_data[0], combined_data[1], 'o-', markersize=3, color='red')
axes[1, 1].set_title('Scaled then Rotated')
axes[1, 1].set_aspect('equal')
axes[1, 1].grid(True)
plt.tight_layout()
plt.show()
9. Practice Exercises¶
These exercises reinforce the core concepts covered in this notebook. Exercise 1 asks you to compute the angle between two vectors using the cosine similarity formula — this is the same computation that powers semantic search in embedding-based systems. Exercise 2 introduces projection matrices, which are the mathematical foundation of PCA (projecting data onto lower-dimensional subspaces). Exercise 3 challenges you to implement matrix multiplication from scratch, which solidifies understanding of the row-times-column dot product pattern that drives every neural network forward pass.
Try solving each one before looking at the hints in the comments.
# Exercise 1: Calculate the angle between two vectors
# Hint: cos(θ) = (v1 · v2) / (||v1|| * ||v2||)
vec1 = np.array([1, 0])
vec2 = np.array([1, 1])
# Your code here:
# angle = ?
# Exercise 2: Create a projection matrix that projects vectors onto the x-axis
# Result should be: [x, y] -> [x, 0]
# Your code here:
# projection_matrix = ?
# Exercise 3: Implement matrix multiplication from scratch (without using @)
def matrix_multiply(A, B):
"""
Multiply two matrices A (m×n) and B (n×p) to get C (m×p)
"""
# Your code here
pass
# Test your function
test_A = np.array([[1, 2], [3, 4]])
test_B = np.array([[5, 6], [7, 8]])
# result = matrix_multiply(test_A, test_B)
# print(result)
# print(test_A @ test_B) # Should match
Summary¶
You’ve learned:
✅ Vectors: Ordered arrays representing data points ✅ Vector operations: Addition, scalar multiplication, dot product, norm ✅ Matrices: 2D arrays for transformations and data ✅ Matrix multiplication: Core operation in neural networks ✅ Transpose: Flipping rows and columns ✅ Identity and Inverse: Fundamental matrix properties ✅ Determinant: Scaling factor and singularity indicator ✅ Eigenvalues/Eigenvectors: Basis for PCA and understanding transformations ✅ ML Applications: Linear regression, data transformation
Next Steps:¶
Practice with real datasets
Explore Principal Component Analysis (PCA)
Study neural network weight matrices
Learn about Singular Value Decomposition (SVD)