Lecture 3: Locally Weighted Regression (LWR)ΒΆ
πΉ Watch Lecture
From Andrew Ngβs CS229 Lecture 3 (Autumn 2018)ΒΆ
βIf you have curved dataβ¦ itβs quite difficult to find features. Is it βx, log(x), xΒ³? What is the set of features that lets you do this? Locally weighted regression sidesteps all those problems.β - Andrew Ng
Parametric vs Non-Parametric LearningΒΆ
Parametric Algorithm (e.g., Linear Regression):
Fit fixed set of parameters (ΞΈ) to data
After fitting, discard training data
Prediction is fast: h(x) = ΞΈα΅x
Non-Parametric Algorithm (e.g., LWR):
Keep entire training set around
Different parameters for each query point
Slower prediction but more flexible
The LWR AlgorithmΒΆ
Key Idea: Weight training examples by proximity to query point
Weight Function (Gaussian kernel):
wβ½β±βΎ = exp(-(xβ½β±βΎ - x)Β² / (2ΟΒ²))
For each prediction:
Compute weights for all training points
Solve weighted least squares:
ΞΈ = (Xα΅WX)β»ΒΉXα΅Wy
where W = diag(wβ½ΒΉβΎ, β¦, wβ½α΅βΎ)
Return prediction: ΞΈα΅x
Bandwidth Parameter Ο (tau)ΒΆ
Controls locality:
Small Ο (e.g., 0.1): Very local, wiggly fit
Large Ο (e.g., 2.0): More global, smooth fit
Medium Ο (e.g., 0.5): Balanced
When to Use LWRΒΆ
β Use LWR when:
Low-dimensional features (n β€ 5)
Non-linear patterns
Donβt want to do feature engineering
Can afford O(nΒ³) per prediction
β Donβt use LWR when:
High dimensions (curse of dimensionality)
Need real-time predictions
Limited training data
Features are already well-chosen
ComparisonΒΆ
Algorithm |
Parameters |
Training Data |
Prediction Speed |
Flexibility |
|---|---|---|---|---|
Linear Regression |
Fixed ΞΈ |
Discarded |
Fast O(n) |
Low |
Polynomial Regression |
Fixed ΞΈ |
Discarded |
Fast O(n) |
Medium |
LWR |
New ΞΈ per query |
Kept |
Slow O(nΒ³) |
High |
From the LectureΒΆ
βThis is a non-parametric learning algorithmβ¦ thereβs no longer a fixed, finite number of parameters that you use to represent your hypothesis.β
On feature engineering:
βRather than having to spend a long time to decide, do I want features of the form x, xΒ², βx, log(x), you can use locally weighted regression and it will do a good job fitting a variety of different data.β
The trade-off:
βThe disadvantage is that you need to keep the entire training set aroundβ¦ For every new query point, you need to fit a new set of parameters.β
import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
# Locally Weighted Regression Implementation
class LocallyWeightedRegression:
"""
Locally Weighted Regression (LWR)
From CS229 Lecture 3: Non-parametric learning algorithm that fits
a different linear model at each query point, weighted by distance.
"""
def __init__(self, tau=0.5):
"""
Parameters:
-----------
tau : float
Bandwidth parameter. Controls how quickly weights fall off with distance.
- Small tau: more local fitting (wiggly)
- Large tau: more global fitting (smooth)
"""
self.tau = tau
self.X_train = None
self.y_train = None
def fit(self, X, y):
"""Store training data (non-parametric algorithm)"""
self.X_train = X
self.y_train = y
return self
def _compute_weights(self, x_query):
"""
Compute Gaussian weights for each training point
w^(i) = exp(-(x^(i) - x)^2 / (2*tau^2))
"""
# Compute squared distances
distances = np.sum((self.X_train - x_query)**2, axis=1)
# Gaussian weight function
weights = np.exp(-distances / (2 * self.tau**2))
return weights
def _weighted_fit(self, x_query):
"""
Fit weighted linear regression at query point
Minimize: J(ΞΈ) = (1/2) Ξ£ w^(i) (h_ΞΈ(x^(i)) - y^(i))^2
Solution: ΞΈ = (X^T W X)^(-1) X^T W y
"""
# Get weights for this query point
weights = self._compute_weights(x_query)
# Create diagonal weight matrix
W = np.diag(weights)
# Add bias term
X_b = np.c_[np.ones((len(self.X_train), 1)), self.X_train]
# Weighted least squares solution
# ΞΈ = (X^T W X)^(-1) X^T W y
theta = np.linalg.inv(X_b.T @ W @ X_b) @ X_b.T @ W @ self.y_train
return theta
def predict(self, X):
"""Make predictions for multiple points"""
predictions = []
for x in X:
# Fit local model at this point
theta = self._weighted_fit(x.reshape(1, -1))
# Make prediction
x_b = np.array([1, x[0]])
pred = x_b @ theta
predictions.append(pred[0])
return np.array(predictions)
print("β
Locally Weighted Regression class implemented")
print("\nKey features:")
print(" β’ Non-parametric: Stores all training data")
print(" β’ Local fitting: Different ΞΈ for each query point")
print(" β’ Gaussian weights: Nearby points have higher influence")
print(" β’ Bandwidth Ο: Controls locality of fit")
Example 1: Non-linear Data (From Lecture 3)ΒΆ
What: Fitting curved data without choosing featuresΒΆ
When the relationship between \(x\) and \(y\) is non-linear, standard linear regression fails because \(h_\theta(x) = \theta^T x\) can only produce straight lines. The classic workaround is polynomial feature engineering (\(x^2\), \(\sqrt{x}\), \(\log x\), etc.), but choosing the right basis functions is tedious and error-prone. Locally weighted regression sidesteps this entirely by fitting a different linear model at each query point, weighted by a Gaussian kernel \(w^{(i)} = \exp\!\bigl(-\|x^{(i)} - x\|^2 / 2\tau^2\bigr)\).
Why the bandwidth parameter mattersΒΆ
The bandwidth \(\tau\) controls the bias-variance tradeoff of LWR. A small \(\tau\) gives high weight only to the nearest neighbors, producing a wiggly fit that tracks noise (high variance, low bias). A large \(\tau\) spreads weight broadly, effectively averaging over more data and producing a smoother fit (low variance, higher bias). Comparing multiple \(\tau\) values side-by-side reveals this tradeoff visually and motivates the need for cross-validation to select \(\tau\) in practice.
# Generate non-linear data similar to lecture example
print("π Example: Fitting Non-linear Curved Data")
print("="*70)
np.random.seed(42)
# Generate data with non-linear pattern (like lecture diagram)
X_train = np.linspace(0, 10, 50)
y_train = np.sin(X_train) + 0.5 * X_train + np.random.randn(50) * 0.3
# For visualization
X_test = np.linspace(0, 10, 200).reshape(-1, 1)
# Compare different bandwidths
tau_values = [0.1, 0.5, 1.0, 2.0]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()
for idx, tau in enumerate(tau_values):
# Fit LWR with this bandwidth
lwr = LocallyWeightedRegression(tau=tau)
lwr.fit(X_train.reshape(-1, 1), y_train)
# Make predictions
y_pred = lwr.predict(X_test)
# Also fit standard linear regression for comparison
X_b = np.c_[np.ones((len(X_train), 1)), X_train]
theta_lr = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_train
y_lr = np.c_[np.ones((len(X_test), 1)), X_test] @ theta_lr
# Plot
axes[idx].scatter(X_train, y_train, alpha=0.6, s=50,
c='steelblue', edgecolors='black', linewidth=0.5,
label='Training data')
axes[idx].plot(X_test, y_pred, 'r-', linewidth=2.5,
label=f'LWR (Ο={tau})')
axes[idx].plot(X_test, y_lr, 'g--', linewidth=1.5, alpha=0.6,
label='Linear Regression')
axes[idx].set_xlabel('x', fontsize=11)
axes[idx].set_ylabel('y', fontsize=11)
axes[idx].set_title(f'Bandwidth Ο = {tau}', fontsize=12, fontweight='bold')
axes[idx].legend(fontsize=9)
axes[idx].grid(True, alpha=0.3)
plt.suptitle('Locally Weighted Regression: Effect of Bandwidth Parameter Ο',
fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()
print("\nπ‘ Observations:")
print(" Ο = 0.1 : Very local (wiggly) - follows noise, may overfit")
print(" Ο = 0.5 : Good balance - captures curve without overfitting")
print(" Ο = 1.0 : More smooth - averages over larger neighborhood")
print(" Ο = 2.0 : Very smooth - approaches linear regression")
print("\n Linear Regression (green dashed): Cannot capture the curvature!")
print("\n LWR automatically handles non-linearity without feature engineering")
Example 2: Visualizing Weight FunctionsΒΆ
What: Seeing how LWR assigns importance to training pointsΒΆ
For a given query point \(x\), the Gaussian weight function \(w^{(i)} = \exp\!\bigl(-\|x^{(i)} - x\|^2 / 2\tau^2\bigr)\) assigns a weight between 0 and 1 to every training example. Points close to \(x\) receive weight near 1 (strong influence on the local fit), while distant points receive weight near 0 (essentially ignored). By plotting these weights for several query points, we can see that LWR effectively performs a different regression at each location, using only the local neighborhood.
Why this mattersΒΆ
Understanding the weight function builds intuition for kernel methods more broadly β the same Gaussian kernel appears in support vector machines (RBF kernel), Gaussian processes, and kernel density estimation. In all these methods, the kernel defines a notion of similarity between points, and the bandwidth parameter controls the scale of that similarity.
# Visualize weight functions for different query points
print("π Weight Function Visualization")
print("="*70)
tau = 0.8
lwr = LocallyWeightedRegression(tau=tau)
lwr.fit(X_train.reshape(-1, 1), y_train)
# Query points to visualize
query_points = [2.5, 5.0, 7.5]
fig, axes = plt.subplots(2, 3, figsize=(15, 8))
for idx, x_query in enumerate(query_points):
# Compute weights for this query point
weights = lwr._compute_weights(np.array([[x_query]]))
# Top row: Data with weights
axes[0, idx].scatter(X_train, y_train, c=weights, s=100*weights + 10,
cmap='RdYlBu_r', edgecolors='black', linewidth=0.5,
vmin=0, vmax=1, alpha=0.8)
axes[0, idx].axvline(x_query, color='red', linestyle='--', linewidth=2,
label=f'Query: x={x_query}')
axes[0, idx].set_xlabel('x', fontsize=10)
axes[0, idx].set_ylabel('y', fontsize=10)
axes[0, idx].set_title(f'Weights at x={x_query}', fontsize=11, fontweight='bold')
axes[0, idx].legend(fontsize=9)
axes[0, idx].grid(True, alpha=0.3)
# Bottom row: Weight function
axes[1, idx].bar(X_train, weights, width=0.15, color='coral',
edgecolor='black', linewidth=0.5, alpha=0.7)
axes[1, idx].axvline(x_query, color='red', linestyle='--', linewidth=2)
axes[1, idx].set_xlabel('x', fontsize=10)
axes[1, idx].set_ylabel('Weight w(i)', fontsize=10)
axes[1, idx].set_title(f'w(i) = exp(-(x(i)-{x_query})Β²/(2ΟΒ²))', fontsize=10)
axes[1, idx].grid(True, alpha=0.3, axis='y')
axes[1, idx].set_ylim([0, 1.1])
plt.suptitle(f'LWR Weight Function (Ο={tau}): Different Query Points',
fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()
print(f"\nπ‘ Key Insights:")
print(f" β’ Point size β weight (larger = higher influence)")
print(f" β’ Color intensity β weight (red = high, blue = low)")
print(f" β’ Weights decay exponentially with distance from query point")
print(f" β’ At query x={query_points[1]}, nearby points have weights close to 1")
print(f" β’ Distant points have weights close to 0")
print(f"\n 'Each prediction uses a different weighted combination of training data'")
Comparison: Linear Regression vs LWR vs Polynomial RegressionΒΆ
What: Head-to-head evaluation of three regression approachesΒΆ
We fit the same non-linear dataset with three methods: ordinary linear regression (\(h_\theta(x) = \theta_0 + \theta_1 x\)), polynomial regression (degree 5), and locally weighted regression (\(\tau = 0.6\)). Linear regression underfits because a straight line cannot capture curvature. Polynomial regression can fit curves but requires choosing the degree and may extrapolate poorly. LWR adapts locally without any feature engineering.
Why: The feature engineering problemΒΆ
In practice, deciding whether to use \(x^2\), \(\sqrt{x}\), \(\log x\), or some other transformation is one of the most time-consuming parts of building a parametric model. LWR eliminates this step entirely by performing local linear fits weighted by proximity. The tradeoff is computational cost β LWR requires \(O(n^3)\) per prediction (solving a weighted least squares problem each time) and must retain the full training set, making it a non-parametric method. This comparison illustrates when the convenience of LWR outweighs its computational overhead, and when a well-chosen parametric model is more practical.
# Generate data with complex non-linear pattern
print("π Comparison: Linear vs Polynomial vs LWR")
print("="*70)
np.random.seed(123)
X_comp = np.linspace(0, 10, 60)
y_comp = 2*np.sin(X_comp) + 0.3*X_comp**1.5 - 3 + np.random.randn(60) * 0.4
X_test_comp = np.linspace(0, 10, 300).reshape(-1, 1)
# 1. Linear Regression
X_b = np.c_[np.ones((len(X_comp), 1)), X_comp]
theta_lin = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_comp
y_linear = np.c_[np.ones((len(X_test_comp), 1)), X_test_comp] @ theta_lin
# 2. Polynomial Regression (degree 5)
from numpy.polynomial import Polynomial
poly = Polynomial.fit(X_comp, y_comp, 5)
y_poly = poly(X_test_comp.flatten())
# 3. Locally Weighted Regression
lwr_comp = LocallyWeightedRegression(tau=0.6)
lwr_comp.fit(X_comp.reshape(-1, 1), y_comp)
y_lwr = lwr_comp.predict(X_test_comp)
# Visualize comparison
plt.figure(figsize=(14, 6))
plt.scatter(X_comp, y_comp, alpha=0.6, s=70, c='steelblue',
edgecolors='black', linewidth=0.8, label='Training data', zorder=5)
plt.plot(X_test_comp, y_linear, 'g--', linewidth=2.5, alpha=0.7,
label='Linear Regression (underfit)')
plt.plot(X_test_comp, y_poly, 'orange', linewidth=2.5, alpha=0.7,
label='Polynomial (degree 5)')
plt.plot(X_test_comp, y_lwr, 'r-', linewidth=3,
label='LWR (Ο=0.6) - no feature engineering!', zorder=4)
plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Model Comparison: LWR Automatically Handles Non-linearity',
fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate MSE for each method
from sklearn.metrics import mean_squared_error
# Predictions on training data
y_pred_lin = X_b @ theta_lin
y_pred_poly = poly(X_comp)
y_pred_lwr = lwr_comp.predict(X_comp.reshape(-1, 1))
mse_lin = mean_squared_error(y_comp, y_pred_lin)
mse_poly = mean_squared_error(y_comp, y_pred_poly)
mse_lwr = mean_squared_error(y_comp, y_pred_lwr)
print(f"\nπ Training MSE:")
print(f" Linear Regression: {mse_lin:.4f} (underfit - too simple)")
print(f" Polynomial (deg 5): {mse_poly:.4f}")
print(f" LWR (Ο=0.6): {mse_lwr:.4f} (best fit)")
print(f"\nπ‘ Key Advantages of LWR:")
print(f" β No feature engineering needed (no xΒ², βx, log(x), etc.)")
print(f" β Automatically adapts to local patterns")
print(f" β One parameter (Ο) vs choosing polynomial degree")
print(f" β Must keep training data (non-parametric)")
print(f" β Slower prediction (O(nΒ³) per query)")
When to Use LWRΒΆ
β Good for LWR:ΒΆ
Low dimensional data (n β€ 5 features typically)
Non-linear patterns without obvious functional form
Small to medium datasets (< 10,000 points)
Offline prediction where speed is not critical
Avoiding feature engineering effort
β Not good for LWR:ΒΆ
High dimensional data (curse of dimensionality)
Real-time prediction requirements
Large datasets (memory and computation issues)
When parametric model works well
Alternative Approaches:ΒΆ
Method |
When to Use |
|---|---|
Linear Regression |
Linear patterns, need interpretability |
Polynomial Features |
Know the non-linear form |
LWR |
Unknown non-linearity, low dimensions |
Neural Networks |
Complex patterns, high dimensions, large data |
Kernel Methods (SVM) |
Non-linear, medium data, want generalization |
From Lecture: βLater this quarter youβll hear about feature selection algorithms for automatically deciding what features to use.β
Mathematical Derivation (Optional)ΒΆ
Weighted Least SquaresΒΆ
Objective: Minimize weighted sum of squared errors $\(J(\theta) = \frac{1}{2} \sum_{i=1}^m w^{(i)} \left(\theta^T x^{(i)} - y^{(i)}\right)^2\)$
Matrix Form: $\(J(\theta) = \frac{1}{2} (X\theta - y)^T W (X\theta - y)\)$
where \(W = \text{diag}(w^{(1)}, \ldots, w^{(m)})\)
Taking derivative and setting to zero: $\(\frac{\partial J}{\partial \theta} = X^T W (X\theta - y) = 0\)$
Solution: $\(\theta^* = (X^T W X)^{-1} X^T W y\)$
Weight FunctionΒΆ
Gaussian kernel (most common): $\(w^{(i)} = \exp\left(-\frac{||x^{(i)} - x||^2}{2\tau^2}\right)\)$
Properties:
\(w^{(i)} \in (0, 1]\)
\(w^{(i)} = 1\) when \(x^{(i)} = x\)
\(w^{(i)} \rightarrow 0\) as \(||x^{(i)} - x|| \rightarrow \infty\)
\(\tau\) controls decay rate
Other kernel options:
Epanechnikov: \(w^{(i)} = \max(0, 1 - ||x^{(i)} - x||^2)\)
Tri-cube: \(w^{(i)} = (1 - ||x^{(i)} - x||^3)^3\) if \(||x^{(i)} - x|| < 1\)
print("β
Locally Weighted Regression notebook complete!")
print("\nπ Key Takeaways from CS229 Lecture 3:")
print(" 1. LWR is a non-parametric algorithm (keeps training data)")
print(" 2. Fits different model at each query point")
print(" 3. Weights are Gaussian function of distance")
print(" 4. Bandwidth Ο controls locality")
print(" 5. Avoids feature engineering for non-linear patterns")
print(" 6. Works well in low dimensions")
print("\nπ― From Andrew Ng:")
print(" 'Locally weighted regression sidesteps the problem of having to")
print(" choose features by fitting a local model at each query point.'")