Run this notebook: Open in Colab Open in Kaggle

Lecture 3: Locally Weighted Regression (LWR)¶

From Andrew Ng’s CS229 Lecture 3 (Autumn 2018)¶

“If you have curved data… it’s quite difficult to find features. Is it √x, log(x), x³? What is the set of features that lets you do this? Locally weighted regression sidesteps all those problems.” - Andrew Ng

Parametric vs Non-Parametric Learning¶

Parametric Algorithm (e.g., Linear Regression):

Fit fixed set of parameters (θ) to data
After fitting, discard training data
Prediction is fast: h(x) = θᵀx

Non-Parametric Algorithm (e.g., LWR):

Keep entire training set around
Different parameters for each query point
Slower prediction but more flexible

The LWR Algorithm¶

Key Idea: Weight training examples by proximity to query point

Weight Function (Gaussian kernel):

w⁽ⁱ⁾ = exp(-(x⁽ⁱ⁾ - x)² / (2τ²))

For each prediction:

Compute weights for all training points
Solve weighted least squares:
```
θ = (XᵀWX)⁻¹XᵀWy
```
where W = diag(w⁽¹⁾, …, w⁽ᵐ⁾)
Return prediction: θᵀx

Bandwidth Parameter τ (tau)¶

Controls locality:

Small τ (e.g., 0.1): Very local, wiggly fit
Large τ (e.g., 2.0): More global, smooth fit
Medium τ (e.g., 0.5): Balanced

When to Use LWR¶

✓ Use LWR when:

Low-dimensional features (n ≤ 5)
Non-linear patterns
Don’t want to do feature engineering
Can afford O(n³) per prediction

✗ Don’t use LWR when:

High dimensions (curse of dimensionality)
Need real-time predictions
Limited training data
Features are already well-chosen

Comparison¶

Algorithm	Parameters	Training Data	Prediction Speed	Flexibility
Linear Regression	Fixed θ	Discarded	Fast O(n)	Low
Polynomial Regression	Fixed θ	Discarded	Fast O(n)	Medium
LWR	New θ per query	Kept	Slow O(n³)	High

From the Lecture¶

“This is a non-parametric learning algorithm… there’s no longer a fixed, finite number of parameters that you use to represent your hypothesis.”

On feature engineering:

“Rather than having to spend a long time to decide, do I want features of the form x, x², √x, log(x), you can use locally weighted regression and it will do a good job fitting a variety of different data.”

The trade-off:

“The disadvantage is that you need to keep the entire training set around… For every new query point, you need to fit a new set of parameters.”

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# Locally Weighted Regression Implementation
class LocallyWeightedRegression:
    """
    Locally Weighted Regression (LWR)
    
    From CS229 Lecture 3: Non-parametric learning algorithm that fits
    a different linear model at each query point, weighted by distance.
    """
    
    def __init__(self, tau=0.5):
        """
        Parameters:
        -----------
        tau : float
            Bandwidth parameter. Controls how quickly weights fall off with distance.
            - Small tau: more local fitting (wiggly)
            - Large tau: more global fitting (smooth)
        """
        self.tau = tau
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """Store training data (non-parametric algorithm)"""
        self.X_train = X
        self.y_train = y
        return self
    
    def _compute_weights(self, x_query):
        """
        Compute Gaussian weights for each training point
        
        w^(i) = exp(-(x^(i) - x)^2 / (2*tau^2))
        """
        # Compute squared distances
        distances = np.sum((self.X_train - x_query)**2, axis=1)
        
        # Gaussian weight function
        weights = np.exp(-distances / (2 * self.tau**2))
        
        return weights
    
    def _weighted_fit(self, x_query):
        """
        Fit weighted linear regression at query point
        
        Minimize: J(θ) = (1/2) Σ w^(i) (h_θ(x^(i)) - y^(i))^2
        Solution: θ = (X^T W X)^(-1) X^T W y
        """
        # Get weights for this query point
        weights = self._compute_weights(x_query)
        
        # Create diagonal weight matrix
        W = np.diag(weights)
        
        # Add bias term
        X_b = np.c_[np.ones((len(self.X_train), 1)), self.X_train]
        
        # Weighted least squares solution
        # θ = (X^T W X)^(-1) X^T W y
        theta = np.linalg.inv(X_b.T @ W @ X_b) @ X_b.T @ W @ self.y_train
        
        return theta
    
    def predict(self, X):
        """Make predictions for multiple points"""
        predictions = []
        
        for x in X:
            # Fit local model at this point
            theta = self._weighted_fit(x.reshape(1, -1))
            
            # Make prediction
            x_b = np.array([1, x[0]])
            pred = x_b @ theta
            
            predictions.append(pred[0])
        
        return np.array(predictions)

print("✅ Locally Weighted Regression class implemented")
print("\nKey features:")
print("  • Non-parametric: Stores all training data")
print("  • Local fitting: Different θ for each query point")
print("  • Gaussian weights: Nearby points have higher influence")
print("  • Bandwidth τ: Controls locality of fit")

Example 1: Non-linear Data (From Lecture 3)¶

What: Fitting curved data without choosing features¶

When the relationship between $x$ and $y$ is non-linear, standard linear regression fails because $h_\theta(x) = \theta^T x$ can only produce straight lines. The classic workaround is polynomial feature engineering ($x^2$, $\sqrt{x}$, $\log x$, etc.), but choosing the right basis functions is tedious and error-prone. Locally weighted regression sidesteps this entirely by fitting a different linear model at each query point, weighted by a Gaussian kernel $w^{(i)} = \exp\!\bigl(-\|x^{(i)} - x\|^2 / 2\tau^2\bigr)$.

Why the bandwidth parameter matters¶

The bandwidth $\tau$ controls the bias-variance tradeoff of LWR. A small $\tau$ gives high weight only to the nearest neighbors, producing a wiggly fit that tracks noise (high variance, low bias). A large $\tau$ spreads weight broadly, effectively averaging over more data and producing a smoother fit (low variance, higher bias). Comparing multiple $\tau$ values side-by-side reveals this tradeoff visually and motivates the need for cross-validation to select $\tau$ in practice.

# Generate non-linear data similar to lecture example
print("📊 Example: Fitting Non-linear Curved Data")
print("="*70)

np.random.seed(42)

# Generate data with non-linear pattern (like lecture diagram)
X_train = np.linspace(0, 10, 50)
y_train = np.sin(X_train) + 0.5 * X_train + np.random.randn(50) * 0.3

# For visualization
X_test = np.linspace(0, 10, 200).reshape(-1, 1)

# Compare different bandwidths
tau_values = [0.1, 0.5, 1.0, 2.0]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, tau in enumerate(tau_values):
    # Fit LWR with this bandwidth
    lwr = LocallyWeightedRegression(tau=tau)
    lwr.fit(X_train.reshape(-1, 1), y_train)
    
    # Make predictions
    y_pred = lwr.predict(X_test)
    
    # Also fit standard linear regression for comparison
    X_b = np.c_[np.ones((len(X_train), 1)), X_train]
    theta_lr = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_train
    y_lr = np.c_[np.ones((len(X_test), 1)), X_test] @ theta_lr
    
    # Plot
    axes[idx].scatter(X_train, y_train, alpha=0.6, s=50, 
                     c='steelblue', edgecolors='black', linewidth=0.5,
                     label='Training data')
    axes[idx].plot(X_test, y_pred, 'r-', linewidth=2.5, 
                  label=f'LWR (τ={tau})')
    axes[idx].plot(X_test, y_lr, 'g--', linewidth=1.5, alpha=0.6,
                  label='Linear Regression')
    
    axes[idx].set_xlabel('x', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].set_title(f'Bandwidth τ = {tau}', fontsize=12, fontweight='bold')
    axes[idx].legend(fontsize=9)
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Locally Weighted Regression: Effect of Bandwidth Parameter τ', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   τ = 0.1  : Very local (wiggly) - follows noise, may overfit")
print("   τ = 0.5  : Good balance - captures curve without overfitting")  
print("   τ = 1.0  : More smooth - averages over larger neighborhood")
print("   τ = 2.0  : Very smooth - approaches linear regression")
print("\n   Linear Regression (green dashed): Cannot capture the curvature!")
print("\n   LWR automatically handles non-linearity without feature engineering")

Example 2: Visualizing Weight Functions¶

What: Seeing how LWR assigns importance to training points¶

For a given query point $x$, the Gaussian weight function $w^{(i)} = \exp\!\bigl(-\|x^{(i)} - x\|^2 / 2\tau^2\bigr)$ assigns a weight between 0 and 1 to every training example. Points close to $x$ receive weight near 1 (strong influence on the local fit), while distant points receive weight near 0 (essentially ignored). By plotting these weights for several query points, we can see that LWR effectively performs a different regression at each location, using only the local neighborhood.

Why this matters¶

Understanding the weight function builds intuition for kernel methods more broadly – the same Gaussian kernel appears in support vector machines (RBF kernel), Gaussian processes, and kernel density estimation. In all these methods, the kernel defines a notion of similarity between points, and the bandwidth parameter controls the scale of that similarity.

# Visualize weight functions for different query points
print("📊 Weight Function Visualization")
print("="*70)

tau = 0.8
lwr = LocallyWeightedRegression(tau=tau)
lwr.fit(X_train.reshape(-1, 1), y_train)

# Query points to visualize
query_points = [2.5, 5.0, 7.5]

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

for idx, x_query in enumerate(query_points):
    # Compute weights for this query point
    weights = lwr._compute_weights(np.array([[x_query]]))
    
    # Top row: Data with weights
    axes[0, idx].scatter(X_train, y_train, c=weights, s=100*weights + 10,
                        cmap='RdYlBu_r', edgecolors='black', linewidth=0.5,
                        vmin=0, vmax=1, alpha=0.8)
    axes[0, idx].axvline(x_query, color='red', linestyle='--', linewidth=2,
                        label=f'Query: x={x_query}')
    axes[0, idx].set_xlabel('x', fontsize=10)
    axes[0, idx].set_ylabel('y', fontsize=10)
    axes[0, idx].set_title(f'Weights at x={x_query}', fontsize=11, fontweight='bold')
    axes[0, idx].legend(fontsize=9)
    axes[0, idx].grid(True, alpha=0.3)
    
    # Bottom row: Weight function
    axes[1, idx].bar(X_train, weights, width=0.15, color='coral', 
                    edgecolor='black', linewidth=0.5, alpha=0.7)
    axes[1, idx].axvline(x_query, color='red', linestyle='--', linewidth=2)
    axes[1, idx].set_xlabel('x', fontsize=10)
    axes[1, idx].set_ylabel('Weight w(i)', fontsize=10)
    axes[1, idx].set_title(f'w(i) = exp(-(x(i)-{x_query})²/(2τ²))', fontsize=10)
    axes[1, idx].grid(True, alpha=0.3, axis='y')
    axes[1, idx].set_ylim([0, 1.1])

plt.suptitle(f'LWR Weight Function (τ={tau}): Different Query Points', 
             fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print(f"\n💡 Key Insights:")
print(f"   • Point size ∝ weight (larger = higher influence)")
print(f"   • Color intensity ∝ weight (red = high, blue = low)")
print(f"   • Weights decay exponentially with distance from query point")
print(f"   • At query x={query_points[1]}, nearby points have weights close to 1")
print(f"   • Distant points have weights close to 0")
print(f"\n   'Each prediction uses a different weighted combination of training data'")

Comparison: Linear Regression vs LWR vs Polynomial Regression¶

What: Head-to-head evaluation of three regression approaches¶

We fit the same non-linear dataset with three methods: ordinary linear regression ($h_\theta(x) = \theta_0 + \theta_1 x$), polynomial regression (degree 5), and locally weighted regression ($\tau = 0.6$). Linear regression underfits because a straight line cannot capture curvature. Polynomial regression can fit curves but requires choosing the degree and may extrapolate poorly. LWR adapts locally without any feature engineering.

Why: The feature engineering problem¶

In practice, deciding whether to use $x^2$, $\sqrt{x}$, $\log x$, or some other transformation is one of the most time-consuming parts of building a parametric model. LWR eliminates this step entirely by performing local linear fits weighted by proximity. The tradeoff is computational cost – LWR requires $O(n^3)$ per prediction (solving a weighted least squares problem each time) and must retain the full training set, making it a non-parametric method. This comparison illustrates when the convenience of LWR outweighs its computational overhead, and when a well-chosen parametric model is more practical.

# Generate data with complex non-linear pattern
print("📊 Comparison: Linear vs Polynomial vs LWR")
print("="*70)

np.random.seed(123)
X_comp = np.linspace(0, 10, 60)
y_comp = 2*np.sin(X_comp) + 0.3*X_comp**1.5 - 3 + np.random.randn(60) * 0.4

X_test_comp = np.linspace(0, 10, 300).reshape(-1, 1)

# 1. Linear Regression
X_b = np.c_[np.ones((len(X_comp), 1)), X_comp]
theta_lin = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_comp
y_linear = np.c_[np.ones((len(X_test_comp), 1)), X_test_comp] @ theta_lin

# 2. Polynomial Regression (degree 5)
from numpy.polynomial import Polynomial
poly = Polynomial.fit(X_comp, y_comp, 5)
y_poly = poly(X_test_comp.flatten())

# 3. Locally Weighted Regression
lwr_comp = LocallyWeightedRegression(tau=0.6)
lwr_comp.fit(X_comp.reshape(-1, 1), y_comp)
y_lwr = lwr_comp.predict(X_test_comp)

# Visualize comparison
plt.figure(figsize=(14, 6))

plt.scatter(X_comp, y_comp, alpha=0.6, s=70, c='steelblue', 
           edgecolors='black', linewidth=0.8, label='Training data', zorder=5)

plt.plot(X_test_comp, y_linear, 'g--', linewidth=2.5, alpha=0.7,
        label='Linear Regression (underfit)')
plt.plot(X_test_comp, y_poly, 'orange', linewidth=2.5, alpha=0.7,
        label='Polynomial (degree 5)')
plt.plot(X_test_comp, y_lwr, 'r-', linewidth=3,
        label='LWR (τ=0.6) - no feature engineering!', zorder=4)

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Model Comparison: LWR Automatically Handles Non-linearity', 
         fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate MSE for each method
from sklearn.metrics import mean_squared_error

# Predictions on training data
y_pred_lin = X_b @ theta_lin
y_pred_poly = poly(X_comp)
y_pred_lwr = lwr_comp.predict(X_comp.reshape(-1, 1))

mse_lin = mean_squared_error(y_comp, y_pred_lin)
mse_poly = mean_squared_error(y_comp, y_pred_poly)
mse_lwr = mean_squared_error(y_comp, y_pred_lwr)

print(f"\n📊 Training MSE:")
print(f"   Linear Regression:     {mse_lin:.4f}  (underfit - too simple)")
print(f"   Polynomial (deg 5):    {mse_poly:.4f}")
print(f"   LWR (τ=0.6):          {mse_lwr:.4f}  (best fit)")

print(f"\n💡 Key Advantages of LWR:")
print(f"   ✓ No feature engineering needed (no x², √x, log(x), etc.)")
print(f"   ✓ Automatically adapts to local patterns")
print(f"   ✓ One parameter (τ) vs choosing polynomial degree")
print(f"   ✗ Must keep training data (non-parametric)")
print(f"   ✗ Slower prediction (O(n³) per query)")

When to Use LWR¶

✅ Good for LWR:¶

Low dimensional data (n ≤ 5 features typically)
Non-linear patterns without obvious functional form
Small to medium datasets (< 10,000 points)
Offline prediction where speed is not critical
Avoiding feature engineering effort

❌ Not good for LWR:¶

High dimensional data (curse of dimensionality)
Real-time prediction requirements
Large datasets (memory and computation issues)
When parametric model works well

Alternative Approaches:¶

Method	When to Use
Linear Regression	Linear patterns, need interpretability
Polynomial Features	Know the non-linear form
LWR	Unknown non-linearity, low dimensions
Neural Networks	Complex patterns, high dimensions, large data
Kernel Methods (SVM)	Non-linear, medium data, want generalization

From Lecture: “Later this quarter you’ll hear about feature selection algorithms for automatically deciding what features to use.”

Mathematical Derivation (Optional)¶

Weighted Least Squares¶

Objective: Minimize weighted sum of squared errors $$J(\theta) = \frac{1}{2} \sum_{i=1}^m w^{(i)} \left(\theta^T x^{(i)} - y^{(i)}\right)^2$$

Matrix Form: $$J(\theta) = \frac{1}{2} (X\theta - y)^T W (X\theta - y)$$

where $W = \text{diag}(w^{(1)}, \ldots, w^{(m)})$

Taking derivative and setting to zero: $$\frac{\partial J}{\partial \theta} = X^T W (X\theta - y) = 0$$

Solution: $$\theta^* = (X^T W X)^{-1} X^T W y$$

Weight Function¶

Gaussian kernel (most common): $$w^{(i)} = \exp\left(-\frac{||x^{(i)} - x||^2}{2\tau^2}\right)$$

Properties:

$w^{(i)} \in (0, 1]$
$w^{(i)} = 1$ when $x^{(i)} = x$
$w^{(i)} \rightarrow 0$ as $||x^{(i)} - x|| \rightarrow \infty$
$\tau$ controls decay rate

Other kernel options:

Epanechnikov: $w^{(i)} = \max(0, 1 - ||x^{(i)} - x||^2)$
Tri-cube: $w^{(i)} = (1 - ||x^{(i)} - x||^3)^3$ if $||x^{(i)} - x|| < 1$

print("✅ Locally Weighted Regression notebook complete!")
print("\n📚 Key Takeaways from CS229 Lecture 3:")
print("   1. LWR is a non-parametric algorithm (keeps training data)")
print("   2. Fits different model at each query point")
print("   3. Weights are Gaussian function of distance")
print("   4. Bandwidth τ controls locality")
print("   5. Avoids feature engineering for non-linear patterns")
print("   6. Works well in low dimensions")
print("\n🎯 From Andrew Ng:")
print("   'Locally weighted regression sidesteps the problem of having to")
print("    choose features by fitting a local model at each query point.'")