Lecture 3: Locally Weighted Regression (LWR)ΒΆ

πŸ“Ή Watch Lecture

From Andrew Ng’s CS229 Lecture 3 (Autumn 2018)ΒΆ

β€œIf you have curved data… it’s quite difficult to find features. Is it √x, log(x), xΒ³? What is the set of features that lets you do this? Locally weighted regression sidesteps all those problems.” - Andrew Ng

Parametric vs Non-Parametric LearningΒΆ

Parametric Algorithm (e.g., Linear Regression):

  • Fit fixed set of parameters (ΞΈ) to data

  • After fitting, discard training data

  • Prediction is fast: h(x) = ΞΈα΅€x

Non-Parametric Algorithm (e.g., LWR):

  • Keep entire training set around

  • Different parameters for each query point

  • Slower prediction but more flexible

The LWR AlgorithmΒΆ

Key Idea: Weight training examples by proximity to query point

Weight Function (Gaussian kernel):

w⁽ⁱ⁾ = exp(-(x⁽ⁱ⁾ - x)Β² / (2τ²))

For each prediction:

  1. Compute weights for all training points

  2. Solve weighted least squares:

    ΞΈ = (Xα΅€WX)⁻¹Xα΅€Wy
    

    where W = diag(w⁽¹⁾, …, w⁽ᡐ⁾)

  3. Return prediction: ΞΈα΅€x

Bandwidth Parameter Ο„ (tau)ΒΆ

Controls locality:

  • Small Ο„ (e.g., 0.1): Very local, wiggly fit

  • Large Ο„ (e.g., 2.0): More global, smooth fit

  • Medium Ο„ (e.g., 0.5): Balanced

When to Use LWRΒΆ

βœ“ Use LWR when:

  • Low-dimensional features (n ≀ 5)

  • Non-linear patterns

  • Don’t want to do feature engineering

  • Can afford O(nΒ³) per prediction

βœ— Don’t use LWR when:

  • High dimensions (curse of dimensionality)

  • Need real-time predictions

  • Limited training data

  • Features are already well-chosen

ComparisonΒΆ

Algorithm

Parameters

Training Data

Prediction Speed

Flexibility

Linear Regression

Fixed ΞΈ

Discarded

Fast O(n)

Low

Polynomial Regression

Fixed ΞΈ

Discarded

Fast O(n)

Medium

LWR

New ΞΈ per query

Kept

Slow O(nΒ³)

High

From the LectureΒΆ

β€œThis is a non-parametric learning algorithm… there’s no longer a fixed, finite number of parameters that you use to represent your hypothesis.”

On feature engineering:

β€œRather than having to spend a long time to decide, do I want features of the form x, xΒ², √x, log(x), you can use locally weighted regression and it will do a good job fitting a variety of different data.”

The trade-off:

β€œThe disadvantage is that you need to keep the entire training set around… For every new query point, you need to fit a new set of parameters.”

import numpy as np
import matplotlib.pyplot as plt
from scipy.spatial.distance import cdist
from matplotlib.animation import FuncAnimation
from IPython.display import HTML

# Locally Weighted Regression Implementation
class LocallyWeightedRegression:
    """
    Locally Weighted Regression (LWR)
    
    From CS229 Lecture 3: Non-parametric learning algorithm that fits
    a different linear model at each query point, weighted by distance.
    """
    
    def __init__(self, tau=0.5):
        """
        Parameters:
        -----------
        tau : float
            Bandwidth parameter. Controls how quickly weights fall off with distance.
            - Small tau: more local fitting (wiggly)
            - Large tau: more global fitting (smooth)
        """
        self.tau = tau
        self.X_train = None
        self.y_train = None
    
    def fit(self, X, y):
        """Store training data (non-parametric algorithm)"""
        self.X_train = X
        self.y_train = y
        return self
    
    def _compute_weights(self, x_query):
        """
        Compute Gaussian weights for each training point
        
        w^(i) = exp(-(x^(i) - x)^2 / (2*tau^2))
        """
        # Compute squared distances
        distances = np.sum((self.X_train - x_query)**2, axis=1)
        
        # Gaussian weight function
        weights = np.exp(-distances / (2 * self.tau**2))
        
        return weights
    
    def _weighted_fit(self, x_query):
        """
        Fit weighted linear regression at query point
        
        Minimize: J(ΞΈ) = (1/2) Ξ£ w^(i) (h_ΞΈ(x^(i)) - y^(i))^2
        Solution: ΞΈ = (X^T W X)^(-1) X^T W y
        """
        # Get weights for this query point
        weights = self._compute_weights(x_query)
        
        # Create diagonal weight matrix
        W = np.diag(weights)
        
        # Add bias term
        X_b = np.c_[np.ones((len(self.X_train), 1)), self.X_train]
        
        # Weighted least squares solution
        # ΞΈ = (X^T W X)^(-1) X^T W y
        theta = np.linalg.inv(X_b.T @ W @ X_b) @ X_b.T @ W @ self.y_train
        
        return theta
    
    def predict(self, X):
        """Make predictions for multiple points"""
        predictions = []
        
        for x in X:
            # Fit local model at this point
            theta = self._weighted_fit(x.reshape(1, -1))
            
            # Make prediction
            x_b = np.array([1, x[0]])
            pred = x_b @ theta
            
            predictions.append(pred[0])
        
        return np.array(predictions)

print("βœ… Locally Weighted Regression class implemented")
print("\nKey features:")
print("  β€’ Non-parametric: Stores all training data")
print("  β€’ Local fitting: Different ΞΈ for each query point")
print("  β€’ Gaussian weights: Nearby points have higher influence")
print("  β€’ Bandwidth Ο„: Controls locality of fit")

Example 1: Non-linear Data (From Lecture 3)ΒΆ

What: Fitting curved data without choosing featuresΒΆ

When the relationship between \(x\) and \(y\) is non-linear, standard linear regression fails because \(h_\theta(x) = \theta^T x\) can only produce straight lines. The classic workaround is polynomial feature engineering (\(x^2\), \(\sqrt{x}\), \(\log x\), etc.), but choosing the right basis functions is tedious and error-prone. Locally weighted regression sidesteps this entirely by fitting a different linear model at each query point, weighted by a Gaussian kernel \(w^{(i)} = \exp\!\bigl(-\|x^{(i)} - x\|^2 / 2\tau^2\bigr)\).

Why the bandwidth parameter mattersΒΆ

The bandwidth \(\tau\) controls the bias-variance tradeoff of LWR. A small \(\tau\) gives high weight only to the nearest neighbors, producing a wiggly fit that tracks noise (high variance, low bias). A large \(\tau\) spreads weight broadly, effectively averaging over more data and producing a smoother fit (low variance, higher bias). Comparing multiple \(\tau\) values side-by-side reveals this tradeoff visually and motivates the need for cross-validation to select \(\tau\) in practice.

# Generate non-linear data similar to lecture example
print("πŸ“Š Example: Fitting Non-linear Curved Data")
print("="*70)

np.random.seed(42)

# Generate data with non-linear pattern (like lecture diagram)
X_train = np.linspace(0, 10, 50)
y_train = np.sin(X_train) + 0.5 * X_train + np.random.randn(50) * 0.3

# For visualization
X_test = np.linspace(0, 10, 200).reshape(-1, 1)

# Compare different bandwidths
tau_values = [0.1, 0.5, 1.0, 2.0]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.flatten()

for idx, tau in enumerate(tau_values):
    # Fit LWR with this bandwidth
    lwr = LocallyWeightedRegression(tau=tau)
    lwr.fit(X_train.reshape(-1, 1), y_train)
    
    # Make predictions
    y_pred = lwr.predict(X_test)
    
    # Also fit standard linear regression for comparison
    X_b = np.c_[np.ones((len(X_train), 1)), X_train]
    theta_lr = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_train
    y_lr = np.c_[np.ones((len(X_test), 1)), X_test] @ theta_lr
    
    # Plot
    axes[idx].scatter(X_train, y_train, alpha=0.6, s=50, 
                     c='steelblue', edgecolors='black', linewidth=0.5,
                     label='Training data')
    axes[idx].plot(X_test, y_pred, 'r-', linewidth=2.5, 
                  label=f'LWR (Ο„={tau})')
    axes[idx].plot(X_test, y_lr, 'g--', linewidth=1.5, alpha=0.6,
                  label='Linear Regression')
    
    axes[idx].set_xlabel('x', fontsize=11)
    axes[idx].set_ylabel('y', fontsize=11)
    axes[idx].set_title(f'Bandwidth Ο„ = {tau}', fontsize=12, fontweight='bold')
    axes[idx].legend(fontsize=9)
    axes[idx].grid(True, alpha=0.3)

plt.suptitle('Locally Weighted Regression: Effect of Bandwidth Parameter Ο„', 
             fontsize=14, fontweight='bold', y=1.00)
plt.tight_layout()
plt.show()

print("\nπŸ’‘ Observations:")
print("   Ο„ = 0.1  : Very local (wiggly) - follows noise, may overfit")
print("   Ο„ = 0.5  : Good balance - captures curve without overfitting")  
print("   Ο„ = 1.0  : More smooth - averages over larger neighborhood")
print("   Ο„ = 2.0  : Very smooth - approaches linear regression")
print("\n   Linear Regression (green dashed): Cannot capture the curvature!")
print("\n   LWR automatically handles non-linearity without feature engineering")

Example 2: Visualizing Weight FunctionsΒΆ

What: Seeing how LWR assigns importance to training pointsΒΆ

For a given query point \(x\), the Gaussian weight function \(w^{(i)} = \exp\!\bigl(-\|x^{(i)} - x\|^2 / 2\tau^2\bigr)\) assigns a weight between 0 and 1 to every training example. Points close to \(x\) receive weight near 1 (strong influence on the local fit), while distant points receive weight near 0 (essentially ignored). By plotting these weights for several query points, we can see that LWR effectively performs a different regression at each location, using only the local neighborhood.

Why this mattersΒΆ

Understanding the weight function builds intuition for kernel methods more broadly – the same Gaussian kernel appears in support vector machines (RBF kernel), Gaussian processes, and kernel density estimation. In all these methods, the kernel defines a notion of similarity between points, and the bandwidth parameter controls the scale of that similarity.

# Visualize weight functions for different query points
print("πŸ“Š Weight Function Visualization")
print("="*70)

tau = 0.8
lwr = LocallyWeightedRegression(tau=tau)
lwr.fit(X_train.reshape(-1, 1), y_train)

# Query points to visualize
query_points = [2.5, 5.0, 7.5]

fig, axes = plt.subplots(2, 3, figsize=(15, 8))

for idx, x_query in enumerate(query_points):
    # Compute weights for this query point
    weights = lwr._compute_weights(np.array([[x_query]]))
    
    # Top row: Data with weights
    axes[0, idx].scatter(X_train, y_train, c=weights, s=100*weights + 10,
                        cmap='RdYlBu_r', edgecolors='black', linewidth=0.5,
                        vmin=0, vmax=1, alpha=0.8)
    axes[0, idx].axvline(x_query, color='red', linestyle='--', linewidth=2,
                        label=f'Query: x={x_query}')
    axes[0, idx].set_xlabel('x', fontsize=10)
    axes[0, idx].set_ylabel('y', fontsize=10)
    axes[0, idx].set_title(f'Weights at x={x_query}', fontsize=11, fontweight='bold')
    axes[0, idx].legend(fontsize=9)
    axes[0, idx].grid(True, alpha=0.3)
    
    # Bottom row: Weight function
    axes[1, idx].bar(X_train, weights, width=0.15, color='coral', 
                    edgecolor='black', linewidth=0.5, alpha=0.7)
    axes[1, idx].axvline(x_query, color='red', linestyle='--', linewidth=2)
    axes[1, idx].set_xlabel('x', fontsize=10)
    axes[1, idx].set_ylabel('Weight w(i)', fontsize=10)
    axes[1, idx].set_title(f'w(i) = exp(-(x(i)-{x_query})Β²/(2τ²))', fontsize=10)
    axes[1, idx].grid(True, alpha=0.3, axis='y')
    axes[1, idx].set_ylim([0, 1.1])

plt.suptitle(f'LWR Weight Function (Ο„={tau}): Different Query Points', 
             fontsize=14, fontweight='bold', y=0.995)
plt.tight_layout()
plt.show()

print(f"\nπŸ’‘ Key Insights:")
print(f"   β€’ Point size ∝ weight (larger = higher influence)")
print(f"   β€’ Color intensity ∝ weight (red = high, blue = low)")
print(f"   β€’ Weights decay exponentially with distance from query point")
print(f"   β€’ At query x={query_points[1]}, nearby points have weights close to 1")
print(f"   β€’ Distant points have weights close to 0")
print(f"\n   'Each prediction uses a different weighted combination of training data'")

Comparison: Linear Regression vs LWR vs Polynomial RegressionΒΆ

What: Head-to-head evaluation of three regression approachesΒΆ

We fit the same non-linear dataset with three methods: ordinary linear regression (\(h_\theta(x) = \theta_0 + \theta_1 x\)), polynomial regression (degree 5), and locally weighted regression (\(\tau = 0.6\)). Linear regression underfits because a straight line cannot capture curvature. Polynomial regression can fit curves but requires choosing the degree and may extrapolate poorly. LWR adapts locally without any feature engineering.

Why: The feature engineering problemΒΆ

In practice, deciding whether to use \(x^2\), \(\sqrt{x}\), \(\log x\), or some other transformation is one of the most time-consuming parts of building a parametric model. LWR eliminates this step entirely by performing local linear fits weighted by proximity. The tradeoff is computational cost – LWR requires \(O(n^3)\) per prediction (solving a weighted least squares problem each time) and must retain the full training set, making it a non-parametric method. This comparison illustrates when the convenience of LWR outweighs its computational overhead, and when a well-chosen parametric model is more practical.

# Generate data with complex non-linear pattern
print("πŸ“Š Comparison: Linear vs Polynomial vs LWR")
print("="*70)

np.random.seed(123)
X_comp = np.linspace(0, 10, 60)
y_comp = 2*np.sin(X_comp) + 0.3*X_comp**1.5 - 3 + np.random.randn(60) * 0.4

X_test_comp = np.linspace(0, 10, 300).reshape(-1, 1)

# 1. Linear Regression
X_b = np.c_[np.ones((len(X_comp), 1)), X_comp]
theta_lin = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y_comp
y_linear = np.c_[np.ones((len(X_test_comp), 1)), X_test_comp] @ theta_lin

# 2. Polynomial Regression (degree 5)
from numpy.polynomial import Polynomial
poly = Polynomial.fit(X_comp, y_comp, 5)
y_poly = poly(X_test_comp.flatten())

# 3. Locally Weighted Regression
lwr_comp = LocallyWeightedRegression(tau=0.6)
lwr_comp.fit(X_comp.reshape(-1, 1), y_comp)
y_lwr = lwr_comp.predict(X_test_comp)

# Visualize comparison
plt.figure(figsize=(14, 6))

plt.scatter(X_comp, y_comp, alpha=0.6, s=70, c='steelblue', 
           edgecolors='black', linewidth=0.8, label='Training data', zorder=5)

plt.plot(X_test_comp, y_linear, 'g--', linewidth=2.5, alpha=0.7,
        label='Linear Regression (underfit)')
plt.plot(X_test_comp, y_poly, 'orange', linewidth=2.5, alpha=0.7,
        label='Polynomial (degree 5)')
plt.plot(X_test_comp, y_lwr, 'r-', linewidth=3,
        label='LWR (Ο„=0.6) - no feature engineering!', zorder=4)

plt.xlabel('x', fontsize=12)
plt.ylabel('y', fontsize=12)
plt.title('Model Comparison: LWR Automatically Handles Non-linearity', 
         fontsize=14, fontweight='bold')
plt.legend(fontsize=11, loc='upper left')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

# Calculate MSE for each method
from sklearn.metrics import mean_squared_error

# Predictions on training data
y_pred_lin = X_b @ theta_lin
y_pred_poly = poly(X_comp)
y_pred_lwr = lwr_comp.predict(X_comp.reshape(-1, 1))

mse_lin = mean_squared_error(y_comp, y_pred_lin)
mse_poly = mean_squared_error(y_comp, y_pred_poly)
mse_lwr = mean_squared_error(y_comp, y_pred_lwr)

print(f"\nπŸ“Š Training MSE:")
print(f"   Linear Regression:     {mse_lin:.4f}  (underfit - too simple)")
print(f"   Polynomial (deg 5):    {mse_poly:.4f}")
print(f"   LWR (Ο„=0.6):          {mse_lwr:.4f}  (best fit)")

print(f"\nπŸ’‘ Key Advantages of LWR:")
print(f"   βœ“ No feature engineering needed (no xΒ², √x, log(x), etc.)")
print(f"   βœ“ Automatically adapts to local patterns")
print(f"   βœ“ One parameter (Ο„) vs choosing polynomial degree")
print(f"   βœ— Must keep training data (non-parametric)")
print(f"   βœ— Slower prediction (O(nΒ³) per query)")

When to Use LWRΒΆ

βœ… Good for LWR:ΒΆ

  • Low dimensional data (n ≀ 5 features typically)

  • Non-linear patterns without obvious functional form

  • Small to medium datasets (< 10,000 points)

  • Offline prediction where speed is not critical

  • Avoiding feature engineering effort

❌ Not good for LWR:¢

  • High dimensional data (curse of dimensionality)

  • Real-time prediction requirements

  • Large datasets (memory and computation issues)

  • When parametric model works well

Alternative Approaches:ΒΆ

Method

When to Use

Linear Regression

Linear patterns, need interpretability

Polynomial Features

Know the non-linear form

LWR

Unknown non-linearity, low dimensions

Neural Networks

Complex patterns, high dimensions, large data

Kernel Methods (SVM)

Non-linear, medium data, want generalization

From Lecture: β€œLater this quarter you’ll hear about feature selection algorithms for automatically deciding what features to use.”

Mathematical Derivation (Optional)ΒΆ

Weighted Least SquaresΒΆ

Objective: Minimize weighted sum of squared errors $\(J(\theta) = \frac{1}{2} \sum_{i=1}^m w^{(i)} \left(\theta^T x^{(i)} - y^{(i)}\right)^2\)$

Matrix Form: $\(J(\theta) = \frac{1}{2} (X\theta - y)^T W (X\theta - y)\)$

where \(W = \text{diag}(w^{(1)}, \ldots, w^{(m)})\)

Taking derivative and setting to zero: $\(\frac{\partial J}{\partial \theta} = X^T W (X\theta - y) = 0\)$

Solution: $\(\theta^* = (X^T W X)^{-1} X^T W y\)$

Weight FunctionΒΆ

Gaussian kernel (most common): $\(w^{(i)} = \exp\left(-\frac{||x^{(i)} - x||^2}{2\tau^2}\right)\)$

Properties:

  • \(w^{(i)} \in (0, 1]\)

  • \(w^{(i)} = 1\) when \(x^{(i)} = x\)

  • \(w^{(i)} \rightarrow 0\) as \(||x^{(i)} - x|| \rightarrow \infty\)

  • \(\tau\) controls decay rate

Other kernel options:

  • Epanechnikov: \(w^{(i)} = \max(0, 1 - ||x^{(i)} - x||^2)\)

  • Tri-cube: \(w^{(i)} = (1 - ||x^{(i)} - x||^3)^3\) if \(||x^{(i)} - x|| < 1\)

print("βœ… Locally Weighted Regression notebook complete!")
print("\nπŸ“š Key Takeaways from CS229 Lecture 3:")
print("   1. LWR is a non-parametric algorithm (keeps training data)")
print("   2. Fits different model at each query point")
print("   3. Weights are Gaussian function of distance")
print("   4. Bandwidth Ο„ controls locality")
print("   5. Avoids feature engineering for non-linear patterns")
print("   6. Works well in low dimensions")
print("\n🎯 From Andrew Ng:")
print("   'Locally weighted regression sidesteps the problem of having to")
print("    choose features by fitting a local model at each query point.'")