Run this notebook: Open in Colab Open in Kaggle

Chapter 7: Moving Beyond Linearity¶

Overview¶

Linear regression assumes: $$Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \varepsilon$$

Problem: Real relationships are often non-linear!

Approaches to Non-Linearity¶

1. Polynomial Regression¶

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_d X^d + \varepsilon\]

Simple extension of linear regression
Can overfit with high degree

2. Step Functions¶

Split range into bins, fit constant in each: $$Y = \beta_0 + \beta_1 C_1(X) + \beta_2 C_2(X) + ... + \beta_K C_K(X) + \varepsilon$$

Piecewise constant
Simple but creates discontinuities

3. Regression Splines¶

Piecewise polynomials with smooth connections at knots

More flexible than polynomials
Smooth transitions
Local control

4. Smoothing Splines¶

Minimize: $$\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 dt$$

Automatic knot selection
Smoothness controlled by λ

5. Local Regression (LOESS)¶

Fit weighted regression in local neighborhood

Very flexible
No global structure needed

6. Generalized Additive Models (GAMs)¶

\[Y = \beta_0 + f_1(X_1) + f_2(X_2) + ... + f_p(X_p) + \varepsilon\]

Non-linear function for each predictor
Interpretable
Flexible

Generate Non-Linear Data¶

What: Creating a synthetic dataset with a known non-linear relationship¶

We generate data from the function $y = \sin(x) + 0.1x^2 - 0.5x + \varepsilon$, which combines oscillation (the sine term) with a quadratic trend and additive Gaussian noise. Having a known ground truth lets us directly compare how well each method (polynomials, step functions, splines, GAMs) recovers the true underlying relationship.

Why: The fundamental motivation for moving beyond linearity¶

Standard linear regression can only fit straight lines (or hyperplanes). When the true relationship is curved, sinusoidal, or otherwise non-linear, a linear model will systematically under-predict in some regions and over-predict in others – this is the hallmark of model misspecification and shows up clearly in residual plots. The methods in this chapter all address this limitation while retaining the interpretability advantages of linear models.

# Generate non-linear data
np.random.seed(42)
n = 100
X = np.linspace(0, 10, n)

# True non-linear function: sin + polynomial
y_true = np.sin(X) + 0.1 * X**2 - 0.5 * X
y = y_true + np.random.randn(n) * 0.5  # Add noise

# Reshape for sklearn
X_train = X.reshape(-1, 1)
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

# Visualize data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Observed data')
plt.plot(X, y_true, 'r-', linewidth=2, label='True function')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Non-Linear Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Data generated: n = {n} points")
print(f"True function: sin(X) + 0.1X² - 0.5X + noise")

7.1 Polynomial Regression¶

Idea¶

Extend linear model with polynomial terms: $$Y = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_d X^d + \varepsilon$$

Still linear in parameters β → Use standard linear regression!

Choosing Degree d¶

d = 1: Linear (underfitting)
d = 2-4: Often appropriate
d > 10: Usually overfitting
Use cross-validation to select d

# Try different polynomial degrees
degrees = [1, 2, 3, 5, 10, 15]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, degree in enumerate(degrees):
    # Create polynomial features and fit
    poly_model = make_pipeline(
        PolynomialFeatures(degree=degree, include_bias=False),
        LinearRegression()
    )
    poly_model.fit(X_train, y)
    y_pred = poly_model.predict(X_plot)
    
    # Calculate R²
    r2 = poly_model.score(X_train, y)
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.5, label='Data')
    axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'Degree {degree}')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Polynomial Degree {degree}\nR² = {r2:.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)
    
    if degree >= 10:
        axes[idx].set_ylim(-5, 15)  # Limit y-axis for high degrees

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   • Degree 1 (linear): Underfits, can't capture non-linearity")
print("   • Degrees 2-5: Reasonable fits, capture overall trend")
print("   • Degree 10+: Overfits, wild oscillations at boundaries")
print("   • Higher R² doesn't always mean better predictions!")

# Use cross-validation to select degree
degrees_test = range(1, 16)
cv_scores = []

for degree in degrees_test:
    poly_model = make_pipeline(
        PolynomialFeatures(degree=degree, include_bias=False),
        LinearRegression()
    )
    scores = cross_val_score(poly_model, X_train, y, cv=5, 
                            scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())  # Convert to MSE

best_degree = degrees_test[np.argmin(cv_scores)]

# Plot CV curve
plt.figure(figsize=(10, 6))
plt.plot(degrees_test, cv_scores, 'o-', linewidth=2, markersize=8)
plt.axvline(best_degree, color='r', linestyle='--', 
           label=f'Best degree = {best_degree}')
plt.xlabel('Polynomial Degree')
plt.ylabel('Cross-Validation MSE')
plt.title('Selecting Polynomial Degree via Cross-Validation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\n🎯 Best polynomial degree: {best_degree}")
print(f"CV MSE: {min(cv_scores):.3f}")

7.2 Step Functions¶

Idea¶

Cut continuous variable into K bins, fit constant in each bin:

\[Y = \beta_0 + \beta_1 I(c_1 \leq X < c_2) + ... + \beta_K I(c_K \leq X < c_{K+1}) + \varepsilon\]

Pros¶

✅ Simple interpretation
✅ Easy to implement
✅ Can handle any non-linearity

Cons¶

❌ Creates discontinuities
❌ Loses information within bins
❌ Sensitive to bin placement

# Step functions with different number of bins
n_bins_list = [2, 4, 8, 16]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, n_bins in enumerate(n_bins_list):
    # Create bins
    bins = np.linspace(X.min(), X.max(), n_bins + 1)
    X_binned = np.digitize(X, bins[1:-1])
    
    # One-hot encode bins
    X_encoded = pd.get_dummies(X_binned, drop_first=True).values
    
    # Fit model
    lr = LinearRegression()
    lr.fit(X_encoded, y)
    
    # Predict for plot
    X_plot_binned = np.digitize(X_plot.ravel(), bins[1:-1])
    X_plot_encoded = pd.get_dummies(X_plot_binned, drop_first=True).reindex(
        columns=range(1, n_bins), fill_value=0).values
    y_pred = lr.predict(X_plot_encoded)
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.5, label='Data')
    axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'{n_bins} bins')
    
    # Show bin boundaries
    for b in bins[1:-1]:
        axes[idx].axvline(b, color='gray', linestyle=':', alpha=0.5)
    
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Step Function: {n_bins} bins')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   • 2 bins: Too simple, underfits")
print("   • 4-8 bins: Captures trend but with discontinuities")
print("   • 16 bins: More flexible but noisier")
print("   • All step functions create jumps (not smooth)")

7.3 Regression Splines¶

The Problem with Polynomials and Steps¶

Polynomials: Global effect, poor boundary behavior
Steps: Discontinuous, not smooth

Solution: Splines¶

Piecewise polynomials that are continuous and smooth at knots

B-Splines (Basis Splines)¶

Build spline as linear combination of B-spline basis functions: $$f(X) = \sum_{j=1}^{K+d} \beta_j B_j(X)$$

where:

K = number of knots
d = degree of polynomial pieces
$B_j$ = B-spline basis functions

Properties¶

✅ Piecewise polynomial
✅ Continuous and smooth at knots
✅ Local control (changing one region doesn’t affect others much)
✅ Stable numerically

Common Choices¶

Cubic splines (d=3): Most popular, smooth second derivative
Natural cubic splines: Linear beyond boundary knots (better behavior)
B-splines: Numerically stable basis

# Compare different spline approaches
from sklearn.preprocessing import SplineTransformer

# Different spline configurations
configs = [
    {'n_knots': 4, 'degree': 3, 'name': '4 knots, degree 3'},
    {'n_knots': 6, 'degree': 3, 'name': '6 knots, degree 3'},
    {'n_knots': 10, 'degree': 3, 'name': '10 knots, degree 3'},
    {'n_knots': 6, 'degree': 2, 'name': '6 knots, degree 2'},
]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, config in enumerate(configs):
    # Create and fit spline model
    spline_model = make_pipeline(
        SplineTransformer(n_knots=config['n_knots'], 
                         degree=config['degree'],
                         include_bias=True),
        LinearRegression()
    )
    spline_model.fit(X_train, y)
    y_pred = spline_model.predict(X_plot)
    
    # Calculate R²
    r2 = spline_model.score(X_train, y)
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.5, label='Data')
    axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label='Spline')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'{config["name"]}\nR² = {r2:.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("   • Splines are smooth (no discontinuities)")
print("   • More knots → more flexibility")
print("   • Degree 3 (cubic) most common")
print("   • Better boundary behavior than polynomials")
print("   • Local changes don't affect entire curve")

7.4 Comparison of Methods¶

What: Side-by-side evaluation of linear, polynomial, step, and spline regression¶

We fit all four methods to the same non-linear dataset and compare their fits visually and in terms of $R^2$. Linear regression provides the simplest baseline, polynomial regression adds global curvature, step functions capture local levels with discontinuities, and regression splines combine local flexibility with smoothness.

Why: Choosing the right level of flexibility for your problem¶

Each method occupies a different point on the flexibility-interpretability spectrum. Linear models are the most interpretable but cannot capture non-linearity. Step functions are easy to interpret (average response in each bin) but create artificial jumps. Polynomial regression is smooth but suffers from global effects (changing one region affects the entire curve) and poor boundary behavior. Splines are typically the best default choice for single-predictor non-linearity because they provide local flexibility, smoothness at knots, and stable boundary behavior – all while remaining a linear model in the spline basis coefficients $\beta$.

# Compare all methods
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Linear
linear = LinearRegression()
linear.fit(X_train, y)
y_pred_linear = linear.predict(X_plot)
axes[0, 0].scatter(X, y, alpha=0.5)
axes[0, 0].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[0, 0].plot(X_plot, y_pred_linear, 'b-', linewidth=2, label='Linear')
axes[0, 0].set_title(f'Linear Regression\nR² = {linear.score(X_train, y):.3f}')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Polynomial (degree 5)
poly = make_pipeline(PolynomialFeatures(5, include_bias=False), LinearRegression())
poly.fit(X_train, y)
y_pred_poly = poly.predict(X_plot)
axes[0, 1].scatter(X, y, alpha=0.5)
axes[0, 1].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[0, 1].plot(X_plot, y_pred_poly, 'g-', linewidth=2, label='Polynomial (d=5)')
axes[0, 1].set_title(f'Polynomial Regression\nR² = {poly.score(X_train, y):.3f}')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Step function (8 bins)
bins = np.linspace(X.min(), X.max(), 9)
X_binned = np.digitize(X, bins[1:-1])
X_encoded = pd.get_dummies(X_binned, drop_first=True).values
step_lr = LinearRegression()
step_lr.fit(X_encoded, y)
X_plot_binned = np.digitize(X_plot.ravel(), bins[1:-1])
X_plot_encoded = pd.get_dummies(X_plot_binned, drop_first=True).reindex(
    columns=range(1, 8), fill_value=0).values
y_pred_step = step_lr.predict(X_plot_encoded)
axes[1, 0].scatter(X, y, alpha=0.5)
axes[1, 0].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[1, 0].plot(X_plot, y_pred_step, 'orange', linewidth=2, label='Step (8 bins)')
for b in bins[1:-1]:
    axes[1, 0].axvline(b, color='gray', linestyle=':', alpha=0.5)
axes[1, 0].set_title('Step Function')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. Spline (6 knots, degree 3)
spline = make_pipeline(SplineTransformer(n_knots=6, degree=3), LinearRegression())
spline.fit(X_train, y)
y_pred_spline = spline.predict(X_plot)
axes[1, 1].scatter(X, y, alpha=0.5)
axes[1, 1].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[1, 1].plot(X_plot, y_pred_spline, 'purple', linewidth=2, label='Spline')
axes[1, 1].set_title(f'Regression Spline\nR² = {spline.score(X_train, y):.3f}')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate test errors
print("\n📊 Comparison Summary:\n")
print(f"{'Method':<20} {'Train R²':<12} {'Flexibility':<15} {'Smoothness'}")
print("="*65)
print(f"{'Linear':<20} {linear.score(X_train, y):>10.3f}   {'Low':<15} Smooth")
print(f"{'Polynomial (d=5)':<20} {poly.score(X_train, y):>10.3f}   {'Medium':<15} Smooth")
print(f"{'Step (8 bins)':<20} {'N/A':>10}   {'Medium':<15} Discontinuous")
print(f"{'Spline':<20} {spline.score(X_train, y):>10.3f}   {'High':<15} Smooth ✅")

print("\n💡 Key Takeaways:")
print("   • Linear: Simple but can't capture non-linearity")
print("   • Polynomial: Flexible but global effects, poor boundaries")
print("   • Step: Flexible but discontinuous")
print("   • Spline: Flexible, smooth, local control ✅ Often best choice")

7.5 Generalized Additive Models (GAMs)¶

Extension to Multiple Predictors¶

Standard Linear Model: $$Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \varepsilon$$

GAM: $$Y = \beta_0 + f_1(X_1) + f_2(X_2) + ... + f_p(X_p) + \varepsilon$$

where each $f_i$ can be:

Spline
Polynomial
Step function
Or stay linear!

Advantages¶

✅ Interpretable: Each feature’s effect can be visualized
✅ Flexible: Non-linear relationships
✅ Additive: Easy to understand individual effects
✅ Automatic: Can use same smoothing for all features

Disadvantages¶

❌ Additive assumption: Misses interactions (can add them manually)
❌ Computationally intensive: For large datasets

When to Use¶

Need interpretability
Expect non-linear relationships
Want to visualize effects
Additive structure reasonable

# GAM Example: 2 predictors
np.random.seed(42)
n = 200

# Generate 2D data
X1 = np.random.uniform(0, 10, n)
X2 = np.random.uniform(0, 10, n)

# True GAM: y = sin(X1) + 0.1*X2^2 + noise
y_gam = np.sin(X1) + 0.1 * X2**2 + np.random.randn(n) * 0.5

X_gam = np.column_stack([X1, X2])

# Fit GAM using splines for each feature
from sklearn.compose import ColumnTransformer

gam_model = make_pipeline(
    ColumnTransformer([
        ('spline1', SplineTransformer(n_knots=5, degree=3), [0]),
        ('spline2', SplineTransformer(n_knots=5, degree=3), [1])
    ]),
    LinearRegression()
)

gam_model.fit(X_gam, y_gam)

# Compare with linear model
linear_model = LinearRegression()
linear_model.fit(X_gam, y_gam)

print("\n📊 GAM vs Linear Model:\n")
print(f"Linear R²: {linear_model.score(X_gam, y_gam):.4f}")
print(f"GAM R²:    {gam_model.score(X_gam, y_gam):.4f}")
print(f"\nImprovement: {(gam_model.score(X_gam, y_gam) - linear_model.score(X_gam, y_gam)):.4f}")

# Visualize individual effects (partial dependence)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Effect of X1 (holding X2 at median)
x1_range = np.linspace(0, 10, 100)
x2_median = np.median(X2)
X_plot_1 = np.column_stack([x1_range, np.full(100, x2_median)])
y_pred_1 = gam_model.predict(X_plot_1)

axes[0].scatter(X1, y_gam, alpha=0.3)
axes[0].plot(x1_range, y_pred_1, 'r-', linewidth=3, label='GAM')
axes[0].plot(x1_range, np.sin(x1_range) + 0.1 * x2_median**2, 'g--', 
            linewidth=2, label='True effect')
axes[0].set_xlabel('X₁')
axes[0].set_ylabel('Y')
axes[0].set_title(f'Effect of X₁ (X₂ = {x2_median:.2f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Effect of X2 (holding X1 at median)
x2_range = np.linspace(0, 10, 100)
x1_median = np.median(X1)
X_plot_2 = np.column_stack([np.full(100, x1_median), x2_range])
y_pred_2 = gam_model.predict(X_plot_2)

axes[1].scatter(X2, y_gam, alpha=0.3)
axes[1].plot(x2_range, y_pred_2, 'r-', linewidth=3, label='GAM')
axes[1].plot(x2_range, np.sin(x1_median) + 0.1 * x2_range**2, 'g--', 
            linewidth=2, label='True effect')
axes[1].set_xlabel('X₂')
axes[1].set_ylabel('Y')
axes[1].set_title(f'Effect of X₂ (X₁ = {x1_median:.2f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 GAM Advantages:")
print("   • Can visualize effect of each predictor separately")
print("   • More flexible than linear regression")
print("   • Maintains interpretability")
print("   • Automatically determines non-linear shape")

Key Takeaways¶

Method Comparison¶

Method	Flexibility	Smoothness	Interpretation	When to Use
Linear	Low	Smooth	Easy	Baseline, linear relationships
Polynomial	Medium-High	Smooth	Medium	Simple non-linearity, few features
Step Functions	Medium	Discontinuous	Easy	Categorical-like effects
Regression Splines	High	Smooth ✅	Medium	Complex non-linearity, 1-2 features
Smoothing Splines	High	Very smooth	Medium	Automatic smoothing
Local Regression	Very High	Smooth	Hard	Exploratory, small datasets
GAMs	High	Smooth	Good ✅	Multiple features, need interpretation

Decision Guide¶

Start Simple:

Try linear regression first
If underfitting, move to polynomials (degree 2-4)
If still underfitting, use splines or GAMs

For Single Predictor:

Polynomial (d=2-4) if simple curve
Regression splines if complex or local features
Smoothing splines if want automatic tuning

For Multiple Predictors:

Polynomial features if few variables and interactions matter
GAMs if want interpretability and additive structure ✅
Other methods (Random Forests, Neural Nets) if just want prediction

Practical Tips¶

Always use cross-validation to select complexity (degree, knots, λ)
Start simple - Don’t use splines if polynomial works
Visualize - Plot fitted functions to understand behavior
Check boundaries - Extrapolation often fails
Consider domain knowledge - Does shape make sense?

Common Pitfalls¶

❌ Too high polynomial degree → Wild oscillations
❌ Too many knots → Overfitting
❌ Extrapolation → Predictions outside data range unreliable
❌ Ignoring interactions → GAMs assume additivity
❌ Not standardizing → Features on different scales

Best Practices¶

✅ Use cross-validation for all tuning parameters
✅ Plot residuals to check fit quality
✅ Compare models on test set, not training set
✅ Regularize if needed (Ridge/Lasso with splines)
✅ Document assumptions especially for GAMs

Next Chapter¶

Chapter 8: Tree-Based Methods

Decision Trees
Bagging
Random Forests
Boosting
BART

Practice Exercises¶

Exercise 1: Polynomial Selection¶

Generate data: $y = \sin(2x) + \epsilon$, $x \in [0, \pi]$

Fit polynomials degree 1-15
Plot fits and CV scores
Identify best degree
Explain overfitting behavior

Exercise 2: Step Function Sensitivity¶

Create step functions with different bin placements
Compare performance
Show sensitivity to bin boundaries
Suggest optimal number of bins

Exercise 3: Spline Flexibility¶

Fit splines with 3, 6, 10, 20 knots
Calculate train/test MSE
Identify optimal knot count
Compare with polynomial of equivalent df

Exercise 4: GAM with Interactions¶

Generate: $y = X_1 \times X_2 + \sin(X_1) + \epsilon$

Fit additive GAM
Fit GAM with interaction term
Compare R²
Visualize interaction effect

Exercise 5: Real Data Application¶

Use Boston Housing or similar dataset:

Fit linear model
Fit GAM allowing non-linearity
Compare performance
Interpret non-linear effects
Which features benefit from non-linearity?