Chapter 7: Moving Beyond LinearityΒΆ

OverviewΒΆ

Linear regression assumes: $\(Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \varepsilon\)$

Problem: Real relationships are often non-linear!

Approaches to Non-LinearityΒΆ

1. Polynomial RegressionΒΆ

\[Y = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_d X^d + \varepsilon\]
  • Simple extension of linear regression

  • Can overfit with high degree

2. Step FunctionsΒΆ

Split range into bins, fit constant in each: $\(Y = \beta_0 + \beta_1 C_1(X) + \beta_2 C_2(X) + ... + \beta_K C_K(X) + \varepsilon\)$

  • Piecewise constant

  • Simple but creates discontinuities

3. Regression SplinesΒΆ

Piecewise polynomials with smooth connections at knots

  • More flexible than polynomials

  • Smooth transitions

  • Local control

4. Smoothing SplinesΒΆ

Minimize: $\(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 dt\)$

  • Automatic knot selection

  • Smoothness controlled by Ξ»

5. Local Regression (LOESS)ΒΆ

Fit weighted regression in local neighborhood

  • Very flexible

  • No global structure needed

6. Generalized Additive Models (GAMs)ΒΆ

\[Y = \beta_0 + f_1(X_1) + f_2(X_2) + ... + f_p(X_p) + \varepsilon\]
  • Non-linear function for each predictor

  • Interpretable

  • Flexible

Generate Non-Linear DataΒΆ

What: Creating a synthetic dataset with a known non-linear relationshipΒΆ

We generate data from the function \(y = \sin(x) + 0.1x^2 - 0.5x + \varepsilon\), which combines oscillation (the sine term) with a quadratic trend and additive Gaussian noise. Having a known ground truth lets us directly compare how well each method (polynomials, step functions, splines, GAMs) recovers the true underlying relationship.

Why: The fundamental motivation for moving beyond linearityΒΆ

Standard linear regression can only fit straight lines (or hyperplanes). When the true relationship is curved, sinusoidal, or otherwise non-linear, a linear model will systematically under-predict in some regions and over-predict in others – this is the hallmark of model misspecification and shows up clearly in residual plots. The methods in this chapter all address this limitation while retaining the interpretability advantages of linear models.

# Generate non-linear data
np.random.seed(42)
n = 100
X = np.linspace(0, 10, n)

# True non-linear function: sin + polynomial
y_true = np.sin(X) + 0.1 * X**2 - 0.5 * X
y = y_true + np.random.randn(n) * 0.5  # Add noise

# Reshape for sklearn
X_train = X.reshape(-1, 1)
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)

# Visualize data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Observed data')
plt.plot(X, y_true, 'r-', linewidth=2, label='True function')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Non-Linear Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"Data generated: n = {n} points")
print(f"True function: sin(X) + 0.1XΒ² - 0.5X + noise")

7.1 Polynomial RegressionΒΆ

IdeaΒΆ

Extend linear model with polynomial terms: $\(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_d X^d + \varepsilon\)$

Still linear in parameters Ξ² β†’ Use standard linear regression!

Choosing Degree dΒΆ

  • d = 1: Linear (underfitting)

  • d = 2-4: Often appropriate

  • d > 10: Usually overfitting

  • Use cross-validation to select d

# Try different polynomial degrees
degrees = [1, 2, 3, 5, 10, 15]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, degree in enumerate(degrees):
    # Create polynomial features and fit
    poly_model = make_pipeline(
        PolynomialFeatures(degree=degree, include_bias=False),
        LinearRegression()
    )
    poly_model.fit(X_train, y)
    y_pred = poly_model.predict(X_plot)
    
    # Calculate RΒ²
    r2 = poly_model.score(X_train, y)
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.5, label='Data')
    axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'Degree {degree}')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Polynomial Degree {degree}\nRΒ² = {r2:.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)
    
    if degree >= 10:
        axes[idx].set_ylim(-5, 15)  # Limit y-axis for high degrees

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Observations:")
print("   β€’ Degree 1 (linear): Underfits, can't capture non-linearity")
print("   β€’ Degrees 2-5: Reasonable fits, capture overall trend")
print("   β€’ Degree 10+: Overfits, wild oscillations at boundaries")
print("   β€’ Higher RΒ² doesn't always mean better predictions!")
# Use cross-validation to select degree
degrees_test = range(1, 16)
cv_scores = []

for degree in degrees_test:
    poly_model = make_pipeline(
        PolynomialFeatures(degree=degree, include_bias=False),
        LinearRegression()
    )
    scores = cross_val_score(poly_model, X_train, y, cv=5, 
                            scoring='neg_mean_squared_error')
    cv_scores.append(-scores.mean())  # Convert to MSE

best_degree = degrees_test[np.argmin(cv_scores)]

# Plot CV curve
plt.figure(figsize=(10, 6))
plt.plot(degrees_test, cv_scores, 'o-', linewidth=2, markersize=8)
plt.axvline(best_degree, color='r', linestyle='--', 
           label=f'Best degree = {best_degree}')
plt.xlabel('Polynomial Degree')
plt.ylabel('Cross-Validation MSE')
plt.title('Selecting Polynomial Degree via Cross-Validation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print(f"\n🎯 Best polynomial degree: {best_degree}")
print(f"CV MSE: {min(cv_scores):.3f}")

7.2 Step FunctionsΒΆ

IdeaΒΆ

Cut continuous variable into K bins, fit constant in each bin:

\[Y = \beta_0 + \beta_1 I(c_1 \leq X < c_2) + ... + \beta_K I(c_K \leq X < c_{K+1}) + \varepsilon\]

ProsΒΆ

  • βœ… Simple interpretation

  • βœ… Easy to implement

  • βœ… Can handle any non-linearity

ConsΒΆ

  • ❌ Creates discontinuities

  • ❌ Loses information within bins

  • ❌ Sensitive to bin placement

# Step functions with different number of bins
n_bins_list = [2, 4, 8, 16]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, n_bins in enumerate(n_bins_list):
    # Create bins
    bins = np.linspace(X.min(), X.max(), n_bins + 1)
    X_binned = np.digitize(X, bins[1:-1])
    
    # One-hot encode bins
    X_encoded = pd.get_dummies(X_binned, drop_first=True).values
    
    # Fit model
    lr = LinearRegression()
    lr.fit(X_encoded, y)
    
    # Predict for plot
    X_plot_binned = np.digitize(X_plot.ravel(), bins[1:-1])
    X_plot_encoded = pd.get_dummies(X_plot_binned, drop_first=True).reindex(
        columns=range(1, n_bins), fill_value=0).values
    y_pred = lr.predict(X_plot_encoded)
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.5, label='Data')
    axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'{n_bins} bins')
    
    # Show bin boundaries
    for b in bins[1:-1]:
        axes[idx].axvline(b, color='gray', linestyle=':', alpha=0.5)
    
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Step Function: {n_bins} bins')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Observations:")
print("   β€’ 2 bins: Too simple, underfits")
print("   β€’ 4-8 bins: Captures trend but with discontinuities")
print("   β€’ 16 bins: More flexible but noisier")
print("   β€’ All step functions create jumps (not smooth)")

7.3 Regression SplinesΒΆ

The Problem with Polynomials and StepsΒΆ

  • Polynomials: Global effect, poor boundary behavior

  • Steps: Discontinuous, not smooth

Solution: SplinesΒΆ

Piecewise polynomials that are continuous and smooth at knots

B-Splines (Basis Splines)ΒΆ

Build spline as linear combination of B-spline basis functions: $\(f(X) = \sum_{j=1}^{K+d} \beta_j B_j(X)\)$

where:

  • K = number of knots

  • d = degree of polynomial pieces

  • \(B_j\) = B-spline basis functions

PropertiesΒΆ

  • βœ… Piecewise polynomial

  • βœ… Continuous and smooth at knots

  • βœ… Local control (changing one region doesn’t affect others much)

  • βœ… Stable numerically

Common ChoicesΒΆ

  • Cubic splines (d=3): Most popular, smooth second derivative

  • Natural cubic splines: Linear beyond boundary knots (better behavior)

  • B-splines: Numerically stable basis

# Compare different spline approaches
from sklearn.preprocessing import SplineTransformer

# Different spline configurations
configs = [
    {'n_knots': 4, 'degree': 3, 'name': '4 knots, degree 3'},
    {'n_knots': 6, 'degree': 3, 'name': '6 knots, degree 3'},
    {'n_knots': 10, 'degree': 3, 'name': '10 knots, degree 3'},
    {'n_knots': 6, 'degree': 2, 'name': '6 knots, degree 2'},
]

fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()

for idx, config in enumerate(configs):
    # Create and fit spline model
    spline_model = make_pipeline(
        SplineTransformer(n_knots=config['n_knots'], 
                         degree=config['degree'],
                         include_bias=True),
        LinearRegression()
    )
    spline_model.fit(X_train, y)
    y_pred = spline_model.predict(X_plot)
    
    # Calculate RΒ²
    r2 = spline_model.score(X_train, y)
    
    # Plot
    axes[idx].scatter(X, y, alpha=0.5, label='Data')
    axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
    axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label='Spline')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'{config["name"]}\nRΒ² = {r2:.3f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Observations:")
print("   β€’ Splines are smooth (no discontinuities)")
print("   β€’ More knots β†’ more flexibility")
print("   β€’ Degree 3 (cubic) most common")
print("   β€’ Better boundary behavior than polynomials")
print("   β€’ Local changes don't affect entire curve")

7.4 Comparison of MethodsΒΆ

What: Side-by-side evaluation of linear, polynomial, step, and spline regressionΒΆ

We fit all four methods to the same non-linear dataset and compare their fits visually and in terms of \(R^2\). Linear regression provides the simplest baseline, polynomial regression adds global curvature, step functions capture local levels with discontinuities, and regression splines combine local flexibility with smoothness.

Why: Choosing the right level of flexibility for your problemΒΆ

Each method occupies a different point on the flexibility-interpretability spectrum. Linear models are the most interpretable but cannot capture non-linearity. Step functions are easy to interpret (average response in each bin) but create artificial jumps. Polynomial regression is smooth but suffers from global effects (changing one region affects the entire curve) and poor boundary behavior. Splines are typically the best default choice for single-predictor non-linearity because they provide local flexibility, smoothness at knots, and stable boundary behavior – all while remaining a linear model in the spline basis coefficients \(\beta\).

# Compare all methods
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Linear
linear = LinearRegression()
linear.fit(X_train, y)
y_pred_linear = linear.predict(X_plot)
axes[0, 0].scatter(X, y, alpha=0.5)
axes[0, 0].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[0, 0].plot(X_plot, y_pred_linear, 'b-', linewidth=2, label='Linear')
axes[0, 0].set_title(f'Linear Regression\nRΒ² = {linear.score(X_train, y):.3f}')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)

# 2. Polynomial (degree 5)
poly = make_pipeline(PolynomialFeatures(5, include_bias=False), LinearRegression())
poly.fit(X_train, y)
y_pred_poly = poly.predict(X_plot)
axes[0, 1].scatter(X, y, alpha=0.5)
axes[0, 1].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[0, 1].plot(X_plot, y_pred_poly, 'g-', linewidth=2, label='Polynomial (d=5)')
axes[0, 1].set_title(f'Polynomial Regression\nRΒ² = {poly.score(X_train, y):.3f}')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)

# 3. Step function (8 bins)
bins = np.linspace(X.min(), X.max(), 9)
X_binned = np.digitize(X, bins[1:-1])
X_encoded = pd.get_dummies(X_binned, drop_first=True).values
step_lr = LinearRegression()
step_lr.fit(X_encoded, y)
X_plot_binned = np.digitize(X_plot.ravel(), bins[1:-1])
X_plot_encoded = pd.get_dummies(X_plot_binned, drop_first=True).reindex(
    columns=range(1, 8), fill_value=0).values
y_pred_step = step_lr.predict(X_plot_encoded)
axes[1, 0].scatter(X, y, alpha=0.5)
axes[1, 0].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[1, 0].plot(X_plot, y_pred_step, 'orange', linewidth=2, label='Step (8 bins)')
for b in bins[1:-1]:
    axes[1, 0].axvline(b, color='gray', linestyle=':', alpha=0.5)
axes[1, 0].set_title('Step Function')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)

# 4. Spline (6 knots, degree 3)
spline = make_pipeline(SplineTransformer(n_knots=6, degree=3), LinearRegression())
spline.fit(X_train, y)
y_pred_spline = spline.predict(X_plot)
axes[1, 1].scatter(X, y, alpha=0.5)
axes[1, 1].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[1, 1].plot(X_plot, y_pred_spline, 'purple', linewidth=2, label='Spline')
axes[1, 1].set_title(f'Regression Spline\nRΒ² = {spline.score(X_train, y):.3f}')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate test errors
print("\nπŸ“Š Comparison Summary:\n")
print(f"{'Method':<20} {'Train RΒ²':<12} {'Flexibility':<15} {'Smoothness'}")
print("="*65)
print(f"{'Linear':<20} {linear.score(X_train, y):>10.3f}   {'Low':<15} Smooth")
print(f"{'Polynomial (d=5)':<20} {poly.score(X_train, y):>10.3f}   {'Medium':<15} Smooth")
print(f"{'Step (8 bins)':<20} {'N/A':>10}   {'Medium':<15} Discontinuous")
print(f"{'Spline':<20} {spline.score(X_train, y):>10.3f}   {'High':<15} Smooth βœ…")

print("\nπŸ’‘ Key Takeaways:")
print("   β€’ Linear: Simple but can't capture non-linearity")
print("   β€’ Polynomial: Flexible but global effects, poor boundaries")
print("   β€’ Step: Flexible but discontinuous")
print("   β€’ Spline: Flexible, smooth, local control βœ… Often best choice")

7.5 Generalized Additive Models (GAMs)ΒΆ

Extension to Multiple PredictorsΒΆ

Standard Linear Model: $\(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \varepsilon\)$

GAM: $\(Y = \beta_0 + f_1(X_1) + f_2(X_2) + ... + f_p(X_p) + \varepsilon\)$

where each \(f_i\) can be:

  • Spline

  • Polynomial

  • Step function

  • Or stay linear!

AdvantagesΒΆ

  • βœ… Interpretable: Each feature’s effect can be visualized

  • βœ… Flexible: Non-linear relationships

  • βœ… Additive: Easy to understand individual effects

  • βœ… Automatic: Can use same smoothing for all features

DisadvantagesΒΆ

  • ❌ Additive assumption: Misses interactions (can add them manually)

  • ❌ Computationally intensive: For large datasets

When to UseΒΆ

  • Need interpretability

  • Expect non-linear relationships

  • Want to visualize effects

  • Additive structure reasonable

# GAM Example: 2 predictors
np.random.seed(42)
n = 200

# Generate 2D data
X1 = np.random.uniform(0, 10, n)
X2 = np.random.uniform(0, 10, n)

# True GAM: y = sin(X1) + 0.1*X2^2 + noise
y_gam = np.sin(X1) + 0.1 * X2**2 + np.random.randn(n) * 0.5

X_gam = np.column_stack([X1, X2])

# Fit GAM using splines for each feature
from sklearn.compose import ColumnTransformer

gam_model = make_pipeline(
    ColumnTransformer([
        ('spline1', SplineTransformer(n_knots=5, degree=3), [0]),
        ('spline2', SplineTransformer(n_knots=5, degree=3), [1])
    ]),
    LinearRegression()
)

gam_model.fit(X_gam, y_gam)

# Compare with linear model
linear_model = LinearRegression()
linear_model.fit(X_gam, y_gam)

print("\nπŸ“Š GAM vs Linear Model:\n")
print(f"Linear RΒ²: {linear_model.score(X_gam, y_gam):.4f}")
print(f"GAM RΒ²:    {gam_model.score(X_gam, y_gam):.4f}")
print(f"\nImprovement: {(gam_model.score(X_gam, y_gam) - linear_model.score(X_gam, y_gam)):.4f}")

# Visualize individual effects (partial dependence)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Effect of X1 (holding X2 at median)
x1_range = np.linspace(0, 10, 100)
x2_median = np.median(X2)
X_plot_1 = np.column_stack([x1_range, np.full(100, x2_median)])
y_pred_1 = gam_model.predict(X_plot_1)

axes[0].scatter(X1, y_gam, alpha=0.3)
axes[0].plot(x1_range, y_pred_1, 'r-', linewidth=3, label='GAM')
axes[0].plot(x1_range, np.sin(x1_range) + 0.1 * x2_median**2, 'g--', 
            linewidth=2, label='True effect')
axes[0].set_xlabel('X₁')
axes[0].set_ylabel('Y')
axes[0].set_title(f'Effect of X₁ (Xβ‚‚ = {x2_median:.2f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Effect of X2 (holding X1 at median)
x2_range = np.linspace(0, 10, 100)
x1_median = np.median(X1)
X_plot_2 = np.column_stack([np.full(100, x1_median), x2_range])
y_pred_2 = gam_model.predict(X_plot_2)

axes[1].scatter(X2, y_gam, alpha=0.3)
axes[1].plot(x2_range, y_pred_2, 'r-', linewidth=3, label='GAM')
axes[1].plot(x2_range, np.sin(x1_median) + 0.1 * x2_range**2, 'g--', 
            linewidth=2, label='True effect')
axes[1].set_xlabel('Xβ‚‚')
axes[1].set_ylabel('Y')
axes[1].set_title(f'Effect of Xβ‚‚ (X₁ = {x1_median:.2f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ GAM Advantages:")
print("   β€’ Can visualize effect of each predictor separately")
print("   β€’ More flexible than linear regression")
print("   β€’ Maintains interpretability")
print("   β€’ Automatically determines non-linear shape")

Key TakeawaysΒΆ

Method ComparisonΒΆ

Method

Flexibility

Smoothness

Interpretation

When to Use

Linear

Low

Smooth

Easy

Baseline, linear relationships

Polynomial

Medium-High

Smooth

Medium

Simple non-linearity, few features

Step Functions

Medium

Discontinuous

Easy

Categorical-like effects

Regression Splines

High

Smooth βœ…

Medium

Complex non-linearity, 1-2 features

Smoothing Splines

High

Very smooth

Medium

Automatic smoothing

Local Regression

Very High

Smooth

Hard

Exploratory, small datasets

GAMs

High

Smooth

Good βœ…

Multiple features, need interpretation

Decision GuideΒΆ

Start Simple:

  1. Try linear regression first

  2. If underfitting, move to polynomials (degree 2-4)

  3. If still underfitting, use splines or GAMs

For Single Predictor:

  • Polynomial (d=2-4) if simple curve

  • Regression splines if complex or local features

  • Smoothing splines if want automatic tuning

For Multiple Predictors:

  • Polynomial features if few variables and interactions matter

  • GAMs if want interpretability and additive structure βœ…

  • Other methods (Random Forests, Neural Nets) if just want prediction

Practical TipsΒΆ

  1. Always use cross-validation to select complexity (degree, knots, Ξ»)

  2. Start simple - Don’t use splines if polynomial works

  3. Visualize - Plot fitted functions to understand behavior

  4. Check boundaries - Extrapolation often fails

  5. Consider domain knowledge - Does shape make sense?

Common PitfallsΒΆ

  • ❌ Too high polynomial degree β†’ Wild oscillations

  • ❌ Too many knots β†’ Overfitting

  • ❌ Extrapolation β†’ Predictions outside data range unreliable

  • ❌ Ignoring interactions β†’ GAMs assume additivity

  • ❌ Not standardizing β†’ Features on different scales

Best PracticesΒΆ

βœ… Use cross-validation for all tuning parameters
βœ… Plot residuals to check fit quality
βœ… Compare models on test set, not training set
βœ… Regularize if needed (Ridge/Lasso with splines)
βœ… Document assumptions especially for GAMs

Next ChapterΒΆ

Chapter 8: Tree-Based Methods

  • Decision Trees

  • Bagging

  • Random Forests

  • Boosting

  • BART

Practice ExercisesΒΆ

Exercise 1: Polynomial SelectionΒΆ

Generate data: \(y = \sin(2x) + \epsilon\), \(x \in [0, \pi]\)

  1. Fit polynomials degree 1-15

  2. Plot fits and CV scores

  3. Identify best degree

  4. Explain overfitting behavior

Exercise 2: Step Function SensitivityΒΆ

  1. Create step functions with different bin placements

  2. Compare performance

  3. Show sensitivity to bin boundaries

  4. Suggest optimal number of bins

Exercise 3: Spline FlexibilityΒΆ

  1. Fit splines with 3, 6, 10, 20 knots

  2. Calculate train/test MSE

  3. Identify optimal knot count

  4. Compare with polynomial of equivalent df

Exercise 4: GAM with InteractionsΒΆ

Generate: \(y = X_1 \times X_2 + \sin(X_1) + \epsilon\)

  1. Fit additive GAM

  2. Fit GAM with interaction term

  3. Compare RΒ²

  4. Visualize interaction effect

Exercise 5: Real Data ApplicationΒΆ

Use Boston Housing or similar dataset:

  1. Fit linear model

  2. Fit GAM allowing non-linearity

  3. Compare performance

  4. Interpret non-linear effects

  5. Which features benefit from non-linearity?