Chapter 7: Moving Beyond LinearityΒΆ
OverviewΒΆ
Linear regression assumes: $\(Y = \beta_0 + \beta_1 X_1 + ... + \beta_p X_p + \varepsilon\)$
Problem: Real relationships are often non-linear!
Approaches to Non-LinearityΒΆ
1. Polynomial RegressionΒΆ
Simple extension of linear regression
Can overfit with high degree
2. Step FunctionsΒΆ
Split range into bins, fit constant in each: $\(Y = \beta_0 + \beta_1 C_1(X) + \beta_2 C_2(X) + ... + \beta_K C_K(X) + \varepsilon\)$
Piecewise constant
Simple but creates discontinuities
3. Regression SplinesΒΆ
Piecewise polynomials with smooth connections at knots
More flexible than polynomials
Smooth transitions
Local control
4. Smoothing SplinesΒΆ
Minimize: $\(\sum_{i=1}^n (y_i - g(x_i))^2 + \lambda \int g''(t)^2 dt\)$
Automatic knot selection
Smoothness controlled by Ξ»
5. Local Regression (LOESS)ΒΆ
Fit weighted regression in local neighborhood
Very flexible
No global structure needed
6. Generalized Additive Models (GAMs)ΒΆ
Non-linear function for each predictor
Interpretable
Flexible
Generate Non-Linear DataΒΆ
What: Creating a synthetic dataset with a known non-linear relationshipΒΆ
We generate data from the function \(y = \sin(x) + 0.1x^2 - 0.5x + \varepsilon\), which combines oscillation (the sine term) with a quadratic trend and additive Gaussian noise. Having a known ground truth lets us directly compare how well each method (polynomials, step functions, splines, GAMs) recovers the true underlying relationship.
Why: The fundamental motivation for moving beyond linearityΒΆ
Standard linear regression can only fit straight lines (or hyperplanes). When the true relationship is curved, sinusoidal, or otherwise non-linear, a linear model will systematically under-predict in some regions and over-predict in others β this is the hallmark of model misspecification and shows up clearly in residual plots. The methods in this chapter all address this limitation while retaining the interpretability advantages of linear models.
# Generate non-linear data
np.random.seed(42)
n = 100
X = np.linspace(0, 10, n)
# True non-linear function: sin + polynomial
y_true = np.sin(X) + 0.1 * X**2 - 0.5 * X
y = y_true + np.random.randn(n) * 0.5 # Add noise
# Reshape for sklearn
X_train = X.reshape(-1, 1)
X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
# Visualize data
plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.5, label='Observed data')
plt.plot(X, y_true, 'r-', linewidth=2, label='True function')
plt.xlabel('X')
plt.ylabel('Y')
plt.title('Non-Linear Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"Data generated: n = {n} points")
print(f"True function: sin(X) + 0.1XΒ² - 0.5X + noise")
7.1 Polynomial RegressionΒΆ
IdeaΒΆ
Extend linear model with polynomial terms: $\(Y = \beta_0 + \beta_1 X + \beta_2 X^2 + ... + \beta_d X^d + \varepsilon\)$
Still linear in parameters Ξ² β Use standard linear regression!
Choosing Degree dΒΆ
d = 1: Linear (underfitting)
d = 2-4: Often appropriate
d > 10: Usually overfitting
Use cross-validation to select d
# Try different polynomial degrees
degrees = [1, 2, 3, 5, 10, 15]
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()
for idx, degree in enumerate(degrees):
# Create polynomial features and fit
poly_model = make_pipeline(
PolynomialFeatures(degree=degree, include_bias=False),
LinearRegression()
)
poly_model.fit(X_train, y)
y_pred = poly_model.predict(X_plot)
# Calculate RΒ²
r2 = poly_model.score(X_train, y)
# Plot
axes[idx].scatter(X, y, alpha=0.5, label='Data')
axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'Degree {degree}')
axes[idx].set_xlabel('X')
axes[idx].set_ylabel('Y')
axes[idx].set_title(f'Polynomial Degree {degree}\nRΒ² = {r2:.3f}')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
if degree >= 10:
axes[idx].set_ylim(-5, 15) # Limit y-axis for high degrees
plt.tight_layout()
plt.show()
print("\nπ‘ Observations:")
print(" β’ Degree 1 (linear): Underfits, can't capture non-linearity")
print(" β’ Degrees 2-5: Reasonable fits, capture overall trend")
print(" β’ Degree 10+: Overfits, wild oscillations at boundaries")
print(" β’ Higher RΒ² doesn't always mean better predictions!")
# Use cross-validation to select degree
degrees_test = range(1, 16)
cv_scores = []
for degree in degrees_test:
poly_model = make_pipeline(
PolynomialFeatures(degree=degree, include_bias=False),
LinearRegression()
)
scores = cross_val_score(poly_model, X_train, y, cv=5,
scoring='neg_mean_squared_error')
cv_scores.append(-scores.mean()) # Convert to MSE
best_degree = degrees_test[np.argmin(cv_scores)]
# Plot CV curve
plt.figure(figsize=(10, 6))
plt.plot(degrees_test, cv_scores, 'o-', linewidth=2, markersize=8)
plt.axvline(best_degree, color='r', linestyle='--',
label=f'Best degree = {best_degree}')
plt.xlabel('Polynomial Degree')
plt.ylabel('Cross-Validation MSE')
plt.title('Selecting Polynomial Degree via Cross-Validation')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
print(f"\nπ― Best polynomial degree: {best_degree}")
print(f"CV MSE: {min(cv_scores):.3f}")
7.2 Step FunctionsΒΆ
IdeaΒΆ
Cut continuous variable into K bins, fit constant in each bin:
ProsΒΆ
β Simple interpretation
β Easy to implement
β Can handle any non-linearity
ConsΒΆ
β Creates discontinuities
β Loses information within bins
β Sensitive to bin placement
# Step functions with different number of bins
n_bins_list = [2, 4, 8, 16]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
for idx, n_bins in enumerate(n_bins_list):
# Create bins
bins = np.linspace(X.min(), X.max(), n_bins + 1)
X_binned = np.digitize(X, bins[1:-1])
# One-hot encode bins
X_encoded = pd.get_dummies(X_binned, drop_first=True).values
# Fit model
lr = LinearRegression()
lr.fit(X_encoded, y)
# Predict for plot
X_plot_binned = np.digitize(X_plot.ravel(), bins[1:-1])
X_plot_encoded = pd.get_dummies(X_plot_binned, drop_first=True).reindex(
columns=range(1, n_bins), fill_value=0).values
y_pred = lr.predict(X_plot_encoded)
# Plot
axes[idx].scatter(X, y, alpha=0.5, label='Data')
axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label=f'{n_bins} bins')
# Show bin boundaries
for b in bins[1:-1]:
axes[idx].axvline(b, color='gray', linestyle=':', alpha=0.5)
axes[idx].set_xlabel('X')
axes[idx].set_ylabel('Y')
axes[idx].set_title(f'Step Function: {n_bins} bins')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Observations:")
print(" β’ 2 bins: Too simple, underfits")
print(" β’ 4-8 bins: Captures trend but with discontinuities")
print(" β’ 16 bins: More flexible but noisier")
print(" β’ All step functions create jumps (not smooth)")
7.3 Regression SplinesΒΆ
The Problem with Polynomials and StepsΒΆ
Polynomials: Global effect, poor boundary behavior
Steps: Discontinuous, not smooth
Solution: SplinesΒΆ
Piecewise polynomials that are continuous and smooth at knots
B-Splines (Basis Splines)ΒΆ
Build spline as linear combination of B-spline basis functions: $\(f(X) = \sum_{j=1}^{K+d} \beta_j B_j(X)\)$
where:
K = number of knots
d = degree of polynomial pieces
\(B_j\) = B-spline basis functions
PropertiesΒΆ
β Piecewise polynomial
β Continuous and smooth at knots
β Local control (changing one region doesnβt affect others much)
β Stable numerically
Common ChoicesΒΆ
Cubic splines (d=3): Most popular, smooth second derivative
Natural cubic splines: Linear beyond boundary knots (better behavior)
B-splines: Numerically stable basis
# Compare different spline approaches
from sklearn.preprocessing import SplineTransformer
# Different spline configurations
configs = [
{'n_knots': 4, 'degree': 3, 'name': '4 knots, degree 3'},
{'n_knots': 6, 'degree': 3, 'name': '6 knots, degree 3'},
{'n_knots': 10, 'degree': 3, 'name': '10 knots, degree 3'},
{'n_knots': 6, 'degree': 2, 'name': '6 knots, degree 2'},
]
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
axes = axes.ravel()
for idx, config in enumerate(configs):
# Create and fit spline model
spline_model = make_pipeline(
SplineTransformer(n_knots=config['n_knots'],
degree=config['degree'],
include_bias=True),
LinearRegression()
)
spline_model.fit(X_train, y)
y_pred = spline_model.predict(X_plot)
# Calculate RΒ²
r2 = spline_model.score(X_train, y)
# Plot
axes[idx].scatter(X, y, alpha=0.5, label='Data')
axes[idx].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[idx].plot(X_plot, y_pred, 'g-', linewidth=2, label='Spline')
axes[idx].set_xlabel('X')
axes[idx].set_ylabel('Y')
axes[idx].set_title(f'{config["name"]}\nRΒ² = {r2:.3f}')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Observations:")
print(" β’ Splines are smooth (no discontinuities)")
print(" β’ More knots β more flexibility")
print(" β’ Degree 3 (cubic) most common")
print(" β’ Better boundary behavior than polynomials")
print(" β’ Local changes don't affect entire curve")
7.4 Comparison of MethodsΒΆ
What: Side-by-side evaluation of linear, polynomial, step, and spline regressionΒΆ
We fit all four methods to the same non-linear dataset and compare their fits visually and in terms of \(R^2\). Linear regression provides the simplest baseline, polynomial regression adds global curvature, step functions capture local levels with discontinuities, and regression splines combine local flexibility with smoothness.
Why: Choosing the right level of flexibility for your problemΒΆ
Each method occupies a different point on the flexibility-interpretability spectrum. Linear models are the most interpretable but cannot capture non-linearity. Step functions are easy to interpret (average response in each bin) but create artificial jumps. Polynomial regression is smooth but suffers from global effects (changing one region affects the entire curve) and poor boundary behavior. Splines are typically the best default choice for single-predictor non-linearity because they provide local flexibility, smoothness at knots, and stable boundary behavior β all while remaining a linear model in the spline basis coefficients \(\beta\).
# Compare all methods
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Linear
linear = LinearRegression()
linear.fit(X_train, y)
y_pred_linear = linear.predict(X_plot)
axes[0, 0].scatter(X, y, alpha=0.5)
axes[0, 0].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[0, 0].plot(X_plot, y_pred_linear, 'b-', linewidth=2, label='Linear')
axes[0, 0].set_title(f'Linear Regression\nRΒ² = {linear.score(X_train, y):.3f}')
axes[0, 0].legend()
axes[0, 0].grid(True, alpha=0.3)
# 2. Polynomial (degree 5)
poly = make_pipeline(PolynomialFeatures(5, include_bias=False), LinearRegression())
poly.fit(X_train, y)
y_pred_poly = poly.predict(X_plot)
axes[0, 1].scatter(X, y, alpha=0.5)
axes[0, 1].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[0, 1].plot(X_plot, y_pred_poly, 'g-', linewidth=2, label='Polynomial (d=5)')
axes[0, 1].set_title(f'Polynomial Regression\nRΒ² = {poly.score(X_train, y):.3f}')
axes[0, 1].legend()
axes[0, 1].grid(True, alpha=0.3)
# 3. Step function (8 bins)
bins = np.linspace(X.min(), X.max(), 9)
X_binned = np.digitize(X, bins[1:-1])
X_encoded = pd.get_dummies(X_binned, drop_first=True).values
step_lr = LinearRegression()
step_lr.fit(X_encoded, y)
X_plot_binned = np.digitize(X_plot.ravel(), bins[1:-1])
X_plot_encoded = pd.get_dummies(X_plot_binned, drop_first=True).reindex(
columns=range(1, 8), fill_value=0).values
y_pred_step = step_lr.predict(X_plot_encoded)
axes[1, 0].scatter(X, y, alpha=0.5)
axes[1, 0].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[1, 0].plot(X_plot, y_pred_step, 'orange', linewidth=2, label='Step (8 bins)')
for b in bins[1:-1]:
axes[1, 0].axvline(b, color='gray', linestyle=':', alpha=0.5)
axes[1, 0].set_title('Step Function')
axes[1, 0].legend()
axes[1, 0].grid(True, alpha=0.3)
# 4. Spline (6 knots, degree 3)
spline = make_pipeline(SplineTransformer(n_knots=6, degree=3), LinearRegression())
spline.fit(X_train, y)
y_pred_spline = spline.predict(X_plot)
axes[1, 1].scatter(X, y, alpha=0.5)
axes[1, 1].plot(X, y_true, 'r--', alpha=0.5, linewidth=2, label='True')
axes[1, 1].plot(X_plot, y_pred_spline, 'purple', linewidth=2, label='Spline')
axes[1, 1].set_title(f'Regression Spline\nRΒ² = {spline.score(X_train, y):.3f}')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
# Calculate test errors
print("\nπ Comparison Summary:\n")
print(f"{'Method':<20} {'Train RΒ²':<12} {'Flexibility':<15} {'Smoothness'}")
print("="*65)
print(f"{'Linear':<20} {linear.score(X_train, y):>10.3f} {'Low':<15} Smooth")
print(f"{'Polynomial (d=5)':<20} {poly.score(X_train, y):>10.3f} {'Medium':<15} Smooth")
print(f"{'Step (8 bins)':<20} {'N/A':>10} {'Medium':<15} Discontinuous")
print(f"{'Spline':<20} {spline.score(X_train, y):>10.3f} {'High':<15} Smooth β
")
print("\nπ‘ Key Takeaways:")
print(" β’ Linear: Simple but can't capture non-linearity")
print(" β’ Polynomial: Flexible but global effects, poor boundaries")
print(" β’ Step: Flexible but discontinuous")
print(" β’ Spline: Flexible, smooth, local control β
Often best choice")
7.5 Generalized Additive Models (GAMs)ΒΆ
Extension to Multiple PredictorsΒΆ
Standard Linear Model: $\(Y = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p + \varepsilon\)$
GAM: $\(Y = \beta_0 + f_1(X_1) + f_2(X_2) + ... + f_p(X_p) + \varepsilon\)$
where each \(f_i\) can be:
Spline
Polynomial
Step function
Or stay linear!
AdvantagesΒΆ
β Interpretable: Each featureβs effect can be visualized
β Flexible: Non-linear relationships
β Additive: Easy to understand individual effects
β Automatic: Can use same smoothing for all features
DisadvantagesΒΆ
β Additive assumption: Misses interactions (can add them manually)
β Computationally intensive: For large datasets
When to UseΒΆ
Need interpretability
Expect non-linear relationships
Want to visualize effects
Additive structure reasonable
# GAM Example: 2 predictors
np.random.seed(42)
n = 200
# Generate 2D data
X1 = np.random.uniform(0, 10, n)
X2 = np.random.uniform(0, 10, n)
# True GAM: y = sin(X1) + 0.1*X2^2 + noise
y_gam = np.sin(X1) + 0.1 * X2**2 + np.random.randn(n) * 0.5
X_gam = np.column_stack([X1, X2])
# Fit GAM using splines for each feature
from sklearn.compose import ColumnTransformer
gam_model = make_pipeline(
ColumnTransformer([
('spline1', SplineTransformer(n_knots=5, degree=3), [0]),
('spline2', SplineTransformer(n_knots=5, degree=3), [1])
]),
LinearRegression()
)
gam_model.fit(X_gam, y_gam)
# Compare with linear model
linear_model = LinearRegression()
linear_model.fit(X_gam, y_gam)
print("\nπ GAM vs Linear Model:\n")
print(f"Linear RΒ²: {linear_model.score(X_gam, y_gam):.4f}")
print(f"GAM RΒ²: {gam_model.score(X_gam, y_gam):.4f}")
print(f"\nImprovement: {(gam_model.score(X_gam, y_gam) - linear_model.score(X_gam, y_gam)):.4f}")
# Visualize individual effects (partial dependence)
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Effect of X1 (holding X2 at median)
x1_range = np.linspace(0, 10, 100)
x2_median = np.median(X2)
X_plot_1 = np.column_stack([x1_range, np.full(100, x2_median)])
y_pred_1 = gam_model.predict(X_plot_1)
axes[0].scatter(X1, y_gam, alpha=0.3)
axes[0].plot(x1_range, y_pred_1, 'r-', linewidth=3, label='GAM')
axes[0].plot(x1_range, np.sin(x1_range) + 0.1 * x2_median**2, 'g--',
linewidth=2, label='True effect')
axes[0].set_xlabel('Xβ')
axes[0].set_ylabel('Y')
axes[0].set_title(f'Effect of Xβ (Xβ = {x2_median:.2f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Effect of X2 (holding X1 at median)
x2_range = np.linspace(0, 10, 100)
x1_median = np.median(X1)
X_plot_2 = np.column_stack([np.full(100, x1_median), x2_range])
y_pred_2 = gam_model.predict(X_plot_2)
axes[1].scatter(X2, y_gam, alpha=0.3)
axes[1].plot(x2_range, y_pred_2, 'r-', linewidth=3, label='GAM')
axes[1].plot(x2_range, np.sin(x1_median) + 0.1 * x2_range**2, 'g--',
linewidth=2, label='True effect')
axes[1].set_xlabel('Xβ')
axes[1].set_ylabel('Y')
axes[1].set_title(f'Effect of Xβ (Xβ = {x1_median:.2f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ GAM Advantages:")
print(" β’ Can visualize effect of each predictor separately")
print(" β’ More flexible than linear regression")
print(" β’ Maintains interpretability")
print(" β’ Automatically determines non-linear shape")
Key TakeawaysΒΆ
Method ComparisonΒΆ
Method |
Flexibility |
Smoothness |
Interpretation |
When to Use |
|---|---|---|---|---|
Linear |
Low |
Smooth |
Easy |
Baseline, linear relationships |
Polynomial |
Medium-High |
Smooth |
Medium |
Simple non-linearity, few features |
Step Functions |
Medium |
Discontinuous |
Easy |
Categorical-like effects |
Regression Splines |
High |
Smooth β |
Medium |
Complex non-linearity, 1-2 features |
Smoothing Splines |
High |
Very smooth |
Medium |
Automatic smoothing |
Local Regression |
Very High |
Smooth |
Hard |
Exploratory, small datasets |
GAMs |
High |
Smooth |
Good β |
Multiple features, need interpretation |
Decision GuideΒΆ
Start Simple:
Try linear regression first
If underfitting, move to polynomials (degree 2-4)
If still underfitting, use splines or GAMs
For Single Predictor:
Polynomial (d=2-4) if simple curve
Regression splines if complex or local features
Smoothing splines if want automatic tuning
For Multiple Predictors:
Polynomial features if few variables and interactions matter
GAMs if want interpretability and additive structure β
Other methods (Random Forests, Neural Nets) if just want prediction
Practical TipsΒΆ
Always use cross-validation to select complexity (degree, knots, Ξ»)
Start simple - Donβt use splines if polynomial works
Visualize - Plot fitted functions to understand behavior
Check boundaries - Extrapolation often fails
Consider domain knowledge - Does shape make sense?
Common PitfallsΒΆ
β Too high polynomial degree β Wild oscillations
β Too many knots β Overfitting
β Extrapolation β Predictions outside data range unreliable
β Ignoring interactions β GAMs assume additivity
β Not standardizing β Features on different scales
Best PracticesΒΆ
β
Use cross-validation for all tuning parameters
β
Plot residuals to check fit quality
β
Compare models on test set, not training set
β
Regularize if needed (Ridge/Lasso with splines)
β
Document assumptions especially for GAMs
Next ChapterΒΆ
Chapter 8: Tree-Based Methods
Decision Trees
Bagging
Random Forests
Boosting
BART
Practice ExercisesΒΆ
Exercise 1: Polynomial SelectionΒΆ
Generate data: \(y = \sin(2x) + \epsilon\), \(x \in [0, \pi]\)
Fit polynomials degree 1-15
Plot fits and CV scores
Identify best degree
Explain overfitting behavior
Exercise 2: Step Function SensitivityΒΆ
Create step functions with different bin placements
Compare performance
Show sensitivity to bin boundaries
Suggest optimal number of bins
Exercise 3: Spline FlexibilityΒΆ
Fit splines with 3, 6, 10, 20 knots
Calculate train/test MSE
Identify optimal knot count
Compare with polynomial of equivalent df
Exercise 4: GAM with InteractionsΒΆ
Generate: \(y = X_1 \times X_2 + \sin(X_1) + \epsilon\)
Fit additive GAM
Fit GAM with interaction term
Compare RΒ²
Visualize interaction effect
Exercise 5: Real Data ApplicationΒΆ
Use Boston Housing or similar dataset:
Fit linear model
Fit GAM allowing non-linearity
Compare performance
Interpret non-linear effects
Which features benefit from non-linearity?