Chapter 3: Linear Regression¶
Overview¶
Linear regression is a simple yet powerful approach for predicting a quantitative response Y based on one or more predictor variables X.
Key Topics:
Simple Linear Regression (one predictor)
Multiple Linear Regression (multiple predictors)
Assessing model accuracy (RSE, R², F-statistic)
Qualitative predictors (dummy variables)
Interaction terms
Non-linear transformations
Potential problems and diagnostics
The Linear Model¶
Simple Linear Regression¶
Y ≈ β₀ + β₁X
Multiple Linear Regression¶
Y ≈ β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ
Goal: Estimate coefficients β₀, β₁, …, βₚ to minimize prediction error
# Generate synthetic advertising data
np.random.seed(42)
n = 200
# Generate data: Sales = 5 + 0.05*TV_budget + noise
TV_budget = np.random.uniform(0, 300, n)
true_beta0 = 5
true_beta1 = 0.05
noise = np.random.normal(0, 1.5, n)
sales = true_beta0 + true_beta1 * TV_budget + noise
# Create DataFrame
data = pd.DataFrame({'TV': TV_budget, 'Sales': sales})
print("📊 Simulated Advertising Data")
print(f"True relationship: Sales = {true_beta0} + {true_beta1}*TV + ε")
print(f"\nData shape: {data.shape}")
print(f"\nFirst few rows:")
print(data.head())
print(f"\nDescriptive statistics:")
print(data.describe())
# Fit simple linear regression using Ordinary Least Squares (OLS)
X = data[['TV']].values
y = data['Sales'].values
# Fit model
model = LinearRegression()
model.fit(X, y)
beta0_hat = model.intercept_
beta1_hat = model.coef_[0]
print("📈 Simple Linear Regression Results")
print(f"\nEstimated coefficients:")
print(f"β₀ (Intercept): {beta0_hat:.4f} (True: {true_beta0})")
print(f"β₁ (Slope): {beta1_hat:.6f} (True: {true_beta1})")
# Make predictions
y_pred = model.predict(X)
# Calculate metrics
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)
rse = np.sqrt(np.sum((y - y_pred)**2) / (len(y) - 2)) # Residual Standard Error
print(f"\nModel Performance:")
print(f"MSE: {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"RSE: {rse:.4f}")
print(f"R²: {r2:.4f}")
# Visualization
plt.figure(figsize=(14, 5))
# Plot 1: Scatter with regression line
plt.subplot(1, 2, 1)
plt.scatter(X, y, alpha=0.5, label='Observed data')
plt.plot(X, y_pred, 'r-', linewidth=2, label='Fitted line')
plt.plot(TV_budget, true_beta0 + true_beta1 * TV_budget, 'g--',
linewidth=2, alpha=0.7, label='True relationship')
plt.xlabel('TV Advertising Budget ($1000s)')
plt.ylabel('Sales (1000 units)')
plt.title('Simple Linear Regression: Sales vs TV Budget')
plt.legend()
plt.grid(True, alpha=0.3)
# Plot 2: Residuals
plt.subplot(1, 2, 2)
residuals = y - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
3.1 Statistical Inference for Coefficients¶
Hypothesis Testing¶
Test whether there’s a relationship between X and Y:
H₀: β₁ = 0 (no relationship)
H₁: β₁ ≠ 0 (relationship exists)
t-statistic¶
t = (β̂₁ - 0) / SE(β̂₁)
Under H₀, t follows a t-distribution with n-2 degrees of freedom
Standard Errors¶
SE(β̂₀): Standard error of intercept
SE(β̂₁): Standard error of slope
Smaller SE → more confident in estimate
Confidence Intervals¶
95% CI for β₁:
β̂₁ ± 1.96 × SE(β̂₁)
# Calculate confidence intervals manually
from scipy import stats as scipy_stats
n = len(X)
p = 1 # number of predictors (excluding intercept)
# Calculate standard errors
y_pred = model.predict(X)
residuals = y - y_pred
rse = np.sqrt(np.sum(residuals**2) / (n - p - 1))
# Standard error of slope
X_centered = X - X.mean()
se_beta1 = rse / np.sqrt(np.sum(X_centered**2))
# Standard error of intercept
se_beta0 = rse * np.sqrt(1/n + X.mean()**2 / np.sum(X_centered**2))
# t-statistics
t_beta0 = beta0_hat / se_beta0
t_beta1 = beta1_hat / se_beta1
# p-values (two-tailed test)
p_value_beta0 = 2 * (1 - scipy_stats.t.cdf(abs(t_beta0), df=n-p-1))
p_value_beta1 = 2 * (1 - scipy_stats.t.cdf(abs(t_beta1), df=n-p-1))
# 95% confidence intervals
t_critical = scipy_stats.t.ppf(0.975, df=n-p-1)
ci_beta0 = (beta0_hat - t_critical * se_beta0, beta0_hat + t_critical * se_beta0)
ci_beta1 = (beta1_hat - t_critical * se_beta1, beta1_hat + t_critical * se_beta1)
print("📊 Statistical Inference for Coefficients\n")
print(f"{'Coefficient':<12} {'Estimate':<10} {'Std Error':<12} {'t-stat':<10} {'p-value':<12} {'95% CI'}")
print("="*80)
print(f"{'β₀ (Intercept)':<12} {beta0_hat:>8.4f} {se_beta0:>12.4f} {t_beta0:>10.3f} {p_value_beta0:>12.6f} {ci_beta0}")
print(f"{'β₁ (TV)':<12} {beta1_hat:>8.6f} {se_beta1:>12.6f} {t_beta1:>10.3f} {p_value_beta1:>12.6f} {ci_beta1}")
print(f"\n💡 Interpretation:")
print(f" • β₁ = {beta1_hat:.6f}: For each $1000 increase in TV budget, sales increase by {beta1_hat*1000:.2f} units")
print(f" • p-value < 0.001: Strong evidence of relationship (reject H₀)")
print(f" • 95% CI for β₁: [{ci_beta1[0]:.6f}, {ci_beta1[1]:.6f}]")
print(f" • True β₁ = {true_beta1} is within the confidence interval ✅")
3.2 Assessing Model Accuracy¶
Residual Standard Error (RSE)¶
RSE = √[RSS / (n - p - 1)]
Estimate of standard deviation of ε
Average amount response deviates from true regression line
Units: same as Y
R² Statistic (Coefficient of Determination)¶
R² = 1 - (RSS / TSS) = 1 - [Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²]
Proportion of variance explained by the model
Range: 0 to 1
Higher is better
R² ≈ 1: Model explains most variance
R² ≈ 0: Model explains little variance
Note: In simple linear regression, R² = cor(X,Y)²
# Comprehensive model assessment
y_mean = y.mean()
# Sum of squares
TSS = np.sum((y - y_mean)**2) # Total Sum of Squares
RSS = np.sum((y - y_pred)**2) # Residual Sum of Squares
ESS = np.sum((y_pred - y_mean)**2) # Explained Sum of Squares
# Metrics
RSE = np.sqrt(RSS / (n - 2))
R_squared = 1 - (RSS / TSS)
correlation = np.corrcoef(X.flatten(), y)[0, 1]
print("📊 Model Accuracy Metrics\n")
print(f"Total Sum of Squares (TSS): {TSS:>10.2f}")
print(f"Residual Sum of Squares (RSS): {RSS:>10.2f}")
print(f"Explained Sum of Squares (ESS): {ESS:>10.2f}")
print(f"\nTSS = RSS + ESS? {TSS:>10.2f} ≈ {RSS + ESS:>10.2f} ✅")
print(f"\n{'Metric':<30} {'Value':<15} {'Interpretation'}")
print("="*70)
print(f"{'Residual Standard Error':<30} {RSE:>10.4f} Average deviation from line")
print(f"{'R² (R-squared)':<30} {R_squared:>10.4f} {R_squared*100:.1f}% variance explained")
print(f"{'Correlation(X, Y)':<30} {correlation:>10.4f} Linear relationship strength")
print(f"{'R² = cor(X,Y)²?':<30} {R_squared:>10.4f} ≈ {correlation**2:>6.4f} ✅")
# Visualize variance decomposition
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
# Total variance
axes[0].scatter(X, y, alpha=0.5)
axes[0].axhline(y=y_mean, color='r', linestyle='--', linewidth=2, label='Mean')
for i in range(0, n, 10):
axes[0].plot([X[i], X[i]], [y[i], y_mean], 'k-', alpha=0.2, linewidth=0.5)
axes[0].set_xlabel('TV')
axes[0].set_ylabel('Sales')
axes[0].set_title(f'Total Variance (TSS = {TSS:.2f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Explained variance
axes[1].scatter(X, y, alpha=0.3, label='Data')
axes[1].plot(X, y_pred, 'r-', linewidth=2, label='Fitted line')
axes[1].axhline(y=y_mean, color='g', linestyle='--', linewidth=2, label='Mean')
for i in range(0, n, 10):
axes[1].plot([X[i], X[i]], [y_pred[i], y_mean], 'orange', alpha=0.4, linewidth=1.5)
axes[1].set_xlabel('TV')
axes[1].set_ylabel('Sales')
axes[1].set_title(f'Explained Variance (ESS = {ESS:.2f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
# Residual variance
axes[2].scatter(X, y, alpha=0.3, label='Data')
axes[2].plot(X, y_pred, 'r-', linewidth=2, label='Fitted line')
for i in range(0, n, 10):
axes[2].plot([X[i], X[i]], [y[i], y_pred[i]], 'b-', alpha=0.2, linewidth=0.5)
axes[2].set_xlabel('TV')
axes[2].set_ylabel('Sales')
axes[2].set_title(f'Residual Variance (RSS = {RSS:.2f})')
axes[2].legend()
axes[2].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
3.3 Multiple Linear Regression¶
The Model¶
Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε
Matrix Form¶
Y = Xβ + ε
where:
Y = (n × 1) response vector
X = (n × (p+1)) design matrix
β = ((p+1) × 1) coefficient vector
ε = (n × 1) error vector
OLS Solution¶
β̂ = (XᵀX)⁻¹Xᵀy
F-Statistic¶
Test whether at least one predictor is useful:
H₀: β₁ = β₂ = … = βₚ = 0
H₁: At least one βⱼ ≠ 0
F = [(TSS - RSS)/p] / [RSS/(n - p - 1)]
# Generate multiple regression data
# Sales = 5 + 0.05*TV + 0.1*Radio + 0.01*Newspaper + noise
np.random.seed(42)
n = 200
TV = np.random.uniform(0, 300, n)
Radio = np.random.uniform(0, 50, n)
Newspaper = np.random.uniform(0, 100, n)
true_beta = np.array([5, 0.05, 0.1, 0.01])
noise = np.random.normal(0, 1.2, n)
Sales = (true_beta[0] +
true_beta[1] * TV +
true_beta[2] * Radio +
true_beta[3] * Newspaper +
noise)
# Create DataFrame
df = pd.DataFrame({
'TV': TV,
'Radio': Radio,
'Newspaper': Newspaper,
'Sales': Sales
})
print("📊 Multiple Regression Dataset")
print(f"\nTrue model: Sales = {true_beta[0]} + {true_beta[1]}*TV + {true_beta[2]}*Radio + {true_beta[3]}*Newspaper + ε")
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())
# Correlation matrix
print(f"\n📈 Correlation Matrix:")
print(df.corr().round(3))
# Visualize pairwise relationships
sns.pairplot(df, diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pairwise Relationships', y=1.01)
plt.show()
# Fit multiple regression model
X_multi = df[['TV', 'Radio', 'Newspaper']].values
y_multi = df['Sales'].values
# Fit model
model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)
# Coefficients
beta_hat = np.concatenate([[model_multi.intercept_], model_multi.coef_])
print("📈 Multiple Linear Regression Results\n")
print(f"{'Variable':<12} {'True β':<12} {'Estimated β̂':<15} {'Difference'}")
print("="*60)
print(f"{'Intercept':<12} {true_beta[0]:>10.4f} {beta_hat[0]:>12.4f} {abs(beta_hat[0]-true_beta[0]):>10.4f}")
for i, var in enumerate(['TV', 'Radio', 'Newspaper']):
print(f"{var:<12} {true_beta[i+1]:>10.4f} {beta_hat[i+1]:>12.4f} {abs(beta_hat[i+1]-true_beta[i+1]):>10.4f}")
# Predictions
y_pred_multi = model_multi.predict(X_multi)
# Model metrics
rss_multi = np.sum((y_multi - y_pred_multi)**2)
tss_multi = np.sum((y_multi - y_multi.mean())**2)
r2_multi = 1 - (rss_multi / tss_multi)
rse_multi = np.sqrt(rss_multi / (n - len(beta_hat)))
print(f"\n📊 Model Performance:")
print(f"R²: {r2_multi:.4f} ({r2_multi*100:.1f}% variance explained)")
print(f"RSE: {rse_multi:.4f}")
# F-statistic
p = len(beta_hat) - 1 # number of predictors
F_stat = ((tss_multi - rss_multi) / p) / (rss_multi / (n - p - 1))
from scipy.stats import f as f_dist
p_value_F = 1 - f_dist.cdf(F_stat, p, n - p - 1)
print(f"\n📊 F-Statistic:")
print(f"F = {F_stat:.2f}")
print(f"p-value < 0.001 (Strong evidence that at least one predictor is useful) ✅")
# Calculate individual predictor significance using statsmodels for detailed output
import statsmodels.api as sm
# Add constant for intercept
X_with_const = sm.add_constant(X_multi)
# Fit model
model_sm = sm.OLS(y_multi, X_with_const).fit()
print("📊 Detailed Regression Summary\n")
print(model_sm.summary())
# Custom visualization of coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Coefficient values
axes[0].bar(['Intercept', 'TV', 'Radio', 'Newspaper'], beta_hat,
color=['gray', 'blue', 'green', 'orange'], alpha=0.7)
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Estimated Coefficients')
axes[0].grid(True, alpha=0.3, axis='y')
# Coefficient comparison with true values
x_pos = np.arange(len(beta_hat))
width = 0.35
axes[1].bar(x_pos - width/2, true_beta, width, label='True β', alpha=0.7)
axes[1].bar(x_pos + width/2, beta_hat, width, label='Estimated β̂', alpha=0.7)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(['β₀', 'β₁ (TV)', 'β₂ (Radio)', 'β₃ (News)'])
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('True vs Estimated Coefficients')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')
plt.tight_layout()
plt.show()
3.4 Model Diagnostics¶
Key Assumptions of Linear Regression¶
Linearity: Relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Constant variance of errors
Normality: Errors are normally distributed
Diagnostic Plots¶
Residuals vs Fitted: Check linearity and homoscedasticity
Q-Q Plot: Check normality of residuals
Scale-Location: Check homoscedasticity
Residuals vs Leverage: Identify influential points
# Comprehensive diagnostic plots
residuals_multi = y_multi - y_pred_multi
standardized_residuals = residuals_multi / np.std(residuals_multi)
fig, axes = plt.subplots(2, 2, figsize=(14, 10))
# 1. Residuals vs Fitted
axes[0, 0].scatter(y_pred_multi, residuals_multi, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted\n(Should show random pattern around 0)')
axes[0, 0].grid(True, alpha=0.3)
# Add LOWESS smoothed line
from scipy.signal import savgol_filter
sorted_idx = np.argsort(y_pred_multi)
smoothed = savgol_filter(residuals_multi[sorted_idx], window_length=51, polyorder=3)
axes[0, 0].plot(y_pred_multi[sorted_idx], smoothed, 'b-', linewidth=2, alpha=0.7)
# 2. Q-Q Plot (Normal Probability Plot)
scipy_stats.probplot(residuals_multi, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot\n(Points should follow diagonal line)')
axes[0, 1].grid(True, alpha=0.3)
# 3. Scale-Location (Spread-Location)
sqrt_abs_std_resid = np.sqrt(np.abs(standardized_residuals))
axes[1, 0].scatter(y_pred_multi, sqrt_abs_std_resid, alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot\n(Check homoscedasticity)')
axes[1, 0].grid(True, alpha=0.3)
# Add trend line
sorted_idx2 = np.argsort(y_pred_multi)
smoothed2 = savgol_filter(sqrt_abs_std_resid[sorted_idx2], window_length=51, polyorder=3)
axes[1, 0].plot(y_pred_multi[sorted_idx2], smoothed2, 'r-', linewidth=2, alpha=0.7)
# 4. Histogram of residuals
axes[1, 1].hist(residuals_multi, bins=30, density=True, alpha=0.7, edgecolor='black')
mu, sigma = residuals_multi.mean(), residuals_multi.std()
x = np.linspace(residuals_multi.min(), residuals_multi.max(), 100)
axes[1, 1].plot(x, scipy_stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2,
label=f'Normal(μ={mu:.2f}, σ={sigma:.2f})')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Distribution of Residuals\n(Should be approximately normal)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("✅ Diagnostic Interpretation:")
print(" 1. Residuals vs Fitted: Random scatter ✅ (no pattern = good)")
print(" 2. Q-Q Plot: Points follow line ✅ (normality assumption met)")
print(" 3. Scale-Location: Horizontal band ✅ (constant variance)")
print(" 4. Histogram: Bell-shaped ✅ (normally distributed residuals)")
3.5 Qualitative Predictors (Categorical Variables)¶
Dummy Variables¶
For a categorical variable with k levels, create k-1 dummy variables
Example: Region (East, West, South)
X_West = 1 if West, 0 otherwise
X_South = 1 if South, 0 otherwise
(East is the baseline/reference level)
Interpretation¶
Y = β₀ + β₁X_West + β₂X_South
β₀: Average Y for baseline (East)
β₁: Difference between West and East
β₂: Difference between South and East
# Generate data with categorical predictor
np.random.seed(42)
n_cat = 300
# Region: East, West, South
regions = np.random.choice(['East', 'West', 'South'], n_cat)
budget = np.random.uniform(50, 250, n_cat)
# Different base sales for each region
base_sales = {'East': 10, 'West': 15, 'South': 12}
sales_cat = np.array([base_sales[r] + 0.05*budget[i] + np.random.normal(0, 1)
for i, r in enumerate(regions)])
df_cat = pd.DataFrame({
'Region': regions,
'Budget': budget,
'Sales': sales_cat
})
print("📊 Categorical Predictor Example")
print(f"\nDataset shape: {df_cat.shape}")
print(f"\nFirst few rows:")
print(df_cat.head(10))
# Create dummy variables
df_cat_encoded = pd.get_dummies(df_cat, columns=['Region'], drop_first=True)
print(f"\n📋 After creating dummy variables:")
print(df_cat_encoded.head())
# Fit model with categorical predictor
X_cat = df_cat_encoded[['Budget', 'Region_South', 'Region_West']].values
y_cat = df_cat_encoded['Sales'].values
model_cat = LinearRegression()
model_cat.fit(X_cat, y_cat)
print(f"\n📈 Regression with Categorical Predictor\n")
print(f"{'Variable':<15} {'Coefficient':<15} {'Interpretation'}")
print("="*70)
print(f"{'Intercept':<15} {model_cat.intercept_:>12.4f} Baseline: East region + Budget=0")
print(f"{'Budget':<15} {model_cat.coef_[0]:>12.4f} Effect of $1 budget increase")
print(f"{'Region_South':<15} {model_cat.coef_[1]:>12.4f} South vs East difference")
print(f"{'Region_West':<15} {model_cat.coef_[2]:>12.4f} West vs East difference")
# Visualize
plt.figure(figsize=(12, 5))
plt.subplot(1, 2, 1)
for region in ['East', 'West', 'South']:
mask = df_cat['Region'] == region
plt.scatter(df_cat[mask]['Budget'], df_cat[mask]['Sales'],
alpha=0.5, label=region, s=50)
plt.xlabel('Budget')
plt.ylabel('Sales')
plt.title('Sales vs Budget by Region')
plt.legend()
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
# Box plot
df_cat.boxplot(column='Sales', by='Region', ax=plt.gca())
plt.suptitle('')
plt.title('Sales Distribution by Region')
plt.ylabel('Sales')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
3.6 Interaction Terms¶
The Concept¶
Standard additive model assumes effect of X₁ on Y is independent of X₂
With interaction:
Y = β₀ + β₁X₁ + β₂X₂ + β₃(X₁ × X₂) + ε
Interpretation¶
β₁: Effect of X₁ when X₂ = 0
β₂: Effect of X₂ when X₁ = 0
β₃: How effect of X₁ changes with X₂ (synergy)
Example: TV × Radio interaction
Advertising on TV and Radio together may be more effective than sum of individual effects
# Generate data with interaction effect
np.random.seed(42)
n_int = 200
TV_int = np.random.uniform(0, 300, n_int)
Radio_int = np.random.uniform(0, 50, n_int)
# Model WITH interaction: Sales = 5 + 0.02*TV + 0.05*Radio + 0.001*(TV*Radio)
sales_int = (5 +
0.02 * TV_int +
0.05 * Radio_int +
0.001 * (TV_int * Radio_int) + # Interaction!
np.random.normal(0, 0.8, n_int))
df_int = pd.DataFrame({
'TV': TV_int,
'Radio': Radio_int,
'Sales': sales_int
})
# Model 1: Without interaction
X_no_int = df_int[['TV', 'Radio']].values
model_no_int = LinearRegression().fit(X_no_int, sales_int)
r2_no_int = model_no_int.score(X_no_int, sales_int)
# Model 2: With interaction
df_int['TV_Radio'] = df_int['TV'] * df_int['Radio']
X_with_int = df_int[['TV', 'Radio', 'TV_Radio']].values
model_with_int = LinearRegression().fit(X_with_int, sales_int)
r2_with_int = model_with_int.score(X_with_int, sales_int)
print("📊 Model Comparison: With vs Without Interaction\n")
print(f"{'Model':<25} {'R²':<10} {'Improvement'}")
print("="*50)
print(f"{'Without interaction':<25} {r2_no_int:>6.4f}")
print(f"{'With interaction':<25} {r2_with_int:>6.4f} +{(r2_with_int-r2_no_int)*100:.2f}%")
print(f"\n📈 Interaction Model Coefficients:")
print(f"Intercept: {model_with_int.intercept_:.4f}")
print(f"TV: {model_with_int.coef_[0]:.6f}")
print(f"Radio: {model_with_int.coef_[1]:.6f}")
print(f"TV × Radio: {model_with_int.coef_[2]:.6f} ← Interaction term")
# Visualize interaction effect
fig = plt.figure(figsize=(15, 5))
# 3D scatter
from mpl_toolkits.mplot3d import Axes3D
ax1 = fig.add_subplot(131, projection='3d')
ax1.scatter(df_int['TV'], df_int['Radio'], sales_int, alpha=0.5)
ax1.set_xlabel('TV Budget')
ax1.set_ylabel('Radio Budget')
ax1.set_zlabel('Sales')
ax1.set_title('Sales vs TV & Radio (3D View)')
# Effect of TV at different Radio levels
ax2 = fig.add_subplot(132)
radio_levels = [0, 25, 50]
tv_range = np.linspace(0, 300, 100)
for radio in radio_levels:
sales_pred = (model_with_int.intercept_ +
model_with_int.coef_[0] * tv_range +
model_with_int.coef_[1] * radio +
model_with_int.coef_[2] * tv_range * radio)
ax2.plot(tv_range, sales_pred, linewidth=2, label=f'Radio = ${radio}k')
ax2.set_xlabel('TV Budget')
ax2.set_ylabel('Predicted Sales')
ax2.set_title('Effect of TV at Different Radio Levels\n(Interaction causes different slopes)')
ax2.legend()
ax2.grid(True, alpha=0.3)
# Interaction visualization
ax3 = fig.add_subplot(133)
TV_grid, Radio_grid = np.meshgrid(np.linspace(0, 300, 20), np.linspace(0, 50, 20))
interaction_effect = model_with_int.coef_[2] * TV_grid * Radio_grid
contour = ax3.contourf(TV_grid, Radio_grid, interaction_effect, levels=15, cmap='viridis')
ax3.set_xlabel('TV Budget')
ax3.set_ylabel('Radio Budget')
ax3.set_title('Interaction Effect Magnitude\n(Synergy between TV and Radio)')
plt.colorbar(contour, ax=ax3, label='Interaction Contribution to Sales')
plt.tight_layout()
plt.show()
print("\n💡 Interpretation:")
print(" • Interaction coefficient is positive → synergy effect")
print(" • Effect of TV increases as Radio budget increases")
print(" • Combined advertising is more effective than sum of parts")
3.7 Non-Linear Transformations¶
Polynomial Regression¶
Model non-linear relationships while staying in linear regression framework:
Y = β₀ + β₁X + β₂X² + β₃X³ + ... + ε
Still “linear” because it’s linear in the coefficients β
When to use:¶
Scatter plot shows curved relationship
Residual plot shows pattern (not random)
Domain knowledge suggests non-linearity
Caution:¶
High-degree polynomials can overfit
Extrapolation beyond data range is dangerous
# Generate non-linear data
np.random.seed(42)
X_poly = np.linspace(-3, 3, 100)
y_poly = 2 + 3*X_poly - 2*X_poly**2 + 0.5*X_poly**3 + np.random.normal(0, 2, 100)
X_poly_2d = X_poly.reshape(-1, 1)
# Fit polynomials of different degrees
degrees = [1, 2, 3, 4, 8]
models_poly = {}
fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()
for idx, degree in enumerate(degrees):
poly_features = PolynomialFeatures(degree=degree)
X_poly_transformed = poly_features.fit_transform(X_poly_2d)
model = LinearRegression()
model.fit(X_poly_transformed, y_poly)
models_poly[degree] = (poly_features, model)
# Predictions
X_plot = np.linspace(-3, 3, 200).reshape(-1, 1)
X_plot_transformed = poly_features.transform(X_plot)
y_plot = model.predict(X_plot_transformed)
# Metrics
y_pred = model.predict(X_poly_transformed)
r2 = r2_score(y_poly, y_pred)
rmse = np.sqrt(mean_squared_error(y_poly, y_pred))
# Plot
axes[idx].scatter(X_poly, y_poly, alpha=0.5, s=30, label='Data')
axes[idx].plot(X_plot, y_plot, 'r-', linewidth=2, label=f'Degree {degree} fit')
axes[idx].set_xlabel('X')
axes[idx].set_ylabel('Y')
axes[idx].set_title(f'Polynomial Degree {degree}\nR² = {r2:.4f}, RMSE = {rmse:.2f}')
axes[idx].legend()
axes[idx].grid(True, alpha=0.3)
axes[idx].set_ylim(-20, 30)
# Summary plot
axes[5].axis('off')
summary_text = "💡 Key Observations:\n\n"
summary_text += "Degree 1 (Linear): Underfits\n"
summary_text += "Degree 2-3: Good balance\n"
summary_text += "Degree 8: Overfits (wiggly)\n\n"
summary_text += "True function is degree 3\n"
summary_text += "Best model: Degree 3 ✅"
axes[5].text(0.1, 0.5, summary_text, fontsize=12, verticalalignment='center',
bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))
plt.tight_layout()
plt.show()
# Compare models with cross-validation
from sklearn.model_selection import cross_val_score
print("\n📊 Model Comparison with Cross-Validation\n")
print(f"{'Degree':<10} {'Train R²':<12} {'CV R² (mean)':<15} {'CV R² (std)'}")
print("="*55)
for degree in degrees:
poly_features = PolynomialFeatures(degree=degree)
X_trans = poly_features.fit_transform(X_poly_2d)
model = LinearRegression()
# Train R²
model.fit(X_trans, y_poly)
train_r2 = model.score(X_trans, y_poly)
# Cross-validation R²
cv_scores = cross_val_score(model, X_trans, y_poly, cv=5, scoring='r2')
cv_mean = cv_scores.mean()
cv_std = cv_scores.std()
print(f"{degree:<10} {train_r2:>10.4f} {cv_mean:>12.4f} {cv_std:>10.4f}")
print("\n💡 Notice:")
print(" • Training R² always increases with degree (can overfit)")
print(" • CV R² peaks around degree 3, then decreases (overfitting)")
print(" • Use CV to select model complexity")
3.8 Potential Problems¶
1. Non-linearity¶
Symptom: Pattern in residual plot Solution: Transform predictors (log, sqrt, polynomial)
2. Correlation of Error Terms¶
Symptom: Adjacent residuals similar (time series) Solution: Time series models, account for correlation structure
3. Non-constant Variance (Heteroscedasticity)¶
Symptom: Funnel shape in residual plot Solution: Transform response (log Y), weighted least squares
4. Outliers¶
Symptom: Unusual values with large residuals Solution: Investigate, possibly remove or transform
5. High Leverage Points¶
Symptom: Unusual predictor values Measure: Leverage statistic hᵢᵢ Solution: Examine influence, consider removal
6. Collinearity¶
Symptom: High correlations between predictors, unstable coefficients Measure: Variance Inflation Factor (VIF > 5 or 10 problematic) Solution: Remove predictors, Ridge regression, PCA
# Demonstrate collinearity problem
np.random.seed(42)
n_col = 100
# Create correlated predictors
X1_col = np.random.normal(0, 1, n_col)
X2_col = X1_col + np.random.normal(0, 0.1, n_col) # Highly correlated with X1
X3_col = np.random.normal(0, 1, n_col) # Independent
y_col = 2 + 3*X1_col + 4*X3_col + np.random.normal(0, 0.5, n_col)
df_col = pd.DataFrame({
'X1': X1_col,
'X2': X2_col,
'X3': X3_col,
'Y': y_col
})
print("📊 Collinearity Example\n")
print("Correlation Matrix:")
print(df_col.corr().round(3))
print("\nNotice: X1 and X2 are highly correlated (0.995)!")
# Calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor
X_vif = df_col[['X1', 'X2', 'X3']].values
vif_data = pd.DataFrame()
vif_data["Variable"] = ['X1', 'X2', 'X3']
vif_data["VIF"] = [variance_inflation_factor(X_vif, i) for i in range(X_vif.shape[1])]
print("\n📊 Variance Inflation Factors (VIF):")
print(vif_data)
print("\nInterpretation:")
print(" VIF = 1: No correlation with other predictors")
print(" VIF = 5-10: Moderate collinearity")
print(" VIF > 10: High collinearity (problematic) ⚠️")
# Fit models with and without collinear predictor
model_with_col = LinearRegression().fit(df_col[['X1', 'X2', 'X3']], y_col)
model_no_col = LinearRegression().fit(df_col[['X1', 'X3']], y_col)
print(f"\n📈 Model Coefficients Comparison:")
print(f"\nWith X2 (collinear):")
print(f" X1: {model_with_col.coef_[0]:.4f} (True: 3.0)")
print(f" X2: {model_with_col.coef_[1]:.4f} (Should be 0)")
print(f" X3: {model_with_col.coef_[2]:.4f} (True: 4.0)")
print(f"\nWithout X2:")
print(f" X1: {model_no_col.coef_[0]:.4f} (True: 3.0) ✅")
print(f" X3: {model_no_col.coef_[1]:.4f} (True: 4.0) ✅")
print("\n💡 Effect of Collinearity:")
print(" • Coefficients become unstable and hard to interpret")
print(" • Standard errors increase")
print(" • Model predictions still OK, but inference is problematic")
print(" • Solution: Remove one of the correlated predictors")
Key Takeaways¶
1. Simple Linear Regression¶
Model: Y = β₀ + β₁X + ε
Estimated via OLS (minimize RSS)
Assess with R², RSE, F-statistic, t-tests
2. Multiple Linear Regression¶
Model: Y = β₀ + Σβⱼ Xⱼ + ε
Matrix form: β̂ = (X’X)⁻¹X’y
F-test for overall significance
Individual t-tests for each predictor
3. Model Assessment¶
R² = 1 - RSS/TSS (proportion of variance explained)
RSE = √[RSS/(n-p-1)] (typical error magnitude)
F-statistic (test if any predictors useful)
4. Extensions¶
Qualitative predictors: Use dummy variables
Interactions: Model synergies between predictors
Polynomial terms: Capture non-linear relationships
5. Assumptions & Diagnostics¶
Linearity (residual plot)
Independence (autocorrelation)
Homoscedasticity (constant variance)
Normality (Q-Q plot)
6. Common Problems¶
Non-linearity → Transform predictors
Heteroscedasticity → Transform response
Outliers → Investigate and handle
Collinearity → VIF > 10 problematic, remove predictors
7. Best Practices¶
Always plot the data first
Check residual plots
Use cross-validation for model selection
Beware of extrapolation
Interpret coefficients carefully with interactions
Next Chapter¶
Chapter 4: Classification
Logistic Regression
Linear Discriminant Analysis
Quadratic Discriminant Analysis
K-Nearest Neighbors