Run this notebook: Open in Colab Open in Kaggle

Chapter 3: Linear Regression¶

Overview¶

Linear regression is a simple yet powerful approach for predicting a quantitative response Y based on one or more predictor variables X.

Key Topics:

Simple Linear Regression (one predictor)
Multiple Linear Regression (multiple predictors)
Assessing model accuracy (RSE, R², F-statistic)
Qualitative predictors (dummy variables)
Interaction terms
Non-linear transformations
Potential problems and diagnostics

The Linear Model¶

Simple Linear Regression¶

Y ≈ β₀ + β₁X

Multiple Linear Regression¶

Y ≈ β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ

Goal: Estimate coefficients β₀, β₁, …, βₚ to minimize prediction error

# Generate synthetic advertising data
np.random.seed(42)
n = 200

# Generate data: Sales = 5 + 0.05*TV_budget + noise
TV_budget = np.random.uniform(0, 300, n)
true_beta0 = 5
true_beta1 = 0.05
noise = np.random.normal(0, 1.5, n)
sales = true_beta0 + true_beta1 * TV_budget + noise

# Create DataFrame
data = pd.DataFrame({'TV': TV_budget, 'Sales': sales})

print("📊 Simulated Advertising Data")
print(f"True relationship: Sales = {true_beta0} + {true_beta1}*TV + ε")
print(f"\nData shape: {data.shape}")
print(f"\nFirst few rows:")
print(data.head())
print(f"\nDescriptive statistics:")
print(data.describe())

# Fit simple linear regression using Ordinary Least Squares (OLS)

X = data[['TV']].values
y = data['Sales'].values

# Fit model
model = LinearRegression()
model.fit(X, y)

beta0_hat = model.intercept_
beta1_hat = model.coef_[0]

print("📈 Simple Linear Regression Results")
print(f"\nEstimated coefficients:")
print(f"β₀ (Intercept): {beta0_hat:.4f}  (True: {true_beta0})")
print(f"β₁ (Slope):     {beta1_hat:.6f}  (True: {true_beta1})")

# Make predictions
y_pred = model.predict(X)

# Calculate metrics
mse = mean_squared_error(y, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y, y_pred)
rse = np.sqrt(np.sum((y - y_pred)**2) / (len(y) - 2))  # Residual Standard Error

print(f"\nModel Performance:")
print(f"MSE:  {mse:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"RSE:  {rse:.4f}")
print(f"R²:   {r2:.4f}")

# Visualization
plt.figure(figsize=(14, 5))

# Plot 1: Scatter with regression line
plt.subplot(1, 2, 1)
plt.scatter(X, y, alpha=0.5, label='Observed data')
plt.plot(X, y_pred, 'r-', linewidth=2, label='Fitted line')
plt.plot(TV_budget, true_beta0 + true_beta1 * TV_budget, 'g--', 
         linewidth=2, alpha=0.7, label='True relationship')
plt.xlabel('TV Advertising Budget ($1000s)')
plt.ylabel('Sales (1000 units)')
plt.title('Simple Linear Regression: Sales vs TV Budget')
plt.legend()
plt.grid(True, alpha=0.3)

# Plot 2: Residuals
plt.subplot(1, 2, 2)
residuals = y - y_pred
plt.scatter(y_pred, residuals, alpha=0.5)
plt.axhline(y=0, color='r', linestyle='--', linewidth=2)
plt.xlabel('Fitted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3.1 Statistical Inference for Coefficients¶

Hypothesis Testing¶

Test whether there’s a relationship between X and Y:

H₀: β₁ = 0 (no relationship)
H₁: β₁ ≠ 0 (relationship exists)

t-statistic¶

t = (β̂₁ - 0) / SE(β̂₁)

Under H₀, t follows a t-distribution with n-2 degrees of freedom

Standard Errors¶

SE(β̂₀): Standard error of intercept
SE(β̂₁): Standard error of slope
Smaller SE → more confident in estimate

Confidence Intervals¶

95% CI for β₁:

β̂₁ ± 1.96 × SE(β̂₁)

# Calculate confidence intervals manually
from scipy import stats as scipy_stats

n = len(X)
p = 1  # number of predictors (excluding intercept)

# Calculate standard errors
y_pred = model.predict(X)
residuals = y - y_pred
rse = np.sqrt(np.sum(residuals**2) / (n - p - 1))

# Standard error of slope
X_centered = X - X.mean()
se_beta1 = rse / np.sqrt(np.sum(X_centered**2))

# Standard error of intercept  
se_beta0 = rse * np.sqrt(1/n + X.mean()**2 / np.sum(X_centered**2))

# t-statistics
t_beta0 = beta0_hat / se_beta0
t_beta1 = beta1_hat / se_beta1

# p-values (two-tailed test)
p_value_beta0 = 2 * (1 - scipy_stats.t.cdf(abs(t_beta0), df=n-p-1))
p_value_beta1 = 2 * (1 - scipy_stats.t.cdf(abs(t_beta1), df=n-p-1))

# 95% confidence intervals
t_critical = scipy_stats.t.ppf(0.975, df=n-p-1)
ci_beta0 = (beta0_hat - t_critical * se_beta0, beta0_hat + t_critical * se_beta0)
ci_beta1 = (beta1_hat - t_critical * se_beta1, beta1_hat + t_critical * se_beta1)

print("📊 Statistical Inference for Coefficients\n")
print(f"{'Coefficient':<12} {'Estimate':<10} {'Std Error':<12} {'t-stat':<10} {'p-value':<12} {'95% CI'}")
print("="*80)
print(f"{'β₀ (Intercept)':<12} {beta0_hat:>8.4f} {se_beta0:>12.4f} {t_beta0:>10.3f} {p_value_beta0:>12.6f}   {ci_beta0}")
print(f"{'β₁ (TV)':<12} {beta1_hat:>8.6f} {se_beta1:>12.6f} {t_beta1:>10.3f} {p_value_beta1:>12.6f}   {ci_beta1}")

print(f"\n💡 Interpretation:")
print(f"   • β₁ = {beta1_hat:.6f}: For each $1000 increase in TV budget, sales increase by {beta1_hat*1000:.2f} units")
print(f"   • p-value < 0.001: Strong evidence of relationship (reject H₀)")
print(f"   • 95% CI for β₁: [{ci_beta1[0]:.6f}, {ci_beta1[1]:.6f}]")
print(f"   • True β₁ = {true_beta1} is within the confidence interval ✅")

3.2 Assessing Model Accuracy¶

Residual Standard Error (RSE)¶

RSE = √[RSS / (n - p - 1)]

Estimate of standard deviation of ε
Average amount response deviates from true regression line
Units: same as Y

R² Statistic (Coefficient of Determination)¶

R² = 1 - (RSS / TSS) = 1 - [Σ(yᵢ - ŷᵢ)² / Σ(yᵢ - ȳ)²]

Proportion of variance explained by the model
Range: 0 to 1
Higher is better
R² ≈ 1: Model explains most variance
R² ≈ 0: Model explains little variance

Note: In simple linear regression, R² = cor(X,Y)²

# Comprehensive model assessment
y_mean = y.mean()

# Sum of squares
TSS = np.sum((y - y_mean)**2)  # Total Sum of Squares
RSS = np.sum((y - y_pred)**2)   # Residual Sum of Squares
ESS = np.sum((y_pred - y_mean)**2)  # Explained Sum of Squares

# Metrics
RSE = np.sqrt(RSS / (n - 2))
R_squared = 1 - (RSS / TSS)
correlation = np.corrcoef(X.flatten(), y)[0, 1]

print("📊 Model Accuracy Metrics\n")
print(f"Total Sum of Squares (TSS):      {TSS:>10.2f}")
print(f"Residual Sum of Squares (RSS):   {RSS:>10.2f}")
print(f"Explained Sum of Squares (ESS):  {ESS:>10.2f}")
print(f"\nTSS = RSS + ESS?  {TSS:>10.2f} ≈ {RSS + ESS:>10.2f} ✅")

print(f"\n{'Metric':<30} {'Value':<15} {'Interpretation'}")
print("="*70)
print(f"{'Residual Standard Error':<30} {RSE:>10.4f}     Average deviation from line")
print(f"{'R² (R-squared)':<30} {R_squared:>10.4f}     {R_squared*100:.1f}% variance explained")
print(f"{'Correlation(X, Y)':<30} {correlation:>10.4f}     Linear relationship strength")
print(f"{'R² = cor(X,Y)²?':<30} {R_squared:>10.4f} ≈ {correlation**2:>6.4f} ✅")

# Visualize variance decomposition
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Total variance
axes[0].scatter(X, y, alpha=0.5)
axes[0].axhline(y=y_mean, color='r', linestyle='--', linewidth=2, label='Mean')
for i in range(0, n, 10):
    axes[0].plot([X[i], X[i]], [y[i], y_mean], 'k-', alpha=0.2, linewidth=0.5)
axes[0].set_xlabel('TV')
axes[0].set_ylabel('Sales')
axes[0].set_title(f'Total Variance (TSS = {TSS:.2f})')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Explained variance
axes[1].scatter(X, y, alpha=0.3, label='Data')
axes[1].plot(X, y_pred, 'r-', linewidth=2, label='Fitted line')
axes[1].axhline(y=y_mean, color='g', linestyle='--', linewidth=2, label='Mean')
for i in range(0, n, 10):
    axes[1].plot([X[i], X[i]], [y_pred[i], y_mean], 'orange', alpha=0.4, linewidth=1.5)
axes[1].set_xlabel('TV')
axes[1].set_ylabel('Sales')
axes[1].set_title(f'Explained Variance (ESS = {ESS:.2f})')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

# Residual variance
axes[2].scatter(X, y, alpha=0.3, label='Data')
axes[2].plot(X, y_pred, 'r-', linewidth=2, label='Fitted line')
for i in range(0, n, 10):
    axes[2].plot([X[i], X[i]], [y[i], y_pred[i]], 'b-', alpha=0.2, linewidth=0.5)
axes[2].set_xlabel('TV')
axes[2].set_ylabel('Sales')
axes[2].set_title(f'Residual Variance (RSS = {RSS:.2f})')
axes[2].legend()
axes[2].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3.3 Multiple Linear Regression¶

The Model¶

Y = β₀ + β₁X₁ + β₂X₂ + ... + βₚXₚ + ε

Matrix Form¶

Y = Xβ + ε

where:
Y = (n × 1) response vector
X = (n × (p+1)) design matrix
β = ((p+1) × 1) coefficient vector
ε = (n × 1) error vector

OLS Solution¶

β̂ = (XᵀX)⁻¹Xᵀy

F-Statistic¶

Test whether at least one predictor is useful:

H₀: β₁ = β₂ = … = βₚ = 0
H₁: At least one βⱼ ≠ 0

F = [(TSS - RSS)/p] / [RSS/(n - p - 1)]

# Generate multiple regression data
# Sales = 5 + 0.05*TV + 0.1*Radio + 0.01*Newspaper + noise
np.random.seed(42)
n = 200

TV = np.random.uniform(0, 300, n)
Radio = np.random.uniform(0, 50, n)
Newspaper = np.random.uniform(0, 100, n)

true_beta = np.array([5, 0.05, 0.1, 0.01])
noise = np.random.normal(0, 1.2, n)

Sales = (true_beta[0] + 
         true_beta[1] * TV + 
         true_beta[2] * Radio + 
         true_beta[3] * Newspaper + 
         noise)

# Create DataFrame
df = pd.DataFrame({
    'TV': TV,
    'Radio': Radio,
    'Newspaper': Newspaper,
    'Sales': Sales
})

print("📊 Multiple Regression Dataset")
print(f"\nTrue model: Sales = {true_beta[0]} + {true_beta[1]}*TV + {true_beta[2]}*Radio + {true_beta[3]}*Newspaper + ε")
print(f"\nDataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())

# Correlation matrix
print(f"\n📈 Correlation Matrix:")
print(df.corr().round(3))

# Visualize pairwise relationships
sns.pairplot(df, diag_kind='kde', plot_kws={'alpha': 0.5})
plt.suptitle('Pairwise Relationships', y=1.01)
plt.show()

# Fit multiple regression model
X_multi = df[['TV', 'Radio', 'Newspaper']].values
y_multi = df['Sales'].values

# Fit model
model_multi = LinearRegression()
model_multi.fit(X_multi, y_multi)

# Coefficients
beta_hat = np.concatenate([[model_multi.intercept_], model_multi.coef_])

print("📈 Multiple Linear Regression Results\n")
print(f"{'Variable':<12} {'True β':<12} {'Estimated β̂':<15} {'Difference'}")
print("="*60)
print(f"{'Intercept':<12} {true_beta[0]:>10.4f}   {beta_hat[0]:>12.4f}   {abs(beta_hat[0]-true_beta[0]):>10.4f}")
for i, var in enumerate(['TV', 'Radio', 'Newspaper']):
    print(f"{var:<12} {true_beta[i+1]:>10.4f}   {beta_hat[i+1]:>12.4f}   {abs(beta_hat[i+1]-true_beta[i+1]):>10.4f}")

# Predictions
y_pred_multi = model_multi.predict(X_multi)

# Model metrics
rss_multi = np.sum((y_multi - y_pred_multi)**2)
tss_multi = np.sum((y_multi - y_multi.mean())**2)
r2_multi = 1 - (rss_multi / tss_multi)
rse_multi = np.sqrt(rss_multi / (n - len(beta_hat)))

print(f"\n📊 Model Performance:")
print(f"R²:  {r2_multi:.4f} ({r2_multi*100:.1f}% variance explained)")
print(f"RSE: {rse_multi:.4f}")

# F-statistic
p = len(beta_hat) - 1  # number of predictors
F_stat = ((tss_multi - rss_multi) / p) / (rss_multi / (n - p - 1))
from scipy.stats import f as f_dist
p_value_F = 1 - f_dist.cdf(F_stat, p, n - p - 1)

print(f"\n📊 F-Statistic:")
print(f"F = {F_stat:.2f}")
print(f"p-value < 0.001 (Strong evidence that at least one predictor is useful) ✅")

# Calculate individual predictor significance using statsmodels for detailed output
import statsmodels.api as sm

# Add constant for intercept
X_with_const = sm.add_constant(X_multi)

# Fit model
model_sm = sm.OLS(y_multi, X_with_const).fit()

print("📊 Detailed Regression Summary\n")
print(model_sm.summary())

# Custom visualization of coefficients
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Coefficient values
axes[0].bar(['Intercept', 'TV', 'Radio', 'Newspaper'], beta_hat, 
           color=['gray', 'blue', 'green', 'orange'], alpha=0.7)
axes[0].axhline(y=0, color='k', linestyle='-', linewidth=0.5)
axes[0].set_ylabel('Coefficient Value')
axes[0].set_title('Estimated Coefficients')
axes[0].grid(True, alpha=0.3, axis='y')

# Coefficient comparison with true values
x_pos = np.arange(len(beta_hat))
width = 0.35
axes[1].bar(x_pos - width/2, true_beta, width, label='True β', alpha=0.7)
axes[1].bar(x_pos + width/2, beta_hat, width, label='Estimated β̂', alpha=0.7)
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(['β₀', 'β₁ (TV)', 'β₂ (Radio)', 'β₃ (News)'])
axes[1].set_ylabel('Coefficient Value')
axes[1].set_title('True vs Estimated Coefficients')
axes[1].legend()
axes[1].grid(True, alpha=0.3, axis='y')

plt.tight_layout()
plt.show()

3.4 Model Diagnostics¶

Key Assumptions of Linear Regression¶

Linearity: Relationship between X and Y is linear
Independence: Observations are independent
Homoscedasticity: Constant variance of errors
Normality: Errors are normally distributed

Diagnostic Plots¶

Residuals vs Fitted: Check linearity and homoscedasticity
Q-Q Plot: Check normality of residuals
Scale-Location: Check homoscedasticity
Residuals vs Leverage: Identify influential points

# Comprehensive diagnostic plots
residuals_multi = y_multi - y_pred_multi
standardized_residuals = residuals_multi / np.std(residuals_multi)

fig, axes = plt.subplots(2, 2, figsize=(14, 10))

# 1. Residuals vs Fitted
axes[0, 0].scatter(y_pred_multi, residuals_multi, alpha=0.5)
axes[0, 0].axhline(y=0, color='r', linestyle='--', linewidth=2)
axes[0, 0].set_xlabel('Fitted Values')
axes[0, 0].set_ylabel('Residuals')
axes[0, 0].set_title('Residuals vs Fitted\n(Should show random pattern around 0)')
axes[0, 0].grid(True, alpha=0.3)

# Add LOWESS smoothed line
from scipy.signal import savgol_filter
sorted_idx = np.argsort(y_pred_multi)
smoothed = savgol_filter(residuals_multi[sorted_idx], window_length=51, polyorder=3)
axes[0, 0].plot(y_pred_multi[sorted_idx], smoothed, 'b-', linewidth=2, alpha=0.7)

# 2. Q-Q Plot (Normal Probability Plot)
scipy_stats.probplot(residuals_multi, dist="norm", plot=axes[0, 1])
axes[0, 1].set_title('Normal Q-Q Plot\n(Points should follow diagonal line)')
axes[0, 1].grid(True, alpha=0.3)

# 3. Scale-Location (Spread-Location)
sqrt_abs_std_resid = np.sqrt(np.abs(standardized_residuals))
axes[1, 0].scatter(y_pred_multi, sqrt_abs_std_resid, alpha=0.5)
axes[1, 0].set_xlabel('Fitted Values')
axes[1, 0].set_ylabel('√|Standardized Residuals|')
axes[1, 0].set_title('Scale-Location Plot\n(Check homoscedasticity)')
axes[1, 0].grid(True, alpha=0.3)

# Add trend line
sorted_idx2 = np.argsort(y_pred_multi)
smoothed2 = savgol_filter(sqrt_abs_std_resid[sorted_idx2], window_length=51, polyorder=3)
axes[1, 0].plot(y_pred_multi[sorted_idx2], smoothed2, 'r-', linewidth=2, alpha=0.7)

# 4. Histogram of residuals
axes[1, 1].hist(residuals_multi, bins=30, density=True, alpha=0.7, edgecolor='black')
mu, sigma = residuals_multi.mean(), residuals_multi.std()
x = np.linspace(residuals_multi.min(), residuals_multi.max(), 100)
axes[1, 1].plot(x, scipy_stats.norm.pdf(x, mu, sigma), 'r-', linewidth=2, 
               label=f'Normal(μ={mu:.2f}, σ={sigma:.2f})')
axes[1, 1].set_xlabel('Residuals')
axes[1, 1].set_ylabel('Density')
axes[1, 1].set_title('Distribution of Residuals\n(Should be approximately normal)')
axes[1, 1].legend()
axes[1, 1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("✅ Diagnostic Interpretation:")
print("   1. Residuals vs Fitted: Random scatter ✅ (no pattern = good)")
print("   2. Q-Q Plot: Points follow line ✅ (normality assumption met)")
print("   3. Scale-Location: Horizontal band ✅ (constant variance)")
print("   4. Histogram: Bell-shaped ✅ (normally distributed residuals)")

3.5 Qualitative Predictors (Categorical Variables)¶

Dummy Variables¶

For a categorical variable with k levels, create k-1 dummy variables

Example: Region (East, West, South)

X_West = 1 if West, 0 otherwise
X_South = 1 if South, 0 otherwise
(East is the baseline/reference level)

Interpretation¶

Y = β₀ + β₁X_West + β₂X_South

β₀: Average Y for baseline (East)
β₁: Difference between West and East
β₂: Difference between South and East

# Generate data with categorical predictor
np.random.seed(42)
n_cat = 300

# Region: East, West, South
regions = np.random.choice(['East', 'West', 'South'], n_cat)
budget = np.random.uniform(50, 250, n_cat)

# Different base sales for each region
base_sales = {'East': 10, 'West': 15, 'South': 12}
sales_cat = np.array([base_sales[r] + 0.05*budget[i] + np.random.normal(0, 1) 
                      for i, r in enumerate(regions)])

df_cat = pd.DataFrame({
    'Region': regions,
    'Budget': budget,
    'Sales': sales_cat
})

print("📊 Categorical Predictor Example")
print(f"\nDataset shape: {df_cat.shape}")
print(f"\nFirst few rows:")
print(df_cat.head(10))

# Create dummy variables
df_cat_encoded = pd.get_dummies(df_cat, columns=['Region'], drop_first=True)
print(f"\n📋 After creating dummy variables:")
print(df_cat_encoded.head())

# Fit model with categorical predictor
X_cat = df_cat_encoded[['Budget', 'Region_South', 'Region_West']].values
y_cat = df_cat_encoded['Sales'].values

model_cat = LinearRegression()
model_cat.fit(X_cat, y_cat)

print(f"\n📈 Regression with Categorical Predictor\n")
print(f"{'Variable':<15} {'Coefficient':<15} {'Interpretation'}")
print("="*70)
print(f"{'Intercept':<15} {model_cat.intercept_:>12.4f}   Baseline: East region + Budget=0")
print(f"{'Budget':<15} {model_cat.coef_[0]:>12.4f}   Effect of $1 budget increase")
print(f"{'Region_South':<15} {model_cat.coef_[1]:>12.4f}   South vs East difference")
print(f"{'Region_West':<15} {model_cat.coef_[2]:>12.4f}   West vs East difference")

# Visualize
plt.figure(figsize=(12, 5))

plt.subplot(1, 2, 1)
for region in ['East', 'West', 'South']:
    mask = df_cat['Region'] == region
    plt.scatter(df_cat[mask]['Budget'], df_cat[mask]['Sales'], 
               alpha=0.5, label=region, s=50)
plt.xlabel('Budget')
plt.ylabel('Sales')
plt.title('Sales vs Budget by Region')
plt.legend()
plt.grid(True, alpha=0.3)

plt.subplot(1, 2, 2)
# Box plot
df_cat.boxplot(column='Sales', by='Region', ax=plt.gca())
plt.suptitle('')
plt.title('Sales Distribution by Region')
plt.ylabel('Sales')
plt.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

3.6 Interaction Terms¶

The Concept¶

Standard additive model assumes effect of X₁ on Y is independent of X₂

With interaction:

Y = β₀ + β₁X₁ + β₂X₂ + β₃(X₁ × X₂) + ε

Interpretation¶

β₁: Effect of X₁ when X₂ = 0
β₂: Effect of X₂ when X₁ = 0
β₃: How effect of X₁ changes with X₂ (synergy)

Example: TV × Radio interaction

Advertising on TV and Radio together may be more effective than sum of individual effects

# Generate data with interaction effect
np.random.seed(42)
n_int = 200

TV_int = np.random.uniform(0, 300, n_int)
Radio_int = np.random.uniform(0, 50, n_int)

# Model WITH interaction: Sales = 5 + 0.02*TV + 0.05*Radio + 0.001*(TV*Radio)
sales_int = (5 + 
             0.02 * TV_int + 
             0.05 * Radio_int + 
             0.001 * (TV_int * Radio_int) +  # Interaction!
             np.random.normal(0, 0.8, n_int))

df_int = pd.DataFrame({
    'TV': TV_int,
    'Radio': Radio_int,
    'Sales': sales_int
})

# Model 1: Without interaction
X_no_int = df_int[['TV', 'Radio']].values
model_no_int = LinearRegression().fit(X_no_int, sales_int)
r2_no_int = model_no_int.score(X_no_int, sales_int)

# Model 2: With interaction
df_int['TV_Radio'] = df_int['TV'] * df_int['Radio']
X_with_int = df_int[['TV', 'Radio', 'TV_Radio']].values
model_with_int = LinearRegression().fit(X_with_int, sales_int)
r2_with_int = model_with_int.score(X_with_int, sales_int)

print("📊 Model Comparison: With vs Without Interaction\n")
print(f"{'Model':<25} {'R²':<10} {'Improvement'}")
print("="*50)
print(f"{'Without interaction':<25} {r2_no_int:>6.4f}")
print(f"{'With interaction':<25} {r2_with_int:>6.4f}   +{(r2_with_int-r2_no_int)*100:.2f}%")

print(f"\n📈 Interaction Model Coefficients:")
print(f"Intercept:     {model_with_int.intercept_:.4f}")
print(f"TV:            {model_with_int.coef_[0]:.6f}")
print(f"Radio:         {model_with_int.coef_[1]:.6f}")
print(f"TV × Radio:    {model_with_int.coef_[2]:.6f}  ← Interaction term")

# Visualize interaction effect
fig = plt.figure(figsize=(15, 5))

# 3D scatter
from mpl_toolkits.mplot3d import Axes3D
ax1 = fig.add_subplot(131, projection='3d')
ax1.scatter(df_int['TV'], df_int['Radio'], sales_int, alpha=0.5)
ax1.set_xlabel('TV Budget')
ax1.set_ylabel('Radio Budget')
ax1.set_zlabel('Sales')
ax1.set_title('Sales vs TV & Radio (3D View)')

# Effect of TV at different Radio levels
ax2 = fig.add_subplot(132)
radio_levels = [0, 25, 50]
tv_range = np.linspace(0, 300, 100)
for radio in radio_levels:
    sales_pred = (model_with_int.intercept_ + 
                  model_with_int.coef_[0] * tv_range +
                  model_with_int.coef_[1] * radio +
                  model_with_int.coef_[2] * tv_range * radio)
    ax2.plot(tv_range, sales_pred, linewidth=2, label=f'Radio = ${radio}k')
ax2.set_xlabel('TV Budget')
ax2.set_ylabel('Predicted Sales')
ax2.set_title('Effect of TV at Different Radio Levels\n(Interaction causes different slopes)')
ax2.legend()
ax2.grid(True, alpha=0.3)

# Interaction visualization
ax3 = fig.add_subplot(133)
TV_grid, Radio_grid = np.meshgrid(np.linspace(0, 300, 20), np.linspace(0, 50, 20))
interaction_effect = model_with_int.coef_[2] * TV_grid * Radio_grid
contour = ax3.contourf(TV_grid, Radio_grid, interaction_effect, levels=15, cmap='viridis')
ax3.set_xlabel('TV Budget')
ax3.set_ylabel('Radio Budget')
ax3.set_title('Interaction Effect Magnitude\n(Synergy between TV and Radio)')
plt.colorbar(contour, ax=ax3, label='Interaction Contribution to Sales')

plt.tight_layout()
plt.show()

print("\n💡 Interpretation:")
print("   • Interaction coefficient is positive → synergy effect")
print("   • Effect of TV increases as Radio budget increases")
print("   • Combined advertising is more effective than sum of parts")

3.7 Non-Linear Transformations¶

Polynomial Regression¶

Model non-linear relationships while staying in linear regression framework:

Y = β₀ + β₁X + β₂X² + β₃X³ + ... + ε

Still “linear” because it’s linear in the coefficients β

When to use:¶

Scatter plot shows curved relationship
Residual plot shows pattern (not random)
Domain knowledge suggests non-linearity

Caution:¶

High-degree polynomials can overfit
Extrapolation beyond data range is dangerous

# Generate non-linear data
np.random.seed(42)
X_poly = np.linspace(-3, 3, 100)
y_poly = 2 + 3*X_poly - 2*X_poly**2 + 0.5*X_poly**3 + np.random.normal(0, 2, 100)

X_poly_2d = X_poly.reshape(-1, 1)

# Fit polynomials of different degrees
degrees = [1, 2, 3, 4, 8]
models_poly = {}

fig, axes = plt.subplots(2, 3, figsize=(15, 10))
axes = axes.ravel()

for idx, degree in enumerate(degrees):
    poly_features = PolynomialFeatures(degree=degree)
    X_poly_transformed = poly_features.fit_transform(X_poly_2d)
    
    model = LinearRegression()
    model.fit(X_poly_transformed, y_poly)
    models_poly[degree] = (poly_features, model)
    
    # Predictions
    X_plot = np.linspace(-3, 3, 200).reshape(-1, 1)
    X_plot_transformed = poly_features.transform(X_plot)
    y_plot = model.predict(X_plot_transformed)
    
    # Metrics
    y_pred = model.predict(X_poly_transformed)
    r2 = r2_score(y_poly, y_pred)
    rmse = np.sqrt(mean_squared_error(y_poly, y_pred))
    
    # Plot
    axes[idx].scatter(X_poly, y_poly, alpha=0.5, s=30, label='Data')
    axes[idx].plot(X_plot, y_plot, 'r-', linewidth=2, label=f'Degree {degree} fit')
    axes[idx].set_xlabel('X')
    axes[idx].set_ylabel('Y')
    axes[idx].set_title(f'Polynomial Degree {degree}\nR² = {r2:.4f}, RMSE = {rmse:.2f}')
    axes[idx].legend()
    axes[idx].grid(True, alpha=0.3)
    axes[idx].set_ylim(-20, 30)

# Summary plot
axes[5].axis('off')
summary_text = "💡 Key Observations:\n\n"
summary_text += "Degree 1 (Linear): Underfits\n"
summary_text += "Degree 2-3: Good balance\n"
summary_text += "Degree 8: Overfits (wiggly)\n\n"
summary_text += "True function is degree 3\n"
summary_text += "Best model: Degree 3 ✅"
axes[5].text(0.1, 0.5, summary_text, fontsize=12, verticalalignment='center',
            bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.8))

plt.tight_layout()
plt.show()

# Compare models with cross-validation
from sklearn.model_selection import cross_val_score

print("\n📊 Model Comparison with Cross-Validation\n")
print(f"{'Degree':<10} {'Train R²':<12} {'CV R² (mean)':<15} {'CV R² (std)'}")
print("="*55)

for degree in degrees:
    poly_features = PolynomialFeatures(degree=degree)
    X_trans = poly_features.fit_transform(X_poly_2d)
    model = LinearRegression()
    
    # Train R²
    model.fit(X_trans, y_poly)
    train_r2 = model.score(X_trans, y_poly)
    
    # Cross-validation R²
    cv_scores = cross_val_score(model, X_trans, y_poly, cv=5, scoring='r2')
    cv_mean = cv_scores.mean()
    cv_std = cv_scores.std()
    
    print(f"{degree:<10} {train_r2:>10.4f}   {cv_mean:>12.4f}      {cv_std:>10.4f}")

print("\n💡 Notice:")
print("   • Training R² always increases with degree (can overfit)")
print("   • CV R² peaks around degree 3, then decreases (overfitting)")
print("   • Use CV to select model complexity")

3.8 Potential Problems¶

1. Non-linearity¶

Symptom: Pattern in residual plot Solution: Transform predictors (log, sqrt, polynomial)

2. Correlation of Error Terms¶

Symptom: Adjacent residuals similar (time series) Solution: Time series models, account for correlation structure

3. Non-constant Variance (Heteroscedasticity)¶

Symptom: Funnel shape in residual plot Solution: Transform response (log Y), weighted least squares

4. Outliers¶

Symptom: Unusual values with large residuals Solution: Investigate, possibly remove or transform

5. High Leverage Points¶

Symptom: Unusual predictor values Measure: Leverage statistic hᵢᵢ Solution: Examine influence, consider removal

6. Collinearity¶

Symptom: High correlations between predictors, unstable coefficients Measure: Variance Inflation Factor (VIF > 5 or 10 problematic) Solution: Remove predictors, Ridge regression, PCA

# Demonstrate collinearity problem
np.random.seed(42)
n_col = 100

# Create correlated predictors
X1_col = np.random.normal(0, 1, n_col)
X2_col = X1_col + np.random.normal(0, 0.1, n_col)  # Highly correlated with X1
X3_col = np.random.normal(0, 1, n_col)  # Independent

y_col = 2 + 3*X1_col + 4*X3_col + np.random.normal(0, 0.5, n_col)

df_col = pd.DataFrame({
    'X1': X1_col,
    'X2': X2_col,
    'X3': X3_col,
    'Y': y_col
})

print("📊 Collinearity Example\n")
print("Correlation Matrix:")
print(df_col.corr().round(3))
print("\nNotice: X1 and X2 are highly correlated (0.995)!")

# Calculate VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

X_vif = df_col[['X1', 'X2', 'X3']].values
vif_data = pd.DataFrame()
vif_data["Variable"] = ['X1', 'X2', 'X3']
vif_data["VIF"] = [variance_inflation_factor(X_vif, i) for i in range(X_vif.shape[1])]

print("\n📊 Variance Inflation Factors (VIF):")
print(vif_data)
print("\nInterpretation:")
print("  VIF = 1: No correlation with other predictors")
print("  VIF = 5-10: Moderate collinearity")
print("  VIF > 10: High collinearity (problematic) ⚠️")

# Fit models with and without collinear predictor
model_with_col = LinearRegression().fit(df_col[['X1', 'X2', 'X3']], y_col)
model_no_col = LinearRegression().fit(df_col[['X1', 'X3']], y_col)

print(f"\n📈 Model Coefficients Comparison:")
print(f"\nWith X2 (collinear):")
print(f"  X1: {model_with_col.coef_[0]:.4f}  (True: 3.0)")
print(f"  X2: {model_with_col.coef_[1]:.4f}  (Should be 0)")
print(f"  X3: {model_with_col.coef_[2]:.4f}  (True: 4.0)")

print(f"\nWithout X2:")
print(f"  X1: {model_no_col.coef_[0]:.4f}  (True: 3.0) ✅")
print(f"  X3: {model_no_col.coef_[1]:.4f}  (True: 4.0) ✅")

print("\n💡 Effect of Collinearity:")
print("   • Coefficients become unstable and hard to interpret")
print("   • Standard errors increase")
print("   • Model predictions still OK, but inference is problematic")
print("   • Solution: Remove one of the correlated predictors")

Key Takeaways¶

1. Simple Linear Regression¶

Model: Y = β₀ + β₁X + ε
Estimated via OLS (minimize RSS)
Assess with R², RSE, F-statistic, t-tests

2. Multiple Linear Regression¶

Model: Y = β₀ + Σβⱼ Xⱼ + ε
Matrix form: β̂ = (X’X)⁻¹X’y
F-test for overall significance
Individual t-tests for each predictor

3. Model Assessment¶

R² = 1 - RSS/TSS (proportion of variance explained)
RSE = √[RSS/(n-p-1)] (typical error magnitude)
F-statistic (test if any predictors useful)

4. Extensions¶

Qualitative predictors: Use dummy variables
Interactions: Model synergies between predictors
Polynomial terms: Capture non-linear relationships

5. Assumptions & Diagnostics¶

Linearity (residual plot)
Independence (autocorrelation)
Homoscedasticity (constant variance)
Normality (Q-Q plot)

6. Common Problems¶

Non-linearity → Transform predictors
Heteroscedasticity → Transform response
Outliers → Investigate and handle
Collinearity → VIF > 10 problematic, remove predictors

7. Best Practices¶

Always plot the data first
Check residual plots
Use cross-validation for model selection
Beware of extrapolation
Interpret coefficients carefully with interactions

Next Chapter¶

Chapter 4: Classification

Logistic Regression
Linear Discriminant Analysis
Quadratic Discriminant Analysis
K-Nearest Neighbors