Run this notebook: Open in Colab Open in Kaggle

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split, learning_curve
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import mean_squared_error, accuracy_score
import warnings
warnings.filterwarnings('ignore')

# Set random seed
np.random.seed(42)

# Configure plotting
plt.style.use('default')
sns.set_palette('husl')

print("✅ Libraries imported successfully!")

Learning Curves: Diagnosing Bias vs. Variance¶

Learning curves plot model performance (training and validation scores) as a function of training set size, providing the single most informative diagnostic for model debugging. High variance (overfitting) manifests as a large gap between training score (near 1.0) and validation score – the model memorizes training examples but fails to generalize. High bias (underfitting) shows both curves converging to a low value, meaning even with unlimited data, the model lacks the capacity to capture the underlying pattern.

The critical diagnostic question is whether more data will help. If the validation curve is still rising at the right edge, collecting more training data will improve performance. If both curves have plateaued with a large gap, regularization or a simpler architecture is needed. If both plateaued at a low level, you need a more expressive model or better features. scikit-learn’s learning_curve() function automates this analysis with cross-validated scoring at each training set size.

def plot_learning_curves(estimator, X, y, title="Learning Curves"):
    """
    Plot learning curves for a model.
    """
    train_sizes, train_scores, val_scores = learning_curve(
        estimator, X, y, cv=5, 
        train_sizes=np.linspace(0.1, 1.0, 10),
        scoring='accuracy',
        n_jobs=-1
    )
    
    # Calculate mean and std
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)
    val_mean = np.mean(val_scores, axis=1)
    val_std = np.std(val_scores, axis=1)
    
    # Plot
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Training score
    ax.plot(train_sizes, train_mean, 'o-', color='blue', label='Training score')
    ax.fill_between(train_sizes, train_mean - train_std, train_mean + train_std, 
                     alpha=0.2, color='blue')
    
    # Validation score
    ax.plot(train_sizes, val_mean, 's-', color='green', label='Validation score')
    ax.fill_between(train_sizes, val_mean - val_std, val_mean + val_std,
                     alpha=0.2, color='green')
    
    ax.set_xlabel('Training Set Size', fontsize=12)
    ax.set_ylabel('Accuracy', fontsize=12)
    ax.set_title(title, fontsize=14, fontweight='bold')
    ax.legend(loc='best', fontsize=11)
    ax.grid(True, alpha=0.3)
    
    plt.tight_layout()
    plt.show()
    
    # Diagnosis
    final_gap = train_mean[-1] - val_mean[-1]
    print(f"\nDiagnosis:")
    print(f"  Final training score: {train_mean[-1]:.3f}")
    print(f"  Final validation score: {val_mean[-1]:.3f}")
    print(f"  Gap: {final_gap:.3f}")
    
    if final_gap > 0.1:
        print("  ⚠️ High variance (overfitting) - Try regularization or more data")
    elif val_mean[-1] < 0.7:
        print("  ⚠️ High bias (underfitting) - Try more complex model")
    else:
        print("  ✅ Good fit!")

# Generate data
X, y = make_classification(
    n_samples=1000, n_features=20, n_informative=15,
    n_redundant=5, random_state=42
)

print("Dataset created: 1000 samples, 20 features")

# Example 1: Well-fitted model
from sklearn.ensemble import RandomForestClassifier

model_good = RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42)
plot_learning_curves(model_good, X, y, "Well-Fitted Model")

# Example 2: Overfitted model
model_overfit = RandomForestClassifier(n_estimators=100, max_depth=None, 
                                       min_samples_split=2, random_state=42)
plot_learning_curves(model_overfit, X, y, "Overfitted Model (High Variance)")

# Example 3: Underfitted model
from sklearn.linear_model import LogisticRegression

# Make problem harder (non-linear)
X_nonlinear = X ** 2
model_underfit = LogisticRegression(max_iter=1000)
plot_learning_curves(model_underfit, X_nonlinear, y, "Underfitted Model (High Bias)")

Overfitting vs. Underfitting: Visual Diagnosis with Polynomial Regression¶

Polynomial regression provides a clean visual demonstration of the bias-variance tradeoff. A degree-1 polynomial (linear fit) has only two parameters and cannot capture nonlinear structure – this is underfitting, where \(\text{Bias}^2\) dominates the error. A degree-15 polynomial has 16 parameters for 80 training points, enough capacity to pass through nearly every point – this is overfitting, where \(\text{Variance}\) dominates as the model fits noise. The decomposition \(\text{MSE} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Noise}\) formalizes this: reducing bias (more complex model) increases variance, and vice versa.

The practical diagnostic: compare \(R^2_{\text{train}}\) and \(R^2_{\text{test}}\). When both are low, increase model complexity. When train is high but test is low (or negative), reduce complexity or add regularization. The “sweet spot” (degree 3 below) shows similar train and test scores, indicating the model captures the true signal without fitting noise.

# Generate regression data for clear visualization
np.random.seed(42)
X_reg = np.linspace(0, 10, 100).reshape(-1, 1)
y_reg = 2 * X_reg.ravel() + 3 * np.sin(X_reg.ravel()) + np.random.randn(100) * 0.5

X_train, X_test, y_train, y_test = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=42
)

print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")

# Compare models with different complexity
degrees = [1, 3, 15]  # Polynomial degrees
models = []

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

for ax, degree in zip(axes, degrees):
    # Create polynomial features
    poly = PolynomialFeatures(degree=degree)
    X_train_poly = poly.fit_transform(X_train)
    X_test_poly = poly.transform(X_test)
    
    # Fit model
    model = LinearRegression()
    model.fit(X_train_poly, y_train)
    
    # Predictions
    X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
    X_plot_poly = poly.transform(X_plot)
    y_plot = model.predict(X_plot_poly)
    
    # Scores
    train_score = model.score(X_train_poly, y_train)
    test_score = model.score(X_test_poly, y_test)
    
    # Plot
    ax.scatter(X_train, y_train, alpha=0.6, s=30, label='Train data')
    ax.scatter(X_test, y_test, alpha=0.6, s=30, color='red', label='Test data')
    ax.plot(X_plot, y_plot, 'g-', linewidth=2, label='Prediction')
    
    # Title with scores
    if degree == 1:
        status = "UNDERFITTING"
        color = 'orange'
    elif degree == 3:
        status = "GOOD FIT"
        color = 'green'
    else:
        status = "OVERFITTING"
        color = 'red'
    
    ax.set_title(f'{status}\nDegree={degree}\nTrain R²={train_score:.3f}, Test R²={test_score:.3f}',
                 fontsize=11, fontweight='bold', color=color)
    ax.set_xlabel('X')
    ax.set_ylabel('y')
    ax.legend(loc='best', fontsize=9)
    ax.grid(True, alpha=0.3)
    ax.set_ylim([-5, 25])

plt.tight_layout()
plt.show()

print("\n🎯 Key Observations:")
print("  Degree 1: Low train & test scores → Underfitting")
print("  Degree 3: Similar train & test scores → Good fit")
print("  Degree 15: High train, low test → Overfitting")

Regularization Strategies: Constraining Model Complexity¶

Regularization adds a penalty term to the loss function that discourages large parameter values, effectively limiting model complexity without reducing the number of features. Ridge (L2) regularization adds \(\alpha \sum w_j^2\) to the loss, shrinking all coefficients toward zero proportionally – it never sets coefficients exactly to zero, making it ideal when all features are potentially relevant. Lasso (L1) regularization adds \(\alpha \sum |w_j|\), which produces sparse solutions by driving irrelevant coefficients exactly to zero, effectively performing feature selection.

The regularization strength \(\alpha\) controls the bias-variance tradeoff: small \(\alpha\) permits complex, high-variance fits; large \(\alpha\) forces simpler, high-bias fits. The optimal \(\alpha\) is found via cross-validation, typically searching a logarithmic grid (0.001, 0.01, 0.1, 1, 10, 100). Elastic Net combines both penalties as \(\alpha[\rho \sum |w_j| + (1-\rho) \sum w_j^2]\), offering a middle ground that handles correlated features better than pure Lasso. Below, we compare Ridge and Lasso on degree-10 polynomial features to see how each handles the 66 coefficients created from a single input feature.

# Compare L1 (Lasso) vs L2 (Ridge) regularization
from sklearn.linear_model import Lasso, Ridge

# Create high-degree polynomial features (prone to overfitting)
poly = PolynomialFeatures(degree=10)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

print(f"Feature expansion: {X_train.shape[1]} → {X_train_poly.shape[1]} features")

# Test different regularization strengths
alphas = [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]

results = {'alpha': [], 'method': [], 'train_score': [], 'test_score': []}

for alpha in alphas:
    # Ridge (L2)
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train_poly, y_train)
    results['alpha'].append(alpha)
    results['method'].append('Ridge (L2)')
    results['train_score'].append(ridge.score(X_train_poly, y_train))
    results['test_score'].append(ridge.score(X_test_poly, y_test))
    
    # Lasso (L1)
    lasso = Lasso(alpha=alpha, max_iter=10000)
    lasso.fit(X_train_poly, y_train)
    results['alpha'].append(alpha)
    results['method'].append('Lasso (L1)')
    results['train_score'].append(lasso.score(X_train_poly, y_train))
    results['test_score'].append(lasso.score(X_test_poly, y_test))

results_df = pd.DataFrame(results)
print("\nRegularization Results:")
print(results_df)

# Visualize regularization effect
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Ridge
ridge_data = results_df[results_df['method'] == 'Ridge (L2)']
axes[0].plot(ridge_data['alpha'], ridge_data['train_score'], 'o-', 
             label='Train', color='blue', linewidth=2)
axes[0].plot(ridge_data['alpha'], ridge_data['test_score'], 's-', 
             label='Test', color='green', linewidth=2)
axes[0].set_xscale('log')
axes[0].set_xlabel('Regularization Strength (α)', fontsize=12)
axes[0].set_ylabel('R² Score', fontsize=12)
axes[0].set_title('Ridge Regression (L2)', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Lasso
lasso_data = results_df[results_df['method'] == 'Lasso (L1)']
axes[1].plot(lasso_data['alpha'], lasso_data['train_score'], 'o-', 
             label='Train', color='blue', linewidth=2)
axes[1].plot(lasso_data['alpha'], lasso_data['test_score'], 's-', 
             label='Test', color='green', linewidth=2)
axes[1].set_xscale('log')
axes[1].set_xlabel('Regularization Strength (α)', fontsize=12)
axes[1].set_ylabel('R² Score', fontsize=12)
axes[1].set_title('Lasso Regression (L1)', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n💡 Observations:")
print("  • Low α: Overfitting (high train, low test)")
print("  • Optimal α: Balanced train/test scores")
print("  • High α: Underfitting (both scores decrease)")

Convergence Issues: When Gradient Descent Gets Lost¶

Convergence failure is one of the most common and frustrating ML debugging problems. Stochastic Gradient Descent (SGD) updates parameters as \(\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\), where the learning rate \(\eta\) is the critical hyperparameter. Too small a learning rate causes painfully slow convergence (thousands of iterations to reach a mediocre solution). Too large a rate causes the loss to oscillate or diverge entirely, as the updates overshoot the minimum.

Feature scaling is the most common fix for convergence problems. When features have vastly different scales (e.g., age in [0, 100] vs. income in [0, 1000000]), the loss surface becomes an elongated ellipsoid, and gradient descent zigzags across the narrow dimensions instead of heading toward the minimum. StandardScaler normalizes each feature to zero mean and unit variance, transforming the loss surface into something closer to a sphere where gradient descent converges quickly. The code below demonstrates that scaling alone can reduce required iterations from 100+ (with convergence warnings) to fewer than 20.

from sklearn.linear_model import SGDClassifier

# Generate classification data
X_cls, y_cls = make_classification(n_samples=1000, n_features=20, random_state=42)
X_train_cls, X_test_cls, y_train_cls, y_test_cls = train_test_split(
    X_cls, y_cls, test_size=0.2, random_state=42
)

# Problem 1: Unscaled features cause slow convergence
print("Training without scaling...")
model_unscaled = SGDClassifier(max_iter=100, random_state=42, verbose=0)
model_unscaled.fit(X_train_cls, y_train_cls)

if model_unscaled.n_iter_ == 100:
    print("⚠️ Model did not converge in 100 iterations")
    print(f"   Final accuracy: {model_unscaled.score(X_test_cls, y_test_cls):.3f}")

# Solution: Scale features
print("\nTraining with scaling...")
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_cls)
X_test_scaled = scaler.transform(X_test_cls)

model_scaled = SGDClassifier(max_iter=100, random_state=42, verbose=0)
model_scaled.fit(X_train_scaled, y_train_cls)

print(f"✅ Converged in {model_scaled.n_iter_} iterations")
print(f"   Final accuracy: {model_scaled.score(X_test_scaled, y_test_cls):.3f}")

# Visualize convergence with different learning rates
learning_rates = [0.001, 0.01, 0.1, 1.0]

fig, ax = plt.subplots(figsize=(10, 6))

for lr in learning_rates:
    model = SGDClassifier(
        max_iter=1000, 
        learning_rate='constant',
        eta0=lr,
        random_state=42,
        verbose=0
    )
    
    # Track training progress
    train_scores = []
    for i in range(1, 101, 5):
        model.max_iter = i
        model.fit(X_train_scaled, y_train_cls)
        train_scores.append(model.score(X_train_scaled, y_train_cls))
    
    iterations = list(range(1, 101, 5))
    ax.plot(iterations, train_scores, label=f'LR={lr}', linewidth=2)

ax.set_xlabel('Iterations', fontsize=12)
ax.set_ylabel('Training Accuracy', fontsize=12)
ax.set_title('Convergence with Different Learning Rates', fontsize=14, fontweight='bold')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print("\n🎯 Learning Rate Guidelines:")
print("  • Too low (0.001): Slow convergence")
print("  • Too high (1.0): Divergence or oscillation")
print("  • Optimal (0.01-0.1): Fast, stable convergence")

Model Complexity Trade-off: Validation Curves¶

Validation curves plot model performance against a single hyperparameter while keeping all other settings fixed, directly visualizing the bias-variance tradeoff. For Random Forest’s max_depth, shallow trees (depth 1-3) have high bias – they are simple decision stumps that cannot capture complex interactions. Deep trees (depth 15+) have high variance – they partition the training set into tiny leaf nodes that memorize noise. The optimal depth sits at the peak of the validation curve, where the model captures the true decision boundary without overfitting.

Unlike learning curves (which vary data size), validation curves vary model complexity and answer: “Is my model too simple or too complex?” scikit-learn’s validation_curve() automates this with cross-validated scoring across a parameter grid. In practice, plot validation curves for the 2-3 most impactful hyperparameters (tree depth, regularization strength, number of estimators) and select values near each validation peak. This is more interpretable than blind grid search because you can see the shape of the performance landscape and understand why certain parameter values work.

# Validation curve: How performance changes with model complexity
from sklearn.model_selection import validation_curve

# Random Forest: vary max_depth
param_range = range(1, 21)

train_scores, val_scores = validation_curve(
    RandomForestClassifier(n_estimators=50, random_state=42),
    X_cls, y_cls,
    param_name="max_depth",
    param_range=param_range,
    cv=5,
    scoring="accuracy",
    n_jobs=-1
)

train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
val_mean = np.mean(val_scores, axis=1)
val_std = np.std(val_scores, axis=1)

# Plot
fig, ax = plt.subplots(figsize=(10, 6))

ax.plot(param_range, train_mean, 'o-', color='blue', label='Training score', linewidth=2)
ax.fill_between(param_range, train_mean - train_std, train_mean + train_std,
                alpha=0.2, color='blue')

ax.plot(param_range, val_mean, 's-', color='green', label='Validation score', linewidth=2)
ax.fill_between(param_range, val_mean - val_std, val_mean + val_std,
                alpha=0.2, color='green')

# Mark optimal point
optimal_idx = np.argmax(val_mean)
optimal_depth = param_range[optimal_idx]
ax.axvline(x=optimal_depth, color='red', linestyle='--', 
           label=f'Optimal depth={optimal_depth}')

ax.set_xlabel('Max Depth', fontsize=12)
ax.set_ylabel('Accuracy', fontsize=12)
ax.set_title('Validation Curve: Model Complexity Trade-off', fontsize=14, fontweight='bold')
ax.legend(loc='best')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\n🎯 Optimal max_depth: {optimal_depth}")
print(f"   Validation accuracy: {val_mean[optimal_idx]:.3f}")
print(f"   Training accuracy: {train_mean[optimal_idx]:.3f}")
print(f"   Gap: {train_mean[optimal_idx] - val_mean[optimal_idx]:.3f}")

🎯 Key Takeaways¶

Learning Curves
- Large gap → Overfitting (high variance)
- Both scores low → Underfitting (high bias)
- Converging scores → Good fit
Overfitting Solutions
- Add regularization (L1/L2)
- Reduce model complexity
- Collect more training data
- Use dropout (neural networks)
- Early stopping
Underfitting Solutions
- Increase model complexity
- Add more features
- Reduce regularization
- Train longer
Convergence Issues
- Always scale features
- Tune learning rate
- Increase max iterations
- Check for numerical instability
Model Selection
- Use validation curves
- Balance bias-variance trade-off
- Cross-validate hyperparameters

📝 Debugging Checklist¶

When performance is poor:

✅ Plot learning curves
✅ Check train vs test scores
✅ Try different model complexities
✅ Apply regularization
✅ Scale features
✅ Monitor convergence
✅ Validate hyperparameters

Continue to Notebook 5 for error analysis! 🚀