Chapter 10: Deep LearningΒΆ

OverviewΒΆ

Deep Learning = Neural networks with multiple hidden layers

Key ComponentsΒΆ

  1. Neurons: Basic computational units

  2. Layers: Input β†’ Hidden (multiple) β†’ Output

  3. Activation Functions: Non-linear transformations

  4. Backpropagation: Learning via gradient descent

  5. Architectures: Feedforward, CNN, RNN, Transformer

Neural Network FormulaΒΆ

Single Hidden Layer: $\(f(X) = \beta_0 + \sum_{k=1}^K \beta_k h_k(X)\)$

where hidden unit: $\(h_k(X) = g(w_{k0} + \sum_{j=1}^p w_{kj}X_j)\)$

Components:

  • \(g()\) = activation function (ReLU, sigmoid, tanh)

  • \(w_{kj}\) = weights (learned parameters)

  • \(K\) = number of hidden units

Common Activation FunctionsΒΆ

ReLU (Rectified Linear Unit): \(g(z) = \max(0, z)\)

  • Most popular in deep networks

  • Avoids vanishing gradients

  • Computationally efficient

Sigmoid: \(g(z) = \frac{1}{1 + e^{-z}}\)

  • Output: (0, 1)

  • Used in binary classification output

  • Can suffer vanishing gradients

Tanh: \(g(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)

  • Output: (-1, 1)

  • Zero-centered (better than sigmoid)

Softmax (output layer): \(g(z_k) = \frac{e^{z_k}}{\sum_j e^{z_j}}\)

  • Multi-class classification

  • Outputs probabilities (sum to 1)

AdvantagesΒΆ

βœ… Automatic feature learning
βœ… Handles complex patterns
βœ… State-of-the-art on images, text, audio
βœ… Flexible architectures
βœ… Transfer learning possible

ChallengesΒΆ

❌ Requires large datasets
❌ Computationally expensive
❌ Many hyperparameters
❌ Black box (hard to interpret)
❌ Can overfit easily

10.1 Single Layer Neural NetworkΒΆ

ArchitectureΒΆ

Input β†’ Hidden Layer β†’ Output

Forward PassΒΆ

  1. Hidden layer: \(h = g(W_1 X + b_1)\)

  2. Output: \(\hat{y} = W_2 h + b_2\)

ParametersΒΆ

  • hidden_layer_sizes: Number of neurons (e.g., 10, 50, 100)

  • activation: ReLU, tanh, logistic

  • alpha: L2 regularization (0.0001 default)

  • learning_rate_init: Step size (0.001 default)

  • max_iter: Training epochs

# Single Layer Neural Network
X, y = make_circles(n_samples=500, noise=0.1, factor=0.5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Test different hidden layer sizes
hidden_sizes = [5, 10, 25, 50]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()

for idx, size in enumerate(hidden_sizes):
    mlp = MLPClassifier(hidden_layer_sizes=(size,), activation='relu',
                        max_iter=1000, random_state=42)
    mlp.fit(X_train_scaled, y_train)
    
    # Decision boundary
    h = 0.02
    x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
    y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
    Z = mlp.predict(scaler.transform(np.c_[xx.ravel(), yy.ravel()]))
    Z = Z.reshape(xx.shape)
    
    axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
    axes[idx].scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdYlBu', 
                     edgecolors='k', s=50, alpha=0.7)
    axes[idx].set_title(f'Hidden Units: {size}\n'
                       f'Train: {mlp.score(X_train_scaled, y_train):.3f}, '
                       f'Test: {mlp.score(X_test_scaled, y_test):.3f}')
    axes[idx].set_xlabel('X₁')
    axes[idx].set_ylabel('Xβ‚‚')

plt.tight_layout()
plt.show()

print("\nπŸ’‘ More hidden units β†’ more flexible decision boundary")
print("   But too many β†’ overfitting risk!")

10.2 Deep Neural NetworksΒΆ

Multiple Hidden LayersΒΆ

Input β†’ Hidden₁ β†’ Hiddenβ‚‚ β†’ … β†’ Hidden_L β†’ Output

Why Deep?ΒΆ

  • Hierarchical features: Each layer learns increasingly abstract representations

  • Better performance: Deeper often better than wider

  • Parameter efficiency: Can represent complex functions with fewer parameters

Typical ArchitecturesΒΆ

  • Shallow: (100,) or (50, 25)

  • Medium: (100, 50, 25)

  • Deep: (128, 64, 32, 16)

Training ChallengesΒΆ

  • Vanishing gradients: Deep networks, sigmoid/tanh

  • Exploding gradients: Poor initialization

  • Overfitting: Many parameters

SolutionsΒΆ

  • ReLU activation: Avoids vanishing gradients

  • Batch normalization: Stabilizes training

  • Dropout: Regularization technique

  • Early stopping: Stop when validation error increases

# Compare Shallow vs Deep Networks
architectures = {
    'Shallow (50)': (50,),
    'Medium (50, 25)': (50, 25),
    'Deep (50, 25, 10)': (50, 25, 10),
    'Very Deep (100, 50, 25, 10)': (100, 50, 25, 10)
}

results = {}
for name, arch in architectures.items():
    mlp = MLPClassifier(hidden_layer_sizes=arch, activation='relu',
                        max_iter=1000, random_state=42, early_stopping=True)
    mlp.fit(X_train_scaled, y_train)
    
    train_acc = mlp.score(X_train_scaled, y_train)
    test_acc = mlp.score(X_test_scaled, y_test)
    n_params = sum(w.size for w in mlp.coefs_) + sum(b.size for b in mlp.intercepts_)
    
    results[name] = {
        'train': train_acc,
        'test': test_acc,
        'params': n_params,
        'layers': len(arch)
    }

# Visualize
df_results = pd.DataFrame(results).T
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Accuracy comparison
df_results[['train', 'test']].plot(kind='barh', ax=ax1)
ax1.set_xlabel('Accuracy')
ax1.set_title('Architecture Comparison')
ax1.legend(['Train', 'Test'])
ax1.grid(True, alpha=0.3, axis='x')

# Parameters vs Performance
ax2.scatter(df_results['params'], df_results['test'], s=200, alpha=0.6)
for idx, name in enumerate(df_results.index):
    ax2.annotate(name, (df_results.iloc[idx]['params'], df_results.iloc[idx]['test']),
                fontsize=8, ha='right')
ax2.set_xlabel('Number of Parameters')
ax2.set_ylabel('Test Accuracy')
ax2.set_title('Parameters vs Performance')
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ“Š Results:\n")
print(df_results.to_string())
print("\nπŸ’‘ Deeper not always better! Balance complexity with generalization.")

10.3 Regularization in Neural NetworksΒΆ

L2 Regularization (Weight Decay)ΒΆ

Add penalty to loss: $\(L = L_{\text{data}} + \alpha \sum_{i,j} w_{ij}^2\)$

  • alpha: Regularization strength (0.0001-0.01)

  • Prevents large weights

Early StoppingΒΆ

  • Monitor validation error

  • Stop when it starts increasing

  • early_stopping=True, validation_fraction=0.1

Dropout (not in sklearn MLPClassifier)ΒΆ

  • Randomly drop neurons during training

  • Prevents co-adaptation

  • Typical rate: 0.2-0.5

Batch Normalization (not in sklearn)ΒΆ

  • Normalize layer inputs

  • Stabilizes training

  • Allows higher learning rates

# Effect of L2 Regularization (alpha)
alphas = [0.0001, 0.001, 0.01, 0.1]

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

train_scores = []
test_scores = []

for alpha in alphas:
    mlp = MLPClassifier(hidden_layer_sizes=(50, 25), activation='relu',
                        alpha=alpha, max_iter=1000, random_state=42)
    mlp.fit(X_train_scaled, y_train)
    train_scores.append(mlp.score(X_train_scaled, y_train))
    test_scores.append(mlp.score(X_test_scaled, y_test))

axes[0].plot(alphas, train_scores, 'o-', label='Train', linewidth=2)
axes[0].plot(alphas, test_scores, 's-', label='Test', linewidth=2)
axes[0].set_xlabel('Alpha (L2 Penalty)')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Effect of Regularization')
axes[0].set_xscale('log')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Learning curves with early stopping
mlp = MLPClassifier(hidden_layer_sizes=(50, 25), activation='relu',
                    max_iter=1000, early_stopping=True, validation_fraction=0.1,
                    random_state=42)
mlp.fit(X_train_scaled, y_train)

axes[1].plot(mlp.loss_curve_, label='Training Loss', linewidth=2)
axes[1].plot(mlp.validation_scores_, label='Validation Accuracy', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss / Accuracy')
axes[1].set_title(f'Learning Curves\nStopped at iteration {mlp.n_iter_}')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nπŸ’‘ Regularization:")
print(f"   β€’ Optimal alpha balances train/test performance")
print(f"   β€’ Early stopping prevented overfitting (stopped at {mlp.n_iter_} iterations)")

10.4 Neural Networks for RegressionΒΆ

Differences from ClassificationΒΆ

  • Output layer: No activation (linear)

  • Loss function: MSE instead of cross-entropy

  • Evaluation: RΒ², MSE instead of accuracy

Use MLPRegressorΒΆ

Same parameters as MLPClassifier

# Neural Network Regression
housing = fetch_california_housing()
X_house, y_house = housing.data, housing.target

X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
    X_house, y_house, test_size=0.3, random_state=42)

scaler_h = StandardScaler()
X_train_h_scaled = scaler_h.fit_transform(X_train_h)
X_test_h_scaled = scaler_h.transform(X_test_h)

# Train neural network
mlp_reg = MLPRegressor(hidden_layer_sizes=(100, 50, 25),
                       activation='relu',
                       max_iter=500,
                       early_stopping=True,
                       random_state=42)
mlp_reg.fit(X_train_h_scaled, y_train_h)

# Predictions
y_pred_train = mlp_reg.predict(X_train_h_scaled)
y_pred_test = mlp_reg.predict(X_test_h_scaled)

# Metrics
train_r2 = r2_score(y_train_h, y_pred_train)
test_r2 = r2_score(y_test_h, y_pred_test)
train_rmse = np.sqrt(mean_squared_error(y_train_h, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test_h, y_pred_test))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Predicted vs Actual
axes[0].scatter(y_test_h, y_pred_test, alpha=0.3)
axes[0].plot([y_test_h.min(), y_test_h.max()], [y_test_h.min(), y_test_h.max()],
            'r--', lw=2, label='Perfect prediction')
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].set_title(f'Neural Network Regression\nTest RΒ² = {test_r2:.3f}, RMSE = {test_rmse:.3f}')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

# Learning curve
axes[1].plot(mlp_reg.loss_curve_, linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss (MSE)')
axes[1].set_title(f'Training Loss\nStopped at iteration {mlp_reg.n_iter_}')
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\nπŸ“Š Regression Results:")
print(f"   Train RΒ²: {train_r2:.4f}, RMSE: {train_rmse:.4f}")
print(f"   Test RΒ²:  {test_r2:.4f}, RMSE: {test_rmse:.4f}")
print(f"   Iterations: {mlp_reg.n_iter_}")

10.5 Image Classification ExampleΒΆ

MNIST Digits DatasetΒΆ

  • 8Γ—8 pixel grayscale images

  • 10 classes (digits 0-9)

  • Classic benchmark

Network DesignΒΆ

  • Input: 64 features (8Γ—8 pixels)

  • Hidden layers: Learn digit features

  • Output: 10 classes (softmax)

# MNIST Digits Classification
digits = load_digits()
X_digits, y_digits = digits.data, digits.target

X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
    X_digits, y_digits, test_size=0.3, random_state=42)

# Scale
scaler_d = StandardScaler()
X_train_d_scaled = scaler_d.fit_transform(X_train_d)
X_test_d_scaled = scaler_d.transform(X_test_d)

# Train network
mlp_digits = MLPClassifier(hidden_layer_sizes=(128, 64, 32),
                          activation='relu',
                          max_iter=500,
                          early_stopping=True,
                          random_state=42)
mlp_digits.fit(X_train_d_scaled, y_train_d)

y_pred_d = mlp_digits.predict(X_test_d_scaled)
accuracy = accuracy_score(y_test_d, y_pred_d)

# Visualize
fig = plt.figure(figsize=(14, 8))
gs = fig.add_gridspec(3, 4)

# Show sample predictions
for i in range(12):
    ax = fig.add_subplot(gs[i // 4, i % 4])
    ax.imshow(X_test_d[i].reshape(8, 8), cmap='gray')
    color = 'green' if y_pred_d[i] == y_test_d[i] else 'red'
    ax.set_title(f'True: {y_test_d[i]}, Pred: {y_pred_d[i]}', color=color)
    ax.axis('off')

plt.suptitle(f'MNIST Classification (Test Accuracy: {accuracy:.3f})', fontsize=14)
plt.tight_layout()
plt.show()

# Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test_d, y_pred_d)

plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - MNIST Digits')
plt.show()

print(f"\nπŸ“Š MNIST Results:")
print(f"   Test Accuracy: {accuracy:.4f}")
print(f"   Network: {mlp_digits.hidden_layer_sizes}")
print(f"   Parameters: {sum(w.size for w in mlp_digits.coefs_):,}")

Key TakeawaysΒΆ

When to Use Neural NetworksΒΆ

Good For: βœ… Large datasets (n > 1000)
βœ… Complex non-linear patterns
βœ… Images, text, audio
βœ… Many features
βœ… When accuracy is paramount

Not Ideal For: ❌ Small datasets (n < 500)
❌ Need interpretability
❌ Linear relationships
❌ Tabular data (try RF/XGBoost first)
❌ Limited computational resources

Hyperparameter GuidelinesΒΆ

Architecture:

  • Start simple: (50,) or (100, 50)

  • Go deeper if needed: (128, 64, 32)

  • More data β†’ can use more layers/units

Activation:

  • Use ReLU (default, works well)

  • Try tanh if ReLU doesn’t work

  • Output: softmax (classification), linear (regression)

Regularization (alpha):

  • Start: 0.0001 (default)

  • If overfitting: increase to 0.001, 0.01

  • If underfitting: decrease to 0.00001

Learning Rate:

  • Default: 0.001 (usually good)

  • If not converging: decrease to 0.0001

  • If too slow: increase to 0.01

Training:

  • Use early_stopping=True

  • Set max_iter=500 or 1000

  • Monitor loss curve

Best PracticesΒΆ

1. Always Scale Features

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

2. Use Early Stopping

MLPClassifier(early_stopping=True, validation_fraction=0.1)

3. Start Simple, Then Increase Complexity

# Start
MLPClassifier(hidden_layer_sizes=(50,))
# Then try
MLPClassifier(hidden_layer_sizes=(100, 50))
# Then
MLPClassifier(hidden_layer_sizes=(128, 64, 32))

4. Monitor Training

mlp.fit(X_train, y_train)
plt.plot(mlp.loss_curve_)

Comparison: NN vs Other MethodsΒΆ

Dataset Type

Best Choice

Alternative

Tabular (small)

Random Forest

XGBoost

Tabular (large)

XGBoost

Neural Network

Images

CNN

ResNet, ViT

Text

Transformer

LSTM, GRU

Time Series

LSTM/GRU

ARIMA, XGBoost

Common PitfallsΒΆ

❌ Not scaling features β†’ Poor convergence
❌ Too complex architecture β†’ Overfitting
❌ No regularization β†’ Overfitting
❌ Too few iterations β†’ Underfitting
❌ Wrong activation β†’ Poor performance
❌ Not using validation β†’ Can’t detect overfitting

sklearn MLPClassifier LimitationsΒΆ

For production deep learning, consider:

  • PyTorch: Flexible, research-friendly

  • TensorFlow/Keras: Production-ready

  • JAX: High-performance

sklearn is good for:

  • Quick prototypes

  • Simple feedforward networks

  • Integration with sklearn pipelines

Next ChapterΒΆ

Chapter 11: Survival Analysis

  • Survival and Censoring

  • Kaplan-Meier Estimator

  • Cox Proportional Hazards Model

  • Extensions

Practice ExercisesΒΆ

Exercise 2: Activation FunctionsΒΆ

Using the circles dataset:

  1. Train networks with: relu, tanh, logistic

  2. Compare convergence speed (iterations to 95% accuracy)

  3. Visualize decision boundaries for each

  4. Explain differences

Exercise 3: Regularization StudyΒΆ

  1. Generate data with noise

  2. Test alpha: [0, 0.0001, 0.001, 0.01, 0.1]

  3. Plot learning curves for each

  4. Find optimal regularization

  5. Visualize weight distributions

Exercise 4: Learning Rate ImpactΒΆ

  1. Use digits dataset

  2. Test learning_rate_init: [0.0001, 0.001, 0.01, 0.1]

  3. Plot loss curves

  4. Identify: too slow, just right, unstable

Exercise 5: Regression ChallengeΒΆ

Create synthetic data: y = sin(x₁) + xβ‚‚Β² + x₃ + noise

  1. Train neural network

  2. Compare with linear regression, random forest

  3. Analyze where NN excels

  4. Visualize predictions vs actual

Exercise 6: Real DatasetΒΆ

Use your choice of dataset:

  1. Implement full pipeline (scaling, train, validate)

  2. Grid search hyperparameters

  3. Plot learning curves

  4. Compare with 2 other algorithms

  5. Provide recommendation with justification