Chapter 10: Deep LearningΒΆ
OverviewΒΆ
Deep Learning = Neural networks with multiple hidden layers
Key ComponentsΒΆ
Neurons: Basic computational units
Layers: Input β Hidden (multiple) β Output
Activation Functions: Non-linear transformations
Backpropagation: Learning via gradient descent
Architectures: Feedforward, CNN, RNN, Transformer
Neural Network FormulaΒΆ
Single Hidden Layer: $\(f(X) = \beta_0 + \sum_{k=1}^K \beta_k h_k(X)\)$
where hidden unit: $\(h_k(X) = g(w_{k0} + \sum_{j=1}^p w_{kj}X_j)\)$
Components:
\(g()\) = activation function (ReLU, sigmoid, tanh)
\(w_{kj}\) = weights (learned parameters)
\(K\) = number of hidden units
Common Activation FunctionsΒΆ
ReLU (Rectified Linear Unit): \(g(z) = \max(0, z)\)
Most popular in deep networks
Avoids vanishing gradients
Computationally efficient
Sigmoid: \(g(z) = \frac{1}{1 + e^{-z}}\)
Output: (0, 1)
Used in binary classification output
Can suffer vanishing gradients
Tanh: \(g(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}\)
Output: (-1, 1)
Zero-centered (better than sigmoid)
Softmax (output layer): \(g(z_k) = \frac{e^{z_k}}{\sum_j e^{z_j}}\)
Multi-class classification
Outputs probabilities (sum to 1)
AdvantagesΒΆ
β
Automatic feature learning
β
Handles complex patterns
β
State-of-the-art on images, text, audio
β
Flexible architectures
β
Transfer learning possible
ChallengesΒΆ
β Requires large datasets
β Computationally expensive
β Many hyperparameters
β Black box (hard to interpret)
β Can overfit easily
10.1 Single Layer Neural NetworkΒΆ
ArchitectureΒΆ
Input β Hidden Layer β Output
Forward PassΒΆ
Hidden layer: \(h = g(W_1 X + b_1)\)
Output: \(\hat{y} = W_2 h + b_2\)
ParametersΒΆ
hidden_layer_sizes: Number of neurons (e.g., 10, 50, 100)
activation: ReLU, tanh, logistic
alpha: L2 regularization (0.0001 default)
learning_rate_init: Step size (0.001 default)
max_iter: Training epochs
# Single Layer Neural Network
X, y = make_circles(n_samples=500, noise=0.1, factor=0.5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
# Scale features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Test different hidden layer sizes
hidden_sizes = [5, 10, 25, 50]
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
axes = axes.ravel()
for idx, size in enumerate(hidden_sizes):
mlp = MLPClassifier(hidden_layer_sizes=(size,), activation='relu',
max_iter=1000, random_state=42)
mlp.fit(X_train_scaled, y_train)
# Decision boundary
h = 0.02
x_min, x_max = X[:, 0].min() - 0.5, X[:, 0].max() + 0.5
y_min, y_max = X[:, 1].min() - 0.5, X[:, 1].max() + 0.5
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = mlp.predict(scaler.transform(np.c_[xx.ravel(), yy.ravel()]))
Z = Z.reshape(xx.shape)
axes[idx].contourf(xx, yy, Z, alpha=0.3, cmap='RdYlBu')
axes[idx].scatter(X_train[:, 0], X_train[:, 1], c=y_train, cmap='RdYlBu',
edgecolors='k', s=50, alpha=0.7)
axes[idx].set_title(f'Hidden Units: {size}\n'
f'Train: {mlp.score(X_train_scaled, y_train):.3f}, '
f'Test: {mlp.score(X_test_scaled, y_test):.3f}')
axes[idx].set_xlabel('Xβ')
axes[idx].set_ylabel('Xβ')
plt.tight_layout()
plt.show()
print("\nπ‘ More hidden units β more flexible decision boundary")
print(" But too many β overfitting risk!")
10.2 Deep Neural NetworksΒΆ
Why Deep?ΒΆ
Hierarchical features: Each layer learns increasingly abstract representations
Better performance: Deeper often better than wider
Parameter efficiency: Can represent complex functions with fewer parameters
Typical ArchitecturesΒΆ
Shallow: (100,) or (50, 25)
Medium: (100, 50, 25)
Deep: (128, 64, 32, 16)
Training ChallengesΒΆ
Vanishing gradients: Deep networks, sigmoid/tanh
Exploding gradients: Poor initialization
Overfitting: Many parameters
SolutionsΒΆ
ReLU activation: Avoids vanishing gradients
Batch normalization: Stabilizes training
Dropout: Regularization technique
Early stopping: Stop when validation error increases
# Compare Shallow vs Deep Networks
architectures = {
'Shallow (50)': (50,),
'Medium (50, 25)': (50, 25),
'Deep (50, 25, 10)': (50, 25, 10),
'Very Deep (100, 50, 25, 10)': (100, 50, 25, 10)
}
results = {}
for name, arch in architectures.items():
mlp = MLPClassifier(hidden_layer_sizes=arch, activation='relu',
max_iter=1000, random_state=42, early_stopping=True)
mlp.fit(X_train_scaled, y_train)
train_acc = mlp.score(X_train_scaled, y_train)
test_acc = mlp.score(X_test_scaled, y_test)
n_params = sum(w.size for w in mlp.coefs_) + sum(b.size for b in mlp.intercepts_)
results[name] = {
'train': train_acc,
'test': test_acc,
'params': n_params,
'layers': len(arch)
}
# Visualize
df_results = pd.DataFrame(results).T
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
# Accuracy comparison
df_results[['train', 'test']].plot(kind='barh', ax=ax1)
ax1.set_xlabel('Accuracy')
ax1.set_title('Architecture Comparison')
ax1.legend(['Train', 'Test'])
ax1.grid(True, alpha=0.3, axis='x')
# Parameters vs Performance
ax2.scatter(df_results['params'], df_results['test'], s=200, alpha=0.6)
for idx, name in enumerate(df_results.index):
ax2.annotate(name, (df_results.iloc[idx]['params'], df_results.iloc[idx]['test']),
fontsize=8, ha='right')
ax2.set_xlabel('Number of Parameters')
ax2.set_ylabel('Test Accuracy')
ax2.set_title('Parameters vs Performance')
ax2.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ Results:\n")
print(df_results.to_string())
print("\nπ‘ Deeper not always better! Balance complexity with generalization.")
10.3 Regularization in Neural NetworksΒΆ
L2 Regularization (Weight Decay)ΒΆ
Add penalty to loss: $\(L = L_{\text{data}} + \alpha \sum_{i,j} w_{ij}^2\)$
alpha: Regularization strength (0.0001-0.01)
Prevents large weights
Early StoppingΒΆ
Monitor validation error
Stop when it starts increasing
early_stopping=True, validation_fraction=0.1
Dropout (not in sklearn MLPClassifier)ΒΆ
Randomly drop neurons during training
Prevents co-adaptation
Typical rate: 0.2-0.5
Batch Normalization (not in sklearn)ΒΆ
Normalize layer inputs
Stabilizes training
Allows higher learning rates
# Effect of L2 Regularization (alpha)
alphas = [0.0001, 0.001, 0.01, 0.1]
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
train_scores = []
test_scores = []
for alpha in alphas:
mlp = MLPClassifier(hidden_layer_sizes=(50, 25), activation='relu',
alpha=alpha, max_iter=1000, random_state=42)
mlp.fit(X_train_scaled, y_train)
train_scores.append(mlp.score(X_train_scaled, y_train))
test_scores.append(mlp.score(X_test_scaled, y_test))
axes[0].plot(alphas, train_scores, 'o-', label='Train', linewidth=2)
axes[0].plot(alphas, test_scores, 's-', label='Test', linewidth=2)
axes[0].set_xlabel('Alpha (L2 Penalty)')
axes[0].set_ylabel('Accuracy')
axes[0].set_title('Effect of Regularization')
axes[0].set_xscale('log')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Learning curves with early stopping
mlp = MLPClassifier(hidden_layer_sizes=(50, 25), activation='relu',
max_iter=1000, early_stopping=True, validation_fraction=0.1,
random_state=42)
mlp.fit(X_train_scaled, y_train)
axes[1].plot(mlp.loss_curve_, label='Training Loss', linewidth=2)
axes[1].plot(mlp.validation_scores_, label='Validation Accuracy', linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss / Accuracy')
axes[1].set_title(f'Learning Curves\nStopped at iteration {mlp.n_iter_}')
axes[1].legend()
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("\nπ‘ Regularization:")
print(f" β’ Optimal alpha balances train/test performance")
print(f" β’ Early stopping prevented overfitting (stopped at {mlp.n_iter_} iterations)")
10.4 Neural Networks for RegressionΒΆ
Differences from ClassificationΒΆ
Output layer: No activation (linear)
Loss function: MSE instead of cross-entropy
Evaluation: RΒ², MSE instead of accuracy
Use MLPRegressorΒΆ
Same parameters as MLPClassifier
# Neural Network Regression
housing = fetch_california_housing()
X_house, y_house = housing.data, housing.target
X_train_h, X_test_h, y_train_h, y_test_h = train_test_split(
X_house, y_house, test_size=0.3, random_state=42)
scaler_h = StandardScaler()
X_train_h_scaled = scaler_h.fit_transform(X_train_h)
X_test_h_scaled = scaler_h.transform(X_test_h)
# Train neural network
mlp_reg = MLPRegressor(hidden_layer_sizes=(100, 50, 25),
activation='relu',
max_iter=500,
early_stopping=True,
random_state=42)
mlp_reg.fit(X_train_h_scaled, y_train_h)
# Predictions
y_pred_train = mlp_reg.predict(X_train_h_scaled)
y_pred_test = mlp_reg.predict(X_test_h_scaled)
# Metrics
train_r2 = r2_score(y_train_h, y_pred_train)
test_r2 = r2_score(y_test_h, y_pred_test)
train_rmse = np.sqrt(mean_squared_error(y_train_h, y_pred_train))
test_rmse = np.sqrt(mean_squared_error(y_test_h, y_pred_test))
# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Predicted vs Actual
axes[0].scatter(y_test_h, y_pred_test, alpha=0.3)
axes[0].plot([y_test_h.min(), y_test_h.max()], [y_test_h.min(), y_test_h.max()],
'r--', lw=2, label='Perfect prediction')
axes[0].set_xlabel('Actual')
axes[0].set_ylabel('Predicted')
axes[0].set_title(f'Neural Network Regression\nTest RΒ² = {test_r2:.3f}, RMSE = {test_rmse:.3f}')
axes[0].legend()
axes[0].grid(True, alpha=0.3)
# Learning curve
axes[1].plot(mlp_reg.loss_curve_, linewidth=2)
axes[1].set_xlabel('Iteration')
axes[1].set_ylabel('Loss (MSE)')
axes[1].set_title(f'Training Loss\nStopped at iteration {mlp_reg.n_iter_}')
axes[1].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"\nπ Regression Results:")
print(f" Train RΒ²: {train_r2:.4f}, RMSE: {train_rmse:.4f}")
print(f" Test RΒ²: {test_r2:.4f}, RMSE: {test_rmse:.4f}")
print(f" Iterations: {mlp_reg.n_iter_}")
10.5 Image Classification ExampleΒΆ
MNIST Digits DatasetΒΆ
8Γ8 pixel grayscale images
10 classes (digits 0-9)
Classic benchmark
Network DesignΒΆ
Input: 64 features (8Γ8 pixels)
Hidden layers: Learn digit features
Output: 10 classes (softmax)
# MNIST Digits Classification
digits = load_digits()
X_digits, y_digits = digits.data, digits.target
X_train_d, X_test_d, y_train_d, y_test_d = train_test_split(
X_digits, y_digits, test_size=0.3, random_state=42)
# Scale
scaler_d = StandardScaler()
X_train_d_scaled = scaler_d.fit_transform(X_train_d)
X_test_d_scaled = scaler_d.transform(X_test_d)
# Train network
mlp_digits = MLPClassifier(hidden_layer_sizes=(128, 64, 32),
activation='relu',
max_iter=500,
early_stopping=True,
random_state=42)
mlp_digits.fit(X_train_d_scaled, y_train_d)
y_pred_d = mlp_digits.predict(X_test_d_scaled)
accuracy = accuracy_score(y_test_d, y_pred_d)
# Visualize
fig = plt.figure(figsize=(14, 8))
gs = fig.add_gridspec(3, 4)
# Show sample predictions
for i in range(12):
ax = fig.add_subplot(gs[i // 4, i % 4])
ax.imshow(X_test_d[i].reshape(8, 8), cmap='gray')
color = 'green' if y_pred_d[i] == y_test_d[i] else 'red'
ax.set_title(f'True: {y_test_d[i]}, Pred: {y_pred_d[i]}', color=color)
ax.axis('off')
plt.suptitle(f'MNIST Classification (Test Accuracy: {accuracy:.3f})', fontsize=14)
plt.tight_layout()
plt.show()
# Confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test_d, y_pred_d)
plt.figure(figsize=(10, 8))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues')
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix - MNIST Digits')
plt.show()
print(f"\nπ MNIST Results:")
print(f" Test Accuracy: {accuracy:.4f}")
print(f" Network: {mlp_digits.hidden_layer_sizes}")
print(f" Parameters: {sum(w.size for w in mlp_digits.coefs_):,}")
Key TakeawaysΒΆ
When to Use Neural NetworksΒΆ
Good For:
β
Large datasets (n > 1000)
β
Complex non-linear patterns
β
Images, text, audio
β
Many features
β
When accuracy is paramount
Not Ideal For:
β Small datasets (n < 500)
β Need interpretability
β Linear relationships
β Tabular data (try RF/XGBoost first)
β Limited computational resources
Hyperparameter GuidelinesΒΆ
Architecture:
Start simple: (50,) or (100, 50)
Go deeper if needed: (128, 64, 32)
More data β can use more layers/units
Activation:
Use ReLU (default, works well)
Try tanh if ReLU doesnβt work
Output: softmax (classification), linear (regression)
Regularization (alpha):
Start: 0.0001 (default)
If overfitting: increase to 0.001, 0.01
If underfitting: decrease to 0.00001
Learning Rate:
Default: 0.001 (usually good)
If not converging: decrease to 0.0001
If too slow: increase to 0.01
Training:
Use
early_stopping=TrueSet
max_iter=500or 1000Monitor loss curve
Best PracticesΒΆ
1. Always Scale Features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
2. Use Early Stopping
MLPClassifier(early_stopping=True, validation_fraction=0.1)
3. Start Simple, Then Increase Complexity
# Start
MLPClassifier(hidden_layer_sizes=(50,))
# Then try
MLPClassifier(hidden_layer_sizes=(100, 50))
# Then
MLPClassifier(hidden_layer_sizes=(128, 64, 32))
4. Monitor Training
mlp.fit(X_train, y_train)
plt.plot(mlp.loss_curve_)
Comparison: NN vs Other MethodsΒΆ
Dataset Type |
Best Choice |
Alternative |
|---|---|---|
Tabular (small) |
Random Forest |
XGBoost |
Tabular (large) |
XGBoost |
Neural Network |
Images |
CNN |
ResNet, ViT |
Text |
Transformer |
LSTM, GRU |
Time Series |
LSTM/GRU |
ARIMA, XGBoost |
Common PitfallsΒΆ
β Not scaling features β Poor convergence
β Too complex architecture β Overfitting
β No regularization β Overfitting
β Too few iterations β Underfitting
β Wrong activation β Poor performance
β Not using validation β Canβt detect overfitting
sklearn MLPClassifier LimitationsΒΆ
For production deep learning, consider:
PyTorch: Flexible, research-friendly
TensorFlow/Keras: Production-ready
JAX: High-performance
sklearn is good for:
Quick prototypes
Simple feedforward networks
Integration with sklearn pipelines
Next ChapterΒΆ
Chapter 11: Survival Analysis
Survival and Censoring
Kaplan-Meier Estimator
Cox Proportional Hazards Model
Extensions
Practice ExercisesΒΆ
Exercise 1: Architecture SearchΒΆ
Generate classification data with
make_classificationTest architectures: (10,), (50,), (100,), (50,25), (100,50,25)
Plot: depth vs accuracy, width vs accuracy
Find sweet spot for this dataset
Exercise 2: Activation FunctionsΒΆ
Using the circles dataset:
Train networks with: relu, tanh, logistic
Compare convergence speed (iterations to 95% accuracy)
Visualize decision boundaries for each
Explain differences
Exercise 3: Regularization StudyΒΆ
Generate data with noise
Test alpha: [0, 0.0001, 0.001, 0.01, 0.1]
Plot learning curves for each
Find optimal regularization
Visualize weight distributions
Exercise 4: Learning Rate ImpactΒΆ
Use digits dataset
Test learning_rate_init: [0.0001, 0.001, 0.01, 0.1]
Plot loss curves
Identify: too slow, just right, unstable
Exercise 5: Regression ChallengeΒΆ
Create synthetic data: y = sin(xβ) + xβΒ² + xβ + noise
Train neural network
Compare with linear regression, random forest
Analyze where NN excels
Visualize predictions vs actual
Exercise 6: Real DatasetΒΆ
Use your choice of dataset:
Implement full pipeline (scaling, train, validate)
Grid search hyperparameters
Plot learning curves
Compare with 2 other algorithms
Provide recommendation with justification