Run this notebook: Open in Colab Open in Kaggle

import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset
from torchvision import datasets, transforms

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_moons

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")

# Set random seed for reproducibility
torch.manual_seed(42)
np.random.seed(42)

print("\n✅ Imports successful!")

Creating Tensors¶

Tensors are the fundamental data structure in PyTorch – multi-dimensional arrays that generalize scalars, vectors, and matrices to arbitrary dimensions. Unlike NumPy arrays, PyTorch tensors can live on a GPU for hardware-accelerated computation and can track the operations applied to them so that gradients are computed automatically during backpropagation. You can create tensors from Python lists, from NumPy arrays, or using factory functions like torch.randn (random normal) and torch.zeros. Every tensor has a shape (its dimensions) and a dtype (its numeric type, e.g., float32). In practice, all neural-network inputs, weights, and outputs are tensors.

# Create tensors
x = torch.tensor([1.0, 2.0, 3.0])
print(f"1D tensor: {x}")
print(f"Shape: {x.shape}, Dtype: {x.dtype}")

# From NumPy
np_array = np.array([[1, 2], [3, 4]])
x = torch.from_numpy(np_array).float()
print(f"\nFrom NumPy:\n{x}")

# Random tensors
x_rand = torch.randn(3, 4)  # Normal distribution
print(f"\nRandom (3x4):\n{x_rand}")

# Zeros and ones
x_zeros = torch.zeros(2, 3)
x_ones = torch.ones(2, 3)
print(f"\nZeros:\n{x_zeros}")
print(f"\nOnes:\n{x_ones}")

# Like another tensor
x_like = torch.ones_like(x_rand)
print(f"\nOnes like x_rand:\n{x_like.shape}")

Tensor Operations¶

Neural networks are fundamentally sequences of tensor operations. Element-wise operations (addition, multiplication) apply independently to each entry, while matrix multiplication (@ or torch.matmul) combines rows and columns following linear algebra rules – this is the core operation inside every nn.Linear layer. Reshaping with view or reshape rearranges elements without copying data, which is essential when converting image grids into flat vectors or reorganizing batch dimensions. Aggregation functions like mean, sum, and argmax reduce tensors along specified dimensions and are used constantly in loss computation and metric calculation.

# Matrix operations
a = torch.tensor([[1.0, 2.0], [3.0, 4.0]])
b = torch.tensor([[5.0, 6.0], [7.0, 8.0]])

print("Matrix A:")
print(a)
print("\nMatrix B:")
print(b)

# Element-wise operations
print(f"\nA + B:\n{a + b}")
print(f"\nA * B (element-wise):\n{a * b}")

# Matrix multiplication
print(f"\nA @ B (matrix multiply):\n{torch.matmul(a, b)}")
print(f"\nAlso: a @ b:\n{a @ b}")

# Transpose
print(f"\nA.T (transpose):\n{a.T}")

# Reshaping
x = torch.arange(12)
print(f"\nOriginal: {x}")
print(f"Reshaped (3x4):\n{x.reshape(3, 4)}")
print(f"Reshaped (2x6):\n{x.reshape(2, 6)}")

# Common operations
x = torch.tensor([1.0, 2.0, 3.0, 4.0])
print(f"\nMean: {x.mean()}")
print(f"Sum: {x.sum()}")
print(f"Max: {x.max()}")
print(f"Argmax: {x.argmax()}")

2. Automatic Differentiation (Autograd)¶

PyTorch automatically computes gradients! No need to implement backpropagation manually.

# Enable gradient tracking
x = torch.tensor([2.0], requires_grad=True)
print(f"x = {x}")
print(f"requires_grad = {x.requires_grad}")

# Compute y = x^2 + 3x + 5
y = x**2 + 3*x + 5
print(f"\ny = x^2 + 3x + 5 = {y}")

# Compute gradient
y.backward()  # Compute dy/dx

print(f"\nGradient dy/dx = {x.grad}")
print(f"Expected: 2x + 3 = 2(2) + 3 = 7 ✓")

# More complex example
x = torch.tensor([1.0, 2.0, 3.0], requires_grad=True)
w = torch.tensor([0.5, -0.3, 0.7], requires_grad=True)
b = torch.tensor([0.1], requires_grad=True)

# Forward pass: y = w·x + b, then loss = y^2
y = torch.dot(w, x) + b
loss = y**2

print(f"x = {x}")
print(f"w = {w}")
print(f"b = {b}")
print(f"\ny = w·x + b = {y.item():.4f}")
print(f"loss = y^2 = {loss.item():.4f}")

# Backward pass
loss.backward()

print(f"\nGradients:")
print(f"  dloss/dw = {w.grad}")
print(f"  dloss/db = {b.grad}")
print(f"  dloss/dx = {x.grad}")

Gradient Accumulation and Zeroing¶

By default PyTorch accumulates gradients: each call to backward() adds to the existing .grad attribute rather than replacing it. This behavior is intentional – it allows gradient accumulation across mini-batches when GPU memory is limited. However, in a standard training loop you must call optimizer.zero_grad() (or manually zero each parameter’s .grad) before computing new gradients, otherwise the gradients from previous iterations will contaminate the current update. Forgetting this step is one of the most common PyTorch bugs.

# Important: Gradients accumulate!
x = torch.tensor([2.0], requires_grad=True)

for i in range(3):
    y = x**2
    y.backward()
    print(f"Iteration {i+1}: x.grad = {x.grad}")
    # Notice gradients add up!

print("\n⚠️  Gradients accumulated! Always zero them in training loops.")

# Proper way:
x = torch.tensor([2.0], requires_grad=True)

for i in range(3):
    if x.grad is not None:
        x.grad.zero_()  # Zero the gradient
    
    y = x**2
    y.backward()
    print(f"Iteration {i+1}: x.grad = {x.grad}")

print("\n✅ Gradients zeroed properly!")

3. Building Neural Networks with nn.Module¶

The PyTorch Way to Define Models¶

nn.Module is the base class for all neural network components in PyTorch. By subclassing it you declare your layers in __init__ and describe the forward computation in forward. PyTorch then automatically provides parameter management (.parameters()), device transfer (.to(device)), mode switching (.train() / .eval()), and serialization (.state_dict()). Every layer you assign as an attribute – such as nn.Linear, nn.Conv2d, or nn.LSTM – is registered as a sub-module, and its weights become part of the model’s parameter set. This modular design lets you compose complex architectures from simple, reusable building blocks.

class SimpleNet(nn.Module):
    """
    A simple neural network
    Architecture: 2 → 8 → 8 → 1
    """
    
    def __init__(self):
        super(SimpleNet, self).__init__()
        
        # Define layers
        self.fc1 = nn.Linear(2, 8)   # Fully connected: 2 inputs → 8 neurons
        self.fc2 = nn.Linear(8, 8)   # 8 → 8
        self.fc3 = nn.Linear(8, 1)   # 8 → 1 output
    
    def forward(self, x):
        """
        Forward pass - define how data flows through the network
        """
        x = F.relu(self.fc1(x))   # Layer 1 + ReLU
        x = F.relu(self.fc2(x))   # Layer 2 + ReLU
        x = torch.sigmoid(self.fc3(x))  # Layer 3 + Sigmoid
        return x

# Create model
model = SimpleNet()
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)

print(f"\nTotal parameters: {total_params}")
print(f"Trainable parameters: {trainable_params}")

# Test forward pass
x_test = torch.randn(5, 2)  # Batch of 5 samples, 2 features each
output = model(x_test)
print(f"\nInput shape: {x_test.shape}")
print(f"Output shape: {output.shape}")
print(f"Output:\n{output}")

Alternative: Sequential API¶

When your model is a simple chain of layers with no branching or skip connections, nn.Sequential provides a concise shorthand. You pass the layers in order and PyTorch chains them together automatically – no need to write a forward method. Under the hood nn.Sequential is itself an nn.Module, so it still supports .parameters(), .to(device), and all other module features. Use it for quick prototyping; switch to the full nn.Module subclass when you need custom logic like residual connections or multiple outputs.

# Simpler way for sequential architectures
model_seq = nn.Sequential(
    nn.Linear(2, 8),
    nn.ReLU(),
    nn.Linear(8, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
    nn.Sigmoid()
)

print("Sequential model:")
print(model_seq)

# Test
output_seq = model_seq(x_test)
print(f"\nOutput shape: {output_seq.shape}")

4. Training a Neural Network – The Complete Loop¶

From Data to Predictions¶

Training in PyTorch follows a consistent four-step rhythm inside each mini-batch: (1) forward pass – feed a batch through the model to get predictions, (2) loss computation – compare predictions to ground truth using a criterion like nn.BCELoss, (3) backward pass – call loss.backward() to compute gradients for every parameter, and (4) optimizer step – call optimizer.step() to update weights. We wrap the data in a DataLoader that handles batching and shuffling. This pattern is the backbone of virtually every PyTorch training script, from simple classifiers to billion-parameter language models.

# Generate data
X, y = make_moons(n_samples=400, noise=0.15, random_state=42)

# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(X)
y_tensor = torch.FloatTensor(y).reshape(-1, 1)

# Create dataset and dataloader
dataset = TensorDataset(X_tensor, y_tensor)
train_loader = DataLoader(dataset, batch_size=32, shuffle=True)

print(f"Dataset size: {len(dataset)}")
print(f"Batch size: 32")
print(f"Number of batches: {len(train_loader)}")

# Visualize data
plt.figure(figsize=(8, 6))
plt.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', alpha=0.6, edgecolors='k')
plt.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', alpha=0.6, edgecolors='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Binary Classification Data')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

Complete Training Loop¶

Below is the canonical PyTorch training loop. Notice the four key lines inside the inner loop: model(batch_X) runs the forward pass, criterion(outputs, batch_y) computes the loss, loss.backward() populates .grad on every parameter, and optimizer.step() applies the Adam update rule. We track loss and accuracy per epoch to monitor convergence. If the training loss decreases smoothly while accuracy climbs toward 100%, the optimizer is successfully navigating the loss landscape toward a good minimum.

# Create model
model = SimpleNet()

# Loss function
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss

# Optimizer
optimizer = optim.Adam(model.parameters(), lr=0.01)

# Training loop
epochs = 100
losses = []
accuracies = []

print("Training...\n")

for epoch in range(epochs):
    epoch_loss = 0
    correct = 0
    total = 0
    
    for batch_X, batch_y in train_loader:
        # 1. Forward pass
        outputs = model(batch_X)
        loss = criterion(outputs, batch_y)
        
        # 2. Backward pass
        optimizer.zero_grad()  # Zero gradients
        loss.backward()        # Compute gradients
        optimizer.step()       # Update weights
        
        # Track metrics
        epoch_loss += loss.item()
        predictions = (outputs > 0.5).float()
        correct += (predictions == batch_y).sum().item()
        total += batch_y.size(0)
    
    # Record metrics
    avg_loss = epoch_loss / len(train_loader)
    accuracy = correct / total
    losses.append(avg_loss)
    accuracies.append(accuracy)
    
    if (epoch + 1) % 20 == 0:
        print(f"Epoch [{epoch+1}/{epochs}] - Loss: {avg_loss:.4f}, Accuracy: {accuracy*100:.2f}%")

print("\n✅ Training complete!")

# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss
ax1.plot(losses, 'b-', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss')
ax1.grid(True, alpha=0.3)

# Accuracy
ax2.plot(accuracies, 'g-', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy')
ax2.set_title('Training Accuracy')
ax2.set_ylim([0, 1])
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

5. Modern Optimizers¶

Beyond Vanilla Gradient Descent¶

Stochastic Gradient Descent (SGD) updates weights proportionally to the gradient, but it can oscillate in narrow valleys and converge slowly. Modern optimizers address this with two key ideas: momentum (accumulate a running average of past gradients to smooth updates) and adaptive learning rates (scale the step size per-parameter based on historical gradient magnitudes). Adam (Adaptive Moment Estimation) combines both, maintaining running estimates of the first moment (mean) and second moment (variance) of each gradient. It is the default choice for most deep learning tasks because it converges quickly with minimal hyperparameter tuning – typically just the learning rate.

def compare_optimizers(X_tensor, y_tensor, epochs=50):
    """
    Compare different optimizers
    """
    optimizers_config = [
        ('SGD', lambda params: optim.SGD(params, lr=0.1)),
        ('SGD + Momentum', lambda params: optim.SGD(params, lr=0.1, momentum=0.9)),
        ('RMSprop', lambda params: optim.RMSprop(params, lr=0.01)),
        ('Adam', lambda params: optim.Adam(params, lr=0.01)),
    ]
    
    results = {}
    
    for name, opt_fn in optimizers_config:
        # Create fresh model
        model = SimpleNet()
        optimizer = opt_fn(model.parameters())
        criterion = nn.BCELoss()
        
        # Create dataloader
        dataset = TensorDataset(X_tensor, y_tensor)
        loader = DataLoader(dataset, batch_size=32, shuffle=True)
        
        losses = []
        
        # Train
        for epoch in range(epochs):
            epoch_loss = 0
            for batch_X, batch_y in loader:
                outputs = model(batch_X)
                loss = criterion(outputs, batch_y)
                
                optimizer.zero_grad()
                loss.backward()
                optimizer.step()
                
                epoch_loss += loss.item()
            
            losses.append(epoch_loss / len(loader))
        
        results[name] = losses
        print(f"{name}: Final loss = {losses[-1]:.4f}")
    
    return results

# Compare
print("Comparing optimizers...\n")
results = compare_optimizers(X_tensor, y_tensor, epochs=50)

# Plot comparison
plt.figure(figsize=(12, 6))
for name, losses in results.items():
    plt.plot(losses, label=name, linewidth=2)

plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.title('Optimizer Comparison')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()

print("\n📊 Key Insights:")
print("  • Adam: Usually fastest convergence, good default choice")
print("  • SGD + Momentum: Better than vanilla SGD")
print("  • RMSprop: Good for RNNs, adaptive learning rate")
print("  • Vanilla SGD: Slower but sometimes better generalization")

6. Real Dataset – MNIST Digit Classification¶

Your First Image Classification Task¶

MNIST is the “Hello World” of computer vision: 60,000 training images and 10,000 test images of handwritten digits (0-9), each 28x28 grayscale pixels. Despite its simplicity, it exercises every concept we have covered – data loading with DataLoader, model definition with nn.Module, loss computation with nn.CrossEntropyLoss, and optimization with Adam. We apply transforms.Normalize to center pixel values around zero, which helps gradient-based training converge faster. A well-tuned fully connected network can reach over 98% accuracy on MNIST; convolutional networks push past 99%.

# Download MNIST dataset
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))  # MNIST mean and std
])

train_dataset = datasets.MNIST(
    root='./data',
    train=True,
    download=True,
    transform=transform
)

test_dataset = datasets.MNIST(
    root='./data',
    train=False,
    download=True,
    transform=transform
)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
test_loader = DataLoader(test_dataset, batch_size=1000, shuffle=False)

print(f"Training samples: {len(train_dataset)}")
print(f"Test samples: {len(test_dataset)}")
print(f"Batch size: 64")
print(f"Number of classes: 10 (digits 0-9)")

# Visualize some samples
fig, axes = plt.subplots(2, 5, figsize=(12, 5))
axes = axes.ravel()

for i in range(10):
    image, label = train_dataset[i]
    axes[i].imshow(image.squeeze(), cmap='gray')
    axes[i].set_title(f'Label: {label}')
    axes[i].axis('off')

plt.tight_layout()
plt.show()

Build MNIST Classifier¶

The MNISTNet model first flattens each 28x28 image into a 784-dimensional vector, then passes it through two hidden layers (128 and 64 neurons) with ReLU activations. Dropout (nn.Dropout(0.2)) randomly zeroes 20% of neurons during training, which acts as a regularizer by preventing the network from relying too heavily on any single neuron. The output layer produces 10 raw logits (one per digit class); we will pair these with nn.CrossEntropyLoss, which internally applies softmax and computes the negative log-likelihood – the standard loss for multi-class classification.

class MNISTNet(nn.Module):
    """
    Neural network for MNIST digit classification
    Input: 28x28 grayscale images
    Output: 10 classes (digits 0-9)
    """
    
    def __init__(self):
        super(MNISTNet, self).__init__()
        
        self.fc1 = nn.Linear(28 * 28, 128)  # Flatten 28x28 → 784 inputs
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)  # 10 output classes
        
        self.dropout = nn.Dropout(0.2)  # Dropout for regularization
    
    def forward(self, x):
        # Flatten image
        x = x.view(-1, 28 * 28)
        
        # Hidden layers
        x = F.relu(self.fc1(x))
        x = self.dropout(x)
        x = F.relu(self.fc2(x))
        x = self.dropout(x)
        
        # Output layer (logits)
        x = self.fc3(x)
        
        return x

# Create model
model = MNISTNet()
print(model)

# Count parameters
total_params = sum(p.numel() for p in model.parameters())
print(f"\nTotal parameters: {total_params:,}")

Training and Evaluation Functions¶

Separating training and evaluation into reusable functions is a best practice that keeps the main loop clean. The train_epoch function calls model.train() to enable dropout and batch normalization statistics, then iterates over batches. The evaluate function calls model.eval() and wraps inference in torch.no_grad() to disable gradient tracking, which saves memory and speeds up computation. Monitoring both training and test metrics per epoch lets you detect overfitting (training accuracy much higher than test accuracy) and decide when to stop training.

def train_epoch(model, loader, criterion, optimizer, device):
    """Train for one epoch"""
    model.train()
    total_loss = 0
    correct = 0
    total = 0
    
    for images, labels in loader:
        images, labels = images.to(device), labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward pass
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        # Metrics
        total_loss += loss.item()
        _, predicted = outputs.max(1)
        total += labels.size(0)
        correct += predicted.eq(labels).sum().item()
    
    return total_loss / len(loader), 100. * correct / total

def evaluate(model, loader, criterion, device):
    """Evaluate on test set"""
    model.eval()
    total_loss = 0
    correct = 0
    total = 0
    
    with torch.no_grad():  # No gradient computation
        for images, labels in loader:
            images, labels = images.to(device), labels.to(device)
            
            outputs = model(images)
            loss = criterion(outputs, labels)
            
            total_loss += loss.item()
            _, predicted = outputs.max(1)
            total += labels.size(0)
            correct += predicted.eq(labels).sum().item()
    
    return total_loss / len(loader), 100. * correct / total

# Setup
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Using device: {device}")

model = MNISTNet().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

# Train
epochs = 10
train_losses = []
train_accs = []
test_losses = []
test_accs = []

print("\nTraining MNIST classifier...\n")

for epoch in range(epochs):
    train_loss, train_acc = train_epoch(model, train_loader, criterion, optimizer, device)
    test_loss, test_acc = evaluate(model, test_loader, criterion, device)
    
    train_losses.append(train_loss)
    train_accs.append(train_acc)
    test_losses.append(test_loss)
    test_accs.append(test_acc)
    
    print(f"Epoch {epoch+1}/{epochs}:")
    print(f"  Train Loss: {train_loss:.4f}, Train Acc: {train_acc:.2f}%")
    print(f"  Test Loss:  {test_loss:.4f}, Test Acc:  {test_acc:.2f}%")

print("\n✅ Training complete!")

# Plot results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Loss
ax1.plot(train_losses, 'b-', label='Train Loss', linewidth=2)
ax1.plot(test_losses, 'r-', label='Test Loss', linewidth=2)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training and Test Loss')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Accuracy
ax2.plot(train_accs, 'b-', label='Train Accuracy', linewidth=2)
ax2.plot(test_accs, 'r-', label='Test Accuracy', linewidth=2)
ax2.set_xlabel('Epoch')
ax2.set_ylabel('Accuracy (%)')
ax2.set_title('Training and Test Accuracy')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n📊 Final Performance:")
print(f"  Test Accuracy: {test_accs[-1]:.2f}%")
print(f"  Test Loss: {test_losses[-1]:.4f}")

7. Saving and Loading Models¶

Model Persistence for Deployment and Resumption¶

After investing time and compute in training, you want to save the result. torch.save(model.state_dict(), path) serializes only the learned parameters (weights and biases) as a dictionary, keeping file sizes small and decoupled from code changes. To reload, you instantiate the same model class, then call model.load_state_dict(torch.load(path)). This pattern is used everywhere: checkpointing during long training runs, deploying models to production servers, and sharing pre-trained weights with the community (as HuggingFace does for BERT and GPT).

# Save model
torch.save(model.state_dict(), 'mnist_model.pth')
print("✅ Model saved to 'mnist_model.pth'")

# Load model
loaded_model = MNISTNet()
loaded_model.load_state_dict(torch.load('mnist_model.pth'))
loaded_model.to(device)
loaded_model.eval()
print("✅ Model loaded successfully")

# Verify it works
test_loss, test_acc = evaluate(loaded_model, test_loader, criterion, device)
print(f"\nLoaded model accuracy: {test_acc:.2f}%")

Summary¶

✅ What You Learned¶

PyTorch Tensors: Creating and manipulating tensors
Autograd: Automatic differentiation for computing gradients
nn.Module: Building neural networks the PyTorch way
Training Loop: Forward pass, loss, backward pass, optimizer step
Optimizers: SGD, Adam, RMSprop
Real Data: Training on MNIST dataset
Model Persistence: Saving and loading models

🔑 Key Patterns¶

Model Definition:

class MyModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.layer = nn.Linear(in_features, out_features)
    
    def forward(self, x):
        return self.layer(x)

Training Loop:

for epoch in range(epochs):
    for batch_x, batch_y in dataloader:
        # Forward
        outputs = model(batch_x)
        loss = criterion(outputs, batch_y)
        
        # Backward
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

🎯 What’s Next?¶

Next notebook: 04_attention_mechanism.ipynb

You’ll learn:

What is attention and why it’s revolutionary
Scaled dot-product attention
Multi-head attention
Self-attention vs cross-attention
Applications in NLP and vision

📚 Additional Resources¶

Fantastic progress! You now know how to build and train neural networks with PyTorch - the foundation for modern deep learning! 🚀