Run this notebook: Open in Colab Open in Kaggle

# Install required packages
!pip install torch torchvision numpy matplotlib scikit-learn transformers datasets tqdm

Verify Installation¶

Before diving into neural networks, we need to confirm that all core libraries are available and properly configured. PyTorch is the deep learning framework we will use throughout this module for building, training, and evaluating models. NumPy provides the numerical foundation, Matplotlib handles visualization, and HuggingFace Transformers gives us access to state-of-the-art pre-trained models like BERT and GPT. The cell below also checks for CUDA (GPU) availability – GPU acceleration is not required for the exercises, but it dramatically speeds up training for larger models.

import torch
import numpy as np
import matplotlib.pyplot as plt
from transformers import __version__ as transformers_version

print("✅ All packages installed successfully!\n")
print(f"PyTorch version: {torch.__version__}")
print(f"NumPy version: {np.__version__}")
print(f"Transformers version: {transformers_version}")
print(f"\nCUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
else:
    print("Running on CPU (this is fine for learning!)")

🎯 What You’ll Build¶

By the end of this module, you’ll have built:

Notebook 1: Simple Neural Network¶

# Binary classifier from scratch
class NeuralNetwork:
    def forward(self, X):
        # Your code!
        pass

Notebook 2: Backpropagation Engine¶

# Automatic differentiation
loss.backward()  # Compute all gradients!

Notebook 3: PyTorch Models¶

# Modern deep learning
model = nn.Sequential(
    nn.Linear(784, 128),
    nn.ReLU(),
    nn.Linear(128, 10)
)

Notebook 4: Attention Mechanism¶

# The innovation that changed AI
attention_scores = softmax(Q @ K.T / sqrt(d_k))
output = attention_scores @ V

Notebook 5: Transformer¶

# The architecture behind GPT, BERT, etc.
transformer = nn.Transformer(
    d_model=512,
    nhead=8,
    num_encoder_layers=6
)

🧪 Quick Neural Network Demo¶

Let’s see a neural network in action!

import torch
import torch.nn as nn
import numpy as np
import matplotlib.pyplot as plt

# Generate data: classify points above/below a curve
np.random.seed(42)
torch.manual_seed(42)

n_points = 200
X = np.random.randn(n_points, 2)
y = (X[:, 1] > X[:, 0]**2).astype(float)  # Quadratic boundary

# Convert to PyTorch tensors
X_tensor = torch.FloatTensor(X)
y_tensor = torch.FloatTensor(y).reshape(-1, 1)

# Define a simple neural network
model = nn.Sequential(
    nn.Linear(2, 8),      # Input layer: 2 features → 8 neurons
    nn.ReLU(),            # Activation function
    nn.Linear(8, 8),      # Hidden layer: 8 → 8
    nn.ReLU(),
    nn.Linear(8, 1),      # Output layer: 8 → 1
    nn.Sigmoid()          # Convert to probability
)

# Training setup
criterion = nn.BCELoss()  # Binary cross-entropy loss
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

# Train the network
losses = []
for epoch in range(1000):
    # Forward pass
    predictions = model(X_tensor)
    loss = criterion(predictions, y_tensor)
    
    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    losses.append(loss.item())
    
    if (epoch + 1) % 200 == 0:
        print(f"Epoch {epoch+1}/1000, Loss: {loss.item():.4f}")

print("\n✅ Training complete!")

# Visualize results
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))

# Plot 1: Training loss
ax1.plot(losses)
ax1.set_xlabel('Epoch')
ax1.set_ylabel('Loss')
ax1.set_title('Training Loss Over Time')
ax1.grid(True, alpha=0.3)

# Plot 2: Decision boundary
# Create mesh for decision boundary
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                     np.linspace(y_min, y_max, 100))

# Predict on mesh
mesh_data = torch.FloatTensor(np.c_[xx.ravel(), yy.ravel()])
with torch.no_grad():
    Z = model(mesh_data).numpy().reshape(xx.shape)

# Plot decision boundary
ax2.contourf(xx, yy, Z, levels=20, cmap='RdYlBu', alpha=0.6)
ax2.scatter(X[y==0, 0], X[y==0, 1], c='blue', label='Class 0', edgecolors='k', s=50)
ax2.scatter(X[y==1, 0], X[y==1, 1], c='red', label='Class 1', edgecolors='k', s=50)
ax2.set_xlabel('Feature 1')
ax2.set_ylabel('Feature 2')
ax2.set_title('Neural Network Decision Boundary')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Calculate accuracy
with torch.no_grad():
    predictions = (model(X_tensor) > 0.5).float()
    accuracy = (predictions == y_tensor).float().mean()
    print(f"\n✅ Accuracy: {accuracy.item()*100:.2f}%")

🎉 What Just Happened?¶

In just a few lines, you:

Created a neural network with 2 hidden layers
Trained it to learn a complex decision boundary
Achieved high accuracy on classification

The network learned the quadratic boundary automatically from data!

📖 Reading Material¶

Theory Documents (Read First)¶

intro.md - Neural network fundamentals
- Neurons, layers, activation functions
- Forward propagation
- Loss functions
attention_explained.md - The attention mechanism
- Why attention matters
- Self-attention explained
- Multi-head attention
transformer_architecture.md - Transformer deep dive
- Encoder-decoder architecture
- Position embeddings
- How GPT and BERT differ

🎓 Prerequisites Review¶

Make sure you’re comfortable with these concepts from earlier phases:

# Mathematics (Phase 0.5)
import numpy as np

# 1. Matrix multiplication (Linear Algebra)
A = np.array([[1, 2], [3, 4]])
B = np.array([[5, 6], [7, 8]])
C = A @ B  # Neural networks are matrix operations!
print("Matrix multiplication:\n", C)

# 2. Derivatives (Calculus)
# f(x) = x^2, derivative f'(x) = 2x
x = 3
derivative = 2 * x  # Backpropagation uses chain rule!
print(f"\nDerivative of x^2 at x=3: {derivative}")

# 3. Probability (Statistics)
# Softmax converts scores to probabilities
scores = np.array([2.0, 1.0, 0.1])
exp_scores = np.exp(scores)
probabilities = exp_scores / exp_scores.sum()
print(f"\nSoftmax probabilities: {probabilities}")
print(f"Sum to 1.0: {probabilities.sum()}")

🚦 Next Steps¶

Ready to Start?¶

Recommended path:

✅ Watch 3Blue1Brown videos 1-2
✅ Read intro.md sections 1-3
✅ Complete 01_neural_network_basics.ipynb
✅ Watch 3Blue1Brown videos 3-4
✅ Complete 02_backpropagation_explained.ipynb
✅ Complete 03_pytorch_fundamentals.ipynb
✅ Read attention_explained.md
✅ Complete 04_attention_mechanism.ipynb
✅ Read transformer_architecture.md
✅ Complete 05_transformer_architecture.ipynb

Study Tips¶

📺 Watch videos first - Visual intuition is crucial
📝 Take notes - Write down key equations
💻 Code along - Don’t just read, type the code
🔄 Experiment - Change hyperparameters, see what happens
🤔 Ask why - Understand the purpose of each component

📊 Progress Tracker¶

Track your learning:

Milestone	Status
Watched 3Blue1Brown videos	☐
Read intro.md	☐
Completed Notebook 1	☐
Completed Notebook 2	☐
Completed Notebook 3	☐
Read attention_explained.md	☐
Completed Notebook 4	☐
Read transformer_architecture.md	☐
Completed Notebook 5	☐
Built a custom model	☐

🎯 Learning Goals¶

By the end of this module, you should be able to:

✅ Explain how neural networks learn from data
✅ Implement forward and backward propagation
✅ Build and train neural networks with PyTorch
✅ Understand the attention mechanism
✅ Explain transformer architecture
✅ Fine-tune pre-trained models
✅ Connect this to earlier phases (tokenization → embeddings → transformers)

🔗 Helpful Resources¶

Videos¶

Interactive¶

TensorFlow Playground - Visualize neural networks
CNN Explainer - Interactive CNN
Transformer Explainer - Interactive transformer

Articles¶

Documentation¶

🚀 Let’s Begin!¶

Start with: 01_neural_network_basics.ipynb

You’re about to understand the technology behind ChatGPT, DALL-E, and all modern AI systems. Exciting times ahead! 🎉