Run this notebook: Open in Colab Open in Kaggle

Chapter 1: But What is a Neural Network?¶

From Pixels to Predictions: The Architecture of Learning¶

A neural network is a mathematical function – built from simple, repeated operations – that learns to map inputs to outputs by adjusting thousands or millions of internal parameters. The classic introductory problem is MNIST digit recognition: given a 28x28 grayscale image of a handwritten digit (784 pixel values, each between 0 and 1), the network must output which digit (0-9) it represents.

What makes neural networks remarkable is that nobody programs the rules for recognizing digits. Instead, the network discovers its own internal representations through exposure to thousands of labeled examples. Early layers might learn to detect edges and curves, middle layers compose these into loops and strokes, and the final layer assembles them into digit identities. This hierarchical feature learning is what separates neural networks from traditional hand-engineered classifiers, and it scales to problems far more complex than digit recognition – from medical imaging to natural language understanding.

import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.patches import Circle

sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (16, 10)
np.random.seed(42)

Layers, Neurons, and Activations: The Building Blocks¶

A neural network is organized into layers of neurons. For MNIST, a simple architecture might use 784 input neurons (one per pixel), two hidden layers of 16 neurons each, and 10 output neurons (one per digit class). Each neuron holds an activation value between 0 and 1, representing how “on” or “off” that feature detector is.

The connections between layers are where the learning happens. Each connection carries a weight \(w\), and each neuron has a bias \(b\). A neuron computes a weighted sum of all incoming activations, adds its bias, then applies a nonlinear activation function:

\[a^{(l)} = \sigma\left(\sum_j w_j \cdot a_j^{(l-1)} + b\right)\]

In matrix notation for an entire layer: \(\vec{a}^{(l)} = \sigma(W^{(l)} \vec{a}^{(l-1)} + \vec{b}^{(l)})\). The weight matrix \(W\) encodes which patterns from the previous layer each neuron cares about, while biases set the threshold for activation. The total number of learnable parameters (weights + biases) determines the network’s capacity – for this architecture, roughly 13,000 parameters must be tuned during training.

# Visualize network
print("Network: 784 → 16 → 16 → 10")
print("Total parameters: ~13,000")

The Activation Function: Introducing Nonlinearity¶

Without activation functions, stacking layers would be pointless – a sequence of linear transformations is just another linear transformation (\(W_2 W_1 \vec{x} = W_{combined} \vec{x}\)). The sigmoid function \(\sigma(x) = \frac{1}{1 + e^{-x}}\) was historically the first widely used activation function. It smoothly squashes any real number into the range \((0, 1)\), which can be interpreted as a probability or a “firing rate.”

For large positive inputs, \(\sigma(x) \approx 1\) (neuron is “on”); for large negative inputs, \(\sigma(x) \approx 0\) (neuron is “off”); and near zero, the transition is smooth and differentiable – this differentiability is critical for training via gradient descent. Modern networks often use ReLU (\(\max(0, x)\)) instead of sigmoid because it avoids the “vanishing gradient” problem where sigmoid’s flat tails cause gradients to shrink to near-zero in deep networks, making learning extremely slow.

Looking Ahead: From Structure to Learning¶

Understanding the architecture is only half the story. A randomly initialized neural network produces garbage outputs – it needs to learn by adjusting its weights and biases to minimize prediction errors. The next chapter introduces gradient descent, the optimization algorithm that makes this possible. The key mathematical challenge: with 13,000 parameters, how do you figure out which direction to nudge each one? The answer involves calculus (derivatives), linear algebra (matrix operations), and a clever algorithm called backpropagation that computes all necessary gradients in a single backward pass through the network.