Before a transformer, any neural network. Before any neural network, four things: a way to combine inputs (the affine map), a way to break linearity (the activation), a way to measure wrongness (the loss), and a way to reduce it (gradient descent). Every architecture in the chapters that follow — attention blocks, FFNs, MoE routers, state-space models — composes these four things at greater scale.

We build them in numpy from scratch so that PyTorch’s autograd later reads as shorthand, not magic. Math first, code second, intuition throughout. A reader who already has these primitives in their bones can skim to Chapter 2. A reader who hasn’t computed a backward pass by hand recently should slow down here — what we build is what every transformer training loop is doing underneath the autograd.

Affine + activation

The only way information moves from one layer of a neural network to the next is an affine map: multiply by a matrix, add a vector.

y=Wx+b,WRm×n,  bRmy = Wx + b, \quad W \in \R^{m \times n}, \; b \in \R^m

That single operation does all the work of mixing inputs. WW is what the network learns; bb is a constant offset that lets the map shift away from passing through the origin. With nn inputs and mm outputs, there are mn+mmn + m parameters in this layer, all of them trainable.

A stack of affine maps alone would buy you nothing. Two affine maps composed are still an affine map:

W2(W1x+b1)+b2=(W2W1)x+(W2b1+b2)=Wx+b.W_2 (W_1 x + b_1) + b_2 = (W_2 W_1)\, x + (W_2 b_1 + b_2) = W' x + b'.

No matter how many layers you stack, the composition collapses to one matrix and one bias. The network would have no more expressive power than a single layer of regression. Depth would buy nothing.

The fix is to break the linearity between affine maps with a pointwise nonlinearity. The dominant choice for hidden layers in modern networks is the rectified linear unit:

ReLU(x)=max(0,x).\text{ReLU}(x) = \max(0, x).

ReLU does three useful things at once. It is cheap (a single comparison per element). It does not saturate for positive inputs (gradients of magnitude 1 flow through unchanged). And it induces sparsity — about half of the activations at a randomly initialized layer are exactly zero, which acts as an implicit regularizer.

There is a small zoo of alternatives. Sigmoid and tanh squash inputs into bounded ranges; they are historical defaults that still appear in attention gates and recurrent cells, but they saturate badly and make deep networks hard to train. GELU and SiLU are smooth ReLU-likes that show up in modern transformer FFN layers — Chapter 5 picks them up when we look at the SwiGLU variant Llama uses. For the implementations in this chapter, ReLU is what every hidden layer uses.

The bias term deserves a sentence of its own. Without it, the affine map y=Wxy = Wx forces the output to pass through the origin: x=0y=0x = 0 \Rightarrow y = 0. For centered data this is benign; for almost everything else it isn’t. Image pixels are in [0,1][0, 1], not symmetric around zero. Token embeddings learn their own offsets but rely on biases downstream to recenter them. The conventional initialization for bb is zero — it contributes nothing to the first forward pass, and gradient descent fills it in.

A note on terminology: “layer” in this chapter means one affine map plus its activation. Some references count them separately, so a “ten-layer MLP” might mean five affine-plus-activation pairs or ten alternating ones. We will be explicit when it matters; in this chapter “two-layer MLP” means two affine maps (with a ReLU between).

A single affine-plus-activation pair, by itself, is a one-layer network. Real architectures stack many of them. To train such a stack we need a way to ask, after one forward pass, how wrong was this? — and a way to translate the answer into an update to every parameter. Those are the next two sections.

Measuring wrongness — loss functions

The forward pass produces a prediction; the loss is a scalar that measures how far that prediction is from the truth. Lower means better. Training reduces to minimize this scalar over parameters. Two losses cover the vast majority of cases.

For regression — when the target yy is a real number — mean squared error is the obvious choice:

LMSE=1Bb=1B(yby^b)2.L_\text{MSE} = \frac{1}{B} \sum_{b=1}^{B} (y_b - \hat{y}_b)^2.

Differentiable everywhere, convex in the prediction, and the gradient with respect to y^\hat{y} is just 2B(y^y)\frac{2}{B}(\hat{y} - y). MSE is what you would invent on a whiteboard if you had not seen losses before.

For classification — when the target is one of KK classes — MSE is the wrong choice. The model produces a distribution over classes, typically as a softmax over logits. MSE between distributions ignores the simplex structure and saturates: once the predicted probability is near zero or one, the squared-error gradient vanishes even when the prediction is confidently wrong. Cross-entropy fixes both problems.

LCE=iyilogpi,p=softmax(z).L_\text{CE} = -\sum_{i} y_i \log p_i, \quad p = \softmax(z).

Here yy is the one-hot encoding of the true class and pp is the predicted probability vector. Because yy is one-hot, the sum picks out a single term: L=logpiL = -\log p_{i^*} for the true class ii^*. The loss is small when the model assigns high probability to the right class and grows without bound as that probability approaches zero.

This is the loss every LLM training loop is minimizing. Next-token prediction is classification over the vocabulary; cross-entropy is the loss; the logit vector at each position is what the transformer produces. A model that has memorized its training corpus assigns probability 1 to the correct next token and the cross-entropy is zero. A model that has learned nothing assigns the uniform distribution and the cross-entropy is logK\log K, where KK is the vocabulary size — about 11.5 nats for a 100K-token vocabulary. The training loss for an LLM lives between those extremes, and reductions in cross-entropy translate directly to perplexity (exp(L)\exp(L)) and to held-out generalization.

The runnable block below shows the numerically stable softmax (max-subtraction inside the exponent) and cross-entropy on a small logit vector. Different target classes produce wildly different losses — confident-and-right is near zero, confident-and-wrong is large.

There is a clean payoff hiding inside cross-entropy that we will earn in section 4: the gradient of LCEL_\text{CE} with respect to the pre-softmax logits is the unreasonably tidy pyp - y. No quotient rule, no chain through the softmax. It is the cleanest gradient in this chapter, and frameworks fuse softmax with cross-entropy precisely to take advantage of it. Take that as a teaser.

The loss tells us how wrong. The next step is how to fix it.

Walking downhill — gradient descent

Imagine standing on a hillside in fog. You cannot see the bottom of the valley, but at your feet you can feel which way is downhill. You take a step in that direction. You feel again. You step again. Eventually you reach a minimum — or you do not, and that failure mode tells you something about the terrain or your stride.

That is gradient descent. The fog is the model’s parameter space. The slope at your feet is the gradient of the loss. The step is the update rule.

One parameter at a time

For a loss that depends on a single parameter θ\theta, the gradient L/θ\partial L / \partial \theta is a single number that points in the direction of increasing loss. To decrease loss, step in the opposite direction:

θθαLθ.\theta \leftarrow \theta - \alpha \, \frac{\partial L}{\partial \theta}.

The scalar α\alpha is the learning rate — the size of the step. Take the toy loss L(θ)=(θ3)2L(\theta) = (\theta - 3)^2, with the gradient L/θ=2(θ3)\partial L / \partial \theta = 2(\theta - 3). Starting from θ=0\theta = 0 with α=0.1\alpha = 0.1, every step pulls θ\theta toward 3. The learning rate is small enough that the steps shrink as θ\theta approaches the minimum.

Many parameters at once

A real network has millions to billions of parameters. The gradient becomes a vector: one partial derivative per parameter. We collect the parameters into θRp\theta \in \R^p and write the multivariable update as

θt+1=θtαθL(θt).\theta_{t+1} = \theta_t - \alpha \, \nabla_\theta L(\theta_t).
(1.gd)

This is the same rule, applied componentwise. Each parameter updates using its own partial derivative, independently of the others. The gradient θL\nabla_\theta L points in the direction of steepest ascent in the loss surface; the negative gradient is the direction of steepest descent. The vector update (1.gd) is what every optimizer in section 7 will modify, never replace.

One conceptual hazard worth flagging: “direction of steepest descent” is locally true, not globally true. The gradient at θt\theta_t is the best direction at that point. After one step, the local picture changes — the next gradient points somewhere else. Gradient descent is a sequence of local approximations, not a search for the global minimum, and on the non-convex loss surfaces of neural networks there is no guarantee that the sequence converges to anything you would call “optimal.” Empirically, on the surfaces that come out of training deep networks with stochastic mini-batches, the local descent finds basins that generalize well. The theory of why it works at this scale is incomplete; the empirics are settled.

Two practical complications appear immediately. First, evaluating L\nabla L on the full training set every step is too slow — modern datasets have billions of examples. Instead, we estimate the gradient on a mini-batch of a few dozen to a few thousand examples sampled uniformly at random. The estimate is noisier than the true gradient, but it is enormously cheaper, and the noise is mostly harmless in expectation — the random sampling means the estimator is unbiased, and the noise averages out across steps. The optimizer name stochastic gradient descent — SGD — refers to this noisy estimator. Batch size becomes a hyperparameter that trades compute per step against the variance of each step.

Second, we need to be able to actually compute L\nabla L for a stack of affine maps and activations. The chain rule is what does it; backpropagation is what makes it efficient. That is the next section, and it is the longest one in this chapter.

Backpropagation — the chain rule, applied carefully

A deep network is a function composition. If the network has nn layers, the forward pass is

y^=fn(fn1(f1(x;θ1););θn),\hat{y} = f_n(f_{n-1}(\cdots f_1(x; \theta_1); \cdots); \theta_n),

and the loss LL is some scalar function of y^\hat{y} and the target. To run gradient descent, we need L/θk\partial L / \partial \theta_k for every layer kk. A naive approach would compute each of those derivatives independently, recomputing all the intermediate quantities from scratch. The cost scales quadratically with depth, which is unacceptable for a network of any size.

Backpropagation is the observation that the chain rule lets us reuse work. Compute gradients once, in reverse layer order, and the cost scales linearly with depth — the same as the forward pass.

The chain rule, scalar form

Compose two scalar functions, L=f(g(θ))L = f(g(\theta)). The chain rule says

Lθ=Lggθ.\frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial g} \cdot \frac{\partial g}{\partial \theta}.

That is all. The derivative of a composition is the product of derivatives. For three functions composed, the chain rule becomes a product of three; for a hundred functions composed, a product of a hundred. Every backward pass through a neural network is an unrolled application of this rule.

A toy example clarifies. Let g(θ)=θ2g(\theta) = \theta^2 and f(u)=euf(u) = e^u, so L=eθ2L = e^{\theta^2}. The chain rule gives L/θ=eθ22θ\partial L / \partial \theta = e^{\theta^2} \cdot 2\theta. We could have differentiated directly, but the chain-rule decomposition is what generalizes: each “layer” reports its local derivative — ff knows f/g\partial f / \partial g, gg knows g/θ\partial g / \partial \theta — and the global gradient is their product.

For a neural network, “layer” generalizes to any operation: matrix multiply, addition, ReLU, softmax, square. Each operation has a hand-coded local gradient rule. Backpropagation is the systematic application of those local rules in reverse, with the chain rule combining them.

The computational graph

For a network with branches, products, and sums, the linear chain rule above is not enough. The structure to work with is a directed acyclic graph in which nodes are tensor values and edges represent operations. The forward pass walks the graph in topological order, evaluating each operation to produce its output. The backward pass walks the graph in reverse topological order, asking each operation a single question: given the gradient of the loss with respect to my output, what is the gradient with respect to each of my inputs?

Think of the forward pass as executing a recipe — take ingredients, apply steps, produce a dish. The backward pass reads the recipe in reverse: at each step, attribute how much each ingredient contributed to the final taste. The chain rule is what propagates that attribution back through the recipe. Nothing is symbolic. The graph stores numerical operations on specific values, and each operation has a hand-coded rule for how its gradient passes through.

This is reverse-mode automatic differentiation, and it is what every modern framework’s .backward() does. Section 8 implements a minimal version in about thirty lines of Python. The data structure is the graph; the algorithm is a topological sort followed by a reverse traversal.

There is a counterpart called forward-mode autodiff that walks the graph in the same direction as the forward pass, propagating tangent vectors instead of gradients. Forward mode is cheaper when the output is high-dimensional and the input is low-dimensional — exactly the opposite of the neural-network case. For neural networks, the loss is one scalar and there are millions of parameters, so reverse mode is the right asymptotic choice. Backpropagation is reverse mode applied to a particular kind of graph (a feedforward computation with a scalar output); it is not a different algorithm.

The cross-entropy gradient (and why it’s clean)

The cleanness deserves a derivation. The setup: logits zRKz \in \R^K, probabilities p=softmax(z)p = \softmax(z), one-hot target yy, loss L=iyilogpiL = -\sum_i y_i \log p_i. The goal is L/zj\partial L / \partial z_j.

Chain rule first:

Lzj=iyilogpizj=iyipipizj.\frac{\partial L}{\partial z_j} = -\sum_i y_i \frac{\partial \log p_i}{\partial z_j} = -\sum_i \frac{y_i}{p_i} \frac{\partial p_i}{\partial z_j}.

The softmax derivative is the only nontrivial step. With pi=ezi/kezkp_i = e^{z_i} / \sum_k e^{z_k}, applying the quotient rule term by term gives

pizj=pi([i=j]pj),\frac{\partial p_i}{\partial z_j} = p_i \bigl([i = j] - p_j\bigr),

where [i=j][i = j] is 1 if i=ji = j and 0 otherwise. Substituting back, the pip_i in numerator and denominator cancel:

Lzj=iyi([i=j]pj)=yj+pjiyi.\frac{\partial L}{\partial z_j} = -\sum_i y_i \bigl([i = j] - p_j\bigr) = -y_j + p_j \sum_i y_i.

The target is one-hot, so iyi=1\sum_i y_i = 1. What survives is the boxed result that every classification loop assumes:

Lzj=pjyj.\boxed{\frac{\partial L}{\partial z_j} = p_j - y_j.}
(1.ce-grad)

Read it geometrically. The gradient at logit jj is the prediction error: the model said pjp_j, the truth was yjy_j, and the gradient drives the logit in the direction that closes the gap. If the model is right, the gradient is zero; if the model is confidently wrong, the gradient is close to one in magnitude. Nothing in the rest of this chapter has a cleaner form.

Matrix form for a 2-layer MLP

Now the same algebra, applied to a real network. Take a two-layer MLP: input xx of shape B×dinB \times d_\text{in} (batch of BB examples), one hidden layer of width dhd_h with ReLU, output dimension KK for cross-entropy classification. The forward pass:

z1=xW1+b1(B×dh),a1=ReLU(z1),z2=a1W2+b2(B×K),p=softmax(z2,axis=1),L=1Bbkybklogpbk.\begin{aligned} z_1 &= x W_1 + b_1 \quad (B \times d_h), \\ a_1 &= \text{ReLU}(z_1), \\ z_2 &= a_1 W_2 + b_2 \quad (B \times K), \\ p &= \softmax(z_2, \text{axis}{=}{-}1), \\ L &= -\tfrac{1}{B} \textstyle\sum_b \sum_k y_{bk} \log p_{bk}. \end{aligned}

Define δ2=L/z2\delta_2 = \partial L / \partial z_2. By (1.ce-grad) , dividing by BB to fold in the batch mean,

δ2=1B(py)(B×K).\delta_2 = \tfrac{1}{B}(p - y) \quad (B \times K).

The remaining gradients fall out of repeated chain-rule application, each line consistent with the shape annotations on the right:

LW2=a1δ2(dh×K),Lb2=b(δ2)b(K,),La1=δ2W2(B×dh),δ1=Lz1=La11[z1>0](B×dh),LW1=xδ1(din×dh),Lb1=b(δ1)b(dh,).\begin{aligned} \frac{\partial L}{\partial W_2} &= a_1^\top \delta_2 \quad (d_h \times K), \\ \frac{\partial L}{\partial b_2} &= \textstyle\sum_b (\delta_2)_b \quad (K,), \\ \frac{\partial L}{\partial a_1} &= \delta_2 W_2^\top \quad (B \times d_h), \\ \delta_1 = \frac{\partial L}{\partial z_1} &= \frac{\partial L}{\partial a_1} \odot \mathbb{1}[z_1 > 0] \quad (B \times d_h), \\ \frac{\partial L}{\partial W_1} &= x^\top \delta_1 \quad (d_\text{in} \times d_h), \\ \frac{\partial L}{\partial b_1} &= \textstyle\sum_b (\delta_1)_b \quad (d_h,). \end{aligned}

Two patterns are worth memorizing. The gradient with respect to a weight matrix is always the outer product of the incoming activation and the outgoing delta: L/Wk=ak1δk\partial L / \partial W_k = a_{k-1}^\top \delta_k. The gradient with respect to a bias is always the sum of the corresponding delta over the batch. The shapes force the order: L/W2\partial L / \partial W_2 has shape (dh×K)(d_h \times K), so it must be (dh×B)(B×K)=a1δ2(d_h \times B)(B \times K) = a_1^\top \delta_2, not the other way around. When in doubt, check the dimensions.

The ReLU step deserves a glance. The gradient through ReLU is one wherever the input was positive and zero elsewhere — equivalent to multiplying L/a1\partial L / \partial a_1 elementwise by the binary mask 1[z1>0]\mathbb{1}[z_1 > 0]. At z1=0z_1 = 0 exactly, the function is not differentiable; in practice we pick f(0)=0f'(0) = 0 and move on. The probability of z1z_1 landing exactly on zero in float arithmetic is essentially nil, and even when it happens the next batch perturbs the activation. This is one of those theoretical worries that does not cash out empirically.

What the matrix form makes obvious is the linear cost of backprop. Each gradient computation is a single matrix product or a single broadcast — the same order of compute as the corresponding forward operation. Total cost: roughly twice the forward pass, regardless of network depth. That is why backprop scales; that is why every framework is built around it.

The widget below animates exactly this trace — forward pass, then backward pass, with shapes labeled at every step. Hover any operation to see the gradient flowing through it.

Backprop through a 2-layer MLP

Interactive
idle
inputhidden (W₁, b₁ → ReLU)output (W₂, b₂ → softmax)loss1.20x[0]-0.80x[1]a1[0]a1[1]a1[2]a1[3]p[0]← targetp[1]p[2]L
Hover any node or edge for the math at that point. Click Play to animate the forward (cyan) and backward (amber) pass.
Stage 0 / 10

Animated trace of the forward and backward pass through a small MLP. Switch between input presets to see how different inputs flow through the network. Hover any node or edge for the math at that point.

The math is in place. Time to run it.

Putting it together — an MLP in numpy

The full MLP is a hundred lines of numpy. It implements exactly the equations above: forward pass that caches intermediates for the backward pass, backward pass that traces the matrix-form gradients, and a training loop that calls them in sequence on mini-batches.

Forward pass

The forward pass is a straight pipeline: affine, ReLU, affine, return logits. The MLP constructor allocates four parameter arrays — two weight matrices and two biases — and initializes them with He variance for ReLU. The forward method threads the batch through the pipeline and caches the intermediates we need for backward: the input xx, the pre-activation z1z_1, and the hidden activation a1a_1. Without the cache, the backward pass would have to recompute the forward pass to recover those tensors. With the cache, backward becomes a sequence of matrix products on tensors we already have.

The forward method does not apply softmax. Softmax is folded into the loss for numerical stability, and the backward pass needs raw logits to use the clean gradient L/z2=(py)/B\partial L / \partial z_2 = (p - y) / B from section 4. Mixing softmax into the forward path here would force us to compute a less stable gradient through the softmax separately.

Backward pass

The backward pass implements the matrix-form derivation directly. Reading top to bottom, each line corresponds to one gradient equation from section 4. We start at the loss and walk backward: compute pp from the cached logits, build the one-hot target, form δ2=(py)/B\delta_2 = (p - y) / B, then chain δ2\delta_2 backward through the second affine, through ReLU, through the first affine. The cleanest check that the implementation is right is dimensional: every output gradient has the same shape as the parameter it corresponds to. If dW1 is not (din×dh)(d_\text{in} \times d_h), something is transposed wrong.

A subtler check is the gradient through the activation. The line dz1 = da1 * relu_grad(self.z1) is the elementwise multiplication of an upstream gradient with the ReLU derivative — exactly the δ1=L/a11[z1>0]\delta_1 = \partial L / \partial a_1 \odot \mathbb{1}[z_1 > 0] from the derivation. The mask is computed from z1z_1, not a1a_1 — the gradient of ReLU is one wherever the input was positive, regardless of what the output looks like.

Training on a toy task

The toy task is to classify 2D points by quadrant — four classes, two input dimensions, sixteen hidden units. We sample a thousand points from a unit Gaussian, label them by which quadrant they fall in (encoded 0–3), and train on mini-batches of size 64 for five hundred steps. The expected behavior: loss drops from log41.39\approx \log 4 \approx 1.39 at random initialization (the cross-entropy of a uniform distribution over four classes) toward zero, and accuracy on the mini-batch passes 95% within the first few hundred steps. If the loss were stuck at 1.39 after a hundred steps, something would be wrong — most likely a sign error in the gradient or a transposed matrix product.

The reduction convention matters: we divide δ2\delta_2 by the batch size BB, which means the reported loss and the gradient magnitude are both means over the batch. Doubling the batch size leaves the effective per-step gradient roughly unchanged, which is what you want — the learning rate stays stable across batch sizes. Some implementations sum instead of average and pay the price elsewhere, often by scaling the learning rate by 1/B1/B to compensate. Pick one convention and stay with it; mixing them is a reliable source of mysterious bugs.

What this script does not do is anything fancy. There is no momentum, no learning rate schedule, no gradient clipping, no validation split, no early stopping. The numbers in the loss are noisy because the gradient is estimated on small mini-batches. The accuracy is computed on the same batch we just trained on, which overstates how well the model generalizes. All of those omissions are deliberate: we want the minimum viable training loop visible in one screen of code. Production training stacks add the missing pieces one at a time, and Chapter 8 walks through them when the scale demands it.

The widget below visualizes how training curves differ between SGD, SGD with momentum, and Adam on this same setup. Run the cell above to convince yourself the model trains; the widget makes the comparison visual.

Training curves — SGD vs Momentum vs Adam

Interactive
Running 1500 training steps in JavaScript…

Three optimizers, same MLP, same data. Scrub the time slider to watch the decision boundary form; observe the loss curves to see why Adam reaches lower training loss faster.

What we glossed over: initialization and optimizers. Both matter exponentially as networks get deep. The next two sections cover them.

Initialization — why it matters more than you think

The MLP above used the line np.sqrt(2.0 / d_in) for the weight standard deviation without comment. That choice is He initialization, and it is not arbitrary.

Consider a deep network of randomly initialized linear layers. If the weight variance is too small, the activations at each layer shrink toward zero — by the time you reach the output, the signal is gone and gradients vanish. If the weight variance is too large, the activations explode — gradients overflow and training diverges within a step or two. Either failure mode happens before the optimizer has a chance to do anything useful.

The variance-preservation argument fixes the scale. Assume the input activations have mean zero and variance σ2\sigma^2. For the output of a linear layer y=Wxy = Wx to have the same variance, the weight variance must be σ2/nin\sigma^2 / n_\text{in}, where ninn_\text{in} is the input dimension. This is the Glorot (or Xavier) initialization — designed for symmetric activations like tanh and sigmoid:

WN ⁣(0,2nin+nout)(Glorot).W \sim \mathcal{N}\!\left(0, \, \frac{2}{n_\text{in} + n_\text{out}}\right) \quad \text{(Glorot)}.

The symmetric average of ninn_\text{in} and noutn_\text{out} keeps both the forward and backward signals well-scaled.

ReLU breaks the assumption. Because ReLU zeros out roughly half its inputs, the activation variance halves at every layer under Glorot init. He, Zhang, Ren, and Sun’s 2015 paper added the missing factor of 2:

WN ⁣(0,2nin)(He).W \sim \mathcal{N}\!\left(0, \, \frac{2}{n_\text{in}}\right) \quad \text{(He)}.

This is what every ReLU-family network uses today, including transformers (which use GELU or SiLU — both ReLU-family enough that He init still applies). There is a long literature on alternatives (orthogonal, Fixup, LSUV) but for the architectures in this tutorial, He is the answer 99% of the time.

The MLP in section 5 was shallow enough that the He factor of 2 is not strictly necessary. At a depth of two it still trains under Glorot, or even under unit-variance init. It is the discipline of using the right thing at small scale that prevents a hundred-hour debugging session at large scale.

Better optimizers — momentum, Adam, AdamW

SGD takes each mini-batch gradient at face value. That gradient is a noisy estimate of the true gradient over the full dataset — useful in expectation, but noisy in any given step. SGD’s noise tolerance is mostly empirical: on convex problems with well-conditioned loss surfaces, it converges fine. On the loss surfaces of deep networks, plain SGD is fragile, slow, and sensitive to the learning rate. Two ideas — momentum and adaptive learning rates — collapse most of the variance.

Momentum

The momentum update accumulates an exponentially weighted average of gradients into a velocity term, and steps in the velocity direction rather than the instantaneous gradient direction:

vt=μvt1+gt,θt=θt1αvt.v_t = \mu v_{t-1} + g_t, \quad \theta_t = \theta_{t-1} - \alpha v_t.

The decay coefficient μ[0,1)\mu \in [0, 1) controls how much past gradients matter; a typical value is μ=0.9\mu = 0.9, meaning each gradient contributes to roughly ten steps of velocity before decaying out. The physical picture: a ball rolling down a noisy hillside. Pure SGD is a sticky ball that follows every micro-bump. With momentum, the ball builds velocity in consistent directions and rolls through small perturbations rather than reacting to them. The result, in practice, is a faster effective step in directions where the gradient is stable and a damped step in directions where it is noisy.

Adam

Adam (Kingma & Ba 2014) goes further. It maintains two exponentially weighted moments per parameter — the mean of past gradients mtm_t (like momentum) and the mean of squared past gradients vtv_t (the uncentered variance). The update divides by the square root of the second moment, which scales the step inversely with the historical gradient magnitude. Parameters with consistently large gradients receive smaller steps; parameters whose gradients have been small or rare receive larger steps. The effect is a per-parameter learning rate, computed for free from the gradient history.

mt=β1mt1+(1β1)gt,vt=β2vt1+(1β2)gt2,m^t=mt1β1t,v^t=vt1β2t,θt=θt1αm^tv^t+ϵ.\begin{aligned} m_t &= \beta_1 m_{t-1} + (1 - \beta_1) g_t, \\ v_t &= \beta_2 v_{t-1} + (1 - \beta_2) g_t^2, \\ \hat{m}_t &= \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}, \\ \theta_t &= \theta_{t-1} - \alpha \, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \epsilon}. \end{aligned}

Defaults: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, ϵ=108\epsilon = 10^{-8}, α=103\alpha = 10^{-3}. The bias-correction factors 1/(1βt)1 / (1 - \beta^t) matter early — without them, the first 100{\sim}100 steps run at an effectively much smaller learning rate because mtm_t and vtv_t start at zero and only accumulate slowly. With β1=0.9\beta_1 = 0.9, the uncorrected m1m_1 equals 0.1g10.1 g_1, which understates the true running mean by a factor of ten. Bias correction divides through by 1β1=0.11 - \beta_1 = 0.1, restoring the right scale. After a few hundred steps, 1β1t11 - \beta_1^t \approx 1 and the correction drops out; the early-training correctness is the whole point.

Adam is the default for transformer training. Its initial convergence is fast, it tolerates a wider range of learning rates than SGD, and its per-parameter scaling handles the wildly different gradient magnitudes that appear in deep networks. The cost is two extra tensors per parameter — first and second moments — which doubles optimizer memory relative to SGD. At LLM scale this matters, but it is the price every modern training stack pays.

AdamW

AdamW (Loshchilov & Hutter 2017) is one line of code different from Adam but the difference matters. The original Adam paper suggested L2 regularization via the loss: add λ2θ2\tfrac{\lambda}{2} \|\theta\|^2, which contributes λθ\lambda \theta to the gradient. That works fine in SGD. In Adam, the per-parameter scaling distorts it — parameters with large adaptive scaling get less effective decay than parameters with small scaling, which is backwards from what L2 regularization is supposed to do.

AdamW decouples weight decay from the gradient. The Adam update runs on the raw gradient (no L2 term added in); decay is applied separately, after the adaptive step, as θθαλθ\theta \leftarrow \theta - \alpha \lambda \theta. The effect is the principled weight decay one would have wanted from L2 in SGD, restored. Modern transformers — Llama, GPT, Mistral, Gemma — all use AdamW with λ\lambda somewhere between 0.01 and 0.1.

The runnable block below puts an Adam implementation on a different problem — a simple elongated quadratic, f(x,y)=x2+10y2f(x, y) = x^2 + 10 y^2 — to make the per-parameter scaling visible. Plain SGD on this loss takes large xx-steps and small yy-steps. Adam normalizes them.

Autograd — the computational graph in production

The MLP in section 5 hard-codes its backward pass. For each new model — a different architecture, a different operation, a different shape — we would have to derive backward by hand and write it out. That does not scale. Attention has ten primitive operations between input and output, FFN has six more, and a transformer has dozens of blocks. The arithmetic is doable but not survivable.

Autograd automates the derivation. As the forward pass runs, the autograd engine records a computational graph: every operation becomes a node, and its inputs become its children. When .backward() is called on the loss, the engine traverses the graph in reverse topological order and applies each operation’s hand-coded local gradient rule. The chain rule does the composition.

Each node stores four things: its data (the forward value), its grad (the accumulated gradient, initially zero), a closure called _backward that knows how to propagate the gradient through this operation’s local rule, and a set of parent nodes it depends on. Karpathy’s micrograd repository is the canonical reference, in about 150 lines of Python. The block below implements a minimal version — addition, multiplication, ReLU, and the topological-sort backward pass — in under fifty lines.

The structure is the entire idea. Every framework — PyTorch, JAX, TensorFlow — implements this pattern with more primitives, GPU kernels, broadcasting, and a great deal of engineering, but the conceptual core is what is on the screen. PyTorch’s eager mode builds this graph dynamically on every forward pass, then walks it on .backward(). JAX traces the graph once and JIT-compiles it. The data structure and the algorithm are identical.

From here on, this tutorial uses PyTorch wherever the production form matters. The numpy-from-scratch tour ends here. What we built in this chapter is exactly what PyTorch’s autograd is doing under the hood — same chain rule, same graph traversal, same matrix-form gradients. The framework is a fast, GPU-aware implementation of the structure above, not a different idea.

Two extensions are worth knowing about before moving on. First, the engine above tracks scalar values; production frameworks track tensor values, and the local gradient rules generalize to tensor operations (matrix multiplication, broadcasting, reductions). The graph structure is identical; the per-node math is heavier. Second, building the graph dynamically on every forward pass — what PyTorch calls “eager mode” — has a cost: graph construction is Python-level and not free at scale. JAX’s response is to trace the forward pass once and JIT-compile the resulting graph, paying the construction cost once and reusing the compiled artifact. The tradeoff is flexibility versus throughput; modern stacks combine both.

Walk the computational graph

Interactive
idleStage 0 / 8
Start: leaves a, b, c have values. Forward will compute d, e, L. Backward then walks gradients in reverse.
a2.00g: 0.00b-3.00g: 0.00c10.00g: 0.00d = a · bg: 0.00e = d + cg: 0.00L = relu(e)g: 0.00
Hover any node to see its _backward closure. Play to animate the forward (cyan) and backward (amber) pass.

A six-node DAG showing scalar autograd. Step through forward to compute values; step through backward to watch gradients flow. Hover any node to see its _backward closure as Python code.

Exercises

The exercises below build on the chapter. Each is a self-contained problem with a starting template. Hints are collapsed by default — try the problem first, peek only if you stall.

Exercise 1 (easy) — Verify softmax numerical stability

Implement the naive softmax (np.exp(z) / np.exp(z).sum()) and the stable softmax (max-subtraction). For each, evaluate on z = [1000, 999, 998] and report what happens. The lesson is that the mathematically equivalent forms are not numerically equivalent in finite-precision arithmetic.

Hint

The naive version overflows because np.exp(1000) returns inf, and inf / inf is nan. The stable version subtracts the max first, so all exponents are non-positive and np.exp returns a value in (0, 1]. The resulting probabilities should be approximately [0.665, 0.244, 0.090].

Exercise 2 (medium) — Extend the MLP to 3 hidden layers

The MLP in section 5 has one hidden layer. Extend it to three hidden layers (input → h1 → h2 → h3 → output) and train it on the same quadrant task. Does it converge faster, slower, or similarly to the 1-hidden-layer version? The toy task is benign enough that the answer is interesting either way.

Hint

You need three weight matrices and three bias vectors plus the output projection. The forward pass has three (affine + ReLU) blocks before the final logits. The backward pass has three corresponding gradient computations — each layer’s dz depends on the next layer’s weights and gradient, by the chain rule. Use He init (std = sqrt(2 / fan_in)) for each layer.

Exercise 3 (medium) — Implement AdamW from scratch

Take the Adam class from section 7 and modify it to implement AdamW (Loshchilov & Hutter 2017). The key difference: weight decay is applied to the parameter directly (p -= lr * weight_decay * p) AFTER the standard Adam update, NOT added to the gradient. Compare AdamW (with weight_decay=0.01) to plain Adam on the elongated quadratic from section 7.

Hint

In Adam.step, after the line p -= lr * m_hat / (np.sqrt(v_hat) + eps), add a second update: p -= lr * weight_decay * p. This decoupled decay means the magnitude of the decay isn’t affected by Adam’s per-parameter rescaling — which is exactly the property AdamW restores.

Exercise 4 (hard) — Build a tiny autograd engine

Implement a minimal Value class in the micrograd style, supporting +, *, relu, and backward(). Use it to compute a tiny scalar expression and verify the gradients by hand. The point isn’t speed — it’s seeing that autograd is just careful graph traversal applied to the chain rule.

Hint

Each Value object stores: data, grad, _backward (a closure), _prev (set of parent Values), and _op (a string label). Operations build the graph as they execute — each new Value records its parents. backward() does a topological sort then walks in reverse, calling each node’s _backward(). The section 8 code block in the chapter shows the exact pattern.

If you finish Exercise 4 in working form, you have essentially recreated Karpathy’s micrograd — the smallest serious autograd engine that ships. PyTorch, JAX, and TensorFlow are doing the same thing, just at scale.

What we built

Affine maps, activations, cross-entropy and MSE losses, gradient descent, backpropagation by hand on a 2-layer MLP, He initialization, SGD with and without momentum, Adam, AdamW, and a minimal autograd engine. Every transformer training loop in the chapters that follow is doing this same set of operations, scaled up: the affine maps become attention projections and feed-forward layers; the activations become GELU and SiLU; the loss stays cross-entropy because next-token prediction is classification over the vocabulary; backprop stays the same chain-rule traversal; AdamW stays the optimizer; He stays the initialization.

What changes is the architecture. Chapters 4 through 6 build attention, the multi-head transformer block, and positional encoding. Before any of that, though, we need a way to turn words into vectors that a network can consume. That is Chapter 2.