Chapter · LLM Tutorial

Chapter 1 ended with a neural network that could classify 2D points by quadrant. Coordinates in, class label out. Both ends continuous. Both ends were always going to play nicely with backprop.

Language is different. A word is not a coordinate. Cat is a token — an entry in a discrete vocabulary — and there is no obvious answer to “what is cat − dog?” The chain rule has nothing to say about discrete IDs.

This chapter is about the bridge: how we turn discrete tokens into continuous vectors that the network from Chapter 1 (or the transformer from Chapter 4) can consume. The bridge is one matrix and one array-indexing operation. It also happens to be where most of the parameters in a small language model live.

Why embeddings

A neural network is a continuous machine. Every operation in it — affine map, activation, normalization, attention — moves real-valued vectors around. Backprop assumes those vectors are differentiable functions of upstream tensors, which in turn requires that the inputs to the first layer be real numbers we can take derivatives with respect to.

Tokens are not real numbers. The string the cat sat becomes, after tokenization, a sequence of integer IDs — something like [12, 5847, 2391]. Those integers are labels, not magnitudes. Feeding them into an MLP directly would be a category error: the network would happily compute $5847 - 12 = 5835$ , treating the difference between cat and the as if it were a meaningful scalar quantity. It is not. The IDs are arbitrary; renumbering the vocabulary changes every input value without changing the underlying language.

\text{token IDs} \in \{0, 1, \dots, |V| - 1\}^T \quad \longrightarrow \quad \text{embedded sequence} \in \R^{T \times d}

The naive bridge is one-hot encoding: turn each ID into a vector of length $|V|$ with a single 1 in the slot for that token and zeros everywhere else. The output is continuous in the trivial sense that it is a vector of floats. The chain rule applies. The category error is gone.

But $|V|$ is enormous. Modern tokenizers have vocabularies in the 30,000–200,000 range. Multiplying a one-hot vector by a weight matrix wastes nearly all of the arithmetic on multiplying by zero — every position except one contributes nothing. At a vocabulary of 50,000 and a hidden dimension of 768, the first layer would do 50,000 × 768 ≈ 38 million multiplications to produce a vector that depends on a single integer.

Embeddings are the actual bridge. Map each token ID to a dense vector in a much smaller space — $d \sim 100$ for word2vec, $d = 768$ for GPT-2 small, $d = 12{,}288$ for GPT-3. Different tokens map to different vectors; the vectors live in $\R^d$ and can be added, subtracted, and dotted into the next layer’s weight matrix without ceremony. Section 3 makes the construction precise.

The next two sections build the bridge: the representation, then the operation.

One-hot versus distributed representations

One-hot encoding is the simplest way to turn a categorical token into a vector. Define it formally: for token ID $i \in \{0, \dots, |V|-1\}$ , the one-hot vector $\mathbf{e}_i \in \{0, 1\}^{|V|}$ has

(\mathbf{e}_i)_j = [i = j],

where $[\cdot]$ is the Iverson bracket — 1 if the condition holds, 0 otherwise. The result is a vector with magnitude 1, a single nonzero entry, and a clean property: any two different tokens are orthogonal.

The orthogonality is exactly the problem. Under one-hot encoding, cat and dog are as similar to each other as cat and helicopter — their dot product is zero in both cases. Nothing about the representation says cat and dog belong together. The cosine similarity between any two one-hots is identically zero. A linear classifier built on top of these vectors has no shared structure to exploit; whatever it learns about cat it learns from scratch for every other token.

One way to feel the failure mode: think of one-hots as stamps. Every word gets a unique stamp, and the stamps carry no information beyond “this is the word with that stamp.” A library where every book is labeled by a unique serial number and nothing else gives you no way to find related books. Stamps are how you identify tokens; they are not how you describe them.

A distributed representation describes a token instead of identifying it. Every word gets a $d$ -dimensional vector of real values, where $d$ is much smaller than $|V|$ (often 1/50 to 1/1000 the size). The vector is not a stamp; it is a fingerprint in semantic space. Cat might end up encoding something like (small, furry, indoor, mammal, feline); dog might encode (medium, furry, indoor, mammal, canine). Most of the features are shared. The differences are meaningful. A linear classifier downstream can pick up on shared structure — anything that follows the “small furry indoor mammal” direction generalizes from cats to ferrets without seeing ferrets in training.

The dimensions of a distributed embedding are not interpretable individually. Coordinate 0 is not furriness and coordinate 1 is not plurality. Useful “directions” in embedding space are linear combinations of many raw dimensions — the gender direction in word2vec, for example, is some specific unit vector in $\R^{300}$ that you can recover by averaging differences across (king, queen), (uncle, aunt), and so on. Probing for a direction works; reading off a single column does not. Section 5 returns to this.

The mathematical link between one-hot and distributed is short and worth seeing. If $E \in \R^{|V| \times d}$ is the embedding matrix (one row per token), then multiplying the one-hot vector $\mathbf{e}_i$ on the left by $E^\top$ picks out the $i$ -th column of $E^\top$ — which is the $i$ -th row of $E$ :

E^\top \mathbf{e}_i = E[i, :] \in \R^d \quad \text{(one-hot lookup = row indexing)}

So “look up the embedding for token $i$ ” and “multiply $E^\top$ by the one-hot vector $\mathbf{e}_i$ ” are the same operation by definition. The mathematical identity is useful pedagogically — it justifies treating embedding lookups as a special case of an affine map, which is what makes the gradient derivation in section 3 fall out cleanly.

So embeddings are dense vectors. Where do they live, and how is the lookup wired into the rest of the network?

The embedding layer as a lookup table

The embedding matrix $E \in \R^{|V| \times d}$ is a learnable parameter — exactly the same kind of object as the weight matrices in Chapter 1’s MLP. Row $i$ of $E$ is the embedding of token $i$ . The forward pass is array indexing.

Forward: array indexing

Given a batch of token IDs token_ids of shape (B,) or (B, T), the embedded output is E[token_ids], with shape (B, d) or (B, T, d). That is it. There is no nonlinearity, no bias, no normalization at this step — the embedding “layer” is just a learned mapping from integers to rows of a matrix.

Initialization is small Gaussian, conventionally $\mathcal{N}(0, 0.02^2)$ . Note what this is not: it is not He initialization, and it is not Xavier. Those schemes are calibrated to preserve activation variance through a stack of affine-plus-activation layers; an embedding lookup has neither an affine combination of many inputs nor an activation function. The right prior is just “small enough that the initial signal doesn’t dominate downstream layers’ weights.” GPT-2 uses 0.02; Llama uses 0.02. The number is empirical, and 0.02 is the field-standard answer until you have a specific reason to deviate.

Backward: sparse gradients

The gradient through the embedding lookup is where the one-hot equivalence pays off. Suppose the forward pass embedded token $i$ to produce $e = E[i, :]$ , and the loss $L$ has gradient $\partial L / \partial e$ flowing back from downstream. The chain rule gives the gradient with respect to each row of $E$ :

\frac{\partial L}{\partial E[j, :]} = \begin{cases} \dfrac{\partial L}{\partial e} & \text{if } j = i \\ \mathbf{0} & \text{otherwise} \end{cases}

Only the row that was looked up receives a non-zero gradient. Every other row of $E$ stays exactly where it was. The embedding for cat never moves unless a batch contains cat.

For a batch of $B$ tokens, the gradient is the same row-by-row scatter, accumulated wherever multiple positions share a token ID. In numpy, the canonical idiom is np.add.at(grad_E, token_ids, grad_out) — an unbuffered scatter that handles repeated indices correctly. Without np.add.at, repeated indices would silently overwrite each other instead of accumulating, and the gradient would be wrong.

The runnable block below implements forward and backward for a tiny embedding layer and shows the gradient sparsity directly: of five vocabulary rows, only the two whose IDs appeared in the batch end up non-zero.

At scale, gradient sparsity has a concrete cost consequence. A batch of 1024 tokens from a 50,000-token vocabulary touches at most 1024 distinct rows of $E$ — roughly 2%. Naively applying Adam to the full embedding matrix means computing and updating moment estimates for the other 98% of rows that received zero gradient, which is wasted work and slightly wrong: the moment estimates decay even when the gradient is zero, so unused rows drift toward zero over time under AdamW’s decay. PyTorch’s nn.Embedding(sparse=True) and equivalent flags in JAX exist precisely to skip updates on untouched rows.

The embedding table is just a matrix of parameters. The question is how it gets its values. The historical answer is word2vec; the modern answer is end-to-end. The next three sections walk the historical answer because it is the cleanest pedagogical setup; section 6 corrects the misconception that word2vec is how modern LLMs work.

Word2vec — skip-gram with negative sampling

Before 2013, learned word representations existed — distributional models going back to LSA and HAL, neural language models like Bengio et al. 2003 — but they were expensive and unstable at the scale of web text. Mikolov, Chen, Corrado, and Dean (2013, arxiv.org/abs/1301.3781) changed this with a simple, scalable, self-supervised setup: word2vec.

The setup is shamelessly simple. For each token in a corpus, predict the tokens around it. No human labels. The corpus is the supervision. Two architectural variants appeared in the same paper — Skip-gram (given a word, predict each context) and CBOW (given a context, predict the word) — but Skip-gram with negative sampling, introduced a few months later in Mikolov, Sutskever, Chen, Corrado, and Dean (2013, arxiv.org/abs/1310.4546), is the variant that took over and that we walk through here.

The self-supervised objective

For each word $w$ in the corpus and each context word $c$ within a window of $\pm 5$ (or so) positions, treat $(w, c)$ as a positive training pair. The model has two embedding matrices: $U \in \R^{|V| \times d}$ for words acting as the “center” and $V \in \R^{|V| \times d}$ for words acting as the “context.” After training, $U$ is typically the embedding used downstream; $V$ is discarded.

The model’s prediction for “given center word $w$ , what is the probability of context word $c$ ?” is a categorical distribution over the vocabulary:

P(c \mid w) = \frac{\exp(v_c^\top u_w)}{\sum_{c' \in V} \exp(v_{c'}^\top u_w)}.

The objective is maximum likelihood: maximize the log probability of the true context word at every position in the corpus. Standard, clean, and entirely impractical.

Why the full softmax is intractable

The denominator sums $\exp(v_{c'}^\top u_w)$ over every word in the vocabulary. For a vocabulary of 50,000 tokens, that is 50,000 dot products and 50,000 exponentials per training pair — and there are billions of training pairs in a real corpus. Most of the work is computing probabilities for words that are nowhere near the true context. The arithmetic is dominated by negative evidence that turns out to be too cheap a way to learn it.

This is the same problem the output softmax has in any large-vocabulary language model. Two families of solutions exist: hierarchical softmax, which arranges the vocabulary as a binary tree and turns the $|V|$ -way classification into $\log_2 |V|$ binary classifications along a path; and noise-contrastive methods, which replace the multinomial classification with a binary classifier that distinguishes real pairs from fake ones. Skip-gram with negative sampling — SGNS — is the noise-contrastive route.

Negative sampling

Replace the $|V|$ -way softmax with $k + 1$ binary classifications. For each positive pair $(w, c)$ from the corpus, draw $k$ negative context words $c'_1, \dots, c'_k$ from a noise distribution $P_n$ . Train the model to score the positive pair high and the negative pairs low.

The per-pair loss:

\mathcal{L}_{\text{SGNS}} = -\log \sigma(v_c^\top u_w) - \sum_{i=1}^{k} \log \sigma\!\left(-v_{c'_i}^\top u_w\right),

(2.sgns)

where $\sigma(x) = 1 / (1 + e^{-x})$ is the standard sigmoid.

The noise distribution from which negatives are drawn is not uniform over the vocabulary. Mikolov et al. empirically chose

P_n(c) \propto p(c)^{3/4},

where $p(c)$ is the unigram probability of context word $c$ . The 3/4 exponent is interpolating: uniform sampling ( $p(c)^0$ ) wastes negatives on words too rare to teach the model anything; pure unigram sampling ( $p(c)^1$ ) over-samples the, of, and, drowning useful signal in stopword noise. The 0.75 power was a hyperparameter sweep that stuck.

Hyperparameter sketch: $k = 5$ negatives per positive at large scale, $k = 15$ to $20$ at small scale (more negatives compensate for fewer training pairs). Context window 5 to 10 to either side. Embedding dimension 100 to 300 — small models by modern standards. Learning rate around 0.025 with linear decay to near-zero by end of training.

A working implementation

The runnable block below trains SGNS on a 22-word toy corpus for 100 epochs. Pyodide handles the whole loop in about a second. After training, the cosine similarities between word pairs should reflect their distributional similarity in the corpus: cat and dog should be more similar than cat and rug, because cat and dog appear in overlapping contexts (sat on, the) while rug appears with a narrower set.

The widget below scales the same idea to a slightly larger toy and animates training. Each click advances a step of SGD; positive pairs visibly pull together and negative pairs push apart in a 2D projection of the embedding space.

Word2vec was a 2013 paper, and most of what it produces — vectors trained on a single self-supervised objective and used downstream as fixed features — has been replaced by the embedding tables of end-to-end-trained LLMs. What did not go away is the geometric structure the paper revealed, and the most striking consequence of that structure is the topic of the next section.

The geometry of learned representations

The paper that introduced negative sampling also introduced what became the iconic result of the embedding era: vectors trained on web text exhibit linear relationships between semantically analogous words. The famous example:

\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}.

Compute the difference between king and man. Add it to woman. The nearest neighbor of the resulting vector in the embedding space is queen. The same trick works for capitals (Paris − France + Italy ≈ Rome), tenses (walked − walk + swim ≈ swam), and a sprawling list of other analogous pairs. The first time this was demonstrated on 6-billion-token-trained Skip-gram, it was the kind of empirical result that made the academic community take the geometric structure of word vectors seriously.

The result is geometric, not symbolic. There is no “gender concept” stored at coordinate 42 of the embedding. What the analogies reveal is that the “male-to-female” direction in $\R^{300}$ — some specific unit vector that you would have to recover with a probe — is approximately constant across analogous pairs. The vector from king to queen and the vector from uncle to aunt point in nearly the same direction with nearly the same magnitude. Add that direction to brother and you land near sister.

The why gets a clean partial answer from Levy and Goldberg (2014, NeurIPS). They showed that the SGNS objective (2.sgns) is implicitly factorizing a matrix — specifically the shifted pointwise mutual information matrix:

v_c^\top u_w \approx \text{PMI}(c, w) - \log k, \quad \text{where} \quad \text{PMI}(c, w) = \log \frac{p(c, w)}{p(c) \cdot p(w)}.

PMI measures how much more often two words co-occur than chance predicts. It has additive structure for compositional pairs — at least in the loose sense that $\text{PMI}(c, \text{queen})$ behaves approximately like $\text{PMI}(c, \text{female}) + \text{PMI}(c, \text{monarch})$ for contexts $c$ that pick up either feature. Because SGNS is factorizing PMI, the additive structure of PMI translates to linear structure in the embedding space, and the linear structure manifests as analogies.

The argument is hand-wavy in exactly the way that the empirical evidence is robust. PMI is not perfectly additive. The factorization is approximate. The convergence is partial. The fact that linear analogies work as well as they do empirically is somewhat surprising; the PMI-factorization view explains why they work at all without quantifying why they work so well. The chapter does not reproduce the full proof — Levy and Goldberg’s paper is short and worth reading directly.

The widget below shows a 2D projection of pre-computed word2vec embeddings colored by semantic category, with analogy overlays drawn on top. The clusters (animals, colors, countries) are clearly visible; the parallel-arrows pattern for analogies is visible too, with the usual caveat that 2D projection compresses information and the parallelism in the projection underestimates the structure in the full 300-dimensional space.

Word2vec is the historical anchor for this material. It is not, however, how the embeddings inside a modern LLM are trained. The next section is the corrective.

Embeddings in modern LLMs

Word2vec trains on a single self-supervised objective — predict context from word — and produces a fixed embedding table that gets reused downstream. The whole point of the paper was to decouple representation learning from task learning. Modern LLMs do the opposite. The embedding table is just another learnable parameter of the model; it updates via backprop along with attention weights, FFN weights, layer norms, and everything else, on whatever objective the model is being trained for (typically next-token prediction).

End-to-end training

There is no “pretrain the embeddings, then freeze them” phase in any production LLM. The embedding matrix $E$ is allocated at initialization (small Gaussian, the same $\mathcal{N}(0, 0.02^2)$ as section 3), wired into the input of the model, and updated by AdamW on the same loss that updates every other parameter. The gradient through $E$ is exactly the sparse scatter from section 3 — token IDs that appear in the batch get their rows updated; the rest of $E$ stays put.

Mechanically, the section-3 EmbeddingLayer is essentially what every modern LLM does. The difference is what the rest of the network does with the embedded sequence. Word2vec wires it into a single dot product with a context-word embedding. A transformer wires it into a stack of attention blocks and FFNs, each refining the representation. The lookup is the same; the downstream task is different; the resulting embeddings reflect that.

What the embeddings learn

The embedding table that emerges from end-to-end training does not optimize for the geometric properties word2vec embeddings have. The training signal is “predict the next token given the previous tokens,” not “predict context given word.” The resulting space encodes whatever directions help next-token prediction, which substantially overlaps with what word2vec encoded but is not identical. Probing studies still find meaningful linear structure — gender, country, sentiment directions all appear in GPT-2’s input embeddings — but the classic king − man + woman demo does not reliably reproduce on them. The geometry is task-shaped.

Karpathy’s makemore series (github.com/karpathy/makemore) is the cleanest pedagogical reference for embedding-as-byproduct. Across the series, a character-level embedding table is trained end-to-end with progressively more elaborate downstream networks — bigram MLP, larger MLP, transformer. The embedding lookup is the same in all of them; what changes is what the downstream network does. The block below is the bigram-MLP version: an embedding table plus a single output projection, trained on next-character prediction. The embedding emerges as whatever vectors make that one downstream operation work.

Layer-by-layer contextualization

There is a second, subtler misconception about modern embeddings: that each token has a single fixed vector throughout the network. This is true at the input — every occurrence of bank maps to the same row of $E$ — and false at every layer above. Attention’s whole job is to mix in surrounding context, and at every transformer layer the representation of bank at position $t$ is some learned combination of representations at every other position. After layer 1, the bank in river bank and the bank in savings bank already differ. By layer 12, they are substantially different vectors pointing in substantially different directions.

This is the difference between static embeddings (the row of $E$ at the input) and contextualized representations (the activations at intermediate layers). Word2vec and GloVe produce static embeddings — one vector per word, regardless of context. A transformer produces a sequence of contextualized representations, with the static input embedding as the starting point of a much longer pipeline.

Chapter 5 picks this up properly when it walks through the transformer block. The takeaway for this chapter is narrower: when we say “the embedding table in a modern LLM,” we mean the input layer’s lookup — $E$ in section 3, fixed per token ID. The “deeper-layer activations” are a different thing entirely, even though casual usage sometimes calls them “embeddings” too.

The implications for interpretability follow. Input embeddings in word2vec are relatively direct objects to probe — they were trained for it. Input embeddings in modern LLMs are less directly interpretable because they are tuned for whatever direction helps the contextualization layers above. Probing still works; the absolute geometric cleanliness of word2vec analogies does not.

Tying input and output embeddings

A language model has two big matrices that sit between the vocabulary and the hidden representation: the input embedding $E \in \R^{|V| \times d}$ and the output projection $W_{\text{out}} \in \R^{d \times |V|}$ . The input maps token IDs to vectors; the output maps the final hidden state back to scores over the vocabulary. Both have shape proportional to $|V| \times d$ . Both, fundamentally, are answering the same question: what is the relationship between this token and this point in $\R^d$ ?

Press and Wolf (2017, arxiv.org/abs/1608.05859) made the obvious move: tie them. Set $W_{\text{out}} = E^\top$ , so the same matrix serves as input embedding and output projection. The input embedding for token $i$ is $E[i, :]$ ; the output logit for token $i$ given a hidden state $h$ is $h \cdot E[i, :]$ . One matrix instead of two.

The savings are concrete. For GPT-2 small ( $|V| = 50{,}257$ , $d = 768$ ), one of the two $|V| \times d$ matrices is 38.6 million parameters — roughly 30% of the model’s 125 million total. Tying eliminates that block of parameters outright. Llama 7B, with $|V| = 32{,}000$ and $d = 4096$ , would save 130 million parameters — closer to 2% of total. The relative savings shrink as models grow.

The output bias is a separate matter and is usually not tied. Implementations typically still include a learnable bias $b \in \R^{|V|}$ on the output projection: logits at position $t$ are $h_t \cdot E^\top + b$ , with the bias applied per vocabulary entry. The bias absorbs the unigram statistics of the corpus — frequent tokens (the, of, and) get a positive bias because they are a priori probable; rare tokens get a negative bias. Without the bias, the model has to bake unigram statistics into the directions of the hidden state, wasting representational capacity on something a single per-token scalar can handle.

Tying is the default in most open-source LLMs. GPT-2 ties. Llama ties. Mistral ties. Gemma ties. Some larger closed models (GPT-3 and later, by various reports) do not tie — at frontier scale the parameter budget is less binding, and the extra flexibility of independent input and output spaces may be worth the cost. The default is “tie until you have a reason not to.”

Exercises

The exercises below build on the chapter. Each is a self-contained problem with a starting template. Hints are collapsed by default — try the problem first.

Exercise 1 (easy) — Verify category separation by cosine similarity

Train the toy skip-gram model from section 4 (the 12-word corpus). Then compute cosine similarities for several pairs: within-category (e.g., cat ↔ dog), across-category (e.g., cat ↔ rug), and to connector words (e.g., cat ↔ sat). Report which pairs are most/least similar.

Hint

After training, the U matrix has the word embeddings. Compute cos(a, b) = (a @ b) / (|a| * |b|). Same-category words should have higher cosine than cross-category. Connector words (sat, on, the) should be moderately similar to many words — they co-occur with everything.

Exercise 2 (medium) — Intruder detection

Given a list of 4 words, identify the one that doesn’t belong using only embedding similarities. For example, in [cat, dog, fish, car], the intruder is car. The basic approach: compute the average pairwise similarity within the list, then check which word has the lowest average similarity to the others.

Hint

For each candidate word $w$ in the list, compute its average cosine similarity to all other words in the list. The word with the LOWEST average similarity is the intruder. (This works because the other three words mutually reinforce each other’s “category-ness.”)

Exercise 3 (medium) — Implement CBOW

CBOW (Continuous Bag of Words) is word2vec’s other variant. Instead of predicting context from a center word, CBOW predicts the center word from the average of its context word embeddings. Implement CBOW with negative sampling. Compare the resulting embeddings to skip-gram on the same toy corpus.

Hint

For each (center, context_set) pair: average the context-word embeddings into a single vector $\bar{u}$ . Then run the SGNS-style update with $\bar{u}$ in place of a single word embedding, predicting the center word as positive and random negatives. Gradient w.r.t. $\bar{u}$ distributes equally back to each context word.

Exercise 4 (hard) — Linear analogy search

Given three words $a$ , $b$ , $c$ , find the word $d$ such that $\vec{d} - \vec{c} \approx \vec{b} - \vec{a}$ , i.e., $d$ completes the analogy ” $a$ is to $b$ as $c$ is to ?”. Implement this search using your trained embeddings.

Hint

Compute the target vector $t = b - a + c$ . Then search the vocabulary for the word whose embedding is closest to $t$ (excluding $a$ , $b$ , $c$ themselves — they often score highest by accident). Use cosine similarity. With a tiny corpus, the results may be noisy; with real word2vec on Wikipedia, the king/queen result emerges.

If you finish Exercise 4 with a working implementation and want to see it succeed on real data, download a pre-trained GloVe or word2vec model from nlp.stanford.edu/projects/glove, load the vectors with numpy, and run your analogy search on a 6-billion-word vocabulary. With real embeddings, king - man + woman returns queen with high probability — the empirical result that put word vectors on the map.

From token to context

The chapter started with a category error: tokens are discrete, neural networks are continuous, and the chain rule has nothing to say about integer IDs. The fix turned out to be one matrix and one array-indexing operation. The matrix is learnable; it is initialized small, trained end-to-end with whatever model it sits underneath, and produces representations with useful geometric properties — sometimes linear-analogy-friendly, sometimes more task-shaped, depending on what objective shaped them.

What the chapter’s embedding is, fundamentally, is static. Every occurrence of cat maps to the same row of $E$ at the input. That is exactly the right behavior for a lookup table, and exactly the wrong behavior for understanding language. The meaning of cat in the cat purred is not the same as its meaning in cat -lvf is a flag for tar, and any model that treats them as identical is going to make the same mistakes on both.

The fix is what the rest of Part II builds. A transformer takes the sequence of static embeddings produced by Chapter 2’s lookup and runs it through a stack of attention blocks, each of which takes the previous layer’s representations and refines them by mixing in information from surrounding positions. By the time the network has run twelve layers (or seventy, or two hundred), the representation at position $t$ has absorbed signal from every other position in the sequence. It is no longer “the embedding of token $t$ ”; it is “what token $t$ means in this context, given everything around it.”

That is the transformer’s structural job, and Chapter 4 introduces the operation — attention — that does it. Chapter 5 builds the full block. Chapter 6 adds the missing piece: how the model knows which position a token is at, since attention is permutation-equivariant without help.

First, though, Chapter 3 covers the other end of the bridge. The chapter has been quietly assuming that “token IDs” are a fixed input — that some upstream process produces integers between 0 and $|V| - 1$ and hands them to the embedding lookup. What that process is, and how it shapes everything downstream (vocabulary size, the geometry of $E$ , the model’s behavior on numbers and code and multilingual text), is its own subject. Tokenization is where token IDs come from, and the choice of tokenizer determines the rows of $E$ before training has done anything at all.

Why embeddings

One-hot versus distributed representations

The embedding layer as a lookup table

Forward: array indexing

Backward: sparse gradients

Word2vec — skip-gram with negative sampling

The self-supervised objective

Why the full softmax is intractable

Negative sampling

A working implementation

Skip-gram dynamics

The geometry of learned representations

Word2vec embedding space

Embeddings in modern LLMs

End-to-end training

What the embeddings learn

Layer-by-layer contextualization

Tying input and output embeddings

Exercises

Exercise 1 (easy) — Verify category separation by cosine similarity

Exercise 2 (medium) — Intruder detection

Exercise 3 (medium) — Implement CBOW

Exercise 4 (hard) — Linear analogy search

From token to context