> **Note:** The canonical experience is the interactive HTML page: [Inside Word2Vec](https://www.quantml.org/topics/word2vec/internals). This file is a text mirror for search engines and AI tools.

# Inside Word2Vec: Model Internals, Forward Pass & Weight Explorer

Step through Word2Vec internals: forward pass, backprop, weight matrices, and similarity heatmaps with real trained model data.

---

## Section 1 — Model Blueprint

**Subtitle:** The entire Skip-gram model in 13 lines of PyTorch. Hover any highlighted line to see what it does.

> **This is a tiny teaching model, not a production one.**
> We deliberately trained on a **96-word vocabulary** with only **545 hand-crafted sentences** and **1,536 total parameters**. Real Word2Vec models use millions of words and billions of training tokens. The model is small enough that you can see every weight and understand every decision, yet large enough that real semantic structure emerges: king/queen cluster together, most analogies work (3/5), and neighborhoods make sense.

### Parameter Census

| Layer | Type | Shape | Parameters |
|---|---|---|---|
| Embedding (W) | Embedding | 96 × 8 | 768 |
| Output projection (W′) | Linear | 8 × 96 | 768 |
| **Total** | | | **1,536** |

For reference, this model's 1,536 parameters on a log scale:

- **This model** — 1,536
- Word2Vec (Google News) — 3B
- GPT-2 — 117M
- GPT-3 — 175B
- GPT-4 — ~1T+

### Training Configuration

| Parameter | Value | Description |
|---|---|---|
| `learning_rate` | 0.1 | How big each gradient step is |
| `window_size` | 2 | Look N words left and right for context |
| `embed_dim` | 8 | Size of each word vector |
| `num_steps` | 15,000 | Total training iterations |
| `min_count` | 1 | Minimum word frequency to include |
| `dense_steps` | 100 | Record every step for first N steps |
| `record_interval` | 50 | Then record every Nth step |
| `pca_method` | incremental | How 2D projection is computed |

### Corpus Overview

| Stat | Value |
|---|---|
| Sentences | 545 |
| Vocabulary | 96 |
| Training Pairs | 1,751 |
| Dimensions | 8 |

Top 20 words by frequency and sample sentences from the curated corpus are shown interactively on the page.

---

## Section 2 — Forward & Backprop Walkthrough

**Subtitle:** Step through a single training step (from input to weight update) at your own pace. Pick any step to see the actual numbers the model computes. Pause, inspect, compare early steps vs late steps.

Select from curated training steps: 1, 10, 50, 100, 500, 1,000, 2,000, 5,000, 10,000, 15,000. Use ← → arrow keys to navigate the 7 stages.

Each step shows the active training pair: a center word → context word.

---

### Stage 1: Pick a center word

The model picks a center word. It must predict which words appear nearby. The true context word is the training target.

The input is a one-hot vector — all zeros except position `#N` where the center word sits in the alphabetical vocabulary.

```
x = one_hot("center_word") ∈ ℝ⁹⁶
```

Only one slot is "on"; this tells the model which word to look up.

---

### Stage 2: Look up the embedding

Instead of multiplying the one-hot vector by W, the model just grabs the corresponding row from the embedding matrix. This 8-number vector **IS** what the model currently thinks the center word means.

```
v_center = W["center_word", :] ∈ ℝ⁸
```

At step 0 these values are random. By later steps they carry semantic structure.

---

### Stage 3: Score every vocabulary word

The model multiplies the hidden vector by W′ to get a raw score for every word. High score means "I think this word appears near the center word."

```
z = W′ · v_center ∈ ℝ⁹⁶
```

Shows top 12 raw logit scores with the true context word highlighted. The remaining 84 words are also scored but not displayed.

---

### Stage 4: Softmax → Probabilities

Softmax converts the 96 raw scores into probabilities that sum to 1. Big scores get amplified; small scores get squashed toward zero.

```
P(w | center) = exp(z_w) / Σⱼ exp(z_j)
```

The model assigns a probability percentage to the true context word and displays its rank out of 96:

- Rank ≤ 5: "Good, it's in the top 5!"
- Rank ≤ 20: "Getting closer, but not top-ranked yet."
- Rank > 20: "Still far from the top. The model has more to learn."

---

### Stage 5: Compute Loss

The loss measures surprise. If the model was confident about the context word, loss is low. If it was surprised, loss is high.

```
L = −log P("context" | "center")
```

Reference: −log(1/96) = 4.56 is pure chance. Loss > 4 means the model is nearly guessing.

---

### Stage 6: Backprop — Gradients

The gradient tells the model how to adjust the center word's embedding to make the context word more likely next time. It splits into two intuitive forces:

**Attractive Force (green):**
"How should I move *center*'s embedding to get *closer* to *context*?"
This is the context word's output embedding (u vector from W′). It points toward where the context word lives — the direction we *want* the center to move.

**Repulsive Force (red):**
"How should I move *center*'s embedding to get *further* from words I wrongly predicted?"
This is a probability-weighted average of all 96 context vectors (W′). Words the model was most confident about contribute the most push.

```
∂L/∂v_center = Σ_w P(w)·u_w  [repel]  −  u_context  [attract]
```

The final gradient = repel − attract, dimension by dimension. Since the update *subtracts* this gradient, the net effect is:
- center's embedding moves **toward** context (attract wins)
- center's embedding moves **away from** words it wrongly predicted

---

### Stage 7: Apply the update

The gradient is multiplied by the learning rate (0.1) and subtracted from the current embedding. One tiny step toward a better model.

```
v_center_new = v_center_old − 0.1 · ∇L
```

Displays three bar charts: current embedding → gradient → updated embedding.

> That's it: one complete training step. The embedding of the center word just moved slightly in 8-dimensional space. Repeat this 15,000 times with different word pairs, and the entire vocabulary self-organizes into meaningful clusters.

---

## Section 3 — Training Journey

**Subtitle:** Our tiny model trained for 15,000 steps on 545 sentences. That's enough for real patterns to emerge: neighbors become meaningful, analogies start working, and embeddings converge. The same dynamics happen in full-scale Word2Vec, just with ~100 billion words (Google News corpus) instead of 1,751.

### Neighbor Evolution

Pick a word and see how its nearest neighbors change during training. At step 0, neighbors are random. By step 5,000+, semantically related words dominate.

### Analogy Scorecard

Five analogy tests evaluated throughout training. Watch the model go from 0/5 to 3/5 correct as embeddings learn semantic structure.

### Word Convergence

How quickly does each word's embedding converge to its final position? Common words converge fast; rare words take much longer.

### Weight Matrix Animation

The raw embedding matrix W (96 words × 8 dimensions) at 10 snapshots during training. Watch it evolve from random noise to structured patterns where semantic groups share similar weight profiles.

---

## Section 4 — Weight Explorer

**Subtitle:** With only 96 words and 8 dimensions, we can visualize *every single number* the model learned. In a real Word2Vec model (3 million words × 300 dimensions) this would be impossible; that's why we built this toy version.

### Similarity Heatmap

Cosine similarity between all 96 word embeddings, sorted by semantic group. The bright blue blocks along the diagonal show that words in the same category have learned similar representations.

### Two Faces of Every Word

The model has **two separate vectors** for every word, not one. Here's why:

**W — "When I'm the main word"**
When "king" is the center word and the model asks "what words appear near king?", it uses `W[king]`. Think of it as the word describing *itself*.

**W′ — "When I'm somebody's neighbor"**
When "king" is a context word and the model checks "is king likely to appear near queen?", it uses `W′[king]`. Think of it as others describing *what it's like to be near king*.

Because these two roles get different gradient updates during training, the two vectors for the same word can end up quite different. A **tall bar** means the word describes itself the same way others describe it; a **short or negative bar** means the two perspectives learned very different things.

In practice, only **W** is kept as the final word embedding; W′ is discarded after training.

---

## Related sections

- [Word2Vec Story](https://www.quantml.org/topics/word2vec) — Conceptual walkthrough from one-hot to embeddings
- [Code Walkthrough](https://www.quantml.org/topics/word2vec/code) — The PyTorch training script that generated these artifacts
- [Quiz](https://www.quantml.org/topics/word2vec/quiz) — Test your understanding