InteractiveVisualizations

Word2Vec

Inside the Model

Explore weight matrices, similarity heatmaps, and the forward-backward pass step by step. See exactly what the neural network learns at every training step.

Inside Word2Vec explores the skip-gram model with real trained weights. You can step through the forward pass, inspect the 96×8 embedding (W) and context weight (W') matrices, and watch similarity heatmaps evolve across 15,000 training steps. Key diagnostics include the W vs W-prime comparison, cosine similarity matrix, and per-word convergence curves.

Model Blueprint

The entire Skip-gram model in 13 lines of PyTorch. Hover any highlighted line to see what it does.

This is a tiny teaching model, not a production one.

We deliberately trained on a 96-word vocabulary with only 545 hand-crafted sentences and 1,536 total parameters. Real Word2Vec models use millions of words and billions of training tokens. The model is small enough that you can see every weight and understand every decision, yet large enough that real semantic structure emerges: king/queen cluster together, most analogies work (3/5), and neighborhoods make sense.

The Code

Parameter Census

Training Config

Corpus Overview

Forward & Backprop Walkthrough

Step through a single training step (from input to weight update) at your own pace. Pick any step to see the actual numbers the model computes. Pause, inspect, compare early steps vs late steps.

Training Step:
This step trains:river→ predict →boy

Use arrow keys to navigate stages

Step 1: Pick a center word

The model picks "river" as the center word. It must predict which words appear nearby. The true context word is "boy".

The input is a one-hot vector, all zeros except position #66 where “river” sits in the alphabetical vocabulary:
Only one slot is “on”; this tells the model which word to look up. Hover over the bars to see which word is at each position.
\mathbf{x} = \text{one\_hot}(\text{"river"}) \in \mathbb{R}^{96}

Training Journey

Our tiny model trained for 15,000 steps on 545 sentences. That’s enough for real patterns to emerge: neighbors become meaningful, analogies start working, and embeddings converge. The same dynamics happen in full-scale Word2Vec, just with ~100 billion words (Google News corpus) instead of 1,751.

Neighbor Evolution

Pick a word and see how its nearest neighbors change during training. At step 0, neighbors are random. By step 5,000+, semantically related words dominate.

Analogy Scorecard

Five analogy tests evaluated throughout training. Watch the model go from 0/5 to 3/5 correct as embeddings learn semantic structure.

Word Convergence

How quickly does each word’s embedding converge to its final position? Common words converge fast; rare words take much longer.

Weight Matrix Animation

The raw embedding matrix W (96 words × 8 dimensions) at 10 snapshots during training. Watch it evolve from random noise to structured patterns where semantic groups share similar weight profiles.

Weight Explorer

With only 96 words and 8 dimensions, we can visualize every single number the model learned. In a real Word2Vec model (3 million words × 300 dimensions) this would be impossible; that’s why we built this toy version.

Similarity Heatmap

Cosine similarity between all 96 word embeddings, sorted by semantic group. The bright blue blocks along the diagonal show that words in the same category have learned similar representations.

Two Faces of Every Word

The model has two separate vectors for every word, not one. Here’s why:

W: “When I’m the main word”

When “king” is the center word and the model asks “what words appear near king?”, it uses W[king]. Think of it as the word describing itself.

W′: “When I’m somebody’s neighbor”

When “king” is a context word and the model checks “is king likely to appear near queen?”, it uses W′[king]. Think of it as others describing what it’s like to be near king.

Because these two roles get different gradient updates during training, the two vectors for the same word can end up quite different. The chart below measures this: a tall bar means the word describes itself the same way others describe it; a short or negative bar means the two perspectives learned very different things. In practice, only W is kept as the final word embedding; W′ is discarded after training.