Word2Vec
Inside the Model
Explore weight matrices, similarity heatmaps, and the forward-backward pass step by step. See exactly what the neural network learns at every training step.
Model Blueprint
The entire Skip-gram model in 13 lines of PyTorch. Hover any highlighted line to see what it does.
This is a tiny teaching model — not a production one.
We deliberately trained on a 96-word vocabulary with only 545 hand-crafted sentences and 1,536 total parameters. Real Word2Vec models use millions of words and billions of training tokens. The model is small enough that you can see every weight and understand every decision, yet large enough that real semantic structure emerges — king/queen cluster together, analogies work, and neighborhoods make sense.
The Code
Parameter Census
Training Config
Corpus Overview
Forward & Backprop Walkthrough
Step through a single training step — from input to weight update — at your own pace. Pick any step to see the actual numbers the model computes. Pause, inspect, compare early steps vs late steps.
Training Journey
Our tiny model trained for 15,000 steps on 545 sentences. That’s enough for real patterns to emerge: neighbors become meaningful, analogies start working, and embeddings converge. The same dynamics happen in full-scale Word2Vec — just with 100 billion words instead of 1,751.
Neighbor Evolution
Pick a word and see how its nearest neighbors change during training. At step 0, neighbors are random. By step 5,000+, semantically related words dominate.
Analogy Scorecard
Five analogy tests evaluated throughout training. Watch the model go from 0/5 to getting most analogies correct as embeddings learn semantic structure.
Word Convergence
How quickly does each word’s embedding converge to its final position? Common words converge fast; rare words take much longer.
Weight Matrix Animation
The raw embedding matrix W (96 words × 8 dimensions) at 10 snapshots during training. Watch it evolve from random noise to structured patterns where semantic groups share similar weight profiles.
Weight Explorer
With only 96 words and 8 dimensions, we can visualize every single number the model learned. In a real Word2Vec model (3 million words × 300 dimensions) this would be impossible — that’s why we built this toy version.
Similarity Heatmap
Cosine similarity between all 96 word embeddings, sorted by semantic group. The bright blue blocks along the diagonal show that words in the same category have learned similar representations.
Two Faces of Every Word
The model has two separate vectors for every word — not one. Here’s why:
W — “When I’m the main word”
When “king” is the center word and the model asks “what words appear near king?”, it uses W[king]. Think of it as the word describing itself.
W′ — “When I’m somebody’s neighbor”
When “king” is a context word and the model checks “is king likely to appear near queen?”, it uses W′[king]. Think of it as others describing what it’s like to be near king.
Because these two roles get different gradient updates during training, the two vectors for the same word can end up quite different. The chart below measures this: a tall bar means the word describes itself the same way others describe it; a short or negative bar means the two perspectives learned very different things.