Interactive6 chapters

Word2Vec

Learning Word Meanings from Context

An interactive deep-dive into how neural networks learn that words appearing in similar contexts have similar meanings. Every equation has a visual twin. Every concept is something you can see, touch, and play with.

~35 min read|Prerequisites: basic Python, comfort with lists of numbers|Published Jan 2025

Word2Vec mission control: embedding clusters, softmax equation, weight matrix heatmap, and probability distribution visualization on a dark cinematic background

TL;DR - What is Word2Vec?

Word2Vec is a neural network that learns to represent words as dense vectors (called embeddings) by predicting which words appear near each other in text. The core idea: words in similar contexts get similar vectors. The Skip-gram variant takes a center word, computes dot products with all vocabulary words, applies softmax to get probabilities, and adjusts the vectors via gradient descent. After thousands of training steps, semantically related words like “king” and “queen” end up close together in vector space - and you can even do arithmetic like king − man + woman ≈ queen (on large corpora; our 545-sentence toy model gets 3 out of 5 analogy tests right).

Why Can’t Computers Understand Words?

Computers only understand numbers. The simplest representation, one-hot vectors, treats every word as equally different from every other word.

How does a computer see the word “king”? The simplest representation is a one-hot vector: a list of 96zeros with a single 1 at the word’s position.

king=[

]

0, 0, ... 1, ... 0, 0(position 37 of 96)

So one-hot vectors are a dead end. Every word is equally far from every other word. “King” is no closer to “queen” than to “refrigerator.”

We need a better representation: one where similar words live close together. But where would that information come from? We don’t want to hand-code word similarities for every language.

It turns out the answer has been hiding in plain sight: in the text itself.

Diagram comparing one-hot encoding (isolated, equidistant dots) versus embedding space (clustered dots where similar words like king and queen are close together)

Words Are Defined by Their Friends

Words that appear in similar contexts tend to have similar meanings. This is the distributional hypothesis, the foundation of Word2Vec.

“You shall know a word by the company it keeps.”, J.R. Firth (1957)

Before building any neural network, we need the key insight: words that appear in the same contexts tend to mean similar things. Let’s discover this ourselves.

🎯 Try it: which words fit?

the???ruledfromthegreatthrone

👆 Click a word to try it

the???roseoverthecalmblueriver

👆 Click a word to try it

the???playedhappilyatthepark

👆 Click a word to try it

🔍 Context fingerprints

Every word has a “fingerprint”: the set of words that appear near it. If two words share the same fingerprint, they mean similar things. Compare pairs below.

👆 Pick a pair to compare

king

queen

ruled

sat

wore

great

queen

king

ruled

kingdom

sat

throne

wore

crown

44%

king and queen share some contexts, but are used differently overall.

⚙️ The mechanism: sliding window

How does a computer build these fingerprints? It slides a window across each sentence. At every position, the center word is paired with each context word inside the window. Step through to see it in action.

Window:2

Step:

1 / 4

thekingandqueenruledthekingdom

Center Context Stopword (skipped)

Training pairs from this position

(king,queen)(king,ruled)

Each pair says: “king” appeared near “queen”, so make their vectors similar.

👆 Pick a sentence to explore

“the king and queen ruled the kingdom”

🌐 The big picture: co-occurrence map

We just saw how the sliding window extracts word pairs from one sentence at a time. Now let’s zoom out: what happens when we collect every pair from every sentence and draw the result?

How to read this graph

Each dot = a word

Color shows its semantic group (gold = royalty, green = animals, etc.)

Each line = “appeared nearby”

A line between two words means they were inside the same context window at least once.

Thicker line = more often

A thick line means these words co-occurred in many sentences. Thin = rare co-occurrence.

window = 230 words · 40 connections

👆 Click any word to inspect it|Try changing window size above

ⓘ

This visualization uses 22 carefully chosen sentences to make patterns obvious. The real training corpus has 545 sentences, and Google’s original Word2Vec trained on 6 billion words. The same principles apply at every scale, just with richer, noisier patterns.

What to notice

Five clusters emerge. king, queen, crown, and throne form one group because they co-occur in sentences about royalty. cat & dog, sun & moon, boy & girl, and lion & wolf each form their own.

Semantic pairs have the thickest edges. Among the strongest connections are king–queen and cat–dog (they share 3 sentences each). Context words like ruled, chased, and rose bridge the pairs into larger topic clusters.

This is what Word2Vec learns. The neural network compresses these co-occurrence patterns into small, dense vectors, so that words in the same cluster end up at nearby positions in vector space.

You just saw that co-occurrence patterns exist. Words that share contexts share meaning. A natural question: why not just use the co-occurrence counts directly?

You could build a giant table counting how often each pair of words appears together. People tried this (it’s called LSA and SVDmethods). It works! But the table is enormous (vocabulary-size × vocabulary-size), mostly zeros, and it doesn’t generalize well to unseen word combinations.

Diagram showing a sparse co-occurrence matrix being compressed into dense 8-dimensional word vectors, where king and queen have similar bar patterns

Word2Vec’s insight: instead of counting co-occurrences, train a neural network to predict context from words. The network is forced to compress word meaning into small, dense vectors as a side effect of learning to predict. The next chapter builds this machine piece by piece.

Building the Machine

The Skip-gram model predicts context words from a center word. Here’s how it works, piece by piece.

Dot Product = Similarity

Before we build the model, we need one key tool: a way to measure how similar two vectors are. The model will use this tool constantly, at every training step, for every word pair.

That tool is the dot product. When two vectors point in the same direction, their dot product is large. When they point apart, it’s small (or negative). Drag the vectors below to build intuition.

\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}|\,|\mathbf{b}|\cos\theta

Dot Product

0.510

Cosine Sim

0.629

Angle

51.0°

Drag the vectors. When they point in the same direction, the dot product is positive (and largest when fully aligned). Perpendicular = 0. Opposite = negative. The dashed line shows the projection of b onto a.

Connection to Word2Vec: vectors a and brepresent word embeddings. When the model predicts that “queen” is likely context for “king,” it means their dot product is high; their vectors point in similar directions.

From Scores to Probabilities: Softmax

The dot product gives us raw scores, but we need probabilities: numbers between 0 and 1 that sum to 1. Softmax does this conversion.

Imagine the center word is “king.” The model computes a dot-product score for every word in the vocabulary. Softmax turns these scores into a probability distribution: “How likely is each word to appear nearby?”

Given center word “king,” how likely is each word to appear nearby?

P(w_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}

queen2.5

Raw (z)

exp(z)

Prob

59.3%

throne1.8

Raw (z)

exp(z)

Prob

29.4%

cat0.3

Raw (z)

exp(z)

Prob

6.6%

river-0.5

Raw (z)

exp(z)

Prob

3.0%

apple-1.0

Raw (z)

exp(z)

Prob

1.8%

Raw Score (z)

exp(z)

Probability

queen

2.5

12.2

59.3%

throne

1.8

6.0

29.4%

cat

0.3

1.3

6.6%

river

-0.5

0.6

3.0%

apple

-1.0

0.4

1.8%

sum(P) = 1.0000 (always 1.0)

Try making one score much larger than the others. Softmax amplifies differences -- the highest score gets a disproportionately large share of the probability.

The Skip-gram Architecture

Step 1 of 5

Given a center word like “runs”, we represent it as a one-hot vector, 96 zeros with a single 1.

Live values from step 5,000

Center: runs → Context: wolf

Pause. The architecture is fully built: input, embedding, dot products, softmax, loss. The next three sections explain how and why this machine actually learns.

Why Does This Actually Work?

Here’s the insight that makes it all click:

If “king” and “queen”both appear near the words “ruled,” “kingdom,” and “throne”, then the network must learn to predict those same context words from both“king” and “queen.”

To predict the same outputs, the network is forced to give them similar input vectors, because similar inputs are the only way to produce similar outputs through the same weight matrix.

Similar prediction targets → similar weights → similar embeddings.

King and queen both predict the same context words (ruled, kingdom, throne), forcing the network to assign them similar vectors - cosine similarity 0.91

This is why context prediction automatically discovers meaning. The network never sees a dictionary. It never reads a definition. It just tries to predict neighboring words, and meaning emerges as a side effect of getting better at that task.

Think of it this way: if you had to fill in the blank for “The wise ___ ruled the kingdom,” both “king” and “queen” work. The model learns this by making their vectors similar, exactly what you discovered in Chapter 2 with the fill-the-blank game.

The Gradient: Attraction vs Repulsion

The model knows what to learn (predict context words) and how to score (dot product + softmax). But how does it actually improve? Through gradients, the precise recipe for nudging each vector in the right direction.

The loss tells us how wrong the prediction was. But how does the model improve? The gradient tells each embedding vector exactly how to move.

\frac{\partial \mathcal{L}}{\partial \mathbf{v}_c} = \underbrace{-\mathbf{u}_{o}}_{\color{#22c55e}{\text{attract}}} + \underbrace{\sum_w P(w|c)\,\mathbf{u}_w}_{\color{#ef4444}{\text{repel}}}

For center word runs predicting context word wolf, the gradient decomposes into two forces.

Attract

Repel

Net force

The gradient does the heavy lifting. The next section is a practical optimization, important for real-world scale, but you can skip it without losing the main thread.

The Shortcut: Negative Sampling

There’s a computational problem we haven’t addressed. At every training step, softmax computes exp(z) for all 96 words. With 96 words, that’s fine. But real vocabularies have 100,000+ words. Computing softmax over all of them at every step is impossibly slow.

Mikolov’s solution: negative sampling. Instead of comparing against every word, pick the true context word plus a handful of random “negative” words. Train the model to tell the real context apart from the noise.

Full Softmax

Score every word in the vocabulary. With 100K words, this is impossibly slow.

Negative Sampling

Score only 1 real + 5 random words. Thousands of times faster.

The machine is built. It takes a center word, looks up its embedding, scores every vocabulary word, and adjusts vectors to pull true context words closer while pushing irrelevant words away.

Now let’s watch it work. We pre-trained this model on a 545-sentence curated corpus with a 96-word vocabulary. The words start at random positions. Over 15,000 training steps, watch them self-organize into meaningful clusters.

What to watch for:

1. Gold dots (royalty) clustering together by step ~5,000
2. The loss curve dropping sharply between steps 2,000–5,000
3. Animals (green) separating from people (blue) by step ~5,000

This is a toy corpus; real Word2Vec (Google’s original) trained on 6 billion words from Google News. But the same dynamics apply: context predicts meaning, and clusters emerge.

Watch It Learn

96 words start at random positions and self-organize into semantic clusters over 15,000 training steps. Drag the timeline. Pause and inspect any step.

Loading training data...

This interactive visualization shows 96 words self-organizing into semantic clusters over 15,000 training steps. Words like “king” and “queen” converge to the same neighborhood, while “cat” and “dog” form their own cluster. The skip-gram model learns by predicting which words appear near a given word, using backpropagation to adjust 96×8 weight matrices.

The dots aren’t random anymore. Words that share meaning share neighborhoods. But the embedding space learned something deeper than just clustering…

The geometric relationships between vectors encode meaning. Directions in the space correspond to concepts like gender, royalty, or animal-ness. The most famous demonstration: vector arithmetic captures analogies.

The Payoff

The trained embeddings encode semantic relationships as geometric properties. The most famous: vector arithmetic captures analogies.

Loading embeddings...

This vector arithmetic explorer demonstrates how word embeddings encode semantic relationships. The classic example: king − man + woman = queen. This works because Word2Vec learns that “king” and “man” differ along the same gender dimension as “queen” and “woman.” You can combine any words from the vocabulary - add multiple positive and negative terms to find the word whose embedding best matches the arithmetic result.

Toy corpus vs. real Word2Vec: With only 545 sentences, some analogies land perfectly (son − boy + girl = daughter) while others are close but not exact (king − man + womanreturns “proud” instead of “queen”, though queen is the 3rd result). On Google’s 6-billion-word corpus, king − man + woman = queen works reliably. The mechanism is the same; only the data size differs.

Why This Matters

Vector arithmetic is a fun party trick, but the real power of word embeddings is in downstream tasks: the things you build on top of embeddings:

🔍Semantic Search

Match queries to documents by meaning, not just keywords. "cheap flights" finds "affordable airfare."

💡Recommendations

Airbnb embeds listings the same way Word2Vec embeds words - listings booked in similar sessions get similar vectors.

😊Sentiment Analysis

Feed word vectors into a classifier to detect positive/negative reviews without hand-crafting features.

🌍Machine Translation

Word vectors in different languages share geometric structure - "roi" in French maps near "king" in English.

Explore Any Word

Each word is represented as 8 numbers. Individual dimensions aren’t interpretable; there’s no “royalty dimension” or “animal dimension.” The meaning is distributed across all 8 numbers working together. What matters is the pattern: similar words have similar patterns.

Loading embeddings...

This embedding explorer lets you inspect individual word vectors learned by Word2Vec. Select any word from the 96-word vocabulary to see its 8-dimensional embedding vector, find its nearest neighbors by cosine similarity, and compare two words side-by-side. Semantically related words (e.g., “cat” and “dog,” or “king” and “queen”) end up with high cosine similarity after training.

What You Just Learned

Words that appear in similar contexts
get similar embeddings.

This single principle powers Word2Vec and much of modern NLP.

The Complete Pipeline

\mathbf{v}_c = W[c, :]

Embedding lookup

Look up the center word's dense vector

z_w = \mathbf{u}_w^\top \mathbf{v}_c

Scoring

Dot product measures alignment

P(w|c) = \frac{\exp(z_w)}{\sum_{w'} \exp(z_{w'})}

Softmax

Scores become probabilities

\mathcal{L} = -\log P(w_o | w_c)

Loss

How surprised by the truth

\nabla_{\mathbf{v}_c} \mathcal{L} = -\mathbf{u}_o + \sum_w P(w|c)\mathbf{u}_w

Gradient

Attract to truth, repel from predictions

Where This Leads

The core idea (using dot products to measure similarity between learned vector representations) is the foundation of modern deep learning.

Transformers use the same dot-product scoring mechanism, but dynamically: instead of fixed embeddings, they compute queries and keys for each input. The attention score \text{Attention}(Q, K) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) is a generalization of Word2Vec’s z_w = \mathbf{u}_w^\top \mathbf{v}_c.

Every embedding you see in GPT, BERT, or any modern language model descends from the ideas you just explored.

The Evolution of Word Embeddings

Word2Vec was just the beginning. Here’s how word representations evolved, and where the field is heading.

2013

Word2Vec

You are here

Static embeddings via skip-gram / CBOW. One vector per word, regardless of context.

2014

GloVe

Combines global co-occurrence statistics with local context windows. Often better on analogy tasks.

Learn more →

2017

FastText

Extends Word2Vec with subword n-grams. Handles rare and unseen words by composing character pieces.

Learn more →

2018

ELMo

First contextual embeddings. Uses deep bidirectional LSTMs; the same word gets different vectors in different sentences.

Learn more →

2018+

BERT / GPT

Transformer-based contextual embeddings. The attention mechanism is a generalization of Word2Vec's dot-product scoring.

Learn more →

The Other Variant: CBOW

We focused on Skip-gram (predict context from center). Word2Vec actually has a second variant: CBOW (Continuous Bag of Words), which does the reverse: it predicts the center word from its surrounding context.

Skip-gram

Center → predict context

Better on small datasets, rare words

CBOW

Context → predict center

Faster to train, better on frequent words

What Word2Vec Can’t Do

Word2Vec was groundbreaking, but it has real limitations. Understanding them is key to understanding why later models (ELMo, BERT, GPT) were created.

One Word = One Vector

"Bank" (river) and "bank" (finance) share a single vector. "Apple" the fruit and "Apple" the company are identical. Word2Vec has no way to distinguish different meanings of the same word.

Out-of-Vocabulary

If a word wasn't in the training data, it has no vector. Misspellings, new slang, and rare words are invisible.

Word Order Ignored

"Dog bites man" and "man bites dog" produce nearly identical training pairs; both pair "dog" with "bites" and "man" with "bites." Word2Vec treats them as interchangeable because it only looks at nearby words, not their order.

No Compositionality

Word2Vec gives you word vectors, not sentence vectors. There's no built-in way to represent "not bad" as different from "bad"; combining words requires additional methods.

The word 'bank' stuck between river-related words and finance-related words because Word2Vec assigns only one vector per word, unable to capture multiple meanings

These limitations are exactly what the evolution above (from GloVe to BERT/GPT) was designed to solve.

Test Your Understanding

Five quick questions to check what stuck. Click an answer to reveal the explanation.

Resources & Further Reading

A curated collection of the best resources for going deeper, from visual intros to original research papers. Organized by type. Level tags help you pick what’s right for you.

🎨 Understand Visually

The Illustrated Word2Vec, Jay AlammarBeginner

The single best visual walkthrough of Word2Vec. Covers skip-gram, training pairs, embeddings, and real-world applications with beautiful step-by-step diagrams.

→

StatQuest: Word2Vec, Clearly ExplainedBeginner

Josh Starmer's 16-minute video breakdown of exactly how Word2Vec training works. No fluff, just clear explanation.

→

📄 Go Deeper

Stanford CS224N: Word Vectors Lecture NotesAdvanced

The definitive academic treatment of Word2Vec. Rigorous math covering SVD, skip-gram, CBOW, negative sampling, and GloVe comparison.

→

Learning Word Embedding, Lilian WengIntermediate

Comprehensive technical overview of all word embedding methods: count-based (co-occurrence, SVD) vs. prediction-based (Word2Vec, GloVe). Clear math throughout.

→

Mikolov et al. (2013): Original Word2Vec PaperAdvanced

The NeurIPS paper that introduced skip-gram with negative sampling. The foundation of everything on this page.

→

Goldberg & Levy: word2vec ExplainedAdvanced

A clearer derivation of Word2Vec's negative sampling math than the original paper. Written specifically because the original was "cryptic and hard to follow."

→

🛠️ Build & Explore

TensorFlow Embedding ProjectorBeginner

Load pre-trained Word2Vec embeddings and explore them in interactive 3D. Search words, find neighbors, see clusters, right in your browser.

→

Word2Vec GalaxyBeginner

3D spatial visualization of Word2Vec embeddings. Perform vector arithmetic (king - man + woman) and see the results in embedding space.

→

Gensim: Train Word2Vec in PythonIntermediate

The most popular Python library for Word2Vec. Train on your own corpus, find similar words, solve analogies. 5 lines of code to get started.

→

←Back to topics

Explore the trained model in depth:

Weight Matrices & Embeddings →Annotated PyTorch Code →Test Your Knowledge →

Next Topic

GloVe: Global Vectors for Word Representation

Learn how global co-occurrence statistics improve on Word2Vec’s local context window approach.

→

Frequently Asked Questions

Quick answers to the most common questions about Word2Vec, skip-gram, embeddings, and how they connect to modern NLP.

Word2Vec is a neural network model that learns dense vector representations (embeddings) for words from large text corpora. It works by training a shallow neural network to predict context words from a center word (Skip-gram) or a center word from context words (CBOW). Words appearing in similar contexts end up with similar vector representations.

The Skip-gram model takes a center word, looks up its embedding vector, computes dot products with all vocabulary words, applies softmax to get probabilities, and trains using cross-entropy loss. The gradient pushes the center word's vector closer to its actual context words (attraction) and away from non-context words (repulsion). In practice, negative sampling replaces the full softmax to make training fast (see the question below). Over thousands of training steps, this produces meaningful word embeddings.

The distributional hypothesis states that words appearing in similar contexts tend to have similar meanings. For example, "king" and "queen" appear near words like "throne," "crown," and "ruled," so they must be semantically related. This principle, attributed to linguist J.R. Firth (1957), is the foundation of Word2Vec and all modern word embedding methods.

A one-hot vector is the simplest way to represent a word as numbers. For a vocabulary of V words, each word becomes a vector of length V with all zeros except a single 1 at the word's index. The problem: all one-hot vectors are orthogonal (cosine similarity = 0), so "king" and "queen" are just as different as "king" and "refrigerator." Word2Vec solves this by learning dense vectors where similar words have similar representations.

This famous example shows that Word2Vec embeddings encode semantic relationships as geometric directions. The vector from "man" to "king" captures the concept of "royalty." Adding that same direction to "woman" lands near "queen." This vector arithmetic works because the training process encodes consistent semantic patterns into the geometry of the embedding space.

A sliding context window is a fixed-size frame that moves across a sentence word by word. At each position, the center word is paired with its neighboring words within the window. These (center, context) pairs become training examples for the Word2Vec neural network. A larger window captures more topical/semantic relationships; a smaller window captures syntactic relationships.

Cosine similarity measures how similar two word vectors are by computing the cosine of the angle between them. A value of 1 means identical direction (very similar words), 0 means orthogonal (unrelated), and −1 means opposite. In our toy Word2Vec model, "king" and "queen" have high cosine similarity (~0.9) while "king" and "cat" are near zero - exactly what you'd expect.

Word2Vec produces static embeddings: each word gets one fixed vector regardless of context. "Bank" has the same vector whether it means a financial institution or a river bank. BERT produces contextual embeddings: the same word gets different vectors depending on its surrounding context. BERT uses the Transformer architecture with self-attention, while Word2Vec uses a simple two-layer neural network.

Negative sampling is an optimization that makes Word2Vec training practical for large vocabularies. Instead of computing softmax over all 100,000+ words at each step (very slow), negative sampling only updates the true context word (positive) and a small random sample of non-context words (negatives, typically 5–20). This provides a 1000×+ speedup while producing similar quality embeddings.

Word2Vec embeddings are used as input features for downstream NLP tasks: semantic search (matching queries to documents by meaning), sentiment analysis (classifying text as positive/negative), machine translation (mapping words across languages), recommendation systems (Airbnb uses embedding techniques inspired by Word2Vec), and as initialization for larger deep learning models.

TL;DR - What is Word2Vec?

Why Can’t Computers Understand Words?

Words Are Defined by Their Friends

Building the Machine

Dot Product = Similarity

From Scores to Probabilities: Softmax

The Skip-gram Architecture

Why Does This Actually Work?

The Gradient: Attraction vs Repulsion

The Shortcut: Negative Sampling

Watch It Learn

The Payoff

Why This Matters

Explore Any Word

What You Just Learned

The Complete Pipeline

Where This Leads

The Evolution of Word Embeddings

Word2Vec

GloVe

FastText

ELMo

BERT / GPT

The Other Variant: CBOW

What Word2Vec Can’t Do

Test Your Understanding

Resources & Further Reading

🎨 Understand Visually

📄 Go Deeper

🛠️ Build & Explore

Frequently Asked Questions

What is Word2Vec?

How does the Skip-gram model work?

What is the distributional hypothesis?

What is a one-hot vector?

What does 'king − man + woman = queen' mean?

What is a sliding context window in Word2Vec?

What is cosine similarity in word embeddings?

What is the difference between Word2Vec and BERT?

What is negative sampling in Word2Vec?

How are Word2Vec embeddings used in practice?