Interactive6 chapters

Word2Vec

Learning Word Meanings from Context

An interactive deep-dive into how neural networks learn that words appearing in similar contexts have similar meanings. Every equation has a visual twin. Every concept is something you can see, touch, and play with.

~35 min read|Prerequisites: basic Python, comfort with lists of numbers|Published Jan 2025
Word2Vec mission control — embedding clusters, softmax equation, weight matrix heatmap, and probability distribution visualization on a dark cinematic background

TL;DR — What is Word2Vec?

Word2Vec is a neural network that learns to represent words as dense vectors (called embeddings) by predicting which words appear near each other in text. The core idea: words in similar contexts get similar vectors. The Skip-gram variant takes a center word, computes dot products with all vocabulary words, applies softmax to get probabilities, and adjusts the vectors via gradient descent. After thousands of training steps, semantically related words like “king” and “queen” end up close together in vector space — and you can even do arithmetic like king − man + woman ≈ queen.

01

Why Can’t Computers Understand Words?

Computers only understand numbers. The simplest representation — one-hot vectors — treats every word as equally different from every other word.

How does a computer see the word “king”? The simplest representation is a one-hot vector: a list of 96 zeros with a single 1 at the word’s position.

king=[
]
0, 0, ... 1, ... 0, 0(position 37 of 96)

So one-hot vectors are a dead end. Every word is equally far from every other word. “King” is no closer to “queen” than to “refrigerator.”

We need a better representation — one where similar words live close together. But where would that information come from? We don’t want to hand-code word similarities for every language.

It turns out the answer has been hiding in plain sight — in the text itself.

Diagram comparing one-hot encoding (isolated, equidistant dots) versus embedding space (clustered dots where similar words like king and queen are close together)
02

Words Are Defined by Their Friends

Words that appear in similar contexts tend to have similar meanings. This is the distributional hypothesis — the foundation of Word2Vec.

“You shall know a word by the company it keeps.”— J.R. Firth, 1957

Before building any neural network, we need the key insight: words that appear in the same contexts tend to mean similar things. Let’s discover this ourselves.

🎯 Try it — which words fit?
thewise???ruledthegreatkingdom
👆 Click a word to try it
asmall???sleptpeacefullybythewarmfireplace
👆 Click a word to try it
thebright???roseslowlyoverthecalmriver
👆 Click a word to try it
🔍 Context fingerprints

Every word has a “fingerprint” — the set of words that appear near it. If two words share the same fingerprint, they mean similar things. Compare pairs below.

👆 Pick a pair to compare
king
wise
1
ruled
1
great
1
old
1
wore
1
golden
1
queen
wise
1
ruled
1
great
1
old
1
wore
1
golden
1
100%

king and queen appear in almost identical contexts — they mean similar things!

⚙️ The mechanism — sliding window

How does a computer build these fingerprints? It slides a window across each sentence. At every position, the center word is paired with each context word inside the window. Step through to see it in action.

Step:
1 / 6
thewisekingruledthegreatkingdomfromthethrone
Center Context Stopword (skipped)
Training pairs from this position
(wise,king)(wise,ruled)

Each pair says: “wise appeared near “king” — make their vectors similar.

👆 Pick a sentence to explore

the wise king ruled the great kingdom from the throne

🌐 The big picture — co-occurrence map

We just saw how the sliding window extracts word pairs from one sentence at a time. Now let’s zoom out: what happens when we collect every pair from every sentence and draw the result?

How to read this graph
Each dot = a word
Color shows its semantic group (gold = royalty, green = animals, etc.)
Each line = “appeared nearby”
A line between two words means they were inside the same context window at least once.
Thicker line = more often
A thick line means these words co-occurred in many sentences. Thin = rare co-occurrence.
window = 236 words · 40 connections
👆 Click any word to inspect it|Try changing window size above

This visualization uses 16 carefully chosen sentences to make patterns obvious. The real training corpus has 479 sentences, and Google’s original Word2Vec trained on 6 billion words. The same principles apply at every scale — just with richer, noisier patterns.

What to notice
1.

Words form clusters. king, queen, crown, and throne clump together because they frequently appear in the same sentences. So do cat and dog, or sun and moon.

2.

Clusters = meaning. Words that cluster together share similar meanings. The graph discovered semantic groups — royalty, animals, nature, people — purely from word proximity, with no dictionary or human labels.

3.

This is what Word2Vec learns. The neural network’s job is to compress this co-occurrence structure into small, dense vectors — so that words in the same cluster end up at nearby positions in vector space.

You just saw that co-occurrence patterns exist. Words that share contexts share meaning. A natural question: why not just use the co-occurrence counts directly?

You could build a giant table counting how often each pair of words appears together. People tried this (it’s called LSA and SVD methods). It works! But the table is enormous (vocabulary-size × vocabulary-size), mostly zeros, and it doesn’t generalize well to unseen word combinations.

Diagram showing a sparse co-occurrence matrix being compressed into dense 8-dimensional word vectors, where king and queen have similar bar patterns

Word2Vec’s insight: instead of counting co-occurrences, train a neural network to predict context from words. The network is forced to compress word meaning into small, dense vectors as a side effect of learning to predict. The next chapter builds this machine piece by piece.

03

Building the Machine

The Skip-gram model predicts context words from a center word. Here’s how it works, piece by piece.

Dot Product = Similarity

Before we build the model, we need one key tool: a way to measure how similar two vectors are. The model will use this tool constantly — at every training step, for every word pair.

That tool is the dot product. When two vectors point in the same direction, their dot product is large. When they point apart, it’s small (or negative). Drag the vectors below to build intuition.

ab=abcosθ\mathbf{a} \cdot \mathbf{b} = |\mathbf{a}|\,|\mathbf{b}|\cos\theta
projab51°
Dot Product
0.510
Cosine Sim
0.629
Angle
51.0°

Drag the vectors. When they point in the same direction, the dot product is large and positive. Perpendicular = 0. Opposite = negative. The dashed line shows the projection of b onto a.

Connection to Word2Vec: vectors a and b represent word embeddings. When the model predicts that “queen” is likely context for “king,” it means their dot product is high — their vectors point in similar directions.

From Scores to Probabilities: Softmax

The dot product gives us raw scores, but we need probabilities — numbers between 0 and 1 that sum to 1. Softmax does this conversion.

Imagine the center word is “king.” The model computes a dot-product score for every word in the vocabulary. Softmax turns these scores into a probability distribution: “How likely is each word to appear nearby?”

Given center word “king,” how likely is each word to appear nearby?
P(wi)=ezijezjP(w_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}
queen2.5
Raw (z)
exp(z)
Prob
59.3%
throne1.8
Raw (z)
exp(z)
Prob
29.4%
cat0.3
Raw (z)
exp(z)
Prob
6.6%
river-0.5
Raw (z)
exp(z)
Prob
3.0%
apple-1.0
Raw (z)
exp(z)
Prob
1.8%
sum(P) = 1.0000 (always 1.0)

Try making one score much larger than the others. Softmax amplifies differences -- the highest score gets a disproportionately large share of the probability.

The Skip-gram Architecture

Inputone-hot (96)W96 × 8Hiddenv_c (8)W'8 × 96SoftmaxP(·|c)
Step 1 of 5

Given a center word like king, we represent it as a one-hot vectorone-hot 96 zeros with a single 1.

Pause. The architecture is fully built — input, embedding, dot products, softmax, loss. The next three sections explain how and why this machine actually learns.

Why Does This Actually Work?

Here’s the insight that makes it all click:

If “king” and “queen” both appear near the words “ruled,” “kingdom,” and “throne” — then the network must learn to predict those same context words from both “king” and “queen.”

To predict the same outputs, the network is forced to give them similar input vectors — because similar inputs are the only way to produce similar outputs through the same weight matrix.

Similar prediction targets → similar weights → similar embeddings.

King and queen both predict the same context words (ruled, kingdom, throne), forcing the network to assign them similar vectors — cosine similarity 0.91

This is why context prediction automatically discovers meaning. The network never sees a dictionary. It never reads a definition. It just tries to predict neighboring words — and meaning emerges as a side effect of getting better at that task.

Think of it this way: if you had to fill in the blank for “The wise ___ ruled the kingdom,” both “king” and “queen” work. The model learns this by making their vectors similar — exactly what you discovered in Chapter 2 with the fill-the-blank game.

The Gradient: Attraction vs Repulsion

The model knows what to learn (predict context words) and how to score (dot product + softmax). But how does it actually improve? Through gradients — the precise recipe for nudging each vector in the right direction.

The gradient does the heavy lifting. The next section is a practical optimization — important for real-world scale, but you can skip it without losing the main thread.

The Shortcut: Negative Sampling

There’s a computational problem we haven’t addressed. At every training step, softmax computes exp(z) for all 96 words. With 96 words, that’s fine. But real vocabularies have 100,000+ words. Computing softmax over all of them at every step is impossibly slow.

Full softmax scores all 100,000 words (slow) vs negative sampling scores only the true context word plus a few random negatives (fast, 1000x speedup)

Mikolov’s solution: negative sampling. Instead of comparing against every word, pick the true context word plus a handful of random “negative” words. Train the model to tell the real context apart from the noise.

Full Softmax

Score every word in the vocabulary. With 100K words, this is impossibly slow.

Negative Sampling

Score only 1 real + 5 random words. Thousands of times faster.

The machine is built. It takes a center word, looks up its embedding, scores every vocabulary word, and adjusts vectors to pull true context words closer while pushing irrelevant words away.

Now let’s watch it work. We pre-trained this model on a 479-sentence curated corpus with a 96-word vocabulary. The words start at random positions. Over 15,000 training steps, watch them self-organize into meaningful clusters.

What to watch for:

  • 1. Gold dots (royalty) clustering together by step ~2,000
  • 2. The loss curve dropping sharply in the first 2,000 steps
  • 3. Animals (green) separating from people (blue) by step ~5,000

This is a toy corpus — real Word2Vec (Google’s original) trained on 6 billion words from Google News. But the same dynamics apply: context predicts meaning, and clusters emerge.

04

Watch It Learn

96 words start at random positions and self-organize into semantic clusters over 15,000 training steps. Drag the timeline. Pause and inspect any step.

Loading training data...

The dots aren’t random anymore. Words that share meaning share neighborhoods. But the embedding space learned something deeper than just clustering…

The geometric relationships between vectors encode meaning. Directions in the space correspond to concepts like gender, royalty, or animal-ness. The most famous demonstration: vector arithmetic captures analogies.

05

The Payoff

The trained embeddings encode semantic relationships as geometric properties. The most famous: vector arithmetic captures analogies.

Loading embeddings...

Why This Matters

Vector arithmetic is a fun party trick, but the real power of word embeddings is in downstream tasks — the things you build on top of embeddings:

🔍Semantic Search

Match queries to documents by meaning, not just keywords. "cheap flights" finds "affordable airfare."

💡Recommendations

Airbnb embeds listings the same way Word2Vec embeds words — listings booked in similar sessions get similar vectors.

😊Sentiment Analysis

Feed word vectors into a classifier to detect positive/negative reviews without hand-crafting features.

🌍Machine Translation

Word vectors in different languages share geometric structure — "roi" in French maps near "king" in English.

Explore Any Word

Each word is represented as 8 numbers. Individual dimensions aren’t interpretable — there’s no “royalty dimension” or “animal dimension.” The meaning is distributed across all 8 numbers working together. What matters is the pattern: similar words have similar patterns.

Loading embeddings...
06

What You Just Learned

Words that appear in similar contexts
get similar embeddings.

This single principle powers Word2Vec and much of modern NLP.

The Complete Pipeline

1
vc=W[c,:]\mathbf{v}_c = W[c, :]
Embedding lookup
Look up the center word's dense vector
2
zw=uwvcz_w = \mathbf{u}_w^\top \mathbf{v}_c
Scoring
Dot product measures alignment
3
P(wc)=exp(zw)wexp(zw)P(w|c) = \frac{\exp(z_w)}{\sum_{w'} \exp(z_{w'})}
Softmax
Scores become probabilities
4
L=logP(wowc)\mathcal{L} = -\log P(w_o | w_c)
Loss
How surprised by the truth
5
vcL=uo+wP(wc)uw\nabla_{\mathbf{v}_c} \mathcal{L} = -\mathbf{u}_o + \sum_w P(w|c)\mathbf{u}_w
Gradient
Attract to truth, repel from predictions

Where This Leads

The core idea — using dot products to measure similarity between learned vector representations — is the foundation of modern deep learning.

Transformers use the same dot-product scoring mechanism, but dynamically: instead of fixed embeddings, they compute queries and keys for each input. The attention score Attention(Q,K)=softmax ⁣(QKd)\text{Attention}(Q, K) = \text{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) is a direct descendant of Word2Vec’s zw=uwvcz_w = \mathbf{u}_w^\top \mathbf{v}_c.

Every embedding you see in GPT, BERT, or any modern language model descends from the ideas you just explored.

The Evolution of Word Embeddings

Word2Vec was just the beginning. Here’s how word representations evolved — and where the field is heading.

2013
2013

Word2Vec

You are here

Static embeddings via skip-gram / CBOW. One vector per word, regardless of context.

2014
2014

GloVe

Combines global co-occurrence statistics with local context windows. Often better on analogy tasks.

Learn more →
2017
2017

FastText

Extends Word2Vec with subword n-grams. Handles rare and unseen words by composing character pieces.

Learn more →
2018
2018

ELMo

First contextual embeddings. Uses deep bidirectional LSTMs — the same word gets different vectors in different sentences.

Learn more →
2018+
2018+

BERT / GPT

Transformer-based contextual embeddings. The attention mechanism is a direct descendant of Word2Vec's dot-product scoring.

Learn more →

The Other Variant: CBOW

We focused on Skip-gram (predict context from center). Word2Vec actually has a second variant: CBOW (Continuous Bag of Words), which does the reverse — it predicts the center word from its surrounding context.

Skip-gram
Center → predict context
Better on small datasets, rare words
CBOW
Context → predict center
Faster to train, better on frequent words

What Word2Vec Can’t Do

Word2Vec was groundbreaking, but it has real limitations. Understanding them is key to understanding why later models (ELMo, BERT, GPT) were created.

Polysemy

"Bank" gets ONE vector, whether it means a river bank or a financial institution. Word2Vec can't distinguish different meanings of the same word.

Out-of-Vocabulary

If a word wasn't in the training data, it has no vector. Misspellings, new slang, and rare words are invisible.

Word Order Ignored

"Dog bites man" and "man bites dog" produce the exact same training pairs. Word2Vec has no concept of word order or grammar.

Static Vectors

Each word gets a fixed vector forever. "Apple" the fruit and "Apple" the company share a single embedding.

The word 'bank' stuck between river-related words and finance-related words because Word2Vec assigns only one vector per word, unable to capture multiple meanings

These limitations are exactly what the evolution below — from GloVe to BERT/GPT — was designed to solve.

Test Your Understanding

Five quick questions to check what stuck. Click an answer to reveal the explanation.

1.

If "doctor" and "nurse" appear in similar contexts, what does Word2Vec learn?

2.

Why does Word2Vec use negative sampling in practice?

3.

What is the main limitation of Word2Vec compared to BERT?

4.

In the gradient, the attractive force pulls the center word toward ___

5.

"King" and "queen" end up with similar embeddings because ___

Resources & Further Reading

A curated collection of the best resources for going deeper — from visual intros to original research papers. Organized by type. Level tags help you pick what’s right for you.

Back to topics
Up next
Recurrent Neural Networks — Coming Soon

Frequently Asked Questions

Quick answers to the most common questions about Word2Vec, skip-gram, embeddings, and how they connect to modern NLP.

Word2Vec is a neural network model that learns dense vector representations (embeddings) for words from large text corpora. It works by training a shallow neural network to predict context words from a center word (Skip-gram) or a center word from context words (CBOW). Words appearing in similar contexts end up with similar vector representations.

The Skip-gram model takes a center word, looks up its embedding vector, computes dot products with all context word vectors, applies softmax to get probabilities, and trains using cross-entropy loss. The gradient pushes the center word's vector closer to its actual context words (attraction) and away from non-context words (repulsion). Over thousands of training steps, this produces meaningful word embeddings.

The distributional hypothesis states that words appearing in similar contexts tend to have similar meanings. For example, "king" and "queen" appear near words like "throne," "crown," and "ruled" — so they must be semantically related. This principle, attributed to linguist J.R. Firth (1957), is the foundation of Word2Vec and all modern word embedding methods.

A one-hot vector is the simplest way to represent a word as numbers. For a vocabulary of V words, each word becomes a vector of length V with all zeros except a single 1 at the word's index. The problem: all one-hot vectors are orthogonal (cosine similarity = 0), so "king" and "queen" are just as different as "king" and "refrigerator." Word2Vec solves this by learning dense vectors where similar words have similar representations.

This famous example shows that Word2Vec embeddings encode semantic relationships as geometric directions. The vector from "man" to "king" captures the concept of "royalty." Adding that same direction to "woman" lands near "queen." This vector arithmetic works because the training process encodes consistent semantic patterns into the geometry of the embedding space.

A sliding context window is a fixed-size frame that moves across a sentence word by word. At each position, the center word is paired with its neighboring words within the window. These (center, context) pairs become training examples for the Word2Vec neural network. A larger window captures more topical/semantic relationships; a smaller window captures syntactic relationships.

Cosine similarity measures how similar two word vectors are by computing the cosine of the angle between them. A value of 1 means identical direction (very similar words), 0 means orthogonal (unrelated), and −1 means opposite. In Word2Vec, words like "king" and "queen" have high cosine similarity (~0.8) while "king" and "cat" have low similarity (~0.1).

Word2Vec produces static embeddings — each word gets one fixed vector regardless of context. "Bank" has the same vector whether it means a financial institution or a river bank. BERT produces contextual embeddings — the same word gets different vectors depending on its surrounding context. BERT uses the Transformer architecture with self-attention, while Word2Vec uses a simple two-layer neural network.

Negative sampling is an optimization that makes Word2Vec training practical for large vocabularies. Instead of computing softmax over all 100,000+ words at each step (very slow), negative sampling only updates the true context word (positive) and a small random sample of non-context words (negatives, typically 5–20). This provides a 1000×+ speedup while producing similar quality embeddings.

Word2Vec embeddings are used as input features for downstream NLP tasks: semantic search (matching queries to documents by meaning), sentiment analysis (classifying text as positive/negative), machine translation (mapping words across languages), recommendation systems (Airbnb uses embedding techniques inspired by Word2Vec), and as initialization for larger deep learning models.