> **Note:** The canonical experience is the interactive HTML page: [Word2Vec](https://www.quantml.org/topics/word2vec). This file is a text mirror for search engines and AI tools.

# Word2Vec: From One-Hot to Word Embeddings

---

## TL;DR - What is Word2Vec?

**Word2Vec** is a neural network that learns to represent words as dense vectors (called *embeddings*) by predicting which words appear near each other in text. The core idea: words in similar contexts get similar vectors. The Skip-gram variant takes a center word, computes dot products with all vocabulary words, applies softmax to get probabilities, and adjusts the vectors via gradient descent. After thousands of training steps, semantically related words like "king" and "queen" end up close together in vector space — and you can even do arithmetic like *king − man + woman ≈ queen* (on large corpora; our 545-sentence toy model gets 3 out of 5 analogy tests right).

---

## Chapter 01 — Why Can't Computers Understand Words?

**Subtitle:** Computers only understand numbers. The simplest representation, one-hot vectors, treats every word as equally different from every other word.

One-hot encoding converts vocabulary words into vectors of all zeros except a single 1 at the word's index. Every word is an orthogonal basis vector. There is no geometry, no notion of closeness.

---

So one-hot vectors are a dead end. Every word is equally far from every other word. **"King"** is no closer to **"queen"** than to **"refrigerator."**

We need a better representation: one where *similar words live close together*. But where would that information come from? We don't want to hand-code word similarities for every language.

It turns out the answer has been hiding in plain sight: **in the text itself.**

---

## Chapter 02 — Words Are Defined by Their Friends

**Subtitle:** Words that appear in similar contexts tend to have similar meanings. This is the distributional hypothesis, the foundation of Word2Vec.

A sliding context window moves across each sentence. At every position the center word is paired with its neighbors. "King" repeatedly appears near "ruled," "kingdom," "throne" — and so does "queen." That repeated co-occurrence is the learning signal.

---

You just saw that co-occurrence patterns exist. Words that share contexts share meaning. A natural question: **why not just use the co-occurrence counts directly?**

You could build a giant table counting how often each pair of words appears together. People tried this (it's called *LSA* and *SVD* methods). It works! But the table is enormous (vocabulary-size × vocabulary-size), mostly zeros, and it doesn't generalize well to unseen word combinations.

Word2Vec's insight: instead of counting co-occurrences, **train a neural network to predict context from words.** The network is forced to compress word meaning into small, dense vectors as a *side effect* of learning to predict. The next chapter builds this machine piece by piece.

---

## Chapter 03 — Building the Machine

**Subtitle:** The Skip-gram model predicts context words from a center word. Here's how it works, piece by piece.

### Dot Product = Similarity

Before we build the model, we need one key tool: **a way to measure how similar two vectors are.** The model will use this tool constantly, at every training step, for every word pair.

That tool is the **dot product**. When two vectors point in the same direction, their dot product is large. When they point apart, it's small (or negative).

> **Connection to Word2Vec:** the two vectors represent word embeddings. When the model predicts that "queen" is likely context for "king," it means their dot product is high; their vectors point in similar directions.

---

### From Scores to Probabilities: Softmax

The dot product gives us raw scores, but we need **probabilities**: numbers between 0 and 1 that sum to 1. *Softmax* does this conversion.

Imagine the center word is **"king."** The model computes a dot-product score for every word in the vocabulary. Softmax turns these scores into a probability distribution: "How likely is each word to appear nearby?"

---

### The Skip-gram Architecture

The complete forward pass in four steps:

1. Convert center word to its one-hot vector.
2. Look up the corresponding row in the embedding matrix **W** — this is the word's current 8-dimensional embedding.
3. Multiply by the output weight matrix **W'** to get a raw score (logit) for every word in the vocabulary.
4. Apply softmax to get a probability for every vocabulary word; train with cross-entropy loss against the true context word.

---

### Why Does This Actually Work?

Here's the insight that makes it all click:

If **"king"** and **"queen"** both appear near the words "ruled," "kingdom," and "throne," then the network must learn to predict those same context words from *both* "king" and "queen."

To predict the same outputs, the network is forced to give them **similar input vectors**, because similar inputs are the only way to produce similar outputs through the same weight matrix.

**Similar prediction targets → similar weights → similar embeddings.**

This is why context prediction *automatically discovers meaning.* The network never sees a dictionary. It never reads a definition. It just tries to predict neighboring words, and meaning emerges as a side effect of getting better at that task.

Think of it this way: if you had to fill in the blank for "The wise ___ ruled the kingdom," both "king" and "queen" work. The model learns this by making their vectors similar.

---

### The Gradient: Attraction vs Repulsion

The model knows *what* to learn (predict context words) and *how to score* (dot product + softmax). But how does it actually **improve**? Through gradients — the precise recipe for nudging each vector in the right direction.

The gradient for a single training step splits into two forces:

- **Attraction:** the center word's vector is pulled toward the true context word's vector.
- **Repulsion:** the center word's vector is pushed away from all incorrectly predicted words.

Each step is a small nudge. Across 15,000 steps, nudges compound into geometry.

---

### The Shortcut: Negative Sampling

There's a computational problem. At every training step, softmax computes *exp(z)* for **all 96 words**. With 96 words, that's fine. But real vocabularies have 100,000+ words. Computing softmax over all of them at every step is impossibly slow.

Negative sampling approximates the full softmax update by only touching:
- the **true context word** (positive pair)
- a small set of **randomly sampled non-context words** (negatives, typically 5–20)

This delivers a 1000×+ speedup on large vocabularies while preserving the same gradient signal that teaches the model semantics.

---

The machine is built. It takes a center word, looks up its embedding, scores every vocabulary word, and adjusts vectors to pull true context words closer while pushing irrelevant words away.

Now let's watch it work. We pre-trained this model on a **545-sentence curated corpus** with a **96-word vocabulary**. The words start at random positions. Over **15,000 training steps**, watch them self-organize into meaningful clusters.

**What to watch for:**
1. Gold dots (royalty) clustering together by step ~5,000
2. The loss curve dropping sharply between steps 2,000–5,000
3. Animals (green) separating from people (blue) by step ~5,000

This is a toy corpus; real Word2Vec (Google's original) trained on 6 billion words from Google News. But the same dynamics apply: context predicts meaning, and clusters emerge.

---

## Chapter 04 — Watch It Learn

**Subtitle:** 96 words start at random positions and self-organize into semantic clusters over 15,000 training steps. Drag the timeline. Pause and inspect any step.

The interactive training player lets you scrub through every checkpoint from step 0 to 15,000. Each position on the timeline shows the full embedding map as a 2D projection, the current loss value, and per-word neighborhood.

---

The dots aren't random anymore. Words that share meaning share neighborhoods. But the embedding space learned something deeper than just clustering…

The **geometric relationships** between vectors encode meaning. Directions in the space correspond to concepts like gender, royalty, or animal-ness. The most famous demonstration: *vector arithmetic captures analogies.*

---

## Chapter 05 — The Payoff

**Subtitle:** The trained embeddings encode semantic relationships as geometric properties. The most famous: vector arithmetic captures analogies.

### Vector Arithmetic

The direction from "man" to "king" encodes "royalty." Adding that same direction to "woman" lands near "queen."

> **Toy corpus vs. real Word2Vec:** With only 545 sentences, some analogies land perfectly (*son − boy + girl = daughter*) while others are close but not exact (*king − man + woman* returns "proud" instead of "queen," though queen is the 3rd result). On Google's 6-billion-word corpus, *king − man + woman = queen* works reliably. The mechanism is the same; only the data size differs.

---

### Why This Matters

Vector arithmetic is a fun party trick, but the real power of word embeddings is in **downstream tasks** — the things you build *on top of* embeddings:

| Application | Description |
|---|---|
| **Semantic Search** | Match queries to documents by meaning, not just keywords. "cheap flights" finds "affordable airfare." |
| **Recommendations** | Airbnb embeds listings the same way Word2Vec embeds words — listings booked in similar sessions get similar vectors. |
| **Sentiment Analysis** | Feed word vectors into a classifier to detect positive/negative reviews without hand-crafting features. |
| **Machine Translation** | Word vectors in different languages share geometric structure — "roi" in French maps near "king" in English. |

---

### Explore Any Word

Each word is represented as **8 numbers**. Individual dimensions aren't interpretable; there's no "royalty dimension" or "animal dimension." The meaning is *distributed* across all 8 numbers working together. What matters is the **pattern**: similar words have similar patterns.

---

## Chapter 06 — What You Just Learned

The full arc covered in this lesson:

1. **One-hot fails** — every word is equally distant from every other.
2. **Context predicts meaning** — the distributional hypothesis provides free supervision.
3. **Skip-gram architecture** — embedding lookup → dot products → softmax → loss.
4. **Gradient mechanics** — attraction toward context, repulsion from everything else.
5. **Geometry emerges** — 15,000 small updates self-organize a semantic map.
6. **Embeddings are reusable** — semantic search, recommendations, translation alignment all benefit.

Explore the trained model in depth:

- [Weight Matrices & Embeddings](https://www.quantml.org/topics/word2vec/internals)
- [Annotated PyTorch Code](https://www.quantml.org/topics/word2vec/code)
- [Test Your Knowledge](https://www.quantml.org/topics/word2vec/quiz)

**Next Topic:** [GloVe: Global Vectors for Word Representation](https://www.quantml.org/topics/glove) — Learn how global co-occurrence statistics improve on Word2Vec's local context window approach.

---

## FAQ

### What is Word2Vec?

Word2Vec is a neural network model that learns dense vector representations (embeddings) for words from large text corpora. It works by training a shallow neural network to predict context words from a center word (Skip-gram) or a center word from context words (CBOW). Words appearing in similar contexts end up with similar vector representations.

### How does the Skip-gram model work?

The Skip-gram model takes a center word, looks up its embedding vector, computes dot products with all vocabulary words, applies softmax to get probabilities, and trains using cross-entropy loss. The gradient pushes the center word's vector closer to its actual context words (attraction) and away from non-context words (repulsion). In practice, negative sampling replaces the full softmax to make training fast (see below). Over thousands of training steps, this produces meaningful word embeddings.

### What is the distributional hypothesis?

The distributional hypothesis states that words appearing in similar contexts tend to have similar meanings. For example, "king" and "queen" appear near words like "throne," "crown," and "ruled," so they must be semantically related. This principle, attributed to linguist J.R. Firth (1957), is the foundation of Word2Vec and all modern word embedding methods.

### What is a one-hot vector?

A one-hot vector is the simplest way to represent a word as numbers. For a vocabulary of V words, each word becomes a vector of length V with all zeros except a single 1 at the word's index. The problem: all one-hot vectors are orthogonal (cosine similarity = 0), so "king" and "queen" are just as different as "king" and "refrigerator." Word2Vec solves this by learning dense vectors where similar words have similar representations.

### What does "king − man + woman = queen" mean?

This famous example shows that Word2Vec embeddings encode semantic relationships as geometric directions. The vector from "man" to "king" captures the concept of "royalty." Adding that same direction to "woman" lands near "queen." This vector arithmetic works because the training process encodes consistent semantic patterns into the geometry of the embedding space.

### What is a sliding context window in Word2Vec?

A sliding context window is a fixed-size frame that moves across a sentence word by word. At each position, the center word is paired with its neighboring words within the window. These (center, context) pairs become training examples for the Word2Vec neural network. A larger window captures more topical/semantic relationships; a smaller window captures syntactic relationships.

### What is cosine similarity in word embeddings?

Cosine similarity measures how similar two word vectors are by computing the cosine of the angle between them. A value of 1 means identical direction (very similar words), 0 means orthogonal (unrelated), and −1 means opposite. In our toy Word2Vec model, "king" and "queen" have high cosine similarity (~0.9) while "king" and "cat" are near zero — exactly what you'd expect.

### What is the difference between Word2Vec and BERT?

Word2Vec produces static embeddings: each word gets one fixed vector regardless of context. "Bank" has the same vector whether it means a financial institution or a river bank. BERT produces contextual embeddings: the same word gets different vectors depending on its surrounding context. BERT uses the Transformer architecture with self-attention, while Word2Vec uses a simple two-layer neural network.

### What is negative sampling in Word2Vec?

Negative sampling is an optimization that makes Word2Vec training practical for large vocabularies. Instead of computing softmax over all 100,000+ words at each step (very slow), negative sampling only updates the true context word (positive) and a small random sample of non-context words (negatives, typically 5–20). This provides a 1000×+ speedup while producing similar quality embeddings.

### How are Word2Vec embeddings used in practice?

Word2Vec embeddings are used as input features for downstream NLP tasks: semantic search (matching queries to documents by meaning), sentiment analysis (classifying text as positive/negative), machine translation (mapping words across languages), recommendation systems (Airbnb uses embedding techniques inspired by Word2Vec), and as initialization for larger deep learning models.
