> **Note:** The canonical experience is the interactive HTML page: [Word2Vec Quiz](https://www.quantml.org/topics/word2vec/quiz). This file is a text mirror for search engines and AI tools.

# Word2Vec Quiz: Test Your Knowledge

15 questions across 3 difficulty levels, from foundational concepts to implementation details. Each question links back to the relevant section so you can review anything you missed. One attempt per question.

---

## Tier 1 — Foundations

Core concepts from The Story: one-hot encoding, distributional hypothesis, dot products, softmax.

---

### Question 1

**Why is one-hot encoding a poor representation for words?**

- A. It uses too much memory
- B. It requires labeled training data
- C. It can only represent verbs
- D. Every word is equally distant from every other word, with no similarity signal

→ [Review in Chapter 1](https://www.quantml.org/topics/word2vec#ch-1)

---

### Question 2

**The distributional hypothesis states that:**

- A. Words that are spelled similarly have similar meanings
- B. Common words carry the most meaning
- C. Words appearing in similar contexts tend to have similar meanings
- D. Longer sentences produce better embeddings

→ [Review in Chapter 2](https://www.quantml.org/topics/word2vec#ch-2)

---

### Question 3

**What does a sliding context window produce from a sentence?**

- A. A parse tree of the sentence grammar
- B. A co-occurrence matrix of all words in the corpus
- C. (center word, context word) training pairs
- D. One-hot vectors for each word

→ [Review in Chapter 2](https://www.quantml.org/topics/word2vec#ch-2)

---

### Question 4

**What does the dot product measure between two word vectors?**

- A. The Euclidean distance between them
- B. How many letters the words share
- C. The probability of one word following the other
- D. How aligned their directions are (a measure of similarity)

→ [Review in Chapter 3: Dot Product](https://www.quantml.org/topics/word2vec#ch-3)

---

### Question 5

**What does the softmax function do in the skip-gram model?**

- A. Converts raw dot-product scores into a probability distribution that sums to 1
- B. Compresses the vocabulary to a smaller size
- C. Removes stopwords from the input
- D. Normalizes the embedding vectors to unit length

→ [Review in Chapter 3: Softmax](https://www.quantml.org/topics/word2vec#ch-3)

---

## Tier 2 — Mechanics

How the model actually works: weight matrices, loss functions, gradients, and vector arithmetic.

---

### Question 6

**What does a single row of the embedding matrix W represent?**

- A. The probability of that word in the corpus
- B. The one-hot encoding of that word
- C. The dense embedding vector for that word (its "fingerprint")
- D. The gradient for that word's most recent training step

→ [Review in Inside the Model: Blueprint](https://www.quantml.org/topics/word2vec/internals#inside-blueprint)

---

### Question 7

**What is the difference between W (embedding matrix) and W' (context matrix)?**

- A. W is for nouns and W' is for verbs
- B. W represents words when they are the center word; W' represents words when they are context words
- C. W is larger than W'
- D. W' is a transposed copy of W

→ [Review in Inside the Model: Weight Explorer](https://www.quantml.org/topics/word2vec/internals#inside-weights)

---

### Question 8

**The cross-entropy loss L = -log P(context | center) is high when:**

- A. The model confidently predicts the correct context word
- B. The vocabulary size is large
- C. The learning rate is too high
- D. The model assigns low probability to the true context word (it was "surprised")

→ [Review in Inside the Model: Forward & Backprop](https://www.quantml.org/topics/word2vec/internals#inside-walkthrough)

---

### Question 9

**The gradient of the Word2Vec loss decomposes into two forces. The attractive force:**

- A. Pulls the center word's vector toward the true context word's vector
- B. Pushes all words toward the center of the embedding space
- C. Increases the learning rate over time
- D. Repels the center word from all vocabulary words

→ [Review in Inside the Model: Forward & Backprop](https://www.quantml.org/topics/word2vec/internals#inside-walkthrough)

---

### Question 10

**Why does the vector arithmetic king − man + woman ≈ queen work?**

- A. The model was explicitly trained on analogy examples
- B. The training process encodes consistent semantic directions (like gender) into the geometry of the embedding space
- C. It only works if the vocabulary is small enough
- D. The softmax function preserves linear relationships

→ [Review in Chapter 5: The Payoff](https://www.quantml.org/topics/word2vec#ch-5)

---

## Tier 3 — Implementation

From theory to code: parameter counts, training loop, recording, and visualization pipeline.

---

### Question 11

**A skip-gram model with vocabulary size V = 96 and embedding dimension d = 8 has how many total trainable parameters?**

- A. 96 × 8 = 768
- B. 96 × 8 × 2 = 1,536 (W and W' matrices)
- C. 96 × 96 = 9,216
- D. 8 × 8 = 64

→ [Review in The Code: The Model](https://www.quantml.org/topics/word2vec/code#code-model)

---

### Question 12

**Negative sampling replaces the full softmax by:**

- A. Removing the loss function entirely
- B. Using a smaller vocabulary
- C. Only updating the true context word (positive) and a few random non-context words (negatives) at each step
- D. Computing softmax over the 10 most frequent words only

→ [Review in Chapter 3: Negative Sampling](https://www.quantml.org/topics/word2vec#ch-3)

---

### Question 13

**What is the correct order of operations in one training step?**

- A. Backward → Forward → Loss → Update
- B. Loss → Forward → Update → Backward
- C. Update → Forward → Backward → Loss
- D. Forward → Loss → Backward → Update

→ [Review in The Code: Training Loop](https://www.quantml.org/topics/word2vec/code#code-loop)

---

### Question 14

**The training script records model state at every step for the first 200 steps, then every 10th step after that. Why?**

- A. Early training shows the most dramatic changes; later training is more gradual
- B. The model stops learning after step 200
- C. Memory runs out after 200 steps
- D. The loss function changes at step 200

→ [Review in The Code: Recording State](https://www.quantml.org/topics/word2vec/code#code-recording)

---

### Question 15

**What does PCA do to the 8-dimensional embeddings in the visualization pipeline?**

- A. Clusters similar words together
- B. Projects them down to 2 dimensions while preserving the most variance, so they can be plotted on a scatter chart
- C. Normalizes them to unit length
- D. Encrypts them for secure storage

→ [Review in The Code: Visualization Pipeline](https://www.quantml.org/topics/word2vec/code#code-viz)

---

## Related sections

- [Word2Vec Story](https://www.quantml.org/topics/word2vec)
- [Inside the Model](https://www.quantml.org/topics/word2vec/internals)
- [Code Walkthrough](https://www.quantml.org/topics/word2vec/code)
</parameters>
</invoke>