# GloVe: Interactive Visual Guide to Global Word Vectors

Static mirror of `/topics/glove`: same chapter order, prose, equations, and numbers as `GloVeStoryContent` (toy corpus: 577 sentences, 1,832 tokens, 144-word vocabulary, window size 5 per side, 91% sparse co-occurrence matrix, AdaGrad lr 0.05, 200 epochs). Interactive widgets (co-occurrence builder, ratio explorer, derivation stepper, weighting/objective explorers, gradient inspector, training player, neighbor/analogy solvers, factorization and PMI charts, complexity calculator) render only on the HTML page.

The HTML page also shows a **Chapters** table of contents (Counting Words Together through GloVe in the Landscape) and a floating **Key Equations** button; their content is inlined below as markdown.

---

## Key Equations (same as the in-page Key Equations card)

1. **Co-occurrence:** \(X_{ij} = \sum \frac{1}{d}\) for all \((i,j)\) windows  
2. **Probability:** \(P(j|i) = \frac{X_{ij}}{X_i}\)  
3. **Ratio insight:** \(F(w_i, w_j, \tilde{w}_k) = \frac{P(k|i)}{P(k|j)}\)  
4. **Model equation:** \(w_i^\top \tilde{w}_k + b_i + \tilde{b}_k = \log(X_{ik})\)  
5. **Objective:** \(J = \sum f(X_{ij}) \cdot \bigl(w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij}\bigr)^2\)  
6. **Weighting:** \(f(x) = \begin{cases} (x/x_{\max})^{3/4} & x < x_{\max} \\ 1 & \text{otherwise} \end{cases}\)  
7. **Gradients:** \(\frac{\partial J}{\partial w_i} = f(X_{ij}) \cdot \text{diff} \cdot \tilde{w}_j\)  
8. **AdaGrad:** \(w_i \leftarrow w_i - \frac{\eta \cdot \nabla}{\sqrt{G_i}}\)  
9. **PMI connection:** \(w_i^\top \tilde{w}_j \approx \text{PMI}(i,j) + \text{const}\)

---

## TL;DR: What is GloVe?

**GloVe** (Global Vectors for Word Representation) learns word vectors by factorizing a co-occurrence matrix. First, count how often every word pair appears near each other in a corpus (the co-occurrence matrix X). Then notice that probability ratios P(k|i) / P(k|j) encode meaning better than raw probabilities. Derive an elegant log-bilinear model: wᵢᵀw̃ₖ + bᵢ + b̃ₖ = log(Xᵢₖ). Train via weighted least squares with AdaGrad, using a power-law weighting function to balance rare and frequent co-occurrences. The result: word vectors where king − man + woman ≈ queen, and the factorization quality reveals which aspects of meaning the model has captured.

---

## Chapter 1: Counting Words Together

Before GloVe trains a single parameter, it does something Word2Vec never does: it reads the entire corpus and builds a complete census of word neighborhoods. For every pair of words in the vocabulary, it asks a simple question: how often does word *j* appear near word *i*? The answer is stored in a giant table called the **co-occurrence matrix X**, where each entry \(X_{ij}\) records the weighted count of times words *i* and *j* were found within a context window of size *L*. For a window of 5, that means scanning 10 context positions around every target word in the corpus.

Not all co-occurrences are counted equally. A word one position away from the target contributes a full weight of 1.0, while a word 5 positions away contributes only 1/5 = 0.2. This **1/d distance weighting** reflects a natural intuition: words standing shoulder-to-shoulder with the target are more likely to share a meaningful relationship than words at the far edge of the window. The paper puts it plainly: “very distant word pairs are expected to contain less relevant information about the words’ relationship.”

The resulting matrix is **symmetric** when using a symmetric window (looking both left and right of the target): if “ice” appears in the context of “cold,” then “cold” also appears in the context of “ice,” so \(X_{ij} = X_{ji}\). This symmetry is not just a convenience; it becomes a critical constraint in the derivation. Symmetric windows also capture **semantic** relationships (topical relatedness), while asymmetric (left-only) windows lean toward **syntactic** patterns (grammar depends on word order). GloVe defaults to symmetric.

Perhaps the most striking property of X is its **sparsity**: 91% of entries in our toy matrix are zero. Most word pairs in a finite corpus simply never co-occur within any window. This isn’t a problem; it’s a feature. GloVe trains only on non-zero entries, making its complexity proportional to the number of observed co-occurrences rather than the square of the vocabulary size. The non-zero entries themselves span a wide range (in our toy corpus from 0.4 to ~51, and on production corpora from 0.2 to thousands), and this dynamic range is why GloVe’s weighting function and AdaGrad optimizer matter so much.

> **Important:** \(X_{ij} = 0\) means words *i* and *j* were never observed together in this corpus’s windows. It does *not* mean they are unrelated; with more data, the count might be non-zero. GloVe handles this by training only on observed (non-zero) entries. Absence of evidence is not evidence of absence.

> **Toy vs Production:** Our toy corpus has 577 sentences, 1,832 tokens, 144 vocabulary words, and 1,930 non-zero entries (91% sparse). The 6B GloVe used 400K vocabulary and billions of non-zero entries (>98% sparse).

> “We’ve built the matrix. But staring at raw numbers won’t reveal meaning. We need to look at these counts more carefully, and ask the right question…”

---

## Chapter 2: What Counts Reveal

Raw co-occurrence counts are a starting point, but they’re noisy. Convert them to conditional probabilities (\(P(j|i) = X_{ij}/X_i\)) and try to use them directly. Look at \(P(k|\text{ice})\) for various probe words *k*: “solid” is high, “cold” is high, “water” is high. Now look at \(P(k|\text{steam})\): “hot” is high, “water” is high, “gas” is high. Some probe words (like “water”) are high for *both*. Shared context words dominate both distributions. Staring at individual probabilities, you cannot easily tell which words are *specifically* related to ice versus steam; the shared signal obscures the discriminative signal.

What if we compare the two distributions? Subtraction (\(P(k|\text{ice}) - P(k|\text{steam})\)) is sensitive to absolute scale; a small difference between two large probabilities is uninformative. **Division** naturally normalizes out shared components: for any probe word equally related to both ice and steam, \(P(k|\text{ice})/P(k|\text{steam}) \approx 1\). Only for *discriminative* probe words does the ratio deviate from 1. This is the key insight.

“Solid” gives a ratio of about 10.5, strongly associated with ice over steam. “Gas” gives 0.11, about 9× more associated with steam. Words related to *both* (like “water,” ratio ≈ 1 because both P values are high) or *neither* (like “cat,” ratio = 1 because both P values are zero) produce ratios near 1, for different reasons, but both uninformative. Four categories emerge naturally from a single operation: large ratio (word1-associated), small ratio (word2-associated), near-1 with high P (shared), near-1 with low P (unrelated). The ratio cleanly separates discriminative signal from noise without requiring any thresholds or hyperparameters. This mirrors Table 1 from the GloVe paper (which uses a 6B corpus with probe words solid, gas, water, and fashion; our toy corpus shows the same pattern with different magnitudes).

Here is the conceptual leap that births GloVe: if ratios encode meaning so cleanly, can we find vectors \(w_i\), \(w_j\), \(\tilde{w}_k\) such that some function of these vectors *equals* the ratio \(P(k|i) / P(k|j)\)? If so, those vectors would encode the same discriminative information, in a form that supports linear algebra operations like analogy. This question drives the entire derivation.

> “Ratios encode meaning. But they’re just numbers in a table. To make them useful for computation, we need to find vectors whose mathematical operations naturally produce these ratios. This is where GloVe’s derivation begins, and it’s one of the most elegant pieces of mathematics in all of NLP.”

---

## Chapter 3: From Ratios to Vectors

The GloVe derivation is a sequence of five principled steps, each motivated by a specific mathematical or linguistic constraint. It begins with the question from Chapter 2 (find \(F(w_i, w_j, \tilde{w}_k) = P(k|i)/P(k|j)\)) and arrives at the elegant equation \(w_i^\top \tilde{w}_k + b_i + \tilde{b}_k = \log(X_{ik})\). No step is arbitrary; each one narrows the space of possible models by demanding that a specific property hold.

The key moves: analogies live in *differences* (so F should operate on \(w_i - w_j\)), vectors must reduce to a scalar (use the dot product), the ratio is multiplicative but the dot product is additive (so F must be the exponential, the unique continuous homomorphism from addition to multiplication), and X’s symmetry demands bias terms that absorb word-level frequency. The result is a model of beautiful simplicity: a dot product plus two biases equals the log co-occurrence count.

Walk through each step in the interactive derivation below. Pay special attention to Step 3, the homomorphism argument, where the exponential function is not chosen but *forced* by mathematical necessity. And remember the prediction from Step 4: the bias \(b_i\) should absorb \(\log(X_i)\). We’ll verify this in Chapter 6.

> “We have a beautiful equation: \(w_i^\top \tilde{w}_k + b_i + \tilde{b}_k = \log(X_{ik})\). But equations are aspirations; in practice, our vectors won’t reconstruct log(X) exactly. How do we turn this into something we can optimize?”

---

## Chapter 4: The Objective Function

The model equation says the prediction should equal \(\log(X_{ij})\). It won’t be exact, so we minimize the squared error, but two problems stand between us and a straightforward loss function. First, 91% of our toy X is zero, and log(0) is undefined. GloVe handles this by training **only on non-zero entries**, which both sidesteps the log(0) issue and makes training complexity proportional to the number of observed pairs rather than \(|V|^2\). The alternative (SVD of \(\log(1 + X)\), which includes zeros) was tested and underperforms by over 11 percentage points on analogy tasks.

Second, non-zero co-occurrences span orders of magnitude. A pair that co-occurred once carries noisy signal; a pair that co-occurred 50 times is reliable; a pair at 5,000 is very reliable but shouldn’t dominate the entire loss. GloVe introduces a **weighting function** \(f(x)\) with three properties: f(0) = 0 (zeros contribute nothing), f is non-decreasing (more data means more weight), and f saturates for large x (frequent pairs don’t overwhelm). The chosen form is \(f(x) = (x/x_{\max})^\alpha\) for \(x < x_{\max}\), and 1 otherwise, with \(\alpha\) = 0.75 and \(x_{\max}\) = 100.

The α = 3/4 exponent is sublinear: doubling \(X_{ij}\) increases weight only by \(2^{0.75} \approx 1.68\times\), not 2×. A similar 3/4 power appears in Word2Vec’s negative sampling distribution, and both compensate for the same phenomenon: Zipf’s law makes frequency distributions heavy-tailed. The \(x_{\max}\) = 100 cap is gentle; the paper notes performance “depends weakly on the cutoff.”

The full objective is: \(J = \sum f(X_{ij}) \cdot (w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2\), summed over all non-zero entries. Why squared error instead of cross-entropy? Cross-entropy requires normalizing the model distribution Q over the full vocabulary, an O(|V|) softmax. Squared error on the log avoids normalization entirely. It’s simpler, cheaper per term, and the paper argues it is preferable to the cross-entropy formulation. The connection to skip-gram’s “global” objective makes this not just a practical choice but a principled one: GloVe replaces weighted cross-entropy with weighted squared error, and the result is a cleaner optimization landscape.

<details>
<summary>Deep dive: From Skip-gram to GloVe</summary>

Skip-gram minimizes, over the whole corpus, the negative log probability of context words:

\[J_{\text{sg}} = -\sum_{i \in \text{corpus}} \sum_{j \in \text{context}(i)} \log Q(j|i)\]

where \(Q(j|i)\) uses a softmax. Identical \((i,j)\) pairs appear many times. Aggregating them, the objective becomes:

\[J = -\sum_{i=1}^{|V|} \sum_{j=1}^{|V|} X_{ij} \log Q(j|i)\]

This is a weighted cross-entropy, with weights \(X_{ij}\) (the co-occurrence counts). GloVe makes two key changes:

1. Replace the cross-entropy with **squared error** on the log: instead of \(-X_{ij} \log Q(j|i)\), use \((w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2\). This avoids the expensive softmax normalization entirely.
2. Replace the raw weight \(X_{ij}\) with the capped function \(f(X_{ij})\) to prevent frequent pairs from dominating the loss.

The result is GloVe’s objective: a weighted least-squares problem that is cheaper per term, avoids normalization, and explicitly works with the global co-occurrence statistics rather than sampling individual windows.

</details>

> “We have the objective: minimize the weighted squared error between our model’s predictions and \(\log(X_{ij})\), summed over all non-zero entries. Now we need to actually minimize it. This means computing gradients and choosing the right optimizer…”

---

## Chapter 5: Training GloVe

The gradient for a single pair (i, j) is elegant: let \(\text{diff} = w_i^\top \tilde{w}_j + b_i + \tilde{b}_j - \log(X_{ij})\), and let \(\text{fdiff} = f(X_{ij}) \cdot \text{diff}\). Then \(\partial J/\partial w_i = \text{fdiff} \cdot \tilde{w}_j\), \(\partial J/\partial \tilde{w}_j = \text{fdiff} \cdot w_i\), and the bias gradients are simply fdiff. The intuition: if the prediction overshoots, push \(w_i\) away from \(\tilde{w}_j\); if it undershoots, pull them closer. All four parameter groups share the same weighted error scalar but receive different directional signals. In the Stanford implementation, gradients are clipped to [−100, 100] to prevent large co-occurrence values from causing instability (this is an implementation detail, not discussed in the paper).

The paper trains GloVe with **AdaGrad**, and the choice is well-motivated. Co-occurrence values span orders of magnitude. Frequent word pairs like ice-solid are updated many times per epoch; rare pairs are updated once. With vanilla SGD and a single learning rate, tuning for rare pairs makes frequent pairs diverge, and tuning for frequent pairs starves rare pairs. AdaGrad solves this by accumulating squared gradients per parameter: \(G_t = G_{t-1} + g_t^2\). The update becomes \(\theta \mathrel{-}= \text{lr} \cdot g \,/\, \sqrt{G + \varepsilon}\). Frequent words accumulate large \(G\) and get smaller effective learning rates, while rare words retain high learning rates; the adaptation is automatic and per-parameter.

GloVe maintains **two sets of vectors**: W (word vectors) and W̃ (context vectors), each with their own bias vectors and AdaGrad accumulators. When X is symmetric, W and W̃ are mathematically interchangeable; they differ only by random initialization. The paper uses their sum W + W̃ as the final embedding, which averages out initialization-dependent noise and gives a small but consistent performance boost. The paper explains: “for certain types of neural networks, training multiple instances and combining the results can help reduce overfitting and noise.”

Unlike Word2Vec’s continuous stream of random samples, GloVe has a natural **epoch structure**: one pass through all non-zero \((i, j, X_{ij})\) triples, shuffled. Loss drops rapidly in early epochs (1–10), then gradually stabilizes. Our toy model trains for 200 epochs with learning rate 0.05. The paper recommends 50 iterations for dimensions below 300, and 100 for 300+. Convergence is fast because the objective, while not globally convex (it’s bilinear in W and W̃), behaves well in practice as a weighted least-squares problem.

Why AdaGrad specifically? It accumulates squared gradients per parameter, building a per-element history \(G_i = \sum_{t} g_{t,i}^2\). Frequent word pairs (like ice-solid) contribute many gradient updates, growing \(G_i\) large and automatically reducing their effective learning rate to \(\eta/\sqrt{G_i}\). Rare pairs retain a high learning rate. This is ideal for NLP’s Zipfian frequency distribution, where a few word pairs (like ice-solid) dominate the training signal.

One subtle detail: when \(X_{ij} < 1\) (from distance weighting; a co-occurrence at distance 5 contributes only 1/5 = 0.2), \(\log(X_{ij})\) is negative. The model must learn to produce negative predictions for these pairs. This is why word vector components can be negative; they’re not probabilities.

> “Training is done. But what did GloVe actually learn? Let’s crack open the model and inspect everything: the vectors, the biases, the quality of the factorization, and how the two sets of vectors relate to each other.”

---

## Chapter 6: What GloVe Learns

With training complete, we can inspect the model from angles unique to GloVe. Start with **nearest neighbors**: using \(W + \tilde{W}\), find the closest words to any word by cosine similarity. For “ice,” the top neighbors are “freeze”, “melts”, “solid”; thermodynamic words cluster tightly. For “king,” the top neighbors are “wise”, “old”, “proud”; properties associated with royalty in this small corpus. Then look at **analogies**: \(\text{king} - \text{man} + \text{woman} \approx \text{queen}\). On production GloVe this works reliably; on our tiny corpus the model predicts “wise” instead of “queen” (though \(\text{ice} - \text{solid} + \text{gas} = \text{steam}\) does work!), but the principle holds. This isn’t magic; it’s a direct consequence of the log-bilinear model. Vector subtraction in embedding space corresponds to co-occurrence ratio *division* in probability space. The difference vector king − man encodes the log ratio of their co-occurrences with every other word; adding woman finds the word whose profile matches that pattern.

### Why analogies work: the algebra

The model equation for word *i* and context word *k* is:

\[w_i^\top \tilde{w}_k + b_i + \tilde{b}_k = \log(X_{ik})\]

Write this for “king” and “man” with the same context word *k*, then subtract:

\[(w_{\text{king}} - w_{\text{man}})^\top \tilde{w}_k = \log\!\frac{X_{\text{king},k}}{X_{\text{man},k}} - (b_{\text{king}} - b_{\text{man}})\]

The difference vector \(w_{\text{king}} - w_{\text{man}}\) encodes the **log co-occurrence ratio** with every context word *k* (up to a bias constant). This vector captures the “royalty minus male” semantic direction.

To find the word that is “royalty minus male plus female,” we solve:

\[\arg\max_j \; \cos(w_{\text{king}} - w_{\text{man}} + w_{\text{woman}}, \; w_j)\]

The paper finds the word *d* whose vector is closest to \(w_b - w_a + w_c\) by **cosine similarity**. On a production corpus, the answer is “queen” because \(w_{\text{queen}}\) has the co-occurrence profile that matches “king’s profile, adjusted for the male→female shift.” This is not magic; it is a direct algebraic consequence of the log-bilinear model.

> **Try it yourself:** Pick two words from the same semantic group in the explorer below. Compute their vector difference. Which context words show the largest magnitude in that difference direction? Do they match your intuition about what distinguishes those words?

> **Toy corpus note:** Our corpus has only 144 words and 8 dimensions. Analogies that work flawlessly on a 6B-token corpus may not succeed here; the vocabulary is too small for many standard analogy sets, and the embeddings lack the capacity to encode fine-grained relationships. The explorer below shows accuracy improving with training, but don’t expect production-level results.

Next, inspect **factorization quality**: how well does \(W \cdot \tilde{W}^\top + b + \tilde{b}\) reconstruct \(\log(X)\)? Plot the target \(\log(X_{ij})\) against the model’s prediction and you’ll see a tight correlation for frequent pairs (which had high \(f(X_{ij})\) weight) and more scatter for rare pairs (which the optimizer down-weighted). On our toy corpus (144 words, 8 dimensions), the fit is modest - \(R^2\) ≈ 0.32 despite having ~2600 parameters for ~1,930 non-zero entries. Why so low? The \(f(X_{ij})\) weighting heavily prioritizes frequent pairs, so rare pairs (the majority) contribute little gradient and remain poorly fit. With only 8 dimensions, the model also lacks capacity to capture all structure. On real corpora with 400K vocab and 300 dimensions, the parameter-to-data ratio is far healthier and the reconstruction is expected to be much tighter.

Now the payoff from Chapter 3’s foreshadowing: **what do the biases encode?** Scatter-plot \(b_i\) against \(\log(X_i)\) and you’ll find a positive correlation, moderate on our toy corpus (r ≈ 0.56 with only 8 dimensions and 144 words), and expected to be much stronger on production models where the derivation’s assumptions hold more tightly. The derivation *predicted* that biases should absorb word frequency, and even on this tiny dataset the trend is visible. This separation is elegant: biases handle frequency, freeing the dot product \(w_i^\top \tilde{w}_k\) to encode pure semantic content. The paper notes that when X is symmetric, W and W̃ “are equivalent and differ only as a result of their random initializations.” We can verify this by comparing their per-word cosine similarity. On our toy corpus the cosines are mixed (mean ≈ 0.16) because 8 dimensions leave little room; the most aligned words (“flows”, “warm”, “fish”) tend to belong to strong semantic clusters, while the most divergent (“moon”, “pot”, “weather”) are words without a tight semantic neighborhood.

Finally, the **PMI connection** ties everything together. Since biases absorb the marginal log-frequencies, the dot product alone approximates \(\text{PMI}(i,j) = \log(X_{ij}) - \log(X_i) - \log(X_j) + \log|C|\). This means GloVe can be interpreted as factorizing a shifted PMI matrix. Separately, Levy & Goldberg (2014) showed that skip-gram with negative sampling implicitly factorizes a similar shifted PMI matrix. Though arrived at from completely different starting points, both methods approximate the same underlying structure. On production corpora the scatter plot of dot products vs PMI values shows a clear linear correlation; on our toy corpus the trend is present but noisier (r ≈ 0.40) due to the limited vocabulary and 8 dimensions.

---

## Chapter 7: GloVe in the Landscape

GloVe sits at the culmination of a long lineage of count-based methods. **LSA** (1990) showed that SVD of a term-document matrix captures meaning through dimensionality reduction, the first proof that statistical co-occurrence encodes semantics. **HAL** (1996) moved to word-word co-occurrence but used raw counts, which are dominated by frequency effects. **COALS** (Rohde et al., 2006) added correlation-based normalization before SVD. **PPMI + SVD** (Levy & Goldberg, 2014) computed Positive PMI and applied SVD, a strong baseline competitive with Word2Vec. Each method asked: “What transformation of the co-occurrence matrix gives the best word representations?” GloVe’s answer: a weighted factorization of the log, guided by the probability ratio insight.

The paper’s Table 2 reveals this progression quantitatively. SVD of raw X scores 7.3% on analogy tasks; frequency effects destroy the signal. SVD of \(\sqrt{X}\) jumps to 42.1% by compressing dynamic range. SVD of \(\log(1 + X)\) reaches 60.1%; the log transformation is more principled and closer to PMI. GloVe achieves 71.7%. That final 11.6% improvement comes from three sources: the weighting function \(f(X_{ij})\) that focuses on informative entries, separate biases that capture frequency, and stochastic optimization that scales where SVD cannot.

The **Levy & Goldberg result** (2014) connects the two sides of the embedding world. They showed that skip-gram with negative sampling implicitly factorizes a shifted PMI matrix: \(w_i^\top \tilde{w}_j \approx \text{PMI}(i,j) - \log(k)\). GloVe explicitly factorizes \(\log(X)\), and since biases absorb marginal frequencies, this is closely related to shifted PMI. Both methods, from completely different starting points, approximate similar structure, lending weight to the idea that co-occurrence statistics contain the fundamental signal for word meaning. Meanwhile, GloVe’s training complexity is **sub-linear** in corpus size thanks to Zipf’s law: the number of non-zero co-occurrence entries scales empirically as \(O(|C|^{0.8})\), as the paper shows by measuring |X| across differently-sized sub-corpora.

Pre-trained GloVe vectors from Stanford (6B, 42B, and 840B token corpora) remain widely available and useful for initialization and lightweight tasks. But GloVe produces **static** embeddings; the same vector for “bank” whether it means riverbank or financial institution. This limitation drove the field toward contextual models: ELMo (2018), BERT (2019), GPT. These produce different vectors for the same word in different sentences. Yet GloVe’s core insights (that co-occurrence statistics encode meaning, that ratios are more informative than raw probabilities, and that a principled log-bilinear model can be derived from first principles) remain foundational to understanding how any embedding method works.

---

## Compare with Word2Vec

| Aspect | GloVe | Word2Vec |
|--------|--------|----------|
| Input | Entire co-occurrence matrix \(X\) | One (center, context) pair at a time |
| Objective | Weighted squared error on \(\log(X_{ij})\) | Cross-entropy with negative sampling |
| Architecture | No neural network, pure regression | Shallow neural network with softmax |
| Training | Epochs over non-zero entries | Streaming passes over corpus |
| Optimizer | AdaGrad (essential) | SGD (sufficient) |
| Output | \(W + \tilde{W}\) (two vector sets) + biases | W only (or W + W′) |
| Complexity | \(O(|X|) = O(|C|^{0.8})\) per epoch | \(O(|C|)\) per epoch |
| Analogies | Follows from the derivation | Empirical observation |
| Incrementality | Must rebuild \(X\) for new data | Can update incrementally |
| Gradients | Manual (analytical) | Autograd (PyTorch) |

### Parameter Scale Comparison

Same labels and formatted counts as the bar chart under the comparison table: This model **2.6K**; GloVe (Stanford, 2014) **241M**; Word2Vec (Google, 2013) **900M**; GPT-2 Small **124M**; BERT Base **110M**.

---

## Frequently Asked Questions

**How is GloVe different from Word2Vec?**  
Word2Vec scans context windows one at a time and updates weights incrementally (prediction-based). GloVe first counts ALL co-occurrences into a matrix, then learns vectors that factorize it (count-based). The paper shows that for the same corpus, vocabulary, window size, and training time, GloVe consistently outperforms Word2Vec on analogy tasks, but both exploit the same underlying co-occurrence statistics and produce vectors with similar linear substructure.

**Why two sets of vectors (W and W̃)?**  
The model has separate “word” and “context” vectors. When \(X\) is symmetric (as in our toy corpus), they’re mathematically equivalent and differ only by random initialization. Summing them (\(W + \tilde{W}\)) averages out noise and gives a small performance boost.

**What do the bias terms encode?**  
Biases absorb word frequency information: \(b_i \approx \log(X_i)\), the log total co-occurrence count of word \(i\). This frees the word vectors to encode meaning rather than frequency.

**Why use AdaGrad instead of SGD or Adam?**  
Co-occurrence values span orders of magnitude. AdaGrad automatically gives smaller learning rates to parameters updated frequently (common word pairs) and larger rates to those updated rarely. On real corpora \(X_{ij}\) can range from 0.2 to 5,000+; our toy corpus spans 0.4 to ~51.

**Why alpha = 3/4 in the weighting function?**  
Empirical, but connected to sublinear frequency scaling (Zipf’s law compensation). \(\alpha = 3/4\) means doubling co-occurrence only increases weight by \(2^{0.75} \approx 1.68\times\), preventing common pairs from dominating.

**Why not use cross-entropy like Word2Vec?**  
Cross-entropy requires normalizing over the entire vocabulary (softmax, \(O(|V|)\) per term). Squared error on log co-occurrences avoids this entirely. It’s simpler, faster, and the paper argues it is preferable, avoiding normalization while the weighting function \(f(X_{ij})\) handles the frequency imbalance that cross-entropy handles poorly.

**Is GloVe doing matrix factorization?**  
Yes. The objective factorizes \(\log(X)\) into \(W \cdot \tilde{W}^\top\) + biases, with weighting \(f(X_{ij})\). This is closely related to SVD of \(\log(X)\), but GloVe’s weighting focuses on informative entries and ignores the ~91% zeros.

**How does training complexity scale?**  
Sub-linearly. The number of non-zero entries \(|X|\) scales as \(O(|C|^{0.8})\) due to Zipf’s law. GloVe trains on \(|X|\) entries per epoch, which is less than Word2Vec’s \(O(|C|)\) per epoch.

**Can GloVe handle new words after training?**  
No. Adding a new word requires rebuilding the co-occurrence matrix and retraining. For OOV words, common strategies include using the zero vector, using the average of all vectors, or switching to FastText which handles OOV via subword embeddings.

**What’s the connection to PMI?**  
Since biases absorb marginal log-frequencies, the dot product approximates \(w_i^\top \tilde{w}_j \approx \text{PMI}(i,j)\) + constant. Levy & Goldberg (2014) independently showed that skip-gram with negative sampling factorizes a similar shifted PMI matrix. Both methods, from different starting points, converge on the same statistical structure.

**Does window size matter?**  
Yes. Larger windows (8–10) capture semantic information (topical relatedness). Smaller windows (2–4) capture syntactic information (grammatical patterns). Symmetric windows perform better on semantic tasks; the paper uses a context of 10 words to the left and 10 to the right (window size = 10 per side). Our toy corpus uses window = 5 per side.

**Why do analogies (king − man + woman ≈ queen) work?**  
It follows from the log-bilinear model. Vector subtraction in embedding space corresponds to co-occurrence ratio division in probability space. \(\text{king} - \text{man}\) encodes the log ratio of their co-occurrences with every word; adding \(\text{woman}\) finds the word whose profile matches that pattern.

**What does Xᵢⱼ = 0 mean?**  
It means words \(i\) and \(j\) never co-occurred within any context window in this particular corpus. With more data, the same pair might have a non-zero count. Absence of evidence is not evidence of absence. GloVe handles this by training only on non-zero entries.

**Where can I get pre-trained GloVe vectors?**  
Stanford provides pre-trained vectors at nlp.stanford.edu/projects/glove/, trained on Wikipedia + Gigaword (6B tokens, 400K vocab), Common Crawl (42B and 840B tokens), and Twitter (27B tokens). The 6B 300d vectors are a solid starting point for most applications.

**Are GloVe embeddings still used today?**  
As standalone features, they’ve largely been replaced by contextual embeddings (BERT, GPT). But GloVe’s insights (co-occurrence statistics, the ratio property, log-bilinear models) are foundational to understanding modern NLP. Pre-trained GloVe vectors are still used for initialization and lightweight applications.

---

## Explore the trained model in depth (footer links)

- [Factorization Quality & Internals →](/topics/glove/internals)  
- [Annotated NumPy Code →](/topics/glove/code)  
- [Test Your Knowledge →](/topics/glove/quiz)

### Next Topic

**[RNN: Recurrent Neural Networks →](/topics/rnn)**  
Learn how recurrent architectures model sequences, language, and time-series data with hidden state.

---

## Other machine-readable mirrors

- Community (markdown): `/topics/glove/community.md` — [HTML](/topics/glove/community)