With training complete, we can inspect the model from angles unique to GloVe. Start with nearest neighbors: using
, find the closest words to any word by cosine similarity. For “ice,” the top neighbors are
“freeze”, “melts”, “solid”; thermodynamic words cluster tightly. For “king,” the top neighbors are
“wise”, “old”, “proud”; properties associated with royalty in this small corpus. Then look at
analogies:
. On production GloVe this works reliably; on our tiny corpus the model predicts “wise” instead of “queen” (though
does work!), but the principle holds. This isn't magic; it's a direct consequence of the log-bilinear model. Vector subtraction in embedding space corresponds to co-occurrence ratio
division in probability space. The difference vector king − man encodes the log ratio of their co-occurrences with every other word; adding woman finds the word whose profile matches that pattern.
Why analogies work: the algebra
The model equation for word i and context word k is:
Write this for “king” and “man” with the same context word k, then subtract:
The difference vector
encodes the
log co-occurrence ratio with every context word
k (up to a bias constant). This vector captures the “royalty minus male” semantic direction.
To find the word that is “royalty minus male plus female,” we solve:
The paper finds the word d whose vector is closest to
by
cosine similarity. On a production corpus, the answer is “queen” because
has the co-occurrence profile that matches “king’s profile, adjusted for the male→female shift.” This is not magic; it is a direct algebraic consequence of the log-bilinear model.
Try it yourself: Pick two words from the same semantic group in the explorer below. Compute their vector difference. Which context words show the largest magnitude in that difference direction? Do they match your intuition about what distinguishes those words?
Toy corpus note: Our corpus has only 144 words and 8 dimensions. Analogies that work flawlessly on a 6B-token corpus may not succeed here; the vocabulary is too small for many standard analogy sets, and the embeddings lack the capacity to encode fine-grained relationships. The explorer below shows accuracy improving with training, but don't expect production-level results.
Next, inspect factorization quality: how well does
reconstruct
? Plot the target
against the model's prediction and you'll see a tight correlation for frequent pairs (which had high
weight) and more scatter for rare pairs (which the optimizer down-weighted). On our toy corpus (144 words, 8 dimensions), the fit is modest -
≈ 0.32 despite having ~2600 parameters for ~1,930 non-zero entries. Why so low? The
weighting heavily prioritizes frequent pairs, so rare pairs (the majority) contribute little gradient and remain poorly fit. With only 8 dimensions, the model also lacks capacity to capture all structure. On real corpora with 400K vocab and 300 dimensions, the parameter-to-data ratio is far healthier and the reconstruction is expected to be much tighter.
Now the payoff from Chapter 3's foreshadowing: what do the biases encode? Scatter-plot
against
and you'll find a positive correlation, moderate on our toy corpus (r ≈ 0.56 with only 8 dimensions and 144 words), and expected to be much stronger on production models where the derivation's assumptions hold more tightly. The derivation
predicted that biases should absorb word frequency, and even on this tiny dataset the trend is visible. This separation is elegant: biases handle frequency, freeing the dot product
to encode pure semantic content. The paper notes that when X is symmetric, W and W̃ “are equivalent and differ only as a result of their random initializations.” We can verify this by comparing their per-word cosine similarity. On our toy corpus the cosines are mixed (mean ≈ 0.16) because 8 dimensions leave little room; the most aligned words (
“flows”, “warm”, “fish”) tend to belong to strong semantic clusters, while the most divergent (
“moon”, “pot”, “weather”) are words without a tight semantic neighborhood.
Finally, the PMI connection ties everything together. Since biases absorb the marginal log-frequencies, the dot product alone approximates
. This means GloVe can be interpreted as factorizing a shifted PMI matrix. Separately, Levy & Goldberg (2014) showed that skip-gram with negative sampling implicitly factorizes a similar shifted PMI matrix. Though arrived at from completely different starting points, both methods approximate the same underlying structure. On production corpora the scatter plot of dot products vs PMI values shows a clear linear correlation; on our toy corpus the trend is present but noisier (r ≈ 0.40) due to the limited vocabulary and 8 dimensions.