Why does GloVe have two sets of vectors (W and W̃)?

The model has separate word and context vectors. When the co-occurrence matrix X is symmetric, they are mathematically equivalent and differ only by random initialization. Summing them (W + W̃) averages out noise and gives a small performance boost.

What do the bias terms in GloVe encode?

Biases absorb word frequency information: b_i approximates log(X_i), the log total co-occurrence count. This frees the word vectors to encode meaning rather than frequency.

Why use AdaGrad instead of SGD or Adam?

Co-occurrence values span orders of magnitude. AdaGrad automatically gives smaller learning rates to frequently updated parameters and larger rates to rare ones, ideal for Zipfian frequency distributions.

Is GloVe doing matrix factorization?

Yes. The objective factorizes log(X) into W · W̃ᵀ + biases, with weighting f(X_ij). This is closely related to SVD of log(X), but GloVe's weighting focuses on informative entries and ignores zeros.

What is the connection between GloVe and PMI?

Since biases absorb marginal log-frequencies, the dot product approximates PMI(i,j). Levy & Goldberg (2014) showed skip-gram factorizes a similar shifted PMI matrix. Both methods converge on the same statistical structure.

Are GloVe embeddings still used today?

As standalone features, they have largely been replaced by contextual embeddings (BERT, GPT). But GloVe's insights are foundational, and pre-trained vectors are still used for initialization and lightweight applications.

Interactive7 chapters

GloVe

Global Vectors for Word Representation

Build a co-occurrence matrix from the entire corpus, discover that probability ratios encode meaning, then derive an elegant objective function that learns word vectors by factorizing the co-occurrence matrix.

~50 min read|Prerequisites: Word2Vec, basic linear algebra|Published Mar 2026

GloVe visualization: co-occurrence matrix heatmap, word embedding clusters, probability ratio table, and weighted least squares objective on a dark cinematic background

TL;DR: What is GloVe?

GloVe (Global Vectors for Word Representation) learns word vectors by factorizing a co-occurrence matrix. First, count how often every word pair appears near each other in a corpus (the co-occurrence matrix X). Then notice that probability ratios P(k|i) / P(k|j) encode meaning better than raw probabilities. Derive an elegant log-bilinear model: wᵢᵀw̃ₖ + bᵢ + b̃ₖ = log(Xᵢₖ). Train via weighted least squares with AdaGrad, using a power-law weighting function to balance rare and frequent co-occurrences. The result: word vectors where king − man + woman ≈ queen, and the factorization quality reveals which aspects of meaning the model has captured.

Chapter 1

Counting Words Together

Before GloVe trains a single parameter, it does something Word2Vec never does: it reads the entire corpus and builds a complete census of word neighborhoods. For every pair of words in the vocabulary, it asks a simple question: how often does word j appear near word i? The answer is stored in a giant table called the co-occurrence matrix X, where each entry records the weighted count of times words i and j were found within a context window of size L. For a window of 5, that means scanning 10 context positions around every target word in the corpus.

Not all co-occurrences are counted equally. A word one position away from the target contributes a full weight of 1.0, while a word 5 positions away contributes only 1/5 = 0.2. This 1/d distance weightingreflects a natural intuition: words standing shoulder-to-shoulder with the target are more likely to share a meaningful relationship than words at the far edge of the window. The paper puts it plainly: “very distant word pairs are expected to contain less relevant information about the words’ relationship.”

The resulting matrix is symmetricwhen using a symmetric window (looking both left and right of the target): if “ice” appears in the context of “cold,” then “cold” also appears in the context of “ice,” so . This symmetry is not just a convenience; it becomes a critical constraint in the derivation. Symmetric windows also capture semantic relationships (topical relatedness), while asymmetric (left-only) windows lean toward syntactic patterns (grammar depends on word order). GloVe defaults to symmetric.

Perhaps the most striking property of X is its sparsity: 91% of entries in our toy matrix are zero. Most word pairs in a finite corpus simply never co-occur within any window. This isn't a problem; it's a feature. GloVe trains only on non-zero entries, making its complexity proportional to the number of observed co-occurrences rather than the square of the vocabulary size. The non-zero entries themselves span a wide range (in our toy corpus from 0.4 to ~51, and on production corpora from 0.2 to thousands), and this dynamic range is why GloVe's weighting function and AdaGrad optimizer matter so much.

Important: means words i and jwere never observed together in this corpus's windows. It does not mean they are unrelated; with more data, the count might be non-zero. GloVe handles this by training only on observed (non-zero) entries. Absence of evidence is not evidence of absence.

Toy vs Production: Our toy corpus has 577 sentences, 1,832 tokens, 144 vocabulary words, and 1,930 non-zero entries (91% sparse). The 6B GloVe used 400K vocabulary and billions of non-zero entries (>98% sparse).

“We've built the matrix. But staring at raw numbers won't reveal meaning. We need to look at these counts more carefully, and ask the right question...”

Chapter 2

What Counts Reveal

Raw co-occurrence counts are a starting point, but they're noisy. Convert them to conditional probabilities () and try to use them directly. Look at for various probe words k: “solid” is high, “cold” is high, “water” is high. Now look at : “hot” is high, “water” is high, “gas” is high.Some probe words (like “water”) are high for both. Shared context words dominate both distributions. Staring at individual probabilities, you cannot easily tell which words are specifically related to ice versus steam; the shared signal obscures the discriminative signal.

What if we compare the two distributions? Subtraction () is sensitive to absolute scale; a small difference between two large probabilities is uninformative. Division naturally normalizes out shared components: for any probe word equally related to both ice and steam, . Only for discriminative probe words does the ratio deviate from 1. This is the key insight.

“Solid” gives a ratio of about 10.5, strongly associated with ice over steam. “Gas” gives 0.11, about 9× more associated with steam. Words related to both(like “water,” ratio ≈ 1 because both P values are high) or neither(like “cat,” ratio = 1 because both P values are zero) produce ratios near 1, for different reasons, but both uninformative. Four categories emerge naturally from a single operation: large ratio (word1-associated), small ratio (word2-associated), near-1 with high P (shared), near-1 with low P (unrelated). The ratio cleanly separates discriminative signal from noise without requiring any thresholds or hyperparameters. This mirrors Table 1 from the GloVe paper (which uses a 6B corpus with probe words solid, gas, water, and fashion; our toy corpus shows the same pattern with different magnitudes).

Here is the conceptual leap that births GloVe: if ratios encode meaning so cleanly, can we find vectors , , such that some function of these vectors equals the ratio ? If so, those vectors would encode the same discriminative information, in a form that supports linear algebra operations like analogy. This question drives the entire derivation.

“Ratios encode meaning. But they're just numbers in a table. To make them useful for computation, we need to find vectors whose mathematical operations naturally produce these ratios. This is where GloVe's derivation begins, and it's one of the most elegant pieces of mathematics in all of NLP.”

Chapter 3

From Ratios to Vectors

The GloVe derivation is a sequence of five principled steps, each motivated by a specific mathematical or linguistic constraint. It begins with the question from Chapter 2 (find) and arrives at the elegant equation . No step is arbitrary; each one narrows the space of possible models by demanding that a specific property hold.

The key moves: analogies live in differences (so F should operate on), vectors must reduce to a scalar (use the dot product), the ratio is multiplicative but the dot product is additive (so F must be the exponential, the unique continuous homomorphism from addition to multiplication), and X's symmetry demands bias terms that absorb word-level frequency. The result is a model of beautiful simplicity: a dot product plus two biases equals the log co-occurrence count.

Walk through each step in the interactive derivation below. Pay special attention to Step 3, the homomorphism argument, where the exponential function is not chosen but forced by mathematical necessity. And remember the prediction from Step 4: the bias should absorb . We'll verify this in Chapter 6.

“We have a beautiful equation: . But equations are aspirations; in practice, our vectors won't reconstruct log(X) exactly. How do we turn this into something we can optimize?”

Chapter 4

The Objective Function

The model equation says the prediction should equal . It won't be exact, so we minimize the squared error, but two problems stand between us and a straightforward loss function. First, 91% of our toy X is zero, and log(0) is undefined. GloVe handles this by training only on non-zero entries, which both sidesteps the log(0) issue and makes training complexity proportional to the number of observed pairs rather than . The alternative (SVD of , which includes zeros) was tested and underperforms by over 11 percentage points on analogy tasks.

Second, non-zero co-occurrences span orders of magnitude. A pair that co-occurred once carries noisy signal; a pair that co-occurred 50 times is reliable; a pair at 5,000 is very reliable but shouldn't dominate the entire loss. GloVe introduces a weighting function with three properties: f(0) = 0 (zeros contribute nothing), f is non-decreasing (more data means more weight), and f saturates for large x (frequent pairs don't overwhelm). The chosen form is for , and 1 otherwise, with = 0.75 and = 100.

The α = 3/4 exponent is sublinear: doubling increases weight by only, not 2×. A similar 3/4 power appears in Word2Vec's negative sampling distribution, and both compensate for the same phenomenon: Zipf's law makes frequency distributions heavy-tailed. The = 100cap is gentle; the paper notes performance “depends weakly on the cutoff.”

The full objective is: , summed over all non-zero entries. Why squared error instead of cross-entropy? Cross-entropy requires normalizing the model distribution Q over the full vocabulary, an O(|V|) softmax. Squared error on the log avoids normalization entirely. It's simpler, cheaper per term, and the paper argues it is preferable to the cross-entropy formulation. The connection to skip-gram's “global” objective makes this not just a practical choice but a principled one: GloVe replaces weighted cross-entropy with weighted squared error, and the result is a cleaner optimization landscape.

▶Deep dive: From Skip-gram to GloVe

Skip-gram minimizes, over the whole corpus, the negative log probability of context words:

where uses a softmax. Identical pairs appear many times. Aggregating them, the objective becomes:

This is a weighted cross-entropy, with weights (the co-occurrence counts). GloVe makes two key changes:

Replace the cross-entropy with squared error on the log: instead of , use . This avoids the expensive softmax normalization entirely.
Replace the raw weight with the capped function to prevent frequent pairs from dominating the loss.

The result is GloVe's objective: a weighted least-squares problem that is cheaper per term, avoids normalization, and explicitly works with the global co-occurrence statistics rather than sampling individual windows.

“We have the objective: minimize the weighted squared error between our model's predictions and , summed over all non-zero entries. Now we need to actually minimize it. This means computing gradients and choosing the right optimizer...”

Chapter 5

Training GloVe

The gradient for a single pair (i, j) is elegant: let , and let . Then , , and the bias gradients are simply fdiff. The intuition: if the prediction overshoots, push away from ; if it undershoots, pull them closer. All four parameter groups share the same weighted error scalar but receive different directional signals. In the Stanford implementation, gradients are clipped to [−100, 100] to prevent large co-occurrence values from causing instability (this is an implementation detail, not discussed in the paper).

The paper trains GloVe with AdaGrad, and the choice is well-motivated. Co-occurrence values span orders of magnitude. Frequent word pairs like ice-solid are updated many times per epoch; rare pairs are updated once. With vanilla SGD and a single learning rate, tuning for rare pairs makes frequent pairs diverge, and tuning for frequent pairs starves rare pairs. AdaGrad solves this by accumulating squared gradients per parameter: . The update becomes . Frequent words accumulate large and get smaller effective learning rates, while rare words retain high learning rates; the adaptation is automatic and per-parameter.

GloVe maintains two sets of vectors: W (word vectors) and W̃ (context vectors), each with their own bias vectors and AdaGrad accumulators. When X is symmetric, W and W̃ are mathematically interchangeable; they differ only by random initialization. The paper uses their sum W + W̃ as the final embedding, which averages out initialization-dependent noise and gives a small but consistent performance boost. The paper explains: “for certain types of neural networks, training multiple instances and combining the results can help reduce overfitting and noise.”

Unlike Word2Vec's continuous stream of random samples, GloVe has a natural epoch structure: one pass through all non-zero triples, shuffled. Loss drops rapidly in early epochs (1–10), then gradually stabilizes. Our toy model trains for 200 epochs with learning rate 0.05. The paper recommends 50 iterations for dimensions below 300, and 100 for 300+. Convergence is fast because the objective, while not globally convex (it's bilinear in W and W̃), behaves well in practice as a weighted least-squares problem.

Why AdaGrad specifically? It accumulates squared gradients per parameter, building a per-element history . Frequent word pairs (like ice-solid) contribute many gradient updates, growing large and automatically reducing their effective learning rate to . Rare pairs retain a high learning rate. This is ideal for NLP's Zipfian frequency distribution, where a few word pairs (like ice-solid) dominate the training signal.

One subtle detail: when (from distance weighting; a co-occurrence at distance 5 contributes only 1/5 = 0.2), is negative. The model must learn to produce negative predictions for these pairs. This is why word vector components can be negative; they're not probabilities.

“Training is done. But what did GloVe actually learn? Let's crack open the model and inspect everything: the vectors, the biases, the quality of the factorization, and how the two sets of vectors relate to each other.”

Chapter 6

What GloVe Learns

With training complete, we can inspect the model from angles unique to GloVe. Start with nearest neighbors: using , find the closest words to any word by cosine similarity. For “ice,” the top neighbors are “freeze”, “melts”, “solid”; thermodynamic words cluster tightly. For “king,” the top neighbors are “wise”, “old”, “proud”; properties associated with royalty in this small corpus. Then look at analogies: . On production GloVe this works reliably; on our tiny corpus the model predicts “wise” instead of “queen” (though does work!), but the principle holds. This isn't magic; it's a direct consequence of the log-bilinear model. Vector subtraction in embedding space corresponds to co-occurrence ratio division in probability space. The difference vector king − man encodes the log ratio of their co-occurrences with every other word; adding woman finds the word whose profile matches that pattern.

Why analogies work: the algebra

The model equation for word i and context word k is:

Write this for “king” and “man” with the same context word k, then subtract:

The difference vector encodes the log co-occurrence ratio with every context word k(up to a bias constant). This vector captures the “royalty minus male” semantic direction.

To find the word that is “royalty minus male plus female,” we solve:

The paper finds the word d whose vector is closest to by cosine similarity. On a production corpus, the answer is “queen” because has the co-occurrence profile that matches “king’s profile, adjusted for the male→female shift.” This is not magic; it is a direct algebraic consequence of the log-bilinear model.

Try it yourself: Pick two words from the same semantic group in the explorer below. Compute their vector difference. Which context words show the largest magnitude in that difference direction? Do they match your intuition about what distinguishes those words?

Toy corpus note: Our corpus has only 144 words and 8dimensions. Analogies that work flawlessly on a 6B-token corpus may not succeed here; the vocabulary is too small for many standard analogy sets, and the embeddings lack the capacity to encode fine-grained relationships. The explorer below shows accuracy improving with training, but don't expect production-level results.

Next, inspect factorization quality: how well does reconstruct ? Plot the target against the model's prediction and you'll see a tight correlation for frequent pairs (which had high weight) and more scatter for rare pairs (which the optimizer down-weighted). On our toy corpus (144 words, 8 dimensions), the fit is modest - ≈ 0.32 despite having ~2600 parameters for ~1,930 non-zero entries. Why so low? The weighting heavily prioritizes frequent pairs, so rare pairs (the majority) contribute little gradient and remain poorly fit. With only 8 dimensions, the model also lacks capacity to capture all structure. On real corpora with 400K vocab and 300 dimensions, the parameter-to-data ratio is far healthier and the reconstruction is expected to be much tighter.

Now the payoff from Chapter 3's foreshadowing: what do the biases encode? Scatter-plot against and you'll find a positive correlation, moderate on our toy corpus (r ≈ 0.56 with only 8 dimensions and 144 words), and expected to be much stronger on production models where the derivation's assumptions hold more tightly. The derivation predicted that biases should absorb word frequency, and even on this tiny dataset the trend is visible. This separation is elegant: biases handle frequency, freeing the dot product to encode pure semantic content. The paper notes that when X is symmetric, W and W̃ “are equivalent and differ only as a result of their random initializations.” We can verify this by comparing their per-word cosine similarity. On our toy corpus the cosines are mixed (mean ≈ 0.16) because 8 dimensions leave little room; the most aligned words (“flows”, “warm”, “fish”) tend to belong to strong semantic clusters, while the most divergent (“moon”, “pot”, “weather”) are words without a tight semantic neighborhood.

Finally, the PMI connection ties everything together. Since biases absorb the marginal log-frequencies, the dot product alone approximates . This means GloVe can be interpreted as factorizing a shifted PMI matrix. Separately, Levy & Goldberg (2014) showed that skip-gram with negative sampling implicitly factorizes a similar shifted PMI matrix. Though arrived at from completely different starting points, both methods approximate the same underlying structure. On production corpora the scatter plot of dot products vs PMI values shows a clear linear correlation; on our toy corpus the trend is present but noisier (r ≈ 0.40) due to the limited vocabulary and 8 dimensions.

“We've inspected what GloVe learned: vectors that factorize the co-occurrence matrix, biases that encode frequency, and PMI structure in the dot products. But where does GloVe fit in the bigger picture? How does it connect to SVD, LSA, and the methods that came before and after?”

Chapter 7

GloVe in the Landscape

GloVe sits at the culmination of a long lineage of count-based methods. LSA (1990) showed that SVD of a term-document matrix captures meaning through dimensionality reduction, the first proof that statistical co-occurrence encodes semantics. HAL (1996) moved to word-word co-occurrence but used raw counts, which are dominated by frequency effects. COALS (Rohde et al., 2006) added correlation-based normalization before SVD. PPMI + SVD(Levy & Goldberg, 2014) computed Positive PMI and applied SVD, a strong baseline competitive with Word2Vec. Each method asked: “What transformation of the co-occurrence matrix gives the best word representations?” GloVe's answer: a weighted factorization of the log, guided by the probability ratio insight.

The paper's Table 2 reveals this progression quantitatively. SVD of raw X scores 7.3% on analogy tasks; frequency effects destroy the signal. SVD of jumps to 42.1% by compressing dynamic range. SVD of reaches 60.1%; the log transformation is more principled and closer to PMI. GloVe achieves 71.7%. That final 11.6% improvement comes from three sources: the weighting function that focuses on informative entries, separate biases that capture frequency, and stochastic optimization that scales where SVD cannot.

The Levy & Goldberg result (2014) connects the two sides of the embedding world. They showed that skip-gram with negative sampling implicitly factorizes a shifted PMI matrix:. GloVe explicitly factorizes, and since biases absorb marginal frequencies, this is closely related to shifted PMI. Both methods, from completely different starting points, approximate similar structure, lending weight to the idea that co-occurrence statistics contain the fundamental signal for word meaning. Meanwhile, GloVe's training complexity is sub-linearin corpus size thanks to Zipf's law: the number of non-zero co-occurrence entries scales empirically as , as the paper shows by measuring |X| across differently-sized sub-corpora.

Pre-trained GloVe vectors from Stanford (6B, 42B, and 840B token corpora) remain widely available and useful for initialization and lightweight tasks. But GloVe produces staticembeddings; the same vector for “bank” whether it means riverbank or financial institution. This limitation drove the field toward contextual models: ELMo (2018), BERT (2019), GPT. These produce different vectors for the same word in different sentences. Yet GloVe's core insights (that co-occurrence statistics encode meaning, that ratios are more informative than raw probabilities, and that a principled log-bilinear model can be derived from first principles) remain foundational to understanding how any embedding method works.

Frequently Asked Questions

Word2Vec scans context windows one at a time and updates weights incrementally (prediction-based). GloVe first counts ALL co-occurrences into a matrix, then learns vectors that factorize it (count-based). The paper shows that for the same corpus, vocabulary, window size, and training time, GloVe consistently outperforms Word2Vec on analogy tasks, but both exploit the same underlying co-occurrence statistics and produce vectors with similar linear substructure.

The model has separate “word” and “context” vectors. When is symmetric (as in our toy corpus), they're mathematically equivalent and differ only by random initialization. Summing them () averages out noise and gives a small performance boost.

Biases absorb word frequency information: , the log total co-occurrence count of word . This frees the word vectors to encode meaning rather than frequency.

Co-occurrence values span orders of magnitude. AdaGrad automatically gives smaller learning rates to parameters updated frequently (common word pairs) and larger rates to those updated rarely. On real corpora can range from 0.2 to 5,000+; our toy corpus spans 0.4 to ~51.

Empirical, but connected to sublinear frequency scaling (Zipf's law compensation). means doubling co-occurrence only increases weight by , preventing common pairs from dominating.

Cross-entropy requires normalizing over the entire vocabulary (softmax, per term). Squared error on log co-occurrences avoids this entirely. It's simpler, faster, and the paper argues it is preferable, avoiding normalization while the weighting function handles the frequency imbalance that cross-entropy handles poorly.

Yes. The objective factorizes into + biases, with weighting . This is closely related to SVD of , but GloVe's weighting focuses on informative entries and ignores the ~91% zeros.

Sub-linearly. The number of non-zero entries scales as due to Zipf's law. GloVe trains on entries per epoch, which is less than Word2Vec's per epoch.

No. Adding a new word requires rebuilding the co-occurrence matrix and retraining. For OOV words, common strategies include using the zero vector, using the average of all vectors, or switching to FastText which handles OOV via subword embeddings.

Since biases absorb marginal log-frequencies, the dot product approximates + constant. Levy & Goldberg (2014) independently showed that skip-gram with negative sampling factorizes a similar shifted PMI matrix. Both methods, from different starting points, converge on the same statistical structure.

Yes. Larger windows (8–10) capture semantic information (topical relatedness). Smaller windows (2–4) capture syntactic information (grammatical patterns). Symmetric windows perform better on semantic tasks; the paper uses a context of 10 words to the left and 10 to the right (window size = 10 per side). Our toy corpus uses window = 5 per side.

It follows from the log-bilinear model. Vector subtraction in embedding space corresponds to co-occurrence ratio division in probability space. encodes the log ratio of their co-occurrences with every word; adding finds the word whose profile matches that pattern.

It means words and never co-occurred within any context window in this particular corpus. With more data, the same pair might have a non-zero count. Absence of evidence is not evidence of absence. GloVe handles this by only training on non-zero entries.

Stanford provides pre-trained vectors at nlp.stanford.edu/projects/glove/, trained on Wikipedia + Gigaword (6B tokens, 400K vocab), Common Crawl (42B and 840B tokens), and Twitter (27B tokens). The 6B 300d vectors are a solid starting point for most applications.

As standalone features, they've largely been replaced by contextual embeddings (BERT, GPT). But GloVe's insights (co-occurrence statistics, the ratio property, log-bilinear models) are foundational to understanding modern NLP. Pre-trained GloVe vectors are still used for initialization and lightweight applications.

Explore the trained model in depth:

Factorization Quality & Internals →Annotated NumPy Code →Test Your Knowledge →

Next Topic

RNN: Recurrent Neural Networks

Learn how recurrent architectures model sequences, language, and time-series data with hidden state.

→