PythonNumPy

GloVe

The Code

Annotated walkthrough of train.py: Phase 1 builds the co-occurrence matrix, Phase 2 optimizes via weighted least squares with AdaGrad. Pure NumPy, no autograd.

Overview

GloVe's train.py has a unique two-phase architecture. Phase 1 builds the co-occurrence matrix from the corpus. Phase 2 optimizes word vectors using weighted least squares with AdaGrad. No neural networks, no autograd. Just NumPy. (750 lines, 16 annotated sections)

Phase 1: Count (5 sections)Phase 2: Factorize (6 sections)Phase 3: Analyze (5 sections)

dim=8lr=0.05x_max=100α=0.75window=5epochs=200

Phase 1: Corpus → Co-occurrence

From raw text to co-occurrence matrix X_ij. Vocabulary, tokenization, and sliding window counts with 1/d distance weighting.

Imports & Constantstrain.py:58-71

Story: Overview →

NumPy-only implementation; no neural network framework needed.

58import argparse
59import json
60import os
61import re
62import time
63from collections import Counter
64from pathlib import Path
65
66import numpy as np
67from sklearn.decomposition import PCA
68
69
70# ── stopwords ───────────────────────────────────────────────────────
71STOPWORDS = frozenset({

Load & Preprocess Corpustrain.py:89-107

Story: Ch.1 Counting Words →

Lowercasing, punctuation removal, hyphen splitting, stopword filtering.

89def load_corpus(filepath: str):
90    """Load corpus with preprocessing: lowercase, strip punctuation,
91    split hyphens, remove numeric tokens, normalize whitespace."""
92    raw, cleaned = [], []
93    with open(filepath) as f:
94        for line in f:
95            line = line.strip()
96            if not line or line.startswith("#"):
97                continue
98            line = line.lower()
99            line = re.sub(r"-", " ", line)
100            line = re.sub(r"[^a-z\s]", "", line)
101            line = re.sub(r"\s+", " ", line).strip()
102            tokens = line.split()
103            raw.append(tokens)
104            content = [t for t in tokens if t not in STOPWORDS]
105            if len(content) >= 2:
106                cleaned.append(content)
107    return raw, cleaned

Build Vocabularytrain.py:110-120

Story: Ch.1 Counting Words →

Frequency-sorted vocabulary with word-to-index mapping.

110def build_vocab(sentences, min_count=3):
111    counts = Counter()
112    for s in sentences:
113        counts.update(s)
114    vocab = sorted(w for w, c in counts.items() if c >= min_count)
115    w2i = {w: i for i, w in enumerate(vocab)}
116    i2w = {i: w for w, i in w2i.items()}
117    return vocab, w2i, i2w, counts
118
119
120# ── co-occurrence matrix ────────────────────────────────────────────

Build Co-occurrence Matrixtrain.py:122-151

Story: Ch.1 Counting Words →

Matrix with 1/d distance weighting (symmetric by default, asymmetric via --no-symmetric), the heart of GloVe’s global statistics.

122def build_cooccurrence(sentences, w2i, window, symmetric):
123    """Build dense co-occurrence matrix with 1/d distance weighting.
124
125    For each word pair within the context window, the contribution
126    is 1/distance (matching the GloVe paper). With --symmetric (default),
127    both X[i][j] and X[j][i] are incremented. With --no-symmetric,
128    only the left context contributes (asymmetric window).
129    """
130    V = len(w2i)
131    X = np.zeros((V, V), dtype=np.float64)
132
133    for sent in sentences:
134        idxs = [w2i[w] for w in sent if w in w2i]
135        n = len(idxs)
136        for i in range(n):
137            for j in range(max(0, i - window), i):
138                dist = i - j
139                weight = 1.0 / dist
140                X[idxs[i], idxs[j]] += weight
141                if symmetric:
142                    X[idxs[j], idxs[i]] += weight
143            if not symmetric:
144                continue
145            for j in range(i + 1, min(n, i + window + 1)):
146                dist = j - i
147                weight = 1.0 / dist
148                X[idxs[i], idxs[j]] += weight
149                X[idxs[j], idxs[i]] += weight
150
151    return X

Co-occurrence Statisticstrain.py:154-204

Internals: Co-occurrence →

Sparsity analysis, value distribution, and Zipf's law fit.

154def compute_cooccurrence_stats(X, vocab):
155    """Compute sparsity, value distribution, and Zipf fit for X."""
156    nonzero_mask = X > 0
157    nonzero_vals = X[nonzero_mask]
158    total_entries = X.shape[0] * X.shape[1]
159    num_nonzero = int(nonzero_mask.sum())
160    sparsity = 1.0 - num_nonzero / total_entries
161
162    distribution = {}
163    if len(nonzero_vals) > 0:
164        distribution = {
165            "min": float(nonzero_vals.min()),
166            "max": float(nonzero_vals.max()),
167            "mean": float(nonzero_vals.mean()),
168            "median": float(np.median(nonzero_vals)),
169            "percentiles": {
170                str(p): float(np.percentile(nonzero_vals, p))
171                for p in [25, 50, 75, 90, 95, 99]
172            },
173        }
174
175    zipf_data = {}
176    if len(nonzero_vals) > 10:
177        sorted_vals = np.sort(nonzero_vals)[::-1]
178        ranks = np.arange(1, len(sorted_vals) + 1, dtype=np.float64)
179        log_ranks = np.log(ranks)
180        log_vals = np.log(sorted_vals)
181        valid = np.isfinite(log_vals) & np.isfinite(log_ranks)
182        if valid.sum() > 2:
183            coeffs = np.polyfit(log_ranks[valid], log_vals[valid], 1)
184            alpha = -coeffs[0]
185            predicted = coeffs[0] * log_ranks[valid] + coeffs[1]
186            ss_res = np.sum((log_vals[valid] - predicted) ** 2)
187            ss_tot = np.sum((log_vals[valid] - log_vals[valid].mean()) ** 2)
188            r_squared = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0
189            zipf_data = {
190                "alpha": float(alpha),
191                "r_squared": float(r_squared),
192                "intercept": float(coeffs[1]),
193            }
194
195    return {
196        "total_entries": total_entries,
197        "num_nonzero": num_nonzero,
198        "sparsity": float(sparsity),
199        "value_distribution": distribution,
200        "zipf": zipf_data,
201    }
202
203
204# ── weighting function ─────────────────────────────────────────────

Phase 2: Weighted Least Squares

Optimize J = Σ f(X_ij)(w_i·w̃_j + b_i + b̃_j − log X_ij)² with manual gradients and AdaGrad.

Weighting Function f(x)train.py:206-208

Story: Ch.4 Objective →

f(x) = (x/x_max)^α for x < x_max, else 1. Down-weights very frequent co-occurrences.

206def f_weight(x, x_max, alpha):
207    """GloVe weighting: (x/x_max)^alpha if x < x_max, else 1."""
208    return (x / x_max) ** alpha if x < x_max else 1.0

Vectorized Weightingtrain.py:211-214

Story: Ch.4 Objective →

NumPy-vectorized version of f(x) used during pre-computation of weights for all nonzero pairs.

211def f_weight_array(x_vals, x_max, alpha):
212    """Vectorized weighting for an array of co-occurrence values."""
213    weights = np.where(x_vals < x_max, (x_vals / x_max) ** alpha, 1.0)
214    return weights

Tracked Pairs & Analogy Teststrain.py:229-244

Story: Ch.6 What GloVe Learns →

Word pairs tracked for cosine similarity across epochs, and analogy tests for evaluation (e.g., king − man + woman = queen).

229KEY_PAIRS = [
230    ("ice", "solid"), ("steam", "gas"), ("ice", "water"), ("steam", "water"),
231    ("ice", "cold"), ("steam", "hot"), ("ice", "freeze"), ("steam", "boil"),
232    ("king", "queen"), ("prince", "princess"), ("man", "woman"), ("boy", "girl"),
233    ("father", "mother"), ("son", "daughter"), ("brother", "sister"),
234    ("husband", "wife"), ("cat", "dog"), ("lion", "wolf"),
235]
236
237ANALOGY_TESTS = [
238    ("king", "man", "woman", "queen"),
239    ("queen", "woman", "man", "king"),
240    ("ice", "solid", "gas", "steam"),
241    ("steam", "gas", "solid", "ice"),
242    ("prince", "boy", "girl", "princess"),
243    ("father", "man", "woman", "mother"),
244]

Training Loop (Weighted Least Squares + AdaGrad)train.py:309-419

Story: Ch.5 Training →

Manual gradient computation, AdaGrad adaptive learning, gradient clipping. The core algorithm.

309def train(X, vocab, w2i, i2w, dim, lr, x_max, alpha, epochs, seed):
310    """GloVe weighted least squares with manual AdaGrad.
311
312    Following the Stanford C implementation:
313    - W, W_tilde initialized uniform(-0.5/d, 0.5/d)
314    - b, b_tilde initialized uniform(-0.5/d, 0.5/d)
315    - AdaGrad accumulators initialized to 1.0 (not 0)
316    - Gradient clipping at [-100, 100]
317    """
318    rng = np.random.RandomState(seed)
319    V = len(vocab)
320
321    W = rng.uniform(-0.5 / dim, 0.5 / dim, (V, dim))
322    W_tilde = rng.uniform(-0.5 / dim, 0.5 / dim, (V, dim))
323    b = rng.uniform(-0.5 / dim, 0.5 / dim, V)
324    b_tilde = rng.uniform(-0.5 / dim, 0.5 / dim, V)
325
326    G_W = np.ones((V, dim))
327    G_Wt = np.ones((V, dim))
328    G_b = np.ones(V)
329    G_bt = np.ones(V)
330
331    nonzero_pairs = []
332    for i in range(V):
333        for j in range(V):
334            if X[i, j] > 0:
335                nonzero_pairs.append((i, j, X[i, j]))
336
337    log_X = np.zeros_like(X)
338    nz_mask = X > 0
339    log_X[nz_mask] = np.log(X[nz_mask])
340
341    f_weights_arr = np.array([f_weight(x, x_max, alpha)
342                              for _, _, x in nonzero_pairs])
343
344    print(f"  Non-zero pairs: {len(nonzero_pairs)}")
345    print(f"  Parameters: {2 * V * dim + 2 * V} "
346          f"({V}×{dim}×2 vectors + {V}×2 biases)")
347
348    records = []
349    all_losses = []
350    t0 = time.time()
351
352    rec = capture_epoch(0, W, W_tilde, b, b_tilde, 0.0, X, log_X,
353                        nonzero_pairs, w2i, i2w, vocab, f_weights_arr)
354    records.append(rec)
355
356    for epoch in range(1, epochs + 1):
357        indices = rng.permutation(len(nonzero_pairs))
358        total_cost = 0.0
359        total_weight = 0.0
360
361        for idx in indices:
362            i, j, x_ij = nonzero_pairs[idx]
363
364            diff = np.dot(W[i], W_tilde[j]) + b[i] + b_tilde[j] - log_X[i, j]
365            fw = f_weights_arr[idx]
366            fdiff = fw * diff
367
368            total_cost += fw * diff * diff
369            total_weight += 1.0
370
371            grad_w = fdiff * W_tilde[j]
372            grad_wt = fdiff * W[i]
373            grad_b = fdiff
374            grad_bt = fdiff
375
376            grad_w = np.clip(grad_w, -100, 100)
377            grad_wt = np.clip(grad_wt, -100, 100)
378            grad_b = np.clip(grad_b, -100, 100)
379            grad_bt = np.clip(grad_bt, -100, 100)
380
381            G_W[i] += grad_w ** 2
382            W[i] -= lr * grad_w / np.sqrt(G_W[i])
383
384            G_Wt[j] += grad_wt ** 2
385            W_tilde[j] -= lr * grad_wt / np.sqrt(G_Wt[j])
386
387            G_b[i] += grad_b ** 2
388            b[i] -= lr * grad_b / np.sqrt(G_b[i])
389
390            G_bt[j] += grad_bt ** 2
391            b_tilde[j] -= lr * grad_bt / np.sqrt(G_bt[j])
392
393        avg_loss = total_cost / total_weight if total_weight > 0 else 0.0
394        all_losses.append({"epoch": epoch, "loss": round(float(avg_loss), 8)})
395
396        record = should_record(epoch, epochs)
397        if record:
398            rec = capture_epoch(epoch, W, W_tilde, b, b_tilde, avg_loss,
399                                X, log_X, nonzero_pairs, w2i, i2w, vocab,
400                                f_weights_arr)
401            records.append(rec)
402
403        if epoch % 10 == 0 or epoch == epochs:
404            elapsed = time.time() - t0
405            print(f"  epoch {epoch:>4}/{epochs}  "
406                  f"loss={avg_loss:.6f}  [{elapsed:.1f}s]  "
407                  f"({len(records)} recorded)")
408
409    final_state = {
410        "W": W.copy(), "W_tilde": W_tilde.copy(),
411        "b": b.copy(), "b_tilde": b_tilde.copy(),
412        "G_W": G_W.copy(), "G_Wt": G_Wt.copy(),
413        "G_b": G_b.copy(), "G_bt": G_bt.copy(),
414    }
415
416    return records, all_losses, final_state
417
418
419# ── projections ─────────────────────────────────────────────────────

Epoch Recording Strategytrain.py:219-226

Internals: Training Journey →

Every epoch for first 10, then every 5th, plus the final epoch; captures early chaos, convergence, and the final state.

219def should_record(epoch, total_epochs):
220    if epoch <= 10:
221        return True
222    if epoch % 5 == 0:
223        return True
224    if epoch == total_epochs:
225        return True
226    return False

State Capture per Epochtrain.py:247-307

Internals: Training Journey →

Saves W, W̃, biases, loss, similarities, reconstruction metrics at each recorded epoch.

247def capture_epoch(epoch, W, W_tilde, b, b_tilde, avg_loss, X, log_X,
248                  nonzero_pairs, w2i, i2w, vocab, f_weights):
249    """Capture full state for a single recorded epoch."""
250    V, d = W.shape
251    combined = W + W_tilde
252
253    cosines = {}
254    for w1, w2 in KEY_PAIRS:
255        if w1 in w2i and w2 in w2i:
256            v1 = combined[w2i[w1]]
257            v2 = combined[w2i[w2]]
258            n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
259            cos = float(np.dot(v1, v2) / (n1 * n2)) if n1 > 0 and n2 > 0 else 0.0
260            cosines[f"{w1}_{w2}"] = round(cos, 6)
261
262    errors = []
263    for idx, (i, j, x_ij) in enumerate(nonzero_pairs):
264        pred = float(np.dot(W[i], W_tilde[j]) + b[i] + b_tilde[j])
265        target = float(log_X[i, j])
266        err = abs(pred - target)
267        errors.append({
268            "i": int(i), "j": int(j),
269            "word_i": i2w[i], "word_j": i2w[j],
270            "x_ij": float(x_ij),
271            "log_x": round(target, 6),
272            "pred": round(pred, 6),
273            "error": round(err, 6),
274            "f_weight": round(float(f_weights[idx]), 6),
275        })
276
277    errors_sorted = sorted(errors, key=lambda e: e["error"], reverse=True)
278    top_loss_pairs = errors_sorted[:10]
279
280    overall_mse = np.mean([e["error"] ** 2 for e in errors])
281    overall_mae = np.mean([e["error"] for e in errors])
282    targets = np.array([e["log_x"] for e in errors])
283    preds = np.array([e["pred"] for e in errors])
284    ss_res = np.sum((targets - preds) ** 2)
285    ss_tot = np.sum((targets - targets.mean()) ** 2)
286    r_squared = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0
287
288    return {
289        "epoch": epoch,
290        "avg_loss": round(float(avg_loss), 8),
291        "W": W.tolist(),
292        "W_tilde": W_tilde.tolist(),
293        "b": b.tolist(),
294        "b_tilde": b_tilde.tolist(),
295        "cosine_similarities": cosines,
296        "top_loss_pairs": top_loss_pairs,
297        "reconstruction": {
298            "mse": round(float(overall_mse), 8),
299            "mae": round(float(overall_mae), 8),
300            "r_squared": round(float(r_squared), 6),
301            "num_pairs": len(errors),
302        },
303        "all_errors": errors,
304    }
305
306
307# ── training loop ──────────────────────────────────────────────────

For J = f(X_ij)(v_ij)² where v_ij = w_i·w̃_j + b_i + b̃_j − log X_ij:

diff = w_iᵀw̃_j + b_i + b̃_j − log(X_ij)

fdiff = f(X_ij) · diff

∂J/∂w_i = fdiff · w̃_j

∂J/∂w̃_j = fdiff · w_i

∂J/∂b_i = fdiff

∂J/∂b̃_j = fdiff

The factor of 2 from d/dx(x²) is absorbed into the learning rate, following the Stanford C implementation. Gradients are clipped to [−100, 100] before the AdaGrad update.

Phase 3: Analysis & Output

PCA projections for visualization, save embeddings and epoch snapshots to JSON, and print training summary.

PCA Projectionstrain.py:421-441

Internals: Embedding Trajectory →

2D PCA projection of W+W̃ for visualization.

421def add_projections(records, i2w):
422    """Add 2D PCA projections of W+W_tilde to each recorded epoch."""
423    V = len(i2w)
424    pca = PCA(n_components=2, random_state=42)
425
426    last_combined = np.array(records[-1]["W"]) + np.array(records[-1]["W_tilde"])
427    pca.fit(last_combined)
428    var = pca.explained_variance_ratio_.tolist()
429
430    combined_snapshots = []
431    for rec in records:
432        combined = np.array(rec["W"]) + np.array(rec["W_tilde"])
433        combined_snapshots.append(combined)
434
435    for i, rec in enumerate(records):
436        proj = pca.transform(combined_snapshots[i])
437        rec["embeddings_2d"] = {i2w[j]: proj[j].tolist() for j in range(V)}
438        rec["pca_variance_explained"] = var
439
440
441# ── storage ─────────────────────────────────────────────────────────

Save Training Outputtrain.py:443-542

Internals: Overview →

Full JSON output: metadata, loss curve, epoch snapshots, co-occurrence matrix.

443def save_local(records, all_losses, X, vocab, w2i, counts, config,
444               raw_sentences, cooc_stats, out_dir):
445    """Write all training data as structured local files."""
446    steps_dir = os.path.join(out_dir, "steps")
447    os.makedirs(steps_dir, exist_ok=True)
448
449    V = len(vocab)
450    nz_mask = X > 0
451    log_X = np.zeros_like(X)
452    log_X[nz_mask] = np.log(X[nz_mask])
453    f_X = np.zeros_like(X)
454    for i in range(V):
455        for j in range(V):
456            if X[i, j] > 0:
457                f_X[i, j] = f_weight(X[i, j], config["x_max"], config["alpha"])
458
459    # ── run_metadata.json ────────────────────────────────────
460    meta = {
461        "topic": "glove",
462        "corpus_name": "curated_v1",
463        "vocab": vocab,
464        "vocab_size": V,
465        "embed_dim": config["embed_dim"],
466        "config": config,
467        "total_epochs": config["epochs"],
468        "recorded_epochs": len(records),
469        "word_frequencies": {w: counts[w] for w in vocab},
470        "raw_sentences": [" ".join(s) for s in raw_sentences],
471        "cooccurrence_stats": cooc_stats,
472    }
473    _write(os.path.join(out_dir, "run_metadata.json"), meta, indent=2)
474
475    # ── cooccurrence_matrix.json ─────────────────────────────
476    cooc_data = {
477        "matrix": X.tolist(),
478        "log_matrix": [[round(float(log_X[i, j]), 6) if X[i, j] > 0 else None
479                         for j in range(V)] for i in range(V)],
480        "weights": [[round(float(f_X[i, j]), 6) if X[i, j] > 0 else None
481                      for j in range(V)] for i in range(V)],
482        "vocab": vocab,
483        "stats": cooc_stats,
484    }
485    _write(os.path.join(out_dir, "cooccurrence_matrix.json"), cooc_data)
486
487    # ── loss_curve.json ──────────────────────────────────────
488    _write(os.path.join(out_dir, "loss_curve.json"), all_losses)
489
490    # ── step_index.json ──────────────────────────────────────
491    step_index = [rec["epoch"] for rec in records]
492    _write(os.path.join(out_dir, "step_index.json"), step_index)
493
494    # ── embeddings_timeline.json ─────────────────────────────
495    timeline = []
496    for rec in records:
497        entry = {
498            "epoch": rec["epoch"],
499            "positions": rec.get("embeddings_2d", {}),
500            "loss": rec["avg_loss"],
501        }
502        if rec.get("pca_variance_explained"):
503            entry["pca_variance_explained"] = rec["pca_variance_explained"]
504        timeline.append(entry)
505    _write(os.path.join(out_dir, "embeddings_timeline.json"), timeline)
506
507    # ── steps/{N}.json ───────────────────────────────────────
508    total_step_bytes = 0
509    for rec in records:
510        step_path = os.path.join(steps_dir, f"{rec['epoch']}.json")
511        step_data = {
512            "epoch": rec["epoch"],
513            "avg_loss": rec["avg_loss"],
514            "W": rec["W"],
515            "W_tilde": rec["W_tilde"],
516            "b": rec["b"],
517            "b_tilde": rec["b_tilde"],
518            "cosine_similarities": rec["cosine_similarities"],
519            "top_loss_pairs": rec["top_loss_pairs"],
520            "reconstruction": rec["reconstruction"],
521            "all_errors": rec["all_errors"],
522            "embeddings_2d": rec.get("embeddings_2d"),
523            "pca_variance_explained": rec.get("pca_variance_explained"),
524        }
525        _write(step_path, step_data)
526        total_step_bytes += os.path.getsize(step_path)
527
528    # ── report ───────────────────────────────────────────────
529    sizes = {}
530    for name in ["run_metadata.json", "cooccurrence_matrix.json",
531                 "loss_curve.json", "step_index.json",
532                 "embeddings_timeline.json"]:
533        p = os.path.join(out_dir, name)
534        sizes[name] = os.path.getsize(p)
535
536    print(f"\n  Output directory: {out_dir}/")
537    for name, sz in sizes.items():
538        print(f"    {name:<35} {sz / 1024:>8.1f} KB")
539    print(f"    steps/ ({len(records)} files){' ' * 19}"
540          f"{total_step_bytes / 1e6:>7.1f} MB")
541    print(f"    {'TOTAL':<35} "
542          f"{(sum(sizes.values()) + total_step_bytes) / 1e6:>7.1f} MB")

JSON Writertrain.py:545-547

Story: Overview →

Helper to serialize Python objects to JSON files.

545def _write(path, data, indent=None):
546    with open(path, "w") as f:
547        json.dump(data, f, indent=indent)

Training Summarytrain.py:552-640

Story: Ch.6 What GloVe Learns →

Nearest neighbors, analogies, Table 1 ratio verification, bias-frequency correlation.

552def print_summary(W, W_tilde, b, b_tilde, X, vocab, w2i, i2w):
553    """Print nearest neighbors, analogies, and ratio verification."""
554    combined = W + W_tilde
555    V = len(vocab)
556
557    norms = np.linalg.norm(combined, axis=1, keepdims=True)
558    norms = np.where(norms > 0, norms, 1.0)
559    normed = combined / norms
560    sim_matrix = normed @ normed.T
561
562    print("\n  Nearest neighbors (cosine, W+W̃):")
563    probe_words = [
564        "ice", "steam", "water", "solid", "gas", "cold", "hot",
565        "king", "queen", "man", "woman", "cat", "dog", "lion",
566    ]
567    for word in probe_words:
568        if word not in w2i:
569            continue
570        idx = w2i[word]
571        sims = sim_matrix[idx].copy()
572        sims[idx] = -2
573        top = np.argsort(sims)[::-1][:5]
574        nbrs = "  ".join(f"{i2w[j]}({sims[j]:.2f})" for j in top)
575        print(f"    {word:>10} → {nbrs}")
576
577    print("\n  Analogies (a − b + c = ?):")
578    for wa, wb, wc, expect in ANALOGY_TESTS:
579        if not all(w in w2i for w in (wa, wb, wc)):
580            continue
581        v = combined[w2i[wa]] - combined[w2i[wb]] + combined[w2i[wc]]
582        n = np.linalg.norm(v)
583        if n > 0:
584            v_normed = v / n
585        else:
586            v_normed = v
587        sims = normed @ v_normed
588        for w in (wa, wb, wc):
589            sims[w2i[w]] = -2
590        top3 = np.argsort(sims)[::-1][:3]
591        results = "  ".join(f"{i2w[j]}({sims[j]:.2f})" for j in top3)
592        mark = " ✓" if i2w[top3[0]] == expect else ""
593        print(f"    {wa} − {wb} + {wc} = {results}{mark}")
594
595    # ── ratio verification (Table 1) ────────────────────────
596    print("\n  Ratio verification (P(k|ice) / P(k|steam)):")
597    if "ice" in w2i and "steam" in w2i:
598        i_ice = w2i["ice"]
599        i_steam = w2i["steam"]
600        x_ice = X[i_ice]
601        x_steam = X[i_steam]
602        p_ice = x_ice / x_ice.sum() if x_ice.sum() > 0 else x_ice
603        p_steam = x_steam / x_steam.sum() if x_steam.sum() > 0 else x_steam
604
605        probe_k = ["solid", "gas", "water", "liquid", "cold", "hot",
606                    "freeze", "boil", "king", "cat"]
607        for k in probe_k:
608            if k not in w2i:
609                continue
610            j = w2i[k]
611            p_k_ice = p_ice[j]
612            p_k_steam = p_steam[j]
613            if p_k_steam > 0 and p_k_ice > 0:
614                ratio = p_k_ice / p_k_steam
615                print(f"    k={k:>8}  P(k|ice)={p_k_ice:.4f}  "
616                      f"P(k|steam)={p_k_steam:.4f}  ratio={ratio:.3f}")
617            elif p_k_ice > 0:
618                print(f"    k={k:>8}  P(k|ice)={p_k_ice:.4f}  "
619                      f"P(k|steam)=0        ratio=∞")
620            elif p_k_steam > 0:
621                print(f"    k={k:>8}  P(k|ice)=0        "
622                      f"P(k|steam)={p_k_steam:.4f}  ratio=0")
623
624    # ── bias-frequency correlation ──────────────────────────
625    print("\n  Bias-frequency correlation:")
626    x_row = np.array([X[w2i[w]].sum() for w in vocab])
627    b_arr = np.asarray(b, dtype=np.float64)
628    bt_arr = np.asarray(b_tilde, dtype=np.float64)
629    valid_idx = np.where(x_row > 0)[0]
630    if len(valid_idx) > 2:
631        log_freq = np.log(x_row[valid_idx])
632        b_vals = b_arr[valid_idx]
633        bt_vals = bt_arr[valid_idx]
634        corr_b = float(np.corrcoef(b_vals, log_freq)[0, 1])
635        corr_bt = float(np.corrcoef(bt_vals, log_freq)[0, 1])
636        print(f"    corr(b, log Σ_j X_ij)     = {corr_b:.4f}")
637        print(f"    corr(b̃, log Σ_j X_ij)     = {corr_bt:.4f}")
638
639
640# ── main ────────────────────────────────────────────────────────────

Entry Pointtrain.py:642-755

Story: Overview →

Ties everything together: argument parsing, corpus loading, co-occurrence construction, training, projection, and output.

642def main():
643    ap = argparse.ArgumentParser(
644        description="GloVe: Global Vectors for Word Representation")
645    ap.add_argument("--corpus",
646                    default=str(Path(__file__).with_name("corpus.txt")))
647    ap.add_argument("--output-dir",
648                    default=str(Path(__file__).resolve().parent / "output"),
649                    help="Where to write output files (default: ./output/)")
650    ap.add_argument("--dim", type=int, default=8,
651                    help="Embedding dimension (default: 8)")
652    ap.add_argument("--lr", type=float, default=0.05,
653                    help="Learning rate (default: 0.05)")
654    ap.add_argument("--x-max", type=float, default=100.0,
655                    help="x_max for weighting function (default: 100)")
656    ap.add_argument("--alpha", type=float, default=0.75,
657                    help="Alpha for weighting function (default: 0.75)")
658    ap.add_argument("--window", type=int, default=5,
659                    help="Context window size (default: 5)")
660    ap.add_argument("--symmetric", action="store_true", default=True,
661                    help="Symmetric context window (default)")
662    ap.add_argument("--no-symmetric", action="store_false", dest="symmetric",
663                    help="Asymmetric (left-only) context window")
664    ap.add_argument("--epochs", type=int, default=100,
665                    help="Number of training epochs (default: 100)")
666    ap.add_argument("--min-count", type=int, default=3,
667                    help="Minimum word frequency (default: 3)")
668    ap.add_argument("--seed", type=int, default=42)
669    args = ap.parse_args()
670
671    np.random.seed(args.seed)
672
673    # ── load ──────────────────────────────────────────────────────
674    print("Loading corpus …")
675    raw, cleaned = load_corpus(args.corpus)
676    vocab, w2i, i2w, counts = build_vocab(cleaned, args.min_count)
677
678    print(f"  {len(raw)} sentences  →  {len(cleaned)} after stopword removal")
679    print(f"  Vocabulary: {len(vocab)} words (min_count={args.min_count})")
680
681    config = {
682        "learning_rate": args.lr,
683        "x_max": args.x_max,
684        "alpha": args.alpha,
685        "window_size": args.window,
686        "symmetric": args.symmetric,
687        "embed_dim": args.dim,
688        "epochs": args.epochs,
689        "min_count": args.min_count,
690        "random_seed": args.seed,
691    }
692
693    # ── Phase 1: co-occurrence matrix ──────────────────────
694    print(f"
695Building co-occurrence matrix "
696          f"(window={args.window}, symmetric={args.symmetric}) …")
697    X = build_cooccurrence(cleaned, w2i, args.window, args.symmetric)
698    cooc_stats = compute_cooccurrence_stats(X, vocab)
699
700    print(f"  Matrix size: {X.shape[0]}×{X.shape[1]}")
701    print(f"  Non-zero entries: {cooc_stats['num_nonzero']} "
702          f"(sparsity: {cooc_stats['sparsity']:.2%})")
703    if cooc_stats["value_distribution"]:
704        d = cooc_stats["value_distribution"]
705        print(f"  Value range: [{d['min']:.2f}, {d['max']:.2f}], "
706              f"mean={d['mean']:.2f}")
707
708    # ── Phase 2: train ─────────────────────────────────────
709    print(f"
710Training (epochs={args.epochs}, dim={args.dim}, lr={args.lr}, "
711          f"x_max={args.x_max}, α={args.alpha}) …")
712    records, all_losses, final_state = train(
713        X, vocab, w2i, i2w, args.dim, args.lr,
714        args.x_max, args.alpha, args.epochs, args.seed,
715    )
716
717    # ── 2D projections ─────────────────────────────────────
718    print(f"
719Projecting to 2D (final_pca, {len(records)} snapshots) …")
720    add_projections(records, i2w)
721
722    # ── save ──────────────────────────────────────────────────────
723    print(f"
724Saving to {args.output_dir} …")
725    save_local(records, all_losses, X, vocab, w2i, counts, config,
726               raw, cooc_stats, args.output_dir)
727
728    # ── summary ───────────────────────────────────────────────────
729    print("
730" + "=" * 60)
731    print("TRAINING SUMMARY")
732    print("=" * 60)
733    print(f"  Total epochs trained : {args.epochs}")
734    print(f"  Epochs recorded      : {len(records)}")
735    print(f"  Vocab size           : {len(vocab)}")
736    print(f"  Embed dim            : {args.dim}")
737    if all_losses:
738        print(f"  Initial loss         : {all_losses[0]['loss']:.6f}")
739        print(f"  Final loss           : {all_losses[-1]['loss']:.6f}")
740    recon = records[-1]["reconstruction"]
741    print(f"  Final MSE            : {recon['mse']:.6f}")
742    print(f"  Final R²             : {recon['r_squared']:.4f}")
743
744    W = final_state["W"]
745    W_tilde = final_state["W_tilde"]
746    b = final_state["b"]
747    b_tilde = final_state["b_tilde"]
748    print_summary(W, W_tilde, b, b_tilde, X, vocab, w2i, i2w)
749
750    print("
751Done ✓")
752
753
754if __name__ == "__main__":
755    main()

Run It Yourself

Quick Start

Terminal

$ cd github-quantml/glove
$ pip install -r requirements.txt
$ python train.py  # trains in ~8s

Verify Results

Terminal

$ python explore.py
# Prints neighbor & analogy evolution over training epochs

$ python explore.py --interactive
# Try: ice | king - man + woman | cooc ice cold | ratio ice steam

$ python analyze.py --save-json
# Runs 10 analyses, exports to JSON

Experiment

Try different hyperparameters and compare results:

Terminal

# Compare embedding dimensions
$ python train.py --dim 4   # fewer dims → harder separation
$ python train.py --dim 32  # more capacity → better fit

# Compare window sizes
$ python train.py --window 2  # narrow → syntactic
$ python train.py --window 10 # wide → semantic

# Symmetric vs asymmetric
$ python train.py --no-symmetric

Hyperparameters

Argument	Default	Description
--dim	8	Embedding vector dimensionality
--lr	0.05	Initial learning rate (AdaGrad scales per-parameter)
--x-max	100	Cap on co-occurrence count in weighting function
--alpha	0.75	Exponent in f(x) = (x/x_max)^α for x < x_max, else 1
--window	5	Context window radius (symmetric by default)
--epochs	100	Number of passes over the co-occurrence matrix
--min-count	3	Min word frequency to include in vocab
--symmetric	true	Use symmetric context window (both left and right)
--no-symmetric	-	Use asymmetric (left-only) context window
--seed	42	Random seed for reproducibility