PythonNumPy

GloVe

The Code

Annotated walkthrough of train.py: Phase 1 builds the co-occurrence matrix, Phase 2 optimizes via weighted least squares with AdaGrad. Pure NumPy, no autograd.

Overview

GloVe's train.py has a unique two-phase architecture. Phase 1 builds the co-occurrence matrix from the corpus. Phase 2 optimizes word vectors using weighted least squares with AdaGrad. No neural networks, no autograd. Just NumPy. (750 lines, 16 annotated sections)

Phase 1: Count (5 sections)Phase 2: Factorize (6 sections)Phase 3: Analyze (5 sections)
dim=8lr=0.05x_max=100α=0.75window=5epochs=200

Phase 1: Corpus → Co-occurrence

From raw text to co-occurrence matrix Xij. Vocabulary, tokenization, and sliding window counts with 1/d distance weighting.

Imports & Constantstrain.py:58-71

NumPy-only implementation; no neural network framework needed.

58import argparse
59import json
60import os
61import re
62import time
63from collections import Counter
64from pathlib import Path
65
66import numpy as np
67from sklearn.decomposition import PCA
68
69
70# ── stopwords ───────────────────────────────────────────────────────
71STOPWORDS = frozenset({
Load & Preprocess Corpustrain.py:89-107

Lowercasing, punctuation removal, hyphen splitting, stopword filtering.

89def load_corpus(filepath: str):
90 """Load corpus with preprocessing: lowercase, strip punctuation,
91 split hyphens, remove numeric tokens, normalize whitespace."""
92 raw, cleaned = [], []
93 with open(filepath) as f:
94 for line in f:
95 line = line.strip()
96 if not line or line.startswith("#"):
97 continue
98 line = line.lower()
99 line = re.sub(r"-", " ", line)
100 line = re.sub(r"[^a-z\s]", "", line)
101 line = re.sub(r"\s+", " ", line).strip()
102 tokens = line.split()
103 raw.append(tokens)
104 content = [t for t in tokens if t not in STOPWORDS]
105 if len(content) >= 2:
106 cleaned.append(content)
107 return raw, cleaned
Build Vocabularytrain.py:110-120

Frequency-sorted vocabulary with word-to-index mapping.

110def build_vocab(sentences, min_count=3):
111 counts = Counter()
112 for s in sentences:
113 counts.update(s)
114 vocab = sorted(w for w, c in counts.items() if c >= min_count)
115 w2i = {w: i for i, w in enumerate(vocab)}
116 i2w = {i: w for w, i in w2i.items()}
117 return vocab, w2i, i2w, counts
118
119
120# ── co-occurrence matrix ────────────────────────────────────────────
Build Co-occurrence Matrixtrain.py:122-151

Matrix with 1/d distance weighting (symmetric by default, asymmetric via --no-symmetric), the heart of GloVe’s global statistics.

122def build_cooccurrence(sentences, w2i, window, symmetric):
123 """Build dense co-occurrence matrix with 1/d distance weighting.
124
125 For each word pair within the context window, the contribution
126 is 1/distance (matching the GloVe paper). With --symmetric (default),
127 both X[i][j] and X[j][i] are incremented. With --no-symmetric,
128 only the left context contributes (asymmetric window).
129 """
130 V = len(w2i)
131 X = np.zeros((V, V), dtype=np.float64)
132
133 for sent in sentences:
134 idxs = [w2i[w] for w in sent if w in w2i]
135 n = len(idxs)
136 for i in range(n):
137 for j in range(max(0, i - window), i):
138 dist = i - j
139 weight = 1.0 / dist
140 X[idxs[i], idxs[j]] += weight
141 if symmetric:
142 X[idxs[j], idxs[i]] += weight
143 if not symmetric:
144 continue
145 for j in range(i + 1, min(n, i + window + 1)):
146 dist = j - i
147 weight = 1.0 / dist
148 X[idxs[i], idxs[j]] += weight
149 X[idxs[j], idxs[i]] += weight
150
151 return X
Co-occurrence Statisticstrain.py:154-204

Sparsity analysis, value distribution, and Zipf's law fit.

154def compute_cooccurrence_stats(X, vocab):
155 """Compute sparsity, value distribution, and Zipf fit for X."""
156 nonzero_mask = X > 0
157 nonzero_vals = X[nonzero_mask]
158 total_entries = X.shape[0] * X.shape[1]
159 num_nonzero = int(nonzero_mask.sum())
160 sparsity = 1.0 - num_nonzero / total_entries
161
162 distribution = {}
163 if len(nonzero_vals) > 0:
164 distribution = {
165 "min": float(nonzero_vals.min()),
166 "max": float(nonzero_vals.max()),
167 "mean": float(nonzero_vals.mean()),
168 "median": float(np.median(nonzero_vals)),
169 "percentiles": {
170 str(p): float(np.percentile(nonzero_vals, p))
171 for p in [25, 50, 75, 90, 95, 99]
172 },
173 }
174
175 zipf_data = {}
176 if len(nonzero_vals) > 10:
177 sorted_vals = np.sort(nonzero_vals)[::-1]
178 ranks = np.arange(1, len(sorted_vals) + 1, dtype=np.float64)
179 log_ranks = np.log(ranks)
180 log_vals = np.log(sorted_vals)
181 valid = np.isfinite(log_vals) & np.isfinite(log_ranks)
182 if valid.sum() > 2:
183 coeffs = np.polyfit(log_ranks[valid], log_vals[valid], 1)
184 alpha = -coeffs[0]
185 predicted = coeffs[0] * log_ranks[valid] + coeffs[1]
186 ss_res = np.sum((log_vals[valid] - predicted) ** 2)
187 ss_tot = np.sum((log_vals[valid] - log_vals[valid].mean()) ** 2)
188 r_squared = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0
189 zipf_data = {
190 "alpha": float(alpha),
191 "r_squared": float(r_squared),
192 "intercept": float(coeffs[1]),
193 }
194
195 return {
196 "total_entries": total_entries,
197 "num_nonzero": num_nonzero,
198 "sparsity": float(sparsity),
199 "value_distribution": distribution,
200 "zipf": zipf_data,
201 }
202
203
204# ── weighting function ─────────────────────────────────────────────

Phase 2: Weighted Least Squares

Optimize J = Σ f(Xij)(wi·w̃j + bi + b̃j − log Xij)² with manual gradients and AdaGrad.

Weighting Function f(x)train.py:206-208

f(x) = (x/x_max)^α for x < x_max, else 1. Down-weights very frequent co-occurrences.

206def f_weight(x, x_max, alpha):
207 """GloVe weighting: (x/x_max)^alpha if x < x_max, else 1."""
208 return (x / x_max) ** alpha if x < x_max else 1.0
Vectorized Weightingtrain.py:211-214

NumPy-vectorized version of f(x) used during pre-computation of weights for all nonzero pairs.

211def f_weight_array(x_vals, x_max, alpha):
212 """Vectorized weighting for an array of co-occurrence values."""
213 weights = np.where(x_vals < x_max, (x_vals / x_max) ** alpha, 1.0)
214 return weights
Tracked Pairs & Analogy Teststrain.py:229-244

Word pairs tracked for cosine similarity across epochs, and analogy tests for evaluation (e.g., king − man + woman = queen).

229KEY_PAIRS = [
230 ("ice", "solid"), ("steam", "gas"), ("ice", "water"), ("steam", "water"),
231 ("ice", "cold"), ("steam", "hot"), ("ice", "freeze"), ("steam", "boil"),
232 ("king", "queen"), ("prince", "princess"), ("man", "woman"), ("boy", "girl"),
233 ("father", "mother"), ("son", "daughter"), ("brother", "sister"),
234 ("husband", "wife"), ("cat", "dog"), ("lion", "wolf"),
235]
236
237ANALOGY_TESTS = [
238 ("king", "man", "woman", "queen"),
239 ("queen", "woman", "man", "king"),
240 ("ice", "solid", "gas", "steam"),
241 ("steam", "gas", "solid", "ice"),
242 ("prince", "boy", "girl", "princess"),
243 ("father", "man", "woman", "mother"),
244]
Training Loop (Weighted Least Squares + AdaGrad)train.py:309-419

Manual gradient computation, AdaGrad adaptive learning, gradient clipping. The core algorithm.

309def train(X, vocab, w2i, i2w, dim, lr, x_max, alpha, epochs, seed):
310 """GloVe weighted least squares with manual AdaGrad.
311
312 Following the Stanford C implementation:
313 - W, W_tilde initialized uniform(-0.5/d, 0.5/d)
314 - b, b_tilde initialized uniform(-0.5/d, 0.5/d)
315 - AdaGrad accumulators initialized to 1.0 (not 0)
316 - Gradient clipping at [-100, 100]
317 """
318 rng = np.random.RandomState(seed)
319 V = len(vocab)
320
321 W = rng.uniform(-0.5 / dim, 0.5 / dim, (V, dim))
322 W_tilde = rng.uniform(-0.5 / dim, 0.5 / dim, (V, dim))
323 b = rng.uniform(-0.5 / dim, 0.5 / dim, V)
324 b_tilde = rng.uniform(-0.5 / dim, 0.5 / dim, V)
325
326 G_W = np.ones((V, dim))
327 G_Wt = np.ones((V, dim))
328 G_b = np.ones(V)
329 G_bt = np.ones(V)
330
331 nonzero_pairs = []
332 for i in range(V):
333 for j in range(V):
334 if X[i, j] > 0:
335 nonzero_pairs.append((i, j, X[i, j]))
336
337 log_X = np.zeros_like(X)
338 nz_mask = X > 0
339 log_X[nz_mask] = np.log(X[nz_mask])
340
341 f_weights_arr = np.array([f_weight(x, x_max, alpha)
342 for _, _, x in nonzero_pairs])
343
344 print(f" Non-zero pairs: {len(nonzero_pairs)}")
345 print(f" Parameters: {2 * V * dim + 2 * V} "
346 f"({V}×{dim}×2 vectors + {V}×2 biases)")
347
348 records = []
349 all_losses = []
350 t0 = time.time()
351
352 rec = capture_epoch(0, W, W_tilde, b, b_tilde, 0.0, X, log_X,
353 nonzero_pairs, w2i, i2w, vocab, f_weights_arr)
354 records.append(rec)
355
356 for epoch in range(1, epochs + 1):
357 indices = rng.permutation(len(nonzero_pairs))
358 total_cost = 0.0
359 total_weight = 0.0
360
361 for idx in indices:
362 i, j, x_ij = nonzero_pairs[idx]
363
364 diff = np.dot(W[i], W_tilde[j]) + b[i] + b_tilde[j] - log_X[i, j]
365 fw = f_weights_arr[idx]
366 fdiff = fw * diff
367
368 total_cost += fw * diff * diff
369 total_weight += 1.0
370
371 grad_w = fdiff * W_tilde[j]
372 grad_wt = fdiff * W[i]
373 grad_b = fdiff
374 grad_bt = fdiff
375
376 grad_w = np.clip(grad_w, -100, 100)
377 grad_wt = np.clip(grad_wt, -100, 100)
378 grad_b = np.clip(grad_b, -100, 100)
379 grad_bt = np.clip(grad_bt, -100, 100)
380
381 G_W[i] += grad_w ** 2
382 W[i] -= lr * grad_w / np.sqrt(G_W[i])
383
384 G_Wt[j] += grad_wt ** 2
385 W_tilde[j] -= lr * grad_wt / np.sqrt(G_Wt[j])
386
387 G_b[i] += grad_b ** 2
388 b[i] -= lr * grad_b / np.sqrt(G_b[i])
389
390 G_bt[j] += grad_bt ** 2
391 b_tilde[j] -= lr * grad_bt / np.sqrt(G_bt[j])
392
393 avg_loss = total_cost / total_weight if total_weight > 0 else 0.0
394 all_losses.append({"epoch": epoch, "loss": round(float(avg_loss), 8)})
395
396 record = should_record(epoch, epochs)
397 if record:
398 rec = capture_epoch(epoch, W, W_tilde, b, b_tilde, avg_loss,
399 X, log_X, nonzero_pairs, w2i, i2w, vocab,
400 f_weights_arr)
401 records.append(rec)
402
403 if epoch % 10 == 0 or epoch == epochs:
404 elapsed = time.time() - t0
405 print(f" epoch {epoch:>4}/{epochs} "
406 f"loss={avg_loss:.6f} [{elapsed:.1f}s] "
407 f"({len(records)} recorded)")
408
409 final_state = {
410 "W": W.copy(), "W_tilde": W_tilde.copy(),
411 "b": b.copy(), "b_tilde": b_tilde.copy(),
412 "G_W": G_W.copy(), "G_Wt": G_Wt.copy(),
413 "G_b": G_b.copy(), "G_bt": G_bt.copy(),
414 }
415
416 return records, all_losses, final_state
417
418
419# ── projections ─────────────────────────────────────────────────────
Epoch Recording Strategytrain.py:219-226

Every epoch for first 10, then every 5th, plus the final epoch; captures early chaos, convergence, and the final state.

219def should_record(epoch, total_epochs):
220 if epoch <= 10:
221 return True
222 if epoch % 5 == 0:
223 return True
224 if epoch == total_epochs:
225 return True
226 return False
State Capture per Epochtrain.py:247-307

Saves W, W̃, biases, loss, similarities, reconstruction metrics at each recorded epoch.

247def capture_epoch(epoch, W, W_tilde, b, b_tilde, avg_loss, X, log_X,
248 nonzero_pairs, w2i, i2w, vocab, f_weights):
249 """Capture full state for a single recorded epoch."""
250 V, d = W.shape
251 combined = W + W_tilde
252
253 cosines = {}
254 for w1, w2 in KEY_PAIRS:
255 if w1 in w2i and w2 in w2i:
256 v1 = combined[w2i[w1]]
257 v2 = combined[w2i[w2]]
258 n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)
259 cos = float(np.dot(v1, v2) / (n1 * n2)) if n1 > 0 and n2 > 0 else 0.0
260 cosines[f"{w1}_{w2}"] = round(cos, 6)
261
262 errors = []
263 for idx, (i, j, x_ij) in enumerate(nonzero_pairs):
264 pred = float(np.dot(W[i], W_tilde[j]) + b[i] + b_tilde[j])
265 target = float(log_X[i, j])
266 err = abs(pred - target)
267 errors.append({
268 "i": int(i), "j": int(j),
269 "word_i": i2w[i], "word_j": i2w[j],
270 "x_ij": float(x_ij),
271 "log_x": round(target, 6),
272 "pred": round(pred, 6),
273 "error": round(err, 6),
274 "f_weight": round(float(f_weights[idx]), 6),
275 })
276
277 errors_sorted = sorted(errors, key=lambda e: e["error"], reverse=True)
278 top_loss_pairs = errors_sorted[:10]
279
280 overall_mse = np.mean([e["error"] ** 2 for e in errors])
281 overall_mae = np.mean([e["error"] for e in errors])
282 targets = np.array([e["log_x"] for e in errors])
283 preds = np.array([e["pred"] for e in errors])
284 ss_res = np.sum((targets - preds) ** 2)
285 ss_tot = np.sum((targets - targets.mean()) ** 2)
286 r_squared = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0
287
288 return {
289 "epoch": epoch,
290 "avg_loss": round(float(avg_loss), 8),
291 "W": W.tolist(),
292 "W_tilde": W_tilde.tolist(),
293 "b": b.tolist(),
294 "b_tilde": b_tilde.tolist(),
295 "cosine_similarities": cosines,
296 "top_loss_pairs": top_loss_pairs,
297 "reconstruction": {
298 "mse": round(float(overall_mse), 8),
299 "mae": round(float(overall_mae), 8),
300 "r_squared": round(float(r_squared), 6),
301 "num_pairs": len(errors),
302 },
303 "all_errors": errors,
304 }
305
306
307# ── training loop ──────────────────────────────────────────────────

For J = f(Xij)(vij)² where vij = wi·w̃j + bi + b̃j − log Xij:

diff = wiᵀw̃j + bi + b̃j − log(Xij)

fdiff = f(Xij) · diff

∂J/∂wi = fdiff · w̃j

∂J/∂w̃j = fdiff · wi

∂J/∂bi = fdiff

∂J/∂b̃j = fdiff

The factor of 2 from d/dx(x²) is absorbed into the learning rate, following the Stanford C implementation. Gradients are clipped to [−100, 100] before the AdaGrad update.

Phase 3: Analysis & Output

PCA projections for visualization, save embeddings and epoch snapshots to JSON, and print training summary.

PCA Projectionstrain.py:421-441

2D PCA projection of W+W̃ for visualization.

421def add_projections(records, i2w):
422 """Add 2D PCA projections of W+W_tilde to each recorded epoch."""
423 V = len(i2w)
424 pca = PCA(n_components=2, random_state=42)
425
426 last_combined = np.array(records[-1]["W"]) + np.array(records[-1]["W_tilde"])
427 pca.fit(last_combined)
428 var = pca.explained_variance_ratio_.tolist()
429
430 combined_snapshots = []
431 for rec in records:
432 combined = np.array(rec["W"]) + np.array(rec["W_tilde"])
433 combined_snapshots.append(combined)
434
435 for i, rec in enumerate(records):
436 proj = pca.transform(combined_snapshots[i])
437 rec["embeddings_2d"] = {i2w[j]: proj[j].tolist() for j in range(V)}
438 rec["pca_variance_explained"] = var
439
440
441# ── storage ─────────────────────────────────────────────────────────
Save Training Outputtrain.py:443-542

Full JSON output: metadata, loss curve, epoch snapshots, co-occurrence matrix.

443def save_local(records, all_losses, X, vocab, w2i, counts, config,
444 raw_sentences, cooc_stats, out_dir):
445 """Write all training data as structured local files."""
446 steps_dir = os.path.join(out_dir, "steps")
447 os.makedirs(steps_dir, exist_ok=True)
448
449 V = len(vocab)
450 nz_mask = X > 0
451 log_X = np.zeros_like(X)
452 log_X[nz_mask] = np.log(X[nz_mask])
453 f_X = np.zeros_like(X)
454 for i in range(V):
455 for j in range(V):
456 if X[i, j] > 0:
457 f_X[i, j] = f_weight(X[i, j], config["x_max"], config["alpha"])
458
459 # ── run_metadata.json ────────────────────────────────────
460 meta = {
461 "topic": "glove",
462 "corpus_name": "curated_v1",
463 "vocab": vocab,
464 "vocab_size": V,
465 "embed_dim": config["embed_dim"],
466 "config": config,
467 "total_epochs": config["epochs"],
468 "recorded_epochs": len(records),
469 "word_frequencies": {w: counts[w] for w in vocab},
470 "raw_sentences": [" ".join(s) for s in raw_sentences],
471 "cooccurrence_stats": cooc_stats,
472 }
473 _write(os.path.join(out_dir, "run_metadata.json"), meta, indent=2)
474
475 # ── cooccurrence_matrix.json ─────────────────────────────
476 cooc_data = {
477 "matrix": X.tolist(),
478 "log_matrix": [[round(float(log_X[i, j]), 6) if X[i, j] > 0 else None
479 for j in range(V)] for i in range(V)],
480 "weights": [[round(float(f_X[i, j]), 6) if X[i, j] > 0 else None
481 for j in range(V)] for i in range(V)],
482 "vocab": vocab,
483 "stats": cooc_stats,
484 }
485 _write(os.path.join(out_dir, "cooccurrence_matrix.json"), cooc_data)
486
487 # ── loss_curve.json ──────────────────────────────────────
488 _write(os.path.join(out_dir, "loss_curve.json"), all_losses)
489
490 # ── step_index.json ──────────────────────────────────────
491 step_index = [rec["epoch"] for rec in records]
492 _write(os.path.join(out_dir, "step_index.json"), step_index)
493
494 # ── embeddings_timeline.json ─────────────────────────────
495 timeline = []
496 for rec in records:
497 entry = {
498 "epoch": rec["epoch"],
499 "positions": rec.get("embeddings_2d", {}),
500 "loss": rec["avg_loss"],
501 }
502 if rec.get("pca_variance_explained"):
503 entry["pca_variance_explained"] = rec["pca_variance_explained"]
504 timeline.append(entry)
505 _write(os.path.join(out_dir, "embeddings_timeline.json"), timeline)
506
507 # ── steps/{N}.json ───────────────────────────────────────
508 total_step_bytes = 0
509 for rec in records:
510 step_path = os.path.join(steps_dir, f"{rec['epoch']}.json")
511 step_data = {
512 "epoch": rec["epoch"],
513 "avg_loss": rec["avg_loss"],
514 "W": rec["W"],
515 "W_tilde": rec["W_tilde"],
516 "b": rec["b"],
517 "b_tilde": rec["b_tilde"],
518 "cosine_similarities": rec["cosine_similarities"],
519 "top_loss_pairs": rec["top_loss_pairs"],
520 "reconstruction": rec["reconstruction"],
521 "all_errors": rec["all_errors"],
522 "embeddings_2d": rec.get("embeddings_2d"),
523 "pca_variance_explained": rec.get("pca_variance_explained"),
524 }
525 _write(step_path, step_data)
526 total_step_bytes += os.path.getsize(step_path)
527
528 # ── report ───────────────────────────────────────────────
529 sizes = {}
530 for name in ["run_metadata.json", "cooccurrence_matrix.json",
531 "loss_curve.json", "step_index.json",
532 "embeddings_timeline.json"]:
533 p = os.path.join(out_dir, name)
534 sizes[name] = os.path.getsize(p)
535
536 print(f"\n Output directory: {out_dir}/")
537 for name, sz in sizes.items():
538 print(f" {name:<35} {sz / 1024:>8.1f} KB")
539 print(f" steps/ ({len(records)} files){' ' * 19}"
540 f"{total_step_bytes / 1e6:>7.1f} MB")
541 print(f" {'TOTAL':<35} "
542 f"{(sum(sizes.values()) + total_step_bytes) / 1e6:>7.1f} MB")
JSON Writertrain.py:545-547

Helper to serialize Python objects to JSON files.

545def _write(path, data, indent=None):
546 with open(path, "w") as f:
547 json.dump(data, f, indent=indent)
Training Summarytrain.py:552-640

Nearest neighbors, analogies, Table 1 ratio verification, bias-frequency correlation.

552def print_summary(W, W_tilde, b, b_tilde, X, vocab, w2i, i2w):
553 """Print nearest neighbors, analogies, and ratio verification."""
554 combined = W + W_tilde
555 V = len(vocab)
556
557 norms = np.linalg.norm(combined, axis=1, keepdims=True)
558 norms = np.where(norms > 0, norms, 1.0)
559 normed = combined / norms
560 sim_matrix = normed @ normed.T
561
562 print("\n Nearest neighbors (cosine, W+W̃):")
563 probe_words = [
564 "ice", "steam", "water", "solid", "gas", "cold", "hot",
565 "king", "queen", "man", "woman", "cat", "dog", "lion",
566 ]
567 for word in probe_words:
568 if word not in w2i:
569 continue
570 idx = w2i[word]
571 sims = sim_matrix[idx].copy()
572 sims[idx] = -2
573 top = np.argsort(sims)[::-1][:5]
574 nbrs = " ".join(f"{i2w[j]}({sims[j]:.2f})" for j in top)
575 print(f" {word:>10} → {nbrs}")
576
577 print("\n Analogies (a − b + c = ?):")
578 for wa, wb, wc, expect in ANALOGY_TESTS:
579 if not all(w in w2i for w in (wa, wb, wc)):
580 continue
581 v = combined[w2i[wa]] - combined[w2i[wb]] + combined[w2i[wc]]
582 n = np.linalg.norm(v)
583 if n > 0:
584 v_normed = v / n
585 else:
586 v_normed = v
587 sims = normed @ v_normed
588 for w in (wa, wb, wc):
589 sims[w2i[w]] = -2
590 top3 = np.argsort(sims)[::-1][:3]
591 results = " ".join(f"{i2w[j]}({sims[j]:.2f})" for j in top3)
592 mark = " ✓" if i2w[top3[0]] == expect else ""
593 print(f" {wa} − {wb} + {wc} = {results}{mark}")
594
595 # ── ratio verification (Table 1) ────────────────────────
596 print("\n Ratio verification (P(k|ice) / P(k|steam)):")
597 if "ice" in w2i and "steam" in w2i:
598 i_ice = w2i["ice"]
599 i_steam = w2i["steam"]
600 x_ice = X[i_ice]
601 x_steam = X[i_steam]
602 p_ice = x_ice / x_ice.sum() if x_ice.sum() > 0 else x_ice
603 p_steam = x_steam / x_steam.sum() if x_steam.sum() > 0 else x_steam
604
605 probe_k = ["solid", "gas", "water", "liquid", "cold", "hot",
606 "freeze", "boil", "king", "cat"]
607 for k in probe_k:
608 if k not in w2i:
609 continue
610 j = w2i[k]
611 p_k_ice = p_ice[j]
612 p_k_steam = p_steam[j]
613 if p_k_steam > 0 and p_k_ice > 0:
614 ratio = p_k_ice / p_k_steam
615 print(f" k={k:>8} P(k|ice)={p_k_ice:.4f} "
616 f"P(k|steam)={p_k_steam:.4f} ratio={ratio:.3f}")
617 elif p_k_ice > 0:
618 print(f" k={k:>8} P(k|ice)={p_k_ice:.4f} "
619 f"P(k|steam)=0 ratio=∞")
620 elif p_k_steam > 0:
621 print(f" k={k:>8} P(k|ice)=0 "
622 f"P(k|steam)={p_k_steam:.4f} ratio=0")
623
624 # ── bias-frequency correlation ──────────────────────────
625 print("\n Bias-frequency correlation:")
626 x_row = np.array([X[w2i[w]].sum() for w in vocab])
627 b_arr = np.asarray(b, dtype=np.float64)
628 bt_arr = np.asarray(b_tilde, dtype=np.float64)
629 valid_idx = np.where(x_row > 0)[0]
630 if len(valid_idx) > 2:
631 log_freq = np.log(x_row[valid_idx])
632 b_vals = b_arr[valid_idx]
633 bt_vals = bt_arr[valid_idx]
634 corr_b = float(np.corrcoef(b_vals, log_freq)[0, 1])
635 corr_bt = float(np.corrcoef(bt_vals, log_freq)[0, 1])
636 print(f" corr(b, log Σ_j X_ij) = {corr_b:.4f}")
637 print(f" corr(b̃, log Σ_j X_ij) = {corr_bt:.4f}")
638
639
640# ── main ────────────────────────────────────────────────────────────
Entry Pointtrain.py:642-755

Ties everything together: argument parsing, corpus loading, co-occurrence construction, training, projection, and output.

642def main():
643 ap = argparse.ArgumentParser(
644 description="GloVe: Global Vectors for Word Representation")
645 ap.add_argument("--corpus",
646 default=str(Path(__file__).with_name("corpus.txt")))
647 ap.add_argument("--output-dir",
648 default=str(Path(__file__).resolve().parent / "output"),
649 help="Where to write output files (default: ./output/)")
650 ap.add_argument("--dim", type=int, default=8,
651 help="Embedding dimension (default: 8)")
652 ap.add_argument("--lr", type=float, default=0.05,
653 help="Learning rate (default: 0.05)")
654 ap.add_argument("--x-max", type=float, default=100.0,
655 help="x_max for weighting function (default: 100)")
656 ap.add_argument("--alpha", type=float, default=0.75,
657 help="Alpha for weighting function (default: 0.75)")
658 ap.add_argument("--window", type=int, default=5,
659 help="Context window size (default: 5)")
660 ap.add_argument("--symmetric", action="store_true", default=True,
661 help="Symmetric context window (default)")
662 ap.add_argument("--no-symmetric", action="store_false", dest="symmetric",
663 help="Asymmetric (left-only) context window")
664 ap.add_argument("--epochs", type=int, default=100,
665 help="Number of training epochs (default: 100)")
666 ap.add_argument("--min-count", type=int, default=3,
667 help="Minimum word frequency (default: 3)")
668 ap.add_argument("--seed", type=int, default=42)
669 args = ap.parse_args()
670
671 np.random.seed(args.seed)
672
673 # ── load ──────────────────────────────────────────────────────
674 print("Loading corpus …")
675 raw, cleaned = load_corpus(args.corpus)
676 vocab, w2i, i2w, counts = build_vocab(cleaned, args.min_count)
677
678 print(f" {len(raw)} sentences → {len(cleaned)} after stopword removal")
679 print(f" Vocabulary: {len(vocab)} words (min_count={args.min_count})")
680
681 config = {
682 "learning_rate": args.lr,
683 "x_max": args.x_max,
684 "alpha": args.alpha,
685 "window_size": args.window,
686 "symmetric": args.symmetric,
687 "embed_dim": args.dim,
688 "epochs": args.epochs,
689 "min_count": args.min_count,
690 "random_seed": args.seed,
691 }
692
693 # ── Phase 1: co-occurrence matrix ──────────────────────
694 print(f"
695Building co-occurrence matrix "
696 f"(window={args.window}, symmetric={args.symmetric}) …")
697 X = build_cooccurrence(cleaned, w2i, args.window, args.symmetric)
698 cooc_stats = compute_cooccurrence_stats(X, vocab)
699
700 print(f" Matrix size: {X.shape[0]}×{X.shape[1]}")
701 print(f" Non-zero entries: {cooc_stats['num_nonzero']} "
702 f"(sparsity: {cooc_stats['sparsity']:.2%})")
703 if cooc_stats["value_distribution"]:
704 d = cooc_stats["value_distribution"]
705 print(f" Value range: [{d['min']:.2f}, {d['max']:.2f}], "
706 f"mean={d['mean']:.2f}")
707
708 # ── Phase 2: train ─────────────────────────────────────
709 print(f"
710Training (epochs={args.epochs}, dim={args.dim}, lr={args.lr}, "
711 f"x_max={args.x_max}, α={args.alpha}) …")
712 records, all_losses, final_state = train(
713 X, vocab, w2i, i2w, args.dim, args.lr,
714 args.x_max, args.alpha, args.epochs, args.seed,
715 )
716
717 # ── 2D projections ─────────────────────────────────────
718 print(f"
719Projecting to 2D (final_pca, {len(records)} snapshots) ")
720 add_projections(records, i2w)
721
722 # ── save ──────────────────────────────────────────────────────
723 print(f"
724Saving to {args.output_dir} ")
725 save_local(records, all_losses, X, vocab, w2i, counts, config,
726 raw, cooc_stats, args.output_dir)
727
728 # ── summary ───────────────────────────────────────────────────
729 print("
730" + "=" * 60)
731 print("TRAINING SUMMARY")
732 print("=" * 60)
733 print(f" Total epochs trained : {args.epochs}")
734 print(f" Epochs recorded : {len(records)}")
735 print(f" Vocab size : {len(vocab)}")
736 print(f" Embed dim : {args.dim}")
737 if all_losses:
738 print(f" Initial loss : {all_losses[0]['loss']:.6f}")
739 print(f" Final loss : {all_losses[-1]['loss']:.6f}")
740 recon = records[-1]["reconstruction"]
741 print(f" Final MSE : {recon['mse']:.6f}")
742 print(f" Final R² : {recon['r_squared']:.4f}")
743
744 W = final_state["W"]
745 W_tilde = final_state["W_tilde"]
746 b = final_state["b"]
747 b_tilde = final_state["b_tilde"]
748 print_summary(W, W_tilde, b, b_tilde, X, vocab, w2i, i2w)
749
750 print("
751Done ")
752
753
754if __name__ == "__main__":
755 main()

Run It Yourself

Quick Start

Terminal
$ cd github-quantml/glove
$ pip install -r requirements.txt
$ python train.py  # trains in ~8s

Verify Results

Terminal
$ python explore.py
# Prints neighbor & analogy evolution over training epochs

$ python explore.py --interactive
# Try: ice | king - man + woman | cooc ice cold | ratio ice steam

$ python analyze.py --save-json
# Runs 10 analyses, exports to JSON

Experiment

Try different hyperparameters and compare results:

Terminal
# Compare embedding dimensions
$ python train.py --dim 4   # fewer dims → harder separation
$ python train.py --dim 32  # more capacity → better fit

# Compare window sizes
$ python train.py --window 2  # narrow → syntactic
$ python train.py --window 10 # wide → semantic

# Symmetric vs asymmetric
$ python train.py --no-symmetric

Hyperparameters

ArgumentDefaultDescription
--dim8Embedding vector dimensionality
--lr0.05Initial learning rate (AdaGrad scales per-parameter)
--x-max100Cap on co-occurrence count in weighting function
--alpha0.75Exponent in f(x) = (x/x_max)^α for x < x_max, else 1
--window5Context window radius (symmetric by default)
--epochs100Number of passes over the co-occurrence matrix
--min-count3Min word frequency to include in vocab
--symmetrictrueUse symmetric context window (both left and right)
--no-symmetric-Use asymmetric (left-only) context window
--seed42Random seed for reproducibility