GloVe
The Code
Annotated walkthrough of train.py: Phase 1 builds the co-occurrence matrix, Phase 2 optimizes via weighted least squares with AdaGrad. Pure NumPy, no autograd.
The Code
Annotated walkthrough of train.py: Phase 1 builds the co-occurrence matrix, Phase 2 optimizes via weighted least squares with AdaGrad. Pure NumPy, no autograd.
GloVe's train.py has a unique two-phase architecture. Phase 1 builds the co-occurrence matrix from the corpus. Phase 2 optimizes word vectors using weighted least squares with AdaGrad. No neural networks, no autograd. Just NumPy. (750 lines, 16 annotated sections)
From raw text to co-occurrence matrix Xij. Vocabulary, tokenization, and sliding window counts with 1/d distance weighting.
NumPy-only implementation; no neural network framework needed.
58import argparse59import json60import os61import re62import time63from collections import Counter64from pathlib import Path6566import numpy as np67from sklearn.decomposition import PCA686970# ── stopwords ───────────────────────────────────────────────────────71STOPWORDS = frozenset({
Lowercasing, punctuation removal, hyphen splitting, stopword filtering.
89def load_corpus(filepath: str):90"""Load corpus with preprocessing: lowercase, strip punctuation,91split hyphens, remove numeric tokens, normalize whitespace."""92raw, cleaned = [], []93with open(filepath) as f:94for line in f:95line = line.strip()96if not line or line.startswith("#"):97continue98line = line.lower()99line = re.sub(r"-", " ", line)100line = re.sub(r"[^a-z\s]", "", line)101line = re.sub(r"\s+", " ", line).strip()102tokens = line.split()103raw.append(tokens)104content = [t for t in tokens if t not in STOPWORDS]105if len(content) >= 2:106cleaned.append(content)107return raw, cleaned
Frequency-sorted vocabulary with word-to-index mapping.
110def build_vocab(sentences, min_count=3):111counts = Counter()112for s in sentences:113counts.update(s)114vocab = sorted(w for w, c in counts.items() if c >= min_count)115w2i = {w: i for i, w in enumerate(vocab)}116i2w = {i: w for w, i in w2i.items()}117return vocab, w2i, i2w, counts118119120# ── co-occurrence matrix ────────────────────────────────────────────
Matrix with 1/d distance weighting (symmetric by default, asymmetric via --no-symmetric), the heart of GloVe’s global statistics.
122def build_cooccurrence(sentences, w2i, window, symmetric):123"""Build dense co-occurrence matrix with 1/d distance weighting.124125For each word pair within the context window, the contribution126is 1/distance (matching the GloVe paper). With --symmetric (default),127both X[i][j] and X[j][i] are incremented. With --no-symmetric,128only the left context contributes (asymmetric window).129"""130V = len(w2i)131X = np.zeros((V, V), dtype=np.float64)132133for sent in sentences:134idxs = [w2i[w] for w in sent if w in w2i]135n = len(idxs)136for i in range(n):137for j in range(max(0, i - window), i):138dist = i - j139weight = 1.0 / dist140X[idxs[i], idxs[j]] += weight141if symmetric:142X[idxs[j], idxs[i]] += weight143if not symmetric:144continue145for j in range(i + 1, min(n, i + window + 1)):146dist = j - i147weight = 1.0 / dist148X[idxs[i], idxs[j]] += weight149X[idxs[j], idxs[i]] += weight150151return X
Sparsity analysis, value distribution, and Zipf's law fit.
154def compute_cooccurrence_stats(X, vocab):155"""Compute sparsity, value distribution, and Zipf fit for X."""156nonzero_mask = X > 0157nonzero_vals = X[nonzero_mask]158total_entries = X.shape[0] * X.shape[1]159num_nonzero = int(nonzero_mask.sum())160sparsity = 1.0 - num_nonzero / total_entries161162distribution = {}163if len(nonzero_vals) > 0:164distribution = {165"min": float(nonzero_vals.min()),166"max": float(nonzero_vals.max()),167"mean": float(nonzero_vals.mean()),168"median": float(np.median(nonzero_vals)),169"percentiles": {170str(p): float(np.percentile(nonzero_vals, p))171for p in [25, 50, 75, 90, 95, 99]172},173}174175zipf_data = {}176if len(nonzero_vals) > 10:177sorted_vals = np.sort(nonzero_vals)[::-1]178ranks = np.arange(1, len(sorted_vals) + 1, dtype=np.float64)179log_ranks = np.log(ranks)180log_vals = np.log(sorted_vals)181valid = np.isfinite(log_vals) & np.isfinite(log_ranks)182if valid.sum() > 2:183coeffs = np.polyfit(log_ranks[valid], log_vals[valid], 1)184alpha = -coeffs[0]185predicted = coeffs[0] * log_ranks[valid] + coeffs[1]186ss_res = np.sum((log_vals[valid] - predicted) ** 2)187ss_tot = np.sum((log_vals[valid] - log_vals[valid].mean()) ** 2)188r_squared = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0189zipf_data = {190"alpha": float(alpha),191"r_squared": float(r_squared),192"intercept": float(coeffs[1]),193}194195return {196"total_entries": total_entries,197"num_nonzero": num_nonzero,198"sparsity": float(sparsity),199"value_distribution": distribution,200"zipf": zipf_data,201}202203204# ── weighting function ─────────────────────────────────────────────
Optimize J = Σ f(Xij)(wi·w̃j + bi + b̃j − log Xij)² with manual gradients and AdaGrad.
f(x) = (x/x_max)^α for x < x_max, else 1. Down-weights very frequent co-occurrences.
206def f_weight(x, x_max, alpha):207"""GloVe weighting: (x/x_max)^alpha if x < x_max, else 1."""208return (x / x_max) ** alpha if x < x_max else 1.0
NumPy-vectorized version of f(x) used during pre-computation of weights for all nonzero pairs.
211def f_weight_array(x_vals, x_max, alpha):212"""Vectorized weighting for an array of co-occurrence values."""213weights = np.where(x_vals < x_max, (x_vals / x_max) ** alpha, 1.0)214return weights
Word pairs tracked for cosine similarity across epochs, and analogy tests for evaluation (e.g., king − man + woman = queen).
229KEY_PAIRS = [230("ice", "solid"), ("steam", "gas"), ("ice", "water"), ("steam", "water"),231("ice", "cold"), ("steam", "hot"), ("ice", "freeze"), ("steam", "boil"),232("king", "queen"), ("prince", "princess"), ("man", "woman"), ("boy", "girl"),233("father", "mother"), ("son", "daughter"), ("brother", "sister"),234("husband", "wife"), ("cat", "dog"), ("lion", "wolf"),235]236237ANALOGY_TESTS = [238("king", "man", "woman", "queen"),239("queen", "woman", "man", "king"),240("ice", "solid", "gas", "steam"),241("steam", "gas", "solid", "ice"),242("prince", "boy", "girl", "princess"),243("father", "man", "woman", "mother"),244]
Manual gradient computation, AdaGrad adaptive learning, gradient clipping. The core algorithm.
309def train(X, vocab, w2i, i2w, dim, lr, x_max, alpha, epochs, seed):310"""GloVe weighted least squares with manual AdaGrad.311312Following the Stanford C implementation:313- W, W_tilde initialized uniform(-0.5/d, 0.5/d)314- b, b_tilde initialized uniform(-0.5/d, 0.5/d)315- AdaGrad accumulators initialized to 1.0 (not 0)316- Gradient clipping at [-100, 100]317"""318rng = np.random.RandomState(seed)319V = len(vocab)320321W = rng.uniform(-0.5 / dim, 0.5 / dim, (V, dim))322W_tilde = rng.uniform(-0.5 / dim, 0.5 / dim, (V, dim))323b = rng.uniform(-0.5 / dim, 0.5 / dim, V)324b_tilde = rng.uniform(-0.5 / dim, 0.5 / dim, V)325326G_W = np.ones((V, dim))327G_Wt = np.ones((V, dim))328G_b = np.ones(V)329G_bt = np.ones(V)330331nonzero_pairs = []332for i in range(V):333for j in range(V):334if X[i, j] > 0:335nonzero_pairs.append((i, j, X[i, j]))336337log_X = np.zeros_like(X)338nz_mask = X > 0339log_X[nz_mask] = np.log(X[nz_mask])340341f_weights_arr = np.array([f_weight(x, x_max, alpha)342for _, _, x in nonzero_pairs])343344print(f" Non-zero pairs: {len(nonzero_pairs)}")345print(f" Parameters: {2 * V * dim + 2 * V} "346f"({V}×{dim}×2 vectors + {V}×2 biases)")347348records = []349all_losses = []350t0 = time.time()351352rec = capture_epoch(0, W, W_tilde, b, b_tilde, 0.0, X, log_X,353nonzero_pairs, w2i, i2w, vocab, f_weights_arr)354records.append(rec)355356for epoch in range(1, epochs + 1):357indices = rng.permutation(len(nonzero_pairs))358total_cost = 0.0359total_weight = 0.0360361for idx in indices:362i, j, x_ij = nonzero_pairs[idx]363364diff = np.dot(W[i], W_tilde[j]) + b[i] + b_tilde[j] - log_X[i, j]365fw = f_weights_arr[idx]366fdiff = fw * diff367368total_cost += fw * diff * diff369total_weight += 1.0370371grad_w = fdiff * W_tilde[j]372grad_wt = fdiff * W[i]373grad_b = fdiff374grad_bt = fdiff375376grad_w = np.clip(grad_w, -100, 100)377grad_wt = np.clip(grad_wt, -100, 100)378grad_b = np.clip(grad_b, -100, 100)379grad_bt = np.clip(grad_bt, -100, 100)380381G_W[i] += grad_w ** 2382W[i] -= lr * grad_w / np.sqrt(G_W[i])383384G_Wt[j] += grad_wt ** 2385W_tilde[j] -= lr * grad_wt / np.sqrt(G_Wt[j])386387G_b[i] += grad_b ** 2388b[i] -= lr * grad_b / np.sqrt(G_b[i])389390G_bt[j] += grad_bt ** 2391b_tilde[j] -= lr * grad_bt / np.sqrt(G_bt[j])392393avg_loss = total_cost / total_weight if total_weight > 0 else 0.0394all_losses.append({"epoch": epoch, "loss": round(float(avg_loss), 8)})395396record = should_record(epoch, epochs)397if record:398rec = capture_epoch(epoch, W, W_tilde, b, b_tilde, avg_loss,399X, log_X, nonzero_pairs, w2i, i2w, vocab,400f_weights_arr)401records.append(rec)402403if epoch % 10 == 0 or epoch == epochs:404elapsed = time.time() - t0405print(f" epoch {epoch:>4}/{epochs} "406f"loss={avg_loss:.6f} [{elapsed:.1f}s] "407f"({len(records)} recorded)")408409final_state = {410"W": W.copy(), "W_tilde": W_tilde.copy(),411"b": b.copy(), "b_tilde": b_tilde.copy(),412"G_W": G_W.copy(), "G_Wt": G_Wt.copy(),413"G_b": G_b.copy(), "G_bt": G_bt.copy(),414}415416return records, all_losses, final_state417418419# ── projections ─────────────────────────────────────────────────────
Every epoch for first 10, then every 5th, plus the final epoch; captures early chaos, convergence, and the final state.
219def should_record(epoch, total_epochs):220if epoch <= 10:221return True222if epoch % 5 == 0:223return True224if epoch == total_epochs:225return True226return False
Saves W, W̃, biases, loss, similarities, reconstruction metrics at each recorded epoch.
247def capture_epoch(epoch, W, W_tilde, b, b_tilde, avg_loss, X, log_X,248nonzero_pairs, w2i, i2w, vocab, f_weights):249"""Capture full state for a single recorded epoch."""250V, d = W.shape251combined = W + W_tilde252253cosines = {}254for w1, w2 in KEY_PAIRS:255if w1 in w2i and w2 in w2i:256v1 = combined[w2i[w1]]257v2 = combined[w2i[w2]]258n1, n2 = np.linalg.norm(v1), np.linalg.norm(v2)259cos = float(np.dot(v1, v2) / (n1 * n2)) if n1 > 0 and n2 > 0 else 0.0260cosines[f"{w1}_{w2}"] = round(cos, 6)261262errors = []263for idx, (i, j, x_ij) in enumerate(nonzero_pairs):264pred = float(np.dot(W[i], W_tilde[j]) + b[i] + b_tilde[j])265target = float(log_X[i, j])266err = abs(pred - target)267errors.append({268"i": int(i), "j": int(j),269"word_i": i2w[i], "word_j": i2w[j],270"x_ij": float(x_ij),271"log_x": round(target, 6),272"pred": round(pred, 6),273"error": round(err, 6),274"f_weight": round(float(f_weights[idx]), 6),275})276277errors_sorted = sorted(errors, key=lambda e: e["error"], reverse=True)278top_loss_pairs = errors_sorted[:10]279280overall_mse = np.mean([e["error"] ** 2 for e in errors])281overall_mae = np.mean([e["error"] for e in errors])282targets = np.array([e["log_x"] for e in errors])283preds = np.array([e["pred"] for e in errors])284ss_res = np.sum((targets - preds) ** 2)285ss_tot = np.sum((targets - targets.mean()) ** 2)286r_squared = 1.0 - ss_res / ss_tot if ss_tot > 0 else 0.0287288return {289"epoch": epoch,290"avg_loss": round(float(avg_loss), 8),291"W": W.tolist(),292"W_tilde": W_tilde.tolist(),293"b": b.tolist(),294"b_tilde": b_tilde.tolist(),295"cosine_similarities": cosines,296"top_loss_pairs": top_loss_pairs,297"reconstruction": {298"mse": round(float(overall_mse), 8),299"mae": round(float(overall_mae), 8),300"r_squared": round(float(r_squared), 6),301"num_pairs": len(errors),302},303"all_errors": errors,304}305306307# ── training loop ──────────────────────────────────────────────────
For J = f(Xij)(vij)² where vij = wi·w̃j + bi + b̃j − log Xij:
diff = wiᵀw̃j + bi + b̃j − log(Xij)
fdiff = f(Xij) · diff
∂J/∂wi = fdiff · w̃j
∂J/∂w̃j = fdiff · wi
∂J/∂bi = fdiff
∂J/∂b̃j = fdiff
The factor of 2 from d/dx(x²) is absorbed into the learning rate, following the Stanford C implementation. Gradients are clipped to [−100, 100] before the AdaGrad update.
PCA projections for visualization, save embeddings and epoch snapshots to JSON, and print training summary.
2D PCA projection of W+W̃ for visualization.
421def add_projections(records, i2w):422"""Add 2D PCA projections of W+W_tilde to each recorded epoch."""423V = len(i2w)424pca = PCA(n_components=2, random_state=42)425426last_combined = np.array(records[-1]["W"]) + np.array(records[-1]["W_tilde"])427pca.fit(last_combined)428var = pca.explained_variance_ratio_.tolist()429430combined_snapshots = []431for rec in records:432combined = np.array(rec["W"]) + np.array(rec["W_tilde"])433combined_snapshots.append(combined)434435for i, rec in enumerate(records):436proj = pca.transform(combined_snapshots[i])437rec["embeddings_2d"] = {i2w[j]: proj[j].tolist() for j in range(V)}438rec["pca_variance_explained"] = var439440441# ── storage ─────────────────────────────────────────────────────────
Full JSON output: metadata, loss curve, epoch snapshots, co-occurrence matrix.
443def save_local(records, all_losses, X, vocab, w2i, counts, config,444raw_sentences, cooc_stats, out_dir):445"""Write all training data as structured local files."""446steps_dir = os.path.join(out_dir, "steps")447os.makedirs(steps_dir, exist_ok=True)448449V = len(vocab)450nz_mask = X > 0451log_X = np.zeros_like(X)452log_X[nz_mask] = np.log(X[nz_mask])453f_X = np.zeros_like(X)454for i in range(V):455for j in range(V):456if X[i, j] > 0:457f_X[i, j] = f_weight(X[i, j], config["x_max"], config["alpha"])458459# ── run_metadata.json ────────────────────────────────────460meta = {461"topic": "glove",462"corpus_name": "curated_v1",463"vocab": vocab,464"vocab_size": V,465"embed_dim": config["embed_dim"],466"config": config,467"total_epochs": config["epochs"],468"recorded_epochs": len(records),469"word_frequencies": {w: counts[w] for w in vocab},470"raw_sentences": [" ".join(s) for s in raw_sentences],471"cooccurrence_stats": cooc_stats,472}473_write(os.path.join(out_dir, "run_metadata.json"), meta, indent=2)474475# ── cooccurrence_matrix.json ─────────────────────────────476cooc_data = {477"matrix": X.tolist(),478"log_matrix": [[round(float(log_X[i, j]), 6) if X[i, j] > 0 else None479for j in range(V)] for i in range(V)],480"weights": [[round(float(f_X[i, j]), 6) if X[i, j] > 0 else None481for j in range(V)] for i in range(V)],482"vocab": vocab,483"stats": cooc_stats,484}485_write(os.path.join(out_dir, "cooccurrence_matrix.json"), cooc_data)486487# ── loss_curve.json ──────────────────────────────────────488_write(os.path.join(out_dir, "loss_curve.json"), all_losses)489490# ── step_index.json ──────────────────────────────────────491step_index = [rec["epoch"] for rec in records]492_write(os.path.join(out_dir, "step_index.json"), step_index)493494# ── embeddings_timeline.json ─────────────────────────────495timeline = []496for rec in records:497entry = {498"epoch": rec["epoch"],499"positions": rec.get("embeddings_2d", {}),500"loss": rec["avg_loss"],501}502if rec.get("pca_variance_explained"):503entry["pca_variance_explained"] = rec["pca_variance_explained"]504timeline.append(entry)505_write(os.path.join(out_dir, "embeddings_timeline.json"), timeline)506507# ── steps/{N}.json ───────────────────────────────────────508total_step_bytes = 0509for rec in records:510step_path = os.path.join(steps_dir, f"{rec['epoch']}.json")511step_data = {512"epoch": rec["epoch"],513"avg_loss": rec["avg_loss"],514"W": rec["W"],515"W_tilde": rec["W_tilde"],516"b": rec["b"],517"b_tilde": rec["b_tilde"],518"cosine_similarities": rec["cosine_similarities"],519"top_loss_pairs": rec["top_loss_pairs"],520"reconstruction": rec["reconstruction"],521"all_errors": rec["all_errors"],522"embeddings_2d": rec.get("embeddings_2d"),523"pca_variance_explained": rec.get("pca_variance_explained"),524}525_write(step_path, step_data)526total_step_bytes += os.path.getsize(step_path)527528# ── report ───────────────────────────────────────────────529sizes = {}530for name in ["run_metadata.json", "cooccurrence_matrix.json",531"loss_curve.json", "step_index.json",532"embeddings_timeline.json"]:533p = os.path.join(out_dir, name)534sizes[name] = os.path.getsize(p)535536print(f"\n Output directory: {out_dir}/")537for name, sz in sizes.items():538print(f" {name:<35} {sz / 1024:>8.1f} KB")539print(f" steps/ ({len(records)} files){' ' * 19}"540f"{total_step_bytes / 1e6:>7.1f} MB")541print(f" {'TOTAL':<35} "542f"{(sum(sizes.values()) + total_step_bytes) / 1e6:>7.1f} MB")
Helper to serialize Python objects to JSON files.
545def _write(path, data, indent=None):546with open(path, "w") as f:547json.dump(data, f, indent=indent)
Nearest neighbors, analogies, Table 1 ratio verification, bias-frequency correlation.
552def print_summary(W, W_tilde, b, b_tilde, X, vocab, w2i, i2w):553"""Print nearest neighbors, analogies, and ratio verification."""554combined = W + W_tilde555V = len(vocab)556557norms = np.linalg.norm(combined, axis=1, keepdims=True)558norms = np.where(norms > 0, norms, 1.0)559normed = combined / norms560sim_matrix = normed @ normed.T561562print("\n Nearest neighbors (cosine, W+W̃):")563probe_words = [564"ice", "steam", "water", "solid", "gas", "cold", "hot",565"king", "queen", "man", "woman", "cat", "dog", "lion",566]567for word in probe_words:568if word not in w2i:569continue570idx = w2i[word]571sims = sim_matrix[idx].copy()572sims[idx] = -2573top = np.argsort(sims)[::-1][:5]574nbrs = " ".join(f"{i2w[j]}({sims[j]:.2f})" for j in top)575print(f" {word:>10} → {nbrs}")576577print("\n Analogies (a − b + c = ?):")578for wa, wb, wc, expect in ANALOGY_TESTS:579if not all(w in w2i for w in (wa, wb, wc)):580continue581v = combined[w2i[wa]] - combined[w2i[wb]] + combined[w2i[wc]]582n = np.linalg.norm(v)583if n > 0:584v_normed = v / n585else:586v_normed = v587sims = normed @ v_normed588for w in (wa, wb, wc):589sims[w2i[w]] = -2590top3 = np.argsort(sims)[::-1][:3]591results = " ".join(f"{i2w[j]}({sims[j]:.2f})" for j in top3)592mark = " ✓" if i2w[top3[0]] == expect else ""593print(f" {wa} − {wb} + {wc} = {results}{mark}")594595# ── ratio verification (Table 1) ────────────────────────596print("\n Ratio verification (P(k|ice) / P(k|steam)):")597if "ice" in w2i and "steam" in w2i:598i_ice = w2i["ice"]599i_steam = w2i["steam"]600x_ice = X[i_ice]601x_steam = X[i_steam]602p_ice = x_ice / x_ice.sum() if x_ice.sum() > 0 else x_ice603p_steam = x_steam / x_steam.sum() if x_steam.sum() > 0 else x_steam604605probe_k = ["solid", "gas", "water", "liquid", "cold", "hot",606"freeze", "boil", "king", "cat"]607for k in probe_k:608if k not in w2i:609continue610j = w2i[k]611p_k_ice = p_ice[j]612p_k_steam = p_steam[j]613if p_k_steam > 0 and p_k_ice > 0:614ratio = p_k_ice / p_k_steam615print(f" k={k:>8} P(k|ice)={p_k_ice:.4f} "616f"P(k|steam)={p_k_steam:.4f} ratio={ratio:.3f}")617elif p_k_ice > 0:618print(f" k={k:>8} P(k|ice)={p_k_ice:.4f} "619f"P(k|steam)=0 ratio=∞")620elif p_k_steam > 0:621print(f" k={k:>8} P(k|ice)=0 "622f"P(k|steam)={p_k_steam:.4f} ratio=0")623624# ── bias-frequency correlation ──────────────────────────625print("\n Bias-frequency correlation:")626x_row = np.array([X[w2i[w]].sum() for w in vocab])627b_arr = np.asarray(b, dtype=np.float64)628bt_arr = np.asarray(b_tilde, dtype=np.float64)629valid_idx = np.where(x_row > 0)[0]630if len(valid_idx) > 2:631log_freq = np.log(x_row[valid_idx])632b_vals = b_arr[valid_idx]633bt_vals = bt_arr[valid_idx]634corr_b = float(np.corrcoef(b_vals, log_freq)[0, 1])635corr_bt = float(np.corrcoef(bt_vals, log_freq)[0, 1])636print(f" corr(b, log Σ_j X_ij) = {corr_b:.4f}")637print(f" corr(b̃, log Σ_j X_ij) = {corr_bt:.4f}")638639640# ── main ────────────────────────────────────────────────────────────
Ties everything together: argument parsing, corpus loading, co-occurrence construction, training, projection, and output.
642def main():643ap = argparse.ArgumentParser(644description="GloVe: Global Vectors for Word Representation")645ap.add_argument("--corpus",646default=str(Path(__file__).with_name("corpus.txt")))647ap.add_argument("--output-dir",648default=str(Path(__file__).resolve().parent / "output"),649help="Where to write output files (default: ./output/)")650ap.add_argument("--dim", type=int, default=8,651help="Embedding dimension (default: 8)")652ap.add_argument("--lr", type=float, default=0.05,653help="Learning rate (default: 0.05)")654ap.add_argument("--x-max", type=float, default=100.0,655help="x_max for weighting function (default: 100)")656ap.add_argument("--alpha", type=float, default=0.75,657help="Alpha for weighting function (default: 0.75)")658ap.add_argument("--window", type=int, default=5,659help="Context window size (default: 5)")660ap.add_argument("--symmetric", action="store_true", default=True,661help="Symmetric context window (default)")662ap.add_argument("--no-symmetric", action="store_false", dest="symmetric",663help="Asymmetric (left-only) context window")664ap.add_argument("--epochs", type=int, default=100,665help="Number of training epochs (default: 100)")666ap.add_argument("--min-count", type=int, default=3,667help="Minimum word frequency (default: 3)")668ap.add_argument("--seed", type=int, default=42)669args = ap.parse_args()670671np.random.seed(args.seed)672673# ── load ──────────────────────────────────────────────────────674print("Loading corpus …")675raw, cleaned = load_corpus(args.corpus)676vocab, w2i, i2w, counts = build_vocab(cleaned, args.min_count)677678print(f" {len(raw)} sentences → {len(cleaned)} after stopword removal")679print(f" Vocabulary: {len(vocab)} words (min_count={args.min_count})")680681config = {682"learning_rate": args.lr,683"x_max": args.x_max,684"alpha": args.alpha,685"window_size": args.window,686"symmetric": args.symmetric,687"embed_dim": args.dim,688"epochs": args.epochs,689"min_count": args.min_count,690"random_seed": args.seed,691}692693# ── Phase 1: co-occurrence matrix ──────────────────────694print(f"695Building co-occurrence matrix "696f"(window={args.window}, symmetric={args.symmetric}) …")697X = build_cooccurrence(cleaned, w2i, args.window, args.symmetric)698cooc_stats = compute_cooccurrence_stats(X, vocab)699700print(f" Matrix size: {X.shape[0]}×{X.shape[1]}")701print(f" Non-zero entries: {cooc_stats['num_nonzero']} "702f"(sparsity: {cooc_stats['sparsity']:.2%})")703if cooc_stats["value_distribution"]:704d = cooc_stats["value_distribution"]705print(f" Value range: [{d['min']:.2f}, {d['max']:.2f}], "706f"mean={d['mean']:.2f}")707708# ── Phase 2: train ─────────────────────────────────────709print(f"710Training (epochs={args.epochs}, dim={args.dim}, lr={args.lr}, "711f"x_max={args.x_max}, α={args.alpha}) …")712records, all_losses, final_state = train(713X, vocab, w2i, i2w, args.dim, args.lr,714args.x_max, args.alpha, args.epochs, args.seed,715)716717# ── 2D projections ─────────────────────────────────────718print(f"719Projecting to 2D (final_pca, {len(records)} snapshots) …")720add_projections(records, i2w)721722# ── save ──────────────────────────────────────────────────────723print(f"724Saving to {args.output_dir} …")725save_local(records, all_losses, X, vocab, w2i, counts, config,726raw, cooc_stats, args.output_dir)727728# ── summary ───────────────────────────────────────────────────729print("730" + "=" * 60)731print("TRAINING SUMMARY")732print("=" * 60)733print(f" Total epochs trained : {args.epochs}")734print(f" Epochs recorded : {len(records)}")735print(f" Vocab size : {len(vocab)}")736print(f" Embed dim : {args.dim}")737if all_losses:738print(f" Initial loss : {all_losses[0]['loss']:.6f}")739print(f" Final loss : {all_losses[-1]['loss']:.6f}")740recon = records[-1]["reconstruction"]741print(f" Final MSE : {recon['mse']:.6f}")742print(f" Final R² : {recon['r_squared']:.4f}")743744W = final_state["W"]745W_tilde = final_state["W_tilde"]746b = final_state["b"]747b_tilde = final_state["b_tilde"]748print_summary(W, W_tilde, b, b_tilde, X, vocab, w2i, i2w)749750print("751Done ✓")752753754if __name__ == "__main__":755main()
$ cd github-quantml/glove $ pip install -r requirements.txt $ python train.py # trains in ~8s
$ python explore.py # Prints neighbor & analogy evolution over training epochs $ python explore.py --interactive # Try: ice | king - man + woman | cooc ice cold | ratio ice steam $ python analyze.py --save-json # Runs 10 analyses, exports to JSON
Try different hyperparameters and compare results:
# Compare embedding dimensions $ python train.py --dim 4 # fewer dims → harder separation $ python train.py --dim 32 # more capacity → better fit # Compare window sizes $ python train.py --window 2 # narrow → syntactic $ python train.py --window 10 # wide → semantic # Symmetric vs asymmetric $ python train.py --no-symmetric
| Argument | Default | Description |
|---|---|---|
| --dim | 8 | Embedding vector dimensionality |
| --lr | 0.05 | Initial learning rate (AdaGrad scales per-parameter) |
| --x-max | 100 | Cap on co-occurrence count in weighting function |
| --alpha | 0.75 | Exponent in f(x) = (x/x_max)^α for x < x_max, else 1 |
| --window | 5 | Context window radius (symmetric by default) |
| --epochs | 100 | Number of passes over the co-occurrence matrix |
| --min-count | 3 | Min word frequency to include in vocab |
| --symmetric | true | Use symmetric context window (both left and right) |
| --no-symmetric | - | Use asymmetric (left-only) context window |
| --seed | 42 | Random seed for reproducibility |