PythonNumPy + PyTorch

RNN

The Code

A line-by-line walkthrough of train.py, from character encoding through the forward recurrence, manual BPTT, gradient clipping, and text generation. Every line annotated.

Overview

The CharRNN training script implements a vanilla recurrent neural network from scratch using NumPy for the core computation and PyTorch parameter management. Every line of BPTT is written by hand; no autograd. (21 annotated sections)

Phase 0: Corpus GenerationPhase 1: Data PipelinePhase 2: The ModelPhase 3: Training LoopPhase 4: Analysis & Generation
7,256 params18,710 chars
Phase 0: Corpus Generation

1 section

0

Generating the Corpus

Building the character-level training data from a generative grammar
25WORD_CATEGORIES = {
26 "NOUN_HUMAN": ["boy", "girl", "man", "woman", "king", "queen"],
27 "NOUN_ANIMAL": ["cat", "dog", "lion", "bird", "fish", "mouse"],
28 "NOUN_FOOD": ["bread", "cake", "apple", "milk", "cookie"],
29 "NOUN_OBJECT": ["book", "rock", "glass", "plate", "ball"],
30 "VERB_INTRAN": ["sleep", "walk", "think", "swim"],
31 "VERB_TRAN": ["chase", "eat", "see", "break", "read", "move"],
32 "VERB_PERCEPT":["smell", "hear", "watch"],
33}
34
Annotations
WORD_CATEGORIES = {7 semantic categories with 38 words. The grammar enforces that nouns appear in noun slots and verbs in verb slots; the RNN must discover this structure from characters alone.
"NOUN_HUMAN":The distributional hypothesis: 'boy', 'girl', 'man', 'woman', 'king', 'queen' all appear in the same grammatical positions. The RNN will cluster them together without ever being told they're related.
"VERB_PERCEPT":["smell", "hear", "watch"],Perception verbs form a sub-category that takes animate objects (NOUN_HUMAN or NOUN_ANIMAL), unlike VERB_TRAN which can take NOUN_OBJECT too. Elman's Figure 7 shows the RNN discovers this fine-grained structure.
50TEMPLATES = [
51 # 1. NOUN_HUMAN VERB_INTRAN
52 [("DET+CAT", "NOUN_HUMAN"), ("CAT", "VERB_INTRAN")],
53 # 2. NOUN_HUMAN VERB_TRAN NOUN_OBJECT
54 [("DET+CAT", "NOUN_HUMAN"), ("CAT", "VERB_TRAN"), ("DET+CAT", "NOUN_OBJECT")],
55 # ...16 templates total...
56]
57
58def generate_sentence(template: list) -> str:
59 """Fill one template with random word choices."""
60 tokens = []
61 for slot in template:
62 if slot[0] == "DET+CAT":
63 word = random.choice(WORD_CATEGORIES[slot[1]])
64 tokens.append(pick_det())
65 tokens.append(word)
66 elif slot[0] == "CAT":
67 word = random.choice(WORD_CATEGORIES[slot[1]])
68 tokens.append(word)
69 return " ".join(tokens)
70
Annotations
TEMPLATES = [16 sentence templates encode the grammar rules. Templates define WHICH slot types appear together: a NOUN_HUMAN can be followed by VERB_INTRAN (intransitive) or VERB_TRAN+NOUN_OBJECT (transitive). The RNN must infer these rules from character sequences.
random.choice(WORD_CATEGORIES[slot[1]])Each word slot independently samples from its category. The RNN sees the output ('the boy sleeps the cat chases a bird...') without any template markers, just a continuous character stream.
return " ".join(tokens)Sentences are joined with spaces into one continuous stream. No sentence boundaries, no newlines. This is exactly Elman (1990)'s design: the model must discover that spaces mark word boundaries.
119def generate_corpus(num_sentences: int, seed: int = 42):
120 random.seed(seed)
121 np.random.seed(seed)
122 sentences = []
123 for _ in range(num_sentences):
124 tidx = int(np.random.choice(len(TEMPLATES), p=normalized_weights))
125 sentence = generate_sentence(TEMPLATES[tidx])
126 sentences.append(sentence)
127 return sentences, template_counts
128
129# Usage:
130# python generate_corpus.py --num-sentences 600 --seed 42
131corpus_text = " ".join(sentences)
132
Annotations
seed: int = 42Fixed seed ensures every learner gets the identical corpus. Reproducibility is non-negotiable in experimental science; 'the RNN discovers categories' is only meaningful if every run of the experiment gets the same result.
corpus_text = " ".join(sentences)600 sentences joined into a continuous character stream of ~18,000 characters. This becomes the entire training set: load, encode, train, repeat.
np.random.choice(len(TEMPLATES), p=normalized_weights)Weighted template sampling: human-centric templates get 2x probability. This matches Elman's corpus design where human-subject sentences are more common, making NOUN_HUMAN the best-separated cluster.
Phase 1: Data Pipeline

4 sections

01

Loading the Corpus

Reading raw characters
train.py:157166
157def load_corpus(path: str) -> str:
158 """Read corpus file, lowercase, strip extra whitespace."""
159 text = Path(path).read_text(encoding="utf-8")
160 text = text.lower().strip()
161 # Collapse multiple spaces to single space
162 import re
163 text = re.sub(r" +", " ", text)
164 return text
165
166
167
Annotations
with open(path) as f:Read the entire text file as a single string, no sentence boundaries preserved
return f.read()Returns the raw character stream. No preprocessing; the model learns punctuation and structure.
02

Building the Vocabulary

Characters to indices
train.py:167174
167def build_vocab(text: str) -> Tuple[List[str], Dict[str, int], Dict[int, str]]:
168 """Extract unique characters, sort, create char↔index mappings."""
169 chars = sorted(set(text))
170 char2idx = {c: i for i, c in enumerate(chars)}
171 idx2char = {i: c for i, c in enumerate(chars)}
172 return chars, char2idx, idx2char
173
174
175
Annotations
sorted(set(text))Vocabulary = unique characters. Sorted for reproducibility.
char2idxMaps character to integer index (0..V-1) for one-hot encoding
idx2charReverse map: integer index back to character for generation
03

Encoding and Chunking

Preparing training sequences
train.py:175179
175def encode(text: str, char2idx: Dict[str, int]) -> List[int]:
176 """Map each character to its index."""
177 return [char2idx[c] for c in text]
178
179
180
Annotations
char2idx[c]Convert each character to its integer index. Unknown chars are skipped.
04

Sequence Iterator

Sliding window with carry-forward
train.py:180194
180def split_data(data: List[int], val_fraction: float = 0.1) -> Tuple[List[int], List[int]]:
181 """
182 Split encoded corpus into train/val.
183
184 Uses the last val_fraction of the corpus for validation, matching the
185 temporal ordering of the data (avoids data leakage from future context).
186 """
187 split_idx = int(len(data) * (1.0 - val_fraction))
188 return data[:split_idx], data[split_idx:]
189
190
191# ---------------------------------------------------------------------------
192# Activation functions
193# ---------------------------------------------------------------------------
194
195
Annotations
val_fractionReserve 10% of corpus for validation loss; monitors overfitting without a separate corpus
int(len(data) * (1 - val_fraction))90% training / 10% validation split at a character boundary
Phase 2: The Model

7 sections

05

The CharRNN Class

Five raw parameter tensors
70class CharRNN:
71 """Vanilla RNN for character-level language modeling.
72
73 All parameters are raw nn.Parameter tensors. No nn.Module subclass,
74 keeping it completely manual so every operation is visible in the code
75 walkthrough.
76
77 Architecture:
78 x_t (V,1) --[W_xh]--> a_t = W_xh @ x_t + W_hh @ h_{t-1} + b_h
79 h_t = activation(a_t) H-dimensional hidden state
80 o_t = W_hy @ h_t + b_y output logits (V,1)
81 p_t = softmax(o_t) probability over next character
82 """
83
84 def __init__(self, vocab_size: int, hidden_size: int):
85 self.vocab_size = vocab_size # V ≈ 27 (characters a-z + space)
86 self.hidden_size = hidden_size # H = 64 (default)
87
88 # Input-to-hidden: (H, V)
89 self.W_xh = nn.Parameter(torch.empty(hidden_size, vocab_size))
90 # Hidden-to-hidden recurrent: (H, H)
91 self.W_hh = nn.Parameter(torch.empty(hidden_size, hidden_size))
92 # Hidden-to-output: (V, H)
93 self.W_hy = nn.Parameter(torch.empty(vocab_size, hidden_size))
94 # Biases: initialized to zeros in constructor
95 self.b_h = nn.Parameter(torch.zeros(hidden_size, 1))
96 self.b_y = nn.Parameter(torch.zeros(vocab_size, 1))
97
98 def parameters(self) -> List[nn.Parameter]:
99 return [self.W_xh, self.W_hh, self.W_hy, self.b_h, self.b_y]
100
101 def count_parameters(self) -> int:
102 return sum(p.numel() for p in self.parameters())
103
104
105# ---------------------------------------------------------------------------
106# Initialization
107# ---------------------------------------------------------------------------
108
109
Annotations
nn.Parameter(torch.empty(nn.Parameter is used ONLY for weight storage and init; all math uses .data.numpy()
W_hhHidden-to-hidden recurrent weight: the core memory mechanism (H×H)
W_xhInput-to-hidden: maps one-hot character to hidden activation (H×V)
W_hyHidden-to-output: maps hidden state to next-char logits (V×H)
b_hHidden bias, initialized to zeros, not learned orthogonally
06

Parameter Initialization

Xavier, orthogonal, and zeros
109def init_parameters(model: CharRNN, method: str = "orthogonal", seed: int = 42) -> None:
110 """
111 Initialize model parameters.
112
113 Args:
114 method: W_hh initialization strategy.
115 "orthogonal" (default): spectral radius = 1.0, optimal for gradient flow
116 "xavier" : Xavier uniform for W_hh (not RNN-specific)
117 "random" : N(0, 0.01) random, worst case for gradient flow
118 seed: torch manual seed
119
120 Why orthogonal for W_hh:
121 An orthogonal matrix has |eigenvalue| = 1 for all eigenvalues, so the
122 Jacobian product in BPTT neither grows nor shrinks. This places the
123 network at the optimal spectral radius for gradient flow, the
124 RNN-specific initialization insight that Xavier alone does not provide.
125 """
126 torch.manual_seed(seed)
127 V, H = model.vocab_size, model.hidden_size
128
129 # Xavier uniform for input projection: keeps variance stable across layers
130 bound_xh = math.sqrt(6.0 / (V + H))
131 nn.init.uniform_(model.W_xh, -bound_xh, bound_xh)
132
133 # W_hh initialization (method-dependent for ablation experiments)
134 if method == "orthogonal":
135 nn.init.orthogonal_(model.W_hh)
136 elif method == "xavier":
137 bound_hh = math.sqrt(6.0 / (H + H))
138 nn.init.uniform_(model.W_hh, -bound_hh, bound_hh)
139 elif method == "random":
140 nn.init.normal_(model.W_hh, mean=0.0, std=0.01)
141 else:
142 raise ValueError(f"Unknown init method: {method}. Choose: orthogonal/xavier/random")
143
144 # Xavier uniform for output projection
145 bound_hy = math.sqrt(6.0 / (H + V))
146 nn.init.uniform_(model.W_hy, -bound_hy, bound_hy)
147
148 # Biases stay zero (already initialized in constructor, reset here for safety)
149 nn.init.zeros_(model.b_h)
150 nn.init.zeros_(model.b_y)
151
152
153# ---------------------------------------------------------------------------
154# Corpus preprocessing
155# ---------------------------------------------------------------------------
156
157
Annotations
nn.init.orthogonal_Orthogonal init: all eigenvalues |λ|=1 at step 0, perfect gradient flow at initialization
spectral_radius = 1.0Spectral radius = 1.0 means W_hh neither amplifies nor suppresses gradients
nn.init.xavier_uniform_Xavier uniform for W_xh and W_hy: variance scales with fan_in + fan_out
torch.zeros_likeBiases start at zero, no initial bias toward any character class
07

The Forward Pass

Explicit recurrence with tanh
train.py:224314
224def forward(
225 model: CharRNN,
226 inputs: List[int],
227 h_prev: np.ndarray,
228 activation: str = "tanh",
229 use_embedding: bool = False,
230 embedding: Optional[np.ndarray] = None,
231) -> dict:
232 """
233 Forward pass through seq_len time steps.
234
235 Args:
236 model: CharRNN with parameters
237 inputs: list of seq_len+1 character indices (input[0..T-1] target[1..T])
238 h_prev: hidden state from previous chunk, shape (H, 1)
239 activation: activation function name ('tanh'/'relu'/'sigmoid')
240 use_embedding: if True, use learned embedding instead of one-hot
241 embedding: embedding matrix (H_emb, V) if use_embedding=True
242
243 Returns dict with:
244 xs: one-hot or embedded input vectors {t: (V_or_E, 1)}
245 xs_indices: original character indices {t: int}
246 hs: hidden states {t: (H,1)}, hs[-1] = h_prev
247 os: output logits {t: (V,1)}
248 ps: softmax probabilities {t: (V,1)}
249 targets: list of target character indices
250 loss: scalar total (unnormalized) cross-entropy loss
251
252 Note on loss normalization:
253 loss here is the SUM over all time steps (unnormalized). capture_step
254 divides by seq_len to get per-character bits. Gradient magnitudes in
255 backward() reflect this unnormalized loss and scale linearly with
256 seq_len, important to understand when comparing experiments with
257 different chunk lengths (Plan 7, Experiment 3).
258 """
259 act_fn, _ = _activation_fn(activation)
260
261 W_xh = model.W_xh.data.numpy()
262 W_hh = model.W_hh.data.numpy()
263 W_hy = model.W_hy.data.numpy()
264 b_h = model.b_h.data.numpy()
265 b_y = model.b_y.data.numpy()
266
267 xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}
268 hs[-1] = h_prev.copy()
269 loss = 0.0
270
271 seq_len = len(inputs) - 1 # predict inputs[1..T] from inputs[0..T-1]
272
273 for t in range(seq_len):
274 xs_indices[t] = inputs[t]
275
276 if use_embedding and embedding is not None:
277 # Learned embedding lookup: (E, 1)
278 xs[t] = embedding[:, inputs[t]].reshape(-1, 1)
279 else:
280 # One-hot encoding: (V, 1)
281 xs[t] = np.zeros((model.vocab_size, 1))
282 xs[t][inputs[t]] = 1.0
283
284 # h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
285 pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h
286 hs[t] = act_fn(pre_act)
287
288 # o_t = W_hy @ h_t + b_y
289 os[t] = W_hy @ hs[t] + b_y
290
291 # softmax with numerical stability: subtract max before exp
292 exp_o = np.exp(os[t] - np.max(os[t]))
293 ps[t] = exp_o / np.sum(exp_o)
294
295 # cross-entropy: -log(p[target])
296 target_t = inputs[t + 1]
297 loss += -np.log(ps[t][target_t, 0] + 1e-12)
298
299 targets = inputs[1:]
300 return {
301 "xs": xs,
302 "xs_indices": xs_indices,
303 "hs": hs,
304 "os": os,
305 "ps": ps,
306 "targets": targets,
307 "loss": loss,
308 }
309
310
311# ---------------------------------------------------------------------------
312# Backward pass: Manual BPTT
313# ---------------------------------------------------------------------------
314
315
Annotations
W_xh @ x_t + W_hh @ h_prev + b_hPre-activation: combines current input AND previous hidden state
np.tanh(a_t)tanh squashes to (-1,+1): derivative (1-h²) causes vanishing gradients in BPTT
h_states.append(h.copy())Store hidden state at every step, needed for BPTT and analysis
loss += -np.log(Cross-entropy loss: penalizes the log probability of the correct next character
08

Output and Softmax

From hidden state to predictions
train.py:224314
224def forward(
225 model: CharRNN,
226 inputs: List[int],
227 h_prev: np.ndarray,
228 activation: str = "tanh",
229 use_embedding: bool = False,
230 embedding: Optional[np.ndarray] = None,
231) -> dict:
232 """
233 Forward pass through seq_len time steps.
234
235 Args:
236 model: CharRNN with parameters
237 inputs: list of seq_len+1 character indices (input[0..T-1] target[1..T])
238 h_prev: hidden state from previous chunk, shape (H, 1)
239 activation: activation function name ('tanh'/'relu'/'sigmoid')
240 use_embedding: if True, use learned embedding instead of one-hot
241 embedding: embedding matrix (H_emb, V) if use_embedding=True
242
243 Returns dict with:
244 xs: one-hot or embedded input vectors {t: (V_or_E, 1)}
245 xs_indices: original character indices {t: int}
246 hs: hidden states {t: (H,1)}, hs[-1] = h_prev
247 os: output logits {t: (V,1)}
248 ps: softmax probabilities {t: (V,1)}
249 targets: list of target character indices
250 loss: scalar total (unnormalized) cross-entropy loss
251
252 Note on loss normalization:
253 loss here is the SUM over all time steps (unnormalized). capture_step
254 divides by seq_len to get per-character bits. Gradient magnitudes in
255 backward() reflect this unnormalized loss and scale linearly with
256 seq_len, important to understand when comparing experiments with
257 different chunk lengths (Plan 7, Experiment 3).
258 """
259 act_fn, _ = _activation_fn(activation)
260
261 W_xh = model.W_xh.data.numpy()
262 W_hh = model.W_hh.data.numpy()
263 W_hy = model.W_hy.data.numpy()
264 b_h = model.b_h.data.numpy()
265 b_y = model.b_y.data.numpy()
266
267 xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}
268 hs[-1] = h_prev.copy()
269 loss = 0.0
270
271 seq_len = len(inputs) - 1 # predict inputs[1..T] from inputs[0..T-1]
272
273 for t in range(seq_len):
274 xs_indices[t] = inputs[t]
275
276 if use_embedding and embedding is not None:
277 # Learned embedding lookup: (E, 1)
278 xs[t] = embedding[:, inputs[t]].reshape(-1, 1)
279 else:
280 # One-hot encoding: (V, 1)
281 xs[t] = np.zeros((model.vocab_size, 1))
282 xs[t][inputs[t]] = 1.0
283
284 # h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
285 pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h
286 hs[t] = act_fn(pre_act)
287
288 # o_t = W_hy @ h_t + b_y
289 os[t] = W_hy @ hs[t] + b_y
290
291 # softmax with numerical stability: subtract max before exp
292 exp_o = np.exp(os[t] - np.max(os[t]))
293 ps[t] = exp_o / np.sum(exp_o)
294
295 # cross-entropy: -log(p[target])
296 target_t = inputs[t + 1]
297 loss += -np.log(ps[t][target_t, 0] + 1e-12)
298
299 targets = inputs[1:]
300 return {
301 "xs": xs,
302 "xs_indices": xs_indices,
303 "hs": hs,
304 "os": os,
305 "ps": ps,
306 "targets": targets,
307 "loss": loss,
308 }
309
310
311# ---------------------------------------------------------------------------
312# Backward pass: Manual BPTT
313# ---------------------------------------------------------------------------
314
315
Annotations
o_t = W_hy @ h_t + b_yOutput logits: un-normalized log-probabilities over the vocabulary
o_t - o_t.max()Numerically stable softmax: subtract max before exp to prevent overflow
exp_o / exp_o.sum()Normalize to get a probability distribution over the V characters
09

Cross-Entropy Loss

Measuring prediction quality
train.py:224314
224def forward(
225 model: CharRNN,
226 inputs: List[int],
227 h_prev: np.ndarray,
228 activation: str = "tanh",
229 use_embedding: bool = False,
230 embedding: Optional[np.ndarray] = None,
231) -> dict:
232 """
233 Forward pass through seq_len time steps.
234
235 Args:
236 model: CharRNN with parameters
237 inputs: list of seq_len+1 character indices (input[0..T-1] target[1..T])
238 h_prev: hidden state from previous chunk, shape (H, 1)
239 activation: activation function name ('tanh'/'relu'/'sigmoid')
240 use_embedding: if True, use learned embedding instead of one-hot
241 embedding: embedding matrix (H_emb, V) if use_embedding=True
242
243 Returns dict with:
244 xs: one-hot or embedded input vectors {t: (V_or_E, 1)}
245 xs_indices: original character indices {t: int}
246 hs: hidden states {t: (H,1)}, hs[-1] = h_prev
247 os: output logits {t: (V,1)}
248 ps: softmax probabilities {t: (V,1)}
249 targets: list of target character indices
250 loss: scalar total (unnormalized) cross-entropy loss
251
252 Note on loss normalization:
253 loss here is the SUM over all time steps (unnormalized). capture_step
254 divides by seq_len to get per-character bits. Gradient magnitudes in
255 backward() reflect this unnormalized loss and scale linearly with
256 seq_len, important to understand when comparing experiments with
257 different chunk lengths (Plan 7, Experiment 3).
258 """
259 act_fn, _ = _activation_fn(activation)
260
261 W_xh = model.W_xh.data.numpy()
262 W_hh = model.W_hh.data.numpy()
263 W_hy = model.W_hy.data.numpy()
264 b_h = model.b_h.data.numpy()
265 b_y = model.b_y.data.numpy()
266
267 xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}
268 hs[-1] = h_prev.copy()
269 loss = 0.0
270
271 seq_len = len(inputs) - 1 # predict inputs[1..T] from inputs[0..T-1]
272
273 for t in range(seq_len):
274 xs_indices[t] = inputs[t]
275
276 if use_embedding and embedding is not None:
277 # Learned embedding lookup: (E, 1)
278 xs[t] = embedding[:, inputs[t]].reshape(-1, 1)
279 else:
280 # One-hot encoding: (V, 1)
281 xs[t] = np.zeros((model.vocab_size, 1))
282 xs[t][inputs[t]] = 1.0
283
284 # h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
285 pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h
286 hs[t] = act_fn(pre_act)
287
288 # o_t = W_hy @ h_t + b_y
289 os[t] = W_hy @ hs[t] + b_y
290
291 # softmax with numerical stability: subtract max before exp
292 exp_o = np.exp(os[t] - np.max(os[t]))
293 ps[t] = exp_o / np.sum(exp_o)
294
295 # cross-entropy: -log(p[target])
296 target_t = inputs[t + 1]
297 loss += -np.log(ps[t][target_t, 0] + 1e-12)
298
299 targets = inputs[1:]
300 return {
301 "xs": xs,
302 "xs_indices": xs_indices,
303 "hs": hs,
304 "os": os,
305 "ps": ps,
306 "targets": targets,
307 "loss": loss,
308 }
309
310
311# ---------------------------------------------------------------------------
312# Backward pass: Manual BPTT
313# ---------------------------------------------------------------------------
314
315
Annotations
-np.log(ps[t][targets[t]])Cross-entropy = -log(probability of the true next character)
loss / len(inputs)Average loss per character. Random guessing gives -log(1/V) ≈ 3.2 for V=24
10

Manual BPTT

The backward pass, line by line
train.py:315431
315def backward(
316 model: CharRNN,
317 fwd: dict,
318 clip: float = 5.0,
319 activation: str = "tanh",
320 freeze_whh: bool = False,
321) -> dict:
322 """
323 Backpropagation Through Time: manual gradient computation.
324
325 Derives gradients for all 5 parameters by propagating error backward
326 through the unrolled computational graph. This is the Jacobian product
327 computed incrementally, rather than forming the full product matrix,
328 we multiply one factor at a time (see Part 7B of the plan for derivation).
329
330 Key teaching points:
331 - dh_next accumulator IS the "gradient flowing backward through time"
332 - (1 - h^2) is the tanh derivative, always in [0,1] causes shrinkage
333 - W_hh.T @ dh_raw is the recurrent propagation Jacobian product
334 - per_step_grad_norm captures vanishing gradient phenomenon directly
335
336 Args:
337 model: CharRNN
338 fwd: output from forward()
339 clip: global gradient norm clipping threshold (0 = no clipping)
340 activation: activation function name (determines derivative)
341 freeze_whh: if True, zero out dW_hh gradient (reservoir computing)
342
343 Returns dict with gradient arrays and diagnostic statistics.
344 """
345 _, deriv_fn = _activation_fn(activation)
346
347 xs, hs, ps, targets = fwd["xs"], fwd["hs"], fwd["ps"], fwd["targets"]
348 seq_len = len(targets)
349
350 W_hh = model.W_hh.data.numpy()
351 W_hy = model.W_hy.data.numpy()
352
353 dW_xh = np.zeros_like(model.W_xh.data.numpy())
354 dW_hh = np.zeros_like(model.W_hh.data.numpy())
355 dW_hy = np.zeros_like(model.W_hy.data.numpy())
356 db_h = np.zeros_like(model.b_h.data.numpy())
357 db_y = np.zeros_like(model.b_y.data.numpy())
358
359 # dh_next: gradient from future time steps flowing backward
360 dh_next = np.zeros((model.hidden_size, 1))
361 per_step_grad_norm = []
362
363 for t in reversed(range(seq_len)):
364 # --- Output layer: softmax-CE gradient ---
365 # d(loss)/d(o_t) = p_t - one_hot(target_t)
366 do = ps[t].copy()
367 do[targets[t]] -= 1.0
368
369 dW_hy += do @ hs[t].T # outer product
370 db_y += do
371
372 # --- Hidden state gradient: two contributions ---
373 # 1. From output at t: W_hy^T @ do
374 # 2. From future steps: dh_next (the recurrent term)
375 dh = W_hy.T @ do + dh_next
376
377 # Through activation: element-wise derivative
378 dh_raw = deriv_fn(hs[t]) * dh
379
380 dW_xh += dh_raw @ xs[t].T # outer product with input
381 dW_hh += dh_raw @ hs[t - 1].T # outer product with prev hidden
382 db_h += dh_raw
383
384 # Propagate gradient to previous time step (the Jacobian factor)
385 dh_next = W_hh.T @ dh_raw
386
387 # Record gradient magnitude at this step for vanishing gradient viz
388 per_step_grad_norm.append(float(np.linalg.norm(dh_next)))
389
390 per_step_grad_norm.reverse() # index 0 = earliest time step
391
392 # Reservoir computing: freeze recurrent weights by zeroing their gradient
393 if freeze_whh:
394 dW_hh[:] = 0.0
395
396 # --- Global norm gradient clipping ---
397 grads = [dW_xh, dW_hh, dW_hy, db_h, db_y]
398 total_grad_norm = float(np.sqrt(sum(np.sum(g ** 2) for g in grads)))
399 clipped = (clip > 0) and (total_grad_norm > clip)
400 clip_ratio = 1.0
401 if clipped:
402 clip_ratio = clip / total_grad_norm
403 for g in grads:
404 g *= clip_ratio
405
406 grad_norms = {
407 "W_xh": float(np.linalg.norm(dW_xh)),
408 "W_hh": float(np.linalg.norm(dW_hh)),
409 "W_hy": float(np.linalg.norm(dW_hy)),
410 "b_h": float(np.linalg.norm(db_h)),
411 "b_y": float(np.linalg.norm(db_y)),
412 }
413
414 return {
415 "dW_xh": dW_xh,
416 "dW_hh": dW_hh,
417 "dW_hy": dW_hy,
418 "db_h": db_h,
419 "db_y": db_y,
420 "grad_norms": grad_norms,
421 "total_grad_norm": total_grad_norm,
422 "clipped": clipped,
423 "clip_ratio": clip_ratio,
424 "per_step_grad_norm": per_step_grad_norm,
425 }
426
427
428# ---------------------------------------------------------------------------
429# Parameter update
430# ---------------------------------------------------------------------------
431
432
Annotations
do[targets[t]] -= 1.0Softmax-cross-entropy gradient: p_t - one_hot(y_t). Only the target index changes.
dh = W_hy.T @ do + dh_nextHidden gradient combines: (a) output error and (b) future gradient dh_next
(1.0 - hs[t] ** 2) * dhtanh derivative: always in [0,1], this is the source of vanishing gradients
dh_next = W_hh.T @ dh_rawTHE core of BPTT: recurrence propagates gradient one step back through W_hh
dW_hh += dh_raw @ hs[t - 1].TW_hh gradient accumulates over all time steps in the sequence
11

Gradient Clipping

Preventing exploding gradients
train.py:315431
315def backward(
316 model: CharRNN,
317 fwd: dict,
318 clip: float = 5.0,
319 activation: str = "tanh",
320 freeze_whh: bool = False,
321) -> dict:
322 """
323 Backpropagation Through Time: manual gradient computation.
324
325 Derives gradients for all 5 parameters by propagating error backward
326 through the unrolled computational graph. This is the Jacobian product
327 computed incrementally, rather than forming the full product matrix,
328 we multiply one factor at a time (see Part 7B of the plan for derivation).
329
330 Key teaching points:
331 - dh_next accumulator IS the "gradient flowing backward through time"
332 - (1 - h^2) is the tanh derivative, always in [0,1] causes shrinkage
333 - W_hh.T @ dh_raw is the recurrent propagation Jacobian product
334 - per_step_grad_norm captures vanishing gradient phenomenon directly
335
336 Args:
337 model: CharRNN
338 fwd: output from forward()
339 clip: global gradient norm clipping threshold (0 = no clipping)
340 activation: activation function name (determines derivative)
341 freeze_whh: if True, zero out dW_hh gradient (reservoir computing)
342
343 Returns dict with gradient arrays and diagnostic statistics.
344 """
345 _, deriv_fn = _activation_fn(activation)
346
347 xs, hs, ps, targets = fwd["xs"], fwd["hs"], fwd["ps"], fwd["targets"]
348 seq_len = len(targets)
349
350 W_hh = model.W_hh.data.numpy()
351 W_hy = model.W_hy.data.numpy()
352
353 dW_xh = np.zeros_like(model.W_xh.data.numpy())
354 dW_hh = np.zeros_like(model.W_hh.data.numpy())
355 dW_hy = np.zeros_like(model.W_hy.data.numpy())
356 db_h = np.zeros_like(model.b_h.data.numpy())
357 db_y = np.zeros_like(model.b_y.data.numpy())
358
359 # dh_next: gradient from future time steps flowing backward
360 dh_next = np.zeros((model.hidden_size, 1))
361 per_step_grad_norm = []
362
363 for t in reversed(range(seq_len)):
364 # --- Output layer: softmax-CE gradient ---
365 # d(loss)/d(o_t) = p_t - one_hot(target_t)
366 do = ps[t].copy()
367 do[targets[t]] -= 1.0
368
369 dW_hy += do @ hs[t].T # outer product
370 db_y += do
371
372 # --- Hidden state gradient: two contributions ---
373 # 1. From output at t: W_hy^T @ do
374 # 2. From future steps: dh_next (the recurrent term)
375 dh = W_hy.T @ do + dh_next
376
377 # Through activation: element-wise derivative
378 dh_raw = deriv_fn(hs[t]) * dh
379
380 dW_xh += dh_raw @ xs[t].T # outer product with input
381 dW_hh += dh_raw @ hs[t - 1].T # outer product with prev hidden
382 db_h += dh_raw
383
384 # Propagate gradient to previous time step (the Jacobian factor)
385 dh_next = W_hh.T @ dh_raw
386
387 # Record gradient magnitude at this step for vanishing gradient viz
388 per_step_grad_norm.append(float(np.linalg.norm(dh_next)))
389
390 per_step_grad_norm.reverse() # index 0 = earliest time step
391
392 # Reservoir computing: freeze recurrent weights by zeroing their gradient
393 if freeze_whh:
394 dW_hh[:] = 0.0
395
396 # --- Global norm gradient clipping ---
397 grads = [dW_xh, dW_hh, dW_hy, db_h, db_y]
398 total_grad_norm = float(np.sqrt(sum(np.sum(g ** 2) for g in grads)))
399 clipped = (clip > 0) and (total_grad_norm > clip)
400 clip_ratio = 1.0
401 if clipped:
402 clip_ratio = clip / total_grad_norm
403 for g in grads:
404 g *= clip_ratio
405
406 grad_norms = {
407 "W_xh": float(np.linalg.norm(dW_xh)),
408 "W_hh": float(np.linalg.norm(dW_hh)),
409 "W_hy": float(np.linalg.norm(dW_hy)),
410 "b_h": float(np.linalg.norm(db_h)),
411 "b_y": float(np.linalg.norm(db_y)),
412 }
413
414 return {
415 "dW_xh": dW_xh,
416 "dW_hh": dW_hh,
417 "dW_hy": dW_hy,
418 "db_h": db_h,
419 "db_y": db_y,
420 "grad_norms": grad_norms,
421 "total_grad_norm": total_grad_norm,
422 "clipped": clipped,
423 "clip_ratio": clip_ratio,
424 "per_step_grad_norm": per_step_grad_norm,
425 }
426
427
428# ---------------------------------------------------------------------------
429# Parameter update
430# ---------------------------------------------------------------------------
431
432
Annotations
global_norm = np.sqrt(Global norm = L2 norm of ALL gradient components concatenated
clip_coef = clip / global_normScale factor: if norm > threshold, scale all gradients uniformly
global_norm > clipOnly clip when necessary; avoids distorting small, well-behaved gradients
Phase 3: Training Loop

4 sections

12

The Training Loop

TBPTT with hidden state carry-forward
train.py:8501016
850def train(model: CharRNN, train_data: List[int], val_data: List[int], config: TrainConfig):
851 """
852 Main TBPTT training loop.
853
854 TBPTT design:
855 - Hidden state h carries FORWARD across chunk boundaries (memory preserved)
856 - BPTT unrolls only within each chunk (gradient truncated at boundaries)
857 - This IS the truncation: h.copy() detaches the gradient graph
858
859 At corpus wrap, h is reset to zeros (fresh start for next epoch's pass).
860 """
861 output_dir = config.output_dir
862 output_dir.mkdir(parents=True, exist_ok=True)
863 (output_dir / "steps").mkdir(exist_ok=True)
864 (output_dir / "jacobians").mkdir(exist_ok=True)
865
866 V = len(config.chars)
867 H = config.hidden_size
868
869 # Optionally initialize learned embedding (--use-embedding flag)
870 embedding = None
871 if config.use_embedding:
872 embed_dim = 16
873 embedding = np.random.randn(embed_dim, V) * 0.1
874
875 # SGD momentum velocity (all zeros initially)
876 velocity = {name: np.zeros_like(getattr(model, name).data.numpy())
877 for name in ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]}
878
879 h = np.zeros((H, 1))
880 pointer = 0
881
882 loss_curve: List[dict] = []
883 step_index: List[int] = []
884 gf_summary: List[dict] = []
885 recorded_data: List[dict] = [] # kept in memory for PCA at end
886 recorded_step_count = 0
887 jacobian_count = 0
888
889 print("=" * 64)
890 print(" RNN Character-Level Language Model: Training")
891 print("=" * 64)
892 print(f" Corpus: {config.corpus} ({len(train_data) + len(val_data):,} chars, {V} unique)")
893 print(f" Architecture: vanilla_rnn (H={H}, V={V})")
894 print(f" Parameters: {model.count_parameters():,}")
895 W_hh_init_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
896 print(f" W_hh init: {config.init} (spectral_radius={np.max(np.abs(W_hh_init_eigs)):.3f})")
897 print(f" Training: {config.steps:,} steps, seq_len={config.seq_len}, "
898 f"lr={config.lr}, clip={config.clip}")
899 print(f" TBPTT chunk: {config.seq_len} characters")
900 print("=" * 64)
901
902 t_start = time.time()
903
904 for step in range(1, config.steps + 1):
905 # Corpus wrap: reset hidden state and pointer
906 if pointer + config.seq_len + 1 >= len(train_data):
907 pointer = 0
908 h = np.zeros((H, 1))
909
910 inputs = train_data[pointer: pointer + config.seq_len + 1]
911 pointer += config.seq_len
912
913 # Current learning rate (step decay)
914 lr = get_lr(step, config.lr, config.lr_decay_rate, config.lr_decay_every)
915
916 # Forward pass: h carries over from previous chunk (TBPTT memory)
917 fwd = forward(
918 model, inputs, h,
919 activation=config.activation,
920 use_embedding=config.use_embedding,
921 embedding=embedding,
922 )
923
924 # Detach: copy final hidden state forward but don't backprop through it
925 # This is the truncation in TBPTT; h "remembers" but gradients don't cross
926 h = fwd["hs"][config.seq_len - 1].copy()
927
928 # Backward pass (BPTT within this chunk only)
929 grads = backward(
930 model, fwd,
931 clip=config.clip,
932 activation=config.activation,
933 freeze_whh=config.freeze_whh,
934 )
935
936 # SGD update
937 velocity = _apply_update(model, grads, lr, velocity, config.momentum)
938
939 # --- Loss curve (every step) ---
940 avg_loss = float(fwd["loss"] / config.seq_len)
941
942 # Validation loss (every val_interval steps)
943 val_loss = None
944 if len(val_data) > 0 and step % config.val_interval == 0:
945 val_loss = compute_val_loss(
946 model, val_data, config.seq_len,
947 activation=config.activation,
948 )
949
950 loss_curve.append({
951 "step": step,
952 "loss": round(avg_loss, 6),
953 "val_loss": round(val_loss, 6) if val_loss is not None else None,
954 "lr": round(lr, 6),
955 })
956
957 # --- Gradient flow summary (every recorded step) ---
958 W_hh_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
959 spectral_radius = float(np.max(np.abs(W_hh_eigs)))
960
961 # --- State recording ---
962 if should_record(step, config.dense_steps, config.record_interval, config.steps):
963 record = capture_step(step, model, fwd, grads, config)
964 save_step(record, output_dir)
965 step_index.append(step)
966 gf_summary.append({
967 "step": step,
968 "total_grad_norm": grads["total_grad_norm"],
969 "clipped": grads["clipped"],
970 "clip_ratio": grads["clip_ratio"],
971 "spectral_radius": spectral_radius,
972 "per_step_grad_norm": grads["per_step_grad_norm"],
973 "grad_norms": grads["grad_norms"],
974 })
975 recorded_data.append(record)
976 recorded_step_count += 1
977
978 # --- Jacobian capture ---
979 if step % config.jacobian_interval == 0:
980 jac_t_start = max(0, config.seq_len - 1 - 10)
981 jac_t_end = config.seq_len - 1
982 jac = capture_jacobian(model, fwd, jac_t_start, jac_t_end)
983 save_jacobian(jac, step, output_dir)
984 jacobian_count += 1
985
986 # --- Console progress ---
987 if step <= 10 or step % 1000 == 0 or step == config.steps:
988 clip_flag = "clipped" if grads["clipped"] else " "
989 print(
990 f"step {step:6d}/{config.steps} | loss {avg_loss:.4f} | "
991 f"grad_norm {grads['total_grad_norm']:6.2f} | {clip_flag} | "
992 f"lr {lr:.4f} | rho {spectral_radius:.3f}"
993 )
994 if step % 5000 == 0 or step == config.steps:
995 sample = generate(
996 model, " ", 60,
997 config.char2idx, config.idx2char,
998 temperature=config.temperature,
999 activation=config.activation,
1000 )
1001 print(f"--- Sample at step {step} (T={config.temperature}): \"{sample}\" ---")
1002
1003 elapsed = time.time() - t_start
1004 print("=" * 64)
1005 print(f" Training complete in {elapsed:.1f}s. Output: {output_dir}")
1006 print(f" Steps recorded: {recorded_step_count} | Jacobian snapshots: {jacobian_count}")
1007 print(f" Final loss: {loss_curve[-1]['loss']:.4f}")
1008 print("=" * 64)
1009
1010 return loss_curve, step_index, gf_summary, recorded_data
1011
1012
1013# ---------------------------------------------------------------------------
1014# CLI
1015# ---------------------------------------------------------------------------
1016
1017
Annotations
h_carryHidden state carries across chunk boundaries; the network has memory beyond seq_len
for chunk_start in range(Sliding window over the corpus: each chunk is seq_len characters long
fwd = forward(model, inputs, targets, h_carryForward pass: returns losses, hidden states, predictions
grads = backward(model, fwd, clip=config.clipManual BPTT over the current chunk
13

Hidden State Management

Detach, carry, reset
train.py:8501016
850def train(model: CharRNN, train_data: List[int], val_data: List[int], config: TrainConfig):
851 """
852 Main TBPTT training loop.
853
854 TBPTT design:
855 - Hidden state h carries FORWARD across chunk boundaries (memory preserved)
856 - BPTT unrolls only within each chunk (gradient truncated at boundaries)
857 - This IS the truncation: h.copy() detaches the gradient graph
858
859 At corpus wrap, h is reset to zeros (fresh start for next epoch's pass).
860 """
861 output_dir = config.output_dir
862 output_dir.mkdir(parents=True, exist_ok=True)
863 (output_dir / "steps").mkdir(exist_ok=True)
864 (output_dir / "jacobians").mkdir(exist_ok=True)
865
866 V = len(config.chars)
867 H = config.hidden_size
868
869 # Optionally initialize learned embedding (--use-embedding flag)
870 embedding = None
871 if config.use_embedding:
872 embed_dim = 16
873 embedding = np.random.randn(embed_dim, V) * 0.1
874
875 # SGD momentum velocity (all zeros initially)
876 velocity = {name: np.zeros_like(getattr(model, name).data.numpy())
877 for name in ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]}
878
879 h = np.zeros((H, 1))
880 pointer = 0
881
882 loss_curve: List[dict] = []
883 step_index: List[int] = []
884 gf_summary: List[dict] = []
885 recorded_data: List[dict] = [] # kept in memory for PCA at end
886 recorded_step_count = 0
887 jacobian_count = 0
888
889 print("=" * 64)
890 print(" RNN Character-Level Language Model: Training")
891 print("=" * 64)
892 print(f" Corpus: {config.corpus} ({len(train_data) + len(val_data):,} chars, {V} unique)")
893 print(f" Architecture: vanilla_rnn (H={H}, V={V})")
894 print(f" Parameters: {model.count_parameters():,}")
895 W_hh_init_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
896 print(f" W_hh init: {config.init} (spectral_radius={np.max(np.abs(W_hh_init_eigs)):.3f})")
897 print(f" Training: {config.steps:,} steps, seq_len={config.seq_len}, "
898 f"lr={config.lr}, clip={config.clip}")
899 print(f" TBPTT chunk: {config.seq_len} characters")
900 print("=" * 64)
901
902 t_start = time.time()
903
904 for step in range(1, config.steps + 1):
905 # Corpus wrap: reset hidden state and pointer
906 if pointer + config.seq_len + 1 >= len(train_data):
907 pointer = 0
908 h = np.zeros((H, 1))
909
910 inputs = train_data[pointer: pointer + config.seq_len + 1]
911 pointer += config.seq_len
912
913 # Current learning rate (step decay)
914 lr = get_lr(step, config.lr, config.lr_decay_rate, config.lr_decay_every)
915
916 # Forward pass: h carries over from previous chunk (TBPTT memory)
917 fwd = forward(
918 model, inputs, h,
919 activation=config.activation,
920 use_embedding=config.use_embedding,
921 embedding=embedding,
922 )
923
924 # Detach: copy final hidden state forward but don't backprop through it
925 # This is the truncation in TBPTT; h "remembers" but gradients don't cross
926 h = fwd["hs"][config.seq_len - 1].copy()
927
928 # Backward pass (BPTT within this chunk only)
929 grads = backward(
930 model, fwd,
931 clip=config.clip,
932 activation=config.activation,
933 freeze_whh=config.freeze_whh,
934 )
935
936 # SGD update
937 velocity = _apply_update(model, grads, lr, velocity, config.momentum)
938
939 # --- Loss curve (every step) ---
940 avg_loss = float(fwd["loss"] / config.seq_len)
941
942 # Validation loss (every val_interval steps)
943 val_loss = None
944 if len(val_data) > 0 and step % config.val_interval == 0:
945 val_loss = compute_val_loss(
946 model, val_data, config.seq_len,
947 activation=config.activation,
948 )
949
950 loss_curve.append({
951 "step": step,
952 "loss": round(avg_loss, 6),
953 "val_loss": round(val_loss, 6) if val_loss is not None else None,
954 "lr": round(lr, 6),
955 })
956
957 # --- Gradient flow summary (every recorded step) ---
958 W_hh_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
959 spectral_radius = float(np.max(np.abs(W_hh_eigs)))
960
961 # --- State recording ---
962 if should_record(step, config.dense_steps, config.record_interval, config.steps):
963 record = capture_step(step, model, fwd, grads, config)
964 save_step(record, output_dir)
965 step_index.append(step)
966 gf_summary.append({
967 "step": step,
968 "total_grad_norm": grads["total_grad_norm"],
969 "clipped": grads["clipped"],
970 "clip_ratio": grads["clip_ratio"],
971 "spectral_radius": spectral_radius,
972 "per_step_grad_norm": grads["per_step_grad_norm"],
973 "grad_norms": grads["grad_norms"],
974 })
975 recorded_data.append(record)
976 recorded_step_count += 1
977
978 # --- Jacobian capture ---
979 if step % config.jacobian_interval == 0:
980 jac_t_start = max(0, config.seq_len - 1 - 10)
981 jac_t_end = config.seq_len - 1
982 jac = capture_jacobian(model, fwd, jac_t_start, jac_t_end)
983 save_jacobian(jac, step, output_dir)
984 jacobian_count += 1
985
986 # --- Console progress ---
987 if step <= 10 or step % 1000 == 0 or step == config.steps:
988 clip_flag = "clipped" if grads["clipped"] else " "
989 print(
990 f"step {step:6d}/{config.steps} | loss {avg_loss:.4f} | "
991 f"grad_norm {grads['total_grad_norm']:6.2f} | {clip_flag} | "
992 f"lr {lr:.4f} | rho {spectral_radius:.3f}"
993 )
994 if step % 5000 == 0 or step == config.steps:
995 sample = generate(
996 model, " ", 60,
997 config.char2idx, config.idx2char,
998 temperature=config.temperature,
999 activation=config.activation,
1000 )
1001 print(f"--- Sample at step {step} (T={config.temperature}): \"{sample}\" ---")
1002
1003 elapsed = time.time() - t_start
1004 print("=" * 64)
1005 print(f" Training complete in {elapsed:.1f}s. Output: {output_dir}")
1006 print(f" Steps recorded: {recorded_step_count} | Jacobian snapshots: {jacobian_count}")
1007 print(f" Final loss: {loss_curve[-1]['loss']:.4f}")
1008 print("=" * 64)
1009
1010 return loss_curve, step_index, gf_summary, recorded_data
1011
1012
1013# ---------------------------------------------------------------------------
1014# CLI
1015# ---------------------------------------------------------------------------
1016
1017
Annotations
h_carry = fwd['h_final'].copy()Detach: carry forward final hidden state but stop gradient at chunk boundary
h_carry = np.zeros(Reset hidden state at the start of each epoch pass over the corpus
14

Parameter Update

SGD with momentum and LR decay
train.py:432448
432def _apply_update(model: CharRNN, grads: dict, lr: float, velocity: dict, momentum: float) -> dict:
433 """SGD update with optional momentum."""
434 param_names = ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]
435 for name in param_names:
436 param = getattr(model, name)
437 g = grads[f"d{name}"]
438 if momentum > 0.0:
439 v = momentum * velocity[name] - lr * g
440 velocity[name] = v
441 delta = torch.from_numpy(v).float()
442 else:
443 delta = torch.from_numpy(-lr * g).float()
444 with torch.no_grad():
445 param.data.add_(delta)
446 return velocity
447
448
449
Annotations
velocity[name]Momentum accumulates gradient direction, reduces oscillations in loss landscape
momentum * velocity[name] + (1 - momentum) * grads[name]Momentum: exponential moving average of gradients
param.data -= lr * velocity[name]SGD update: subtract scaled gradient from parameter
15

State Recording

Capturing training dynamics
train.py:552633
552def capture_step(
553 step: int,
554 model: CharRNN,
555 fwd: dict,
556 grads: dict,
557 config,
558) -> dict:
559 """
560 Capture full training state at a given step.
561
562 Records all fields needed by the frontend internals page:
563 weight matrices, eigenvalues, gradients, hidden state trajectories,
564 prediction probabilities, saturation statistics, and a sample text.
565 """
566 W_hh_np = model.W_hh.data.numpy()
567 eigenvalues = np.linalg.eigvals(W_hh_np)
568 seq_len = len(fwd["targets"])
569
570 return {
571 "step": step,
572 "loss": float(fwd["loss"] / seq_len), # avg per-character cross-entropy
573
574 # Full weight matrices (for weight evolution visualization)
575 "W_xh": model.W_xh.data.numpy().tolist(),
576 "W_hh": model.W_hh.data.numpy().tolist(),
577 "W_hy": model.W_hy.data.numpy().tolist(),
578 "b_h": model.b_h.data.numpy().ravel().tolist(),
579 "b_y": model.b_y.data.numpy().ravel().tolist(),
580
581 # W_hh spectral analysis (for spectral radius tracking)
582 "w_hh_eigenvalues_real": eigenvalues.real.tolist(),
583 "w_hh_eigenvalues_imag": eigenvalues.imag.tolist(),
584 "spectral_radius": float(np.max(np.abs(eigenvalues))),
585
586 # Gradient diagnostics
587 "grad_norms": grads["grad_norms"],
588 "total_grad_norm": grads["total_grad_norm"],
589 "clipped": grads["clipped"],
590 "clip_ratio": grads["clip_ratio"],
591 "per_step_grad_norm": grads["per_step_grad_norm"],
592
593 # Hidden state trajectory for this chunk (for PCA / trajectory viz)
594 "hidden_states": [
595 fwd["hs"][t].ravel().tolist() for t in range(seq_len)
596 ],
597 "input_chars": [
598 config.idx2char[fwd["xs_indices"][t]] for t in range(seq_len)
599 ],
600
601 # Prediction distribution at each position (for prediction landscape)
602 "predictions": [
603 {
604 config.idx2char[i]: round(float(fwd["ps"][t][i, 0]), 6)
605 for i in range(model.vocab_size)
606 }
607 for t in range(seq_len)
608 ],
609 "target_chars": [
610 config.idx2char[fwd["targets"][t]] for t in range(seq_len)
611 ],
612
613 # Per-position accuracy (for word-boundary error analysis)
614 "correct_at_position": [
615 int(np.argmax(fwd["ps"][t]) == fwd["targets"][t])
616 for t in range(seq_len)
617 ],
618
619 # Fraction of hidden units saturated near ±1 (for saturation viz)
620 "saturation": float(np.mean(
621 np.abs(np.array([fwd["hs"][t].ravel() for t in range(seq_len)])) > 0.95
622 )),
623
624 # Sample text generated at this checkpoint
625 "sample_text": generate(
626 model, " ", config.generate_length,
627 config.char2idx, config.idx2char,
628 temperature=config.temperature,
629 activation=config.activation,
630 ),
631 }
632
633
634
Annotations
should_recordDense recording early (every 50 steps for steps ≤200), then every 50th step
w_hh_eigenvalues_realTrack W_hh eigenvalues at each snapshot: spectral radius drift over training
per_step_grad_normPer-position gradient norm within the seq_len window, vanishing gradient evidence
saturationFraction of hidden units with |h|>0.9: how much of the tanh range is being used
Phase 4: Analysis & Generation

4 sections

16

Text Generation

Sampling with temperature
458def generate(
459 model: CharRNN,
460 seed_char: str,
461 length: int,
462 char2idx: Dict[str, int],
463 idx2char: Dict[int, str],
464 temperature: float = 1.0,
465 activation: str = "tanh",
466) -> str:
467 """
468 Sample text character by character using temperature-scaled softmax.
469
470 Temperature controls randomness:
471 T < 1: sharper, more deterministic (greedy at T0)
472 T = 1: unmodified model distribution
473 T > 1: flatter, more random (uniform at T)
474 """
475 act_fn, _ = _activation_fn(activation)
476 W_xh = model.W_xh.data.numpy()
477 W_hh = model.W_hh.data.numpy()
478 W_hy = model.W_hy.data.numpy()
479 b_h = model.b_h.data.numpy()
480 b_y = model.b_y.data.numpy()
481
482 h = np.zeros((model.hidden_size, 1))
483 idx = char2idx.get(seed_char, 0)
484 result = [seed_char]
485
486 for _ in range(length):
487 x = np.zeros((model.vocab_size, 1))
488 x[idx] = 1.0
489 h = act_fn(W_xh @ x + W_hh @ h + b_h)
490 o = W_hy @ h + b_y
491
492 o = o / temperature
493 exp_o = np.exp(o - np.max(o))
494 p = exp_o / np.sum(exp_o)
495
496 idx = int(np.random.choice(model.vocab_size, p=p.ravel()))
497 result.append(idx2char[idx])
498
499 return "".join(result)
500
501
502# ---------------------------------------------------------------------------
503# Validation loss
504# ---------------------------------------------------------------------------
505
506
Annotations
temperatureT<1: conservative (peaks sharper); T=1: standard; T>1: creative/noisy
np.random.choice(range(vocab_size), p=probs)Sample from distribution, not argmax, so output varies each run
h_gen = np.tanh(Generation uses the same forward pass as training, no special generation mode
17

Hidden State PCA

Projecting to 2D for visualization
train.py:777816
777def save_hidden_state_pca(recorded_data: list, output_dir: Path) -> None:
778 """
779 Compute PCA on final-step hidden states, project all recorded steps to 2D.
780
781 PCA is fit on the FINAL step's hidden states, then applied to all steps,
782 matching the GloVe pattern of using final_pca as the reference frame.
783 """
784 from sklearn.decomposition import PCA
785
786 if not recorded_data:
787 return
788
789 # Fit PCA on final step's hidden states
790 final_step = recorded_data[-1]
791 final_hs = np.array(final_step["hidden_states"]) # (seq_len, H)
792
793 pca = PCA(n_components=2)
794 pca.fit(final_hs)
795
796 steps_pca = []
797 for record in recorded_data:
798 hs = np.array(record["hidden_states"]) # (seq_len, H)
799 projected = pca.transform(hs) # (seq_len, 2)
800 steps_pca.append({
801 "step": record["step"],
802 "trajectory_2d": projected.tolist(),
803 "input_chars": record["input_chars"],
804 })
805
806 result = {
807 "pca_variance_explained": pca.explained_variance_ratio_.tolist(),
808 "steps": steps_pca,
809 }
810 save_json(result, output_dir / "hidden_state_pca.json")
811
812
813# ---------------------------------------------------------------------------
814# Training loop
815# ---------------------------------------------------------------------------
816
817
Annotations
PCA(n_components=2)Project 64-dim hidden states to 2D for visualization
pca.fit(all_states)Fit PCA on all recorded hidden states: principal axes capture most variance
trajectories_2d2D trajectory through hidden state space as the model reads the sample sequence
18

Eigenvalue Tracking

Monitoring the spectral radius
train.py:552633
552def capture_step(
553 step: int,
554 model: CharRNN,
555 fwd: dict,
556 grads: dict,
557 config,
558) -> dict:
559 """
560 Capture full training state at a given step.
561
562 Records all fields needed by the frontend internals page:
563 weight matrices, eigenvalues, gradients, hidden state trajectories,
564 prediction probabilities, saturation statistics, and a sample text.
565 """
566 W_hh_np = model.W_hh.data.numpy()
567 eigenvalues = np.linalg.eigvals(W_hh_np)
568 seq_len = len(fwd["targets"])
569
570 return {
571 "step": step,
572 "loss": float(fwd["loss"] / seq_len), # avg per-character cross-entropy
573
574 # Full weight matrices (for weight evolution visualization)
575 "W_xh": model.W_xh.data.numpy().tolist(),
576 "W_hh": model.W_hh.data.numpy().tolist(),
577 "W_hy": model.W_hy.data.numpy().tolist(),
578 "b_h": model.b_h.data.numpy().ravel().tolist(),
579 "b_y": model.b_y.data.numpy().ravel().tolist(),
580
581 # W_hh spectral analysis (for spectral radius tracking)
582 "w_hh_eigenvalues_real": eigenvalues.real.tolist(),
583 "w_hh_eigenvalues_imag": eigenvalues.imag.tolist(),
584 "spectral_radius": float(np.max(np.abs(eigenvalues))),
585
586 # Gradient diagnostics
587 "grad_norms": grads["grad_norms"],
588 "total_grad_norm": grads["total_grad_norm"],
589 "clipped": grads["clipped"],
590 "clip_ratio": grads["clip_ratio"],
591 "per_step_grad_norm": grads["per_step_grad_norm"],
592
593 # Hidden state trajectory for this chunk (for PCA / trajectory viz)
594 "hidden_states": [
595 fwd["hs"][t].ravel().tolist() for t in range(seq_len)
596 ],
597 "input_chars": [
598 config.idx2char[fwd["xs_indices"][t]] for t in range(seq_len)
599 ],
600
601 # Prediction distribution at each position (for prediction landscape)
602 "predictions": [
603 {
604 config.idx2char[i]: round(float(fwd["ps"][t][i, 0]), 6)
605 for i in range(model.vocab_size)
606 }
607 for t in range(seq_len)
608 ],
609 "target_chars": [
610 config.idx2char[fwd["targets"][t]] for t in range(seq_len)
611 ],
612
613 # Per-position accuracy (for word-boundary error analysis)
614 "correct_at_position": [
615 int(np.argmax(fwd["ps"][t]) == fwd["targets"][t])
616 for t in range(seq_len)
617 ],
618
619 # Fraction of hidden units saturated near ±1 (for saturation viz)
620 "saturation": float(np.mean(
621 np.abs(np.array([fwd["hs"][t].ravel() for t in range(seq_len)])) > 0.95
622 )),
623
624 # Sample text generated at this checkpoint
625 "sample_text": generate(
626 model, " ", config.generate_length,
627 config.char2idx, config.idx2char,
628 temperature=config.temperature,
629 activation=config.activation,
630 ),
631 }
632
633
634
Annotations
np.linalg.eigvals(W_hh)Compute all H eigenvalues of the recurrent weight matrix
spectral_radius = float(np.max(np.abs(eigs)))Spectral radius = largest eigenvalue magnitude. >1 means W_hh can amplify
19

Jacobian Computation

Measuring gradient flow through time
train.py:634676
634def capture_jacobian(model: CharRNN, fwd: dict, t_start: int, t_end: int) -> dict:
635 """
636 Compute the Jacobian product dh_t/dh_k for k=t_start..t_end.
637
638 This is the formal quantity whose spectral norm governs whether gradients
639 vanish or explode. Saved periodically to jacobians/{step}.json for the
640 JacobianInspector frontend component.
641
642 For a sequence of length T, the Jacobian from step k to step t is:
643 J_{tk} = prod_{i=k+1}^{t} diag(1 - h_i^2) @ W_hh
644
645 We accumulate this product incrementally (one factor per step).
646 """
647 W_hh = model.W_hh.data.numpy()
648 product = np.eye(model.hidden_size)
649 jacobian_records = []
650
651 for t in range(t_start + 1, t_end + 1):
652 diag = np.diag((1.0 - fwd["hs"][t].ravel() ** 2))
653 J_t = diag @ W_hh
654 product = J_t @ product
655
656 eigs = np.linalg.eigvals(product)
657 jacobian_records.append({
658 "step": t,
659 "jacobian_norm": float(np.linalg.norm(J_t)),
660 "product_norm": float(np.linalg.norm(product)),
661 "product_spectral_radius": float(np.max(np.abs(eigs))),
662 "product_eigenvalues_real": eigs.real.tolist(),
663 })
664
665 return {
666 "from_step": t_start,
667 "to_step": t_end,
668 "jacobians": jacobian_records,
669 "final_product_norm": float(np.linalg.norm(product)),
670 }
671
672
673# ---------------------------------------------------------------------------
674# Output / save functions
675# ---------------------------------------------------------------------------
676
677
Annotations
J_t = np.diag(1 - hs[t]**2) @ W_hhSingle-step Jacobian: tanh derivative × W_hh
product = J_t @ productAccumulate product J_{t}·J_{t-1}·...·J_{1}: measures gradient shrinkage over steps
final_product_norm||J_T·...·J_1||: if <<1, gradients vanish; if >>1, they explode
Run It Yourself

1 section

20

Run It Yourself

Commands and expected output
train.py:
0# Generate corpus
1python generate_corpus.py --num-sentences 900 --seed 42
2
3# Train (5-10 min on CPU)
4python train.py --steps 30000 --hidden-size 64 --seq-len 25 --lr 0.01 --clip 5.0 --seed 42
5
6# Analyze
7python analyze.py --data-dir output/ --save-json
8
9# Explore interactively
10python explore.py --data-dir output/ --interactive
11
12# Expected at step 30000:
13# loss ~1.1 spectral_radius ~1.18 vocab_hit_rate ~0.88
20

Run It Yourself

Quick Start

Clone the repo
$ git clone https://github.com/quantml/github-quantml.git
$ cd github-quantml/rnn/
Install dependencies
$ pip install torch numpy scipy
Generate corpus
$ python generate_corpus.py --num-sentences 600 --seed 42
Train the model
$ python train.py --steps 30000 --hidden-size 64 --seq-len 25 --lr 0.01 --clip 5.0 --seed 42
Analyze results
$ python analyze.py --data-dir output/ --save-json
Explore interactively
$ python explore.py --data-dir output/ --interactive

Expected Output

================================================================
  RNN Character-Level Language Model -- Training
================================================================
  Corpus: corpus.txt (18,247 chars, 27 unique)
  Architecture: vanilla_rnn (H=64, V=27)
  Parameters: 7,579
  W_hh init: orthogonal (spectral_radius=1.000)
================================================================
step      1/30000 | loss 3.2958 | grad_norm 12.50 | clipped | lr 0.0100
step   5000/30000 | loss 1.8423 | grad_norm  2.31 |         | lr 0.0098
step  15000/30000 | loss 1.3107 | grad_norm  1.12 |         | lr 0.0095
step  30000/30000 | loss 1.1205 | grad_norm  0.85 |         | lr 0.0091
--- Sample: "the lion chases the mouse the girl reads the book" ---
================================================================

Experiments to Try

--hidden-size 16Weaker category emergence, higher loss
--hidden-size 128Sharper categories, ~4× slower
--seq-len 5Severe vanishing gradients; watch the memory probe
--seq-len 50Better memory, risk of gradient explosion
--clip 1.0Aggressive clipping; slow convergence
--lr 0.001Lower learning rate, more stable but slower

To test non-orthogonal initialization: replace nn.init.orthogonal_ with nn.init.xavier_uniform_ and observe the spectral radius deviation from 1.0.