PythonNumPy + PyTorch

RNN

The Code

A line-by-line walkthrough of train.py, from character encoding through the forward recurrence, manual BPTT, gradient clipping, and text generation. Every line annotated.

Overview

The CharRNN training script implements a vanilla recurrent neural network from scratch using NumPy for the core computation and PyTorch parameter management. Every line of BPTT is written by hand; no autograd. (21 annotated sections)

Phase 0: Corpus GenerationPhase 1: Data PipelinePhase 2: The ModelPhase 3: Training LoopPhase 4: Analysis & Generation

7,256 params18,710 chars

Phase 0: Corpus Generation

1 section

Generating the Corpus

Building the character-level training data from a generative grammar

↗ Story: Chapter 6: Category Emergence ↗ Internals

train.py:25–33

↗ Chapter 6: Category Emergence ↗ Internals

25WORD_CATEGORIES = {▶
26    "NOUN_HUMAN":  ["boy", "girl", "man", "woman", "king", "queen"],▶
27    "NOUN_ANIMAL": ["cat", "dog", "lion", "bird", "fish", "mouse"],
28    "NOUN_FOOD":   ["bread", "cake", "apple", "milk", "cookie"],
29    "NOUN_OBJECT": ["book", "rock", "glass", "plate", "ball"],
30    "VERB_INTRAN": ["sleep", "walk", "think", "swim"],
31    "VERB_TRAN":   ["chase", "eat", "see", "break", "read", "move"],
32    "VERB_PERCEPT":["smell", "hear", "watch"],▶
33}
34

Annotations

WORD_CATEGORIES = {7 semantic categories with 38 words. The grammar enforces that nouns appear in noun slots and verbs in verb slots; the RNN must discover this structure from characters alone.

"NOUN_HUMAN":The distributional hypothesis: 'boy', 'girl', 'man', 'woman', 'king', 'queen' all appear in the same grammatical positions. The RNN will cluster them together without ever being told they're related.

"VERB_PERCEPT":["smell", "hear", "watch"],Perception verbs form a sub-category that takes animate objects (NOUN_HUMAN or NOUN_ANIMAL), unlike VERB_TRAN which can take NOUN_OBJECT too. Elman's Figure 7 shows the RNN discovers this fine-grained structure.

train.py:50–116

↗ Chapter 6: Category Emergence ↗ Internals

50TEMPLATES = [▶
51    # 1.  NOUN_HUMAN VERB_INTRAN
52    [("DET+CAT", "NOUN_HUMAN"), ("CAT", "VERB_INTRAN")],
53    # 2.  NOUN_HUMAN VERB_TRAN NOUN_OBJECT
54    [("DET+CAT", "NOUN_HUMAN"), ("CAT", "VERB_TRAN"), ("DET+CAT", "NOUN_OBJECT")],
55    # ...16 templates total...
56]
57
58def generate_sentence(template: list) -> str:
59    """Fill one template with random word choices."""
60    tokens = []
61    for slot in template:
62        if slot[0] == "DET+CAT":
63            word = random.choice(WORD_CATEGORIES[slot[1]])▶
64            tokens.append(pick_det())
65            tokens.append(word)
66        elif slot[0] == "CAT":
67            word = random.choice(WORD_CATEGORIES[slot[1]])▶
68            tokens.append(word)
69    return " ".join(tokens)▶
70

Annotations

TEMPLATES = [16 sentence templates encode the grammar rules. Templates define WHICH slot types appear together: a NOUN_HUMAN can be followed by VERB_INTRAN (intransitive) or VERB_TRAN+NOUN_OBJECT (transitive). The RNN must infer these rules from character sequences.

random.choice(WORD_CATEGORIES[slot[1]])Each word slot independently samples from its category. The RNN sees the output ('the boy sleeps the cat chases a bird...') without any template markers, just a continuous character stream.

return " ".join(tokens)Sentences are joined with spaces into one continuous stream. No sentence boundaries, no newlines. This is exactly Elman (1990)'s design: the model must discover that spaces mark word boundaries.

train.py:119–202

↗ Chapter 6: Category Emergence ↗ Internals

119def generate_corpus(num_sentences: int, seed: int = 42):▶
120    random.seed(seed)
121    np.random.seed(seed)
122    sentences = []
123    for _ in range(num_sentences):
124        tidx = int(np.random.choice(len(TEMPLATES), p=normalized_weights))▶
125        sentence = generate_sentence(TEMPLATES[tidx])
126        sentences.append(sentence)
127    return sentences, template_counts
128
129# Usage:
130# python generate_corpus.py --num-sentences 600 --seed 42
131corpus_text = " ".join(sentences)▶
132

Annotations

seed: int = 42Fixed seed ensures every learner gets the identical corpus. Reproducibility is non-negotiable in experimental science; 'the RNN discovers categories' is only meaningful if every run of the experiment gets the same result.

corpus_text = " ".join(sentences)600 sentences joined into a continuous character stream of ~18,000 characters. This becomes the entire training set: load, encode, train, repeat.

np.random.choice(len(TEMPLATES), p=normalized_weights)Weighted template sampling: human-centric templates get 2x probability. This matches Elman's corpus design where human-subject sentences are more common, making NOUN_HUMAN the best-separated cluster.

Phase 1: Data Pipeline

4 sections

Loading the Corpus

Reading raw characters

↗ Story: Chapter 1: Snapshots vs Movies

train.py:157–166

↗ Chapter 1: Snapshots vs Movies

157def load_corpus(path: str) -> str:
158    """Read corpus file, lowercase, strip extra whitespace."""
159    text = Path(path).read_text(encoding="utf-8")
160    text = text.lower().strip()
161    # Collapse multiple spaces to single space
162    import re
163    text = re.sub(r" +", " ", text)
164    return text
165
166
167

Annotations

with open(path) as f:Read the entire text file as a single string, no sentence boundaries preserved

return f.read()Returns the raw character stream. No preprocessing; the model learns punctuation and structure.

Building the Vocabulary

Characters to indices

↗ Story: Chapter 2: The Feedback Loop

train.py:167–174

↗ Chapter 2: The Feedback Loop

167def build_vocab(text: str) -> Tuple[List[str], Dict[str, int], Dict[int, str]]:
168    """Extract unique characters, sort, create char↔index mappings."""
169    chars = sorted(set(text))▶
170    char2idx = {c: i for i, c in enumerate(chars)}▶
171    idx2char = {i: c for i, c in enumerate(chars)}▶
172    return chars, char2idx, idx2char▶
173
174
175

Annotations

sorted(set(text))Vocabulary = unique characters. Sorted for reproducibility.

char2idxMaps character to integer index (0..V-1) for one-hot encoding

idx2charReverse map: integer index back to character for generation

Encoding and Chunking

Preparing training sequences

↗ Story: Chapter 2: The Feedback Loop

train.py:175–179

↗ Chapter 2: The Feedback Loop

175def encode(text: str, char2idx: Dict[str, int]) -> List[int]:
176    """Map each character to its index."""
177    return [char2idx[c] for c in text]▶
178
179
180

Annotations

char2idx[c]Convert each character to its integer index. Unknown chars are skipped.

Sequence Iterator

Sliding window with carry-forward

↗ Story: Chapter 4: Learning Through Time

train.py:180–194

↗ Chapter 4: Learning Through Time

180def split_data(data: List[int], val_fraction: float = 0.1) -> Tuple[List[int], List[int]]:▶
181    """
182    Split encoded corpus into train/val.
183
184    Uses the last val_fraction of the corpus for validation, matching the▶
185    temporal ordering of the data (avoids data leakage from future context).
186    """
187    split_idx = int(len(data) * (1.0 - val_fraction))▶
188    return data[:split_idx], data[split_idx:]
189
190
191# ---------------------------------------------------------------------------
192# Activation functions
193# ---------------------------------------------------------------------------
194
195

Annotations

val_fractionReserve 10% of corpus for validation loss; monitors overfitting without a separate corpus

int(len(data) * (1 - val_fraction))90% training / 10% validation split at a character boundary

Phase 2: The Model

7 sections

The CharRNN Class

Five raw parameter tensors

↗ Story: Chapter 2: The Feedback Loop ↗ Internals

train.py:70–108

↗ Chapter 2: The Feedback Loop ↗ Internals

70class CharRNN:
71    """Vanilla RNN for character-level language modeling.
72
73    All parameters are raw nn.Parameter tensors. No nn.Module subclass,
74    keeping it completely manual so every operation is visible in the code
75    walkthrough.
76
77    Architecture:
78        x_t (V,1) --[W_xh]--> a_t = W_xh @ x_t + W_hh @ h_{t-1} + b_h▶
79        h_t = activation(a_t)  ← H-dimensional hidden state
80        o_t = W_hy @ h_t + b_y  ← output logits (V,1)▶
81        p_t = softmax(o_t)       ← probability over next character
82    """
83
84    def __init__(self, vocab_size: int, hidden_size: int):
85        self.vocab_size = vocab_size    # V ≈ 27 (characters a-z + space)
86        self.hidden_size = hidden_size  # H = 64 (default)
87
88        # Input-to-hidden: (H, V)
89        self.W_xh = nn.Parameter(torch.empty(hidden_size, vocab_size))▶
90        # Hidden-to-hidden recurrent: (H, H)
91        self.W_hh = nn.Parameter(torch.empty(hidden_size, hidden_size))▶
92        # Hidden-to-output: (V, H)
93        self.W_hy = nn.Parameter(torch.empty(vocab_size, hidden_size))▶
94        # Biases: initialized to zeros in constructor
95        self.b_h = nn.Parameter(torch.zeros(hidden_size, 1))▶
96        self.b_y = nn.Parameter(torch.zeros(vocab_size, 1))
97
98    def parameters(self) -> List[nn.Parameter]:
99        return [self.W_xh, self.W_hh, self.W_hy, self.b_h, self.b_y]▶
100
101    def count_parameters(self) -> int:
102        return sum(p.numel() for p in self.parameters())
103
104
105# ---------------------------------------------------------------------------
106# Initialization
107# ---------------------------------------------------------------------------
108
109

Annotations

nn.Parameter(torch.empty(nn.Parameter is used ONLY for weight storage and init; all math uses .data.numpy()

W_hhHidden-to-hidden recurrent weight: the core memory mechanism (H×H)

W_xhInput-to-hidden: maps one-hot character to hidden activation (H×V)

W_hyHidden-to-output: maps hidden state to next-char logits (V×H)

b_hHidden bias, initialized to zeros, not learned orthogonally

Parameter Initialization

Xavier, orthogonal, and zeros

↗ Story: Chapter 5: Why Memory Fades ↗ Internals

train.py:109–156

↗ Chapter 5: Why Memory Fades ↗ Internals

109def init_parameters(model: CharRNN, method: str = "orthogonal", seed: int = 42) -> None:
110    """
111    Initialize model parameters.
112
113    Args:
114        method: W_hh initialization strategy.
115            "orthogonal" (default): spectral radius = 1.0, optimal for gradient flow
116            "xavier"              : Xavier uniform for W_hh (not RNN-specific)
117            "random"              : N(0, 0.01) random, worst case for gradient flow
118        seed: torch manual seed
119
120    Why orthogonal for W_hh:
121        An orthogonal matrix has |eigenvalue| = 1 for all eigenvalues, so the
122        Jacobian product in BPTT neither grows nor shrinks. This places the
123        network at the optimal spectral radius for gradient flow, the
124        RNN-specific initialization insight that Xavier alone does not provide.
125    """
126    torch.manual_seed(seed)
127    V, H = model.vocab_size, model.hidden_size
128
129    # Xavier uniform for input projection: keeps variance stable across layers
130    bound_xh = math.sqrt(6.0 / (V + H))
131    nn.init.uniform_(model.W_xh, -bound_xh, bound_xh)
132
133    # W_hh initialization (method-dependent for ablation experiments)
134    if method == "orthogonal":
135        nn.init.orthogonal_(model.W_hh)▶
136    elif method == "xavier":
137        bound_hh = math.sqrt(6.0 / (H + H))
138        nn.init.uniform_(model.W_hh, -bound_hh, bound_hh)
139    elif method == "random":
140        nn.init.normal_(model.W_hh, mean=0.0, std=0.01)
141    else:
142        raise ValueError(f"Unknown init method: {method}. Choose: orthogonal/xavier/random")
143
144    # Xavier uniform for output projection
145    bound_hy = math.sqrt(6.0 / (H + V))
146    nn.init.uniform_(model.W_hy, -bound_hy, bound_hy)
147
148    # Biases stay zero (already initialized in constructor, reset here for safety)
149    nn.init.zeros_(model.b_h)
150    nn.init.zeros_(model.b_y)
151
152
153# ---------------------------------------------------------------------------
154# Corpus preprocessing
155# ---------------------------------------------------------------------------
156
157

Annotations

nn.init.orthogonal_Orthogonal init: all eigenvalues |λ|=1 at step 0, perfect gradient flow at initialization

spectral_radius = 1.0Spectral radius = 1.0 means W_hh neither amplifies nor suppresses gradients

nn.init.xavier_uniform_Xavier uniform for W_xh and W_hy: variance scales with fan_in + fan_out

torch.zeros_likeBiases start at zero, no initial bias toward any character class

The Forward Pass

Explicit recurrence with tanh

↗ Story: Chapter 2: Eq 2.1

train.py:224–314

↗ Chapter 2: Eq 2.1

224def forward(
225    model: CharRNN,
226    inputs: List[int],
227    h_prev: np.ndarray,
228    activation: str = "tanh",
229    use_embedding: bool = False,
230    embedding: Optional[np.ndarray] = None,
231) -> dict:
232    """
233    Forward pass through seq_len time steps.
234
235    Args:
236        model: CharRNN with parameters
237        inputs: list of seq_len+1 character indices (input[0..T-1] → target[1..T])
238        h_prev: hidden state from previous chunk, shape (H, 1)
239        activation: activation function name ('tanh'/'relu'/'sigmoid')
240        use_embedding: if True, use learned embedding instead of one-hot
241        embedding: embedding matrix (H_emb, V) if use_embedding=True
242
243    Returns dict with:
244        xs: one-hot or embedded input vectors {t: (V_or_E, 1)}
245        xs_indices: original character indices {t: int}
246        hs: hidden states {t: (H,1)}, hs[-1] = h_prev
247        os: output logits {t: (V,1)}
248        ps: softmax probabilities {t: (V,1)}
249        targets: list of target character indices
250        loss: scalar total (unnormalized) cross-entropy loss
251
252    Note on loss normalization:
253        loss here is the SUM over all time steps (unnormalized). capture_step
254        divides by seq_len to get per-character bits. Gradient magnitudes in
255        backward() reflect this unnormalized loss and scale linearly with
256        seq_len, important to understand when comparing experiments with
257        different chunk lengths (Plan 7, Experiment 3).
258    """
259    act_fn, _ = _activation_fn(activation)
260
261    W_xh = model.W_xh.data.numpy()
262    W_hh = model.W_hh.data.numpy()
263    W_hy = model.W_hy.data.numpy()
264    b_h  = model.b_h.data.numpy()
265    b_y  = model.b_y.data.numpy()
266
267    xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}
268    hs[-1] = h_prev.copy()
269    loss = 0.0
270
271    seq_len = len(inputs) - 1  # predict inputs[1..T] from inputs[0..T-1]
272
273    for t in range(seq_len):
274        xs_indices[t] = inputs[t]
275
276        if use_embedding and embedding is not None:
277            # Learned embedding lookup: (E, 1)
278            xs[t] = embedding[:, inputs[t]].reshape(-1, 1)
279        else:
280            # One-hot encoding: (V, 1)
281            xs[t] = np.zeros((model.vocab_size, 1))
282            xs[t][inputs[t]] = 1.0
283
284        # h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
285        pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h
286        hs[t] = act_fn(pre_act)
287
288        # o_t = W_hy @ h_t + b_y
289        os[t] = W_hy @ hs[t] + b_y
290
291        # softmax with numerical stability: subtract max before exp
292        exp_o = np.exp(os[t] - np.max(os[t]))
293        ps[t] = exp_o / np.sum(exp_o)
294
295        # cross-entropy: -log(p[target])
296        target_t = inputs[t + 1]
297        loss += -np.log(ps[t][target_t, 0] + 1e-12)▶
298
299    targets = inputs[1:]
300    return {
301        "xs": xs,
302        "xs_indices": xs_indices,
303        "hs": hs,
304        "os": os,
305        "ps": ps,
306        "targets": targets,
307        "loss": loss,
308    }
309
310
311# ---------------------------------------------------------------------------
312# Backward pass: Manual BPTT
313# ---------------------------------------------------------------------------
314
315

Annotations

W_xh @ x_t + W_hh @ h_prev + b_hPre-activation: combines current input AND previous hidden state

np.tanh(a_t)tanh squashes to (-1,+1): derivative (1-h²) causes vanishing gradients in BPTT

h_states.append(h.copy())Store hidden state at every step, needed for BPTT and analysis

loss += -np.log(Cross-entropy loss: penalizes the log probability of the correct next character

Output and Softmax

From hidden state to predictions

↗ Story: Chapter 2: Output

train.py:224–314

↗ Chapter 2: Output

224def forward(
225    model: CharRNN,
226    inputs: List[int],
227    h_prev: np.ndarray,
228    activation: str = "tanh",
229    use_embedding: bool = False,
230    embedding: Optional[np.ndarray] = None,
231) -> dict:
232    """
233    Forward pass through seq_len time steps.
234
235    Args:
236        model: CharRNN with parameters
237        inputs: list of seq_len+1 character indices (input[0..T-1] → target[1..T])
238        h_prev: hidden state from previous chunk, shape (H, 1)
239        activation: activation function name ('tanh'/'relu'/'sigmoid')
240        use_embedding: if True, use learned embedding instead of one-hot
241        embedding: embedding matrix (H_emb, V) if use_embedding=True
242
243    Returns dict with:
244        xs: one-hot or embedded input vectors {t: (V_or_E, 1)}
245        xs_indices: original character indices {t: int}
246        hs: hidden states {t: (H,1)}, hs[-1] = h_prev
247        os: output logits {t: (V,1)}
248        ps: softmax probabilities {t: (V,1)}
249        targets: list of target character indices
250        loss: scalar total (unnormalized) cross-entropy loss
251
252    Note on loss normalization:
253        loss here is the SUM over all time steps (unnormalized). capture_step
254        divides by seq_len to get per-character bits. Gradient magnitudes in
255        backward() reflect this unnormalized loss and scale linearly with
256        seq_len, important to understand when comparing experiments with
257        different chunk lengths (Plan 7, Experiment 3).
258    """
259    act_fn, _ = _activation_fn(activation)
260
261    W_xh = model.W_xh.data.numpy()
262    W_hh = model.W_hh.data.numpy()
263    W_hy = model.W_hy.data.numpy()
264    b_h  = model.b_h.data.numpy()
265    b_y  = model.b_y.data.numpy()
266
267    xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}
268    hs[-1] = h_prev.copy()
269    loss = 0.0
270
271    seq_len = len(inputs) - 1  # predict inputs[1..T] from inputs[0..T-1]
272
273    for t in range(seq_len):
274        xs_indices[t] = inputs[t]
275
276        if use_embedding and embedding is not None:
277            # Learned embedding lookup: (E, 1)
278            xs[t] = embedding[:, inputs[t]].reshape(-1, 1)
279        else:
280            # One-hot encoding: (V, 1)
281            xs[t] = np.zeros((model.vocab_size, 1))
282            xs[t][inputs[t]] = 1.0
283
284        # h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
285        pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h
286        hs[t] = act_fn(pre_act)
287
288        # o_t = W_hy @ h_t + b_y▶
289        os[t] = W_hy @ hs[t] + b_y
290
291        # softmax with numerical stability: subtract max before exp
292        exp_o = np.exp(os[t] - np.max(os[t]))
293        ps[t] = exp_o / np.sum(exp_o)
294
295        # cross-entropy: -log(p[target])
296        target_t = inputs[t + 1]
297        loss += -np.log(ps[t][target_t, 0] + 1e-12)
298
299    targets = inputs[1:]
300    return {
301        "xs": xs,
302        "xs_indices": xs_indices,
303        "hs": hs,
304        "os": os,
305        "ps": ps,
306        "targets": targets,
307        "loss": loss,
308    }
309
310
311# ---------------------------------------------------------------------------
312# Backward pass: Manual BPTT
313# ---------------------------------------------------------------------------
314
315

Annotations

o_t = W_hy @ h_t + b_yOutput logits: un-normalized log-probabilities over the vocabulary

o_t - o_t.max()Numerically stable softmax: subtract max before exp to prevent overflow

exp_o / exp_o.sum()Normalize to get a probability distribution over the V characters

Cross-Entropy Loss

Measuring prediction quality

↗ Story: Chapter 4: Eq 4.1

train.py:224–314

↗ Chapter 4: Eq 4.1

224def forward(
225    model: CharRNN,
226    inputs: List[int],
227    h_prev: np.ndarray,
228    activation: str = "tanh",
229    use_embedding: bool = False,
230    embedding: Optional[np.ndarray] = None,
231) -> dict:
232    """
233    Forward pass through seq_len time steps.
234
235    Args:
236        model: CharRNN with parameters
237        inputs: list of seq_len+1 character indices (input[0..T-1] → target[1..T])
238        h_prev: hidden state from previous chunk, shape (H, 1)
239        activation: activation function name ('tanh'/'relu'/'sigmoid')
240        use_embedding: if True, use learned embedding instead of one-hot
241        embedding: embedding matrix (H_emb, V) if use_embedding=True
242
243    Returns dict with:
244        xs: one-hot or embedded input vectors {t: (V_or_E, 1)}
245        xs_indices: original character indices {t: int}
246        hs: hidden states {t: (H,1)}, hs[-1] = h_prev
247        os: output logits {t: (V,1)}
248        ps: softmax probabilities {t: (V,1)}
249        targets: list of target character indices
250        loss: scalar total (unnormalized) cross-entropy loss
251
252    Note on loss normalization:
253        loss here is the SUM over all time steps (unnormalized). capture_step
254        divides by seq_len to get per-character bits. Gradient magnitudes in
255        backward() reflect this unnormalized loss and scale linearly with
256        seq_len, important to understand when comparing experiments with
257        different chunk lengths (Plan 7, Experiment 3).
258    """
259    act_fn, _ = _activation_fn(activation)
260
261    W_xh = model.W_xh.data.numpy()
262    W_hh = model.W_hh.data.numpy()
263    W_hy = model.W_hy.data.numpy()
264    b_h  = model.b_h.data.numpy()
265    b_y  = model.b_y.data.numpy()
266
267    xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}
268    hs[-1] = h_prev.copy()
269    loss = 0.0
270
271    seq_len = len(inputs) - 1  # predict inputs[1..T] from inputs[0..T-1]
272
273    for t in range(seq_len):
274        xs_indices[t] = inputs[t]
275
276        if use_embedding and embedding is not None:
277            # Learned embedding lookup: (E, 1)
278            xs[t] = embedding[:, inputs[t]].reshape(-1, 1)
279        else:
280            # One-hot encoding: (V, 1)
281            xs[t] = np.zeros((model.vocab_size, 1))
282            xs[t][inputs[t]] = 1.0
283
284        # h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)
285        pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h
286        hs[t] = act_fn(pre_act)
287
288        # o_t = W_hy @ h_t + b_y
289        os[t] = W_hy @ hs[t] + b_y
290
291        # softmax with numerical stability: subtract max before exp
292        exp_o = np.exp(os[t] - np.max(os[t]))
293        ps[t] = exp_o / np.sum(exp_o)
294
295        # cross-entropy: -log(p[target])
296        target_t = inputs[t + 1]
297        loss += -np.log(ps[t][target_t, 0] + 1e-12)
298
299    targets = inputs[1:]
300    return {
301        "xs": xs,
302        "xs_indices": xs_indices,
303        "hs": hs,
304        "os": os,
305        "ps": ps,
306        "targets": targets,
307        "loss": loss,
308    }
309
310
311# ---------------------------------------------------------------------------
312# Backward pass: Manual BPTT
313# ---------------------------------------------------------------------------
314
315

Annotations

-np.log(ps[t][targets[t]])Cross-entropy = -log(probability of the true next character)

loss / len(inputs)Average loss per character. Random guessing gives -log(1/V) ≈ 3.2 for V=24

Manual BPTT

The backward pass, line by line

↗ Story: Chapter 4: BPTT ↗ Internals

train.py:315–431

↗ Chapter 4: BPTT ↗ Internals

315def backward(
316    model: CharRNN,
317    fwd: dict,
318    clip: float = 5.0,
319    activation: str = "tanh",
320    freeze_whh: bool = False,
321) -> dict:
322    """
323    Backpropagation Through Time: manual gradient computation.
324
325    Derives gradients for all 5 parameters by propagating error backward
326    through the unrolled computational graph. This is the Jacobian product
327    computed incrementally, rather than forming the full product matrix,
328    we multiply one factor at a time (see Part 7B of the plan for derivation).
329
330    Key teaching points:
331    - dh_next accumulator IS the "gradient flowing backward through time"
332    - (1 - h^2) is the tanh derivative, always in [0,1] → causes shrinkage
333    - W_hh.T @ dh_raw is the recurrent propagation → Jacobian product
334    - per_step_grad_norm captures vanishing gradient phenomenon directly
335
336    Args:
337        model: CharRNN
338        fwd: output from forward()
339        clip: global gradient norm clipping threshold (0 = no clipping)
340        activation: activation function name (determines derivative)
341        freeze_whh: if True, zero out dW_hh gradient (reservoir computing)
342
343    Returns dict with gradient arrays and diagnostic statistics.
344    """
345    _, deriv_fn = _activation_fn(activation)
346
347    xs, hs, ps, targets = fwd["xs"], fwd["hs"], fwd["ps"], fwd["targets"]
348    seq_len = len(targets)
349
350    W_hh = model.W_hh.data.numpy()
351    W_hy = model.W_hy.data.numpy()
352
353    dW_xh = np.zeros_like(model.W_xh.data.numpy())
354    dW_hh = np.zeros_like(model.W_hh.data.numpy())
355    dW_hy = np.zeros_like(model.W_hy.data.numpy())
356    db_h  = np.zeros_like(model.b_h.data.numpy())
357    db_y  = np.zeros_like(model.b_y.data.numpy())
358
359    # dh_next: gradient from future time steps flowing backward
360    dh_next = np.zeros((model.hidden_size, 1))
361    per_step_grad_norm = []
362
363    for t in reversed(range(seq_len)):
364        # --- Output layer: softmax-CE gradient ---
365        # d(loss)/d(o_t) = p_t - one_hot(target_t)
366        do = ps[t].copy()
367        do[targets[t]] -= 1.0▶
368
369        dW_hy += do @ hs[t].T   # outer product
370        db_y  += do
371
372        # --- Hidden state gradient: two contributions ---
373        # 1. From output at t:   W_hy^T @ do
374        # 2. From future steps:  dh_next  (the recurrent term)
375        dh = W_hy.T @ do + dh_next▶
376
377        # Through activation: element-wise derivative
378        dh_raw = deriv_fn(hs[t]) * dh
379
380        dW_xh += dh_raw @ xs[t].T          # outer product with input
381        dW_hh += dh_raw @ hs[t - 1].T      # outer product with prev hidden▶
382        db_h  += dh_raw
383
384        # Propagate gradient to previous time step (the Jacobian factor)
385        dh_next = W_hh.T @ dh_raw▶
386
387        # Record gradient magnitude at this step for vanishing gradient viz
388        per_step_grad_norm.append(float(np.linalg.norm(dh_next)))
389
390    per_step_grad_norm.reverse()  # index 0 = earliest time step
391
392    # Reservoir computing: freeze recurrent weights by zeroing their gradient
393    if freeze_whh:
394        dW_hh[:] = 0.0
395
396    # --- Global norm gradient clipping ---
397    grads = [dW_xh, dW_hh, dW_hy, db_h, db_y]
398    total_grad_norm = float(np.sqrt(sum(np.sum(g ** 2) for g in grads)))
399    clipped = (clip > 0) and (total_grad_norm > clip)
400    clip_ratio = 1.0
401    if clipped:
402        clip_ratio = clip / total_grad_norm
403        for g in grads:
404            g *= clip_ratio
405
406    grad_norms = {
407        "W_xh": float(np.linalg.norm(dW_xh)),
408        "W_hh": float(np.linalg.norm(dW_hh)),
409        "W_hy": float(np.linalg.norm(dW_hy)),
410        "b_h":  float(np.linalg.norm(db_h)),
411        "b_y":  float(np.linalg.norm(db_y)),
412    }
413
414    return {
415        "dW_xh": dW_xh,
416        "dW_hh": dW_hh,
417        "dW_hy": dW_hy,
418        "db_h":  db_h,
419        "db_y":  db_y,
420        "grad_norms":        grad_norms,
421        "total_grad_norm":   total_grad_norm,
422        "clipped":           clipped,
423        "clip_ratio":        clip_ratio,
424        "per_step_grad_norm": per_step_grad_norm,
425    }
426
427
428# ---------------------------------------------------------------------------
429# Parameter update
430# ---------------------------------------------------------------------------
431
432

Annotations

do[targets[t]] -= 1.0Softmax-cross-entropy gradient: p_t - one_hot(y_t). Only the target index changes.

dh = W_hy.T @ do + dh_nextHidden gradient combines: (a) output error and (b) future gradient dh_next

(1.0 - hs[t] ** 2) * dhtanh derivative: always in [0,1], this is the source of vanishing gradients

dh_next = W_hh.T @ dh_rawTHE core of BPTT: recurrence propagates gradient one step back through W_hh

dW_hh += dh_raw @ hs[t - 1].TW_hh gradient accumulates over all time steps in the sequence

Gradient Clipping

Preventing exploding gradients

↗ Story: Chapter 4: Eq 4.5

train.py:315–431

↗ Chapter 4: Eq 4.5

315def backward(
316    model: CharRNN,
317    fwd: dict,
318    clip: float = 5.0,
319    activation: str = "tanh",
320    freeze_whh: bool = False,
321) -> dict:
322    """
323    Backpropagation Through Time: manual gradient computation.
324
325    Derives gradients for all 5 parameters by propagating error backward
326    through the unrolled computational graph. This is the Jacobian product
327    computed incrementally, rather than forming the full product matrix,
328    we multiply one factor at a time (see Part 7B of the plan for derivation).
329
330    Key teaching points:
331    - dh_next accumulator IS the "gradient flowing backward through time"
332    - (1 - h^2) is the tanh derivative, always in [0,1] → causes shrinkage
333    - W_hh.T @ dh_raw is the recurrent propagation → Jacobian product
334    - per_step_grad_norm captures vanishing gradient phenomenon directly
335
336    Args:
337        model: CharRNN
338        fwd: output from forward()
339        clip: global gradient norm clipping threshold (0 = no clipping)
340        activation: activation function name (determines derivative)
341        freeze_whh: if True, zero out dW_hh gradient (reservoir computing)
342
343    Returns dict with gradient arrays and diagnostic statistics.
344    """
345    _, deriv_fn = _activation_fn(activation)
346
347    xs, hs, ps, targets = fwd["xs"], fwd["hs"], fwd["ps"], fwd["targets"]
348    seq_len = len(targets)
349
350    W_hh = model.W_hh.data.numpy()
351    W_hy = model.W_hy.data.numpy()
352
353    dW_xh = np.zeros_like(model.W_xh.data.numpy())
354    dW_hh = np.zeros_like(model.W_hh.data.numpy())
355    dW_hy = np.zeros_like(model.W_hy.data.numpy())
356    db_h  = np.zeros_like(model.b_h.data.numpy())
357    db_y  = np.zeros_like(model.b_y.data.numpy())
358
359    # dh_next: gradient from future time steps flowing backward
360    dh_next = np.zeros((model.hidden_size, 1))
361    per_step_grad_norm = []
362
363    for t in reversed(range(seq_len)):
364        # --- Output layer: softmax-CE gradient ---
365        # d(loss)/d(o_t) = p_t - one_hot(target_t)
366        do = ps[t].copy()
367        do[targets[t]] -= 1.0
368
369        dW_hy += do @ hs[t].T   # outer product
370        db_y  += do
371
372        # --- Hidden state gradient: two contributions ---
373        # 1. From output at t:   W_hy^T @ do
374        # 2. From future steps:  dh_next  (the recurrent term)
375        dh = W_hy.T @ do + dh_next
376
377        # Through activation: element-wise derivative
378        dh_raw = deriv_fn(hs[t]) * dh
379
380        dW_xh += dh_raw @ xs[t].T          # outer product with input
381        dW_hh += dh_raw @ hs[t - 1].T      # outer product with prev hidden
382        db_h  += dh_raw
383
384        # Propagate gradient to previous time step (the Jacobian factor)
385        dh_next = W_hh.T @ dh_raw
386
387        # Record gradient magnitude at this step for vanishing gradient viz
388        per_step_grad_norm.append(float(np.linalg.norm(dh_next)))
389
390    per_step_grad_norm.reverse()  # index 0 = earliest time step
391
392    # Reservoir computing: freeze recurrent weights by zeroing their gradient
393    if freeze_whh:
394        dW_hh[:] = 0.0
395
396    # --- Global norm gradient clipping ---
397    grads = [dW_xh, dW_hh, dW_hy, db_h, db_y]
398    total_grad_norm = float(np.sqrt(sum(np.sum(g ** 2) for g in grads)))
399    clipped = (clip > 0) and (total_grad_norm > clip)
400    clip_ratio = 1.0
401    if clipped:
402        clip_ratio = clip / total_grad_norm
403        for g in grads:
404            g *= clip_ratio
405
406    grad_norms = {
407        "W_xh": float(np.linalg.norm(dW_xh)),
408        "W_hh": float(np.linalg.norm(dW_hh)),
409        "W_hy": float(np.linalg.norm(dW_hy)),
410        "b_h":  float(np.linalg.norm(db_h)),
411        "b_y":  float(np.linalg.norm(db_y)),
412    }
413
414    return {
415        "dW_xh": dW_xh,
416        "dW_hh": dW_hh,
417        "dW_hy": dW_hy,
418        "db_h":  db_h,
419        "db_y":  db_y,
420        "grad_norms":        grad_norms,
421        "total_grad_norm":   total_grad_norm,
422        "clipped":           clipped,
423        "clip_ratio":        clip_ratio,
424        "per_step_grad_norm": per_step_grad_norm,
425    }
426
427
428# ---------------------------------------------------------------------------
429# Parameter update
430# ---------------------------------------------------------------------------
431
432

Annotations

global_norm = np.sqrt(Global norm = L2 norm of ALL gradient components concatenated

clip_coef = clip / global_normScale factor: if norm > threshold, scale all gradients uniformly

global_norm > clipOnly clip when necessary; avoids distorting small, well-behaved gradients

Phase 3: Training Loop

4 sections

The Training Loop

TBPTT with hidden state carry-forward

↗ Story: Chapter 4: Training

train.py:850–1016

↗ Chapter 4: Training

850def train(model: CharRNN, train_data: List[int], val_data: List[int], config: TrainConfig):
851    """
852    Main TBPTT training loop.
853
854    TBPTT design:
855    - Hidden state h carries FORWARD across chunk boundaries (memory preserved)
856    - BPTT unrolls only within each chunk (gradient truncated at boundaries)
857    - This IS the truncation: h.copy() detaches the gradient graph
858
859    At corpus wrap, h is reset to zeros (fresh start for next epoch's pass).
860    """
861    output_dir = config.output_dir
862    output_dir.mkdir(parents=True, exist_ok=True)
863    (output_dir / "steps").mkdir(exist_ok=True)
864    (output_dir / "jacobians").mkdir(exist_ok=True)
865
866    V = len(config.chars)
867    H = config.hidden_size
868
869    # Optionally initialize learned embedding (--use-embedding flag)
870    embedding = None
871    if config.use_embedding:
872        embed_dim = 16
873        embedding = np.random.randn(embed_dim, V) * 0.1
874
875    # SGD momentum velocity (all zeros initially)
876    velocity = {name: np.zeros_like(getattr(model, name).data.numpy())
877                for name in ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]}
878
879    h = np.zeros((H, 1))
880    pointer = 0
881
882    loss_curve: List[dict] = []
883    step_index: List[int] = []
884    gf_summary: List[dict] = []
885    recorded_data: List[dict] = []  # kept in memory for PCA at end
886    recorded_step_count = 0
887    jacobian_count = 0
888
889    print("=" * 64)
890    print("  RNN Character-Level Language Model: Training")
891    print("=" * 64)
892    print(f"  Corpus: {config.corpus} ({len(train_data) + len(val_data):,} chars, {V} unique)")
893    print(f"  Architecture: vanilla_rnn (H={H}, V={V})")
894    print(f"  Parameters: {model.count_parameters():,}")
895    W_hh_init_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
896    print(f"  W_hh init: {config.init} (spectral_radius={np.max(np.abs(W_hh_init_eigs)):.3f})")
897    print(f"  Training: {config.steps:,} steps, seq_len={config.seq_len}, "
898          f"lr={config.lr}, clip={config.clip}")
899    print(f"  TBPTT chunk: {config.seq_len} characters")
900    print("=" * 64)
901
902    t_start = time.time()
903
904    for step in range(1, config.steps + 1):
905        # Corpus wrap: reset hidden state and pointer
906        if pointer + config.seq_len + 1 >= len(train_data):
907            pointer = 0
908            h = np.zeros((H, 1))
909
910        inputs = train_data[pointer: pointer + config.seq_len + 1]
911        pointer += config.seq_len
912
913        # Current learning rate (step decay)
914        lr = get_lr(step, config.lr, config.lr_decay_rate, config.lr_decay_every)
915
916        # Forward pass: h carries over from previous chunk (TBPTT memory)
917        fwd = forward(
918            model, inputs, h,
919            activation=config.activation,
920            use_embedding=config.use_embedding,
921            embedding=embedding,
922        )
923
924        # Detach: copy final hidden state forward but don't backprop through it
925        # This is the truncation in TBPTT; h "remembers" but gradients don't cross
926        h = fwd["hs"][config.seq_len - 1].copy()
927
928        # Backward pass (BPTT within this chunk only)
929        grads = backward(
930            model, fwd,
931            clip=config.clip,
932            activation=config.activation,
933            freeze_whh=config.freeze_whh,
934        )
935
936        # SGD update
937        velocity = _apply_update(model, grads, lr, velocity, config.momentum)
938
939        # --- Loss curve (every step) ---
940        avg_loss = float(fwd["loss"] / config.seq_len)
941
942        # Validation loss (every val_interval steps)
943        val_loss = None
944        if len(val_data) > 0 and step % config.val_interval == 0:
945            val_loss = compute_val_loss(
946                model, val_data, config.seq_len,
947                activation=config.activation,
948            )
949
950        loss_curve.append({
951            "step":     step,
952            "loss":     round(avg_loss, 6),
953            "val_loss": round(val_loss, 6) if val_loss is not None else None,
954            "lr":       round(lr, 6),
955        })
956
957        # --- Gradient flow summary (every recorded step) ---
958        W_hh_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
959        spectral_radius = float(np.max(np.abs(W_hh_eigs)))
960
961        # --- State recording ---
962        if should_record(step, config.dense_steps, config.record_interval, config.steps):
963            record = capture_step(step, model, fwd, grads, config)
964            save_step(record, output_dir)
965            step_index.append(step)
966            gf_summary.append({
967                "step":            step,
968                "total_grad_norm": grads["total_grad_norm"],
969                "clipped":         grads["clipped"],
970                "clip_ratio":      grads["clip_ratio"],
971                "spectral_radius": spectral_radius,
972                "per_step_grad_norm": grads["per_step_grad_norm"],
973                "grad_norms":      grads["grad_norms"],
974            })
975            recorded_data.append(record)
976            recorded_step_count += 1
977
978        # --- Jacobian capture ---
979        if step % config.jacobian_interval == 0:
980            jac_t_start = max(0, config.seq_len - 1 - 10)
981            jac_t_end = config.seq_len - 1
982            jac = capture_jacobian(model, fwd, jac_t_start, jac_t_end)
983            save_jacobian(jac, step, output_dir)
984            jacobian_count += 1
985
986        # --- Console progress ---
987        if step <= 10 or step % 1000 == 0 or step == config.steps:
988            clip_flag = "clipped" if grads["clipped"] else "       "
989            print(
990                f"step {step:6d}/{config.steps} | loss {avg_loss:.4f} | "
991                f"grad_norm {grads['total_grad_norm']:6.2f} | {clip_flag} | "
992                f"lr {lr:.4f} | rho {spectral_radius:.3f}"
993            )
994            if step % 5000 == 0 or step == config.steps:
995                sample = generate(
996                    model, " ", 60,
997                    config.char2idx, config.idx2char,
998                    temperature=config.temperature,
999                    activation=config.activation,
1000                )
1001                print(f"--- Sample at step {step} (T={config.temperature}): \"{sample}\" ---")
1002
1003    elapsed = time.time() - t_start
1004    print("=" * 64)
1005    print(f"  Training complete in {elapsed:.1f}s. Output: {output_dir}")
1006    print(f"  Steps recorded: {recorded_step_count} | Jacobian snapshots: {jacobian_count}")
1007    print(f"  Final loss: {loss_curve[-1]['loss']:.4f}")
1008    print("=" * 64)
1009
1010    return loss_curve, step_index, gf_summary, recorded_data
1011
1012
1013# ---------------------------------------------------------------------------
1014# CLI
1015# ---------------------------------------------------------------------------
1016
1017

Annotations

h_carryHidden state carries across chunk boundaries; the network has memory beyond seq_len

for chunk_start in range(Sliding window over the corpus: each chunk is seq_len characters long

fwd = forward(model, inputs, targets, h_carryForward pass: returns losses, hidden states, predictions

grads = backward(model, fwd, clip=config.clipManual BPTT over the current chunk

Hidden State Management

Detach, carry, reset

↗ Story: Chapter 4: Truncated BPTT

train.py:850–1016

↗ Chapter 4: Truncated BPTT

850def train(model: CharRNN, train_data: List[int], val_data: List[int], config: TrainConfig):
851    """
852    Main TBPTT training loop.
853
854    TBPTT design:
855    - Hidden state h carries FORWARD across chunk boundaries (memory preserved)
856    - BPTT unrolls only within each chunk (gradient truncated at boundaries)
857    - This IS the truncation: h.copy() detaches the gradient graph
858
859    At corpus wrap, h is reset to zeros (fresh start for next epoch's pass).
860    """
861    output_dir = config.output_dir
862    output_dir.mkdir(parents=True, exist_ok=True)
863    (output_dir / "steps").mkdir(exist_ok=True)
864    (output_dir / "jacobians").mkdir(exist_ok=True)
865
866    V = len(config.chars)
867    H = config.hidden_size
868
869    # Optionally initialize learned embedding (--use-embedding flag)
870    embedding = None
871    if config.use_embedding:
872        embed_dim = 16
873        embedding = np.random.randn(embed_dim, V) * 0.1
874
875    # SGD momentum velocity (all zeros initially)
876    velocity = {name: np.zeros_like(getattr(model, name).data.numpy())
877                for name in ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]}
878
879    h = np.zeros((H, 1))
880    pointer = 0
881
882    loss_curve: List[dict] = []
883    step_index: List[int] = []
884    gf_summary: List[dict] = []
885    recorded_data: List[dict] = []  # kept in memory for PCA at end
886    recorded_step_count = 0
887    jacobian_count = 0
888
889    print("=" * 64)
890    print("  RNN Character-Level Language Model: Training")
891    print("=" * 64)
892    print(f"  Corpus: {config.corpus} ({len(train_data) + len(val_data):,} chars, {V} unique)")
893    print(f"  Architecture: vanilla_rnn (H={H}, V={V})")
894    print(f"  Parameters: {model.count_parameters():,}")
895    W_hh_init_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
896    print(f"  W_hh init: {config.init} (spectral_radius={np.max(np.abs(W_hh_init_eigs)):.3f})")
897    print(f"  Training: {config.steps:,} steps, seq_len={config.seq_len}, "
898          f"lr={config.lr}, clip={config.clip}")
899    print(f"  TBPTT chunk: {config.seq_len} characters")
900    print("=" * 64)
901
902    t_start = time.time()
903
904    for step in range(1, config.steps + 1):
905        # Corpus wrap: reset hidden state and pointer
906        if pointer + config.seq_len + 1 >= len(train_data):
907            pointer = 0
908            h = np.zeros((H, 1))
909
910        inputs = train_data[pointer: pointer + config.seq_len + 1]
911        pointer += config.seq_len
912
913        # Current learning rate (step decay)
914        lr = get_lr(step, config.lr, config.lr_decay_rate, config.lr_decay_every)
915
916        # Forward pass: h carries over from previous chunk (TBPTT memory)
917        fwd = forward(
918            model, inputs, h,
919            activation=config.activation,
920            use_embedding=config.use_embedding,
921            embedding=embedding,
922        )
923
924        # Detach: copy final hidden state forward but don't backprop through it
925        # This is the truncation in TBPTT; h "remembers" but gradients don't cross
926        h = fwd["hs"][config.seq_len - 1].copy()
927
928        # Backward pass (BPTT within this chunk only)
929        grads = backward(
930            model, fwd,
931            clip=config.clip,
932            activation=config.activation,
933            freeze_whh=config.freeze_whh,
934        )
935
936        # SGD update
937        velocity = _apply_update(model, grads, lr, velocity, config.momentum)
938
939        # --- Loss curve (every step) ---
940        avg_loss = float(fwd["loss"] / config.seq_len)
941
942        # Validation loss (every val_interval steps)
943        val_loss = None
944        if len(val_data) > 0 and step % config.val_interval == 0:
945            val_loss = compute_val_loss(
946                model, val_data, config.seq_len,
947                activation=config.activation,
948            )
949
950        loss_curve.append({
951            "step":     step,
952            "loss":     round(avg_loss, 6),
953            "val_loss": round(val_loss, 6) if val_loss is not None else None,
954            "lr":       round(lr, 6),
955        })
956
957        # --- Gradient flow summary (every recorded step) ---
958        W_hh_eigs = np.linalg.eigvals(model.W_hh.data.numpy())
959        spectral_radius = float(np.max(np.abs(W_hh_eigs)))
960
961        # --- State recording ---
962        if should_record(step, config.dense_steps, config.record_interval, config.steps):
963            record = capture_step(step, model, fwd, grads, config)
964            save_step(record, output_dir)
965            step_index.append(step)
966            gf_summary.append({
967                "step":            step,
968                "total_grad_norm": grads["total_grad_norm"],
969                "clipped":         grads["clipped"],
970                "clip_ratio":      grads["clip_ratio"],
971                "spectral_radius": spectral_radius,
972                "per_step_grad_norm": grads["per_step_grad_norm"],
973                "grad_norms":      grads["grad_norms"],
974            })
975            recorded_data.append(record)
976            recorded_step_count += 1
977
978        # --- Jacobian capture ---
979        if step % config.jacobian_interval == 0:
980            jac_t_start = max(0, config.seq_len - 1 - 10)
981            jac_t_end = config.seq_len - 1
982            jac = capture_jacobian(model, fwd, jac_t_start, jac_t_end)
983            save_jacobian(jac, step, output_dir)
984            jacobian_count += 1
985
986        # --- Console progress ---
987        if step <= 10 or step % 1000 == 0 or step == config.steps:
988            clip_flag = "clipped" if grads["clipped"] else "       "
989            print(
990                f"step {step:6d}/{config.steps} | loss {avg_loss:.4f} | "
991                f"grad_norm {grads['total_grad_norm']:6.2f} | {clip_flag} | "
992                f"lr {lr:.4f} | rho {spectral_radius:.3f}"
993            )
994            if step % 5000 == 0 or step == config.steps:
995                sample = generate(
996                    model, " ", 60,
997                    config.char2idx, config.idx2char,
998                    temperature=config.temperature,
999                    activation=config.activation,
1000                )
1001                print(f"--- Sample at step {step} (T={config.temperature}): \"{sample}\" ---")
1002
1003    elapsed = time.time() - t_start
1004    print("=" * 64)
1005    print(f"  Training complete in {elapsed:.1f}s. Output: {output_dir}")
1006    print(f"  Steps recorded: {recorded_step_count} | Jacobian snapshots: {jacobian_count}")
1007    print(f"  Final loss: {loss_curve[-1]['loss']:.4f}")
1008    print("=" * 64)
1009
1010    return loss_curve, step_index, gf_summary, recorded_data
1011
1012
1013# ---------------------------------------------------------------------------
1014# CLI
1015# ---------------------------------------------------------------------------
1016
1017

Annotations

h_carry = fwd['h_final'].copy()Detach: carry forward final hidden state but stop gradient at chunk boundary

h_carry = np.zeros(Reset hidden state at the start of each epoch pass over the corpus

Parameter Update

SGD with momentum and LR decay

↗ Story: Chapter 4: SGD Update

train.py:432–448

↗ Chapter 4: SGD Update

432def _apply_update(model: CharRNN, grads: dict, lr: float, velocity: dict, momentum: float) -> dict:
433    """SGD update with optional momentum."""
434    param_names = ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]
435    for name in param_names:
436        param = getattr(model, name)
437        g = grads[f"d{name}"]
438        if momentum > 0.0:
439            v = momentum * velocity[name] - lr * g▶
440            velocity[name] = v▶
441            delta = torch.from_numpy(v).float()
442        else:
443            delta = torch.from_numpy(-lr * g).float()
444        with torch.no_grad():
445            param.data.add_(delta)
446    return velocity
447
448
449

Annotations

velocity[name]Momentum accumulates gradient direction, reduces oscillations in loss landscape

momentum * velocity[name] + (1 - momentum) * grads[name]Momentum: exponential moving average of gradients

param.data -= lr * velocity[name]SGD update: subtract scaled gradient from parameter

State Recording

Capturing training dynamics

↗ Internals

train.py:552–633

↗ Internals

552def capture_step(
553    step: int,
554    model: CharRNN,
555    fwd: dict,
556    grads: dict,
557    config,
558) -> dict:
559    """
560    Capture full training state at a given step.
561
562    Records all fields needed by the frontend internals page:
563    weight matrices, eigenvalues, gradients, hidden state trajectories,
564    prediction probabilities, saturation statistics, and a sample text.▶
565    """
566    W_hh_np = model.W_hh.data.numpy()
567    eigenvalues = np.linalg.eigvals(W_hh_np)
568    seq_len = len(fwd["targets"])
569
570    return {
571        "step": step,
572        "loss": float(fwd["loss"] / seq_len),   # avg per-character cross-entropy
573
574        # Full weight matrices (for weight evolution visualization)
575        "W_xh": model.W_xh.data.numpy().tolist(),
576        "W_hh": model.W_hh.data.numpy().tolist(),
577        "W_hy": model.W_hy.data.numpy().tolist(),
578        "b_h":  model.b_h.data.numpy().ravel().tolist(),
579        "b_y":  model.b_y.data.numpy().ravel().tolist(),
580
581        # W_hh spectral analysis (for spectral radius tracking)
582        "w_hh_eigenvalues_real": eigenvalues.real.tolist(),▶
583        "w_hh_eigenvalues_imag": eigenvalues.imag.tolist(),
584        "spectral_radius": float(np.max(np.abs(eigenvalues))),
585
586        # Gradient diagnostics
587        "grad_norms":        grads["grad_norms"],
588        "total_grad_norm":   grads["total_grad_norm"],
589        "clipped":           grads["clipped"],
590        "clip_ratio":        grads["clip_ratio"],
591        "per_step_grad_norm": grads["per_step_grad_norm"],▶
592
593        # Hidden state trajectory for this chunk (for PCA / trajectory viz)
594        "hidden_states": [
595            fwd["hs"][t].ravel().tolist() for t in range(seq_len)
596        ],
597        "input_chars": [
598            config.idx2char[fwd["xs_indices"][t]] for t in range(seq_len)
599        ],
600
601        # Prediction distribution at each position (for prediction landscape)
602        "predictions": [
603            {
604                config.idx2char[i]: round(float(fwd["ps"][t][i, 0]), 6)
605                for i in range(model.vocab_size)
606            }
607            for t in range(seq_len)
608        ],
609        "target_chars": [
610            config.idx2char[fwd["targets"][t]] for t in range(seq_len)
611        ],
612
613        # Per-position accuracy (for word-boundary error analysis)
614        "correct_at_position": [
615            int(np.argmax(fwd["ps"][t]) == fwd["targets"][t])
616            for t in range(seq_len)
617        ],
618
619        # Fraction of hidden units saturated near ±1 (for saturation viz)▶
620        "saturation": float(np.mean(▶
621            np.abs(np.array([fwd["hs"][t].ravel() for t in range(seq_len)])) > 0.95
622        )),
623
624        # Sample text generated at this checkpoint
625        "sample_text": generate(
626            model, " ", config.generate_length,
627            config.char2idx, config.idx2char,
628            temperature=config.temperature,
629            activation=config.activation,
630        ),
631    }
632
633
634

Annotations

should_recordDense recording early (every 50 steps for steps ≤200), then every 50th step

w_hh_eigenvalues_realTrack W_hh eigenvalues at each snapshot: spectral radius drift over training

per_step_grad_normPer-position gradient norm within the seq_len window, vanishing gradient evidence

saturationFraction of hidden units with |h|>0.9: how much of the tanh range is being used

Phase 4: Analysis & Generation

4 sections

Text Generation

Sampling with temperature

↗ Story: Chapter 6: What the Network Discovers

train.py:458–505

↗ Chapter 6: What the Network Discovers

458def generate(
459    model: CharRNN,
460    seed_char: str,
461    length: int,
462    char2idx: Dict[str, int],
463    idx2char: Dict[int, str],
464    temperature: float = 1.0,▶
465    activation: str = "tanh",
466) -> str:
467    """
468    Sample text character by character using temperature-scaled softmax.▶
469
470    Temperature controls randomness:
471        T < 1: sharper, more deterministic (greedy at T→0)
472        T = 1: unmodified model distribution
473        T > 1: flatter, more random (uniform at T→∞)
474    """
475    act_fn, _ = _activation_fn(activation)
476    W_xh = model.W_xh.data.numpy()
477    W_hh = model.W_hh.data.numpy()
478    W_hy = model.W_hy.data.numpy()
479    b_h  = model.b_h.data.numpy()
480    b_y  = model.b_y.data.numpy()
481
482    h = np.zeros((model.hidden_size, 1))
483    idx = char2idx.get(seed_char, 0)
484    result = [seed_char]
485
486    for _ in range(length):
487        x = np.zeros((model.vocab_size, 1))
488        x[idx] = 1.0
489        h = act_fn(W_xh @ x + W_hh @ h + b_h)
490        o = W_hy @ h + b_y
491
492        o = o / temperature▶
493        exp_o = np.exp(o - np.max(o))
494        p = exp_o / np.sum(exp_o)
495
496        idx = int(np.random.choice(model.vocab_size, p=p.ravel()))
497        result.append(idx2char[idx])
498
499    return "".join(result)
500
501
502# ---------------------------------------------------------------------------
503# Validation loss
504# ---------------------------------------------------------------------------
505
506

Annotations

temperatureT<1: conservative (peaks sharper); T=1: standard; T>1: creative/noisy

np.random.choice(range(vocab_size), p=probs)Sample from distribution, not argmax, so output varies each run

h_gen = np.tanh(Generation uses the same forward pass as training, no special generation mode

Hidden State PCA

Projecting to 2D for visualization

↗ Internals

train.py:777–816

↗ Internals

777def save_hidden_state_pca(recorded_data: list, output_dir: Path) -> None:
778    """
779    Compute PCA on final-step hidden states, project all recorded steps to 2D.
780
781    PCA is fit on the FINAL step's hidden states, then applied to all steps,
782    matching the GloVe pattern of using final_pca as the reference frame.
783    """
784    from sklearn.decomposition import PCA
785
786    if not recorded_data:
787        return
788
789    # Fit PCA on final step's hidden states
790    final_step = recorded_data[-1]
791    final_hs = np.array(final_step["hidden_states"])  # (seq_len, H)
792
793    pca = PCA(n_components=2)▶
794    pca.fit(final_hs)
795
796    steps_pca = []
797    for record in recorded_data:
798        hs = np.array(record["hidden_states"])  # (seq_len, H)
799        projected = pca.transform(hs)           # (seq_len, 2)
800        steps_pca.append({
801            "step":          record["step"],
802            "trajectory_2d": projected.tolist(),
803            "input_chars":   record["input_chars"],
804        })
805
806    result = {
807        "pca_variance_explained": pca.explained_variance_ratio_.tolist(),
808        "steps": steps_pca,
809    }
810    save_json(result, output_dir / "hidden_state_pca.json")
811
812
813# ---------------------------------------------------------------------------
814# Training loop
815# ---------------------------------------------------------------------------
816
817

Annotations

PCA(n_components=2)Project 64-dim hidden states to 2D for visualization

pca.fit(all_states)Fit PCA on all recorded hidden states: principal axes capture most variance

trajectories_2d2D trajectory through hidden state space as the model reads the sample sequence

Eigenvalue Tracking

Monitoring the spectral radius

↗ Internals

train.py:552–633

↗ Internals

552def capture_step(
553    step: int,
554    model: CharRNN,
555    fwd: dict,
556    grads: dict,
557    config,
558) -> dict:
559    """
560    Capture full training state at a given step.
561
562    Records all fields needed by the frontend internals page:
563    weight matrices, eigenvalues, gradients, hidden state trajectories,
564    prediction probabilities, saturation statistics, and a sample text.
565    """
566    W_hh_np = model.W_hh.data.numpy()
567    eigenvalues = np.linalg.eigvals(W_hh_np)
568    seq_len = len(fwd["targets"])
569
570    return {
571        "step": step,
572        "loss": float(fwd["loss"] / seq_len),   # avg per-character cross-entropy
573
574        # Full weight matrices (for weight evolution visualization)
575        "W_xh": model.W_xh.data.numpy().tolist(),
576        "W_hh": model.W_hh.data.numpy().tolist(),
577        "W_hy": model.W_hy.data.numpy().tolist(),
578        "b_h":  model.b_h.data.numpy().ravel().tolist(),
579        "b_y":  model.b_y.data.numpy().ravel().tolist(),
580
581        # W_hh spectral analysis (for spectral radius tracking)
582        "w_hh_eigenvalues_real": eigenvalues.real.tolist(),
583        "w_hh_eigenvalues_imag": eigenvalues.imag.tolist(),
584        "spectral_radius": float(np.max(np.abs(eigenvalues))),
585
586        # Gradient diagnostics
587        "grad_norms":        grads["grad_norms"],
588        "total_grad_norm":   grads["total_grad_norm"],
589        "clipped":           grads["clipped"],
590        "clip_ratio":        grads["clip_ratio"],
591        "per_step_grad_norm": grads["per_step_grad_norm"],
592
593        # Hidden state trajectory for this chunk (for PCA / trajectory viz)
594        "hidden_states": [
595            fwd["hs"][t].ravel().tolist() for t in range(seq_len)
596        ],
597        "input_chars": [
598            config.idx2char[fwd["xs_indices"][t]] for t in range(seq_len)
599        ],
600
601        # Prediction distribution at each position (for prediction landscape)
602        "predictions": [
603            {
604                config.idx2char[i]: round(float(fwd["ps"][t][i, 0]), 6)
605                for i in range(model.vocab_size)
606            }
607            for t in range(seq_len)
608        ],
609        "target_chars": [
610            config.idx2char[fwd["targets"][t]] for t in range(seq_len)
611        ],
612
613        # Per-position accuracy (for word-boundary error analysis)
614        "correct_at_position": [
615            int(np.argmax(fwd["ps"][t]) == fwd["targets"][t])
616            for t in range(seq_len)
617        ],
618
619        # Fraction of hidden units saturated near ±1 (for saturation viz)
620        "saturation": float(np.mean(
621            np.abs(np.array([fwd["hs"][t].ravel() for t in range(seq_len)])) > 0.95
622        )),
623
624        # Sample text generated at this checkpoint
625        "sample_text": generate(
626            model, " ", config.generate_length,
627            config.char2idx, config.idx2char,
628            temperature=config.temperature,
629            activation=config.activation,
630        ),
631    }
632
633
634

Annotations

np.linalg.eigvals(W_hh)Compute all H eigenvalues of the recurrent weight matrix

spectral_radius = float(np.max(np.abs(eigs)))Spectral radius = largest eigenvalue magnitude. >1 means W_hh can amplify

Jacobian Computation

Measuring gradient flow through time

↗ Story: Chapter 5: Eq 5.1 ↗ Internals

train.py:634–676

↗ Chapter 5: Eq 5.1 ↗ Internals

634def capture_jacobian(model: CharRNN, fwd: dict, t_start: int, t_end: int) -> dict:
635    """
636    Compute the Jacobian product dh_t/dh_k for k=t_start..t_end.
637
638    This is the formal quantity whose spectral norm governs whether gradients
639    vanish or explode. Saved periodically to jacobians/{step}.json for the
640    JacobianInspector frontend component.
641
642    For a sequence of length T, the Jacobian from step k to step t is:
643        J_{t←k} = prod_{i=k+1}^{t} diag(1 - h_i^2) @ W_hh
644
645    We accumulate this product incrementally (one factor per step).
646    """
647    W_hh = model.W_hh.data.numpy()
648    product = np.eye(model.hidden_size)
649    jacobian_records = []
650
651    for t in range(t_start + 1, t_end + 1):
652        diag = np.diag((1.0 - fwd["hs"][t].ravel() ** 2))
653        J_t = diag @ W_hh
654        product = J_t @ product▶
655
656        eigs = np.linalg.eigvals(product)
657        jacobian_records.append({
658            "step":                       t,
659            "jacobian_norm":              float(np.linalg.norm(J_t)),
660            "product_norm":               float(np.linalg.norm(product)),
661            "product_spectral_radius":    float(np.max(np.abs(eigs))),
662            "product_eigenvalues_real":   eigs.real.tolist(),
663        })
664
665    return {
666        "from_step":        t_start,
667        "to_step":          t_end,
668        "jacobians":        jacobian_records,
669        "final_product_norm": float(np.linalg.norm(product)),▶
670    }
671
672
673# ---------------------------------------------------------------------------
674# Output / save functions
675# ---------------------------------------------------------------------------
676
677

Annotations

J_t = np.diag(1 - hs[t]**2) @ W_hhSingle-step Jacobian: tanh derivative × W_hh

product = J_t @ productAccumulate product J_{t}·J_{t-1}·...·J_{1}: measures gradient shrinkage over steps

final_product_norm||J_T·...·J_1||: if <<1, gradients vanish; if >>1, they explode

Run It Yourself

1 section

Run It Yourself

Commands and expected output

↗ Story: The Story

train.py:–

↗ The Story

0# Generate corpus
1python generate_corpus.py --num-sentences 900 --seed 42
2
3# Train (5-10 min on CPU)
4python train.py --steps 30000 --hidden-size 64 --seq-len 25 --lr 0.01 --clip 5.0 --seed 42
5
6# Analyze
7python analyze.py --data-dir output/ --save-json
8
9# Explore interactively
10python explore.py --data-dir output/ --interactive
11
12# Expected at step 30000:
13# loss ~1.1  spectral_radius ~1.18  vocab_hit_rate ~0.88

Run It Yourself

Quick Start

Clone the repo

$ git clone https://github.com/quantml/github-quantml.git
$ cd github-quantml/rnn/

Install dependencies

$ pip install torch numpy scipy

Generate corpus

$ python generate_corpus.py --num-sentences 600 --seed 42

Train the model

$ python train.py --steps 30000 --hidden-size 64 --seq-len 25 --lr 0.01 --clip 5.0 --seed 42

Analyze results

$ python analyze.py --data-dir output/ --save-json

Explore interactively

$ python explore.py --data-dir output/ --interactive

Expected Output

================================================================
  RNN Character-Level Language Model -- Training
================================================================
  Corpus: corpus.txt (18,247 chars, 27 unique)
  Architecture: vanilla_rnn (H=64, V=27)
  Parameters: 7,579
  W_hh init: orthogonal (spectral_radius=1.000)
================================================================
step      1/30000 | loss 3.2958 | grad_norm 12.50 | clipped | lr 0.0100
step   5000/30000 | loss 1.8423 | grad_norm  2.31 |         | lr 0.0098
step  15000/30000 | loss 1.3107 | grad_norm  1.12 |         | lr 0.0095
step  30000/30000 | loss 1.1205 | grad_norm  0.85 |         | lr 0.0091
--- Sample: "the lion chases the mouse the girl reads the book" ---
================================================================

Experiments to Try

--hidden-size 16Weaker category emergence, higher loss

--hidden-size 128Sharper categories, ~4× slower

--seq-len 5Severe vanishing gradients; watch the memory probe

--seq-len 50Better memory, risk of gradient explosion

--clip 1.0Aggressive clipping; slow convergence

--lr 0.001Lower learning rate, more stable but slower

To test non-orthogonal initialization: replace nn.init.orthogonal_ with nn.init.xavier_uniform_ and observe the spectral radius deviation from 1.0.