PythonNumPy + PyTorch
RNN
The Code
A line-by-line walkthrough of train.py, from character encoding through the forward recurrence, manual BPTT, gradient clipping, and text generation. Every line annotated.
The Code
A line-by-line walkthrough of train.py, from character encoding through the forward recurrence, manual BPTT, gradient clipping, and text generation. Every line annotated.
The CharRNN training script implements a vanilla recurrent neural network from scratch using NumPy for the core computation and PyTorch parameter management. Every line of BPTT is written by hand; no autograd. (21 annotated sections)
1 section
25WORD_CATEGORIES = {▶26"NOUN_HUMAN": ["boy", "girl", "man", "woman", "king", "queen"],▶27"NOUN_ANIMAL": ["cat", "dog", "lion", "bird", "fish", "mouse"],28"NOUN_FOOD": ["bread", "cake", "apple", "milk", "cookie"],29"NOUN_OBJECT": ["book", "rock", "glass", "plate", "ball"],30"VERB_INTRAN": ["sleep", "walk", "think", "swim"],31"VERB_TRAN": ["chase", "eat", "see", "break", "read", "move"],32"VERB_PERCEPT":["smell", "hear", "watch"],▶33}34
WORD_CATEGORIES = {7 semantic categories with 38 words. The grammar enforces that nouns appear in noun slots and verbs in verb slots; the RNN must discover this structure from characters alone."NOUN_HUMAN":The distributional hypothesis: 'boy', 'girl', 'man', 'woman', 'king', 'queen' all appear in the same grammatical positions. The RNN will cluster them together without ever being told they're related."VERB_PERCEPT":["smell", "hear", "watch"],Perception verbs form a sub-category that takes animate objects (NOUN_HUMAN or NOUN_ANIMAL), unlike VERB_TRAN which can take NOUN_OBJECT too. Elman's Figure 7 shows the RNN discovers this fine-grained structure.50TEMPLATES = [▶51# 1. NOUN_HUMAN VERB_INTRAN52[("DET+CAT", "NOUN_HUMAN"), ("CAT", "VERB_INTRAN")],53# 2. NOUN_HUMAN VERB_TRAN NOUN_OBJECT54[("DET+CAT", "NOUN_HUMAN"), ("CAT", "VERB_TRAN"), ("DET+CAT", "NOUN_OBJECT")],55# ...16 templates total...56]5758def generate_sentence(template: list) -> str:59"""Fill one template with random word choices."""60tokens = []61for slot in template:62if slot[0] == "DET+CAT":63word = random.choice(WORD_CATEGORIES[slot[1]])▶64tokens.append(pick_det())65tokens.append(word)66elif slot[0] == "CAT":67word = random.choice(WORD_CATEGORIES[slot[1]])▶68tokens.append(word)69return " ".join(tokens)▶70
TEMPLATES = [16 sentence templates encode the grammar rules. Templates define WHICH slot types appear together: a NOUN_HUMAN can be followed by VERB_INTRAN (intransitive) or VERB_TRAN+NOUN_OBJECT (transitive). The RNN must infer these rules from character sequences.random.choice(WORD_CATEGORIES[slot[1]])Each word slot independently samples from its category. The RNN sees the output ('the boy sleeps the cat chases a bird...') without any template markers, just a continuous character stream.return " ".join(tokens)Sentences are joined with spaces into one continuous stream. No sentence boundaries, no newlines. This is exactly Elman (1990)'s design: the model must discover that spaces mark word boundaries.119def generate_corpus(num_sentences: int, seed: int = 42):▶120random.seed(seed)121np.random.seed(seed)122sentences = []123for _ in range(num_sentences):124tidx = int(np.random.choice(len(TEMPLATES), p=normalized_weights))▶125sentence = generate_sentence(TEMPLATES[tidx])126sentences.append(sentence)127return sentences, template_counts128129# Usage:130# python generate_corpus.py --num-sentences 600 --seed 42131corpus_text = " ".join(sentences)▶132
seed: int = 42Fixed seed ensures every learner gets the identical corpus. Reproducibility is non-negotiable in experimental science; 'the RNN discovers categories' is only meaningful if every run of the experiment gets the same result.corpus_text = " ".join(sentences)600 sentences joined into a continuous character stream of ~18,000 characters. This becomes the entire training set: load, encode, train, repeat.np.random.choice(len(TEMPLATES), p=normalized_weights)Weighted template sampling: human-centric templates get 2x probability. This matches Elman's corpus design where human-subject sentences are more common, making NOUN_HUMAN the best-separated cluster.4 sections
157def load_corpus(path: str) -> str:158"""Read corpus file, lowercase, strip extra whitespace."""159text = Path(path).read_text(encoding="utf-8")160text = text.lower().strip()161# Collapse multiple spaces to single space162import re163text = re.sub(r" +", " ", text)164return text165166167
with open(path) as f:Read the entire text file as a single string, no sentence boundaries preservedreturn f.read()Returns the raw character stream. No preprocessing; the model learns punctuation and structure.167def build_vocab(text: str) -> Tuple[List[str], Dict[str, int], Dict[int, str]]:168"""Extract unique characters, sort, create char↔index mappings."""169chars = sorted(set(text))▶170char2idx = {c: i for i, c in enumerate(chars)}▶171idx2char = {i: c for i, c in enumerate(chars)}▶172return chars, char2idx, idx2char▶173174175
sorted(set(text))Vocabulary = unique characters. Sorted for reproducibility.char2idxMaps character to integer index (0..V-1) for one-hot encodingidx2charReverse map: integer index back to character for generation175def encode(text: str, char2idx: Dict[str, int]) -> List[int]:176"""Map each character to its index."""177return [char2idx[c] for c in text]▶178179180
char2idx[c]Convert each character to its integer index. Unknown chars are skipped.180def split_data(data: List[int], val_fraction: float = 0.1) -> Tuple[List[int], List[int]]:▶181"""182Split encoded corpus into train/val.183184Uses the last val_fraction of the corpus for validation, matching the▶185temporal ordering of the data (avoids data leakage from future context).186"""187split_idx = int(len(data) * (1.0 - val_fraction))▶188return data[:split_idx], data[split_idx:]189190191# ---------------------------------------------------------------------------192# Activation functions193# ---------------------------------------------------------------------------194195
val_fractionReserve 10% of corpus for validation loss; monitors overfitting without a separate corpusint(len(data) * (1 - val_fraction))90% training / 10% validation split at a character boundary7 sections
70class CharRNN:71"""Vanilla RNN for character-level language modeling.7273All parameters are raw nn.Parameter tensors. No nn.Module subclass,74keeping it completely manual so every operation is visible in the code75walkthrough.7677Architecture:78x_t (V,1) --[W_xh]--> a_t = W_xh @ x_t + W_hh @ h_{t-1} + b_h▶79h_t = activation(a_t) ← H-dimensional hidden state80o_t = W_hy @ h_t + b_y ← output logits (V,1)▶81p_t = softmax(o_t) ← probability over next character82"""8384def __init__(self, vocab_size: int, hidden_size: int):85self.vocab_size = vocab_size # V ≈ 27 (characters a-z + space)86self.hidden_size = hidden_size # H = 64 (default)8788# Input-to-hidden: (H, V)89self.W_xh = nn.Parameter(torch.empty(hidden_size, vocab_size))▶90# Hidden-to-hidden recurrent: (H, H)91self.W_hh = nn.Parameter(torch.empty(hidden_size, hidden_size))▶92# Hidden-to-output: (V, H)93self.W_hy = nn.Parameter(torch.empty(vocab_size, hidden_size))▶94# Biases: initialized to zeros in constructor95self.b_h = nn.Parameter(torch.zeros(hidden_size, 1))▶96self.b_y = nn.Parameter(torch.zeros(vocab_size, 1))9798def parameters(self) -> List[nn.Parameter]:99return [self.W_xh, self.W_hh, self.W_hy, self.b_h, self.b_y]▶100101def count_parameters(self) -> int:102return sum(p.numel() for p in self.parameters())103104105# ---------------------------------------------------------------------------106# Initialization107# ---------------------------------------------------------------------------108109
nn.Parameter(torch.empty(nn.Parameter is used ONLY for weight storage and init; all math uses .data.numpy()W_hhHidden-to-hidden recurrent weight: the core memory mechanism (H×H)W_xhInput-to-hidden: maps one-hot character to hidden activation (H×V)W_hyHidden-to-output: maps hidden state to next-char logits (V×H)b_hHidden bias, initialized to zeros, not learned orthogonally109def init_parameters(model: CharRNN, method: str = "orthogonal", seed: int = 42) -> None:110"""111Initialize model parameters.112113Args:114method: W_hh initialization strategy.115"orthogonal" (default): spectral radius = 1.0, optimal for gradient flow116"xavier" : Xavier uniform for W_hh (not RNN-specific)117"random" : N(0, 0.01) random, worst case for gradient flow118seed: torch manual seed119120Why orthogonal for W_hh:121An orthogonal matrix has |eigenvalue| = 1 for all eigenvalues, so the122Jacobian product in BPTT neither grows nor shrinks. This places the123network at the optimal spectral radius for gradient flow, the124RNN-specific initialization insight that Xavier alone does not provide.125"""126torch.manual_seed(seed)127V, H = model.vocab_size, model.hidden_size128129# Xavier uniform for input projection: keeps variance stable across layers130bound_xh = math.sqrt(6.0 / (V + H))131nn.init.uniform_(model.W_xh, -bound_xh, bound_xh)132133# W_hh initialization (method-dependent for ablation experiments)134if method == "orthogonal":135nn.init.orthogonal_(model.W_hh)▶136elif method == "xavier":137bound_hh = math.sqrt(6.0 / (H + H))138nn.init.uniform_(model.W_hh, -bound_hh, bound_hh)139elif method == "random":140nn.init.normal_(model.W_hh, mean=0.0, std=0.01)141else:142raise ValueError(f"Unknown init method: {method}. Choose: orthogonal/xavier/random")143144# Xavier uniform for output projection145bound_hy = math.sqrt(6.0 / (H + V))146nn.init.uniform_(model.W_hy, -bound_hy, bound_hy)147148# Biases stay zero (already initialized in constructor, reset here for safety)149nn.init.zeros_(model.b_h)150nn.init.zeros_(model.b_y)151152153# ---------------------------------------------------------------------------154# Corpus preprocessing155# ---------------------------------------------------------------------------156157
nn.init.orthogonal_Orthogonal init: all eigenvalues |λ|=1 at step 0, perfect gradient flow at initializationspectral_radius = 1.0Spectral radius = 1.0 means W_hh neither amplifies nor suppresses gradientsnn.init.xavier_uniform_Xavier uniform for W_xh and W_hy: variance scales with fan_in + fan_outtorch.zeros_likeBiases start at zero, no initial bias toward any character class224def forward(225model: CharRNN,226inputs: List[int],227h_prev: np.ndarray,228activation: str = "tanh",229use_embedding: bool = False,230embedding: Optional[np.ndarray] = None,231) -> dict:232"""233Forward pass through seq_len time steps.234235Args:236model: CharRNN with parameters237inputs: list of seq_len+1 character indices (input[0..T-1] → target[1..T])238h_prev: hidden state from previous chunk, shape (H, 1)239activation: activation function name ('tanh'/'relu'/'sigmoid')240use_embedding: if True, use learned embedding instead of one-hot241embedding: embedding matrix (H_emb, V) if use_embedding=True242243Returns dict with:244xs: one-hot or embedded input vectors {t: (V_or_E, 1)}245xs_indices: original character indices {t: int}246hs: hidden states {t: (H,1)}, hs[-1] = h_prev247os: output logits {t: (V,1)}248ps: softmax probabilities {t: (V,1)}249targets: list of target character indices250loss: scalar total (unnormalized) cross-entropy loss251252Note on loss normalization:253loss here is the SUM over all time steps (unnormalized). capture_step254divides by seq_len to get per-character bits. Gradient magnitudes in255backward() reflect this unnormalized loss and scale linearly with256seq_len, important to understand when comparing experiments with257different chunk lengths (Plan 7, Experiment 3).258"""259act_fn, _ = _activation_fn(activation)260261W_xh = model.W_xh.data.numpy()262W_hh = model.W_hh.data.numpy()263W_hy = model.W_hy.data.numpy()264b_h = model.b_h.data.numpy()265b_y = model.b_y.data.numpy()266267xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}268hs[-1] = h_prev.copy()269loss = 0.0270271seq_len = len(inputs) - 1 # predict inputs[1..T] from inputs[0..T-1]272273for t in range(seq_len):274xs_indices[t] = inputs[t]275276if use_embedding and embedding is not None:277# Learned embedding lookup: (E, 1)278xs[t] = embedding[:, inputs[t]].reshape(-1, 1)279else:280# One-hot encoding: (V, 1)281xs[t] = np.zeros((model.vocab_size, 1))282xs[t][inputs[t]] = 1.0283284# h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)285pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h286hs[t] = act_fn(pre_act)287288# o_t = W_hy @ h_t + b_y289os[t] = W_hy @ hs[t] + b_y290291# softmax with numerical stability: subtract max before exp292exp_o = np.exp(os[t] - np.max(os[t]))293ps[t] = exp_o / np.sum(exp_o)294295# cross-entropy: -log(p[target])296target_t = inputs[t + 1]297loss += -np.log(ps[t][target_t, 0] + 1e-12)▶298299targets = inputs[1:]300return {301"xs": xs,302"xs_indices": xs_indices,303"hs": hs,304"os": os,305"ps": ps,306"targets": targets,307"loss": loss,308}309310311# ---------------------------------------------------------------------------312# Backward pass: Manual BPTT313# ---------------------------------------------------------------------------314315
W_xh @ x_t + W_hh @ h_prev + b_hPre-activation: combines current input AND previous hidden statenp.tanh(a_t)tanh squashes to (-1,+1): derivative (1-h²) causes vanishing gradients in BPTTh_states.append(h.copy())Store hidden state at every step, needed for BPTT and analysisloss += -np.log(Cross-entropy loss: penalizes the log probability of the correct next character224def forward(225model: CharRNN,226inputs: List[int],227h_prev: np.ndarray,228activation: str = "tanh",229use_embedding: bool = False,230embedding: Optional[np.ndarray] = None,231) -> dict:232"""233Forward pass through seq_len time steps.234235Args:236model: CharRNN with parameters237inputs: list of seq_len+1 character indices (input[0..T-1] → target[1..T])238h_prev: hidden state from previous chunk, shape (H, 1)239activation: activation function name ('tanh'/'relu'/'sigmoid')240use_embedding: if True, use learned embedding instead of one-hot241embedding: embedding matrix (H_emb, V) if use_embedding=True242243Returns dict with:244xs: one-hot or embedded input vectors {t: (V_or_E, 1)}245xs_indices: original character indices {t: int}246hs: hidden states {t: (H,1)}, hs[-1] = h_prev247os: output logits {t: (V,1)}248ps: softmax probabilities {t: (V,1)}249targets: list of target character indices250loss: scalar total (unnormalized) cross-entropy loss251252Note on loss normalization:253loss here is the SUM over all time steps (unnormalized). capture_step254divides by seq_len to get per-character bits. Gradient magnitudes in255backward() reflect this unnormalized loss and scale linearly with256seq_len, important to understand when comparing experiments with257different chunk lengths (Plan 7, Experiment 3).258"""259act_fn, _ = _activation_fn(activation)260261W_xh = model.W_xh.data.numpy()262W_hh = model.W_hh.data.numpy()263W_hy = model.W_hy.data.numpy()264b_h = model.b_h.data.numpy()265b_y = model.b_y.data.numpy()266267xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}268hs[-1] = h_prev.copy()269loss = 0.0270271seq_len = len(inputs) - 1 # predict inputs[1..T] from inputs[0..T-1]272273for t in range(seq_len):274xs_indices[t] = inputs[t]275276if use_embedding and embedding is not None:277# Learned embedding lookup: (E, 1)278xs[t] = embedding[:, inputs[t]].reshape(-1, 1)279else:280# One-hot encoding: (V, 1)281xs[t] = np.zeros((model.vocab_size, 1))282xs[t][inputs[t]] = 1.0283284# h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)285pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h286hs[t] = act_fn(pre_act)287288# o_t = W_hy @ h_t + b_y▶289os[t] = W_hy @ hs[t] + b_y290291# softmax with numerical stability: subtract max before exp292exp_o = np.exp(os[t] - np.max(os[t]))293ps[t] = exp_o / np.sum(exp_o)294295# cross-entropy: -log(p[target])296target_t = inputs[t + 1]297loss += -np.log(ps[t][target_t, 0] + 1e-12)298299targets = inputs[1:]300return {301"xs": xs,302"xs_indices": xs_indices,303"hs": hs,304"os": os,305"ps": ps,306"targets": targets,307"loss": loss,308}309310311# ---------------------------------------------------------------------------312# Backward pass: Manual BPTT313# ---------------------------------------------------------------------------314315
o_t = W_hy @ h_t + b_yOutput logits: un-normalized log-probabilities over the vocabularyo_t - o_t.max()Numerically stable softmax: subtract max before exp to prevent overflowexp_o / exp_o.sum()Normalize to get a probability distribution over the V characters224def forward(225model: CharRNN,226inputs: List[int],227h_prev: np.ndarray,228activation: str = "tanh",229use_embedding: bool = False,230embedding: Optional[np.ndarray] = None,231) -> dict:232"""233Forward pass through seq_len time steps.234235Args:236model: CharRNN with parameters237inputs: list of seq_len+1 character indices (input[0..T-1] → target[1..T])238h_prev: hidden state from previous chunk, shape (H, 1)239activation: activation function name ('tanh'/'relu'/'sigmoid')240use_embedding: if True, use learned embedding instead of one-hot241embedding: embedding matrix (H_emb, V) if use_embedding=True242243Returns dict with:244xs: one-hot or embedded input vectors {t: (V_or_E, 1)}245xs_indices: original character indices {t: int}246hs: hidden states {t: (H,1)}, hs[-1] = h_prev247os: output logits {t: (V,1)}248ps: softmax probabilities {t: (V,1)}249targets: list of target character indices250loss: scalar total (unnormalized) cross-entropy loss251252Note on loss normalization:253loss here is the SUM over all time steps (unnormalized). capture_step254divides by seq_len to get per-character bits. Gradient magnitudes in255backward() reflect this unnormalized loss and scale linearly with256seq_len, important to understand when comparing experiments with257different chunk lengths (Plan 7, Experiment 3).258"""259act_fn, _ = _activation_fn(activation)260261W_xh = model.W_xh.data.numpy()262W_hh = model.W_hh.data.numpy()263W_hy = model.W_hy.data.numpy()264b_h = model.b_h.data.numpy()265b_y = model.b_y.data.numpy()266267xs, xs_indices, hs, os, ps = {}, {}, {}, {}, {}268hs[-1] = h_prev.copy()269loss = 0.0270271seq_len = len(inputs) - 1 # predict inputs[1..T] from inputs[0..T-1]272273for t in range(seq_len):274xs_indices[t] = inputs[t]275276if use_embedding and embedding is not None:277# Learned embedding lookup: (E, 1)278xs[t] = embedding[:, inputs[t]].reshape(-1, 1)279else:280# One-hot encoding: (V, 1)281xs[t] = np.zeros((model.vocab_size, 1))282xs[t][inputs[t]] = 1.0283284# h_t = activation(W_xh @ x_t + W_hh @ h_{t-1} + b_h)285pre_act = W_xh @ xs[t] + W_hh @ hs[t - 1] + b_h286hs[t] = act_fn(pre_act)287288# o_t = W_hy @ h_t + b_y289os[t] = W_hy @ hs[t] + b_y290291# softmax with numerical stability: subtract max before exp292exp_o = np.exp(os[t] - np.max(os[t]))293ps[t] = exp_o / np.sum(exp_o)294295# cross-entropy: -log(p[target])296target_t = inputs[t + 1]297loss += -np.log(ps[t][target_t, 0] + 1e-12)298299targets = inputs[1:]300return {301"xs": xs,302"xs_indices": xs_indices,303"hs": hs,304"os": os,305"ps": ps,306"targets": targets,307"loss": loss,308}309310311# ---------------------------------------------------------------------------312# Backward pass: Manual BPTT313# ---------------------------------------------------------------------------314315
-np.log(ps[t][targets[t]])Cross-entropy = -log(probability of the true next character)loss / len(inputs)Average loss per character. Random guessing gives -log(1/V) ≈ 3.2 for V=24315def backward(316model: CharRNN,317fwd: dict,318clip: float = 5.0,319activation: str = "tanh",320freeze_whh: bool = False,321) -> dict:322"""323Backpropagation Through Time: manual gradient computation.324325Derives gradients for all 5 parameters by propagating error backward326through the unrolled computational graph. This is the Jacobian product327computed incrementally, rather than forming the full product matrix,328we multiply one factor at a time (see Part 7B of the plan for derivation).329330Key teaching points:331- dh_next accumulator IS the "gradient flowing backward through time"332- (1 - h^2) is the tanh derivative, always in [0,1] → causes shrinkage333- W_hh.T @ dh_raw is the recurrent propagation → Jacobian product334- per_step_grad_norm captures vanishing gradient phenomenon directly335336Args:337model: CharRNN338fwd: output from forward()339clip: global gradient norm clipping threshold (0 = no clipping)340activation: activation function name (determines derivative)341freeze_whh: if True, zero out dW_hh gradient (reservoir computing)342343Returns dict with gradient arrays and diagnostic statistics.344"""345_, deriv_fn = _activation_fn(activation)346347xs, hs, ps, targets = fwd["xs"], fwd["hs"], fwd["ps"], fwd["targets"]348seq_len = len(targets)349350W_hh = model.W_hh.data.numpy()351W_hy = model.W_hy.data.numpy()352353dW_xh = np.zeros_like(model.W_xh.data.numpy())354dW_hh = np.zeros_like(model.W_hh.data.numpy())355dW_hy = np.zeros_like(model.W_hy.data.numpy())356db_h = np.zeros_like(model.b_h.data.numpy())357db_y = np.zeros_like(model.b_y.data.numpy())358359# dh_next: gradient from future time steps flowing backward360dh_next = np.zeros((model.hidden_size, 1))361per_step_grad_norm = []362363for t in reversed(range(seq_len)):364# --- Output layer: softmax-CE gradient ---365# d(loss)/d(o_t) = p_t - one_hot(target_t)366do = ps[t].copy()367do[targets[t]] -= 1.0▶368369dW_hy += do @ hs[t].T # outer product370db_y += do371372# --- Hidden state gradient: two contributions ---373# 1. From output at t: W_hy^T @ do374# 2. From future steps: dh_next (the recurrent term)375dh = W_hy.T @ do + dh_next▶376377# Through activation: element-wise derivative378dh_raw = deriv_fn(hs[t]) * dh379380dW_xh += dh_raw @ xs[t].T # outer product with input381dW_hh += dh_raw @ hs[t - 1].T # outer product with prev hidden▶382db_h += dh_raw383384# Propagate gradient to previous time step (the Jacobian factor)385dh_next = W_hh.T @ dh_raw▶386387# Record gradient magnitude at this step for vanishing gradient viz388per_step_grad_norm.append(float(np.linalg.norm(dh_next)))389390per_step_grad_norm.reverse() # index 0 = earliest time step391392# Reservoir computing: freeze recurrent weights by zeroing their gradient393if freeze_whh:394dW_hh[:] = 0.0395396# --- Global norm gradient clipping ---397grads = [dW_xh, dW_hh, dW_hy, db_h, db_y]398total_grad_norm = float(np.sqrt(sum(np.sum(g ** 2) for g in grads)))399clipped = (clip > 0) and (total_grad_norm > clip)400clip_ratio = 1.0401if clipped:402clip_ratio = clip / total_grad_norm403for g in grads:404g *= clip_ratio405406grad_norms = {407"W_xh": float(np.linalg.norm(dW_xh)),408"W_hh": float(np.linalg.norm(dW_hh)),409"W_hy": float(np.linalg.norm(dW_hy)),410"b_h": float(np.linalg.norm(db_h)),411"b_y": float(np.linalg.norm(db_y)),412}413414return {415"dW_xh": dW_xh,416"dW_hh": dW_hh,417"dW_hy": dW_hy,418"db_h": db_h,419"db_y": db_y,420"grad_norms": grad_norms,421"total_grad_norm": total_grad_norm,422"clipped": clipped,423"clip_ratio": clip_ratio,424"per_step_grad_norm": per_step_grad_norm,425}426427428# ---------------------------------------------------------------------------429# Parameter update430# ---------------------------------------------------------------------------431432
do[targets[t]] -= 1.0Softmax-cross-entropy gradient: p_t - one_hot(y_t). Only the target index changes.dh = W_hy.T @ do + dh_nextHidden gradient combines: (a) output error and (b) future gradient dh_next(1.0 - hs[t] ** 2) * dhtanh derivative: always in [0,1], this is the source of vanishing gradientsdh_next = W_hh.T @ dh_rawTHE core of BPTT: recurrence propagates gradient one step back through W_hhdW_hh += dh_raw @ hs[t - 1].TW_hh gradient accumulates over all time steps in the sequence315def backward(316model: CharRNN,317fwd: dict,318clip: float = 5.0,319activation: str = "tanh",320freeze_whh: bool = False,321) -> dict:322"""323Backpropagation Through Time: manual gradient computation.324325Derives gradients for all 5 parameters by propagating error backward326through the unrolled computational graph. This is the Jacobian product327computed incrementally, rather than forming the full product matrix,328we multiply one factor at a time (see Part 7B of the plan for derivation).329330Key teaching points:331- dh_next accumulator IS the "gradient flowing backward through time"332- (1 - h^2) is the tanh derivative, always in [0,1] → causes shrinkage333- W_hh.T @ dh_raw is the recurrent propagation → Jacobian product334- per_step_grad_norm captures vanishing gradient phenomenon directly335336Args:337model: CharRNN338fwd: output from forward()339clip: global gradient norm clipping threshold (0 = no clipping)340activation: activation function name (determines derivative)341freeze_whh: if True, zero out dW_hh gradient (reservoir computing)342343Returns dict with gradient arrays and diagnostic statistics.344"""345_, deriv_fn = _activation_fn(activation)346347xs, hs, ps, targets = fwd["xs"], fwd["hs"], fwd["ps"], fwd["targets"]348seq_len = len(targets)349350W_hh = model.W_hh.data.numpy()351W_hy = model.W_hy.data.numpy()352353dW_xh = np.zeros_like(model.W_xh.data.numpy())354dW_hh = np.zeros_like(model.W_hh.data.numpy())355dW_hy = np.zeros_like(model.W_hy.data.numpy())356db_h = np.zeros_like(model.b_h.data.numpy())357db_y = np.zeros_like(model.b_y.data.numpy())358359# dh_next: gradient from future time steps flowing backward360dh_next = np.zeros((model.hidden_size, 1))361per_step_grad_norm = []362363for t in reversed(range(seq_len)):364# --- Output layer: softmax-CE gradient ---365# d(loss)/d(o_t) = p_t - one_hot(target_t)366do = ps[t].copy()367do[targets[t]] -= 1.0368369dW_hy += do @ hs[t].T # outer product370db_y += do371372# --- Hidden state gradient: two contributions ---373# 1. From output at t: W_hy^T @ do374# 2. From future steps: dh_next (the recurrent term)375dh = W_hy.T @ do + dh_next376377# Through activation: element-wise derivative378dh_raw = deriv_fn(hs[t]) * dh379380dW_xh += dh_raw @ xs[t].T # outer product with input381dW_hh += dh_raw @ hs[t - 1].T # outer product with prev hidden382db_h += dh_raw383384# Propagate gradient to previous time step (the Jacobian factor)385dh_next = W_hh.T @ dh_raw386387# Record gradient magnitude at this step for vanishing gradient viz388per_step_grad_norm.append(float(np.linalg.norm(dh_next)))389390per_step_grad_norm.reverse() # index 0 = earliest time step391392# Reservoir computing: freeze recurrent weights by zeroing their gradient393if freeze_whh:394dW_hh[:] = 0.0395396# --- Global norm gradient clipping ---397grads = [dW_xh, dW_hh, dW_hy, db_h, db_y]398total_grad_norm = float(np.sqrt(sum(np.sum(g ** 2) for g in grads)))399clipped = (clip > 0) and (total_grad_norm > clip)400clip_ratio = 1.0401if clipped:402clip_ratio = clip / total_grad_norm403for g in grads:404g *= clip_ratio405406grad_norms = {407"W_xh": float(np.linalg.norm(dW_xh)),408"W_hh": float(np.linalg.norm(dW_hh)),409"W_hy": float(np.linalg.norm(dW_hy)),410"b_h": float(np.linalg.norm(db_h)),411"b_y": float(np.linalg.norm(db_y)),412}413414return {415"dW_xh": dW_xh,416"dW_hh": dW_hh,417"dW_hy": dW_hy,418"db_h": db_h,419"db_y": db_y,420"grad_norms": grad_norms,421"total_grad_norm": total_grad_norm,422"clipped": clipped,423"clip_ratio": clip_ratio,424"per_step_grad_norm": per_step_grad_norm,425}426427428# ---------------------------------------------------------------------------429# Parameter update430# ---------------------------------------------------------------------------431432
global_norm = np.sqrt(Global norm = L2 norm of ALL gradient components concatenatedclip_coef = clip / global_normScale factor: if norm > threshold, scale all gradients uniformlyglobal_norm > clipOnly clip when necessary; avoids distorting small, well-behaved gradients4 sections
850def train(model: CharRNN, train_data: List[int], val_data: List[int], config: TrainConfig):851"""852Main TBPTT training loop.853854TBPTT design:855- Hidden state h carries FORWARD across chunk boundaries (memory preserved)856- BPTT unrolls only within each chunk (gradient truncated at boundaries)857- This IS the truncation: h.copy() detaches the gradient graph858859At corpus wrap, h is reset to zeros (fresh start for next epoch's pass).860"""861output_dir = config.output_dir862output_dir.mkdir(parents=True, exist_ok=True)863(output_dir / "steps").mkdir(exist_ok=True)864(output_dir / "jacobians").mkdir(exist_ok=True)865866V = len(config.chars)867H = config.hidden_size868869# Optionally initialize learned embedding (--use-embedding flag)870embedding = None871if config.use_embedding:872embed_dim = 16873embedding = np.random.randn(embed_dim, V) * 0.1874875# SGD momentum velocity (all zeros initially)876velocity = {name: np.zeros_like(getattr(model, name).data.numpy())877for name in ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]}878879h = np.zeros((H, 1))880pointer = 0881882loss_curve: List[dict] = []883step_index: List[int] = []884gf_summary: List[dict] = []885recorded_data: List[dict] = [] # kept in memory for PCA at end886recorded_step_count = 0887jacobian_count = 0888889print("=" * 64)890print(" RNN Character-Level Language Model: Training")891print("=" * 64)892print(f" Corpus: {config.corpus} ({len(train_data) + len(val_data):,} chars, {V} unique)")893print(f" Architecture: vanilla_rnn (H={H}, V={V})")894print(f" Parameters: {model.count_parameters():,}")895W_hh_init_eigs = np.linalg.eigvals(model.W_hh.data.numpy())896print(f" W_hh init: {config.init} (spectral_radius={np.max(np.abs(W_hh_init_eigs)):.3f})")897print(f" Training: {config.steps:,} steps, seq_len={config.seq_len}, "898f"lr={config.lr}, clip={config.clip}")899print(f" TBPTT chunk: {config.seq_len} characters")900print("=" * 64)901902t_start = time.time()903904for step in range(1, config.steps + 1):905# Corpus wrap: reset hidden state and pointer906if pointer + config.seq_len + 1 >= len(train_data):907pointer = 0908h = np.zeros((H, 1))909910inputs = train_data[pointer: pointer + config.seq_len + 1]911pointer += config.seq_len912913# Current learning rate (step decay)914lr = get_lr(step, config.lr, config.lr_decay_rate, config.lr_decay_every)915916# Forward pass: h carries over from previous chunk (TBPTT memory)917fwd = forward(918model, inputs, h,919activation=config.activation,920use_embedding=config.use_embedding,921embedding=embedding,922)923924# Detach: copy final hidden state forward but don't backprop through it925# This is the truncation in TBPTT; h "remembers" but gradients don't cross926h = fwd["hs"][config.seq_len - 1].copy()927928# Backward pass (BPTT within this chunk only)929grads = backward(930model, fwd,931clip=config.clip,932activation=config.activation,933freeze_whh=config.freeze_whh,934)935936# SGD update937velocity = _apply_update(model, grads, lr, velocity, config.momentum)938939# --- Loss curve (every step) ---940avg_loss = float(fwd["loss"] / config.seq_len)941942# Validation loss (every val_interval steps)943val_loss = None944if len(val_data) > 0 and step % config.val_interval == 0:945val_loss = compute_val_loss(946model, val_data, config.seq_len,947activation=config.activation,948)949950loss_curve.append({951"step": step,952"loss": round(avg_loss, 6),953"val_loss": round(val_loss, 6) if val_loss is not None else None,954"lr": round(lr, 6),955})956957# --- Gradient flow summary (every recorded step) ---958W_hh_eigs = np.linalg.eigvals(model.W_hh.data.numpy())959spectral_radius = float(np.max(np.abs(W_hh_eigs)))960961# --- State recording ---962if should_record(step, config.dense_steps, config.record_interval, config.steps):963record = capture_step(step, model, fwd, grads, config)964save_step(record, output_dir)965step_index.append(step)966gf_summary.append({967"step": step,968"total_grad_norm": grads["total_grad_norm"],969"clipped": grads["clipped"],970"clip_ratio": grads["clip_ratio"],971"spectral_radius": spectral_radius,972"per_step_grad_norm": grads["per_step_grad_norm"],973"grad_norms": grads["grad_norms"],974})975recorded_data.append(record)976recorded_step_count += 1977978# --- Jacobian capture ---979if step % config.jacobian_interval == 0:980jac_t_start = max(0, config.seq_len - 1 - 10)981jac_t_end = config.seq_len - 1982jac = capture_jacobian(model, fwd, jac_t_start, jac_t_end)983save_jacobian(jac, step, output_dir)984jacobian_count += 1985986# --- Console progress ---987if step <= 10 or step % 1000 == 0 or step == config.steps:988clip_flag = "clipped" if grads["clipped"] else " "989print(990f"step {step:6d}/{config.steps} | loss {avg_loss:.4f} | "991f"grad_norm {grads['total_grad_norm']:6.2f} | {clip_flag} | "992f"lr {lr:.4f} | rho {spectral_radius:.3f}"993)994if step % 5000 == 0 or step == config.steps:995sample = generate(996model, " ", 60,997config.char2idx, config.idx2char,998temperature=config.temperature,999activation=config.activation,1000)1001print(f"--- Sample at step {step} (T={config.temperature}): \"{sample}\" ---")10021003elapsed = time.time() - t_start1004print("=" * 64)1005print(f" Training complete in {elapsed:.1f}s. Output: {output_dir}")1006print(f" Steps recorded: {recorded_step_count} | Jacobian snapshots: {jacobian_count}")1007print(f" Final loss: {loss_curve[-1]['loss']:.4f}")1008print("=" * 64)10091010return loss_curve, step_index, gf_summary, recorded_data101110121013# ---------------------------------------------------------------------------1014# CLI1015# ---------------------------------------------------------------------------10161017
h_carryHidden state carries across chunk boundaries; the network has memory beyond seq_lenfor chunk_start in range(Sliding window over the corpus: each chunk is seq_len characters longfwd = forward(model, inputs, targets, h_carryForward pass: returns losses, hidden states, predictionsgrads = backward(model, fwd, clip=config.clipManual BPTT over the current chunk432def _apply_update(model: CharRNN, grads: dict, lr: float, velocity: dict, momentum: float) -> dict:433"""SGD update with optional momentum."""434param_names = ["W_xh", "W_hh", "W_hy", "b_h", "b_y"]435for name in param_names:436param = getattr(model, name)437g = grads[f"d{name}"]438if momentum > 0.0:439v = momentum * velocity[name] - lr * g▶440velocity[name] = v▶441delta = torch.from_numpy(v).float()442else:443delta = torch.from_numpy(-lr * g).float()444with torch.no_grad():445param.data.add_(delta)446return velocity447448449
velocity[name]Momentum accumulates gradient direction, reduces oscillations in loss landscapemomentum * velocity[name] + (1 - momentum) * grads[name]Momentum: exponential moving average of gradientsparam.data -= lr * velocity[name]SGD update: subtract scaled gradient from parameter552def capture_step(553step: int,554model: CharRNN,555fwd: dict,556grads: dict,557config,558) -> dict:559"""560Capture full training state at a given step.561562Records all fields needed by the frontend internals page:563weight matrices, eigenvalues, gradients, hidden state trajectories,564prediction probabilities, saturation statistics, and a sample text.▶565"""566W_hh_np = model.W_hh.data.numpy()567eigenvalues = np.linalg.eigvals(W_hh_np)568seq_len = len(fwd["targets"])569570return {571"step": step,572"loss": float(fwd["loss"] / seq_len), # avg per-character cross-entropy573574# Full weight matrices (for weight evolution visualization)575"W_xh": model.W_xh.data.numpy().tolist(),576"W_hh": model.W_hh.data.numpy().tolist(),577"W_hy": model.W_hy.data.numpy().tolist(),578"b_h": model.b_h.data.numpy().ravel().tolist(),579"b_y": model.b_y.data.numpy().ravel().tolist(),580581# W_hh spectral analysis (for spectral radius tracking)582"w_hh_eigenvalues_real": eigenvalues.real.tolist(),▶583"w_hh_eigenvalues_imag": eigenvalues.imag.tolist(),584"spectral_radius": float(np.max(np.abs(eigenvalues))),585586# Gradient diagnostics587"grad_norms": grads["grad_norms"],588"total_grad_norm": grads["total_grad_norm"],589"clipped": grads["clipped"],590"clip_ratio": grads["clip_ratio"],591"per_step_grad_norm": grads["per_step_grad_norm"],▶592593# Hidden state trajectory for this chunk (for PCA / trajectory viz)594"hidden_states": [595fwd["hs"][t].ravel().tolist() for t in range(seq_len)596],597"input_chars": [598config.idx2char[fwd["xs_indices"][t]] for t in range(seq_len)599],600601# Prediction distribution at each position (for prediction landscape)602"predictions": [603{604config.idx2char[i]: round(float(fwd["ps"][t][i, 0]), 6)605for i in range(model.vocab_size)606}607for t in range(seq_len)608],609"target_chars": [610config.idx2char[fwd["targets"][t]] for t in range(seq_len)611],612613# Per-position accuracy (for word-boundary error analysis)614"correct_at_position": [615int(np.argmax(fwd["ps"][t]) == fwd["targets"][t])616for t in range(seq_len)617],618619# Fraction of hidden units saturated near ±1 (for saturation viz)▶620"saturation": float(np.mean(▶621np.abs(np.array([fwd["hs"][t].ravel() for t in range(seq_len)])) > 0.95622)),623624# Sample text generated at this checkpoint625"sample_text": generate(626model, " ", config.generate_length,627config.char2idx, config.idx2char,628temperature=config.temperature,629activation=config.activation,630),631}632633634
should_recordDense recording early (every 50 steps for steps ≤200), then every 50th stepw_hh_eigenvalues_realTrack W_hh eigenvalues at each snapshot: spectral radius drift over trainingper_step_grad_normPer-position gradient norm within the seq_len window, vanishing gradient evidencesaturationFraction of hidden units with |h|>0.9: how much of the tanh range is being used4 sections
458def generate(459model: CharRNN,460seed_char: str,461length: int,462char2idx: Dict[str, int],463idx2char: Dict[int, str],464temperature: float = 1.0,▶465activation: str = "tanh",466) -> str:467"""468Sample text character by character using temperature-scaled softmax.▶469470Temperature controls randomness:471T < 1: sharper, more deterministic (greedy at T→0)472T = 1: unmodified model distribution473T > 1: flatter, more random (uniform at T→∞)474"""475act_fn, _ = _activation_fn(activation)476W_xh = model.W_xh.data.numpy()477W_hh = model.W_hh.data.numpy()478W_hy = model.W_hy.data.numpy()479b_h = model.b_h.data.numpy()480b_y = model.b_y.data.numpy()481482h = np.zeros((model.hidden_size, 1))483idx = char2idx.get(seed_char, 0)484result = [seed_char]485486for _ in range(length):487x = np.zeros((model.vocab_size, 1))488x[idx] = 1.0489h = act_fn(W_xh @ x + W_hh @ h + b_h)490o = W_hy @ h + b_y491492o = o / temperature▶493exp_o = np.exp(o - np.max(o))494p = exp_o / np.sum(exp_o)495496idx = int(np.random.choice(model.vocab_size, p=p.ravel()))497result.append(idx2char[idx])498499return "".join(result)500501502# ---------------------------------------------------------------------------503# Validation loss504# ---------------------------------------------------------------------------505506
temperatureT<1: conservative (peaks sharper); T=1: standard; T>1: creative/noisynp.random.choice(range(vocab_size), p=probs)Sample from distribution, not argmax, so output varies each runh_gen = np.tanh(Generation uses the same forward pass as training, no special generation mode777def save_hidden_state_pca(recorded_data: list, output_dir: Path) -> None:778"""779Compute PCA on final-step hidden states, project all recorded steps to 2D.780781PCA is fit on the FINAL step's hidden states, then applied to all steps,782matching the GloVe pattern of using final_pca as the reference frame.783"""784from sklearn.decomposition import PCA785786if not recorded_data:787return788789# Fit PCA on final step's hidden states790final_step = recorded_data[-1]791final_hs = np.array(final_step["hidden_states"]) # (seq_len, H)792793pca = PCA(n_components=2)▶794pca.fit(final_hs)795796steps_pca = []797for record in recorded_data:798hs = np.array(record["hidden_states"]) # (seq_len, H)799projected = pca.transform(hs) # (seq_len, 2)800steps_pca.append({801"step": record["step"],802"trajectory_2d": projected.tolist(),803"input_chars": record["input_chars"],804})805806result = {807"pca_variance_explained": pca.explained_variance_ratio_.tolist(),808"steps": steps_pca,809}810save_json(result, output_dir / "hidden_state_pca.json")811812813# ---------------------------------------------------------------------------814# Training loop815# ---------------------------------------------------------------------------816817
PCA(n_components=2)Project 64-dim hidden states to 2D for visualizationpca.fit(all_states)Fit PCA on all recorded hidden states: principal axes capture most variancetrajectories_2d2D trajectory through hidden state space as the model reads the sample sequence552def capture_step(553step: int,554model: CharRNN,555fwd: dict,556grads: dict,557config,558) -> dict:559"""560Capture full training state at a given step.561562Records all fields needed by the frontend internals page:563weight matrices, eigenvalues, gradients, hidden state trajectories,564prediction probabilities, saturation statistics, and a sample text.565"""566W_hh_np = model.W_hh.data.numpy()567eigenvalues = np.linalg.eigvals(W_hh_np)568seq_len = len(fwd["targets"])569570return {571"step": step,572"loss": float(fwd["loss"] / seq_len), # avg per-character cross-entropy573574# Full weight matrices (for weight evolution visualization)575"W_xh": model.W_xh.data.numpy().tolist(),576"W_hh": model.W_hh.data.numpy().tolist(),577"W_hy": model.W_hy.data.numpy().tolist(),578"b_h": model.b_h.data.numpy().ravel().tolist(),579"b_y": model.b_y.data.numpy().ravel().tolist(),580581# W_hh spectral analysis (for spectral radius tracking)582"w_hh_eigenvalues_real": eigenvalues.real.tolist(),583"w_hh_eigenvalues_imag": eigenvalues.imag.tolist(),584"spectral_radius": float(np.max(np.abs(eigenvalues))),585586# Gradient diagnostics587"grad_norms": grads["grad_norms"],588"total_grad_norm": grads["total_grad_norm"],589"clipped": grads["clipped"],590"clip_ratio": grads["clip_ratio"],591"per_step_grad_norm": grads["per_step_grad_norm"],592593# Hidden state trajectory for this chunk (for PCA / trajectory viz)594"hidden_states": [595fwd["hs"][t].ravel().tolist() for t in range(seq_len)596],597"input_chars": [598config.idx2char[fwd["xs_indices"][t]] for t in range(seq_len)599],600601# Prediction distribution at each position (for prediction landscape)602"predictions": [603{604config.idx2char[i]: round(float(fwd["ps"][t][i, 0]), 6)605for i in range(model.vocab_size)606}607for t in range(seq_len)608],609"target_chars": [610config.idx2char[fwd["targets"][t]] for t in range(seq_len)611],612613# Per-position accuracy (for word-boundary error analysis)614"correct_at_position": [615int(np.argmax(fwd["ps"][t]) == fwd["targets"][t])616for t in range(seq_len)617],618619# Fraction of hidden units saturated near ±1 (for saturation viz)620"saturation": float(np.mean(621np.abs(np.array([fwd["hs"][t].ravel() for t in range(seq_len)])) > 0.95622)),623624# Sample text generated at this checkpoint625"sample_text": generate(626model, " ", config.generate_length,627config.char2idx, config.idx2char,628temperature=config.temperature,629activation=config.activation,630),631}632633634
np.linalg.eigvals(W_hh)Compute all H eigenvalues of the recurrent weight matrixspectral_radius = float(np.max(np.abs(eigs)))Spectral radius = largest eigenvalue magnitude. >1 means W_hh can amplify634def capture_jacobian(model: CharRNN, fwd: dict, t_start: int, t_end: int) -> dict:635"""636Compute the Jacobian product dh_t/dh_k for k=t_start..t_end.637638This is the formal quantity whose spectral norm governs whether gradients639vanish or explode. Saved periodically to jacobians/{step}.json for the640JacobianInspector frontend component.641642For a sequence of length T, the Jacobian from step k to step t is:643J_{t←k} = prod_{i=k+1}^{t} diag(1 - h_i^2) @ W_hh644645We accumulate this product incrementally (one factor per step).646"""647W_hh = model.W_hh.data.numpy()648product = np.eye(model.hidden_size)649jacobian_records = []650651for t in range(t_start + 1, t_end + 1):652diag = np.diag((1.0 - fwd["hs"][t].ravel() ** 2))653J_t = diag @ W_hh654product = J_t @ product▶655656eigs = np.linalg.eigvals(product)657jacobian_records.append({658"step": t,659"jacobian_norm": float(np.linalg.norm(J_t)),660"product_norm": float(np.linalg.norm(product)),661"product_spectral_radius": float(np.max(np.abs(eigs))),662"product_eigenvalues_real": eigs.real.tolist(),663})664665return {666"from_step": t_start,667"to_step": t_end,668"jacobians": jacobian_records,669"final_product_norm": float(np.linalg.norm(product)),▶670}671672673# ---------------------------------------------------------------------------674# Output / save functions675# ---------------------------------------------------------------------------676677
J_t = np.diag(1 - hs[t]**2) @ W_hhSingle-step Jacobian: tanh derivative × W_hhproduct = J_t @ productAccumulate product J_{t}·J_{t-1}·...·J_{1}: measures gradient shrinkage over stepsfinal_product_norm||J_T·...·J_1||: if <<1, gradients vanish; if >>1, they explode1 section
0# Generate corpus1python generate_corpus.py --num-sentences 900 --seed 4223# Train (5-10 min on CPU)4python train.py --steps 30000 --hidden-size 64 --seq-len 25 --lr 0.01 --clip 5.0 --seed 4256# Analyze7python analyze.py --data-dir output/ --save-json89# Explore interactively10python explore.py --data-dir output/ --interactive1112# Expected at step 30000:13# loss ~1.1 spectral_radius ~1.18 vocab_hit_rate ~0.88
$ git clone https://github.com/quantml/github-quantml.git$ cd github-quantml/rnn/
$ pip install torch numpy scipy
$ python generate_corpus.py --num-sentences 600 --seed 42
$ python train.py --steps 30000 --hidden-size 64 --seq-len 25 --lr 0.01 --clip 5.0 --seed 42
$ python analyze.py --data-dir output/ --save-json
$ python explore.py --data-dir output/ --interactive
================================================================ RNN Character-Level Language Model -- Training ================================================================ Corpus: corpus.txt (18,247 chars, 27 unique) Architecture: vanilla_rnn (H=64, V=27) Parameters: 7,579 W_hh init: orthogonal (spectral_radius=1.000) ================================================================ step 1/30000 | loss 3.2958 | grad_norm 12.50 | clipped | lr 0.0100 step 5000/30000 | loss 1.8423 | grad_norm 2.31 | | lr 0.0098 step 15000/30000 | loss 1.3107 | grad_norm 1.12 | | lr 0.0095 step 30000/30000 | loss 1.1205 | grad_norm 0.85 | | lr 0.0091 --- Sample: "the lion chases the mouse the girl reads the book" --- ================================================================
--hidden-size 16Weaker category emergence, higher loss--hidden-size 128Sharper categories, ~4× slower--seq-len 5Severe vanishing gradients; watch the memory probe--seq-len 50Better memory, risk of gradient explosion--clip 1.0Aggressive clipping; slow convergence--lr 0.001Lower learning rate, more stable but slowerTo test non-orthogonal initialization: replace nn.init.orthogonal_ with nn.init.xavier_uniform_ and observe the spectral radius deviation from 1.0.