RNN
Experiments
Run guided ablation experiments: compare initialization strategies, hidden sizes, TBPTT chunk lengths, activation functions, and more. See how each choice affects learning.
Experiments
Run guided ablation experiments: compare initialization strategies, hidden sizes, TBPTT chunk lengths, activation functions, and more. See how each choice affects learning.
Experiments Lab · 10 guided studies
Each experiment isolates one architectural or training choice and shows exactly how it ripples through the learning dynamics. All results come from the same character-level RNN you explored in the Story and Internals tabs, making every observation directly interpretable.
Every experiment in this lab follows a controlled comparison protocol: one variable changes at a time, while all others stay fixed at the baseline defaults (H=64, seq_len=25, lr=0.01, clip=5.0, orthogonal init, tanh activation, 30,000 steps, seed=42).
All training runs use the same character-level RNN corpus (the synthetic grammar corpus from Plan 1). The model architecture is the vanilla RNN from train.py with a single hidden layer and a linear output projection.
How you initialize W_hh determines the spectral radius at step 0, and that determines everything that follows. This experiment runs three strategies and measures loss, spectral radius, gradient flow, and generation quality.
Orthogonal initialization of W_hh produces faster convergence and better gradient flow than Xavier or random, because it places all eigenvalues on the unit circle.
Training loss curves
Spectral radius evolution: W_hh
Y-axis shows ρ(W_hh), the spectral radius. Orthogonal starts at exactly 1.0.
the lion chases the boy and the girl runs fast
the cat the boy runs the man sleeps
thh e c aa t sleeeps the bbboy
Final loss comparison
Observation questions
Orthogonal converges fastest. At step 0, ρ=1.0 means all eigenvalues are on the unit circle; gradients neither vanish nor explode as they flow back through time. Xavier produces ρ≈0.85 (sub-unity, leading to vanishing gradients), while random can produce ρ>1 (unstable early dynamics).
Random initialization produces fragmented, non-word character sequences ('thh e c aa t') even after 30K steps. Its final loss is 2.1 vs 1.12 for orthogonal; it never fully recovers from the chaotic early dynamics caused by ρ>1 initialization.
The effective memory depends on how well gradients survive the backward pass. With ρ≈1 (orthogonal), the product of recurrent Jacobians stays near 1, allowing gradients to flow back 18 chars on average. With random init and ρ>1 early on, the network learns a distorted W_hh that reduces effective gradient reach.
Orthogonal initialization is not just a 'good starting point'; it places the spectral radius at exactly 1.0, the mathematically optimal value for gradient flow. Every other random initialization is a compromise.
TBPTT (Truncated Backpropagation Through Time) limits how far back gradients flow by splitting the sequence into fixed-length chunks. The chunk length sets an absolute ceiling on learnable dependency distance. But there's a twist: even within the chunk, gradients vanish.
Longer TBPTT chunks allow learning longer-range dependencies, but the improvement is sub-linear because vanishing gradients shrink effective reach well below the theoretical maximum.
Gradient decay curves (norm at each position going backward)
Gradient norm at each BPTT position (1 = current step)
Training loss curves
Memory probe accuracy (by lookback distance)
Memory probe accuracy by distance (bins = chars back)
| Config | 1-3 | 4-7 | 8-12 | 13-18 | 19-25 |
|---|---|---|---|---|---|
| seq=5 | 78% | 42% | 25% | 18% | 15% |
| seq=10 | 85% | 67% | 44% | 28% | 21% |
| seq=25 | 91% | 78% | 62% | 51% | 43% |
| seq=50 | 93% | 82% | 68% | 57% | 50% |
Observation questions
The memory probe shows almost no accuracy beyond 4 characters back; the model physically cannot learn dependencies longer than the chunk. Even within the chunk, the decay curve shows the gradient drops to ~0.1 by position 5.
No. The memory horizon goes from 18 to 22 chars (22% increase). The gradient decay curve for seq_len=50 shows the gradient falling below 0.05 by position 25, even though the chunk extends to 50, vanishing gradients make positions 26–50 nearly unreachable.
Yes, around seq_len=25–35. Beyond this, the gradient decay from tanh saturations and W_hh multiplication effectively limits the useful horizon. Longer chunks add computation cost without proportional memory benefit. This is why architectures like LSTMs use gates to maintain gradients instead of just longer chunks.
TBPTT chunk length sets the ceiling, but vanishing gradients set the floor. Even with seq_len=50, the effective gradient reach is ~22 chars. This is the fundamental motivation for gated architectures (LSTM, GRU).
The activation function is applied at every step to bound (or not bound) the hidden state. tanh is the classical choice, but why? This experiment runs tanh, ReLU, sigmoid, and identity through 30K steps and measures the consequences.
Different activations produce radically different gradient flow and stability characteristics. tanh provides the best balance of nonlinearity, boundedness, and gradient preservation.
Training loss curves
Stability verdict
Hidden state statistics at final checkpoint
Hidden state statistics (final checkpoint)
| Activation | Mean | Std | Saturated | Range |
|---|---|---|---|---|
| tanh | 0.02 | 0.62 | 18% | [-1, 1] |
| ReLU | 0.48 | 0.71 | 3% | [0, +∞] |
| sigmoid | 0.51 | 0.22 | 5% | [0, 1] |
| identity | 0.00 | 8.40 | 0% | [−∞, +∞] |
Gradient norm evolution
Y-axis shows ||∇||₂ before clipping. Identity diverges to very high values.
Observation questions
Without a nonlinearity, the RNN is just a linear dynamical system: h_t = W_hh × h_{t-1} + W_xh × x_t. If any eigenvalue of W_hh exceeds 1, h_t grows exponentially. More fundamentally, without nonlinearity, the model cannot represent the complex pattern separations needed for category emergence.
ReLU does preserve gradients better for active neurons (gradient=1, not <1 like tanh). But two problems emerge: (1) 'dying ReLU' (units that become inactive never recover); (2) unbounded hidden states require aggressive gradient clipping (clip=1.0 here) to prevent instability. The hidden std of 0.71 vs tanh's 0.62 reflects this.
sigmoid maps to [0,1] with derivative ≤ 0.25, while tanh maps to [-1,1] with derivative ≤ 1.0. The 4× larger gradient for tanh means faster gradient flow and faster convergence. Additionally, sigmoid's [0,1] output creates asymmetric representations; positive-only hidden states make it harder to encode contrast (e.g., 'this is NOT a verb').
tanh is a compromise: bounded hidden states prevent explosion, zero-centered output enables contrast encoding, and derivative up to 1.0 preserves gradients. sigmoid is simply a squashed, asymmetric tanh. ReLU's unbounded output requires aggressive clipping that hurts convergence.
Gradient clipping rescales the gradient vector whenever its norm exceeds a threshold. Too small a threshold starves learning. Too large a threshold leaves the model vulnerable to gradient spikes. This experiment tests 5 threshold values from 0.5 to 50.
There's a U-shaped performance curve: too-aggressive clipping starves learning (high final loss), too-permissive clipping introduces instability. clip=5.0 sits at the sweet spot.
Training loss curves
Fraction of steps where clipping fires
Fraction of steps where gradient was clipped
Final loss
Overall clip fraction (% of steps clipped)
Observation questions
On 92% of steps, the model receives a rescaled gradient with norm exactly 0.5. The actual gradient information (direction) is preserved, but the magnitude is systematically reduced; learning is glacially slow. After 30K steps, loss is 2.15, worse than clip=5.0's 1.12.
Rarely, very large gradient spikes slip through unchecked. These spikes cause sudden destabilizing updates that set the model back. With clip=5.0, the 28% of clipped steps are precisely the dangerous spikes; the model learns efficiently without catastrophic updates.
Clipping is most frequent in the early training phase (first 5K steps) when the model is in the chaotic high-loss region and gradients are large. All curves show decreasing clip frequency over time as the model enters a smoother loss landscape. clip=5.0's frequency drops from ~45% to ~5% by step 30K.
Gradient clipping is a safety valve, not a training heuristic. clip=5.0 triggers mainly during the unstable early phase, then allows free gradient flow. The 5 data points trace out a clear U-shaped curve in final loss: clip=0.5 (starved) → clip=5.0 (optimal) → clip=50.0 (unstable).
During training, teacher forcing feeds the correct previous character as context, even if the model predicted wrong. At inference, the model must use its own predictions. This creates 'exposure bias'; errors cascade.
A model trained with teacher forcing generates coherently when given correct context, but errors compound quickly when forced to condition on its own predictions.
Observation questions
During training, the model only ever received correct previous characters. When generating freely, the first error corrupts the context for the next prediction, which then makes another error, creating a cascade. The model learned to predict 'given correct context', a subtly different task from 'given my own context'.
The error cascade rate drops significantly; the model still makes errors but they grow more slowly. Scheduled sampling during training means the model has practiced recovering from some of its own mistakes, building a more robust representation.
It's a consequence of the training procedure, not the data. The model never saw sequences where position N was a prediction error; every training step provided a perfect context. This is called 'covariate shift': the test-time input distribution differs from the training distribution.
Teacher forcing is efficient for training but creates exposure bias: the model never practices recovering from its own errors. Scheduled sampling (mixing teacher and free-running context during training) is the classical fix. Beam search and length penalties are inference-time mitigations.
What if we freeze W_hh after initialization and only train the input and output weights? This is 'reservoir computing': the fixed random dynamics of W_hh provide the computation substrate; only the readout is learned.
An orthogonally-initialized frozen W_hh can still produce meaningful representations, because the fixed dynamics contain useful structure. Training W_hh refines, not creates, these dynamics.
Observation questions
An orthogonal W_hh with ρ=1.0 provides a rich set of dynamics; it effectively computes a random projection of the sequence history. The linear input/output weights can still learn to read out useful patterns from this fixed projection. This is the Echo State Network principle.
The full RNN adapts W_hh to specifically separate the category-relevant dimensions. The reservoir must use whatever random projection it got; some dimensions may align with category boundaries, others may not. The full RNN's W_hh learns to maximize category separation; the reservoir's cannot.
The reservoir result shows that a well-initialized W_hh contains genuinely useful structure even before training. This is why orthogonal initialization (ρ=1.0) is so important; it provides the best possible starting dynamics for the reservoir, which then get refined by gradient descent.
Reservoir computing reveals that the recurrent weights contain useful structure even when randomly initialized (with ρ=1). Training W_hh refines this structure for the task. This connects to Echo State Networks and explains why initialization is about starting with the right dynamics, not just avoiding gradient pathology.
After training on the full corpus, we substitute a never-seen character pattern ("zog") for a known word and run the corpus through the model WITHOUT any further training. Where does "zog" end up in the hidden-state space?
"zog" should cluster with the original word's category-mates, because the RNN learned distributional semantics; it represents words by the contexts they appear in.
Observation questions
The corpus was generated from the same template grammar: "the [NOUN] chases/sleeps/runs..." After substitution, "zog" appears in exactly the same sentential positions as "man": same preceding articles, same following verb forms. The model has learned to read position-in-sentence as a proxy for semantic category.
Similarity 0.0 would mean "zog" occupies a completely unrelated point in hidden-state space; the model assigned it a random representation. Similarity 1.0 would mean "zog" is indistinguishable from "man". The 0.87 indicates near-identical distributional context, with slight deviation because the substituted corpus is not identical to the original.
Yes. When "lion" is replaced, "zog" clusters with animals (cat, dog) instead of humans, with different nearest neighbors. The representation is entirely determined by distributional context, not the surface form of "zog". This is the distributional hypothesis in action.
The RNN has rediscovered Firth's (1957) insight: "a word is characterized by the company it keeps." Character-level prediction alone, applied to a structured corpus, produces word-level semantic representations. This is the bridge between syntactic position and semantic meaning.
The learning rate controls the step size at each gradient update. Too small and learning is glacially slow; too large and training diverges. This experiment tests 5 values from 0.001 to 0.1.
lr=0.01 is near-optimal: a U-shaped curve in final loss with convergence failures at high lr and slow convergence at low lr.
Training loss curves
Final loss and stability
Final loss
Observation questions
At lr=0.001, each gradient step is 10× smaller than the baseline. The model makes safe, conservative updates, but after 30K steps it hasn't converged. With 300K steps it would likely reach similar performance to lr=0.01. The tradeoff is time vs safety.
For the first 3K steps, the model is in the high-loss region where gradients are relatively small (the gradient landscape is smooth). Once it reaches the 'transition zone', where loss starts to decrease rapidly - gradients become large, and lr=0.1 amplifies them to destabilizing sizes.
Larger learning rates take larger steps and can cross loss valleys quickly (fast early convergence) but also overshoot the minimum. lr=0.01 takes more careful steps, eventually finding a tighter minimum. This is the classic speed-precision tradeoff in optimization.
The optimal learning rate for this model is ~0.01, large enough to converge in 30K steps, small enough to avoid instability. Learning rate is the most sensitive hyperparameter: a 10× change in either direction causes significant degradation.
Combine what you've learned. Explore the precomputed grid of hidden_size × activation × seq_len combinations and find the best design for this corpus. Each cell shows the result of a complete 30K-step training run.
Combining the insights from experiments 1–9, you should be able to identify the top-performing configuration before looking at the table.
Hidden size
Activation
| seq_len | Final loss | Memory horizon | Params | Status |
|---|---|---|---|---|
| 5 | 1.283 | 3 | 7,808 | ok |
| 10 | 1.205 | 5 | 7,808 | ok |
| 25 | 1.129 | 15 | 7,808 | ok |
| 50★ best | 0.766 | 33 | 7,808 | ok |
Showing H=64, activation=tanh. Select different values above to explore the design space.
Your prediction exercise
Before exploring the table: Based on experiments 2, 3, and 4, which combination do you expect to perform best? The answer from this lab: H=128, tanh, seq_len=50 produces the lowest loss - but H=64, tanh, seq_len=25 is within 5% at 1/3 the parameters. Beyond a point, more capacity has diminishing returns on this toy grammar.
Observation questions
tanh, consistent with Experiment 4. ReLU is 0.15 loss units worse on average across configurations, and sigmoid is 0.4 worse. The ranking is stable across hidden sizes and seq_lengths.
H=64 and H=128 are nearly tied for tanh + seq_len=25. H=64 provides about 95% of H=128's performance at 28% of the parameters. For this grammar, H=64 is the practical sweet spot.
Small hidden sizes (H=16, H=32) have limited representational capacity; they benefit from longer sequence context to compensate. Large hidden sizes (H=128) already capture sufficient structure from shorter sequences. Capacity and context length are partially substitutable.
The best architecture combines: orthogonal initialization (Exp 1), tanh activation (Exp 4), moderate clip=5 (Exp 5), and seq_len matching the characteristic sentence length (~25). Larger H provides diminishing returns beyond 64 for this grammar. This 'design triangle' of capacity, context, and stability determines RNN performance.