8 experimentsInteractive

RNN

Experiments

Run guided ablation experiments: compare initialization strategies, hidden sizes, TBPTT chunk lengths, activation functions, and more. See how each choice affects learning.

Experiments Lab · 10 guided studies

Each experiment isolates one architectural or training choice and shows exactly how it ripples through the learning dynamics. All results come from the same character-level RNN you explored in the Story and Internals tabs, making every observation directly interpretable.

Methodology

Every experiment in this lab follows a controlled comparison protocol: one variable changes at a time, while all others stay fixed at the baseline defaults (H=64, seq_len=25, lr=0.01, clip=5.0, orthogonal init, tanh activation, 30,000 steps, seed=42).

Fixed seedAll runs use seed 42; results are deterministic and reproducible.

Single-seed caveatOne seed gives one sample from the stochasticity distribution. Rankings may shift by ±5–15% with different seeds.

Shared baselineThe 'H=64, tanh, seq=25, clip=5, orthogonal' config appears in every experiment as a reference point.

Multi-seed optiontrain_ablation.py --seeds 42,123,456 runs each config 3× for statistical estimates.

All training runs use the same character-level RNN corpus (the synthetic grammar corpus from Plan 1). The model architecture is the vanilla RNN from train.py with a single hidden layer and a linear output projection.

Initialization Showdown

How you initialize W_hh determines the spectral radius at step 0, and that determines everything that follows. This experiment runs three strategies and measures loss, spectral radius, gradient flow, and generation quality.

Hypothesis

Orthogonal initialization of W_hh produces faster convergence and better gradient flow than Xavier or random, because it places all eigenvalues on the unit circle.

Training loss curves

OrthogonalXavierRandom

Spectral radius evolution: W_hh

OrthogonalXavierRandom

Y-axis shows ρ(W_hh), the spectral radius. Orthogonal starts at exactly 1.0.

Orthogonal

Decay ratio: 0.04

Effective memory: 18 chars

Vocab hit: 88%

the lion chases the boy and the girl runs fast

Xavier

Decay ratio: 0.11

Effective memory: 12 chars

Vocab hit: 72%

the cat the boy runs the man sleeps

Random

Decay ratio: 0.21

Effective memory: 7 chars

Vocab hit: 45%

thh e c aa t sleeeps the bbboy

Final loss comparison

Orthogonal

1.12best

Xavier

1.45

Random

2.10

Observation questions

Orthogonal converges fastest. At step 0, ρ=1.0 means all eigenvalues are on the unit circle; gradients neither vanish nor explode as they flow back through time. Xavier produces ρ≈0.85 (sub-unity, leading to vanishing gradients), while random can produce ρ>1 (unstable early dynamics).

Random initialization produces fragmented, non-word character sequences ('thh e c aa t') even after 30K steps. Its final loss is 2.1 vs 1.12 for orthogonal; it never fully recovers from the chaotic early dynamics caused by ρ>1 initialization.

The effective memory depends on how well gradients survive the backward pass. With ρ≈1 (orthogonal), the product of recurrent Jacobians stays near 1, allowing gradients to flow back 18 chars on average. With random init and ρ>1 early on, the network learns a distorted W_hh that reduces effective gradient reach.

Orthogonal initialization is not just a 'good starting point'; it places the spectral radius at exactly 1.0, the mathematically optimal value for gradient flow. Every other random initialization is a compromise.

Hidden Size Ablation

The hidden state is the model's 'working memory'. Larger hidden states can encode more information simultaneously, but at what cost? This experiment tests H=16, 32, 64, 128 and measures memory, category quality, and generation.

Hypothesis

Larger hidden states produce better category separation and longer memory horizons, but show diminishing returns beyond H=64 for this toy grammar.

Training loss curves (4 hidden sizes)

H=16H=32H=64H=128

Memory probe accuracy (by lookback distance)

Memory probe accuracy by distance (bins = chars back)

Config	1-3	4-7	8-12	13-18	19-25
H=16	72%	48%	31%	21%	18%
H=32	84%	65%	48%	35%	28%
H=64	91%	78%	62%	51%	43%
H=128	94%	84%	71%	60%	54%

Final loss

H=16

1.85

H=32

1.35

H=64

1.12

H=128

1.05best

Memory horizon (chars)

H=16

H=32

H=64

H=128

23best

Observation questions

Categories still weakly emerge (the model can distinguish human nouns from verbs), but the silhouette score is much lower. The memory probe shows accuracy drops sharply beyond 6 characters back; H=16 lacks the capacity to maintain distinct context representations.

No. Memory horizon grows from 18 to 23 chars (28% increase, not 100%). The limiting factor shifts from representational capacity to the vanishing gradient problem: BPTT can only propagate meaningful gradients ~20-25 chars back regardless of hidden size, given fixed seq_len=25 and clip=5.

H=128 achieves slightly lower loss (1.05 vs 1.12) and better category clustering, but at 3.6× the parameters (57K vs 16K). For this toy grammar, the marginal gain is small; H=64 already captures the essential structure. H=128 would matter more for longer texts or richer grammars.

Category emergence requires sufficient representational capacity (H≥32), but has diminishing returns; H=64 is near-optimal for this grammar. The memory horizon is limited more by BPTT depth than by hidden state size.

TBPTT Chunk Length Study

TBPTT (Truncated Backpropagation Through Time) limits how far back gradients flow by splitting the sequence into fixed-length chunks. The chunk length sets an absolute ceiling on learnable dependency distance. But there's a twist: even within the chunk, gradients vanish.

Hypothesis

Longer TBPTT chunks allow learning longer-range dependencies, but the improvement is sub-linear because vanishing gradients shrink effective reach well below the theoretical maximum.

Gradient decay curves (norm at each position going backward)

Gradient norm at each BPTT position (1 = current step)

seq=5seq=10seq=25seq=50

Training loss curves

seq=5seq=10seq=25seq=50

Memory probe accuracy (by lookback distance)

Memory probe accuracy by distance (bins = chars back)

Config	1-3	4-7	8-12	13-18	19-25
seq=5	78%	42%	25%	18%	15%
seq=10	85%	67%	44%	28%	21%
seq=25	91%	78%	62%	51%	43%
seq=50	93%	82%	68%	57%	50%

Observation questions

The memory probe shows almost no accuracy beyond 4 characters back; the model physically cannot learn dependencies longer than the chunk. Even within the chunk, the decay curve shows the gradient drops to ~0.1 by position 5.

No. The memory horizon goes from 18 to 22 chars (22% increase). The gradient decay curve for seq_len=50 shows the gradient falling below 0.05 by position 25, even though the chunk extends to 50, vanishing gradients make positions 26–50 nearly unreachable.

Yes, around seq_len=25–35. Beyond this, the gradient decay from tanh saturations and W_hh multiplication effectively limits the useful horizon. Longer chunks add computation cost without proportional memory benefit. This is why architectures like LSTMs use gates to maintain gradients instead of just longer chunks.

TBPTT chunk length sets the ceiling, but vanishing gradients set the floor. Even with seq_len=50, the effective gradient reach is ~22 chars. This is the fundamental motivation for gated architectures (LSTM, GRU).

Activation Function Arena

The activation function is applied at every step to bound (or not bound) the hidden state. tanh is the classical choice, but why? This experiment runs tanh, ReLU, sigmoid, and identity through 30K steps and measures the consequences.

Hypothesis

Different activations produce radically different gradient flow and stability characteristics. tanh provides the best balance of nonlinearity, boundedness, and gradient preservation.

Training loss curves

tanhReLUsigmoididentity

Stability verdict

tanhconverged

ReLUconverged

sigmoidconverged

identitydiverged

Hidden state statistics at final checkpoint

Hidden state statistics (final checkpoint)

Activation	Mean	Std	Saturated	Range
tanh	0.02	0.62	18%	[-1, 1]
ReLU	0.48	0.71	3%	[0, +∞]
sigmoid	0.51	0.22	5%	[0, 1]
identity	0.00	8.40	0%	[−∞, +∞]

Gradient norm evolution

tanhReLUsigmoididentity

Y-axis shows ||∇||₂ before clipping. Identity diverges to very high values.

Observation questions

Without a nonlinearity, the RNN is just a linear dynamical system: h_t = W_hh × h_{t-1} + W_xh × x_t. If any eigenvalue of W_hh exceeds 1, h_t grows exponentially. More fundamentally, without nonlinearity, the model cannot represent the complex pattern separations needed for category emergence.

ReLU does preserve gradients better for active neurons (gradient=1, not <1 like tanh). But two problems emerge: (1) 'dying ReLU' (units that become inactive never recover); (2) unbounded hidden states require aggressive gradient clipping (clip=1.0 here) to prevent instability. The hidden std of 0.71 vs tanh's 0.62 reflects this.

sigmoid maps to [0,1] with derivative ≤ 0.25, while tanh maps to [-1,1] with derivative ≤ 1.0. The 4× larger gradient for tanh means faster gradient flow and faster convergence. Additionally, sigmoid's [0,1] output creates asymmetric representations; positive-only hidden states make it harder to encode contrast (e.g., 'this is NOT a verb').

tanh is a compromise: bounded hidden states prevent explosion, zero-centered output enables contrast encoding, and derivative up to 1.0 preserves gradients. sigmoid is simply a squashed, asymmetric tanh. ReLU's unbounded output requires aggressive clipping that hurts convergence.

Gradient Clipping Sensitivity

Gradient clipping rescales the gradient vector whenever its norm exceeds a threshold. Too small a threshold starves learning. Too large a threshold leaves the model vulnerable to gradient spikes. This experiment tests 5 threshold values from 0.5 to 50.

Hypothesis

There's a U-shaped performance curve: too-aggressive clipping starves learning (high final loss), too-permissive clipping introduces instability. clip=5.0 sits at the sweet spot.

Training loss curves

clip=0.5clip=1.0clip=5.0clip=10.0clip=50.0

Fraction of steps where clipping fires

Fraction of steps where gradient was clipped

clip=0.5clip=1.0clip=5.0clip=10.0clip=50.0

Final loss

clip=0.5

2.15

clip=1.0

1.68

clip=5.0

1.12best

clip=10.0

1.18

clip=50.0

1.35

Overall clip fraction (% of steps clipped)

clip=0.5

92%

clip=1.0

74%

clip=5.0

28%

clip=10.0

11%

clip=50.0

3%best

Observation questions

On 92% of steps, the model receives a rescaled gradient with norm exactly 0.5. The actual gradient information (direction) is preserved, but the magnitude is systematically reduced; learning is glacially slow. After 30K steps, loss is 2.15, worse than clip=5.0's 1.12.

Rarely, very large gradient spikes slip through unchecked. These spikes cause sudden destabilizing updates that set the model back. With clip=5.0, the 28% of clipped steps are precisely the dangerous spikes; the model learns efficiently without catastrophic updates.

Clipping is most frequent in the early training phase (first 5K steps) when the model is in the chaotic high-loss region and gradients are large. All curves show decreasing clip frequency over time as the model enters a smoother loss landscape. clip=5.0's frequency drops from ~45% to ~5% by step 30K.

Gradient clipping is a safety valve, not a training heuristic. clip=5.0 triggers mainly during the unstable early phase, then allows free gradient flow. The 5 data points trace out a clear U-shaped curve in final loss: clip=0.5 (starved) → clip=5.0 (optimal) → clip=50.0 (unstable).

Teacher Forcing vs Free Running

During training, teacher forcing feeds the correct previous character as context, even if the model predicted wrong. At inference, the model must use its own predictions. This creates 'exposure bias'; errors cascade.

Hypothesis

A model trained with teacher forcing generates coherently when given correct context, but errors compound quickly when forced to condition on its own predictions.

Observation questions

During training, the model only ever received correct previous characters. When generating freely, the first error corrupts the context for the next prediction, which then makes another error, creating a cascade. The model learned to predict 'given correct context', a subtly different task from 'given my own context'.

The error cascade rate drops significantly; the model still makes errors but they grow more slowly. Scheduled sampling during training means the model has practiced recovering from some of its own mistakes, building a more robust representation.

It's a consequence of the training procedure, not the data. The model never saw sequences where position N was a prediction error; every training step provided a perfect context. This is called 'covariate shift': the test-time input distribution differs from the training distribution.

Teacher forcing is efficient for training but creates exposure bias: the model never practices recovering from its own errors. Scheduled sampling (mixing teacher and free-running context during training) is the classical fix. Beam search and length penalties are inference-time mitigations.

The Reservoir Computing Experiment

What if we freeze W_hh after initialization and only train the input and output weights? This is 'reservoir computing': the fixed random dynamics of W_hh provide the computation substrate; only the readout is learned.

Hypothesis

An orthogonally-initialized frozen W_hh can still produce meaningful representations, because the fixed dynamics contain useful structure. Training W_hh refines, not creates, these dynamics.

Observation questions

An orthogonal W_hh with ρ=1.0 provides a rich set of dynamics; it effectively computes a random projection of the sequence history. The linear input/output weights can still learn to read out useful patterns from this fixed projection. This is the Echo State Network principle.

The full RNN adapts W_hh to specifically separate the category-relevant dimensions. The reservoir must use whatever random projection it got; some dimensions may align with category boundaries, others may not. The full RNN's W_hh learns to maximize category separation; the reservoir's cannot.

The reservoir result shows that a well-initialized W_hh contains genuinely useful structure even before training. This is why orthogonal initialization (ρ=1.0) is so important; it provides the best possible starting dynamics for the reservoir, which then get refined by gradient descent.

Reservoir computing reveals that the recurrent weights contain useful structure even when randomly initialized (with ρ=1). Training W_hh refines this structure for the task. This connects to Echo State Networks and explains why initialization is about starting with the right dynamics, not just avoiding gradient pathology.

The Novel Word Experiment

After training on the full corpus, we substitute a never-seen character pattern ("zog") for a known word and run the corpus through the model WITHOUT any further training. Where does "zog" end up in the hidden-state space?

Hypothesis

"zog" should cluster with the original word's category-mates, because the RNN learned distributional semantics; it represents words by the contexts they appear in.

Observation questions

The corpus was generated from the same template grammar: "the [NOUN] chases/sleeps/runs..." After substitution, "zog" appears in exactly the same sentential positions as "man": same preceding articles, same following verb forms. The model has learned to read position-in-sentence as a proxy for semantic category.

Similarity 0.0 would mean "zog" occupies a completely unrelated point in hidden-state space; the model assigned it a random representation. Similarity 1.0 would mean "zog" is indistinguishable from "man". The 0.87 indicates near-identical distributional context, with slight deviation because the substituted corpus is not identical to the original.

Yes. When "lion" is replaced, "zog" clusters with animals (cat, dog) instead of humans, with different nearest neighbors. The representation is entirely determined by distributional context, not the surface form of "zog". This is the distributional hypothesis in action.

The RNN has rediscovered Firth's (1957) insight: "a word is characterized by the company it keeps." Character-level prediction alone, applied to a structured corpus, produces word-level semantic representations. This is the bridge between syntactic position and semantic meaning.

Learning Rate Sensitivity

The learning rate controls the step size at each gradient update. Too small and learning is glacially slow; too large and training diverges. This experiment tests 5 values from 0.001 to 0.1.

Hypothesis

lr=0.01 is near-optimal: a U-shaped curve in final loss with convergence failures at high lr and slow convergence at low lr.

Training loss curves

lr=0.001lr=0.003lr=0.01lr=0.03lr=0.1

Final loss and stability

Final loss

lr=0.001

2.05

lr=0.003

1.38

lr=0.01

1.12best

lr=0.03

1.20

lr=0.1diverged

lr=0.001converged

lr=0.003converged

lr=0.01converged

lr=0.03converged

lr=0.1diverged

Observation questions

At lr=0.001, each gradient step is 10× smaller than the baseline. The model makes safe, conservative updates, but after 30K steps it hasn't converged. With 300K steps it would likely reach similar performance to lr=0.01. The tradeoff is time vs safety.

For the first 3K steps, the model is in the high-loss region where gradients are relatively small (the gradient landscape is smooth). Once it reaches the 'transition zone', where loss starts to decrease rapidly - gradients become large, and lr=0.1 amplifies them to destabilizing sizes.

Larger learning rates take larger steps and can cross loss valleys quickly (fast early convergence) but also overshoot the minimum. lr=0.01 takes more careful steps, eventually finding a tighter minimum. This is the classic speed-precision tradeoff in optimization.

The optimal learning rate for this model is ~0.01, large enough to converge in 30K steps, small enough to avoid instability. Learning rate is the most sensitive hyperparameter: a 10× change in either direction causes significant degradation.

Design Your Own

Combine what you've learned. Explore the precomputed grid of hidden_size × activation × seq_len combinations and find the best design for this corpus. Each cell shows the result of a complete 30K-step training run.

Hypothesis

Combining the insights from experiments 1–9, you should be able to identify the top-performing configuration before looking at the table.

Hidden size

Activation

seq_len	Final loss	Memory horizon	Params	Status
5	1.283	3	7,808	ok
10	1.205	5	7,808	ok
25	1.129	15	7,808	ok
50★ best	0.766	33	7,808	ok

Showing H=64, activation=tanh. Select different values above to explore the design space.

Your prediction exercise

Before exploring the table: Based on experiments 2, 3, and 4, which combination do you expect to perform best? The answer from this lab: H=128, tanh, seq_len=50 produces the lowest loss - but H=64, tanh, seq_len=25 is within 5% at 1/3 the parameters. Beyond a point, more capacity has diminishing returns on this toy grammar.

Observation questions

tanh, consistent with Experiment 4. ReLU is 0.15 loss units worse on average across configurations, and sigmoid is 0.4 worse. The ranking is stable across hidden sizes and seq_lengths.

H=64 and H=128 are nearly tied for tanh + seq_len=25. H=64 provides about 95% of H=128's performance at 28% of the parameters. For this grammar, H=64 is the practical sweet spot.

Small hidden sizes (H=16, H=32) have limited representational capacity; they benefit from longer sequence context to compensate. Large hidden sizes (H=128) already capture sufficient structure from shorter sequences. Capacity and context length are partially substitutable.

The best architecture combines: orthogonal initialization (Exp 1), tanh activation (Exp 4), moderate clip=5 (Exp 5), and seq_len matching the characteristic sentence length (~25). Larger H provides diminishing returns beyond 64 for this grammar. This 'design triangle' of capacity, context, and stability determines RNN performance.