RNN
Networks That Remember
Watch a network learn to carry information through time, character by character. See hidden states evolve, gradients vanish, and categories emerge from raw text.

Key Equations
A recurrent neural network carries a hidden state forward through time, compressing the input history into a fixed-size vector at each step; this enables sequence processing but causes gradients to vanish exponentially during training, limiting practical memory to short spans and motivating gated architectures like LSTM.
Snapshots vs Movies
Why sequences need memory
Toy vs Production: Our examples use a 27-character vocabulary and a 38-word grammar. Real language models process 50,000+ tokens. The simplicity is deliberate: it makes every phenomenon inspectable. When you can see all 64 hidden units and all 27 output probabilities, nothing is a black box.
The Feedback Loop
Anatomy of a recurrent cell
Recurrence alone
Without input, the hidden state either decays to zero (eigenvalues < 1) or explodes (eigenvalues > 1). Not useful yet.
Add input
Now the state combines new input with memory. But values can grow unbounded.
Add nonlinearity
tanh squashes everything to [−1, 1]. This creates a bounded 'memory register.' Each unit stores a value between −1 and +1 representing some aspect of input history.
Add output
Read out predictions from the hidden state. The softmax converts logits to a probability distribution over next characters.
Embedding Layers: Dense vs One-Hot
Misconception: “RNNs have infinite memory”
No. The hidden state is a fixed-size vector (64 numbers in [−1, +1]). Information from early inputs is progressively compressed and eventually overwritten. The effective memory horizon depends on spectral radius and task structure; for our model, it is roughly 15–20 characters. Chapter 5 will make this precise.
Misconception: “The hidden state is a recording of everything”
No. It is a lossy, task-dependent compression. Two very different input histories can produce identical hidden states (aliasing). The network retains what is useful for predicting the next character and discards the rest.
Why orthogonal initialization?: is initialized as an orthogonal matrix. This means all its eigenvalues have absolute value exactly 1.0, placing the network at the optimal spectral radius for gradient flow. Xavier initialization does not provide this guarantee. We will see why this matters in Chapter 5.
Unrolling the Loop
From loops to layers
A 2-unit RNN can compute parity, whether the number of 1s in a binary sequence is odd or even. The key insight: a single bit can be encoded in one real-valued hidden unit.
With and , the first unit computes . For the sequence [1, 0, 1, 1, 0], the hidden states evolve as: h=[0.76, 0] → [−0.60, 0] → [0.65, 0] → [0.08, 0] → [−0.08, 0]. The sign of h[0] encodes parity: positive=odd, negative=even.
This proves RNNs can implement finite automata, the simplest computational class. Regular languages, recognizable by finite automata, are learnable by Elman networks.
| Pattern | Application | Our task |
|---|---|---|
| One-to-one | Standard feedforward (no RNN) | |
| One-to-many | Image captioning | |
| Many-to-one | Sentiment analysis | |
| Many-to-many (synced) | POS tagging | ✓ next-character prediction |
| Many-to-many (encoder-decoder) | Machine translation |
Bidirectional and Stacked RNNs
The Formal Computational Hierarchy
Learning Through Time
BPTT and gradient clipping
At t = T (terminal step)
Only the direct contribution through the output layer.
At t < T
The second term is the gradient flowing backward through time, through the tanh derivative and W_hh.
RTRL, introduced by Williams and Zipser (1989), computes gradients in forward time rather than backward time. Instead of waiting until the end of the sequence to backpropagate, RTRL maintains a running computation of how each parameter affects each hidden unit.
The cost: RTRL requires operations per step (maintaining a 3D tensor ), compared to BPTT's per step. For our tiny H=64 model, that is 64⁴ = 16.7 million vs 64² = 4,096 operations per step. Impractical at any scale. Werbos (1990) compared both approaches and BPTT won definitively.
RTRL is theoretically important because it is online: it can update weights after every single time step, without truncation bias. Recent work (SNAP, R-BPTT) revisits forward-time gradients for online learning scenarios where truncated BPTT's bias is unacceptable.
Teacher Forcing: A Deep Dive
The complete derivation proceeds in four phases:
Cross-reference: this corresponds line-by-line to the backward() function in train.py. See the Code walkthrough section code-bptt for annotated correspondence.
Why Memory Fades
Vanishing gradients and spectral radius
ρ < 1Jacobian product shrinks exponentially at rate ≈ ρ^(t−k). Gradients vanish. Long-range dependencies are unlearnable.
ρ > 1Product grows exponentially. Gradients explode. Training becomes unstable without clipping.
ρ = 1The 'edge of chaos'. Gradients neither vanish nor explode. Maximum memory, maximum sensitivity.
Bengio, Simard, and Frasconi (1994) proved something stronger than “gradients tend to vanish.” They showed that for an RNN with stable fixed points (which any well-behaved RNN must have to produce consistent predictions) gradients must vanish.
Formal statement: if the RNN converges to a fixed point , then the spectral radius of the Jacobian at is strictly less than 1. If ρ ≥ 1, the fixed point is unstable; the hidden state would wander away from under small perturbations. A useful RNN must have stable fixed points, so ρ < 1 at those points, so gradients must vanish there.
The implication is stark: there is no sweet spot for vanilla RNNs. Stability and long-term memory are fundamentally in tension. You cannot have both simultaneously without changing the architecture. Pascanu, Mikolov, and Bengio (2013) extended this analysis to exploding gradients, showing both phenomena are two sides of the same Jacobian product instability.
The Information-Theoretic View
The dynamical systems counterpart to spectral radius is the Lyapunov exponent, which measures the average rate at which nearby trajectories diverge over long time horizons. A positive Lyapunov exponent corresponds to ρ > 1 (chaotic, exploding gradients); negative to ρ < 1 (contracting, vanishing); zero to ρ = 1 (edge of chaos).
Finite-Time Lyapunov Exponents (FTLEs) are the time-varying version, measuring divergence rate over a finite window rather than in the limit. The product norm at each position in our JacobianInspector IS essentially the FTLE: it measures how much nearby hidden states diverge after k steps of processing. When the FTLE heatmap shows a bright patch, the network is locally chaotic; when it shows a dark patch, the network is locally contracting.
This connects to attractor dynamics: the edge of chaos (FTLE ≈ 0) maximizes the network's sensitivity to new inputs while maintaining stability; it can update its internal state significantly in response to new information without diverging.
Attractors and the Edge of Chaos
The Curse of Memory
Misconception: “Vanishing gradients means gradients are literally zero”
No. They are exponentially small but nonzero. For ρ=0.95 and t−k=25 steps: 0.95²⁵ = 0.28, not zero, just 72% attenuated. At t−k=100: 0.95¹⁰⁰ = 0.006. The problem is that these tiny gradients are indistinguishable from noise in the loss landscape, making learning impossible in practice even though the signal technically exists.
Misconception: “More hidden units = proportionally longer memory”
No. Doubling H from 64 to 128 doubles the number of dimensions, but the memory horizon grows only logarithmically (roughly +2–3 characters). The bottleneck is the multiplicative structure of the Jacobian product, not the hidden state size. Going from H=64 to H=1024 (16×) might extend the memory horizon from 18 to ~25 characters, a modest gain for a massive parameter increase.
This is NOT just a training problem: Many tutorials present vanishing gradients as an “optimization difficulty” that better training techniques could solve. The curse of memory (2024) shows it is a fundamental architectural limitation: without gating, RNNs cannot stably maintain long-term memory regardless of how you train them. Three independent failure modes (gradient decay, information capacity, and sensitivity) all point to the same root cause.
What the Network Discovers
Emergence and interpretation
T=0.2"the the the the the..."
Too conservative: repetitive, stuck in highest-probability loop
T=0.8"the lion chases the mouse. the girl reads..."
Balanced: grammatical, varied, mostly correct
1.5"thr cxg slpeks moy brk..."
Too noisy: garbled, no longer respects grammar
Token-Level Analysis
The Zog Experiment
Misconception: “The network memorizes training examples”
No. It learns distributional statistics. The Zog experiment proves this definitively: a never-seen character pattern (“zog”) placed in the same contexts as “man” is immediately classified as NOUN_HUMAN, without any further training. Memorization would fail completely on novel inputs.
What the Network Gets Wrong
Elman's first simulation trained a tiny 2-unit RNN to maintain a running tally, effectively computing XOR across a binary sequence. The key insight: a static function (XOR is easy) becomes a memory task when cast in temporal form.
The network learned to track the parity of the sequence: whether it has seen an odd or even number of 1s. This is the simplest possible memory task: a 1-bit counter. Yet it requires the hidden state to maintain information across arbitrarily many time steps.
Reference: referance/temporal_xor.py(Plan 7) contains an implementation. The hidden state visualization shows a clear two-region structure: one half of the unit circle for “even count,” the other half for “odd count.”
The Ceiling
Why architecture must change
The Three Failure Modes: A Synthesis
| Failure Mode | Domain | Core Equation | Concrete Number | What Breaks | Reference |
|---|---|---|---|---|---|
| Vanishing gradients | Optimization | ρ=0.95: after 25 steps, 28% survives | Cannot learn long-range deps | Bengio 1994, Pascanu 2013 | |
| Finite capacity | Info theory | H=64 → ~200 effective bits | Cannot represent distant info | Data Processing Inequality | |
| Curse of memory | Dyn. systems | λ=0.993, T=100: 200× amplification | Cannot stably maintain memory | Wang & Li, ICML 2024 |
Each row addresses a different “cannot”: cannot learn, cannot represent, cannot maintain. All three must be solved simultaneously for long-range sequence processing.
Thought Exercises: Fix Each Failure Mode
📖 Recommended reading: For a deeper visual walkthrough of how LSTM solves the vanishing gradient problem, see Understanding LSTMs by Christopher Colah. For a rigorous modern treatment of state-space models, Memorization in RNNs on Distill.pub is highly recommended.
Chapter Summary: 5 Review Questions
- What single architectural change distinguishes an RNN from a feedforward network?
- Why does unfolding turn an RNN into a deep network?
- What determines whether gradients vanish or explode during BPTT?
- How did Elman show that grammar can emerge from prediction alone?
- What are the three independent reasons vanilla RNNs fail at long-range memory?
TL;DR: Recurrent Neural Networks
- What: A neural network with a feedback loop; the hidden state h_t is updated at every time step using both the current input and the previous hidden state, giving the network a form of memory.
- How it trains: Backpropagation Through Time (BPTT) unrolls the network across all time steps and sums gradients for each shared weight matrix. The same W_hh, W_xh, and W_hy are used at every step.
- Core limitation: Vanishing gradients: tanh saturates and gradients shrink exponentially as they flow backward through many time steps, making it hard to learn long-range dependencies. LSTMs and GRUs solve this with gating.
- Surprising capability: Without any explicit language supervision, RNNs trained on raw character sequences discover linguistic structure: word boundaries, parts of speech, and syntactic patterns emerge in the hidden state activations.
- When to use: Vanilla RNNs are rarely used today for production NLP (replaced by Transformers), but understanding them is essential; they introduce sequence modeling, hidden state memory, and BPTT, which are foundational to all modern sequential architectures.
| Dimension | Word2Vec ↗ | GloVe ↗ | RNN (this topic) |
|---|---|---|---|
| Core paradigm | Neural prediction | Matrix factorization | Sequential computation |
| Input | Word pairs | Co-occurrence counts | Character sequences |
| Output | Static embeddings | Static embeddings | Dynamic predictions + hidden states |
| Key challenge | Efficient training | Weighting function | Vanishing/exploding gradients |
| Key math | Softmax + neg sampling | Log-bilinear regression | Jacobian products + eigenvalues |
| Training | Skip-gram SGD | Weighted least squares | Manual BPTT + clipping |
| Visualization focus | Embedding geometry | Co-occurrence ratios | State trajectories + gradient flow |
| "Aha moment" | king − man + woman = queen | P(k|ice) / P(k|steam) | Grammar emerges from characters |
| Story arc | Words have geometry | Statistics reveal meaning | Networks can remember |
| Where it leads | GloVe | Compare to neural | LSTM & GRU (next) |
| Training objective | L = -\log \sigma(v_w \cdot v_c) - \textstyle\sum \log \sigma(-v_w \cdot v_n) | J = \textstyle\sum f(X_{ij})(w_i \cdot \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2 | L = -\tfrac{1}{T}\textstyle\sum \log p_t[y_t] |
Synthesis: All three objectives learn from distributional context (what appears near what) but through fundamentally different lenses: Word2Vec predicts context words from target words, GloVe fits a log-bilinear model to global co-occurrence counts, and our RNN predicts the next character from the hidden state. The distributional hypothesis connects them all.