RNN
Networks That Remember
Watch a network learn to carry information through time, character by character. See hidden states evolve, gradients vanish, and categories emerge from raw text.

Networks That Remember
Watch a network learn to carry information through time, character by character. See hidden states evolve, gradients vanish, and categories emerge from raw text.

A recurrent neural network carries a hidden state forward through time, compressing the input history into a fixed-size vector at each step; this enables sequence processing but causes gradients to vanish exponentially during training, limiting practical memory to short spans and motivating gated architectures like LSTM.
Why sequences need memory
You hear “The cat sat on the ___.” You know the answer is probably mat or couch. You know this because you heard the whole sentence, not just the last word. A feedforward neural network that processes one word at a time has no such luxury. It sees “the” in isolation, then forgets it entirely when “cat” arrives. It has no notion of what came before. ↗ Word2Vec: Why Can't Computers Understand Words?
This is not a quirk of a particular architecture. It is a fundamental property of feedforward networks: they are stateless. Each forward pass maps input to output with no memory of previous passes. For static tasks (classifying an image, predicting house prices) this is fine. For sequences, it is a fatal limitation.
The shift register fix and why it fails. The obvious workaround, described by Elman (1990) in Section 1, is to concatenate recent inputs into one big vector, a “shift register.” Feed the model the last k characters as a single input. This approach has three fatal problems.
First, a fixed-length buffer: you must pre-specify the maximum sequence length before training. Real language has no such bound. Second, absolute vs relative position: the same pattern at different positions looks geometrically different. Elman's example: the pattern [1 1 0 0] and [0 0 1 1] represent the same two-word sequence shifted by two positions, yet they are orthogonal vectors, maximally different in Euclidean space. A network that learned the first cannot transfer to the second. Third, parameters explode: each position needs its own set of weights. For a 100-step window and a 27-character vocabulary, the input-to-hidden weight matrix grows from 27×64 to 2,700×64. This scales poorly.
A biological aside. Recurrent connections are ubiquitous in cortical circuits. Even the visual cortex has massive feedback loops, roughly equal numbers of feedforward and feedback connections. The brain does not process inputs in a single sweep; it maintains and updates internal states across time. Evolution solved the sequence problem not by building wider inputs, but by building systems that carry information forward.
The core question this topic answers: What if a network's output at one moment became part of its input at the next moment? This single idea, a feedback loop, is what gives neural networks memory. The rest of this story is an exploration of what that idea makes possible, and why it eventually hits a wall that forced an architectural revolution.
Toy vs Production: Our examples use a 27-character vocabulary and a 38-word grammar. Real language models process 50,000+ tokens. The simplicity is deliberate: it makes every phenomenon inspectable. When you can see all 64 hidden units and all 27 output probabilities, nothing is a black box.
We have seen why feedforward networks fail at sequences: they have no memory. But if the problem is clear, so is the direction of the solution. What if we gave the network exactly one new ability: the ability to pass a message to itself?
Anatomy of a recurrent cell
A brief history. Jordan (1986) first used recurrent connections, feeding the network's output back as input to generate sequences. But Elman (1990) had a more powerful idea: feed the hidden layer's activations back to itself. Instead of remembering what you said, remember what you were thinking. The hidden state is richer than the output: it contains intermediate representations, not just final decisions. This distinction matters enormously.
The Elman architecture works like this: at each time step, the hidden layer receives two inputs: the current character AND a copy of its own previous state. The copy operation has a fixed weight of 1.0; it is a direct pipeline, not a learned transformation. This means the hidden state carries a compressed summary of everything seen so far. The network must learn to pack relevant history into those 64 real-valued slots.
Building the forward pass step by step. The equation is straightforward once you see it assembled piece by piece.
Recurrence alone
Without input, the hidden state either decays to zero (eigenvalues < 1) or explodes (eigenvalues > 1). Not useful yet.
Add input
Now the state combines new input with memory. But values can grow unbounded.
Add nonlinearity
tanh squashes everything to [−1, 1]. This creates a bounded 'memory register.' Each unit stores a value between −1 and +1 representing some aspect of input history.
Add output
Read out predictions from the hidden state. The softmax converts logits to a probability distribution over next characters.
Our 27-character vocabulary produces one-hot vectors: a 27-dimensional vector with exactly one non-zero entry. For the letter “a,” the vector is [1, 0, 0, ..., 0]. This is information-poor: 26 of 27 entries are wasted. The network receives almost no signal before the first weight multiplication.
An embedding layer fixes this by learning a dense representation. Instead of a 27-dim sparse vector, each character gets a 16-dim dense vector, a column of a learned matrix
This is exactly the same concept as Word2Vec embeddings , representations discovered through prediction. The skip-gram model learns word vectors by predicting context; our character embedding learns character vectors by predicting the next character. Same principle, different granularity. ↗ GloVe: GloVe in the Landscape
Our choice: we use one-hot vectors to keep everything visible and mathematically transparent. Adding a 27×16 embedding matrix would add 432 parameters and reduce
What the hidden state actually stores. Research by Collins, Sohl-Dickstein, and Sussillo (2017) shows that each hidden unit stores approximately 1 real number from input history. With H=64 units, you get roughly 64 “slots” of memory. But this is bounded further by the data processing inequality:
This memory is lossy, task-dependent, and constantly being overwritten. Think of it as a summary, not a transcript. Two very different input histories can produce identical hidden states; the network retains what is useful for prediction and discards the rest. This is why training matters: gradient descent shapes what the hidden state chooses to retain.
Misconception: “RNNs have infinite memory”
No. The hidden state is a fixed-size vector (64 numbers in [−1, +1]). Information from early inputs is progressively compressed and eventually overwritten. The effective memory horizon depends on spectral radius and task structure; for our model, it is roughly 15–20 characters. Chapter 5 will make this precise.
Misconception: “The hidden state is a recording of everything”
No. It is a lossy, task-dependent compression. Two very different input histories can produce identical hidden states (aliasing). The network retains what is useful for predicting the next character and discards the rest.
The five parameters.
Why orthogonal initialization?:
The forward pass processes one step at a time, carrying hidden state forward. But to train this network, we need gradients. And computing gradients through a feedback loop requires a beautiful trick: unfolding the loop into something familiar.
From loops to layers
Take the RNN's loop and lay it flat across time. An RNN processing T characters is mathematically equivalent to a T-layer deep feedforward network. Each “layer” computes the same function with the same parameters. This is not an analogy; it is exact. The two descriptions are mathematically identical; one is just drawn differently.
Weight sharing. In the unfolded graph, every layer uses the same
The parameter count formula is independent of T:
RNNs as programs. As Karpathy (2015) puts it: “If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.” The hidden state is the program's memory; the weight matrices are the program's instructions. The same instructions execute at every time step. Siegelmann and Sontag (1995) proved that RNNs with sufficient hidden state size and real-valued precision are Turing-complete - they can simulate any computation. ↗ GloVe: GloVe in the Landscape
A 2-unit RNN can compute parity, whether the number of 1s in a binary sequence is odd or even. The key insight: a single bit can be encoded in one real-valued hidden unit.
With
This proves RNNs can implement finite automata, the simplest computational class. Regular languages, recognizable by finite automata, are learnable by Elman networks.
The input/output taxonomy. RNNs enable five qualitatively different processing patterns:
| Pattern | Application | Our task |
|---|---|---|
| One-to-one | Standard feedforward (no RNN) | |
| One-to-many | Image captioning | |
| Many-to-one | Sentiment analysis | |
| Many-to-many (synced) | POS tagging | ✓ next-character prediction |
| Many-to-many (encoder-decoder) | Machine translation |
A standard RNN reads left-to-right; at position t, it can only see positions 1 through t. But many tasks benefit from knowing what comes after too. A bidirectional RNN runs two passes: a forward pass producing
Bidirectional RNNs excel at tasks where context on both sides matters: part-of-speech tagging (knowing what follows “run” distinguishes verb from noun), named entity recognition, and coreference resolution. But they are fundamentally incompatible with language generation (you cannot look at future tokens when generating token by token) and they require knowing the full sequence upfront, making them unsuitable for streaming.
Stacked (deep) RNNs take a different approach to adding capacity: layer the recurrent cells vertically. Layer 1's hidden states become the inputs to Layer 2; Layer 2's hidden states feed Layer 3; and so on. This creates a hierarchy where lower layers capture fine-grained temporal patterns (character-level phonology in our case) while higher layers capture more abstract features (word-level syntax and grammar).
The key distinction: temporal depth (sequence length T) creates width in the unfolded graph; vertical depth (number of layers L) creates a separate dimension of abstraction. In our toy model, L=1; all representations are at one level. Production RNNs commonly use L=2–4. Beyond L=4, the benefits diminish and training becomes harder.
Different RNN architectures have provably different computational powers. Weiss, Goldberg, and Yahav (2018) showed a strict containment hierarchy: Elman ⊊ GRU ⊊ LSTM in terms of the formal languages each can recognize.
Elman networks can learn regular languages, patterns recognizable by finite automata. Our parity checker is an example. GRU can recognize some counter languages, including bounded counting tasks. LSTM can learn context-free languages including the classic
Why does
This hierarchy has practical consequences. If your task requires keeping exact counts or deeply nested structure, a vanilla RNN is architecturally insufficient, not just harder to train, but provably incapable.
An unfolded RNN is a deep feedforward network with shared weights. We already know how to train deep networks: backpropagation. Applied to the unfolded RNN, this becomes Backpropagation Through Time.
BPTT and gradient clipping
The loss function. We want the network to predict the next character at every position. The loss is the average cross-entropy across T time steps:
where
Gradients for the output layer are the easy part.
Gradients for the recurrent layer are the hard part. Computing
The BPTT recursion. Working backward from t=T:
At t = T (terminal step)
Only the direct contribution through the output layer.
At t < T
The second term is the gradient flowing backward through time, through the tanh derivative and W_hh.
The temporal gradient passes through two things at each step: (1) the tanh derivative
Accumulating parameter gradients. Once we have
The gradient for
RTRL, introduced by Williams and Zipser (1989), computes gradients in forward time rather than backward time. Instead of waiting until the end of the sequence to backpropagate, RTRL maintains a running computation of how each parameter affects each hidden unit.
The cost: RTRL requires
RTRL is theoretically important because it is online: it can update weights after every single time step, without truncation bias. Recent work (SNAP, R-BPTT) revisits forward-time gradients for online learning scenarios where truncated BPTT's bias is unacceptable.
Truncated BPTT. Full BPTT over 18,000 characters would be impractical; the unfolded graph would have 18,000 layers with the same gradient problems. Instead:
The gradient estimate is biased: events more than 25 steps ago receive no gradient signal. But this bias is acceptable: we know from Chapter 5 that gradients beyond ~20 steps are negligibly small anyway.
During training, we always feed the correct previous character as input, even if the model predicted incorrectly. This is teacher forcing: the teacher (correct label) overrides the student's (model's) mistake. It keeps training stable because gradients never compound through a chain of wrong predictions. Wikipedia: Teacher forcing ↗ Word2Vec: Why Does This Actually Work?
But teacher forcing creates exposure bias: at generation time, the model must condition on its own predictions, which may be wrong. It was never trained for this situation. The effect is dramatic. If per-character accuracy is 0.85, after 10 characters: P(all correct) = 0.85¹⁰ = 0.20. There is an 80% chance of at least one error in just 10 characters. And each error cascades; if the model predicts “b” instead of “c” after “the ca,” it has now entered a sequence (“the cab...”) it has never encountered during training.
Scheduled sampling (Bengio et al., 2015) bridges the gap with a curriculum: during training, use the ground truth with probability ε and the model's own prediction with probability 1−ε. As training progresses, ε decays from 1.0 to 0.0. The model gradually learns to handle its own mistakes.
Professor forcing (Lamb et al., 2016) takes a more principled approach: use adversarial training to match the distribution of hidden states between teacher-forced and free-running modes. When both modes produce the same hidden state distribution, the model is immune to exposure bias. This is more expensive but more principled than scheduled sampling.
For our toy model, even at temperature T=0.8, generation introduces errors because the model was only trained with perfect input. This is visible in the RNNTrainingPlayer below: high-temperature samples diverge faster and contain more grammatical violations.
Regularization for RNNs. Standard dropout breaks recurrence by randomly zeroing hidden units; if unit k is dropped at step t, all information it carried about the past is permanently lost for all future steps. The temporal signal is destroyed, not just the spatial one.
The two solutions that matter in practice are variational dropout (Gal & Ghahramani, 2016), which applies the same dropout mask at every step, preserving temporal consistency, and layer normalization (Ba et al., 2016), which normalizes within each step (batch norm fails on variable-length sequences). For our toy corpus, we use only gradient clipping. Regularization matters at scale. See Merity et al. (2018) for a thorough study.
Gradient clipping. Even with truncation, gradients can explode. Gradient clipping rescales the entire gradient vector when its norm exceeds a threshold:
This preserves the gradient's direction while capping its magnitude. In our code, C=5.0. Clipping direction, not magnitude: this is why gradient clipping is safe; it cannot reverse a weight update, only moderate it.
The complete derivation proceeds in four phases:
Cross-reference: this corresponds line-by-line to the backward() function in train.py. See the Code walkthrough section code-bptt for annotated correspondence.
BPTT gives us gradients. But there is a catch. Look again at the gradient arrows in the BPTT visualizer. Notice how they get fainter further back in time? This is not a coincidence. It is a mathematical inevitability.
Vanishing gradients and spectral radius
Look at the gradient flow from Chapter 4's BPTTVisualizer. The gradient at position 0 is roughly 25× smaller than at position 24. Information from the beginning of the chunk barely influences the parameter update. The network effectively has a “memory horizon” - beyond this distance, it cannot learn long-range dependencies.
The Jacobian product. Expanding the BPTT recursion across multiple time steps reveals a product of matrices:
Each factor is a matrix. The gradient flows through this product. Its behavior is determined by the eigenvalues of these matrices. The product of T matrices, even if each has eigenvalues near 1, can dramatically change magnitude depending on whether eigenvalues compound to grow or shrink.
The tanh derivative ceiling.
The spectral radius. The spectral radius
ρ < 1Jacobian product shrinks exponentially at rate ≈ ρ^(t−k). Gradients vanish. Long-range dependencies are unlearnable.
ρ > 1Product grows exponentially. Gradients explode. Training becomes unstable without clipping.
ρ = 1The 'edge of chaos'. Gradients neither vanish nor explode. Maximum memory, maximum sensitivity.
Orthogonal initialization sets ρ = 1.0, starting the network at the edge of chaos. This is why it is optimal: gradients flow freely from the very first step of training.
Bengio, Simard, and Frasconi (1994) proved something stronger than “gradients tend to vanish.” They showed that for an RNN with stable fixed points (which any well-behaved RNN must have to produce consistent predictions) gradients must vanish.
Formal statement: if the RNN converges to a fixed point
The implication is stark: there is no sweet spot for vanilla RNNs. Stability and long-term memory are fundamentally in tension. You cannot have both simultaneously without changing the architecture. Pascanu, Mikolov, and Bengio (2013) extended this analysis to exploding gradients, showing both phenomena are two sides of the same Jacobian product instability.
The gradient problem has a companion at the level of information, not optimization. The data processing inequality says that for a Markov chain
In the RNN, each hidden state
This is an information-theoretic ceiling, independent of gradient flow. Even with perfect gradients, perfect optimizer, and perfect initialization, if the RNN has seen T steps, information about step 1 has been compressed T times and can only have decreased. The forgetting curve from our memory probe analysis is an empirical measurement of this decay. ↗ GloVe: What Counts Reveal (how counting captures longer-range info)
The telephone game analogy. Imagine a chain of 25 people whispering a message. Each person can only hear 90% of what the previous person said (like a Jacobian factor with norm 0.9). After 25 people: 0.9²⁵ = 0.072. Only 7% of the original message survives. This is vanishing gradients. If each person amplifies the message by 10% (norm 1.1), after 25 people: 1.1²⁵ = 10.8. The message is wildly distorted. This is exploding gradients.
The dynamical systems counterpart to spectral radius is the Lyapunov exponent, which measures the average rate at which nearby trajectories diverge over long time horizons. A positive Lyapunov exponent corresponds to ρ > 1 (chaotic, exploding gradients); negative to ρ < 1 (contracting, vanishing); zero to ρ = 1 (edge of chaos).
Finite-Time Lyapunov Exponents (FTLEs) are the time-varying version, measuring divergence rate over a finite window rather than in the limit. The product norm at each position in our JacobianInspector IS essentially the FTLE: it measures how much nearby hidden states diverge after k steps of processing. When the FTLE heatmap shows a bright patch, the network is locally chaotic; when it shows a dark patch, the network is locally contracting.
This connects to attractor dynamics: the edge of chaos (FTLE ≈ 0) maximizes the network's sensitivity to new inputs while maintaining stability; it can update its internal state significantly in response to new information without diverging.
An attractor is a state the hidden state converges to under repeated input. Feed the character “a” 15 times to our trained network: the hidden state spirals toward a fixed point in 64-dimensional space. The convergence is exponential:
This attractor is the fixed point from the Bengio condition. Its stability is the gradient problem. The three dynamical regimes correspond directly to the spectral radius cases: ρ < 1 creates point attractors (stable, forgetting past); ρ ≈ 1 creates the edge of chaos (long transients, maximum memory); ρ > 1 creates chaotic dynamics (exploding gradients, unstable training).
Our model self-organizes toward the edge of chaos during training; eigenvalues that started on the unit circle drift outward in directions that help maintain useful memory, while gradient clipping prevents the resulting instability from being catastrophic.
Recent research (2025) shows that training can cause attractor-merging crises - qualitative bifurcations where two separate attractor basins merge into one, causing sudden changes in representation structure. These correspond to the loss plateaus sometimes observed in training curves.
Beyond vanishing/exploding gradients, a deeper problem exists. As an RNN's memory increases, as eigenvalues approach 1, the network becomes exponentially sensitive to tiny parameter changes. Even without gradient explosion, small weight perturbations produce massive output changes. This was formalized by Wang and Li (ICML 2024) as the curse of memory.
Formal statement: sensitivity scales as exp(memory_length / τ), where τ is the time constant of the system. For a memory horizon of T steps, a weight perturbation of size δ produces an output change of approximately δ · exp(T/τ). This grows without bound as T increases.
Numerical example: for ρ = 0.999 and a memory of 100 steps, a weight perturbation of 10⁻⁶ produces an output change of ~0.1, seven orders of magnitude amplification. The network is hyper-sensitive. The gradient step that corrects one output wreaks havoc on all outputs that depend on distant context.
Adding nonlinearities does not help. tanh and ReLU change the constant factors but not the exponential scaling. The curse arises from the multiplicative structure of the Jacobian product, which is present regardless of the activation function.
Why gating does help: LSTM's forget gate provides a multiplicative shortcut that decouples memory length from sensitivity. Instead of computing a full matrix product, the cell state uses element-wise multiplication:
Connection to S4/Mamba: Gu et al.'s S4 (2022) uses stable reparameterization, diagonal state matrices with negative real eigenvalues, to mathematically break the curse. The eigenvalues are parameterized as
Practical consequence: you cannot fix vanilla RNNs by training longer, using a better optimizer, or increasing the learning rate. These are optimization tricks; the curse is an architectural problem. Better training cannot create a gradient highway where none exists.
Misconception: “Vanishing gradients means gradients are literally zero”
No. They are exponentially small but nonzero. For ρ=0.95 and t−k=25 steps: 0.95²⁵ = 0.28, not zero, just 72% attenuated. At t−k=100: 0.95¹⁰⁰ = 0.006. The problem is that these tiny gradients are indistinguishable from noise in the loss landscape, making learning impossible in practice even though the signal technically exists.
Misconception: “More hidden units = proportionally longer memory”
No. Doubling H from 64 to 128 doubles the number of dimensions, but the memory horizon grows only logarithmically (roughly +2–3 characters). The bottleneck is the multiplicative structure of the Jacobian product, not the hidden state size. Going from H=64 to H=1024 (16×) might extend the memory horizon from 18 to ~25 characters, a modest gain for a massive parameter increase.
This is NOT just a training problem: Many tutorials present vanishing gradients as an “optimization difficulty” that better training techniques could solve. The curse of memory (2024) shows it is a fundamental architectural limitation: without gating, RNNs cannot stably maintain long-term memory regardless of how you train them. Three independent failure modes (gradient decay, information capacity, and sensitivity) all point to the same root cause.
The vanishing gradient problem limits how far an RNN can look back. But within its effective memory horizon, the network can learn remarkable things. Let us see what our trained model actually discovered.
Emergence and interpretation
Training progression. The model begins outputting random noise; at step 0, before any learning, it produces a meaningless stream like “xqkzm blg pfr...” By step 5,000, recognizable character patterns emerge: spaces appear in plausible positions, common digrams like “th” start appearing. By step 30,000, the output is grammatically structured sentences with correct subject-verb-object patterns. The model learned a grammar it was never explicitly taught.
Prediction error reveals structure. This is a replication of Elman (1990) Figures 4 and 6. When we run the trained model over the corpus and measure prediction error at each character position:
The RNN discovered “words” without anyone telling it words exist. The error signal IS a word boundary detector. Nobody labeled word boundaries; the model inferred them from sequential statistics alone.
Temperature and sampling. The same trained model generates very different text depending on the sampling temperature T, which controls the sharpness of the softmax distribution:
T=0.2"the the the the the..."
Too conservative: repetitive, stuck in highest-probability loop
T=0.8"the lion chases the mouse. the girl reads..."
Balanced: grammatical, varied, mostly correct
1.5"thr cxg slpeks moy brk..."
Too noisy: garbled, no longer respects grammar
Category emergence. This is a replication of Elman (1990) Figure 7, and the crown jewel of this analysis. Take the hidden state vectors for each word, averaged across all contexts in which that word appears, then cluster them using hierarchical agglomerative clustering. The result:
Word-averaged vectors smooth over context. A more rigorous test uses individual token vectors - every single occurrence of “boy” gets its own hidden state vector, not just the average across all occurrences. This is Elman's Figure 9 replication.
The result: sub-clusters organized by preceding context. Occurrences of “boy” following “the” cluster separately from those following “a.” The model is not just classifying words; it is tracking grammatical state: which determiner was used, what verb preceded the noun phrase. The representations are not just semantic but syntactico-semantic.
Tony Plate's Warning (Elman, 1990, Footnote 6): averaging hidden states across contexts could create artificial cluster structure that reflects nothing real. Individual token vectors refute this: the sub-clusters are present even without any averaging, and they reflect genuine distributional differences in context. The structure is real, not an averaging artifact.
The most striking experiment in the study. Replace “man” with “zog” - a novel character sequence that has never appeared in the training corpus, and run it through the trained model with no further training.
Result: “zog” occupies “man's” position in the clustering dendrogram. It appears in the NOUN_HUMAN cluster, next to “man,” “boy,” and “king.” The network inferred the category from distributional context alone - because “zog” appeared in the same syntactic positions as “man” (subject of sentences, after determiners, before transitive verbs).
This is a direct demonstration of the distributional hypothesis: “a word is characterized by the company it keeps.” The model's representations are purely distributional. Memorization would fail completely on a novel input; this replication proves the model learned statistics, not examples. ↗ Word2Vec: Words Are Defined by Their Friends ↗ GloVe: Counting Words Together
Misconception: “The network memorizes training examples”
No. It learns distributional statistics. The Zog experiment proves this definitively: a never-seen character pattern (“zog”) placed in the same contexts as “man” is immediately classified as NOUN_HUMAN, without any further training. Memorization would fail completely on novel inputs.
Interpretable neurons (in the style of Karpathy's analysis): individual hidden units that track specific features. These emerge spontaneously; we did not ask for them:
The dynamical systems view. The hidden state traces a trajectory through 64-dimensional space as it reads a sequence. When projected to 2D via PCA: characters within the same word cluster together, different word categories occupy distinct regions, and the trajectory makes loops within words and jumps between words. This is the hidden state acting as a dynamical system, with different attractor regions for different grammatical contexts. ↗ Word2Vec: The Payoff (king − man + woman ≈ queen)
The successes are real but bounded. Our trained RNN makes systematic errors that reveal its limitations:
Cross-sentence dependencies. The network cannot maintain context across sentence boundaries. The subject of the previous sentence does not influence predictions in the current sentence. This is by design (our grammar generates independent sentences) but reflects a real limitation of the architecture.
Novel combinations. Word pairs never seen together in training, such as “king eats the glass” (legal grammar, unusual semantics), produce confused hidden states. The network's distributional knowledge is accurate for trained patterns but extrapolates poorly to unseen combinations.
Long words. Rare, longer words like “cookie” (6 characters) show higher per-character error than common short words. The model has seen fewer examples and the initial characters are less constraining.
Rare templates. Template 15 (adverb + sentence) has systematically higher loss than frequent templates. Rare patterns in the training distribution produce less reliable predictions, even though the grammar is perfectly deterministic.
Elman's first simulation trained a tiny 2-unit RNN to maintain a running tally, effectively computing XOR across a binary sequence. The key insight: a static function (XOR is easy) becomes a memory task when cast in temporal form.
The network learned to track the parity of the sequence: whether it has seen an odd or even number of 1s. This is the simplest possible memory task: a 1-bit counter. Yet it requires the hidden state to maintain information across arbitrarily many time steps.
Reference: referance/temporal_xor.py (Plan 7) contains an implementation. The hidden state visualization shows a clear two-region structure: one half of the unit circle for “even count,” the other half for “odd count.”
What RNNs can and cannot compute. Our network learned a simple context-free grammar. In theory, RNNs with infinite precision are Turing-complete. In practice: they reliably learn finite automata patterns (our grammar); they can learn some context-free patterns (bracket matching with bounded depth); and they struggle with deeply nested structures and fail to generalize beyond training lengths. The gap between Turing-complete theory and practical capacity is itself a lesson; infinite-precision theory does not translate to finite-precision gradient-based learning.
Our RNN discovered words, grammar categories, and interpretable neurons, all from raw character sequences. But it hit a ceiling. Real language requires memory spanning hundreds of words, not a dozen characters. The history of deep learning is the story of pushing past this ceiling.
Why architecture must change
Everything our RNN learned in Chapter 6 was real: word boundaries, category structure, distributional semantics. But every success came with an asterisk: within the memory horizon. Beyond ~15–20 characters, the network is effectively blind. This chapter asks: why, precisely, is that ceiling so hard? And why can't we just push past it with better training?
| Failure Mode | Domain | Core Equation | Concrete Number | What Breaks | Reference |
|---|---|---|---|---|---|
| Vanishing gradients | Optimization | ρ=0.95: after 25 steps, 28% survives | Cannot learn long-range deps | Bengio 1994, Pascanu 2013 | |
| Finite capacity | Info theory | H=64 → ~200 effective bits | Cannot represent distant info | Data Processing Inequality | |
| Curse of memory | Dyn. systems | λ=0.993, T=100: 200× amplification | Cannot stably maintain memory | Wang & Li, ICML 2024 |
Each row addresses a different “cannot”: cannot learn, cannot represent, cannot maintain. All three must be solved simultaneously for long-range sequence processing.
The root cause: one equation. All three failures trace to one line of code:
The problem is that every piece of information, new input, old memory, computational state, flows through the same matrix multiplication and nonlinearity. There is no way to selectively update some dimensions while preserving others. The tanh forces a global write at every step. Memory is overwritten, not managed. Every bit of new information costs a bit of old information.
What comes next. In the LSTM & GRU topic, we introduce a separate memory channel, the cell state, that uses element-wise operations instead of matrix multiplication. The gradient flows through this highway without the Jacobian product decay. Three learned gates (forget, input, output) give the network fine-grained memory control. The gated architecture addresses all three failure modes simultaneously.
📖 Recommended reading: For a deeper visual walkthrough of how LSTM solves the vanishing gradient problem, see Understanding LSTMs by Christopher Colah. For a rigorous modern treatment of state-space models, Memorization in RNNs on Distill.pub is highly recommended.
| Dimension | Word2Vec ↗ | GloVe ↗ | RNN (this topic) |
|---|---|---|---|
| Core paradigm | Neural prediction | Matrix factorization | Sequential computation |
| Input | Word pairs | Co-occurrence counts | Character sequences |
| Output | Static embeddings | Static embeddings | Dynamic predictions + hidden states |
| Key challenge | Efficient training | Weighting function | Vanishing/exploding gradients |
| Key math | Softmax + neg sampling | Log-bilinear regression | Jacobian products + eigenvalues |
| Training | Skip-gram SGD | Weighted least squares | Manual BPTT + clipping |
| Visualization focus | Embedding geometry | Co-occurrence ratios | State trajectories + gradient flow |
| "Aha moment" | king − man + woman = queen | P(k|ice) / P(k|steam) | Grammar emerges from characters |
| Story arc | Words have geometry | Statistics reveal meaning | Networks can remember |
| Where it leads | GloVe | Compare to neural | LSTM & GRU (next) |
| Training objective | L = -\log \sigma(v_w \cdot v_c) - \textstyle\sum \log \sigma(-v_w \cdot v_n) | J = \textstyle\sum f(X_{ij})(w_i \cdot \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2 | L = -\tfrac{1}{T}\textstyle\sum \log p_t[y_t] |
Synthesis: All three objectives learn from distributional context (what appears near what) but through fundamentally different lenses: Word2Vec predicts context words from target words, GloVe fits a log-bilinear model to global co-occurrence counts, and our RNN predicts the next character from the hidden state. The distributional hypothesis connects them all.
Same character, different predictions, because context matters.
Build up the recurrent cell from feedforward to Elman step by step.
No memory; each input is processed independently. The network sees only the current input and produces an output with no knowledge of what came before.
Every matrix multiplication, step by step. 3 hidden units, 6-char vocab.
Click a character to see how it's represented in each scheme.
↗ Embeddings are the same idea behind Word2Vec's skip-gram architecture , learning dense representations through prediction.
The same RNN cell repeated T times; weight matrices shared across all positions.
Elman ⊊ GRU ⊊ LSTM; each architecture strictly more powerful than the last.
Recognizes regular languages. Can track simple patterns but has no ability to count.
aaaaabbbbb✓ PassesAccuracy drops sharply; Elman cannot truly count, only estimate.
Gradient arrows flow backward through time. Watch them fade the further back they travel.
Teacher-Forced
Avg accuracy: 96.3%Input sequence:
Per-position accuracy (green = high, red = low):
Given the correct previous character, the model predicts the next one with high accuracy. This is what the model learned to do during training.
Free-Running
Error rate: 16.3%Generated text (red = error vs teacher-forced):
Errors compound; each mistake changes the context for the next prediction, creating a cascade. This is the exposure bias problem.
Error rate vs sequence position
Scheduled sampling fraction (teacher input)
0%Pure free-running: errors cascade rapidly after the first mistake.
Clipping caps gradient magnitude while preserving direction.
The product ||∂h_t / ∂h_k||: shows how gradient information decays as you go further back.
Drag the slider or click the chart to explore gradient flow through tanh.
W_hh eigenvalues on the complex plane. The unit circle is the stability boundary.
Orthogonal init: all 64 eigenvalues have |λ| = 1.0
Adjust parameters to see how gradient magnitude decays over distance.
Vanishing regime: gradients decay at rate ρ^t (ρ=0.97)
Three panels: convergence to fixed points, regime transitions, Lyapunov heatmap.
Slow convergence, long spiral, optimal for memory. Training pushes ρ ≈ 1.0.
The razor's edge: longer memory requires exponentially more precise eigenvalues.
Prediction accuracy vs required memory distance. The forgetting curve.
Beyond ~8 characters, accuracy drops below 80%; this is the practical memory limit imposed by vanishing gradients. Beyond ~14 characters, the network performs near chance.
Watch the model learn. Loss, generated text, and gradients synchronized.
Error spikes at word onsets; the network learned where words begin.
Hierarchical clustering dendrogram. Watch grammatical categories emerge as training progresses.
Select a hidden unit and see its activation pattern across the sequence.
PCA projection of hidden states as the RNN processes a sequence. Hover for details.