InteractiveCharacter-level

RNN

Networks That Remember

Watch a network learn to carry information through time, character by character. See hidden states evolve, gradients vanish, and categories emerge from raw text.

~50 min read|Prerequisites: Word2Vec (or comfort with neural networks)|Published Mar 2026
RNN visualization: hidden state trajectories, gradient flow through time, category emergence from character sequences, and spectral radius evolution on a dark cinematic background

A recurrent neural network carries a hidden state forward through time, compressing the input history into a fixed-size vector at each step; this enables sequence processing but causes gradients to vanish exponentially during training, limiting practical memory to short spans and motivating gated architectures like LSTM.

Chapter 1

Snapshots vs Movies

Why sequences need memory

You hear “The cat sat on the ___.” You know the answer is probably mat or couch. You know this because you heard the whole sentence, not just the last word. A feedforward neural network that processes one word at a time has no such luxury. It sees “the” in isolation, then forgets it entirely when “cat” arrives. It has no notion of what came before. Word2Vec: Why Can't Computers Understand Words?
This is not a quirk of a particular architecture. It is a fundamental property of feedforward networks: they are stateless. Each forward pass maps input to output with no memory of previous passes. For static tasks (classifying an image, predicting house prices) this is fine. For sequences, it is a fatal limitation.
The shift register fix and why it fails. The obvious workaround, described by Elman (1990) in Section 1, is to concatenate recent inputs into one big vector, a “shift register.” Feed the model the last k characters as a single input. This approach has three fatal problems.
First, a fixed-length buffer: you must pre-specify the maximum sequence length before training. Real language has no such bound. Second, absolute vs relative position: the same pattern at different positions looks geometrically different. Elman's example: the pattern [1 1 0 0] and [0 0 1 1] represent the same two-word sequence shifted by two positions, yet they are orthogonal vectors, maximally different in Euclidean space. A network that learned the first cannot transfer to the second. Third, parameters explode: each position needs its own set of weights. For a 100-step window and a 27-character vocabulary, the input-to-hidden weight matrix grows from 27×64 to 2,700×64. This scales poorly.
A biological aside. Recurrent connections are ubiquitous in cortical circuits. Even the visual cortex has massive feedback loops, roughly equal numbers of feedforward and feedback connections. The brain does not process inputs in a single sweep; it maintains and updates internal states across time. Evolution solved the sequence problem not by building wider inputs, but by building systems that carry information forward.
The core question this topic answers: What if a network's output at one moment became part of its input at the next moment? This single idea, a feedback loop, is what gives neural networks memory. The rest of this story is an exploration of what that idea makes possible, and why it eventually hits a wall that forced an architectural revolution.

Toy vs Production: Our examples use a 27-character vocabulary and a 38-word grammar. Real language models process 50,000+ tokens. The simplicity is deliberate: it makes every phenomenon inspectable. When you can see all 64 hidden units and all 27 output probabilities, nothing is a black box.

We have seen why feedforward networks fail at sequences: they have no memory. But if the problem is clear, so is the direction of the solution. What if we gave the network exactly one new ability: the ability to pass a message to itself?
Chapter 2

The Feedback Loop

Anatomy of a recurrent cell

A brief history. Jordan (1986) first used recurrent connections, feeding the network's output back as input to generate sequences. But Elman (1990) had a more powerful idea: feed the hidden layer's activations back to itself. Instead of remembering what you said, remember what you were thinking. The hidden state is richer than the output: it contains intermediate representations, not just final decisions. This distinction matters enormously.
The Elman architecture works like this: at each time step, the hidden layer receives two inputs: the current character AND a copy of its own previous state. The copy operation has a fixed weight of 1.0; it is a direct pipeline, not a learned transformation. This means the hidden state carries a compressed summary of everything seen so far. The network must learn to pack relevant history into those 64 real-valued slots.
Building the forward pass step by step. The equation is straightforward once you see it assembled piece by piece.
A

Recurrence alone

Without input, the hidden state either decays to zero (eigenvalues < 1) or explodes (eigenvalues > 1). Not useful yet.

B

Add input

Now the state combines new input with memory. But values can grow unbounded.

C

Add nonlinearity

tanh squashes everything to [−1, 1]. This creates a bounded 'memory register.' Each unit stores a value between −1 and +1 representing some aspect of input history.

D

Add output

Read out predictions from the hidden state. The softmax converts logits to a probability distribution over next characters.

Embedding Layers: Dense vs One-Hot

Our 27-character vocabulary produces one-hot vectors: a 27-dimensional vector with exactly one non-zero entry. For the letter “a,” the vector is [1, 0, 0, ..., 0]. This is information-poor: 26 of 27 entries are wasted. The network receives almost no signal before the first weight multiplication.
An embedding layer fixes this by learning a dense representation. Instead of a 27-dim sparse vector, each character gets a 16-dim dense vector, a column of a learned matrix . The forward pass becomes a simple column selection: , where idx is the integer index of the current character. Word2Vec: Building the Machine (Skip-gram)
This is exactly the same concept as Word2Vec embeddings , representations discovered through prediction. The skip-gram model learns word vectors by predicting context; our character embedding learns character vectors by predicting the next character. Same principle, different granularity. GloVe: GloVe in the Landscape
Our choice: we use one-hot vectors to keep everything visible and mathematically transparent. Adding a 27×16 embedding matrix would add 432 parameters and reduce from 27×64 to 16×64, a net saving of 1,296 parameters. But it would add an abstraction layer that hides the raw input. For learning, the tradeoff is almost always worth it. For pedagogy, transparency wins.
What the hidden state actually stores. Research by Collins, Sohl-Dickstein, and Sussillo (2017) shows that each hidden unit stores approximately 1 real number from input history. With H=64 units, you get roughly 64 “slots” of memory. But this is bounded further by the data processing inequality: : mutual information between the current hidden state and any earlier input can only decrease or stay constant as time passes. The hidden state cannot recover information that was lost in a previous step.
This memory is lossy, task-dependent, and constantly being overwritten. Think of it as a summary, not a transcript. Two very different input histories can produce identical hidden states; the network retains what is useful for prediction and discards the rest. This is why training matters: gradient descent shapes what the hidden state chooses to retain.

Misconception: “RNNs have infinite memory”

No. The hidden state is a fixed-size vector (64 numbers in [−1, +1]). Information from early inputs is progressively compressed and eventually overwritten. The effective memory horizon depends on spectral radius and task structure; for our model, it is roughly 15–20 characters. Chapter 5 will make this precise.

Misconception: “The hidden state is a recording of everything”

No. It is a lossy, task-dependent compression. Two very different input histories can produce identical hidden states (aliasing). The network retains what is useful for predicting the next character and discards the rest.

The five parameters. (27×64 = 1,728), (64×64 = 4,096), (64×27 = 1,728), (64), (27). Total: 7,643 parameters. The entire “program” that processes sequences, from raw characters to structured predictions, fits in these five matrices. The initial hidden state .

Why orthogonal initialization?: is initialized as an orthogonal matrix. This means all its eigenvalues have absolute value exactly 1.0, placing the network at the optimal spectral radius for gradient flow. Xavier initialization does not provide this guarantee. We will see why this matters in Chapter 5.

The forward pass processes one step at a time, carrying hidden state forward. But to train this network, we need gradients. And computing gradients through a feedback loop requires a beautiful trick: unfolding the loop into something familiar.
Chapter 3

Unrolling the Loop

From loops to layers

Take the RNN's loop and lay it flat across time. An RNN processing T characters is mathematically equivalent to a T-layer deep feedforward network. Each “layer” computes the same function with the same parameters. This is not an analogy; it is exact. The two descriptions are mathematically identical; one is just drawn differently.
Weight sharing. In the unfolded graph, every layer uses the same . This is why RNNs handle variable-length sequences ; the same “program” runs for any T. Parameter count does not grow with sequence length (still 7,643 whether T=10 or T=1000). The network learns a time-invariant rule: given this input and this state, produce that state. The same rule applies at every position.
The parameter count formula is independent of T: , where V=27 (vocabulary) and H=64 (hidden size). The unfolded depth equals the sequence length T. At seq_len=25, this is a 25-layer deep network, deeper than ResNet-18.
RNNs as programs. As Karpathy (2015) puts it: “If training vanilla neural nets is optimization over functions, training recurrent nets is optimization over programs.” The hidden state is the program's memory; the weight matrices are the program's instructions. The same instructions execute at every time step. Siegelmann and Sontag (1995) proved that RNNs with sufficient hidden state size and real-valued precision are Turing-complete - they can simulate any computation. GloVe: GloVe in the Landscape

A 2-unit RNN can compute parity, whether the number of 1s in a binary sequence is odd or even. The key insight: a single bit can be encoded in one real-valued hidden unit.

With and , the first unit computes . For the sequence [1, 0, 1, 1, 0], the hidden states evolve as: h=[0.76, 0] → [−0.60, 0] → [0.65, 0] → [0.08, 0] → [−0.08, 0]. The sign of h[0] encodes parity: positive=odd, negative=even.

This proves RNNs can implement finite automata, the simplest computational class. Regular languages, recognizable by finite automata, are learnable by Elman networks.

The input/output taxonomy. RNNs enable five qualitatively different processing patterns:
PatternApplicationOur task
One-to-oneStandard feedforward (no RNN)
One-to-manyImage captioning
Many-to-oneSentiment analysis
Many-to-many (synced)POS tagging✓ next-character prediction
Many-to-many (encoder-decoder)Machine translation

Bidirectional and Stacked RNNs

A standard RNN reads left-to-right; at position t, it can only see positions 1 through t. But many tasks benefit from knowing what comes after too. A bidirectional RNN runs two passes: a forward pass producing and a backward pass (reading right-to-left) producing . The final representation at each position is their concatenation: .
Bidirectional RNNs excel at tasks where context on both sides matters: part-of-speech tagging (knowing what follows “run” distinguishes verb from noun), named entity recognition, and coreference resolution. But they are fundamentally incompatible with language generation (you cannot look at future tokens when generating token by token) and they require knowing the full sequence upfront, making them unsuitable for streaming.
Stacked (deep) RNNstake a different approach to adding capacity: layer the recurrent cells vertically. Layer 1's hidden states become the inputs to Layer 2; Layer 2's hidden states feed Layer 3; and so on. This creates a hierarchy where lower layers capture fine-grained temporal patterns (character-level phonology in our case) while higher layers capture more abstract features (word-level syntax and grammar).
The key distinction: temporal depth (sequence length T) creates width in the unfolded graph; vertical depth (number of layers L) creates a separate dimension of abstraction. In our toy model, L=1; all representations are at one level. Production RNNs commonly use L=2–4. Beyond L=4, the benefits diminish and training becomes harder.

The Formal Computational Hierarchy

Different RNN architectures have provably different computational powers. Weiss, Goldberg, and Yahav (2018) showed a strict containment hierarchy: Elman ⊊ GRU ⊊ LSTM in terms of the formal languages each can recognize.
Elman networks can learn regular languages, patterns recognizable by finite automata. Our parity checker is an example. GRU can recognize some counter languages, including bounded counting tasks. LSTM can learn context-free languages including the classic pattern: n copies of “a” followed by exactly n copies of “b.”
Why does require LSTM? The network must count exactly how many “a’s” appeared, then count down while reading “b’s.” Vanilla RNNs cannot maintain an unbounded integer counter because the tanh hidden state is bounded. LSTM's cell state can implement an exact counter because it uses additive updates without a nonlinearity: . Set during the “a” phase; setduring the “b” phase with a decrement. The additive structure breaks the bound.
This hierarchy has practical consequences. If your task requires keeping exact counts or deeply nested structure, a vanilla RNN is architecturally insufficient, not just harder to train, but provably incapable.
An unfolded RNN is a deep feedforward network with shared weights. We already know how to train deep networks: backpropagation. Applied to the unfolded RNN, this becomes Backpropagation Through Time.
Chapter 4

Learning Through Time

BPTT and gradient clipping

The loss function. We want the network to predict the next character at every position. The loss is the average cross-entropy across T time steps:
where is the correct next character and is the softmax output at time t. This is just cross-entropy, averaged over all T positions. Backpropagation through time computes gradients with respect to all parameters simultaneously.
Gradients for the output layer are the easy part. is a sum of outer products at each time step; no temporal dependency. The gradient at step t depends only on what the network predicted at step t versus what it should have predicted.
Gradients for the recurrent layer are the hard part. Computing requires for every t. The complication: affects the loss BOTH through directly AND through all future hidden states . Both paths must be summed. This is what makes BPTT non-trivial.
The BPTT recursion. Working backward from t=T:

At t = T (terminal step)

Only the direct contribution through the output layer.

At t < T

The second term is the gradient flowing backward through time, through the tanh derivative and W_hh.

The temporal gradient passes through two things at each step: (1) the tanh derivative , which has entries in [0, 1]; it always shrinks or preserves the gradient; and (2) the recurrent weight matrix , which can amplify or attenuate depending on its eigenvalues. Both of these conspire to make long-range gradient flow difficult.
Accumulating parameter gradients. Once we have at every step, the parameter gradients are:
The gradient for is analogous, with in place of . These gradients are summed across all T time steps; the weight matrices accumulate evidence from the entire sequence.

RTRL, introduced by Williams and Zipser (1989), computes gradients in forward time rather than backward time. Instead of waiting until the end of the sequence to backpropagate, RTRL maintains a running computation of how each parameter affects each hidden unit.

The cost: RTRL requires operations per step (maintaining a 3D tensor ), compared to BPTT's per step. For our tiny H=64 model, that is 64⁴ = 16.7 million vs 64² = 4,096 operations per step. Impractical at any scale. Werbos (1990) compared both approaches and BPTT won definitively.

RTRL is theoretically important because it is online: it can update weights after every single time step, without truncation bias. Recent work (SNAP, R-BPTT) revisits forward-time gradients for online learning scenarios where truncated BPTT's bias is unacceptable.

Truncated BPTT. Full BPTT over 18,000 characters would be impractical; the unfolded graph would have 18,000 layers with the same gradient problems. Instead:
SplitCorpus → chunks of seq_len=25 characters
ForwardCarry hidden state h from end of one chunk to start of next (memory persists across chunks)
BackwardBackpropagate ONLY within the current chunk (gradients truncated at chunk boundaries)
ResultHidden state remembers; gradients do not. This is TBPTT: biased but practical.
The gradient estimate is biased: events more than 25 steps ago receive no gradient signal. But this bias is acceptable: we know from Chapter 5 that gradients beyond ~20 steps are negligibly small anyway.

Teacher Forcing: A Deep Dive

During training, we always feed the correct previous character as input, even if the model predicted incorrectly. This is teacher forcing: the teacher (correct label) overrides the student's (model's) mistake. It keeps training stable because gradients never compound through a chain of wrong predictions. Wikipedia: Teacher forcing Word2Vec: Why Does This Actually Work?
But teacher forcing creates exposure bias: at generation time, the model must condition on its own predictions, which may be wrong. It was never trained for this situation. The effect is dramatic. If per-character accuracy is 0.85, after 10 characters: P(all correct) = 0.85¹⁰ = 0.20. There is an 80% chance of at least one error in just 10 characters. And each error cascades; if the model predicts “b” instead of “c” after “the ca,” it has now entered a sequence (“the cab...”) it has never encountered during training.
Scheduled sampling (Bengio et al., 2015) bridges the gap with a curriculum: during training, use the ground truth with probability ε and the model's own prediction with probability 1−ε. As training progresses, ε decays from 1.0 to 0.0. The model gradually learns to handle its own mistakes.
Professor forcing (Lamb et al., 2016) takes a more principled approach: use adversarial training to match the distribution of hidden states between teacher-forced and free-running modes. When both modes produce the same hidden state distribution, the model is immune to exposure bias. This is more expensive but more principled than scheduled sampling.
For our toy model, even at temperature T=0.8, generation introduces errors because the model was only trained with perfect input. This is visible in the RNNTrainingPlayer below: high-temperature samples diverge faster and contain more grammatical violations.
Regularization for RNNs. Standard dropout breaks recurrence by randomly zeroing hidden units; if unit k is dropped at step t, all information it carried about the past is permanently lost for all future steps. The temporal signal is destroyed, not just the spatial one.
The two solutions that matter in practice are variational dropout (Gal & Ghahramani, 2016), which applies the same dropout mask at every step, preserving temporal consistency, and layer normalization (Ba et al., 2016), which normalizes within each step (batch norm fails on variable-length sequences). For our toy corpus, we use only gradient clipping. Regularization matters at scale. See Merity et al. (2018) for a thorough study.
Gradient clipping. Even with truncation, gradients can explode. Gradient clipping rescales the entire gradient vector when its norm exceeds a threshold:
This preserves the gradient's direction while capping its magnitude. In our code, C=5.0. Clipping direction, not magnitude: this is why gradient clipping is safe; it cannot reverse a weight update, only moderate it.

The complete derivation proceeds in four phases:

1. Forward pass: Compute all and store them.
2. Output gradients: , so .
3. Hidden gradients: Initialize . For t=T down to 1: .
4. Parameter gradients: .

Cross-reference: this corresponds line-by-line to the backward() function in train.py. See the Code walkthrough section code-bptt for annotated correspondence.

BPTT gives us gradients. But there is a catch. Look again at the gradient arrows in the BPTT visualizer. Notice how they get fainter further back in time? This is not a coincidence. It is a mathematical inevitability.
Chapter 5

Why Memory Fades

Vanishing gradients and spectral radius

Look at the gradient flow from Chapter 4's BPTTVisualizer. The gradient at position 0 is roughly 25× smaller than at position 24. Information from the beginning of the chunk barely influences the parameter update. The network effectively has a “memory horizon” - beyond this distance, it cannot learn long-range dependencies.
The Jacobian product. Expanding the BPTT recursion across multiple time steps reveals a product of matrices:
Each factor is a matrix. The gradient flows through this product. Its behavior is determined by the eigenvalues of these matrices. The product of T matrices, even if each has eigenvalues near 1, can dramatically change magnitude depending on whether eigenvalues compound to grow or shrink.
The tanh derivative ceiling. is a diagonal matrix with entries in [0, 1]. When, the derivative is close to 1; gradient passes through. When(saturated), the derivative approaches 0; gradient is killed. The tanh derivative can only shrink the gradient; it is strictly ≤ 1, never > 1. The gradient cannot be amplified by the activation function alone.
The spectral radius. The spectral radius , the largest absolute eigenvalue of the recurrent weight matrix, determines the long-run behavior of the Jacobian product:
ρ < 1

Jacobian product shrinks exponentially at rate ≈ ρ^(t−k). Gradients vanish. Long-range dependencies are unlearnable.

ρ > 1

Product grows exponentially. Gradients explode. Training becomes unstable without clipping.

ρ = 1

The 'edge of chaos'. Gradients neither vanish nor explode. Maximum memory, maximum sensitivity.

Orthogonal initialization sets ρ = 1.0, starting the network at the edge of chaos. This is why it is optimal: gradients flow freely from the very first step of training.

Bengio, Simard, and Frasconi (1994) proved something stronger than “gradients tend to vanish.” They showed that for an RNN with stable fixed points (which any well-behaved RNN must have to produce consistent predictions) gradients must vanish.

Formal statement: if the RNN converges to a fixed point , then the spectral radius of the Jacobian at is strictly less than 1. If ρ ≥ 1, the fixed point is unstable; the hidden state would wander away from under small perturbations. A useful RNN must have stable fixed points, so ρ < 1 at those points, so gradients must vanish there.

The implication is stark: there is no sweet spot for vanilla RNNs. Stability and long-term memory are fundamentally in tension. You cannot have both simultaneously without changing the architecture. Pascanu, Mikolov, and Bengio (2013) extended this analysis to exploding gradients, showing both phenomena are two sides of the same Jacobian product instability.

The Information-Theoretic View

The gradient problem has a companion at the level of information, not optimization. The data processing inequality says that for a Markov chain, mutual information can only decrease:
In the RNN, each hidden state is a deterministic function of the previous state and current input, a Markov chain. The data processing inequality then says the hidden state at time T cannot contain more information about the very first input than the hidden state at time T−1 did. Information about distant inputs is physically lost, not just hard to learn.
This is an information-theoretic ceiling, independent of gradient flow. Even with perfect gradients, perfect optimizer, and perfect initialization, if the RNN has seen T steps, information about step 1 has been compressed T times and can only have decreased. The forgetting curve from our memory probe analysis is an empirical measurement of this decay. GloVe: What Counts Reveal (how counting captures longer-range info)
The telephone game analogy. Imagine a chain of 25 people whispering a message. Each person can only hear 90% of what the previous person said (like a Jacobian factor with norm 0.9). After 25 people: 0.9²⁵ = 0.072. Only 7% of the original message survives. This is vanishing gradients. If each person amplifies the message by 10% (norm 1.1), after 25 people: 1.1²⁵ = 10.8. The message is wildly distorted. This is exploding gradients.

The dynamical systems counterpart to spectral radius is the Lyapunov exponent, which measures the average rate at which nearby trajectories diverge over long time horizons. A positive Lyapunov exponent corresponds to ρ > 1 (chaotic, exploding gradients); negative to ρ < 1 (contracting, vanishing); zero to ρ = 1 (edge of chaos).

Finite-Time Lyapunov Exponents (FTLEs) are the time-varying version, measuring divergence rate over a finite window rather than in the limit. The product norm at each position in our JacobianInspector IS essentially the FTLE: it measures how much nearby hidden states diverge after k steps of processing. When the FTLE heatmap shows a bright patch, the network is locally chaotic; when it shows a dark patch, the network is locally contracting.

This connects to attractor dynamics: the edge of chaos (FTLE ≈ 0) maximizes the network's sensitivity to new inputs while maintaining stability; it can update its internal state significantly in response to new information without diverging.

Attractors and the Edge of Chaos

An attractoris a state the hidden state converges to under repeated input. Feed the character “a” 15 times to our trained network: the hidden state spirals toward a fixed point in 64-dimensional space. The convergence is exponential: . Beyond ~10 repetitions, the hidden state has essentially forgotten any prior context; it has been “washed out” by the attractor.
This attractor isthe fixed point from the Bengio condition. Its stability is the gradient problem. The three dynamical regimes correspond directly to the spectral radius cases: ρ < 1 creates point attractors (stable, forgetting past); ρ ≈ 1 creates the edge of chaos(long transients, maximum memory); ρ > 1 creates chaotic dynamics (exploding gradients, unstable training).
Our model self-organizes toward the edge of chaos during training; eigenvalues that started on the unit circle drift outward in directions that help maintain useful memory, while gradient clipping prevents the resulting instability from being catastrophic.
Recent research (2025) shows that training can cause attractor-merging crises - qualitative bifurcations where two separate attractor basins merge into one, causing sudden changes in representation structure. These correspond to the loss plateaus sometimes observed in training curves.

The Curse of Memory

Beyond vanishing/exploding gradients, a deeper problem exists. As an RNN's memory increases, as eigenvalues approach 1, the network becomes exponentially sensitive to tiny parameter changes. Even without gradient explosion, small weight perturbations produce massive output changes. This was formalized by Wang and Li (ICML 2024) as the curse of memory.
Formal statement: sensitivity scales as exp(memory_length / τ), where τ is the time constant of the system. For a memory horizon of T steps, a weight perturbation of size δ produces an output change of approximately δ · exp(T/τ). This grows without bound as T increases.
Numerical example: for ρ = 0.999 and a memory of 100 steps, a weight perturbation of 10⁻⁶ produces an output change of ~0.1, seven orders of magnitude amplification. The network is hyper-sensitive. The gradient step that corrects one output wreaks havoc on all outputs that depend on distant context.
Adding nonlinearities does not help. tanh and ReLU change the constant factors but not the exponential scaling. The curse arises from the multiplicative structure of the Jacobian product, which is present regardless of the activation function.
Why gating does help:LSTM's forget gate provides a multiplicative shortcut that decouples memory length from sensitivity. Instead of computing a full matrix product, the cell state uses element-wise multiplication:. The gradient factor per step is instead of. Since , the gradient factor is bounded and well-controlled.
Connection to S4/Mamba: Gu et al.'s S4 (2022) uses stable reparameterization, diagonal state matrices with negative real eigenvalues, to mathematically break the curse. The eigenvalues are parameterized as where , guaranteeing they lie inside the unit disk with bounded sensitivity. Mamba inherits this structure.
Practical consequence: you cannot fix vanilla RNNs by training longer, using a better optimizer, or increasing the learning rate. These are optimization tricks; the curse is an architectural problem. Better training cannot create a gradient highway where none exists.

Misconception: “Vanishing gradients means gradients are literally zero”

No. They are exponentially small but nonzero. For ρ=0.95 and t−k=25 steps: 0.95²⁵ = 0.28, not zero, just 72% attenuated. At t−k=100: 0.95¹⁰⁰ = 0.006. The problem is that these tiny gradients are indistinguishable from noise in the loss landscape, making learning impossible in practice even though the signal technically exists.

Misconception: “More hidden units = proportionally longer memory”

No. Doubling H from 64 to 128 doubles the number of dimensions, but the memory horizon grows only logarithmically (roughly +2–3 characters). The bottleneck is the multiplicative structure of the Jacobian product, not the hidden state size. Going from H=64 to H=1024 (16×) might extend the memory horizon from 18 to ~25 characters, a modest gain for a massive parameter increase.

This is NOT just a training problem: Many tutorials present vanishing gradients as an “optimization difficulty” that better training techniques could solve. The curse of memory (2024) shows it is a fundamental architectural limitation: without gating, RNNs cannot stably maintain long-term memory regardless of how you train them. Three independent failure modes (gradient decay, information capacity, and sensitivity) all point to the same root cause.

The vanishing gradient problem limits how far an RNN can look back. But within its effective memory horizon, the network can learn remarkable things. Let us see what our trained model actually discovered.
Chapter 6

What the Network Discovers

Emergence and interpretation

Training progression. The model begins outputting random noise; at step 0, before any learning, it produces a meaningless stream like “xqkzm blg pfr...” By step 5,000, recognizable character patterns emerge: spaces appear in plausible positions, common digrams like “th” start appearing. By step 30,000, the output is grammatically structured sentences with correct subject-verb-object patterns. The model learned a grammar it was never explicitly taught.
Prediction error reveals structure. This is a replication of Elman (1990) Figures 4 and 6. When we run the trained model over the corpus and measure prediction error at each character position:
Word onsetHighModel does not know which word comes next, reflecting maximum uncertainty
Mid-wordLowAfter the first 2–3 characters, the word is predictable from context
SpaceMediumModel knows a space is coming (after word ends) but not what follows
The RNN discovered “words” without anyone telling it words exist. The error signal IS a word boundary detector. Nobody labeled word boundaries; the model inferred them from sequential statistics alone.
Temperature and sampling. The same trained model generates very different text depending on the sampling temperature T, which controls the sharpness of the softmax distribution:
T=0.2

"the the the the the..."

Too conservative: repetitive, stuck in highest-probability loop

T=0.8

"the lion chases the mouse. the girl reads..."

Balanced: grammatical, varied, mostly correct

1.5

"thr cxg slpeks moy brk..."

Too noisy: garbled, no longer respects grammar

Category emergence. This is a replication of Elman (1990) Figure 7, and the crown jewel of this analysis. Take the hidden state vectors for each word, averaged across all contexts in which that word appears, then cluster them using hierarchical agglomerative clustering. The result:
Top-level split: Content words vs function words (OR nouns vs verbs)
Within nouns: NOUN_HUMAN (boy, girl, man, woman, king, queen) | NOUN_ANIMAL (cat, dog, lion, bird) | food | objects
Within verbs: Transitive (chase, eat, read) | Intransitive (sleep, walk, think)
Nobody told the network about parts of speech or animacy. It discovered these purely from sequential statistics.

Token-Level Analysis

Word-averaged vectors smooth over context. A more rigorous test uses individual tokenvectors - every single occurrence of “boy” gets its own hidden state vector, not just the average across all occurrences. This is Elman's Figure 9 replication.
The result: sub-clusters organized by preceding context. Occurrences of “boy” following “the” cluster separately from those following “a.” The model is not just classifying words; it is tracking grammatical state: which determiner was used, what verb preceded the noun phrase. The representations are not just semantic but syntactico-semantic.
Tony Plate's Warning (Elman, 1990, Footnote 6): averaging hidden states across contexts could create artificial cluster structure that reflects nothing real. Individual token vectors refute this: the sub-clusters are present even without any averaging, and they reflect genuine distributional differences in context. The structure is real, not an averaging artifact.

The Zog Experiment

The most striking experiment in the study. Replace “man” with “zog” - a novel character sequence that has never appeared in the training corpus, and run it through the trained model with no further training.
Result: “zog” occupies “man's” position in the clustering dendrogram. It appears in the NOUN_HUMAN cluster, next to “man,” “boy,” and “king.” The network inferred the category from distributional context alone - because “zog” appeared in the same syntactic positions as “man” (subject of sentences, after determiners, before transitive verbs).
This is a direct demonstration of the distributional hypothesis: “a word is characterized by the company it keeps.” The model's representations are purely distributional. Memorization would fail completely on a novel input; this replication proves the model learned statistics, not examples. Word2Vec: Words Are Defined by Their Friends GloVe: Counting Words Together

Misconception: “The network memorizes training examples”

No. It learns distributional statistics. The Zog experiment proves this definitively: a never-seen character pattern (“zog”) placed in the same contexts as “man” is immediately classified as NOUN_HUMAN, without any further training. Memorization would fail completely on novel inputs.

Interpretable neurons (in the style of Karpathy's analysis): individual hidden units that track specific features. These emerge spontaneously; we did not ask for them:
Word boundary detectorActivation spikes at spaces and word onsets
Position counterActivation correlates with position-within-word (1st char → high, 4th char → low)
Vowel detectorActivation distinguishes vowels (a, e, i, o, u) from consonants
Verb tense trackerActivation differs between words appearing in subject vs object positions
The dynamical systems view. The hidden state traces a trajectory through 64-dimensional space as it reads a sequence. When projected to 2D via PCA: characters within the same word cluster together, different word categories occupy distinct regions, and the trajectory makes loops within words and jumps between words. This is the hidden state acting as a dynamical system, with different attractor regions for different grammatical contexts. Word2Vec: The Payoff (king − man + woman ≈ queen)

What the Network Gets Wrong

The successes are real but bounded. Our trained RNN makes systematic errors that reveal its limitations:
Cross-sentence dependencies. The network cannot maintain context across sentence boundaries. The subject of the previous sentence does not influence predictions in the current sentence. This is by design (our grammar generates independent sentences) but reflects a real limitation of the architecture.
Novel combinations.Word pairs never seen together in training, such as “king eats the glass” (legal grammar, unusual semantics), produce confused hidden states. The network's distributional knowledge is accurate for trained patterns but extrapolates poorly to unseen combinations.
Long words.Rare, longer words like “cookie” (6 characters) show higher per-character error than common short words. The model has seen fewer examples and the initial characters are less constraining.
Rare templates. Template 15 (adverb + sentence) has systematically higher loss than frequent templates. Rare patterns in the training distribution produce less reliable predictions, even though the grammar is perfectly deterministic.

Elman's first simulation trained a tiny 2-unit RNN to maintain a running tally, effectively computing XOR across a binary sequence. The key insight: a static function (XOR is easy) becomes a memory task when cast in temporal form.

The network learned to track the parity of the sequence: whether it has seen an odd or even number of 1s. This is the simplest possible memory task: a 1-bit counter. Yet it requires the hidden state to maintain information across arbitrarily many time steps.

Reference: referance/temporal_xor.py(Plan 7) contains an implementation. The hidden state visualization shows a clear two-region structure: one half of the unit circle for “even count,” the other half for “odd count.”

What RNNs can and cannot compute. Our network learned a simple context-free grammar. In theory, RNNs with infinite precision are Turing-complete. In practice: they reliably learn finite automata patterns (our grammar); they can learn some context-free patterns (bracket matching with bounded depth); and they struggle with deeply nested structures and fail to generalize beyond training lengths. The gap between Turing-complete theory and practical capacity is itself a lesson; infinite-precision theory does not translate to finite-precision gradient-based learning.
Our RNN discovered words, grammar categories, and interpretable neurons, all from raw character sequences. But it hit a ceiling. Real language requires memory spanning hundreds of words, not a dozen characters. The history of deep learning is the story of pushing past this ceiling.
Chapter 7

The Ceiling

Why architecture must change

Everything our RNN learned in Chapter 6 was real: word boundaries, category structure, distributional semantics. But every success came with an asterisk: within the memory horizon. Beyond ~15–20 characters, the network is effectively blind. This chapter asks: why, precisely, is that ceiling so hard? And why can't we just push past it with better training?

The Three Failure Modes: A Synthesis

Failure ModeDomainCore EquationConcrete NumberWhat BreaksReference
Vanishing gradientsOptimizationρ=0.95: after 25 steps, 28% survivesCannot learn long-range depsBengio 1994, Pascanu 2013
Finite capacityInfo theoryH=64 → ~200 effective bitsCannot represent distant infoData Processing Inequality
Curse of memoryDyn. systemsλ=0.993, T=100: 200× amplificationCannot stably maintain memoryWang & Li, ICML 2024

Each row addresses a different “cannot”: cannot learn, cannot represent, cannot maintain. All three must be solved simultaneously for long-range sequence processing.

The root cause: one equation. All three failures trace to one line of code:
The problem is that every piece of information, new input, old memory, computational state, flows through the same matrix multiplication and nonlinearity. There is no way to selectively update some dimensions while preserving others. The tanh forces a global write at every step. Memory is overwritten, not managed. Every bit of new information costs a bit of old information.

Thought Exercises: Fix Each Failure Mode

Answer: Provide a gradient highway that bypasses the Jacobian product. LSTM's cell state does this: it uses element-wise multiplication with the forget gate, so the gradient factor per step is diag(f_t) instead of diag(1−h²)·W_hh. Since f_t ∈ [0,1]^H, the gradient can flow through unchanged when f_t ≈ 1.
Answer: Increase the effective state space or provide external memory. LSTM doubles the state (both h_t AND c_t carry information), attention mechanisms provide access to ALL encoder states (not just the last one), and memory-augmented networks add explicit tape storage.
Answer: Reparameterize memory duration through a bounded gate instead of raw eigenvalues. LSTM's forget gate f_t ∈ [0,1] controls retention smoothly and is trainable via sigmoid; no eigenvalue instability. S4/Mamba use stable diagonal reparameterization: eigenvalues of the form e^{−Δλ} with Δ, λ > 0, guaranteed inside the unit disk.
What comes next. In the LSTM & GRU topic, we introduce a separate memory channel, the cell state, that uses element-wise operations instead of matrix multiplication. The gradient flows through this highway without the Jacobian product decay. Three learned gates (forget, input, output) give the network fine-grained memory control. The gated architecture addresses all three failure modes simultaneously.

📖 Recommended reading: For a deeper visual walkthrough of how LSTM solves the vanishing gradient problem, see Understanding LSTMs by Christopher Colah. For a rigorous modern treatment of state-space models, Memorization in RNNs on Distill.pub is highly recommended.

Chapter Summary: 5 Review Questions

  1. What single architectural change distinguishes an RNN from a feedforward network?
  2. Why does unfolding turn an RNN into a deep network?
  3. What determines whether gradients vanish or explode during BPTT?
  4. How did Elman show that grammar can emerge from prediction alone?
  5. What are the three independent reasons vanilla RNNs fail at long-range memory?

TL;DR: Recurrent Neural Networks

  • What: A neural network with a feedback loop; the hidden state h_t is updated at every time step using both the current input and the previous hidden state, giving the network a form of memory.
  • How it trains: Backpropagation Through Time (BPTT) unrolls the network across all time steps and sums gradients for each shared weight matrix. The same W_hh, W_xh, and W_hy are used at every step.
  • Core limitation: Vanishing gradients: tanh saturates and gradients shrink exponentially as they flow backward through many time steps, making it hard to learn long-range dependencies. LSTMs and GRUs solve this with gating.
  • Surprising capability: Without any explicit language supervision, RNNs trained on raw character sequences discover linguistic structure: word boundaries, parts of speech, and syntactic patterns emerge in the hidden state activations.
  • When to use: Vanilla RNNs are rarely used today for production NLP (replaced by Transformers), but understanding them is essential; they introduce sequence modeling, hidden state memory, and BPTT, which are foundational to all modern sequential architectures.
DimensionWord2Vec ↗GloVe ↗RNN (this topic)
Core paradigmNeural predictionMatrix factorizationSequential computation
InputWord pairsCo-occurrence countsCharacter sequences
OutputStatic embeddingsStatic embeddingsDynamic predictions + hidden states
Key challengeEfficient trainingWeighting functionVanishing/exploding gradients
Key mathSoftmax + neg samplingLog-bilinear regressionJacobian products + eigenvalues
TrainingSkip-gram SGDWeighted least squaresManual BPTT + clipping
Visualization focusEmbedding geometryCo-occurrence ratiosState trajectories + gradient flow
"Aha moment"king − man + woman = queenP(k|ice) / P(k|steam)Grammar emerges from characters
Story arcWords have geometryStatistics reveal meaningNetworks can remember
Where it leadsGloVeCompare to neuralLSTM & GRU (next)
Training objectiveL = -\log \sigma(v_w \cdot v_c) - \textstyle\sum \log \sigma(-v_w \cdot v_n)J = \textstyle\sum f(X_{ij})(w_i \cdot \tilde{w}_j + b_i + \tilde{b}_j - \log X_{ij})^2L = -\tfrac{1}{T}\textstyle\sum \log p_t[y_t]

Synthesis: All three objectives learn from distributional context (what appears near what) but through fundamentally different lenses: Word2Vec predicts context words from target words, GloVe fits a log-bilinear model to global co-occurrence counts, and our RNN predicts the next character from the hidden state. The distributional hypothesis connects them all.

Frequently Asked Questions

An RNN is a neural network with a feedback loop: the hidden layer's output at one time step becomes part of its input at the next step. This gives the network memory; it can process sequences of arbitrary length using the same parameters.
A feedforward network processes each input independently with no memory of previous inputs. An RNN carries a hidden state forward through time, so its output depends on the entire input history, not just the current input.
The hidden state is a vector (in our model, 64 numbers) that represents a compressed, task-dependent summary of everything the network has processed so far. Each element is between −1 and +1 due to tanh.
BPTT is the algorithm for computing gradients in an RNN. It works by “unfolding” the recurrent loop across T time steps, creating an equivalent deep feedforward network, and then applying standard backpropagation. Gradients for the shared weights are summed across all time steps.
Gradients pass through a product of Jacobian matrices, one per time step. Each factor involves and the tanh derivative (always in [0,1]). If the spectral radius of is less than 1, this product shrinks exponentially, and gradients from distant time steps become negligible.
The spectral radius is the largest absolute eigenvalue of the recurrent weight matrix . It determines whether gradients vanish (), explode (), or are preserved (). Orthogonal initialization sets .
Instead of backpropagating through the entire sequence (which can be thousands of steps), truncated BPTT only backpropagates through chunks of fixed length (e.g., 25 steps). The hidden state carries forward across chunks, but gradients do not. This trades gradient accuracy for computational feasibility.
When gradients become very large (exploding gradients), gradient clipping rescales the entire gradient vector to have a maximum norm. This preserves the direction of the update while preventing catastrophically large weight changes. Formula: if , set .
After training, feed a seed character, get a probability distribution over next characters, sample from it, feed the sampled character back as input, and repeat. The 'temperature' parameter controls randomness: low temperature produces conservative, repetitive text; high temperature produces creative but error-prone text.
During training, the RNN always receives the correct previous character as input, even if it predicted incorrectly. This stabilizes training but creates 'exposure bias': at generation time, the model must condition on its own predictions (which may be wrong). Scheduled sampling gradually transitions from teacher forcing to self-conditioning.
An orthogonal matrix has all eigenvalues with absolute value exactly 1 (spectral radius = 1). This means gradients neither grow nor shrink during BPTT, providing optimal gradient flow from the start of training. Standard Xavier initialization does not guarantee this property.
In practice, vanilla RNNs have an effective memory of roughly 10–20 time steps for our toy model. Beyond this, gradient signal is too weak for learning. The exact horizon depends on hidden size, spectral radius, and task structure. This limitation is what motivated LSTM and GRU.
A 2024 research finding: as an RNN's memory capacity increases (eigenvalues approach 1), the network becomes exponentially sensitive to parameter changes. This makes gradient-based training unreliable even without exploding gradients. Only gating mechanisms (LSTM, GRU) or stable reparameterization (modern SSMs) can break this curse.
Vanilla RNNs are rarely used in production. LSTM and GRU variants are still common for some sequence tasks. The Transformer architecture has largely replaced RNNs for language modeling. However, modern state-space models (Mamba, S4) bring back recurrence-like computation with O(1) inference cost, showing that the idea of recurrence is very much alive.
Elman trained a simple recurrent network to predict the next word in sentences generated by a grammar. The network discovered grammatical categories (nouns vs verbs, animate vs inanimate) purely from sequential statistics, without any explicit linguistic knowledge. This showed that structure can emerge from prediction, a finding that foreshadowed modern representation learning. Word2Vec: Words Are Defined by Their Friends