M 09 · 9.2.3 · conceptual

9.2.3 Recurrent Networks

Process sequences with a network that has *memory* — a hidden state passed from step to step. The vanilla RNN learns to generate sequences character by character; LSTMs and Transformers are the modern descendants of the same recurrent idea.

Duration30–40 min

Levelintermediate-advanced

Load3 core concepts

Prereqs9.2.1 (feedforward), 9.1.3 (activations)

Big question

How does a neural network handle data that arrives one item at a time, where the meaning of each item depends on what came before? Text is the canonical example: the meaning of “bank” depends on whether we previously said “river” or “money.” A feedforward network has no way to remember previous inputs. Recurrent Neural Networks (RNNs) maintain a hidden state vector that is updated at every time step, carrying information forward [1]. The vanilla RNN of 1986 became the LSTM of 1997 [2], the GRU of 2014 [3], and eventually the Transformer of 2017 [4]. The recurrent idea has evolved, but it began here.

Learning objectives

Express an RNN as a function h_t = tanh(W_x x_t + W_h h_{t-1} + b) that maps an input + previous state to a new state.
Unroll the recurrence in time and apply backpropagation through time (BPTT).
Train a character-level RNN on a small text corpus and generate plausible-sounding output.
Identify the vanishing-gradient problem that motivated LSTMs and GRUs.

Part 1 — The recurrent unit

A vanilla RNN cell maps an input vector x_t and a previous hidden state h_{t-1} to a new hidden state:

def rnn_step(x_t, h_prev, W_x, W_h, b):
    return np.tanh(W_x @ x_t + W_h @ h_prev + b)

The same weights W_x, W_h, b are reused at every time step. Computing the output sequence is a simple loop:

def forward(self, x_seq):
    h = np.zeros(self.hidden_size)
    states = [h]
    for x_t in x_seq:
        h = rnn_step(x_t, h, self.W_x, self.W_h, self.b)
        states.append(h)
    return states

The hidden state is the network’s memory — it carries information about everything seen so far. The capacity of that memory is limited by the dimensionality of h; bigger hidden vectors hold more information [5].

A diagram of an RNN cell with input arrow, previous-hidden-state arrow, both feeding into a node that produces a new hidden state and an output. The same node is shown unrolled in time as a chain of identical cells. — Fig. 1 The RNN cell (left) and its unrolled-in-time view (right). The same weights are applied at every step; only the inputs and the carried state change.

Part 2 — Unrolling and backpropagation through time

To train, we treat the RNN as a feedforward network with T layers (one per time step), all sharing the same weights. Backpropagation propagates gradients backward through every step:

∂L/∂W_h = Σ_t  ∂L_t/∂h_t · ∂h_t/∂W_h

This is Backpropagation Through Time (BPTT) [1]. The catch: gradients can vanish or explode exponentially as they propagate backward through many time steps, because the same recurrence matrix W_h is multiplied repeatedly. For T = 100 steps, even a small spectral radius error compounds.

A diagram showing the RNN unrolled across five time steps. Each step has its own copy of the cell, with arrows showing how the hidden state passes forward in time and gradients flow backward. — Fig. 2 BPTT: gradients flow backward through every time step. Same weights everywhere, but the chain of multiplications can amplify or vanish small numerical errors.

Part 3 — Character-level language model

The classic RNN demo (Karpathy 2015) is a character-level language model: input one character, predict the next. Train on Shakespeare or Linux kernel source, and the network generates eerily plausible text [6].

# Forward pass for one input character
h = rnn_step(one_hot(char), h, W_x, W_h, b)
logits = W_y @ h + b_y
probs = softmax(logits)
next_char = np.random.choice(len(vocab), p=probs)

After enough training, the network learns spelling, basic grammar, and even some style. It does not learn semantics: the output is fluent gibberish, not coherent thought.

A textual sample showing generated characters that resemble English: real-looking words, occasional sentence-like structure, but no actual meaning. — Fig. 3 Sample output from a vanilla RNN trained on Shakespeare. The model has learned letter frequencies, word patterns, and even sentence structure — but the result is gibberish, not literature.

A comparison of generated text samples from RNNs with hidden sizes 16, 64, 256. The smallest RNN produces near-random gibberish; the medium produces word-like sequences; the largest produces sentence-like output. — Fig. 4 Hidden size matters. Tiny RNNs produce noise; larger ones produce plausible-but-meaningless text.

Synthesis project

EXECUTE I.

Run the char-level RNN

Run rnn_starter.py from the downloads. It trains a small RNN on a sample text corpus for 100 epochs and generates sample output every 10 epochs.

Reflection questions

Why does early-training output look like random noise?
After 100 epochs, the output has real words but no meaning. What is the network not learning?
What would the same network look like on numeric time-series data (stock prices, sensor readings) instead of text?

MODIFY II.

Hidden-size sweep

Train RNNs with hidden sizes 8, 64, 256 on the same corpus. Compare:

Goals

Final training loss for each.
Sample output quality at each.
Wall-clock training time per epoch.

CREATE III.

Spot the vanishing gradient

Train an RNN on a sequence task where you must remember the first element of a long sequence to predict the last. With sequence length 5, the network learns fine. With sequence length 100, it can’t.

Task and outcome

# Task: input sequence of length T; first element is class 0 or 1;
# remaining T-1 elements are noise; predict the first element at step T.
def make_batch(T, batch_size=64):
    cls = rng.integers(0, 2, batch_size)
    xs = rng.normal(0, 1, (batch_size, T, 1))
    xs[:, 0, 0] = cls
    return xs, cls

A vanilla RNN with T = 5 learns this task in a few epochs. With T = 100, the gradient from step 100 back to step 1 has been multiplied by W_h 100 times — if its spectral radius is < 1, the gradient has vanished. The network never connects the prediction to the input. LSTMs solve this exact problem with their gating mechanism — adding gated additive updates that don’t vanish under repeated application.

Downloads

rnn_starter.py — char-level RNN

References

[1] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536.
[2] Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
[3] Cho, K., van Merriënboer, B., Gulcehre, C., et al. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. EMNLP, 1724–1734.
[4] Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). Attention is all you need. NeurIPS 30.
[5] Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
[6] Karpathy, A. (2015). The unreasonable effectiveness of recurrent neural networks. Andrej Karpathy Blog.
[7] Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL-HLT, 4171–4186.