M 09 · 9.1.2 · conceptual

9.1.2 Backpropagation Visualisation

Train a 2-2-1 network on the XOR problem the perceptron couldn't solve. Watch the chain rule propagate gradients backward through the hidden layer, and visualise the decision boundary as it warps from a straight line into a curve.

Duration30–40 min

Levelintermediate

Load3 core concepts

Prereqs9.1.1 (perceptron), basic calculus (chain rule)

Big question

How does a neural network with hidden layers learn? A single perceptron can’t solve XOR — there is no straight line that separates the points. Adding a hidden layer of perceptrons gives the network enough flexibility to learn a curved boundary, but presented a new problem: how do you adjust the hidden weights when you can only observe the output error? The answer is backpropagation, the chain-rule trick rediscovered for neural networks by Rumelhart, Hinton, and Williams in 1986 [1]. This lesson trains a 2-2-1 network on XOR and traces each gradient step so the math feels concrete.

Learning objectives

Build a tiny network with two inputs, two hidden neurons, and one output — about 9 parameters total.
Run the forward pass to compute hidden activations and output prediction.
Run the backward pass to compute gradients via the chain rule.
Apply gradient descent updates to all weights and watch the XOR problem solve over ~1000 epochs.

Part 1 — The architecture of a multi-layer network

A 2-2-1 network has:

2 inputs (x₁, x₂)
A hidden layer of 2 neurons with sigmoid activations
An output neuron with sigmoid activation
9 parameters total: 4 input→hidden weights, 2 hidden biases, 2 hidden→output weights, 1 output bias

A diagram of a small neural network: two input nodes on the left, two hidden nodes in the middle, and one output node on the right, with weighted connections between each layer. — Fig. 1 The 2-2-1 architecture. Each connection has a learnable weight; each non-input neuron has a bias.

The replacement of the step function with the sigmoid is the key. Sigmoid is smooth and differentiable everywhere; the step function is not. Gradient-based learning needs derivatives — and only continuous activations have them.

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

def sigmoid_deriv(z):
    s = sigmoid(z)
    return s * (1 - s)

Part 2 — Forward pass

Compute the hidden activations then the output:

def forward(self, x):
    z1 = W1 @ x + b1                  # hidden pre-activation
    h  = sigmoid(z1)                  # hidden activation
    z2 = W2 @ h + b2                  # output pre-activation
    y  = sigmoid(z2)                  # output activation
    return y, (z1, h, z2)             # also cache for backprop

The cache (z1, h, z2) is essential — backprop needs these values to compute derivatives without redundant recomputation.

A diagram tracing numerical values forward through the 2-2-1 network: from inputs through the two hidden activations to the output. — Fig. 2 One forward pass with concrete numbers. Each layer computes a weighted sum, then a sigmoid.

Part 3 — Backward pass via the chain rule

The loss for one training example is squared error:

L = (y_pred − y_true)²

We want ∂L/∂w for every weight w. The chain rule propagates the error backward through each operation:

def backward(self, x, y_true, cache):
    z1, h, z2 = cache
    y_pred = sigmoid(z2)

    dL_dy   = 2 * (y_pred - y_true)             # ∂L/∂y_pred
    dy_dz2  = sigmoid_deriv(z2)                 # ∂y_pred/∂z2
    dL_dz2  = dL_dy * dy_dz2                    # combine

    dL_dW2  = np.outer(dL_dz2, h)               # output weights
    dL_db2  = dL_dz2                            # output bias

    dL_dh   = W2.T @ dL_dz2                     # error at hidden activations
    dL_dz1  = dL_dh * sigmoid_deriv(z1)         # combine with sigmoid deriv

    dL_dW1  = np.outer(dL_dz1, x)               # input weights
    dL_db1  = dL_dz1                            # hidden biases

    return dL_dW1, dL_db1, dL_dW2, dL_db2

Each line is one application of the chain rule. The genius of backprop is reusing the cached forward activations to avoid recomputing them [2, 3].

A diagram showing gradients flowing backward through the network: from the output error, through the output weights, into the hidden layer, and back to the input weights. — Fig. 3 The backward pass: error gradients propagate from output to input via the chain rule. Each weight's gradient is a product of forward activations and downstream errors.

Synthesis project

EXECUTE I.

Train on XOR

Run simple_xor_train.py. The script initialises a 2-2-1 network with random weights and trains for 5000 epochs using batch gradient descent. The script saves a sequence of decision-boundary images as training progresses.

Four XOR data points plotted in 2D: points (0,0) and (1,1) labeled class 0, points (0,1) and (1,0) labeled class 1. No straight line can separate the two classes. — Fig. 4 XOR — the dataset a single perceptron *can't* learn. With a hidden layer, the network can learn a curved boundary that does.

A plot showing loss decreasing over training epochs. Sharp initial drop followed by a slow decay toward zero. — Fig. 5 Loss curve. After ~2000 epochs the network has solved XOR; training keeps going until loss is negligibly small.

Reflection questions

Why does the loss curve have a long plateau near the start?
Why is the final boundary curved, not straight?
What happens if you initialise all weights to zero?

MODIFY II.

Three architecture experiments

Edit simple_xor_train.py to try these variations.

Goals

More hidden neurons — bump from 2 to 8. Faster convergence? Smoother boundary?
Different activation — replace sigmoid with tanh (and its derivative 1 − tanh²).
Larger learning rate — bump lr from 0.5 to 5.0. Watch for divergence.

CREATE III.

Replace XOR with a circle dataset

Generate a dataset where one class is a disc and the other is the surrounding annulus. Train the same 2-2-1 network. Plot the learned decision boundary.

Hint and outcome

rng = np.random.default_rng(0)
N = 200
inside = rng.normal(0, 0.5, (N, 2))
outside = rng.normal(0, 1.6, (N, 2))
outside = outside[np.linalg.norm(outside, axis=1) > 1.2]
X = np.vstack([inside, outside])
y = np.concatenate([np.ones(len(inside)), np.zeros(len(outside))])

Two hidden neurons can carve out a rough disc boundary, but bumping to 4+ neurons gives a much cleaner circle. The exercise demonstrates that the expressive power scales with hidden-layer width — the Universal Approximation Theorem (1989) proved that one hidden layer with enough neurons can approximate any continuous function [5].

Downloads

simple_xor_train.py — train + plot training backprop_solution.py — reference implementation

References

[1] Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533–536. doi:10.1038/323533a0
[2] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
[3] Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press. neuralnetworksanddeeplearning.com
[4] Werbos, P. J. (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences. PhD dissertation, Harvard University.
[5] Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303–314.
[6] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.