M 09 · 9.2.2 · conceptual

9.2.2 Convolutional Networks

Replace fully-connected layers with convolutions: shared weights that slide across the image. The same trick used as a kernel in 3.4.1 becomes the central operation of every computer-vision network from LeNet (1998) to today.

Duration35–40 min

Levelintermediate-advanced

Load3 core concepts

Prereqs3.4.1 (convolution), 9.2.1 (feedforward nets)

Big question

Why don’t we just use a giant feedforward network on images? A 224×224 RGB image has 150,528 input pixels. A single fully-connected hidden layer with 1000 neurons would have 150 million weights — for one layer. Most of those weights would learn redundant local-pattern detectors for every pixel position. Yann LeCun’s insight in 1989 was that the same edge detector should work at every location: share the weights across positions. The convolutional layer is born — a tiny 3×3 or 5×5 filter, slid across the image, learning one feature detector that runs everywhere [1, 2]. This single architectural idea is what made computer-vision deep learning practical.

Learning objectives

Connect convolution-as-kernel (3.4.1) to convolution-as-CNN-layer — same math, learned weights.
Implement a forward convolution in NumPy and apply learned filters to an image.
Understand the CNN architecture: convolution → activation → pooling, stacked.
Visualise feature maps — the per-filter response at each layer — to see what the network is “looking at.”

Part 1 — Shared weights, sliding filters

A fully-connected layer treats every pixel as an independent input feature: weights w(i, j) connect input pixel (i, j) to every hidden neuron. A convolutional layer shares weights across positions: one 3×3 filter is applied at every location of the input.

def conv2d(image, kernel, stride=1, pad='same'):
    H, W = image.shape
    k = kernel.shape[0]
    if pad == 'same':
        p = k // 2
        image = np.pad(image, p, mode='edge')
    out = np.zeros((H, W))
    for y in range(H):
        for x in range(W):
            out[y, x] = np.sum(image[y:y+k, x:x+k] * kernel)
    return out

This is the same convolution as 3.4.1. The difference in a CNN: the kernel weights are learned by backpropagation rather than hand-designed [3].

A diagram of a small CNN: an input image flows into a 3x3 convolution producing several feature maps, then a max-pooling layer reduces spatial size, then a flatten and dense layer, finally a classification output. — Fig. 1 The basic CNN block: convolution → activation → pooling, repeated. Each layer learns features built from features of the layer below.

Part 2 — Convolutional layer ingredients

A practical convolutional layer has four ingredients:

Filters — K filters of shape (k, k, C_in) produce K output channels.
Stride — step size when sliding the filter. Stride=1 keeps spatial size; stride=2 halves it.
Padding — same keeps the output size equal to input; valid shrinks by k-1.
Activation — typically ReLU, applied element-wise after convolution.

out_height = (in_height - kernel + 2*pad) / stride + 1

After the convolution + activation, a max-pooling layer reduces spatial size by taking the maximum over each 2×2 block. Pooling is what gives CNNs their translation invariance — the same feature at a slightly different position still fires the same downstream units [4].

An animation showing a 3 by 3 filter sliding across an input image, computing dot products with each receptive field, and assembling the results into a feature map. — Fig. 2 The convolution operation: same kernel, every position. The feature map is the per-position response of one filter.

Part 3 — Feature maps

After applying K filters, you have K feature maps — each one a 2D image showing where that filter “fires.” Visualising these is how you build intuition about what a CNN has learned.

A grid of grayscale feature maps. Each shows the response of one learned filter to the same input image: some respond to horizontal edges, others to vertical edges, some to corners, some to textures. — Fig. 3 Twelve feature maps from a first-layer convolution. Each filter has specialised in a different low-level feature.

A grid of 3 by 3 filter weight matrices visualised as small images. Some look like edge detectors, others like Gabor-style oriented patterns, others like blob detectors. — Fig. 4 The filter weights themselves, visualised as 3×3 images. Many look like classical computer-vision filters (Sobel, Gabor, blob detectors) that were once hand-engineered.

Astonishingly, early-layer filters in trained CNNs look like Gabor filters — the same oriented-edge detectors that the 1980s computer-vision community had already designed by hand. The network rediscovers these features from scratch, by gradient descent on a classification objective [2, 5].

Synthesis project

EXECUTE I.

Apply learned filters to an image

Run cnn_visualization.py from the downloads. It loads pre-trained filter kernels (Sobel-like horizontal, Sobel-like vertical, blob detector, sharpening, blur), applies each to an input image, and saves the feature maps.

Reflection questions

Each feature map is a 2D image. What do bright pixels in the feature map mean?
The Sobel-horizontal filter is hand-designed in 3.4.1. Why does a CNN trained on image classification learn something nearly identical in its first layer?
What happens if you apply multiple convolutional layers to the same input?

MODIFY II.

Artistic filters via convolution

Apply each of these hand-crafted kernels to an image and observe the artistic effects:

Goals

Edge detection — Sobel kernels (3.4.2) for grayscale outline.
Emboss — [[-2, -1, 0], [-1, 1, 1], [0, 1, 2]] for 3D-looking shading.
Sharpen — [[0, -1, 0], [-1, 5, -1], [0, -1, 0]] for crisp detail.

A grid showing the same source image transformed by four artistic filters: original, edge-detected, embossed, and sharpened. — Fig. 5 Same input, four learned-or-handcrafted filters. The CNN reaches for these kinds of low-level features in early layers.

CREATE III.

Build a tiny CNN forward pass

Implement a 2-conv-layer CNN forward pass in NumPy. Input is a 28×28 image (MNIST-style); the network has:

Conv layer 1: 4 filters of size 3×3, stride 1, ReLU.
Max-pool: 2×2, stride 2.
Conv layer 2: 8 filters of size 3×3, stride 1, ReLU.
Max-pool: 2×2, stride 2.
Output: flatten → 1 dense layer.

python · exercise3_starter.py

import numpy as np

def conv2d(x, W, b, stride=1):
    """Multi-filter 2D convolution with bias.
    x: (H, W, C_in), W: (K, k, k, C_in), b: (K,)
    Returns: (H_out, W_out, K)
    """
    H, W_in, C = x.shape
    K, kh, kw, _ = W.shape
    # TODO: implement using nested loops or im2col.

def maxpool2d(x, size=2, stride=2):
    """2x2 max pool with stride 2."""
    # TODO

def relu(x):
    return np.maximum(0, x)

# Build the network
def forward(image):
    x = image[..., None]                # (28, 28, 1)
    x = relu(conv2d(x, W1, b1))         # (28, 28, 4)
    x = maxpool2d(x)                    # (14, 14, 4)
    x = relu(conv2d(x, W2, b2))         # (14, 14, 8)
    x = maxpool2d(x)                    # (7, 7, 8)
    x = x.flatten()
    return W3 @ x + b3                  # logits

Downloads

cnn_visualization.py — apply learned filters cnn_starter.py — tiny CNN forward pass

References

[1] LeCun, Y., Boser, B., Denker, J. S., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
[2] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
[3] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
[4] Boureau, Y.-L., Ponce, J., & LeCun, Y. (2010). A theoretical analysis of feature pooling in visual recognition. ICML 27.
[5] Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. ECCV, 818–833.
[6] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 25, 1097–1105.
[7] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR, 770–778.