Pixels2GenAI
Path ii Continuum
M 09 · 9.2.2 · conceptual

9.2.2 Convolutional Networks

Replace fully-connected layers with convolutions: shared weights that slide across the image. The same trick used as a kernel in 3.4.1 becomes the central operation of every computer-vision network from LeNet (1998) to today.

Duration35–40 min
Levelintermediate-advanced
Load3 core concepts
Prereqs3.4.1 (convolution), 9.2.1 (feedforward nets)

Big question

Why don’t we just use a giant feedforward network on images? A 224×224 RGB image has 150,528 input pixels. A single fully-connected hidden layer with 1000 neurons would have 150 million weights — for one layer. Most of those weights would learn redundant local-pattern detectors for every pixel position. Yann LeCun’s insight in 1989 was that the same edge detector should work at every location: share the weights across positions. The convolutional layer is born — a tiny 3×3 or 5×5 filter, slid across the image, learning one feature detector that runs everywhere [1, 2]. This single architectural idea is what made computer-vision deep learning practical.

Learning objectives

  1. Connect convolution-as-kernel (3.4.1) to convolution-as-CNN-layer — same math, learned weights.
  2. Implement a forward convolution in NumPy and apply learned filters to an image.
  3. Understand the CNN architecture: convolution → activation → pooling, stacked.
  4. Visualise feature maps — the per-filter response at each layer — to see what the network is “looking at.”

Part 1 — Shared weights, sliding filters

A fully-connected layer treats every pixel as an independent input feature: weights w(i, j) connect input pixel (i, j) to every hidden neuron. A convolutional layer shares weights across positions: one 3×3 filter is applied at every location of the input.

def conv2d(image, kernel, stride=1, pad='same'):
    H, W = image.shape
    k = kernel.shape[0]
    if pad == 'same':
        p = k // 2
        image = np.pad(image, p, mode='edge')
    out = np.zeros((H, W))
    for y in range(H):
        for x in range(W):
            out[y, x] = np.sum(image[y:y+k, x:x+k] * kernel)
    return out

This is the same convolution as 3.4.1. The difference in a CNN: the kernel weights are learned by backpropagation rather than hand-designed [3].

A diagram of a small CNN: an input image flows into a 3x3 convolution producing several feature maps, then a max-pooling layer reduces spatial size, then a flatten and dense layer, finally a classification output.
Fig. 1 The basic CNN block: convolution → activation → pooling, repeated. Each layer learns features built from features of the layer below.

Part 2 — Convolutional layer ingredients

A practical convolutional layer has four ingredients:

  • FiltersK filters of shape (k, k, C_in) produce K output channels.
  • Stride — step size when sliding the filter. Stride=1 keeps spatial size; stride=2 halves it.
  • Paddingsame keeps the output size equal to input; valid shrinks by k-1.
  • Activation — typically ReLU, applied element-wise after convolution.
out_height = (in_height - kernel + 2*pad) / stride + 1

After the convolution + activation, a max-pooling layer reduces spatial size by taking the maximum over each 2×2 block. Pooling is what gives CNNs their translation invariance — the same feature at a slightly different position still fires the same downstream units [4].

An animation showing a 3 by 3 filter sliding across an input image, computing dot products with each receptive field, and assembling the results into a feature map.
Fig. 2 The convolution operation: same kernel, every position. The feature map is the per-position response of one filter.

Part 3 — Feature maps

After applying K filters, you have K feature maps — each one a 2D image showing where that filter “fires.” Visualising these is how you build intuition about what a CNN has learned.

A grid of grayscale feature maps. Each shows the response of one learned filter to the same input image: some respond to horizontal edges, others to vertical edges, some to corners, some to textures.
Fig. 3 Twelve feature maps from a first-layer convolution. Each filter has specialised in a different low-level feature.
A grid of 3 by 3 filter weight matrices visualised as small images. Some look like edge detectors, others like Gabor-style oriented patterns, others like blob detectors.
Fig. 4 The filter weights themselves, visualised as 3×3 images. Many look like classical computer-vision filters (Sobel, Gabor, blob detectors) that were once hand-engineered.

Astonishingly, early-layer filters in trained CNNs look like Gabor filters — the same oriented-edge detectors that the 1980s computer-vision community had already designed by hand. The network rediscovers these features from scratch, by gradient descent on a classification objective [2, 5].

Synthesis project

EXECUTE I.

Apply learned filters to an image

Run cnn_visualization.py from the downloads. It loads pre-trained filter kernels (Sobel-like horizontal, Sobel-like vertical, blob detector, sharpening, blur), applies each to an input image, and saves the feature maps.

Reflection questions

  • Each feature map is a 2D image. What do bright pixels in the feature map mean?
  • The Sobel-horizontal filter is hand-designed in 3.4.1. Why does a CNN trained on image classification learn something nearly identical in its first layer?
  • What happens if you apply multiple convolutional layers to the same input?
MODIFY II.

Artistic filters via convolution

Apply each of these hand-crafted kernels to an image and observe the artistic effects:

Goals

  1. Edge detection — Sobel kernels (3.4.2) for grayscale outline.
  2. Emboss[[-2, -1, 0], [-1, 1, 1], [0, 1, 2]] for 3D-looking shading.
  3. Sharpen[[0, -1, 0], [-1, 5, -1], [0, -1, 0]] for crisp detail.
A grid showing the same source image transformed by four artistic filters: original, edge-detected, embossed, and sharpened.
Fig. 5 Same input, four learned-or-handcrafted filters. The CNN reaches for these kinds of low-level features in early layers.
CREATE III.

Build a tiny CNN forward pass

Implement a 2-conv-layer CNN forward pass in NumPy. Input is a 28×28 image (MNIST-style); the network has:

  • Conv layer 1: 4 filters of size 3×3, stride 1, ReLU.
  • Max-pool: 2×2, stride 2.
  • Conv layer 2: 8 filters of size 3×3, stride 1, ReLU.
  • Max-pool: 2×2, stride 2.
  • Output: flatten → 1 dense layer.
python · exercise3_starter.py
import numpy as np

def conv2d(x, W, b, stride=1):
    """Multi-filter 2D convolution with bias.
    x: (H, W, C_in), W: (K, k, k, C_in), b: (K,)
    Returns: (H_out, W_out, K)
    """
    H, W_in, C = x.shape
    K, kh, kw, _ = W.shape
    # TODO: implement using nested loops or im2col.

def maxpool2d(x, size=2, stride=2):
    """2x2 max pool with stride 2."""
    # TODO

def relu(x):
    return np.maximum(0, x)

# Build the network
def forward(image):
    x = image[..., None]                # (28, 28, 1)
    x = relu(conv2d(x, W1, b1))         # (28, 28, 4)
    x = maxpool2d(x)                    # (14, 14, 4)
    x = relu(conv2d(x, W2, b2))         # (14, 14, 8)
    x = maxpool2d(x)                    # (7, 7, 8)
    x = x.flatten()
    return W3 @ x + b3                  # logits

Downloads

cnn_visualization.py — apply learned filters cnn_starter.py — tiny CNN forward pass

References

  1. [1] LeCun, Y., Boser, B., Denker, J. S., et al. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
  2. [2] LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
  3. [3] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  4. [4] Boureau, Y.-L., Ponce, J., & LeCun, Y. (2010). A theoretical analysis of feature pooling in visual recognition. ICML 27.
  5. [5] Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. ECCV, 818–833.
  6. [6] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 25, 1097–1105.
  7. [7] He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. CVPR, 770–778.