Pixels2GenAI
Path ii Continuum
M 09 · 9.1.3 · conceptual

9.1.3 Activation Functions as Art

Visualise the seven canonical activation functions — sigmoid, tanh, ReLU, leaky ReLU, ELU, GELU, Swish — and feed an image through each. The choice of activation is one of the few decisions that hasn't moved in a decade.

Duration25–35 min
Levelintermediate
Load3 core concepts
Prereqs9.1.1, 9.1.2 (forward pass intuition)

Big question

What does the activation function actually do? Without it, a neural network is just a stack of linear transforms — and any stack of linear transforms collapses into a single linear transform [1]. The activation introduces non-linearity, the one ingredient that lets the network learn curved decision boundaries, image features, and language structure. Choosing the activation function is one of the few neural-network decisions where the field has reached near-consensus: ReLU (or its variants) dominate, with sigmoid and tanh largely retired to recurrent networks and output layers [2].

Learning objectives

  1. Plot the curve and derivative of each canonical activation function.
  2. Explain why ReLU (max(0, x)) became standard around 2012 — sparsity, no saturation, fast gradient.
  3. Apply each activation pixel-wise to an image and observe how it transforms the value distribution.
  4. Connect the activation choice to the vanishing gradient problem in deep networks.

Part 1 — Why non-linearity matters

Without an activation, a multi-layer network reduces to a single matrix multiplication:

y = W₃ (W₂ (W₁ x)) = (W₃ W₂ W₁) x = W' x

The composition of linear functions is linear. To learn anything more interesting than a hyperplane separator, you need a non-linear function between layers. Any non-linear function works in principle; in practice, the seven canonical activations balance three concerns:

  • Differentiability — for backpropagation.
  • Saturation — sigmoid and tanh squash inputs into bounded ranges, which causes gradients to vanish for large |x|.
  • Computational costmax(0, x) is one comparison; tanh(x) is several transcendental operations.
A plot showing several activation functions overlaid: sigmoid S-shape between 0 and 1, tanh similar shape between -1 and 1, ReLU as a ramp at zero, leaky ReLU as a softer ramp, and ELU as a smooth version of ReLU.
Fig. 1 Five canonical activation functions on the same axes. ReLU's kink at zero is what makes it cheap; the rest are different ways to smooth that kink.

Part 2 — The canonical functions

import numpy as np

def sigmoid(x):     return 1 / (1 + np.exp(-x))
def tanh(x):        return np.tanh(x)
def relu(x):        return np.maximum(0, x)
def leaky_relu(x, alpha=0.01): return np.where(x > 0, x, alpha * x)
def elu(x, alpha=1.0): return np.where(x > 0, x, alpha * (np.exp(x) - 1))
def gelu(x):        return 0.5 * x * (1 + np.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))
def swish(x):       return x * sigmoid(x)

The historical sequence is instructive:

FunctionYearNotes
Sigmoid1958Smooth, but saturates → vanishing gradients in deep nets
Tanh1980sZero-centred sigmoid; better for backprop but still saturates
ReLU2010One comparison; no saturation for x > 0. Default for CNNs by 2012 [3]
Leaky ReLU2013Tiny negative slope to fix the “dying ReLU” problem
ELU2015Smoother negative branch via exp
GELU2016Stochastic + smooth; default in transformers [4]
Swish2017x · sigmoid(x); found by neural architecture search

The ReLU paper of Glorot, Bordes & Bengio (2011) was the inflection point — ReLU-based networks trained faster and reached lower loss than sigmoid networks, on the same data and architecture. By 2012’s AlexNet, ReLU was standard [3, 5].

Part 3 — Apply each to an image

A useful way to see what each activation does is to apply it pixel-wise to a normalised image. The transformations are striking:

A grid showing the same input image transformed by different activation functions. Each panel shows how the activation reshapes the brightness distribution.
Fig. 2 One image, seven activations. Sigmoid and tanh compress the dynamic range; ReLU clips half the image to black; GELU and Swish do something subtler.

The effect on real images:

  • Sigmoid — compresses everything into the middle of the brightness range. The image looks low-contrast.
  • Tanh — same shape, signed. Slightly more contrast than sigmoid.
  • ReLU — anything below zero (after centring) → black. Sharp half-clip effect.
  • Leaky ReLU — same as ReLU but with very dim negatives instead of pure black.
  • ELU — like leaky ReLU but with smoother negative tail.
  • GELU/Swish — a tiny negative slope that grows smoothly into positive identity. Hard to distinguish visually from ReLU at first glance, but training dynamics differ.
An artistic rendering showing the activation function applied to a colourful gradient image, producing a stylised look where the activation reshapes colour bands.
Fig. 3 The same gradient image after applying a non-linearity. The activation function is now an artistic filter.

Synthesis project

EXECUTE I.

Plot all activations and their derivatives

Run activation_functions_art.py. The script plots all seven functions on a shared axis and their derivatives separately.

Reflection questions

  • Which activations saturate (have near-zero derivatives for large |x|)?
  • Why does saturation cause the “vanishing gradient” problem in deep networks?
  • Which activations are not zero-centred? Why does that matter?
MODIFY II.

Apply activations to an image

Edit activation_starter.py to apply each activation to a normalised input image and save a 7-panel comparison.

Goals

  1. Normalise the image to [-2, 2] before applying — gives the non-linearities room to show their shape.
  2. Apply each function pixel-wise.
  3. Stack into a 2×4 grid with the original in the first slot for reference.
CREATE III.

Aurora — the activation as art

Build an artistic composition using activation functions: generate a smooth field (e.g. fbm noise from Module 06), apply different activations to different regions, and combine into one image.

An aurora-like artistic composition: smooth flowing bands of colour with sharp transitions at activation-function boundaries.
Fig. 4 Reference output for the aurora challenge: a noise field reshaped by activation curves produces flowing, banded colour.

Downloads

activation_functions_art.py — plot curves activation_starter.py — apply to images challenge_aurora.py — aurora reference

References

  1. [1] Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
  2. [2] LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444.
  3. [3] Glorot, X., Bordes, A., & Bengio, Y. (2011). Deep sparse rectifier neural networks. Proceedings of AISTATS, 15, 315–323.
  4. [4] Hendrycks, D., & Gimpel, K. (2016). Gaussian Error Linear Units (GELUs). arXiv:1606.08415.
  5. [5] Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. NeurIPS 25, 1097–1105.
  6. [6] He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. ICCV, 1026–1034.
  7. [7] Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv:1710.05941.