SolveWithPython

Backpropagation With Activation Functions — Completing the Gradient Chain

In the previous article, we successfully trained a neuron by:

  • Computing a forward pass
  • Measuring loss
  • Computing gradients
  • Updating weights and bias

However, we intentionally simplified one detail:

We assumed the activation function was the identity.

Real neural networks do not work this way.

In practice, every neuron includes a non-linear activation, and that activation affects how gradients flow backward.

This article shows exactly how.

Why Activations Change Backpropagation

Recall the forward computation of a real neuron:z=wx+bz = w \cdot x + b a=f(z)a = f(z)L=loss(a,y)L = \text{loss}(a, y)

Previously, we had:a=za = za=z

Now, a depends on z through a non-linear function.

That adds one extra derivative to the chain rule.

The Full Gradient Chain (Now Complete)

With an activation function, the gradient becomes:Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

The new term is:az\frac{\partial a}{\partial z}

This term depends entirely on the activation function.

Case 1: ReLU Activation

ReLU Definition

ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)

ReLU Derivative

ddzReLU(z)={1z>00z0\frac{d}{dz}\text{ReLU}(z) = \begin{cases} 1 & z > 0 \\ 0 & z \le 0 \end{cases}

This is simple—and powerful.

Python
def relu(z):
return max(0.0, z)
def relu_derivative(z):
return 1.0 if z > 0 else 0.0

Backpropagation With ReLU (Concrete Example)

Let’s reuse the same setup, but add ReLU.

Python
x = 2.0
w = 1.5
b = 0.5
y = 4.0

Forward Pass

z = w * x + b # 3.5
a = relu(z) # 3.5
loss = (a - y) ** 2 # 0.25

Backward Pass

Python
dL_da = 2 * (a - y) # -1.0
da_dz = relu_derivative(z) # 1.0
dz_dw = x # 2.0
dz_db = 1.0

Gradients

Python
dL_dw = dL_da * da_dz * dz_dw # -2.0
dL_db = dL_da * da_dz * dz_db # -1.0

Same result as before — because ReLU was active.

What Happens When ReLU Is Inactive?

If z <= 0:

Python
da_dz = 0.0

Then:

dL_dw = 0
dL_db = 0

No gradient flows.

This is why ReLU neurons can “die” if they stay inactive.

Case 2: Sigmoid Activation

Sigmoid Definition

σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Sigmoid Derivative

A key identity:ddzσ(z)=σ(z)(1σ(z))\frac{d}{dz}\sigma(z) = \sigma(z)(1 – \sigma(z))

This means the derivative depends on the output, not the input.

Implementing Sigmoid and Its Derivative

Python
import math
def sigmoid(z):
return 1 / (1 + math.exp(-z))
def sigmoid_derivative(a):
return a * (1 - a)

Note:

  • We pass a, not z
  • This avoids recomputing sigmoid twice

Backpropagation With Sigmoid

Forward Pass

Python
z = w * x + b
a = sigmoid(z)
loss = (a - y) ** 2

Backward Pass

Python
dL_da = 2 * (a - y)
da_dz = sigmoid_derivative(a)
dz_dw = x
dz_db = 1.0

Gradients

Python
dL_dw = dL_da * da_dz * dz_dw
dL_db = dL_da * da_dz * dz_db

Now the gradient magnitude depends on:

  • How saturated the sigmoid is
  • How confident the neuron already is

Why Sigmoid Can Cause Vanishing Gradients

If a is close to 0 or 1:

Python
sigmoid_derivative(a) 0

Which means:

  • Gradients shrink
  • Learning slows down
  • Deep networks struggle

This is why ReLU dominates modern architectures.

A General Backpropagation Pattern

Every neuron follows this structure:

Loss
Activation derivative
Linear derivative
Weight / bias

Backpropagation is simply applying this pattern repeatedly.

What We Have Completed

At this point, you now understand:

  • Backpropagation for a neuron
  • How activation functions affect gradients
  • Why ReLU and Sigmoid behave differently
  • Where vanishing gradients come from

This is all the math needed for real neural networks.

What’s Next in the Series

In Article #9, we will:

  • Backpropagate through a full layer
  • Accumulate gradients for multiple neurons
  • Prepare a clean, reusable training loop
  • Move from “a neuron learns” to “a network learns”

This is the step where everything scales.

GitHub Code

Activation-aware backpropagation code will be added next:

👉 [link to your GitHub repository]

Series Progress

Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
✔ Article #3 — Building a Layer
✔ Article #4 — Forward Propagation
✔ Article #5 — Loss Functions
✔ Article #6 — Gradients Explained
✔ Article #7 — Backpropagation (Single Neuron)
✔ Article #8 — Backpropagation With Activations
➡ Article #9 — Backpropagation Through a Layer