SolveWithPython

Backpropagation With Activation Functions — Completing the Gradient Chain

In the previous article, we successfully trained a neuron by:

Computing a forward pass
Measuring loss
Computing gradients
Updating weights and bias

However, we intentionally simplified one detail:

We assumed the activation function was the identity.

Real neural networks do not work this way.

In practice, every neuron includes a non-linear activation, and that activation affects how gradients flow backward.

This article shows exactly how.

Why Activations Change Backpropagation

Recall the forward computation of a real neuron: $z = w \cdot x + b$ $a = f(z)$ $L = \text{loss}(a, y)$

Previously, we had: $a = z$ a=z

Now, a depends on z through a non-linear function.

That adds one extra derivative to the chain rule.

The Full Gradient Chain (Now Complete)

With an activation function, the gradient becomes: $\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}$

The new term is: $\frac{\partial a}{\partial z}$

This term depends entirely on the activation function.

Case 1: ReLU Activation

ReLU Definition

$\text{ReLU}(z) = \max(0, z)$

ReLU Derivative

$\frac{d}{dz}\text{ReLU}(z) = \begin{cases} 1 & z > 0 \\ 0 & z \le 0 \end{cases}$

This is simple—and powerful.

def relu(z):
    return max(0.0, z)
def relu_derivative(z):
    return 1.0 if z > 0 else 0.0

Backpropagation With ReLU (Concrete Example)

Let’s reuse the same setup, but add ReLU.

x = 2.0
w = 1.5
b = 0.5
y = 4.0

Forward Pass

			
z = w * x + b        # 3.5
a = relu(z)          # 3.5
loss = (a - y) ** 2 # 0.25

Backward Pass

dL_da = 2 * (a - y)       # -1.0
da_dz = relu_derivative(z)  # 1.0
dz_dw = x                # 2.0
dz_db = 1.0

Gradients

dL_dw = dL_da * da_dz * dz_dw  # -2.0
dL_db = dL_da * da_dz * dz_db  # -1.0

Same result as before — because ReLU was active.

What Happens When ReLU Is Inactive?

If z <= 0:

da_dz = 0.0

Then:

			
dL_dw = 0
dL_db = 0

No gradient flows.

This is why ReLU neurons can “die” if they stay inactive.

Case 2: Sigmoid Activation

Sigmoid Definition

$\sigma(z) = \frac{1}{1 + e^{-z}}$

Sigmoid Derivative

A key identity: $\frac{d}{dz}\sigma(z) = \sigma(z)(1 – \sigma(z))$

This means the derivative depends on the output, not the input.

Implementing Sigmoid and Its Derivative

import math
def sigmoid(z):
    return 1 / (1 + math.exp(-z))
def sigmoid_derivative(a):
    return a * (1 - a)

Note:

We pass a, not z
This avoids recomputing sigmoid twice

Backpropagation With Sigmoid

Forward Pass

z = w * x + b
a = sigmoid(z)
loss = (a - y) ** 2

Backward Pass

dL_da = 2 * (a - y)
da_dz = sigmoid_derivative(a)
dz_dw = x
dz_db = 1.0

Gradients

dL_dw = dL_da * da_dz * dz_dw
dL_db = dL_da * da_dz * dz_db

Now the gradient magnitude depends on:

How saturated the sigmoid is
How confident the neuron already is

Why Sigmoid Can Cause Vanishing Gradients

If a is close to 0 or 1:

sigmoid_derivative(a) ≈ 0

Which means:

Gradients shrink
Learning slows down
Deep networks struggle

This is why ReLU dominates modern architectures.

A General Backpropagation Pattern

Every neuron follows this structure:

			
Loss
 ↑
Activation derivative
 ↑
Linear derivative
 ↑
Weight / bias

		

Backpropagation is simply applying this pattern repeatedly.

What We Have Completed

At this point, you now understand:

Backpropagation for a neuron
How activation functions affect gradients
Why ReLU and Sigmoid behave differently
Where vanishing gradients come from

This is all the math needed for real neural networks.

What’s Next in the Series

In Article #9, we will:

Backpropagate through a full layer
Accumulate gradients for multiple neurons
Prepare a clean, reusable training loop
Move from “a neuron learns” to “a network learns”

This is the step where everything scales.

GitHub Code

Activation-aware backpropagation code will be added next:

👉 https://github.com/Benard-Kemp/Backpropagation-With-Activation-Functions

Series Progress

Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
✔ Article #3 — Building a Layer
✔ Article #4 — Forward Propagation
✔ Article #5 — Loss Functions
✔ Article #6 — Gradients Explained
✔ Article #7 — Backpropagation (Single Neuron)
✔ Article #8 — Backpropagation With Activations
➡ Article #9 — Backpropagation Through a Layer