In the previous article, we successfully trained a neuron by:
- Computing a forward pass
- Measuring loss
- Computing gradients
- Updating weights and bias
However, we intentionally simplified one detail:
We assumed the activation function was the identity.
Real neural networks do not work this way.
In practice, every neuron includes a non-linear activation, and that activation affects how gradients flow backward.
This article shows exactly how.
Why Activations Change Backpropagation
Recall the forward computation of a real neuron:
Previously, we had:a=z
Now, a depends on z through a non-linear function.
That adds one extra derivative to the chain rule.
The Full Gradient Chain (Now Complete)
With an activation function, the gradient becomes:
The new term is:
This term depends entirely on the activation function.
Case 1: ReLU Activation
ReLU Definition
ReLU Derivative
This is simple—and powerful.
def relu(z): return max(0.0, z)def relu_derivative(z): return 1.0 if z > 0 else 0.0
Backpropagation With ReLU (Concrete Example)
Let’s reuse the same setup, but add ReLU.
x = 2.0w = 1.5b = 0.5y = 4.0
Forward Pass
z = w * x + b # 3.5a = relu(z) # 3.5loss = (a - y) ** 2 # 0.25
Backward Pass
dL_da = 2 * (a - y) # -1.0da_dz = relu_derivative(z) # 1.0dz_dw = x # 2.0dz_db = 1.0
Gradients
dL_dw = dL_da * da_dz * dz_dw # -2.0dL_db = dL_da * da_dz * dz_db # -1.0
Same result as before — because ReLU was active.
What Happens When ReLU Is Inactive?
If z <= 0:
da_dz = 0.0
Then:
dL_dw = 0dL_db = 0
No gradient flows.
This is why ReLU neurons can “die” if they stay inactive.
Case 2: Sigmoid Activation
Sigmoid Definition
Sigmoid Derivative
A key identity:
This means the derivative depends on the output, not the input.
Implementing Sigmoid and Its Derivative
import mathdef sigmoid(z): return 1 / (1 + math.exp(-z))def sigmoid_derivative(a): return a * (1 - a)
Note:
- We pass
a, notz - This avoids recomputing sigmoid twice
Backpropagation With Sigmoid
Forward Pass
z = w * x + ba = sigmoid(z)loss = (a - y) ** 2
Backward Pass
dL_da = 2 * (a - y)da_dz = sigmoid_derivative(a)dz_dw = xdz_db = 1.0
Gradients
dL_dw = dL_da * da_dz * dz_dwdL_db = dL_da * da_dz * dz_db
Now the gradient magnitude depends on:
- How saturated the sigmoid is
- How confident the neuron already is
Why Sigmoid Can Cause Vanishing Gradients
If a is close to 0 or 1:
sigmoid_derivative(a) ≈ 0
Which means:
- Gradients shrink
- Learning slows down
- Deep networks struggle
This is why ReLU dominates modern architectures.
A General Backpropagation Pattern
Every neuron follows this structure:
Loss ↑Activation derivative ↑Linear derivative ↑Weight / bias
Backpropagation is simply applying this pattern repeatedly.
What We Have Completed
At this point, you now understand:
- Backpropagation for a neuron
- How activation functions affect gradients
- Why ReLU and Sigmoid behave differently
- Where vanishing gradients come from
This is all the math needed for real neural networks.
What’s Next in the Series
In Article #9, we will:
- Backpropagate through a full layer
- Accumulate gradients for multiple neurons
- Prepare a clean, reusable training loop
- Move from “a neuron learns” to “a network learns”
This is the step where everything scales.
GitHub Code
Activation-aware backpropagation code will be added next:
👉 [link to your GitHub repository]
Series Progress
Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
✔ Article #3 — Building a Layer
✔ Article #4 — Forward Propagation
✔ Article #5 — Loss Functions
✔ Article #6 — Gradients Explained
✔ Article #7 — Backpropagation (Single Neuron)
✔ Article #8 — Backpropagation With Activations
➡ Article #9 — Backpropagation Through a Layer