SolveWithPython

Backpropagation Step by Step — Computing Gradients for a Single Neuron

In the previous article, we introduced gradients and answered a fundamental question:

How does a change in a weight affect the loss?

We saw that the answer is a derivative, and that learning requires computing these derivatives systematically.

In this article, we will do exactly that.

We will:

  • Take a single neuron
  • Walk through forward computation
  • Compute the loss
  • Derive gradients step by step
  • Implement the math directly in Python

This is the moment where neural networks stop being conceptual—and start learning.

The Simplest Learnable Setup

We will work with the smallest meaningful system:

  • One input
  • One weight
  • One bias
  • One activation
  • One loss function

No layers. No loops. No shortcuts.

Step 1: Define the Forward Computation

Our neuron:z=wx+bz = w \cdot x + b

Activation (identity, for now):a=za = z

Loss (Mean Squared Error):L=(ay)2L = (a – y)^2

This setup removes distractions so we can focus on gradients.

Step 2: Concrete Values

Let’s assign real numbers:

Python
x = 2.0 # input
w = 1.5 # weight
b = 0.5 # bias
y = 4.0 # target

Forward Pass

Python
z = w * x + b # 1.5 * 2.0 + 0.5 = 3.5
a = z
loss = (a - y) ** 2 # (3.5 - 4.0)^2 = 0.25

The network is wrong — but not by much.

Now comes the important part.

Step 3: What Gradients Do We Need?

To update the parameters, we need:LwandLb\frac{\partial L}{\partial w} \quad \text{and} \quad \frac{\partial L}{\partial b}

We compute these using the chain rule.

Step 4: Apply the Chain Rule

Recall the dependency chain:

  • Loss depends on a
  • a depends on z
  • z depends on w and b

So:Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

And similarly for b.

Step 5: Compute Each Derivative

1. Loss with respect to activation

L=(ay)2L = (a – y)^2 La=2(ay)\frac{\partial L}{\partial a} = 2(a – y)

Python
dL_da = 2 * (a - y) # 2 * (3.5 - 4.0) = -1.0

2. Activation with respect to z

Since a = z:az=1\frac{\partial a}{\partial z} = 1

Python
da_dz = 1.0

3. z with respect to weight

z=wx+bz = w \cdot x + bzw=x\frac{\partial z}{\partial w} = x

Python
dz_dw = x # 2.0

4. z with respect to bias

zb=1\frac{\partial z}{\partial b} = 1

Python
dz_db = 1.0

Step 6: Combine Gradients

Gradient with respect to weight

Python
dL_dw = dL_da * da_dz * dz_dw
# -1.0 * 1.0 * 2.0 = -2.0

Gradient with respect to bias

Python
dL_db = dL_da * da_dz * dz_db
# -1.0 * 1.0 * 1.0 = -1.0

What These Numbers Mean

  • dL_dw = -2.0 → increasing w will reduce loss
  • dL_db = -1.0 → increasing b will reduce loss

Gradients do not update anything yet.

They only point in the right direction.

Step 7: Update the Parameters (Gradient Descent)

Now we apply a learning rate.

Python
learning_rate = 0.1
w = w - learning_rate * dL_dw
b = b - learning_rate * dL_db

New values:

Python
w = 1.5 - 0.1 * (-2.0) = 1.7
b = 0.5 - 0.1 * (-1.0) = 0.6

Step 8: Forward Pass Again

Python
z = w * x + b # 1.7 * 2.0 + 0.6 = 4.0
a = z
loss = (a - y) ** 2 # (4.0 - 4.0)^2 = 0.0

The loss is now zero.

The neuron has learned.

This Is Backpropagation

Nothing more happened than:

  1. Forward computation
  2. Loss calculation
  3. Derivative calculation
  4. Parameter update

Backpropagation is not magic.

It is organized calculus applied repeatedly.

Where Activation Functions Fit In

In real networks:

  • a ≠ z
  • Activations introduce an extra derivative term

For example:

  • ReLU → derivative is 0 or 1
  • Sigmoid → derivative depends on output value

We will add these next.

Common Beginner Misconceptions

Mistake 1: Thinking backpropagation is one formula
→ It is a process, not a single equation.

Mistake 2: Confusing gradients with updates
→ Gradients describe change; learning rate applies it.

Mistake 3: Thinking frameworks do something different
→ They do exactly this—just faster and in bulk.

What We Have Achieved So Far

At this point, you now understand:

  • Forward propagation
  • Loss functions
  • Gradients
  • Backpropagation for a neuron
  • Gradient descent updates

This is the core of all neural networks.

Everything else is scale.

What’s Next in the Series

In Article #8, we will:

  • Add activation functions to backpropagation
  • Compute ReLU and Sigmoid derivatives
  • Backpropagate through a full neuron with activation
  • Prepare for multi-neuron layers

This is where the math becomes reusable.

GitHub Code

This article’s code will be added as a standalone, readable example:

👉 [link to your GitHub repository]

Series Progress

Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
✔ Article #3 — Building a Layer
✔ Article #4 — Forward Propagation
✔ Article #5 — Loss Functions
✔ Article #6 — Gradients Explained
✔ Article #7 — Backpropagation Step by Step
➡ Article #8 — Backpropagation with Activation Functions