SolveWithPython

Backpropagation Through a Layer — How Neural Networks Learn at Scale

So far, we have done something crucial:

We trained a single neuron
We computed gradients explicitly
We included activation functions in backpropagation

This already covers the entire mathematical foundation of neural networks.

Now we take the next step:

How does backpropagation work when a layer has many neurons?

The answer is reassuringly simple.

A layer does not introduce new math.
It just repeats the same math many times.

What Changes When We Move From a Neuron to a Layer?

Recall what a dense layer is:

Multiple neurons
Same input vector
Different weights and biases
Independent activations

Each neuron:

Produces its own output
Contributes to the next layer
Has its own gradients

Backpropagation through a layer means:

Compute gradients per neuron, then aggregate them.

The Forward Computation (Layer Recap)

For a dense layer with k neurons: $z_i = \sum_j w_{ij} x_j + b_i$ $a_i = f(z_i)$

The layer output is the vector: $\mathbf{a} = [a_1, a_2, \dots, a_k]$

Nothing new here.

What Gradients Do We Need for a Layer?

For each neuron $i$ i, we need:

$\frac{\partial L}{\partial w_{ij}}$ for every weight
$\frac{\partial L}{\partial b_i}$ for every bias
$\frac{\partial L}{\partial x_j}$ to pass backward to the previous layer

That last one is important.

Why Inputs Need Gradients Too

In a multi-layer network:

The “input” to one layer
Is the “output” of the previous layer

So during backpropagation:

Each layer must return gradients with respect to its inputs
Those gradients become the upstream signal for the layer before it

This is how gradients flow through the network.

Step-by-Step: Backpropagation for One Neuron in a Layer

For neuron $i$ i: $\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial a_i} \cdot \frac{\partial a_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial w_{ij}}$

Where:

$\frac{\partial z_i}{\partial w_{ij}} = x_j$
$\frac{\partial z_i}{\partial b_i} = 1$

This is exactly what we already did — just indexed.

Implementing Layer Backpropagation in Python

Let’s assume:

One dense layer
ReLU activation
Incoming gradient from the next layer: dL_da_list

Forward Cache (Needed for Backprop)

def dense_forward(inputs, weights_list, bias_list, activation):
    z_list = []
    a_list = []
    for weights, bias in zip(weights_list, bias_list):
        z = sum(x * w for x, w in zip(inputs, weights)) + bias
        a = activation(z)
        z_list.append(z)
        a_list.append(a)
    return a_list, z_list

We store z_list because activation derivatives need it.

Backward Pass for the Layer

def dense_backward(inputs, z_list, dL_da_list, weights_list, activation_derivative):
    dL_dw = []
    dL_db = []
    dL_dx = [0.0 for _ in inputs]
    for i in range(len(weights_list)):
        da_dz = activation_derivative(z_list[i])
        dL_dz = dL_da_list[i] * da_dz
        # Gradients for weights and bias
        neuron_dw = []
        for j in range(len(inputs)):
            neuron_dw.append(dL_dz * inputs[j])
            dL_dx[j] += dL_dz * weights_list[i][j]
        dL_dw.append(neuron_dw)
        dL_db.append(dL_dz)
    return dL_dw, dL_db, dL_dx

What This Code Is Doing

For each neuron:

Compute its local gradient
Compute gradients for its weights
Compute gradient for its bias

For the layer:

Sum contributions to dL_dx
Return gradients upstream

This aggregation is the key idea.

Why `dL_dx` Is a Sum

Each input affects every neuron in the layer.

So the total gradient with respect to an input is the sum of all paths through which it influences the loss.

This is the chain rule applied at scale.

Updating the Layer Parameters

def update_layer(weights_list, bias_list, dL_dw, dL_db, learning_rate):
    for i in range(len(weights_list)):
        for j in range(len(weights_list[i])):
            weights_list[i][j] -= learning_rate * dL_dw[i][j]
        bias_list[i] -= learning_rate * dL_db[i]

This completes one learning step for the layer.

What We Have Built Now

At this point, you understand:

Backpropagation for a neuron
Backpropagation with activation functions
Backpropagation through a full dense layer
How gradients flow backward between layers

This is the full engine of learning in neural networks.

Everything else is:

Vectorization
Performance optimization
Engineering convenience

Common Beginner Misconceptions

Mistake 1: Thinking layers require new math
→ They don’t. Just repetition and aggregation.

Mistake 2: Forgetting input gradients
→ Without them, networks cannot stack layers.

Mistake 3: Thinking frameworks do something different
→ They do exactly this, just faster.

What’s Next in the Series

In Article #10, we will:

Combine everything into a full training loop
Train a multi-layer neural network end to end
Watch loss decrease over epochs
See learning happen in real time

This is where all pieces finally come together.

GitHub Code

Layer-level backpropagation code will be added to the repository:

👉 https://github.com/Benard-Kemp/Backpropagation-Through-a-Layer

Series Progress

Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
✔ Article #3 — Building a Layer
✔ Article #4 — Forward Propagation
✔ Article #5 — Loss Functions
✔ Article #6 — Gradients Explained
✔ Article #7 — Backpropagation (Single Neuron)
✔ Article #8 — Backpropagation With Activations
✔ Article #9 — Backpropagation Through a Layer
➡ Article #10 — Training a Neural Network End to End