SolveWithPython

Backpropagation Through a Layer — How Neural Networks Learn at Scale

So far, we have done something crucial:

  • We trained a single neuron
  • We computed gradients explicitly
  • We included activation functions in backpropagation

This already covers the entire mathematical foundation of neural networks.

Now we take the next step:

How does backpropagation work when a layer has many neurons?

The answer is reassuringly simple.

A layer does not introduce new math.
It just repeats the same math many times.

What Changes When We Move From a Neuron to a Layer?

Recall what a dense layer is:

  • Multiple neurons
  • Same input vector
  • Different weights and biases
  • Independent activations

Each neuron:

  • Produces its own output
  • Contributes to the next layer
  • Has its own gradients

Backpropagation through a layer means:

Compute gradients per neuron, then aggregate them.

The Forward Computation (Layer Recap)

For a dense layer with k neurons:zi=jwijxj+biz_i = \sum_j w_{ij} x_j + b_i ai=f(zi)a_i = f(z_i)

The layer output is the vector:a=[a1,a2,,ak]\mathbf{a} = [a_1, a_2, \dots, a_k]

Nothing new here.

What Gradients Do We Need for a Layer?

For each neuron iii, we need:

  • Lwij\frac{\partial L}{\partial w_{ij}} for every weight
  • Lbi\frac{\partial L}{\partial b_i}​ for every bias
  • Lxj\frac{\partial L}{\partial x_j}​ to pass backward to the previous layer

That last one is important.

Why Inputs Need Gradients Too

In a multi-layer network:

  • The “input” to one layer
  • Is the “output” of the previous layer

So during backpropagation:

  • Each layer must return gradients with respect to its inputs
  • Those gradients become the upstream signal for the layer before it

This is how gradients flow through the network.

Step-by-Step: Backpropagation for One Neuron in a Layer

For neuron iii:Lwij=Laiaiziziwij\frac{\partial L}{\partial w_{ij}} = \frac{\partial L}{\partial a_i} \cdot \frac{\partial a_i}{\partial z_i} \cdot \frac{\partial z_i}{\partial w_{ij}}

Where:

  • ziwij=xj\frac{\partial z_i}{\partial w_{ij}} = x_j
  • zibi=1\frac{\partial z_i}{\partial b_i} = 1

This is exactly what we already did — just indexed.

Implementing Layer Backpropagation in Python

Let’s assume:

  • One dense layer
  • ReLU activation
  • Incoming gradient from the next layer: dL_da_list

Forward Cache (Needed for Backprop)

Python
def dense_forward(inputs, weights_list, bias_list, activation):
z_list = []
a_list = []
for weights, bias in zip(weights_list, bias_list):
z = sum(x * w for x, w in zip(inputs, weights)) + bias
a = activation(z)
z_list.append(z)
a_list.append(a)
return a_list, z_list

We store z_list because activation derivatives need it.

Backward Pass for the Layer

Python
def dense_backward(inputs, z_list, dL_da_list, weights_list, activation_derivative):
dL_dw = []
dL_db = []
dL_dx = [0.0 for _ in inputs]
for i in range(len(weights_list)):
da_dz = activation_derivative(z_list[i])
dL_dz = dL_da_list[i] * da_dz
# Gradients for weights and bias
neuron_dw = []
for j in range(len(inputs)):
neuron_dw.append(dL_dz * inputs[j])
dL_dx[j] += dL_dz * weights_list[i][j]
dL_dw.append(neuron_dw)
dL_db.append(dL_dz)
return dL_dw, dL_db, dL_dx

What This Code Is Doing

For each neuron:

  • Compute its local gradient
  • Compute gradients for its weights
  • Compute gradient for its bias

For the layer:

  • Sum contributions to dL_dx
  • Return gradients upstream

This aggregation is the key idea.

Why dL_dx Is a Sum

Each input affects every neuron in the layer.

So the total gradient with respect to an input is the sum of all paths through which it influences the loss.

This is the chain rule applied at scale.

Updating the Layer Parameters

Python
def update_layer(weights_list, bias_list, dL_dw, dL_db, learning_rate):
for i in range(len(weights_list)):
for j in range(len(weights_list[i])):
weights_list[i][j] -= learning_rate * dL_dw[i][j]
bias_list[i] -= learning_rate * dL_db[i]

This completes one learning step for the layer.

What We Have Built Now

At this point, you understand:

  • Backpropagation for a neuron
  • Backpropagation with activation functions
  • Backpropagation through a full dense layer
  • How gradients flow backward between layers

This is the full engine of learning in neural networks.

Everything else is:

  • Vectorization
  • Performance optimization
  • Engineering convenience

Common Beginner Misconceptions

Mistake 1: Thinking layers require new math
→ They don’t. Just repetition and aggregation.

Mistake 2: Forgetting input gradients
→ Without them, networks cannot stack layers.

Mistake 3: Thinking frameworks do something different
→ They do exactly this, just faster.

What’s Next in the Series

In Article #10, we will:

  • Combine everything into a full training loop
  • Train a multi-layer neural network end to end
  • Watch loss decrease over epochs
  • See learning happen in real time

This is where all pieces finally come together.

GitHub Code

Layer-level backpropagation code will be added to the repository:

👉 [link to your GitHub repository]

Series Progress

Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
✔ Article #3 — Building a Layer
✔ Article #4 — Forward Propagation
✔ Article #5 — Loss Functions
✔ Article #6 — Gradients Explained
✔ Article #7 — Backpropagation (Single Neuron)
✔ Article #8 — Backpropagation With Activations
✔ Article #9 — Backpropagation Through a Layer
➡ Article #10 — Training a Neural Network End to End