SolveWithPython

Vectorized Backpropagation With NumPy — Training at Real Scale

In Article #11, we made a critical shift:

  • We stopped looping over neurons in Python
  • We expressed the same math using NumPy arrays
  • We vectorized the forward pass

That already gave us a major speed boost.

But a neural network does not learn during the forward pass.

It learns during backpropagation.

In this article, we complete the transition by:

  • Vectorizing backpropagation
  • Vectorizing gradient computation
  • Vectorizing the full training loop

This is the point where your neural network stops being a teaching example and starts behaving like a real system.

The Key Idea (Before Any Code)

Vectorized backpropagation is not new math.

It is the same chain rule, expressed with matrices instead of loops.

If you understand Part I, you already understand this article.

Recap: Shapes Matter More Than Code

We will work with a simple network:

Input (n) Dense (k) Output (m)

Shapes:

  • Inputs: (n,)
  • Weights: (k, n)
  • Biases: (k,)
  • Activations: (k,)

Gradients will follow the same shapes.

If the shapes make sense, the code will too.

Step 1: Forward Pass (Vectorized)

def dense_forward(inputs, weights, biases, activation):
z = weights @ inputs + biases
a = activation(z)
return a, z

This single line replaces:

  • nested loops
  • per-neuron computation
  • manual accumulation

Step 2: Vectorized Activation Derivatives

ReLU

Plain text
def relu(z):
return np.maximum(0, z)
def relu_derivative(z):
return (z > 0).astype(float)

This returns a vector of 0s and 1s.

Step 3: Vectorized Loss and Its Gradient

We use Mean Squared Error for clarity.

Python
def mse(y_true, y_pred):
return np.mean((y_pred - y_true) ** 2)
def mse_derivative(y_true, y_pred):
return 2 * (y_pred - y_true) / y_true.size

Step 4: Vectorized Backpropagation for One Layer

This is the core of the article.

Assume:

  • dL_da arrives from the next layer
  • We cached z and inputs during forward pass

Backward Pass (Vectorized)

Python
def dense_backward(inputs, z, dL_da, weights, activation_derivative):
dL_dz = dL_da * activation_derivative(z)
dL_dw = np.outer(dL_dz, inputs)
dL_db = dL_dz
dL_dx = weights.T @ dL_dz
return dL_dw, dL_db, dL_dx

Why This Works (Important)

This line:

dL_dw = np.outer(dL_dz, inputs)

Is exactly the vectorized form of:

Python
for each neuron i:
for each input j:
dL_dw[i][j] = dL_dz[i] * inputs[j]

Same math.
One operation.

Step 5: Vectorized Parameter Update

Python
def update_layer(weights, biases, dL_dw, dL_db, lr):
weights -= lr * dL_dw
biases -= lr * dL_db

No loops.
No indexing.
Just linear algebra.

Step 6: Building a Fully Vectorized Network

We will train:

2 3 1

Initialization

Python
np.random.seed(0)
W1 = np.random.randn(3, 2)
b1 = np.zeros(3)
W2 = np.random.randn(1, 3)
b2 = np.zeros(1)

Step 7: Vectorized Training Loop

Python
learning_rate = 0.01
epochs = 1000
X = np.array([
[1.0, 2.0],
[2.0, 1.0],
[3.0, 1.0],
[1.0, 3.0]
])
y = np.array([4.0, 3.0, 5.0, 5.0])
for epoch in range(epochs):
total_loss = 0.0
for x_i, y_i in zip(X, y):
# Forward
a1, z1 = dense_forward(x_i, W1, b1, relu)
a2, z2 = dense_forward(a1, W2, b2, lambda z: z)
loss = mse(y_i, a2)
total_loss += loss
# Backward
dL_da2 = mse_derivative(y_i, a2)
dW2, db2, dL_da1 = dense_backward(
a1, z2, dL_da2, W2, lambda z: 1.0
)
dW1, db1, _ = dense_backward(
x_i, z1, dL_da1, W1, relu_derivative
)
# Update
update_layer(W2, b2, dW2, db2, learning_rate)
update_layer(W1, b1, dW1, db1, learning_rate)
if epoch % 100 == 0:
print(f"Epoch {epoch}, Loss: {total_loss:.4f}")

What You Have Achieved

At this point, you have:

  • Fully vectorized forward propagation
  • Fully vectorized backpropagation
  • A NumPy-based training loop
  • Code that scales far beyond toy examples

This is exactly what deep learning frameworks do internally.

Why This Is a Major Milestone

You can now:

  • Read PyTorch code and understand it
  • Debug gradient issues intelligently
  • Reason about performance bottlenecks
  • Scale models without losing intuition

Frameworks are no longer a crutch.

They are a convenience.

Common Mistakes at This Stage

Mistake 1: Losing track of shapes
→ Always write shapes in comments.

Mistake 2: Over-optimizing too early
→ Correctness first, speed second.

Mistake 3: Treating NumPy as magic
→ It’s just vectorized math.

What’s Next in Part II

In Article #13, we will:

  • Vectorize training over entire batches
  • Remove the inner loop over samples
  • Introduce mini-batch gradient descent
  • Prepare for GPU-style computation

This is the final step before frameworks.

Series Status

You are now firmly in real neural network territory.