SolveWithPython

Batch and Mini-Batch Training — Removing the Last Loop

In Article #12, we achieved a major milestone:

  • Forward propagation was vectorized
  • Backpropagation was vectorized
  • Gradients were computed with matrix operations
  • The network trained efficiently using NumPy

But there is still one loop left.

This one:

Python
for x_i, y_i in zip(X, y):

In this article, we remove it.

This is the final conceptual step before:

  • GPUs
  • deep learning frameworks
  • industrial-scale training

Why That Last Loop Matters

So far, we have been training the network:

  • One sample at a time
  • Updating weights after every example

This approach is called:

Stochastic Gradient Descent (SGD)

SGD works — but it is not how most real systems train.

The Three Training Modes (Big Picture)

Neural networks are trained using one of three strategies:

1. Stochastic Gradient Descent (SGD)

  • Batch size = 1
  • Very noisy gradients
  • Fast updates, unstable learning

2. Batch Gradient Descent

  • Batch size = full dataset
  • Stable gradients
  • Slow updates, high memory usage

3. Mini-Batch Gradient Descent (⭐ default)

  • Batch size = 16–1024
  • Balance of speed and stability
  • Used in almost all real training setups

This article focuses on batch and mini-batch training.

The Core Idea of Batch Training

Instead of computing gradients for one sample:

x forward loss backward

We compute gradients for many samples at once:

X_batch forward loss backward

Then:

  • gradients are averaged
  • parameters are updated once per batch

Shapes Change — Math Does Not

Let’s define shapes clearly.

Inputs

  • X: (batch_size, n_features)

Weights

  • W: (n_neurons, n_features)

Biases

  • b: (n_neurons,)

Step 1: Batch Forward Propagation

Previously:

z = W @ x + b

Now, for a batch:

Z = X @ W.T + b

Shapes

  • X: (B, n)
  • W.T: (n, k)
  • Z: (B, k)

Each row of Z corresponds to one sample.

Batch ReLU

Python
def relu(Z):
return np.maximum(0, Z)

Works automatically on matrices.

Step 2: Batch Loss

Mean Squared Error over a batch:

Python
def mse(y_true, y_pred):
return np.mean((y_pred - y_true) ** 2)

This averages error across all samples.

Step 3: Batch Loss Gradient

Python
def mse_derivative(y_true, y_pred):
return 2 * (y_pred - y_true) / y_true.shape[0]
return 2 * (y_pred - y_true) / y_true.shape[0]

This distributes error evenly across the batch.

Step 4: Batch Backpropagation (Key Insight)

For a batch, gradients become matrix operations.

Gradient w.r.t. weights

LW=1B(dL/dZ)TX\frac{\partial L}{\partial W} = \frac{1}{B} \cdot (dL/dZ)^T \cdot X

Vectorized Implementation

Python
def dense_backward_batch(X, Z, dL_dA, W, activation_derivative):
dZ = dL_dA * activation_derivative(Z)
dW = dZ.T @ X
db = np.sum(dZ, axis=0)
dX = dZ @ W
return dW, db, dX

This replaces:

  • looping over samples
  • looping over neurons
  • manual accumulation

Step 5: Full Batch Training Loop

Python
learning_rate = 0.01
epochs = 500
for epoch in range(epochs):
# Forward
A1, Z1 = dense_forward(X, W1, b1, relu)
A2, Z2 = dense_forward(A1, W2, b2, lambda z: z)
loss = mse(y, A2)
# Backward
dL_dA2 = mse_derivative(y, A2)
dW2, db2, dA1 = dense_backward_batch(
A1, Z2, dL_dA2, W2, lambda z: 1.0
)
dW1, db1, _ = dense_backward_batch(
X, Z1, dA1, W1, relu_derivative
)
# Update
W2 -= learning_rate * dW2
b2 -= learning_rate * db2
W1 -= learning_rate * dW1
b1 -= learning_rate * db1
if epoch % 50 == 0:
print(f"Epoch {epoch}, Loss: {loss:.4f}")

No sample loop.

This is true batch gradient descent.

Mini-Batch Training (Practical Default)

In practice:

  • Full batch can be memory-heavy
  • SGD is noisy

So we use mini-batches.

Mini-Batch Loop

Python
batch_size = 16
for epoch in range(epochs):
for i in range(0, len(X), batch_size):
X_batch = X[i:i+batch_size]
y_batch = y[i:i+batch_size]
# forward, backward, update (same as above)

This is how almost all deep learning training works.

Why Mini-Batches Are So Effective

Mini-batch training:

  • smooths gradients
  • improves generalization
  • fits GPU memory
  • enables parallelism

This is why GPUs matter.

What You Have Achieved So Far

At this point, you have built:

  • A neural network from scratch
  • Vectorized forward propagation
  • Vectorized backpropagation
  • Batch and mini-batch training
  • A training loop that scales

This is the complete core of deep learning.

Frameworks now become optional.

Common Mistakes at This Stage

Mistake 1: Forgetting to average gradients
→ Leads to exploding updates.

Mistake 2: Confusing batch size with epochs
→ One epoch = one pass over the dataset.

Mistake 3: Thinking frameworks add intelligence
→ They add engineering, not math.

What’s Next in Part II

In Article #14, we will:

  • Visualize loss curves
  • Detect overfitting
  • Introduce validation sets
  • Explain why “lower loss” is not always better

This is where training becomes diagnosable, not just runnable.

Series Status

  • Part I — Foundations ✔ Complete
  • Part II — Vectorization & Scaling ▶ Nearly Complete

You now understand neural networks at a level where:

  • frameworks make sense
  • debugging is possible
  • intuition matches code