SolveWithPython

Batch and Mini-Batch Training — Removing the Last Loop

In Article #12, we achieved a major milestone:

Forward propagation was vectorized
Backpropagation was vectorized
Gradients were computed with matrix operations
The network trained efficiently using NumPy

But there is still one loop left.

This one:

for x_i, y_i in zip(X, y):

In this article, we remove it.

This is the final conceptual step before:

GPUs
deep learning frameworks
industrial-scale training

Why That Last Loop Matters

So far, we have been training the network:

One sample at a time
Updating weights after every example

This approach is called:

Stochastic Gradient Descent (SGD)

SGD works — but it is not how most real systems train.

The Three Training Modes (Big Picture)

Neural networks are trained using one of three strategies:

1. Stochastic Gradient Descent (SGD)

Batch size = 1
Very noisy gradients
Fast updates, unstable learning

2. Batch Gradient Descent

Batch size = full dataset
Stable gradients
Slow updates, high memory usage

3. Mini-Batch Gradient Descent (⭐ default)

Batch size = 16–1024
Balance of speed and stability
Used in almost all real training setups

This article focuses on batch and mini-batch training.

The Core Idea of Batch Training

Instead of computing gradients for one sample:

x → forward → loss → backward

We compute gradients for many samples at once:

X_batch → forward → loss → backward

Then:

gradients are averaged
parameters are updated once per batch

Shapes Change — Math Does Not

Let’s define shapes clearly.

Inputs

X: (batch_size, n_features)

Weights

W: (n_neurons, n_features)

Biases

b: (n_neurons,)

Step 1: Batch Forward Propagation

Previously:

z = W @ x + b

Now, for a batch:

Z = X @ W.T + b

Shapes

X: (B, n)
W.T: (n, k)
Z: (B, k)

Each row of Z corresponds to one sample.

Batch ReLU

def relu(Z):
    return np.maximum(0, Z)

Works automatically on matrices.

Step 2: Batch Loss

Mean Squared Error over a batch:

def mse(y_true, y_pred):
    return np.mean((y_pred - y_true) ** 2)

This averages error across all samples.

Step 3: Batch Loss Gradient

def mse_derivative(y_true, y_pred):
    return 2 * (y_pred - y_true) / y_true.shape[0]

return 2 * (y_pred - y_true) / y_true.shape[0]

This distributes error evenly across the batch.

Step 4: Batch Backpropagation (Key Insight)

For a batch, gradients become matrix operations.

Gradient w.r.t. weights

$\frac{\partial L}{\partial W} = \frac{1}{B} \cdot (dL/dZ)^T \cdot X$

Vectorized Implementation

def dense_backward_batch(X, Z, dL_dA, W, activation_derivative):
    dZ = dL_dA * activation_derivative(Z)
    dW = dZ.T @ X
    db = np.sum(dZ, axis=0)
    dX = dZ @ W
    return dW, db, dX

This replaces:

looping over samples
looping over neurons
manual accumulation

Step 5: Full Batch Training Loop

learning_rate = 0.01
epochs = 500
for epoch in range(epochs):
    # Forward
    A1, Z1 = dense_forward(X, W1, b1, relu)
    A2, Z2 = dense_forward(A1, W2, b2, lambda z: z)
    loss = mse(y, A2)
    # Backward
    dL_dA2 = mse_derivative(y, A2)
    dW2, db2, dA1 = dense_backward_batch(
        A1, Z2, dL_dA2, W2, lambda z: 1.0
    )
    dW1, db1, _ = dense_backward_batch(
        X, Z1, dA1, W1, relu_derivative
    )
    # Update
    W2 -= learning_rate * dW2
    b2 -= learning_rate * db2
    W1 -= learning_rate * dW1
    b1 -= learning_rate * db1
    if epoch % 50 == 0:
        print(f"Epoch {epoch}, Loss: {loss:.4f}")

No sample loop.

This is true batch gradient descent.

Mini-Batch Training (Practical Default)

In practice:

Full batch can be memory-heavy
SGD is noisy

So we use mini-batches.

Mini-Batch Loop

batch_size = 16
for epoch in range(epochs):
    for i in range(0, len(X), batch_size):
        X_batch = X[i:i+batch_size]
        y_batch = y[i:i+batch_size]
        # forward, backward, update (same as above)

This is how almost all deep learning training works.

Why Mini-Batches Are So Effective

Mini-batch training:

smooths gradients
improves generalization
fits GPU memory
enables parallelism

This is why GPUs matter.

What You Have Achieved So Far

At this point, you have built:

A neural network from scratch
Vectorized forward propagation
Vectorized backpropagation
Batch and mini-batch training
A training loop that scales

This is the complete core of deep learning.

Frameworks now become optional.

Common Mistakes at This Stage

Mistake 1: Forgetting to average gradients
→ Leads to exploding updates.

Mistake 2: Confusing batch size with epochs
→ One epoch = one pass over the dataset.

Mistake 3: Thinking frameworks add intelligence
→ They add engineering, not math.

What’s Next in Part II

In Article #14, we will:

Visualize loss curves
Detect overfitting
Introduce validation sets
Explain why “lower loss” is not always better

This is where training becomes diagnosable, not just runnable.

Series Status

Part I — Foundations ✔ Complete
Part II — Vectorization & Scaling ▶ Nearly Complete

You now understand neural networks at a level where:

frameworks make sense
debugging is possible
intuition matches code