In Article #12, we achieved a major milestone:
- Forward propagation was vectorized
- Backpropagation was vectorized
- Gradients were computed with matrix operations
- The network trained efficiently using NumPy
But there is still one loop left.
This one:
for x_i, y_i in zip(X, y):
In this article, we remove it.
This is the final conceptual step before:
- GPUs
- deep learning frameworks
- industrial-scale training
Why That Last Loop Matters
So far, we have been training the network:
- One sample at a time
- Updating weights after every example
This approach is called:
Stochastic Gradient Descent (SGD)
SGD works — but it is not how most real systems train.
The Three Training Modes (Big Picture)
Neural networks are trained using one of three strategies:
1. Stochastic Gradient Descent (SGD)
- Batch size = 1
- Very noisy gradients
- Fast updates, unstable learning
2. Batch Gradient Descent
- Batch size = full dataset
- Stable gradients
- Slow updates, high memory usage
3. Mini-Batch Gradient Descent (⭐ default)
- Batch size = 16–1024
- Balance of speed and stability
- Used in almost all real training setups
This article focuses on batch and mini-batch training.
The Core Idea of Batch Training
Instead of computing gradients for one sample:
x → forward → loss → backward
We compute gradients for many samples at once:
X_batch → forward → loss → backward
Then:
- gradients are averaged
- parameters are updated once per batch
Shapes Change — Math Does Not
Let’s define shapes clearly.
Inputs
X:(batch_size, n_features)
Weights
W:(n_neurons, n_features)
Biases
b:(n_neurons,)
Step 1: Batch Forward Propagation
Previously:
z = W @ x + b
Now, for a batch:
Z = X @ W.T + b
Shapes
X:(B, n)W.T:(n, k)Z:(B, k)
Each row of Z corresponds to one sample.
Batch ReLU
def relu(Z): return np.maximum(0, Z)
Works automatically on matrices.
Step 2: Batch Loss
Mean Squared Error over a batch:
def mse(y_true, y_pred): return np.mean((y_pred - y_true) ** 2)
This averages error across all samples.
Step 3: Batch Loss Gradient
def mse_derivative(y_true, y_pred): return 2 * (y_pred - y_true) / y_true.shape[0]
return 2 * (y_pred - y_true) / y_true.shape[0]
This distributes error evenly across the batch.
Step 4: Batch Backpropagation (Key Insight)
For a batch, gradients become matrix operations.
Gradient w.r.t. weights
Vectorized Implementation
def dense_backward_batch(X, Z, dL_dA, W, activation_derivative): dZ = dL_dA * activation_derivative(Z) dW = dZ.T @ X db = np.sum(dZ, axis=0) dX = dZ @ W return dW, db, dX
This replaces:
- looping over samples
- looping over neurons
- manual accumulation
Step 5: Full Batch Training Loop
learning_rate = 0.01epochs = 500for epoch in range(epochs): # Forward A1, Z1 = dense_forward(X, W1, b1, relu) A2, Z2 = dense_forward(A1, W2, b2, lambda z: z) loss = mse(y, A2) # Backward dL_dA2 = mse_derivative(y, A2) dW2, db2, dA1 = dense_backward_batch( A1, Z2, dL_dA2, W2, lambda z: 1.0 ) dW1, db1, _ = dense_backward_batch( X, Z1, dA1, W1, relu_derivative ) # Update W2 -= learning_rate * dW2 b2 -= learning_rate * db2 W1 -= learning_rate * dW1 b1 -= learning_rate * db1 if epoch % 50 == 0: print(f"Epoch {epoch}, Loss: {loss:.4f}")
No sample loop.
This is true batch gradient descent.
Mini-Batch Training (Practical Default)
In practice:
- Full batch can be memory-heavy
- SGD is noisy
So we use mini-batches.
Mini-Batch Loop
batch_size = 16for epoch in range(epochs): for i in range(0, len(X), batch_size): X_batch = X[i:i+batch_size] y_batch = y[i:i+batch_size] # forward, backward, update (same as above)
This is how almost all deep learning training works.
Why Mini-Batches Are So Effective
Mini-batch training:
- smooths gradients
- improves generalization
- fits GPU memory
- enables parallelism
This is why GPUs matter.
What You Have Achieved So Far
At this point, you have built:
- A neural network from scratch
- Vectorized forward propagation
- Vectorized backpropagation
- Batch and mini-batch training
- A training loop that scales
This is the complete core of deep learning.
Frameworks now become optional.
Common Mistakes at This Stage
Mistake 1: Forgetting to average gradients
→ Leads to exploding updates.
Mistake 2: Confusing batch size with epochs
→ One epoch = one pass over the dataset.
Mistake 3: Thinking frameworks add intelligence
→ They add engineering, not math.
What’s Next in Part II
In Article #14, we will:
- Visualize loss curves
- Detect overfitting
- Introduce validation sets
- Explain why “lower loss” is not always better
This is where training becomes diagnosable, not just runnable.
Series Status
- Part I — Foundations ✔ Complete
- Part II — Vectorization & Scaling ▶ Nearly Complete
You now understand neural networks at a level where:
- frameworks make sense
- debugging is possible
- intuition matches code