In the previous article, we learned how to detect overfitting:
- training loss keeps going down
- validation loss starts going up
- the model is memorizing instead of generalizing
Detection is essential.
But now we move to the next level:
How do we actively prevent overfitting while training?
The simplest and most fundamental answer is L2 regularization, also known as weight decay.
The Core Idea of Regularization
Regularization adds a preference to the learning process.
Not just:
“Minimize the loss.”
But:
“Minimize the loss and keep the model simple.”
L2 regularization does this by discouraging large weights.
Why Large Weights Are a Problem
Large weights mean:
- the model reacts strongly to small input changes
- tiny fluctuations can cause big output swings
- noise gets amplified
- memorization becomes easy
In short:
Large weights → brittle models.
Regularization gently pushes weights to stay small unless absolutely necessary.
What Is L2 Regularization (Mathematically)?
Original loss (for one sample):
With L2 regularization:
Where:
- are the weights
- (lambda) controls regularization strength
Biases are usually not regularized.
Intuition (No Math)
L2 regularization says:
“I will allow large weights — but only if they really help reduce error.”
If two solutions fit the data equally well:
- the one with smaller weights wins
This bias toward simplicity improves generalization.
How L2 Regularization Affects Gradients
This is the key insight.
Without regularization:
With L2 regularization:
So every update includes a pull toward zero.
Implementing L2 Regularization From Scratch
We integrate it directly into backpropagation.
Step 1: Choose a Regularization Strength
lambda_reg = 0.01
Typical values range from:
0.0001(very mild)- to
0.1(strong)
Step 2: Modify Weight Gradients
Original gradient update:
W -= lr * dW
With L2 regularization:
W -= lr * (dW + 2 * lambda_reg * W)
That’s it.
No new loops.
No new architecture.
Just one extra term.
Step 3: Apply It to Our Vectorized Training Loop
Output Layer Update
W2 -= learning_rate * (dW2 + 2 * lambda_reg * W2)b2 -= learning_rate * db2
Hidden Layer Update
W1 -= learning_rate * (dW1 + 2 * lambda_reg * W1)b1 -= learning_rate * db1
Biases remain unchanged.
Step 4: (Optional) Track Regularized Loss
If you want to see the effect:
reg_loss = lambda_reg * (np.sum(W1**2) + np.sum(W2**2))total_loss = data_loss + reg_loss
But remember:
- gradients matter more than the number itself
What Weight Decay Actually Does During Training
Every update step now has two forces:
- Data gradient → fit the data
- Regularization gradient → keep weights small
Learning becomes a negotiation instead of a race.
Visual Effect on Loss Curves
With L2 regularization, you often see:
- slightly higher training loss
- lower validation loss
- smaller train/validation gap
This is exactly what we want.
Common Beginner Misconceptions
Mistake 1: “Regularization hurts accuracy”
→ It often improves real-world accuracy.
Mistake 2: “More regularization is always better”
→ Too much leads to underfitting.
Mistake 3: “Regularization replaces more data”
→ It helps, but data is still king.
How L2 Regularization Fits in the Big Picture
L2 regularization is:
- simple
- cheap
- mathematically clean
- widely used
It is the default defense against overfitting.
Other methods (dropout, data augmentation) build on this foundation.
What You Have Achieved So Far
At this point, you can:
- detect overfitting
- understand why it happens
- actively reduce it
- modify loss functions safely
- reason about generalization
You are now doing model tuning, not just model building.
What’s Next in the Series
In Article #16, we will introduce:
- Dropout
- Why randomly disabling neurons helps
- How dropout differs from regularization
- A clean from-scratch implementation
This will complete your core overfitting defense toolkit.
Series Status
- Part I — Foundations ✔
- Part II — Scaling & Generalization ▶ Deepening
You are now operating at a level where:
- training behavior is explainable
- fixes are intentional
- results are interpretable