SolveWithPython

Gradients in Neural Networks — Why Learning Requires Derivatives

Up to this point, we have built a complete neural network pipeline:

  • Inputs flow forward through layers
  • The network produces a prediction
  • A loss function measures how wrong that prediction is

At this stage, the network can evaluate itself.

But it still cannot improve.

To improve, the network must answer a deeper question:

Which weights caused the error, and by how much?

The tool that answers this question is the gradient.

This article introduces gradients from first principles and explains why derivatives are the engine of learning in neural networks.

The Core Problem of Learning

Suppose your network produces a loss of 0.25.

That number tells you:

  • The prediction is not perfect

But it does not tell you:

  • Which weight caused the error
  • Whether to increase or decrease a weight
  • How much to change it

Loss alone is just a measurement.

To learn, we need direction.

What Is a Gradient?

A gradient tells us:

How much the loss changes when a parameter changes.

In simpler terms:

  • If I nudge this weight slightly…
  • Will the loss go up or down?
  • And how fast?

Mathematically, this is a derivative.

One Weight, One Question

Let’s simplify the problem to its smallest form.

Imagine a neuron with:

  • One input
  • One weight
  • One bias
  • One loss value

The learning question becomes:

How does the loss change if I change the weight?

This is written as:Lossw\frac{\partial \text{Loss}}{\partial w}

This is the gradient of the loss with respect to the weight.

Why Derivatives Matter

Derivatives give us two critical pieces of information:

  1. Direction
    • Positive gradient → increasing weight increases loss
    • Negative gradient → increasing weight decreases loss
  2. Sensitivity
    • Large gradient → small changes matter a lot
    • Small gradient → changes barely matter

Learning is simply moving weights in the direction that reduces loss.

A Concrete Example (No Neural Network Yet)

Let’s step away from networks for a moment.

Consider this simple function:f(w)=(w3)2f(w) = (w – 3)^2

This function has:

  • A minimum at w = 3
  • Higher values as you move away from 3

Its derivative is:f(w)=2(w3)f'(w) = 2(w – 3)

What the Derivative Tells Us

  • If w = 5, derivative = +4 → decrease w
  • If w = 1, derivative = -4 → increase w
  • If w = 3, derivative = 0 → stop

This is exactly how neural networks learn — just with more variables.

Gradients in a Neuron

Now let’s return to our neural network.

Recall a simple neuron:z=wx+bz = w \cdot x + b

With an activation function:a=f(z)a = f(z)

And a loss function:L=loss(a,y)L = \text{loss}(a, y)

The loss depends on the weight indirectly.

To compute the gradient, we apply the chain rule.

The Chain Rule (Conceptual, Not Formal)

The chain rule tells us:

If A affects B, and B affects C, then A affects C.

In neural networks:

  • Weight affects z
  • z affects activation a
  • a affects loss L

So:Lw=Laazzw\frac{\partial L}{\partial w} = \frac{\partial L}{\partial a} \cdot \frac{\partial a}{\partial z} \cdot \frac{\partial z}{\partial w}

This is the backbone of backpropagation.

We will compute each part explicitly in the next article.

A First Gradient Calculation (Manual)

Let’s compute one part right now.

From:z=wx+bz = w \cdot x + b

The derivative with respect to w is:zw=x\frac{\partial z}{\partial w} = x

This means:

  • The input directly scales how much the weight matters
  • Larger inputs → larger gradients

This is not an accident — it is fundamental.

Why Gradients Are Computed Backward

Notice something important:

  • The loss is computed last
  • But gradients are needed for the earliest weights

This means:

  • We must start at the loss
  • And move backward through the network

This is why the algorithm is called backpropagation.

Common Beginner Misconceptions

Mistake 1: Thinking gradients are magic
→ They are just derivatives.

Mistake 2: Thinking gradients update weights
→ Gradients only describe change. Updates come later.

Mistake 3: Fearing calculus
→ You only need simple derivatives, applied systematically.

What We Have Achieved So Far

At this point, we now understand:

  • Why loss alone is insufficient
  • Why derivatives are necessary
  • What a gradient represents
  • How gradients relate to learning

We are now ready to compute gradients end to end.

What’s Next in the Series

In Article #7, we will:

  • Compute gradients for a full neuron
  • Differentiate the loss function
  • Differentiate activation functions (ReLU, Sigmoid)
  • Combine everything using the chain rule

This will be our first true step into backpropagation.

GitHub Code

In the next article, we will begin adding explicit gradient calculations to the repository.

👉 [link to your GitHub repository]

Series Progress

You are reading:

Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
✔ Article #3 — Building a Layer
✔ Article #4 — Forward Propagation
✔ Article #5 — Loss Functions
✔ Article #6 — Gradients Explained
➡ Article #7 — Backpropagation Step by Step