SolveWithPython

Activation Functions in Neural Networks — Why a Network Without Them Cannot Learn

In the previous article, we built a real artificial neuron in pure Python.
It took inputs, applied weights, added a bias, and produced an output.

At that point, we had something important—but also something fundamentally limited.

A network made only of those neurons cannot learn complex patterns, no matter how many layers you stack.

This article explains why, and introduces the single idea that turns linear math into learning:
activation functions.

The Hidden Problem With Linear Neurons

Recall the neuron we built:z=(x1w1)+(x2w2)++bz = (x_1 \cdot w_1) + (x_2 \cdot w_2) + \dots + b

This is a linear function.

Now here is the key insight:

A stack of linear functions is still just a linear function.

That means:

  • 1 layer → linear
  • 10 layers → still linear
  • 1,000 layers → still linear

No matter how deep the network is, it cannot model non-linear relationships.

This is why a neural network without activation functions is mathematically pointless.

A Simple Proof (Intuition, Not Formal Math)

Suppose we have two layers:

Layer 1:z1=W1x+b1z_1 = W_1 x + b_1

Layer 2:z2=W2z1+b2z_2 = W_2 z_1 + b_2

Substitute the first into the second:z2=W2(W1x+b1)+b2z_2 = W_2 (W_1 x + b_1) + b_2

Which simplifies to:z2=(W2W1)x+(W2b1+b2)z_2 = (W_2 W_1) x + (W_2 b_1 + b_2)

That is still a single linear transformation.

Depth alone does nothing.

What Activation Functions Do

Activation functions introduce non-linearity.

Instead of outputting z directly, a neuron outputs:a=f(z)a = f(z)

Where f is a non-linear function.

This one change allows neural networks to:

  • Bend decision boundaries
  • Learn curves, shapes, and patterns
  • Approximate complex functions

Without activation functions, neural networks collapse into linear regression.

The Two Most Important Activation Functions

We will start with the two that matter most conceptually.

1. ReLU (Rectified Linear Unit)

Definition:ReLU(z)=max(0,z)\text{ReLU}(z) = \max(0, z)ReLU(z)=max(0,z)

Interpretation:

  • Negative values → 0
  • Positive values → unchanged

Why it works well:

  • Simple
  • Efficient
  • Avoids saturation for positive values
  • Dominates modern deep learning

Implementing ReLU in Python

Python
def relu(z):
return max(0.0, z)

Example:

Python
print(relu(-3.0)) # 0.0
print(relu(2.5)) # 2.5

2. Sigmoid

Definition:σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}}

Interpretation:

  • Maps values to the range (0, 1)
  • Can be interpreted as probability

When it’s used:

  • Binary classification
  • Output layers

Implementing Sigmoid in Python

Python
import math
def sigmoid(z):
return 1 / (1 + math.exp(-z))

Example:

Python
print(sigmoid(-2.0)) # ~0.12
print(sigmoid(0.0)) # 0.5
print(sigmoid(2.0)) # ~0.88

Adding Activation to Our Neuron

Let’s extend the neuron from Article #1.

Python
def neuron(inputs, weights, bias, activation):
total = 0.0
for x, w in zip(inputs, weights):
total += x * w
total += bias
return activation(total)

Now the neuron is no longer purely linear.

Example: Neuron With ReLU

Python
inputs = [2.0, 3.0]
weights = [0.5, -1.0]
bias = 1.0
output = neuron(inputs, weights, bias, relu)
print(output)

Previously, the raw output was -1.0.

After ReLU:

  • relu(-1.0)0.0

This single decision changes how information flows through the network.

Why Non-Linearity Enables Learning

With activation functions:

  • Different neurons activate for different regions of input space
  • Layers can progressively reshape the data
  • Decision boundaries become curved instead of straight

This is what allows neural networks to solve problems like:

  • XOR
  • Image recognition
  • Speech
  • Language

Without activation functions, none of that is possible.

Common Beginner Mistakes

Mistake 1: Using no activation at all
→ The network becomes linear regression.

Mistake 2: Using sigmoid everywhere
→ Gradients vanish in deep networks.

Mistake 3: Thinking activation is optional
→ It is the core of neural networks.

What We Have Built So Far

We now have:

  • A neuron
  • Weights and bias
  • A non-linear activation function
  • A neuron capable of expressing complex behavior

But we still have a limitation:

Our neuron works alone.

Neural networks gain power when neurons work in groups.

What’s Next in the Series

In Article #3, we will:

  • Combine neurons into a layer
  • Implement a dense (fully connected) layer in pure Python
  • Understand how data flows through multiple neurons
  • Prepare for full forward propagation

This is where a neural network truly begins.

GitHub Code

All code for this article is available here:

👉 https://github.com/Benard-Kemp/Activation-Functions-in-Neural-Networks

Each article adds exactly one new concept and one new file.

Series Progress

You are reading:

Neural Networks From Scratch (Pure Python)
✔ Article #1 — What a Neuron Really Computes
✔ Article #2 — Activation Functions
➡ Article #3 — Building a Layer From Neurons