In Article #2, we removed unimportant weights.
Now we move to something equally powerful — and often more biologically intuitive:
What if most neurons simply didn’t activate in the first place?
This is called activation sparsity.
Instead of removing connections, we control how many neurons are active per input.
In this article, you will:
- Measure neuron activation rates in a dense network
- Implement k-Winners-Take-All (k-WTA)
- Compare dense vs sparse activations
- Quantify compute savings per batch
As always: Python first, minimal math second.
1. What Is Activation Sparsity?
A layer is activation-sparse when:
Only a small fraction of neurons produce non-zero outputs.
ReLU already introduces partial sparsity:
ReLU(x) = max(0, x)
But ReLU sparsity is uncontrolled.
We want explicit control over how many neurons fire.
2. Measuring Activation Sparsity in a Dense Model
Let’s begin by measuring how many neurons are active after a ReLU layer.
Dense Model
import torchimport torch.nn as nnclass DenseMLP(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.fc1 = nn.Linear(input_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc3 = nn.Linear(hidden_dim, output_dim) self.relu = nn.ReLU() def forward(self, x): a1 = self.relu(self.fc1(x)) a2 = self.relu(self.fc2(a1)) out = self.fc3(a2) return out, a1, a2
Activation Measurement Utility
def activation_ratio(tensor): total = tensor.numel() active = (tensor > 0).sum().item() return active / total
Test It
model = DenseMLP(100, 256, 10)x = torch.randn(512, 100)_, a1, a2 = model(x)print("Layer 1 activation ratio:", activation_ratio(a1))print("Layer 2 activation ratio:", activation_ratio(a2))
Typical output:
Layer 1 activation ratio: ~0.50Layer 2 activation ratio: ~0.48
ReLU alone activates roughly half the neurons.
We can do better.
3. k-Winners-Take-All (k-WTA)
The idea is simple:
For each input sample, only keep the top-k activations.
Everything else becomes zero.
This guarantees exact sparsity.
4. Implementing k-WTA in PyTorch
class KWinnersTakeAll(nn.Module): def __init__(self, k): super().__init__() self.k = k def forward(self, x): # x shape: (batch_size, features) topk_vals, topk_idx = torch.topk(x, self.k, dim=1) mask = torch.zeros_like(x) mask.scatter_(1, topk_idx, 1.0) return x * mask
Key properties:
- Exactly k neurons active per sample
- Fully differentiable (subgradient through topk)
5. Sparse Activation MLP
class SparseActivationMLP(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim, k): super().__init__() self.fc1 = nn.Linear(input_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, hidden_dim) self.fc3 = nn.Linear(hidden_dim, output_dim) self.relu = nn.ReLU() self.kwta = KWinnersTakeAll(k) def forward(self, x): x = self.relu(self.fc1(x)) x = self.kwta(x) x = self.relu(self.fc2(x)) x = self.kwta(x) return self.fc3(x)
6. Measuring Controlled Sparsity
model = SparseActivationMLP(100, 256, 10, k=32)x = torch.randn(512, 100)with torch.no_grad(): a1 = model.relu(model.fc1(x)) a1_sparse = model.kwta(a1)print("Activation ratio after k-WTA:", activation_ratio(a1_sparse))
Expected output:
Activation ratio after k-WTA: 0.125
Since 32 / 256 = 0.125.
This gives us exact activation sparsity.
7. Compute Implications
For a dense layer:
Compute ≈ batch × input_dim × output_dim
With activation sparsity:
Effective compute ≈ batch × input_dim × k
When k ≪ output_dim, compute drops significantly.
8. Why Activation Sparsity Matters
Activation sparsity:
- Reduces memory bandwidth
- Reduces downstream computation
- Encourages specialization
- Aligns with biological neural systems
It is also the foundation of:
- Mixture of Experts (MoE)
- Conditional computation
- Token routing in transformers
9. Comparing Weight vs Activation Sparsity
| Type | What Is Zero? | When Applied |
|---|---|---|
| Weight Sparsity | Connections | After training |
| Activation Sparsity | Neuron outputs | During forward pass |
Weight sparsity removes structure.
Activation sparsity removes activity.
Modern architectures often combine both.
10. What Comes Next
We have now seen two paths to sparsity:
- Remove unimportant weights
- Control which neurons activate
The next logical step:
Can we train sparse networks directly — without ever being dense?
In Article #4, we implement Sparse Training From Scratch.
Code Location
All code for this article lives in:
03_activation_sparsity/
Experiment ideas:
- Try different values of k
- Visualize activation distributions
- Compare training curves
Sparsity becomes powerful when it is controlled — not accidental.