So far in this series, we have:
- Built dense and sparse models
- Pruned trained networks
- Controlled activation sparsity
- Trained sparse models from scratch
- Rewired networks dynamically
Now we answer the most important practical question:
Do sparse neural networks actually save compute — and at what cost?
In this article, we build a fair benchmark comparing:
- Dense model
- Static sparse model
- Dynamic sparse model
We measure:
• Parameter count
• Effective sparsity
• Training time
• Final loss
• Approximate FLOPs
As always — fully reproducible Python.
1. Benchmark Design Principles
To keep this fair, we:
• Use identical architectures (same layer sizes)
• Use identical dataset
• Use identical optimizer
• Keep sparsity fixed (e.g., 80%)
• Train for same number of epochs
The only difference is connectivity.
2. Shared Dataset and Utilities
import torchimport torch.nn as nnimport torch.optim as optimimport timedef generate_data(n=4096, input_dim=100, num_classes=10): X = torch.randn(n, input_dim) y = torch.randint(0, num_classes, (n,)) return X, ydef count_parameters(model): return sum(p.numel() for p in model.parameters())def measure_weight_sparsity(model): total = 0 zeros = 0 for param in model.parameters(): if param.dim() > 1: total += param.numel() zeros += (param == 0).sum().item() return zeros / total
3. Dense Baseline Model
class DenseMLP(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim): super().__init__() self.net = nn.Sequential( nn.Linear(input_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, hidden_dim), nn.ReLU(), nn.Linear(hidden_dim, output_dim) ) def forward(self, x): return self.net(x)
4. Static Sparse Model (Masking)
class StaticSparseLinear(nn.Module): def __init__(self, in_features, out_features, sparsity): super().__init__() self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.01) self.bias = nn.Parameter(torch.zeros(out_features)) mask = torch.rand(out_features, in_features) mask = (mask > sparsity).float() self.register_buffer("mask", mask) def forward(self, x): return x @ (self.weight * self.mask).t() + self.biasclass StaticSparseMLP(nn.Module): def __init__(self, input_dim, hidden_dim, output_dim, sparsity): super().__init__() self.fc1 = StaticSparseLinear(input_dim, hidden_dim, sparsity) self.fc2 = StaticSparseLinear(hidden_dim, hidden_dim, sparsity) self.fc3 = StaticSparseLinear(hidden_dim, output_dim, sparsity) self.relu = nn.ReLU() def forward(self, x): x = self.relu(self.fc1(x)) x = self.relu(self.fc2(x)) return self.fc3(x)
5. Dynamic Sparse Model
We reuse the dynamic layer from Article #5.
class DynamicSparseLinear(nn.Module): def __init__(self, in_features, out_features, sparsity): super().__init__() self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.01) self.bias = nn.Parameter(torch.zeros(out_features)) mask = torch.rand(out_features, in_features) mask = (mask > sparsity).float() self.register_buffer("mask", mask) def forward(self, x): return x @ (self.weight * self.mask).t() + self.bias
(Prune + regrow functions omitted here for brevity — see Article #5.)
6. Training Function
def train_model(model, X, y, epochs=20): optimizer = optim.Adam(model.parameters(), lr=1e-3) loss_fn = nn.CrossEntropyLoss() start = time.time() for _ in range(epochs): logits = model(X) loss = loss_fn(logits, y) optimizer.zero_grad() loss.backward() optimizer.step() elapsed = time.time() - start return loss.item(), elapsed
7. Running the Benchmark
X, y = generate_data()# Densedense = DenseMLP(100, 256, 10)dense_loss, dense_time = train_model(dense, X, y)# Static Sparsestatic_sparse = StaticSparseMLP(100, 256, 10, sparsity=0.8)static_loss, static_time = train_model(static_sparse, X, y)print("Dense Loss:", dense_loss)print("Static Sparse Loss:", static_loss)print("Dense Time:", dense_time)print("Static Sparse Time:", static_time)print("Static Sparsity:", measure_weight_sparsity(static_sparse))
8. What You Should Expect
On CPU, you may observe:
• Similar training time (masking still computes dense matmul)
• Similar final loss (at moderate sparsity)
• High measured sparsity (≈ 0.8)
Important insight:
Naive masking does NOT automatically speed up computation.
True speedups require:
• Sparse matrix kernels
• Block sparsity
• Hardware support
9. Approximate FLOPs Comparison
For hidden_dim = 256:
Dense layer FLOPs:
100 × 256 = 25,600
At 80% sparsity:
Effective ≈ 0.2 × 25,600 = 5,120
Theoretical savings are large.
Practical savings depend on hardware.
10. Key Takeaways
- Sparse models can match dense accuracy at high sparsity.
- Structural sparsity alone does not guarantee runtime speedup.
- Real gains require hardware-aware sparsity.
- Sparse scaling is about compute control — not just compression.
11. Where This Leads
We’ve now validated sparsity structurally.
The next frontier is:
Sparse Attention and Mixture of Experts
Where sparsity becomes conditional computation at scale.
Code Location
All benchmark code lives in:
06_dense_vs_sparse_benchmark/
Try:
• Increasing sparsity to 90%
• Measuring GPU runtime
• Comparing static vs dynamic sparse
Sparse neural networks are powerful.
But only when measured carefully.