SolveWithPython

Dense vs Sparse: A Fair Benchmark in Python

So far in this series, we have:

  1. Built dense and sparse models
  2. Pruned trained networks
  3. Controlled activation sparsity
  4. Trained sparse models from scratch
  5. Rewired networks dynamically

Now we answer the most important practical question:

Do sparse neural networks actually save compute — and at what cost?

In this article, we build a fair benchmark comparing:

  • Dense model
  • Static sparse model
  • Dynamic sparse model

We measure:

• Parameter count
• Effective sparsity
• Training time
• Final loss
• Approximate FLOPs

As always — fully reproducible Python.

1. Benchmark Design Principles

To keep this fair, we:

• Use identical architectures (same layer sizes)
• Use identical dataset
• Use identical optimizer
• Keep sparsity fixed (e.g., 80%)
• Train for same number of epochs

The only difference is connectivity.

2. Shared Dataset and Utilities

Python
import torch
import torch.nn as nn
import torch.optim as optim
import time
def generate_data(n=4096, input_dim=100, num_classes=10):
X = torch.randn(n, input_dim)
y = torch.randint(0, num_classes, (n,))
return X, y
def count_parameters(model):
return sum(p.numel() for p in model.parameters())
def measure_weight_sparsity(model):
total = 0
zeros = 0
for param in model.parameters():
if param.dim() > 1:
total += param.numel()
zeros += (param == 0).sum().item()
return zeros / total

3. Dense Baseline Model

Python
class DenseMLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim):
super().__init__()
self.net = nn.Sequential(
nn.Linear(input_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, output_dim)
)
def forward(self, x):
return self.net(x)

4. Static Sparse Model (Masking)

Python
class StaticSparseLinear(nn.Module):
def __init__(self, in_features, out_features, sparsity):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.01)
self.bias = nn.Parameter(torch.zeros(out_features))
mask = torch.rand(out_features, in_features)
mask = (mask > sparsity).float()
self.register_buffer("mask", mask)
def forward(self, x):
return x @ (self.weight * self.mask).t() + self.bias
class StaticSparseMLP(nn.Module):
def __init__(self, input_dim, hidden_dim, output_dim, sparsity):
super().__init__()
self.fc1 = StaticSparseLinear(input_dim, hidden_dim, sparsity)
self.fc2 = StaticSparseLinear(hidden_dim, hidden_dim, sparsity)
self.fc3 = StaticSparseLinear(hidden_dim, output_dim, sparsity)
self.relu = nn.ReLU()
def forward(self, x):
x = self.relu(self.fc1(x))
x = self.relu(self.fc2(x))
return self.fc3(x)

5. Dynamic Sparse Model

We reuse the dynamic layer from Article #5.

Python
class DynamicSparseLinear(nn.Module):
def __init__(self, in_features, out_features, sparsity):
super().__init__()
self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.01)
self.bias = nn.Parameter(torch.zeros(out_features))
mask = torch.rand(out_features, in_features)
mask = (mask > sparsity).float()
self.register_buffer("mask", mask)
def forward(self, x):
return x @ (self.weight * self.mask).t() + self.bias

(Prune + regrow functions omitted here for brevity — see Article #5.)

6. Training Function

Python
def train_model(model, X, y, epochs=20):
optimizer = optim.Adam(model.parameters(), lr=1e-3)
loss_fn = nn.CrossEntropyLoss()
start = time.time()
for _ in range(epochs):
logits = model(X)
loss = loss_fn(logits, y)
optimizer.zero_grad()
loss.backward()
optimizer.step()
elapsed = time.time() - start
return loss.item(), elapsed

7. Running the Benchmark

Python
X, y = generate_data()
# Dense
dense = DenseMLP(100, 256, 10)
dense_loss, dense_time = train_model(dense, X, y)
# Static Sparse
static_sparse = StaticSparseMLP(100, 256, 10, sparsity=0.8)
static_loss, static_time = train_model(static_sparse, X, y)
print("Dense Loss:", dense_loss)
print("Static Sparse Loss:", static_loss)
print("Dense Time:", dense_time)
print("Static Sparse Time:", static_time)
print("Static Sparsity:", measure_weight_sparsity(static_sparse))

8. What You Should Expect

On CPU, you may observe:

• Similar training time (masking still computes dense matmul)
• Similar final loss (at moderate sparsity)
• High measured sparsity (≈ 0.8)

Important insight:

Naive masking does NOT automatically speed up computation.

True speedups require:

• Sparse matrix kernels
• Block sparsity
• Hardware support

9. Approximate FLOPs Comparison

For hidden_dim = 256:

Dense layer FLOPs:

100 × 256 = 25,600

At 80% sparsity:

Effective ≈ 0.2 × 25,600 = 5,120

Theoretical savings are large.

Practical savings depend on hardware.

10. Key Takeaways

  1. Sparse models can match dense accuracy at high sparsity.
  2. Structural sparsity alone does not guarantee runtime speedup.
  3. Real gains require hardware-aware sparsity.
  4. Sparse scaling is about compute control — not just compression.

11. Where This Leads

We’ve now validated sparsity structurally.

The next frontier is:

Sparse Attention and Mixture of Experts

Where sparsity becomes conditional computation at scale.

Code Location

All benchmark code lives in:

06_dense_vs_sparse_benchmark/

Try:

• Increasing sparsity to 90%
• Measuring GPU runtime
• Comparing static vs dynamic sparse

Sparse neural networks are powerful.

But only when measured carefully.