SolveWithPython

Dense vs Sparse: A Fair Benchmark in Python

So far in this series, we have:

Built dense and sparse models
Pruned trained networks
Controlled activation sparsity
Trained sparse models from scratch
Rewired networks dynamically

Now we answer the most important practical question:

Do sparse neural networks actually save compute — and at what cost?

In this article, we build a fair benchmark comparing:

Dense model
Static sparse model
Dynamic sparse model

We measure:

• Parameter count
• Effective sparsity
• Training time
• Final loss
• Approximate FLOPs

As always — fully reproducible Python.

1. Benchmark Design Principles

To keep this fair, we:

• Use identical architectures (same layer sizes)
• Use identical dataset
• Use identical optimizer
• Keep sparsity fixed (e.g., 80%)
• Train for same number of epochs

The only difference is connectivity.

2. Shared Dataset and Utilities

import torch
import torch.nn as nn
import torch.optim as optim
import time
def generate_data(n=4096, input_dim=100, num_classes=10):
    X = torch.randn(n, input_dim)
    y = torch.randint(0, num_classes, (n,))
    return X, y
def count_parameters(model):
    return sum(p.numel() for p in model.parameters())
def measure_weight_sparsity(model):
    total = 0
    zeros = 0
    for param in model.parameters():
        if param.dim() > 1:
            total += param.numel()
            zeros += (param == 0).sum().item()
    return zeros / total

3. Dense Baseline Model

class DenseMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, output_dim)
        )
    def forward(self, x):
        return self.net(x)

4. Static Sparse Model (Masking)

class StaticSparseLinear(nn.Module):
    def __init__(self, in_features, out_features, sparsity):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.01)
        self.bias = nn.Parameter(torch.zeros(out_features))
        mask = torch.rand(out_features, in_features)
        mask = (mask > sparsity).float()
        self.register_buffer("mask", mask)
    def forward(self, x):
        return x @ (self.weight * self.mask).t() + self.bias
class StaticSparseMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, sparsity):
        super().__init__()
        self.fc1 = StaticSparseLinear(input_dim, hidden_dim, sparsity)
        self.fc2 = StaticSparseLinear(hidden_dim, hidden_dim, sparsity)
        self.fc3 = StaticSparseLinear(hidden_dim, output_dim, sparsity)
        self.relu = nn.ReLU()
    def forward(self, x):
        x = self.relu(self.fc1(x))
        x = self.relu(self.fc2(x))
        return self.fc3(x)

5. Dynamic Sparse Model

We reuse the dynamic layer from Article #5.

class DynamicSparseLinear(nn.Module):
    def __init__(self, in_features, out_features, sparsity):
        super().__init__()
        self.weight = nn.Parameter(torch.randn(out_features, in_features) * 0.01)
        self.bias = nn.Parameter(torch.zeros(out_features))
        mask = torch.rand(out_features, in_features)
        mask = (mask > sparsity).float()
        self.register_buffer("mask", mask)
    def forward(self, x):
        return x @ (self.weight * self.mask).t() + self.bias

(Prune + regrow functions omitted here for brevity — see Article #5.)

6. Training Function

def train_model(model, X, y, epochs=20):
    optimizer = optim.Adam(model.parameters(), lr=1e-3)
    loss_fn = nn.CrossEntropyLoss()
    start = time.time()
    for _ in range(epochs):
        logits = model(X)
        loss = loss_fn(logits, y)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    elapsed = time.time() - start
    return loss.item(), elapsed

7. Running the Benchmark

X, y = generate_data()
# Dense
dense = DenseMLP(100, 256, 10)
dense_loss, dense_time = train_model(dense, X, y)
# Static Sparse
static_sparse = StaticSparseMLP(100, 256, 10, sparsity=0.8)
static_loss, static_time = train_model(static_sparse, X, y)
print("Dense Loss:", dense_loss)
print("Static Sparse Loss:", static_loss)
print("Dense Time:", dense_time)
print("Static Sparse Time:", static_time)
print("Static Sparsity:", measure_weight_sparsity(static_sparse))

8. What You Should Expect

On CPU, you may observe:

• Similar training time (masking still computes dense matmul)
• Similar final loss (at moderate sparsity)
• High measured sparsity (≈ 0.8)

Important insight:

Naive masking does NOT automatically speed up computation.

True speedups require:

• Sparse matrix kernels
• Block sparsity
• Hardware support

9. Approximate FLOPs Comparison

For hidden_dim = 256:

Dense layer FLOPs:

100 × 256 = 25,600

At 80% sparsity:

Effective ≈ 0.2 × 25,600 = 5,120

Theoretical savings are large.

Practical savings depend on hardware.

10. Key Takeaways

Sparse models can match dense accuracy at high sparsity.
Structural sparsity alone does not guarantee runtime speedup.
Real gains require hardware-aware sparsity.
Sparse scaling is about compute control — not just compression.

11. Where This Leads

We’ve now validated sparsity structurally.

The next frontier is:

Sparse Attention and Mixture of Experts

Where sparsity becomes conditional computation at scale.

Code Location

All benchmark code lives in:

06_dense_vs_sparse_benchmark/

Try:

• Increasing sparsity to 90%
• Measuring GPU runtime
• Comparing static vs dynamic sparse

Sparse neural networks are powerful.

But only when measured carefully.