When Model Rankings Are Just Noise

What This Investigation Covers

Leaderboards present rankings as if they were precise measurements. They are not. Any ranking computed from a finite eval set carries sampling uncertainty. Two models separated by less than the CI width of their score difference are statistically tied — regardless of which number is larger.

What you’ll learn

Computing confidence intervals on ranks, not just scores: How to bootstrap the full ranking and compute a CI on each model’s rank position, separate from its score CI.

Testing whether adjacent rankings are distinguishable: How to directly test whether Model #3 and Model #4 are actually distinguishable at the stated confidence level.

Estimating the fraction of “real” improvements in a benchmark history: Given a leaderboard’s history of score improvements, how many are outside the CI noise floor? What does this look like for a typical academic NLP benchmark?

How N and leaderboard size interact: How the required N to reliably rank k models scales with k, and why large leaderboards with small eval sets produce mostly noise.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.