What This Investigation Covers
Leaderboards present rankings as if they were precise measurements. They are not. Any ranking computed from a finite eval set carries sampling uncertainty. Two models separated by less than the CI width of their score difference are statistically tied — regardless of which number is larger.
What you’ll learn
Computing confidence intervals on ranks, not just scores: How to bootstrap the full ranking and compute a CI on each model’s rank position, separate from its score CI.
Testing whether adjacent rankings are distinguishable: How to directly test whether Model #3 and Model #4 are actually distinguishable at the stated confidence level.
Estimating the fraction of “real” improvements in a benchmark history: Given a leaderboard’s history of score improvements, how many are outside the CI noise floor? What does this look like for a typical academic NLP benchmark?
How N and leaderboard size interact: How the required N to reliably rank k models scales with k, and why large leaderboards with small eval sets produce mostly noise.
Looking for quick usage examples? Check out the Example Usage page.
Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.