Distilling a Benchmark — Stats for LLM Evals

What This Investigation Covers

Large eval sets are expensive to run and maintain. But shrinking them arbitrarily — by grabbing a random slice — risks discarding the items that most sharply differentiate models. Benchmark distillation is the principled alternative: using pilot data to identify which items contribute independent statistical information, preserve difficulty coverage, and discriminate between models that would otherwise appear identical. This investigation covers the core techniques, from stratified reduction through IRT-inspired item selection, and shows how to validate that a compact benchmark is trustworthy.

What you’ll learn

Identifying redundant items: How to detect items whose outcomes are highly correlated with other items across a model population, and how removing them shrinks the benchmark without losing coverage or discriminative power.

Stratified reduction strategies: How to preserve the difficulty distribution and category coverage of the full benchmark while dramatically reducing item count — so the compact set doesn’t accidentally over-represent easy or hard cases.

IRT-inspired item selection: A practical introduction to item response theory: using pilot data to estimate item discrimination and difficulty, then selecting the items that best differentiate model capabilities across the ability range.

Validating the distilled benchmark: How to measure whether your reduced set produces scores that correlate tightly with full-benchmark scores across a range of held-out models, and the minimum correlation you should require before relying on the compact version.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.