How Many Eval Items Do I Actually Need?

What This Investigation Covers

Most eval datasets are sized by convenience (“I had 50 good test cases”) rather than by statistical need. Power analysis flips this: given the minimum difference you care about detecting, compute how large N must be to detect it reliably.

What you’ll learn

Framing eval design as a power problem: How to specify minimum detectable effect (MDE), desired power (1−β), and confidence level (α) before choosing N.

How N scales with effect size and variance: The key relationships: halving the MDE quadruples required N; doubling variance roughly doubles N; binary data often requires fewer samples than continuous.

Using pilot data to estimate variance: How to run a small pilot (N=20–30) to estimate score variance, then use that to compute required N for the full study.

Power curves for common eval scenarios: Reference curves for binary pass/fail (Wilson-based power), numeric scores (bootstrap-based power), and paired vs. unpaired designs.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.