What This Investigation Covers
Most eval datasets are sized by convenience (“I had 50 good test cases”) rather than by statistical need. Power analysis flips this: given the minimum difference you care about detecting, compute how large N must be to detect it reliably.
What you’ll learn
Framing eval design as a power problem: How to specify minimum detectable effect (MDE), desired power (1−β), and confidence level (α) before choosing N.
How N scales with effect size and variance: The key relationships: halving the MDE quadruples required N; doubling variance roughly doubles N; binary data often requires fewer samples than continuous.
Using pilot data to estimate variance: How to run a small pilot (N=20–30) to estimate score variance, then use that to compute required N for the full study.
Power curves for common eval scenarios: Reference curves for binary pass/fail (Wilson-based power), numeric scores (bootstrap-based power), and paired vs. unpaired designs.
Looking for quick usage examples? Check out the Example Usage page.
Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.