What This Investigation Covers
Fine-tuning comparisons are high-stakes: a spurious positive could mean shipping a regression, and a spurious negative could mean discarding real progress. The paired structure of before/after evals (same inputs, two checkpoints) is statistically powerful when used correctly.
What you’ll learn
Framing fine-tuning as a paired test: Why the “before” and “after” scores should be treated as paired differences, not independent samples, and how this dramatically reduces required N.
Setting a deployment CI threshold: How to decide in advance what CI gap justifies shipping — and how to frame this as a statistical decision rule rather than a subjective call.
Detecting category-level regressions: How to check whether an overall improvement masks regressions on specific task subsets, and how to correct for multiple category comparisons.
Minimum detectable effect at your sample size: How to know before running the eval whether your N is sufficient to detect the effect size you care about.
Looking for quick usage examples? Check out the Example Usage page.
Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.