What This Investigation Covers

Fine-tuning comparisons are high-stakes: a spurious positive could mean shipping a regression, and a spurious negative could mean discarding real progress. The paired structure of before/after evals (same inputs, two checkpoints) is statistically powerful when used correctly.

What you’ll learn

01

Framing fine-tuning as a paired test: Why the “before” and “after” scores should be treated as paired differences, not independent samples, and how this dramatically reduces required N.

02

Setting a deployment CI threshold: How to decide in advance what CI gap justifies shipping — and how to frame this as a statistical decision rule rather than a subjective call.

03

Detecting category-level regressions: How to check whether an overall improvement masks regressions on specific task subsets, and how to correct for multiple category comparisons.

04

Minimum detectable effect at your sample size: How to know before running the eval whether your N is sufficient to detect the effect size you care about.

Looking for quick usage examples? Check out the Example Usage page.

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.