Before / After: Did My Fine-Tune Actually Help?

What This Investigation Covers

Fine-tuning comparisons are high-stakes: a spurious positive could mean shipping a regression, and a spurious negative could mean discarding real progress. The paired structure of before/after evals (same inputs, two checkpoints) is statistically powerful when used correctly.

What you’ll learn

Framing fine-tuning as a paired test: Why the “before” and “after” scores should be treated as paired differences, not independent samples, and how this dramatically reduces required N.

Setting a deployment CI threshold: How to decide in advance what CI gap justifies shipping — and how to frame this as a statistical decision rule rather than a subjective call.

Detecting category-level regressions: How to check whether an overall improvement masks regressions on specific task subsets, and how to correct for multiple category comparisons.

Minimum detectable effect at your sample size: How to know before running the eval whether your N is sufficient to detect the effect size you care about.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.