What This Investigation Covers
Model updates are an ongoing source of silent regressions. A new checkpoint that improves summarization may degrade instruction-following. A fine-tune that improves math may hurt creative writing. Regression Guard is a systematic, statistics-first approach to catching these before they ship.
What you’ll learn
Defining regression statistically: Why “lower mean score” is not the right definition of regression, and how to define it instead as a CI lower bound crossing a threshold.
Setting CI-based release gates: How to specify a release criterion (“ship if the CI lower bound on difference is above −2 points”) and how to calibrate the threshold for your use case.
Category-level regression detection with correction: How to test all task categories simultaneously and apply Holm correction so that a regression anywhere in the eval is flagged reliably.
Running regression guard in CI/CD: How to integrate statistical regression testing into a continuous delivery pipeline so every checkpoint is evaluated automatically.
Looking for quick usage examples? Check out the Example Usage page.
Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.