What This Investigation Covers

Model updates are an ongoing source of silent regressions. A new checkpoint that improves summarization may degrade instruction-following. A fine-tune that improves math may hurt creative writing. Regression Guard is a systematic, statistics-first approach to catching these before they ship.

What you’ll learn

01

Defining regression statistically: Why “lower mean score” is not the right definition of regression, and how to define it instead as a CI lower bound crossing a threshold.

02

Setting CI-based release gates: How to specify a release criterion (“ship if the CI lower bound on difference is above −2 points”) and how to calibrate the threshold for your use case.

03

Category-level regression detection with correction: How to test all task categories simultaneously and apply Holm correction so that a regression anywhere in the eval is flagged reliably.

04

Running regression guard in CI/CD: How to integrate statistical regression testing into a continuous delivery pipeline so every checkpoint is evaluated automatically.

Looking for quick usage examples? Check out the Example Usage page.

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.