Regression Guard: CI-Based Safety Net for Model Updates

What This Investigation Covers

Model updates are an ongoing source of silent regressions. A new checkpoint that improves summarization may degrade instruction-following. A fine-tune that improves math may hurt creative writing. Regression Guard is a systematic, statistics-first approach to catching these before they ship.

What you’ll learn

Defining regression statistically: Why “lower mean score” is not the right definition of regression, and how to define it instead as a CI lower bound crossing a threshold.

Setting CI-based release gates: How to specify a release criterion (“ship if the CI lower bound on difference is above −2 points”) and how to calibrate the threshold for your use case.

Category-level regression detection with correction: How to test all task categories simultaneously and apply Holm correction so that a regression anywhere in the eval is flagged reliably.

Running regression guard in CI/CD: How to integrate statistical regression testing into a continuous delivery pipeline so every checkpoint is evaluated automatically.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.