What This Investigation Covers
A “model comparison” that uses a single prompt template per model is not really comparing models — it’s comparing model+prompt pairs. Prompt sensitivity analysis is the corrective: run each model on k semantically equivalent phrasings, then ask how stable the ranking is across phrasings.
What you’ll learn
Running a prompt sensitivity study: How to design a suite of k paraphrase variants that test the same capability while varying surface-level phrasing.
Decomposing variance: model vs. prompt: How to partition the total score variance into a model component and a prompt component using a simple variance decomposition.
Computing a “ranking stability score”: How to estimate the probability that the observed Model A > Model B ranking would hold under a new, unseen prompt — and what that probability looks like in practice.
Reporting rankings that are robust to prompt choice: How to report model comparisons that are backed by multi-prompt evidence, and how to flag comparisons that are not.
Looking for quick usage examples? Check out the Example Usage page.
Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.