Prompt Sensitivity Analysis: Model Strength or Phrasing Luck?

What This Investigation Covers

A “model comparison” that uses a single prompt template per model is not really comparing models — it’s comparing model+prompt pairs. Prompt sensitivity analysis is the corrective: run each model on k semantically equivalent phrasings, then ask how stable the ranking is across phrasings.

What you’ll learn

Running a prompt sensitivity study: How to design a suite of k paraphrase variants that test the same capability while varying surface-level phrasing.

Decomposing variance: model vs. prompt: How to partition the total score variance into a model component and a prompt component using a simple variance decomposition.

Computing a “ranking stability score”: How to estimate the probability that the observed Model A > Model B ranking would hold under a new, unseen prompt — and what that probability looks like in practice.

Reporting rankings that are robust to prompt choice: How to report model comparisons that are backed by multi-prompt evidence, and how to flag comparisons that are not.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.