What This Investigation Covers

A “model comparison” that uses a single prompt template per model is not really comparing models — it’s comparing model+prompt pairs. Prompt sensitivity analysis is the corrective: run each model on k semantically equivalent phrasings, then ask how stable the ranking is across phrasings.

What you’ll learn

01

Running a prompt sensitivity study: How to design a suite of k paraphrase variants that test the same capability while varying surface-level phrasing.

02

Decomposing variance: model vs. prompt: How to partition the total score variance into a model component and a prompt component using a simple variance decomposition.

03

Computing a “ranking stability score”: How to estimate the probability that the observed Model A > Model B ranking would hold under a new, unseen prompt — and what that probability looks like in practice.

04

Reporting rankings that are robust to prompt choice: How to report model comparisons that are backed by multi-prompt evidence, and how to flag comparisons that are not.

Looking for quick usage examples? Check out the Example Usage page.

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.