What This Investigation Covers

The 2D model×prompt comparison is where most real-world eval work lives. It’s also where naive statistics fail most spectacularly: picking the highest cell in a 3×5 matrix without correction virtually guarantees a spurious winner.

What you’ll learn

01

All-pairs comparison with family-wise correction: How to test all 15 model×prompt combinations and apply Holm correction across the full family of tests.

02

Detecting interaction effects: How to check whether Model A’s advantage over Model B is consistent across prompt templates, or only appears with certain phrasings.

03

Visualizing the score matrix with uncertainty: How to build a score heatmap that shows CI width alongside means, so narrow wins are visually distinct from robust ones.

04

Reporting the winning combination: How to report a model+prompt winner while being honest about whether the win generalizes beyond these specific templates.

Looking for quick usage examples? Check out the Example Usage page.

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.