Finding the Best Model-Prompt Combo: What Actually Wins?

What This Investigation Covers

The 2D model×prompt comparison is where most real-world eval work lives. It’s also where naive statistics fail most spectacularly: picking the highest cell in a 3×5 matrix without correction virtually guarantees a spurious winner.

What you’ll learn

All-pairs comparison with family-wise correction: How to test all 15 model×prompt combinations and apply Holm correction across the full family of tests.

Detecting interaction effects: How to check whether Model A’s advantage over Model B is consistent across prompt templates, or only appears with certain phrasings.

Visualizing the score matrix with uncertainty: How to build a score heatmap that shows CI width alongside means, so narrow wins are visually distinct from robust ones.

Reporting the winning combination: How to report a model+prompt winner while being honest about whether the win generalizes beyond these specific templates.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.