Roadmap — Stats for LLM Evals

This roadmap covers planned investigations, library features, and simulation extensions. Items are ordered roughly by priority. The current focus is on establishing solid defaults for the most common eval tasks, with worked Jupyter notebooks, visual terminal outputs, and plots for each.

Current Focus

I am focused on covering the core cases every LLM practitioner encounters: comparing models, comparing prompts, and comparing model×prompt combinations, across both single-run and multi-run (seed variance) settings. Multi-run analysis is treated as a first-class concern: every single-run method has a nested-bootstrap counterpart when repeated sampling is available. These settings will be supported by the core API and visual terminal outputs, and will be informed by Monte Carlo simulations testing quality of default statistical methods across realistic eval conditions.

Practitioners focused on the latest "hottest topics"—i.e., agent and RAG engineering—might worry that this roadmap is too focused on an "old" problem. First, prompts and models are foundational to all LLM-integrated systems, and getting the defaults right for those will cover a huge fraction of use cases. Second, and more importantly, the terms "models" and "prompts" can easily be abstracted and the same analysis methods re-used, for instance to compare RAG pipeline configurations or component variations in agent architectures. Getting the defaults right is therefore really important, as it will lay the groundwork for more complex analyses.

Current investigations include:

Model A vs. Model B: Is the Gap Real?: CIs on the score difference (Newcombe / smooth bootstrap), single-run and multi-run nested-bootstrap variants, minimum N to detect a given gap, and honest reporting of statistical ties.
Finding Your Best Prompt: Per-variant CIs with multiple-comparison correction, single-run and multi-run modes, and how to report “Prompt C wins” with appropriate hedging.
Finding the Best Model×Prompt Combo: Multi-comparison correction, factorial (two-way) analysis to detect interaction effects, score-matrix heatmaps with CI width, and multi-run variants. Includes analyze_factorial() for the common tagged-dataset format. In progress.
Comparing Models Across Multiple Metrics: Per-metric CIs, Pareto dominance as a weight-free criterion, composite score uncertainty, and visualizations that surface tradeoffs rather than a pre-digested verdict. In progress.

Supporting these investigations: the evalstats library API (compare_models(), compare_prompts(), analyze(), analyze_factorial()) with .summary() terminal output and .plot(); and Monte Carlo simulations covering CI coverage and width across binary, continuous, Likert, and grade-scale eval types at varied sample sizes.

Near-Future Planned Foci

More nuanced investigations for practitioners who need to answer harder questions about seed variance, fine-tune validation, pipeline optimization, and eval set sizing.

Checking Response Consistency: Measuring run-to-run variance and related factors.
Before / After: Did My Fine-Tune Actually Help?: Simple before/after comparison methods.
How Many Eval Items Do I Actually Need?: Estimating required sample sizes.
Prompt Sensitivity Analysis: Exploring prompt effects and stability.
Stats for RAG Pipelines: Methods for analyzing RAG pipeline setups.
Regression Guard: Tools for regression detection and CI integration.
Auditing Your LLM Judge: Approaches for measuring annotation agreement.
When Model Rankings Are Just Noise: Assessing ranking reliability.
Distilling a Benchmark: Reducing and validating benchmark sets.

Far-Out Explorations

I intend to explore alternative statistical paradigms, like Bayesian methods and e-values, which could provide more nuanced insights and better support for the iterative, noisy nature of LLM evals. However, I want to first provide some sound, "best effort" defaults within a more familiar frequentist framework, which many practitioners are already using and which is still widely requested by stakeholders, reviewers, and auditors. One idea here is for evalstats to add a Bayesian mode, where users can toggle between the two paradigms and get outputs that are consistent with the underlying philosophy without needed to change top-level API calls (e.g., compare_models() will report either frequentist or Bayesian outputs depending on the mode). This guide would then have an "alternative universe" where the same website and investigation notebooks are re-framed through a Bayesian lens, and unique Bayesian investigations are added (e.g., sequential testing and evidence combination across tests).