Statistics is a complex and often confusing field, with a dizzying array of methods, tests, and metrics to choose from. The purpose of this project is to identify reasonable defaults for statistical methods in the specific context of LLM evals. To do that, we need to establish some principles to guide our choices and to help us navigate the complex landscape.
Philosophy
Estimation-first, conservative, non-parametric, pairwise, CI-focused, sample floor-enforcing, simulation-grounded, p-value permissive.
There are many different statistical philosophies to choose from. The most prominent ones are frequentist and Bayesian approaches, but there are also variations, such as estimation statistics and decision theory. Frequentist statistics remains the most dominant (and contentious), centered around p-values and null hypothesis significance testing (NHST). Many scientific disciplines have historically relied on NHST, including ML. However, there has been a growing recognition of the limitations and pitfalls of NHST, especially in the context of noisy measurements, small sample sizes, and repeated testing, all of which are common in LLM evals.
Therefore, the Stats for Evals project (and evalstats API) takes an estimation-first approach to statistical analysis,
foregrounding confidence intervals (particularly bootstrapping), point estimates,
effect sizes, pairwise differences, and visualization,
rather than p-values. As many scientists across fields have recognized,
p-values encourage dichotomous thinking and overconfidence, which is especially dangerous
in the context of LLM evals, where measurements are noisy, sample sizes are often small, and repeated tests are rampant, with tests re-run upon every minor change to the model, prompt, or eval set.
That said, evalstats does allow users to print p-values and NHST tests, should they wish it,
as many people are still familiar with that framework and may be requested to use it for communication with stakeholders (e.g., paper reviewers).
To provide "at a glance" decision support, evalstats also leans into significance testing on CIs to estimate the most
likely top performers among multiple candidates, by adapting rank bands and critical difference diagrams from the multiple comparisons literature to an estimation-first framework
that instead uses simultaneous confidence intervals and bootstrap-resampled rank distributions.
Principles
Based on our philosophy, we establish the following principles to guide our statistical choices in the specific context of LLM evals:
evalstats uses 99% CIs (an alpha of 0.01 for CIs and p-values), which is more conservative than the traditional 0.05,
as a Type I error risk of 5% is too high for the noisy, high-stakes, repeated-testing context of LLM evals. (Arguably, even 0.01 is too high a threshold,
but we also don't want to be so conservative as to never find anything of importance.)
method="auto" defaults in prompstats should adapt to sample size N and data type
based on which methods performed best in simulations that reflect those conditions.
Commentary and Responses to Common Questions
To make progress, we've gotta start somewhere
Like any new project, we can't afford to be perfect. Statistics nerds will point out some flaws with this analysis, perhaps. I can already hear the nay-sayers, spouting terms like "content validity," "pre-registration," the Type I risks of repeated testing, etc. And yes, all of those are valid concerns! But we have to start somewhere, and it's certainly better to identify some reasonable approaches like estimation statistics, than to continue with the rampant state of having no uncertainty quantification at all. If you're a stats nerd, please reach out if you'd like to contribute to improving this project.
Why no Bayesian approach?
There could be a Bayesian alternative to this guide (amid other alternative philosophies),
which would change how we approach and report our evals analysis.
The main benefit of a Bayesian approach in this context would be its support for sequential testing
and combining evidence across tests, which is common in evals as developers often start with small test sets
and expand them over time. Although evalstats uses some Bayesian methods
(like bayes_evals
to approximate CIs at low sample size for binary data), it is decidedly not a Bayesian framework.
We think that an estimation-first frequentist approach is more accessible to the average practitioner,
closer to the traditional methods that they may be familiar with, and thus more likely to be adopted in the short term.
If our goal is changing practices and making some inroads for uncertainty quantification in the LLM eval community,
then we think this is the best place to start (and honestly, we're already doing some "non-traditional" things here—like using 99% CIs by default,
or adapting a max-T CI correction for multiple comparisons).
However, we intend to explore a Bayesian mode in a future release.