Principles

Statistics is a complex and often confusing field, with a dizzying array of methods, tests, and metrics to choose from. The purpose of this project is to identify reasonable defaults for statistical methods in the specific context of LLM evals. To do that, we need to establish some principles to guide our choices and to help us navigate the complex landscape.

Philosophy

Estimation-first, conservative, non-parametric, pairwise, CI-focused, sample floor-enforcing, simulation-grounded, p-value permissive.

There are many different statistical philosophies to choose from. The most prominent ones are frequentist and Bayesian approaches, but there are also variations, such as estimation statistics and decision theory. Frequentist statistics remains the most dominant (and contentious), centered around p-values and null hypothesis significance testing (NHST). Many scientific disciplines have historically relied on NHST, including ML. However, there has been a growing recognition of the limitations and pitfalls of NHST, especially in the context of noisy measurements, small sample sizes, and repeated testing, all of which are common in LLM evals.

Therefore, the Stats for Evals project (and evalstats API) takes an estimation-first approach to statistical analysis, foregrounding confidence intervals (particularly bootstrapping), point estimates, effect sizes, pairwise differences, and visualization, rather than p-values. As many scientists across fields have recognized, p-values encourage dichotomous thinking and overconfidence, which is especially dangerous in the context of LLM evals, where measurements are noisy, sample sizes are often small, and repeated tests are rampant, with tests re-run upon every minor change to the model, prompt, or eval set. That said, evalstats does allow users to print p-values and NHST tests, should they wish it, as many people are still familiar with that framework and may be requested to use it for communication with stakeholders (e.g., paper reviewers). To provide "at a glance" decision support, evalstats also leans into significance testing on CIs to estimate the most likely top performers among multiple candidates, by adapting rank bands and critical difference diagrams from the multiple comparisons literature to an estimation-first framework that instead uses simultaneous confidence intervals and bootstrap-resampled rank distributions.

Based on our philosophy, we establish the following principles to guide our statistical choices in the specific context of LLM evals:

Err conservative by default. LLM evals suffer from construct validity issues, noisy measurements, and often small sample sizes. Basically, we should assume all evals are ill-designed, underpowered, and therefore noisy until proven otherwise. What's more, many evals concern safety, and a wider confidence interval that actually contains the truth is far safer than a narrow interval that misses it. Thus, we prefer tests and methods that sacrifice some statistical power to guard against overconfidence. By default, evalstats uses 99% CIs (an alpha of 0.01 for CIs and p-values), which is more conservative than the traditional 0.05, as a Type I error risk of 5% is too high for the noisy, high-stakes, repeated-testing context of LLM evals. (Arguably, even 0.01 is too high a threshold, but we also don't want to be so conservative as to never find anything of importance.)

Prefer confidence intervals over standard errors, and plot CIs as error bars. Standard errors assume normality and symmetry. They collapse in the presence of hierarchical data structure. Confidence intervals make fewer assumptions and communicate uncertainty more honestly. This recommendation is consistent with the principles of estimation statistics, a widely-cited movement which changed how psychologists approach data analysis (specifically the guideline: "Prefer 95% CIs to SE bars. Routinely report 95% CIs, and use error bars to depict them in figures").

Prefer pairwise tests whenever possible. Pairwise comparisons (each item evaluated by both conditions) cancel shared noise, dramatically reducing variance. This is especially important when eval items vary widely in difficulty. Said differently, comparing two models on the same set of items is more powerful than comparing them on different sets, even if the latter are larger. This guideline is also consistent with the "New Statistics" guideline to calculate CIs on differences rather than on raw means.

Prefer non-parametric tests that don't assume normality. LLM outputs—especially binary pass/fail, Likert scores, and bounded floats—are rarely normally distributed. Methods like the bootstrap and tests like Wilcoxon Signed Ranks Test, Sign Test, and Mann-Whitney U make fewer assumptions and are more robust to deviations from normality.

Prefer means over medians in eval contexts. LLM evals commonly operate in the 70–95% accuracy range with small, bounded scores. Medians don't make sense for many data types (e.g., binary pass/fail or 1–5 Likert), and even when they do, they often fail to capture meaningful differences. While medians are more robust to outliers, we actually want to capture outliers in evals, because they often represent important failure modes (e.g., model A fails on rare but critical cases, while model B performs well on those same cases).

Enforce a minimum sample floor. Statistics needs some data to work. We just stated that we should err conservative by default, because LLM evals are often poorly designed and thus noisy estimators. That means we need to enforce a minimum sample size floor below which we don't trust the results at all. In our simulations, the N=30 floor is where methods start to become somewhat reliable, and below N=15 they're basically useless or so prone to error as to be actively misleading. However, developers often start with small eval sets and expand them over time. Thus, we make a trade-off: we recommend setting a hard cut-off at N=15, below which no statistical tests are reported. At range N=15-30, we can print a loud warning to developers, but still output stats. This strategy will, hopefully, push developers toward expanding their test sets.

Ground recommendations in Monte Carlo simulations. Statistical defaults should be justified by empirical behavior, not by convention or even by trusting research papers. Conventions were established for (likely) different data types and contexts than LLM evals, and may not be suited to our needs. Thus, whenever possible, we should run simulations that mirror real eval conditions to estimate how different methods perform across realistic data types, sample sizes, and noise. The method="auto" defaults in prompstats should adapt to sample size N and data type based on which methods performed best in simulations that reflect those conditions.

Commentary and Responses to Common Questions

To make progress, we've gotta start somewhere

Like any new project, we can't afford to be perfect. Statistics nerds will point out some flaws with this analysis, perhaps. I can already hear the nay-sayers, spouting terms like "content validity," "pre-registration," the Type I risks of repeated testing, etc. And yes, all of those are valid concerns! But we have to start somewhere, and it's certainly better to identify some reasonable approaches like estimation statistics, than to continue with the rampant state of having no uncertainty quantification at all. If you're a stats nerd, please reach out if you'd like to contribute to improving this project.

Why no Bayesian approach?

There could be a Bayesian alternative to this guide (amid other alternative philosophies), which would change how we approach and report our evals analysis. The main benefit of a Bayesian approach in this context would be its support for sequential testing and combining evidence across tests, which is common in evals as developers often start with small test sets and expand them over time. Although evalstats uses some Bayesian methods (like bayes_evals to approximate CIs at low sample size for binary data), it is decidedly not a Bayesian framework. We think that an estimation-first frequentist approach is more accessible to the average practitioner, closer to the traditional methods that they may be familiar with, and thus more likely to be adopted in the short term. If our goal is changing practices and making some inroads for uncertainty quantification in the LLM eval community, then we think this is the best place to start (and honestly, we're already doing some "non-traditional" things here—like using 99% CIs by default, or adapting a max-T CI correction for multiple comparisons). However, we intend to explore a Bayesian mode in a future release.