Recommendations and Defaults
Based on research and Monte Carlo simulations, we recommend the following methods for
calculating confidence intervals and p-values in LLM eval contexts.
These recommendations are implemented as the
method="auto" defaults in evalstats,
which adapt to data type and sample size based on which methods performed best in our simulations.
Note that the contents of this section are subject to change as gather more evidence for different situations.
Methods to Compute Confidence Intervals (By Data Type)
bayes_evals only below N=100;
switch to bootstrap at N=100 and above for safety and CI multi-comparisons correction (simultaneous CIs) with the studentized maxT method. (Warning:
around N≈200+, the paired method in bayes_evals collapses, hence the switch.)
By Analysis Type (Confidence Intervals)
| Task | Data Type | Recommended Method | Notes |
|---|---|---|---|
| Single-sample CI on mean score | Binary | Wilson score interval | Closed-form, best coverage |
| Single-sample CI on mean score | Numeric/Interval | Smooth bootstrap (Gaussian KDE) | Conservative; most reliable at small N |
| Pairwise comparison (A vs B) | Binary | Bayesian Paired (N < 100), Bootstrap (N ≥ 100) | Bayesian paired is strong at small N but can fail around N≈200+; switch at N=100 to bootstrap for safety and multi-comparisons correction |
| Pairwise comparison (A vs B) | Numeric/Interval | Smooth bootstrap (Gaussian KDE) | Highest pairwise coverage; BCa is explicitly not recommended |
| All pairwise comparisons (k templates) | Any | As above + multi-comparisons correction | We recommend the max-T resampling procedure (simultaneous_ci=True will use max-T if the method is also bootstrap-based).
There is a precedent for this in the ML literature. |
| Hierarchical data (multi-run) | Any | Nested smooth bootstrap (Gaussian KDE) | Resample inputs, then runs within inputs; propagates run stochasticity into CIs |
By Analysis Type (P-values)
| Task | Data Type | Recommended Method | Notes |
|---|---|---|---|
| Pairwise test (A vs B) | Any paired eval scores | Wilcoxon Signed Ranks | Slightly conservative, consistent across data types and sample sizes, and no Type I inflation above the alpha in our simulation. |
| Pairwise test (A vs B), but in a safety-critical setting | Any paired eval scores | Sign Test | More conservative than Wilcoxon; useful when false positives are very costly. |
| Multiple pairwise tests (k variants) | Any | Benjamini-Hochberg (fdr_bh) |
Default correction in evalstats; controls expected false discovery rate with better power than stricter family-wise procedures. |
Do not rely on NHST alone for eval decisions. If you report p-values, report them alongside effect sizes and confidence intervals. Prefer effect sizes and CIs as primary evidence, and treat p-values as secondary.
Sample Size Guidance
N < 15: Do not report statistics. The CI width will be so large as to be meaningless, and the sample is extremely unlikely to be representative.
15 ≤ N < 30: Report statistics with a prominent warning. Results should be treated as exploratory only.
N ≥ 30: Bootstrapped CIs and score intervals begin to become reliable.
Simultaneous Confidence Intervals for Multiple Comparisons
When comparing multiple prompts or models, it is common to run multiple pairwise comparisons. In these cases, controlling the family-wise error rate (FWER) is essential to avoid spurious findings. The max-T resampling procedure is a well-established method in genomics and neuroscience for constructing simultaneous confidence intervals (CIs) that jointly cover all comparisons, rather than treating each CI in isolation (basically, it widens the CIs to ensure the alpha level is maintained across all comparisons). It also has very few assumptions, allowing for robust inference even when candidates are correlated. Despite its power and flexibility, max-T is virtually unknown in LLM evaluation and ML contexts.
The max-T procedure works by resampling the data (e.g., via bootstrap), computing the test statistic (such as the difference in means) for all pairs in each resample, and then recording the maximum absolute value of the test statistics across all comparisons for each resample. The distribution of these maxima is then used to set CI bounds that guarantee the desired coverage simultaneously for all comparisons, not just marginally. This approach is less conservative than Bonferroni correction, yet provides strong control over false positives.
evalstats uses the max-T procedure by default whenever you output multiple pairwise comparisons with bootstrapped CIs. This ensures that the reported intervals account for the increased uncertainty from making many comparisons at once. For consistency, evalstats also applies the max-T correction to p-values when users request them, so that both CIs and p-values are corrected in the same principled way.
The max-T method is recommended by Rink & Brannath, 2025, a recent ML paper on multiple comparisons correction when comparing many candidate models that may be correlated. For more foundational background, see the Methods Used in evalstats and Statistical Pitfalls in ML sections of the Resources page.
Simulation Study
We ran simulation studies over eval-like distributions to compare methods for calculating CIs head-to-head, rather than relying solely on others' results.
We rigged realistic distributions, randomly sampled from them, then measured how close each CI
method gets to the true value across repeated trials. Two estimands were tested: single-sample CIs
(does the interval contain the true mean?) and pairwise difference CIs (does the interval contain
the true difference?). Sample sizes: N = 10, 20, 30, 50, 100, 200. Bootstrap and Bayesian draws: 10,000.
Monte-Carlo trials: 2,000 per condition. Simulations are implemented in
simulations/ folder in the project repository.
Distributions Tested
- Binary scores ({0, 1}): Bernoulli with p ∈ {0.1, 0.3, 0.5, 0.7, 0.9}
- Continuous scores (0–1 floats): Beta families covering uniform, U-shaped, right-skewed, left-skewed, and moderate-skew (e.g., Beta(1,1), Beta(0.5,0.5), Beta(2,8), Beta(8,2), Beta(2,5))
- Likert scores (integers 1–5): discrete scenarios including uniform, skewed-low, skewed-high, bimodal, and center-peaked
- Grades (0–100 floats): clipped normal scenarios including symmetric, high-scoring, low-scoring, ceiling-heavy, and floor-heavy
Single-Sample Coverage (target: 0.95)
The following video and table presents the results of the Monte Carlo simulation, which took ~14 hours to run. The video shows how the coverage rate of each method evolves as the number of samples increases from 10 to 200, averaged across all distributions and data types. The table breaks down the final average coverage rates at each sample size, with notes on performance. (Coverage rate = fraction of trials where the CI contains the true population mean. Higher is better; under-coverage means the method is over-confident.)
| Method | Coverage | MCSE | Width | Dev | Notes |
|---|---|---|---|---|---|
| Percentile bootstrap | 0.906▼ | 0.0004 | 2.9396 | -0.044 | Moderate under-coverage; narrower than smooth bootstrap |
| BCa | 0.908▼ | 0.0004 | 2.9628 | -0.042 | Slightly better than percentile bootstrap, still under target |
| Bayes bootstrap | 0.899▼ | 0.0004 | 2.8772 | -0.051 | Lowest coverage among bootstrap variants; over-optimistic |
| Smooth Bootstrap (Gaussian KDE) | 0.922 | 0.0004 | 3.3556 | -0.028 | Best bootstrap coverage; widest (most conservative) interval |
| Wilson score interval | 0.950 | 0.0007 | 0.2089 | -0.000 | Binary only; exact-on-target coverage |
| Bayesian Independent Samples | 0.939 | 0.0007 | 0.2095 | -0.011 | Binary-only Bayesian baseline; slightly under target |
Coverage target is 0.95. ▼ marks under-target coverage. Bootstrap-family methods are shown alongside binary-only references (Wilson and bayes_indep) for comparison.
Pairwise Comparison Coverage (target: 0.95)
For paired comparisons, coverage measures whether the CI on the difference contains the true difference. This is the more practically important statistic for comparing two prompts or models.
| Method | Coverage | MCSE | Width | Dev | Notes |
|---|---|---|---|---|---|
| Percentile bootstrap | 0.917 | 0.0006 | 0.6159 | -0.033 | Under target; stable but mildly over-confident |
| BCa | 0.902▼ | 0.0006 | 0.6247 | -0.048 | Lowest frequentist coverage; clearly under target |
| Bayes bootstrap | 0.912 | 0.0006 | 0.6050 | -0.038 | Under target with narrower intervals |
| Smooth Bootstrap (Gaussian KDE) | 0.951 | 0.0004 | 0.7042 | +0.001 | Best bootstrap method; near-exact target with widest CI |
| Newcombe score interval | 0.916 | 0.0011 | 0.2323 | -0.034 | Binary-only frequentist score interval; under target overall |
| Bayesian Independent Comparison | 0.969 | 0.0007 | 0.3400 | +0.019 | Binary-only independent Bayesian comparison; conservative |
| Bayesian Paired Comparison | 0.952 | 0.0009 | 0.3213 | +0.002 | Binary-only paired Bayesian comparison; near target in aggregate but breaks down around N≈200+ with overconfident CIs |
Coverage target is 0.95. ▼ marks under-target coverage. Results are averaged across all eval types, scenarios, and sample sizes; binary-specific methods are included for reference.
Averaged across conditions, paired Bayes looks excellent (0.952), but this hides a critical failure mode:
around N≈200 and above, the bayes_evals paired method can collapse into dramatically
overconfident intervals. So the practical recommendation is conditional: use Bayesian paired only
at smaller sample sizes, then switch to regular bootstrap around N=100. Among bootstrap methods, Smooth bootstrap
remains the most conservative and best-calibrated general option, while BCa under-covers most severely (0.902).
Smooth boostrap's performance for binary data is especially impressive given that
it is not designed for that data type and technically "wrong" to run it there, as one
cannot add Gaussian noise to binary data without straying outside the data range. But
it still achives reasonable performance at N=30 and above.
Comparison with Bayesian Methods for Binary Scores
To address the "Don't Use the CLT" paper's claims, we ensured the simulation included
the Bayesian beta-posterior methods the authors advocate, implemented using their bayes_evals library.
For the binary single-sample case, we find:
- Wilson score interval lands exactly on target coverage (0.950).
- The binary Bayesian single-sample baseline under-covers (0.939).
For pairwise comparisons in this simulation summary, Bayesian Paired Comparison
(0.952 aggregate) and Smooth Bootstrap (0.951) are both near target on average,
while Newcombe score interval under-covers overall (0.916). But the average masks
a severe regime issue: around N≈200+, the bayes_evals Bayesian paired method can fail
with overconfident CIs. (We ran additional simulations up to N=1000 confirming this pattern.
The reason why seems to be that the importance sampling procedure becomes unstable at larger N.)
That makes the operational rule straightforward: Bayesian paired only below
N=100, then Newcombe for binary pairwise comparisons, with smooth bootstrap as the strongest
frequentist option across mixed data types.
For binary single-sample estimates, use Wilson.
For binary pairwise comparisons, use Bayesian paired only when N < 100,
then switch to Newcombe at N ≥ 100 for speed and better accuracy.
Around N≈200+, the bayes_evals paired method can become dangerously overconfident.
Smooth bootstrap remains the strongest frequentist alternative across mixed data types.
For numeric/interval scores, smooth bootstrap remains the practical winner.
Thus, while Bayesian methods can perform well for binary data,
they are not universally superior to frequentist methods (at least in these comparisons), and
their advantages may not extend to non-binary data types common in LLM evals. We
call on statistics experts to extend our simulation code and meta-analyses to further
clarify the conditions under which Bayesian methods outperform frequentist alternatives,
and to identify any scenarios where they may underperform.
Is BCa Actually Better Than the Naive Bootstrap?
Common wisdom says yes, but our simulations say otherwise. (GPT-5 seems convinced BCa is the go-to method, for instance, and loves to suggest it.) One simulation study found:
This is consistent with our own simulation results for LLM eval data.
Which p-value Method Works Best?
We also ran another Monte Carlo simulation study to compare p-value methods for pairwise
testing across LLM eval data types and sample sizes, implemented in
simulations/sim_compare_pvalues.py. The criterion here is Type I error
control under the null (target: α = 0.05) plus consistency across
data distributions and N.
Wilcoxon Signed Ranks is our recommended default for pairwise
comparisons for those adopting a NHST framework. In this simulation, it is slightly conservative, consistently
performant across sample sizes, data types, and distributions, and does not show pathological
behavior where Type I error is inflated above the nominal threshold
(α = 0.05). Wilcoxon is a non-parametric test that
makes fewer assumptions about the data distribution, which likely contributes
to its robustness in this context. It is also widely implemented, well-understood,
and computationally efficient, making it a practical choice for developers.
That said, the Sign Test is also a reasonable choice for safety-critical settings where false positives are especially costly, because it is much more conservative than Wilcoxon.
For comparing 3+ models/prompts, we recommend the Friedman test as the default, with Wilcoxon signed ranks as the post-hoc pairwise test. Friedman is a non-parametric test for comparing more than two related samples, it doesn't assume normality, and it's robust to homoscedasticity (variance of residuals). Friedman tests are already used in some ML contexts for comparing models, as advocated by Demsar (2006). Note that evalstats does not output Friedman by default, since it does not foreground a NHST-style analysis pipeline, but it is available as an option for users who want to use it.
For multiple comparisons correction,
we recommend the Benjamini-Hochberg procedure (correction="fdr_bh")
to those in lower-stakes contexts, as it which performed comparably to other methods in our simulation,
but had the greatest power among methods that maintained Type I error control.
In extremely safety-critical contexts, the Bonferroni-Holm correction (correction="holm")
is more appropriate for its strong control over false positives, albeit at the cost of increased false negatives.
We ran simulations to provide guidance on which p-value method is most reliable for LLM evals, for those who want to use them, but we do not recommend relying on p-values. Null-hypothesis significance testing (NHST) should not be the primary basis for eval decisions. A p-value is only a compatibility check against a null model; it does not tell you the magnitude, direction, or practical importance of an effect.
If you report p-values at all, always report them alongside effect sizes and confidence intervals. In practice, prefer effect sizes and CIs as the main evidence, and use p-values as secondary context. Prefer alpha thresholds that are more conservative than 0.05, if you use them at all (e.g., 0.001).