Which Method? — Stats for LLM Evals

Recommendations and Defaults

Based on research and Monte Carlo simulations, we recommend the following methods for calculating confidence intervals and p-values in LLM eval contexts. These recommendations are implemented as the method="auto" defaults in evalstats, which adapt to data type and sample size based on which methods performed best in our simulations. Note that the contents of this section are subject to change as gather more evidence for different situations.

Methods to Compute Confidence Intervals (By Data Type)

Binary (0 / 1)

Wilson + Bayesian Paired (N<100) / Bootstrap (N≥100)

Wilson score interval for point estimate CIs. For pairwise binary comparisons, use Bayesian paired from bayes_evals only below N=100; switch to bootstrap at N=100 and above for safety and CI multi-comparisons correction (simultaneous CIs) with the studentized maxT method. (Warning: around N≈200+, the paired method in bayes_evals collapses, hence the switch.)

Numeric (floats, integers, Likert, grades)

Smooth Bootstrap (Gaussian KDE) (N<200) / Bootstrap (N≥200)

Smooth bootstrap with a Guassian KDE had the highest coverage rate across all CI methods tested at small N sample sizes for numeric data. Smooth bootstrap acts like regular bootstrap, but adds Gaussian noise to each data point—it treats each data point as a little fuzzy instead of exact. It handled continuous (0–1 floats), discrete (Likert / integer), and bounded or skewed (grades, 0–100) data more gracefully and robustly than other methods we tested, and it is slightly conservative, which is preferred in LLM eval contexts since we can't trust our eval sets are all that representative, either. At larger N, regular bootstrap is more efficient and more on-target, and also supports simultaneous CIs for multi-comparisons correction. (Here, Scott's rule is used to set the bandwidth of the Gaussian kernel—this is more conservative than Silverman's rule, and makes an assumption of normality for the noise, but its conservativeness is preferred according to our principles, and our simulations suggest it performs well in practice.)

By Analysis Type (Confidence Intervals)

Task	Data Type	Recommended Method	Notes
Single-sample CI on mean score	Binary	Wilson score interval	Closed-form, best coverage
Single-sample CI on mean score	Numeric/Interval	Smooth bootstrap (Gaussian KDE)	Conservative; most reliable at small N
Pairwise comparison (A vs B)	Binary	Bayesian Paired (N < 100), Bootstrap (N ≥ 100)	Bayesian paired is strong at small N but can fail around N≈200+; switch at N=100 to bootstrap for safety and multi-comparisons correction
Pairwise comparison (A vs B)	Numeric/Interval	Smooth bootstrap (Gaussian KDE)	Highest pairwise coverage; BCa is explicitly not recommended
All pairwise comparisons (k templates)	Any	As above + multi-comparisons correction	We recommend the max-T resampling procedure (`simultaneous_ci=True` will use max-T if the method is also bootstrap-based). There is a precedent for this in the ML literature.
Hierarchical data (multi-run)	Any	Nested smooth bootstrap (Gaussian KDE)	Resample inputs, then runs within inputs; propagates run stochasticity into CIs

By Analysis Type (P-values)

Task	Data Type	Recommended Method	Notes
Pairwise test (A vs B)	Any paired eval scores	Wilcoxon Signed Ranks	Slightly conservative, consistent across data types and sample sizes, and no Type I inflation above the alpha in our simulation.
Pairwise test (A vs B), but in a safety-critical setting	Any paired eval scores	Sign Test	More conservative than Wilcoxon; useful when false positives are very costly.
Multiple pairwise tests (k variants)	Any	Benjamini-Hochberg (`fdr_bh`)	Default correction in `evalstats`; controls expected false discovery rate with better power than stricter family-wise procedures.

Reporting principle

Do not rely on NHST alone for eval decisions. If you report p-values, report them alongside effect sizes and confidence intervals. Prefer effect sizes and CIs as primary evidence, and treat p-values as secondary.

Sample Size Guidance

N < 15: Do not report statistics. The CI width will be so large as to be meaningless, and the sample is extremely unlikely to be representative.

15 ≤ N < 30: Report statistics with a prominent warning. Results should be treated as exploratory only.

N ≥ 30: Bootstrapped CIs and score intervals begin to become reliable.

Simultaneous Confidence Intervals for Multiple Comparisons

When comparing multiple prompts or models, it is common to run multiple pairwise comparisons. In these cases, controlling the family-wise error rate (FWER) is essential to avoid spurious findings. The max-T resampling procedure is a well-established method in genomics and neuroscience for constructing simultaneous confidence intervals (CIs) that jointly cover all comparisons, rather than treating each CI in isolation (basically, it widens the CIs to ensure the alpha level is maintained across all comparisons). It also has very few assumptions, allowing for robust inference even when candidates are correlated. Despite its power and flexibility, max-T is virtually unknown in LLM evaluation and ML contexts.

What is max-T?

The max-T procedure works by resampling the data (e.g., via bootstrap), computing the test statistic (such as the difference in means) for all pairs in each resample, and then recording the maximum absolute value of the test statistics across all comparisons for each resample. The distribution of these maxima is then used to set CI bounds that guarantee the desired coverage simultaneously for all comparisons, not just marginally. This approach is less conservative than Bonferroni correction, yet provides strong control over false positives.

evalstats uses the max-T procedure by default whenever you output multiple pairwise comparisons with bootstrapped CIs. This ensures that the reported intervals account for the increased uncertainty from making many comparisons at once. For consistency, evalstats also applies the max-T correction to p-values when users request them, so that both CIs and p-values are corrected in the same principled way.

The max-T method is recommended by Rink & Brannath, 2025, a recent ML paper on multiple comparisons correction when comparing many candidate models that may be correlated. For more foundational background, see the Methods Used in evalstats and Statistical Pitfalls in ML sections of the Resources page.

Simulation Study

We ran simulation studies over eval-like distributions to compare methods for calculating CIs head-to-head, rather than relying solely on others' results.

We rigged realistic distributions, randomly sampled from them, then measured how close each CI method gets to the true value across repeated trials. Two estimands were tested: single-sample CIs (does the interval contain the true mean?) and pairwise difference CIs (does the interval contain the true difference?). Sample sizes: N = 10, 20, 30, 50, 100, 200. Bootstrap and Bayesian draws: 10,000. Monte-Carlo trials: 2,000 per condition. Simulations are implemented in simulations/ folder in the project repository.

Distributions Tested

Binary scores ({0, 1}): Bernoulli with p ∈ {0.1, 0.3, 0.5, 0.7, 0.9}
Continuous scores (0–1 floats): Beta families covering uniform, U-shaped, right-skewed, left-skewed, and moderate-skew (e.g., Beta(1,1), Beta(0.5,0.5), Beta(2,8), Beta(8,2), Beta(2,5))
Likert scores (integers 1–5): discrete scenarios including uniform, skewed-low, skewed-high, bimodal, and center-peaked
Grades (0–100 floats): clipped normal scenarios including symmetric, high-scoring, low-scoring, ceiling-heavy, and floor-heavy

Single-Sample Coverage (target: 0.95)

The following video and table presents the results of the Monte Carlo simulation, which took ~14 hours to run. The video shows how the coverage rate of each method evolves as the number of samples increases from 10 to 200, averaged across all distributions and data types. The table breaks down the final average coverage rates at each sample size, with notes on performance. (Coverage rate = fraction of trials where the CI contains the true population mean. Higher is better; under-coverage means the method is over-confident.)

Table 1 — Single-sample method summary (true simulation outputs)

Method	Coverage	MCSE	Width	Dev	Notes
Percentile bootstrap	0.906▼	0.0004	2.9396	-0.044	Moderate under-coverage; narrower than smooth bootstrap
BCa	0.908▼	0.0004	2.9628	-0.042	Slightly better than percentile bootstrap, still under target
Bayes bootstrap	0.899▼	0.0004	2.8772	-0.051	Lowest coverage among bootstrap variants; over-optimistic
Smooth Bootstrap (Gaussian KDE)	0.922	0.0004	3.3556	-0.028	Best bootstrap coverage; widest (most conservative) interval
Wilson score interval	0.950	0.0007	0.2089	-0.000	Binary only; exact-on-target coverage
Bayesian Independent Samples	0.939	0.0007	0.2095	-0.011	Binary-only Bayesian baseline; slightly under target

Coverage target is 0.95. ▼ marks under-target coverage. Bootstrap-family methods are shown alongside binary-only references (Wilson and bayes_indep) for comparison.

Pairwise Comparison Coverage (target: 0.95)

For paired comparisons, coverage measures whether the CI on the difference contains the true difference. This is the more practically important statistic for comparing two prompts or models.

Table 2 — Pairwise method summary (true simulation outputs)

Method	Coverage	MCSE	Width	Dev	Notes
Percentile bootstrap	0.917	0.0006	0.6159	-0.033	Under target; stable but mildly over-confident
BCa	0.902▼	0.0006	0.6247	-0.048	Lowest frequentist coverage; clearly under target
Bayes bootstrap	0.912	0.0006	0.6050	-0.038	Under target with narrower intervals
Smooth Bootstrap (Gaussian KDE)	0.951	0.0004	0.7042	+0.001	Best bootstrap method; near-exact target with widest CI
Newcombe score interval	0.916	0.0011	0.2323	-0.034	Binary-only frequentist score interval; under target overall
Bayesian Independent Comparison	0.969	0.0007	0.3400	+0.019	Binary-only independent Bayesian comparison; conservative
Bayesian Paired Comparison	0.952	0.0009	0.3213	+0.002	Binary-only paired Bayesian comparison; near target in aggregate but breaks down around N≈200+ with overconfident CIs

Coverage target is 0.95. ▼ marks under-target coverage. Results are averaged across all eval types, scenarios, and sample sizes; binary-specific methods are included for reference.

Surprising finding

Averaged across conditions, paired Bayes looks excellent (0.952), but this hides a critical failure mode: around N≈200 and above, the bayes_evals paired method can collapse into dramatically overconfident intervals. So the practical recommendation is conditional: use Bayesian paired only at smaller sample sizes, then switch to regular bootstrap around N=100. Among bootstrap methods, Smooth bootstrap remains the most conservative and best-calibrated general option, while BCa under-covers most severely (0.902). Smooth boostrap's performance for binary data is especially impressive given that it is not designed for that data type and technically "wrong" to run it there, as one cannot add Gaussian noise to binary data without straying outside the data range. But it still achives reasonable performance at N=30 and above.

Comparison with Bayesian Methods for Binary Scores

To address the "Don't Use the CLT" paper's claims, we ensured the simulation included the Bayesian beta-posterior methods the authors advocate, implemented using their bayes_evals library. For the binary single-sample case, we find:

Wilson score interval lands exactly on target coverage (0.950).
The binary Bayesian single-sample baseline under-covers (0.939).

For pairwise comparisons in this simulation summary, Bayesian Paired Comparison (0.952 aggregate) and Smooth Bootstrap (0.951) are both near target on average, while Newcombe score interval under-covers overall (0.916). But the average masks a severe regime issue: around N≈200+, the bayes_evals Bayesian paired method can fail with overconfident CIs. (We ran additional simulations up to N=1000 confirming this pattern. The reason why seems to be that the importance sampling procedure becomes unstable at larger N.) That makes the operational rule straightforward: Bayesian paired only below N=100, then Newcombe for binary pairwise comparisons, with smooth bootstrap as the strongest frequentist option across mixed data types.

Conclusion on Bayesian methods

For binary single-sample estimates, use Wilson. For binary pairwise comparisons, use Bayesian paired only when N < 100, then switch to Newcombe at N ≥ 100 for speed and better accuracy. Around N≈200+, the bayes_evals paired method can become dangerously overconfident. Smooth bootstrap remains the strongest frequentist alternative across mixed data types. For numeric/interval scores, smooth bootstrap remains the practical winner. Thus, while Bayesian methods can perform well for binary data, they are not universally superior to frequentist methods (at least in these comparisons), and their advantages may not extend to non-binary data types common in LLM evals. We call on statistics experts to extend our simulation code and meta-analyses to further clarify the conditions under which Bayesian methods outperform frequentist alternatives, and to identify any scenarios where they may underperform.

Is BCa Actually Better Than the Naive Bootstrap?

Common wisdom says yes, but our simulations say otherwise. (GPT-5 seems convinced BCa is the go-to method, for instance, and loves to suggest it.) One simulation study found:

"Our simulation results revealed an outperformance of the percentile method over both BCa and Student's t with respect to the desired coverage of the CIs. The BCa method tended to produce slightly better-balanced CIs than the other two methods, but overall, it was undermined due to its narrower CIs than the desired level in coverage." — Simulation study on bootstrap CI methods for factor analysis loadings

This is consistent with our own simulation results for LLM eval data.

Which p-value Method Works Best?

We also ran another Monte Carlo simulation study to compare p-value methods for pairwise testing across LLM eval data types and sample sizes, implemented in simulations/sim_compare_pvalues.py. The criterion here is Type I error control under the null (target: α = 0.05) plus consistency across data distributions and N.

Wilcoxon Signed Ranks is our recommended default for pairwise comparisons for those adopting a NHST framework. In this simulation, it is slightly conservative, consistently performant across sample sizes, data types, and distributions, and does not show pathological behavior where Type I error is inflated above the nominal threshold (α = 0.05). Wilcoxon is a non-parametric test that makes fewer assumptions about the data distribution, which likely contributes to its robustness in this context. It is also widely implemented, well-understood, and computationally efficient, making it a practical choice for developers.

That said, the Sign Test is also a reasonable choice for safety-critical settings where false positives are especially costly, because it is much more conservative than Wilcoxon.

For comparing 3+ models/prompts, we recommend the Friedman test as the default, with Wilcoxon signed ranks as the post-hoc pairwise test. Friedman is a non-parametric test for comparing more than two related samples, it doesn't assume normality, and it's robust to homoscedasticity (variance of residuals). Friedman tests are already used in some ML contexts for comparing models, as advocated by Demsar (2006). Note that evalstats does not output Friedman by default, since it does not foreground a NHST-style analysis pipeline, but it is available as an option for users who want to use it.

For multiple comparisons correction, we recommend the Benjamini-Hochberg procedure (correction="fdr_bh") to those in lower-stakes contexts, as it which performed comparably to other methods in our simulation, but had the greatest power among methods that maintained Type I error control. In extremely safety-critical contexts, the Bonferroni-Holm correction (correction="holm") is more appropriate for its strong control over false positives, albeit at the cost of increased false negatives.

Do not rely on NHST alone

We ran simulations to provide guidance on which p-value method is most reliable for LLM evals, for those who want to use them, but we do not recommend relying on p-values. Null-hypothesis significance testing (NHST) should not be the primary basis for eval decisions. A p-value is only a compatibility check against a null model; it does not tell you the magnitude, direction, or practical importance of an effect.

If you report p-values at all, always report them alongside effect sizes and confidence intervals. In practice, prefer effect sizes and CIs as the main evidence, and use p-values as secondary context. Prefer alpha thresholds that are more conservative than 0.05, if you use them at all (e.g., 0.001).