Papers, Articles, and Commentary

A curated list of papers and articles that I've encountered and noted while working on this project. The list is organized into two sections: Commentary on select papers, and Papers, articles, and other educational resources on statistical methods and perspectives in LLM eval, ML, and HCI contexts.

Commentary on select papers

The "Evals Need Error Bars" Paper

In 2024, Evan Miller at Anthropic made an early, important case for error bars in evaluations. The argument centered on standard errors: combine standard deviation with sample size, report alongside the mean.

This is reasonable, but limited for the evals landscape of today. The framework assumes thousands or hundreds of thousands of datapoints — as Anthropic's public benchmarks have. That is not the world of most developers. A typical custom developer eval has 20–100 items, not 10,000. At that scale:

Standard errors assume normal distributions of LLM outputs, which rarely hold.
They assume symmetry around the mean — also rarely true for bounded scores at high performance.
They collapse with hierarchical data (e.g., same inputs evaluated across multiple prompts and models).

What we want instead is confidence intervals on pairwise differences, with no strong distributional assumptions. To trust these intervals, we need to use methods that are robust at small N. To understand which methods are robust, we would ideally run simulations that reflect the kinds of data we see in real evals.

The Forgotten Paper on Statistical Comparisons in ML: Demšar (2006)

Before the LLM era, the machine learning community had a well-established answer to the question of how to statistically compare multiple models across multiple evaluation sets. Janez Demšar’s 2006 Journal of Machine Learning Research paper, Statistical Comparisons of Classifiers over Multiple Data Sets, argued that parametric tests like ANOVA are routinely misapplied to classifier comparisons, violating normality and independence assumptions. He recommended non-parametric alternatives— the Wilcoxon signed-ranks test for two models, the Friedman test plus post-hoc corrections for more than two—and introduced the critical difference (CD) diagram as a clean visual for communicating the results.

The paper is highly cited in classical ML but has largely been forgotten in the LLM benchmark discourse. That is a shame, because the situation Demšar addressed is precisely the one we face: many models (or prompts), many evaluation sets (or items), non-normal distributions, and a need to make comparative claims that are statistically honest about uncertainty.

Why it still matters

Demšar’s core insight is that comparing the rank of methods averaged across datasets is more robust than comparing raw means, and that multiple comparisons require explicit correction. This transfers directly to LLM evals whenever you are comparing N prompts or models across M evaluation sets and asking which is best.

That said, the framework needs adaptation for modern eval practice. CD diagrams and rank-band visualizations are built around a narrow, rigid NHST pipeline: run a Friedman test, apply a post-hoc correction (Nemenyi or Holm), then draw cliques of methods that are not significantly different. This works, but it doesn't adapt to the estimation-first approach that we advocate, which focuses on effect sizes and confidence intervals rather than p-values. If we want to run any other test than Friedman, essentially, we can't use the CD diagram as Demšar’s paper defines it.

But CD diagram is just a visualization! There's nothing stopping us from adapting it to other tests and other philosophies. In fact, evalstats adapts the CD diagram to work with pairwise confidence intervals (adjusted for multiple comparisons) and estimated ranks from a bootstrapped distribution. However, it also supports the traditional NHST Friedman test and CD diagrams generated from that, for users who want to run a more traditional analysis.

The "Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints" Paper

An ICML position paper titled "Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints" argues that for N<300 samples you must use Bayesian statistics, and that frequentist and bootstrap methods are untrustworthy at small N. But is that really true?

Important methodological caveat

The paper evaluates Bayesian methods under simulation assumptions closely aligned with those Bayesian priors. That setup is valid for answering a specific question, but it can favor Bayesian approaches relative to methods optimized for broader or misspecified settings.

Despite the broad title, the analysis is focused on Bernoulli (binary) outcomes. It does not cover numeric/interval eval scores (e.g., Likert or 0–100 grades), and it compares against a limited set of frequentist baselines rather than the strongest modern alternatives.

This matters because other simulation work has reported different rankings under different designs, including settings where certain frequentist intervals achieve higher coverage than Bayesian variants:

"In general, the Bayesian method had the lowest coverage rate... Studentized [bootstrapping] is preferable when the goal is to maximize the coverage rate, while in cases where interval length is more important, the Bayesian method should be favored."

In ML terms, this is a precision–recall style trade-off: some methods give tighter intervals, others give higher coverage. Which is preferable depends on risk tolerance, data type, and whether under-coverage or over-conservatism is the bigger error in your application.

The real takeaway

In our own simulations (see below): Wilson is strongest for binary single-sample coverage, Bayesian paired is strongest for binary pairwise coverage at small-to-mid N, and smooth bootstrap with kernel density estimation (KDE) is the most reliable general-purpose method across non-binary eval scores. However, in our larger-N runs, the bayes_evals Bayesian paired method breaks down around N≈200+, producing overconfident intervals that can be dangerously wrong. Thus, evalstats uses Bayesian paired only when N < 100, then switches to bootstrap for better accuracy at larger N and to take advantage of multi-comparisons correction for pairwise CIs.

LLM Eval Statistics and Debates

Papers that have directly shaped how the field thinks about statistical rigor in LLM evaluation. There's also some commentary on select papers.

Adding Error Bars to Evals: A Statistical Approach to Language Model Evaluations

Miller, E. et al. — Anthropic, arXiv:2411.00640, 2024

An influential argument for treating LLM eval scores as statistical estimates with quantified uncertainty. Proposes bootstrap CIs, paired tests for model comparisons, and power analysis for eval design. Addresses cluster structure (multi-run, multi-judge settings) via cluster-robust standard errors. The paper identified a genuine gap: the industry norm of reporting bare means without uncertainty is indefensible at any sample size. See the Debunking section for how evalstats’ smooth bootstrap improves on SE-based intervals at small N.

discussed in guide bootstrap CIs power analysis eval practice

Don’t Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints

Boi et al. — ICML Position Paper, 2024

Argues that for binary LLM eval scores with N < 300, Bayesian beta-posterior methods substantially outperform frequentist and bootstrap alternatives. The paper’s simulation design evaluates Bayesian estimates on distributions that match the Bayesian prior, which inflates its apparent advantage. It also focuses exclusively on the binary case and does not benchmark against the Wilson interval or smoothed bootstrap. See the Debunking section of the Stats Reference Guide for a detailed critique. The paper’s core call for more rigorous statistics in LLM evals is well-founded, even if the specific recommendation is overstated.

debated in guide Bayesian methods small N

Expanding the AI Evaluation Toolbox with Statistical Models (NIST AI 800-3)

Raji, I.D. et al. — NIST AI 800-3, 2026

A NIST standards document advocating for statistical rigor in AI evaluation, with formal guidance on CI estimation, test selection, and uncertainty quantification for model comparisons. Argues that the dominant evaluation culture — reporting point estimates without uncertainty — cannot support the regulatory and procurement decisions now being made on the basis of benchmark scores. Provides a taxonomy of evaluation tasks and matching statistical approaches, making it a useful framework for practitioners choosing between methods.

cited in guide standards & policy uncertainty quantification

Practices for Automated Benchmark Evaluations of Language Models (NIST AI 800-2)

NIST Center for AI Standards and Innovation — NIST AI 800-2 (Initial Public Draft), January 2026

A companion to NIST AI 800-3, focused specifically on automated benchmark evaluation practices for language models. Provides standardized guidance on how to design, run, and report automated benchmark evaluations, with an eye toward reproducibility and consistency across the AI research community. Where 800-3 addresses statistical modeling and uncertainty quantification, 800-2 addresses the procedural infrastructure: how benchmarks should be constructed, scored, and compared.

standards & policy benchmark design reproducibility

NIST Comment: Practices for Automated Benchmark Evaluations of Language Models

Princeton SAGE Lab — Public comment on NIST AI 800-2, 2026

A formal public comment submitted to the NIST AI 800-2 process by Princeton’s SAGE lab, focused on benchmark reliability concerns. Provides technical perspective on the limitations and failure modes of automated LLM benchmarks from a research group actively studying benchmark validity. A useful complement to the official NIST document, representing practitioner and researcher pushback on where current benchmark evaluation standards fall short.

benchmark reliability standards & policy

LLM Evaluation Methodology

Papers on how LLM evals are structured, run, and reported — and where they go wrong.

The Emerging Science of Machine Learning Benchmarks

Hardt, M. — Online textbook; Princeton University Press (forthcoming 2026)

A 15-chapter online textbook tracing the foundations, current challenges, and future directions of ML benchmarking. Covers statistical testing, the holdout method, test-set reuse, replication crises, generative models, and LLM evaluation. Opens with a dual diagnosis: benchmarks are “the iron rule to tame the anything goes” in ML research, yet “a crisis grips the benchmarking enterprise.” A thorough treatment of why aggregate benchmark scores are both essential and fragile — and what rigorous alternatives look like.

benchmark design replication crisis textbook

Holistic Evaluation of Language Models (HELM)

Liang, P. et al. — arXiv:2211.09110, 2022

Introduces HELM: evaluating models across 42 scenarios, 7 metric categories, and hundreds of sub-scenarios. One of the most comprehensive eval frameworks published. Aggregate scores are reported as means without confidence intervals, making it a useful illustration of scale-dependent eval limitations: at the scale HELM operates (thousands of items per scenario), standard errors are small enough to be negligible, but for the typical developer running a custom eval at N=50, HELM’s methodology does not transfer.

benchmark design large-scale eval

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Zheng, L. et al. — NeurIPS 2023

Introduces and validates the LLM-as-a-judge paradigm: using a capable model (GPT-4) as a scoring proxy for human preference. Shows ~80% agreement with human judges on pairwise comparisons. Also introduces Chatbot Arena and its Elo-rating leaderboard. The paper identifies key judge failure modes (position bias, verbosity bias, self-enhancement) but does not quantify judge-induced variance in downstream CI estimates.

LLM-as-judge Elo rating judge variance unaddressed

Are Emergent Abilities of Large Language Models a Mirage?

Schaeffer, R. et al. — NeurIPS 2023

Shows that “emergent” abilities — capabilities that appear suddenly at a threshold model scale — are largely artifacts of the choice of metric. Discontinuous metrics (exact match) create apparent phase transitions; replacing them with continuous metrics (edit distance, token probability) reveals smooth, continuous improvement curves. A precise demonstration of why the score function is part of the statistical model: the metric you choose changes the conclusions you can draw.

metric artifacts scaling laws

The Matthews Correlation Coefficient (MCC) Is More Reliable Than Balanced Accuracy, Bookmaker Informedness, and Markedness in Two-Class Confusion Matrix Evaluation

Chicco, D., Tötsch, N. & Jurman, G. — BioData Mining, 2021

Argues that MCC is the most informative single metric for binary classification: a high MCC guarantees all four confusion-matrix rates are high (TPR, TNR, PPV, NPV), whereas balanced accuracy or bookmaker informedness only guarantee three. Uses mathematical derivations and worked examples to show scenarios where BA and BM signal near-perfect performance while MCC correctly flags failure. Directly relevant to LLM binary evals: when pass/fail or correct/incorrect scores are the output, choosing the wrong summary metric can mask systematic failure modes on one class. A concrete companion to Schaeffer et al.: not only does the shape of the metric matter (continuous vs. discontinuous), but so does which confusion-matrix cell it is sensitive to.

metric artifacts binary classification eval design

A Call for Clarity in Reporting BLEU Scores

Post, M. — ACL 2018

Documents that different BLEU implementations produce scores that differ by up to 1–2 BLEU points on the same output, making cross-paper comparisons unreliable. An early example of what has become a general problem: aggregate metrics reported without enough implementation detail to reproduce them. The lesson generalizes directly to LLM evals: a score of 78% is only meaningful if the scoring function, judge model, prompt template, and evaluation protocol are fully specified.

metric reproducibility NLP eval

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-Bench)

Srivastava, A. et al. — TMLR, 2023

A large-scale collaborative benchmark covering 204 tasks designed to be beyond the capabilities of then-current models. Beyond its task coverage, BIG-Bench is notable methodologically for its use of multiple human raters with inter-rater reliability checks, calibrated difficulty ratings, and discussion of metric sensitivity. A useful contrast to single-metric benchmarks, though statistical uncertainty of aggregate scores is still not prominently reported.

benchmark design inter-rater reliability

ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs

Zhuo, T.Y. et al. — arXiv:2410.12405, 2024

Introduces ProSA, a framework for measuring how sensitive LLM performance is to surface-level prompt variation (phrasing, ordering, delimiter choice). Finds that sensitivity varies substantially across models, tasks, and prompt dimensions, and that models with similar mean accuracy can have very different variance profiles. Directly motivates the Prompt Sensitivity investigation: a model’s mean score on a single prompt is a noisy proxy for its true task capability, and the noise is correlated with prompt style rather than item difficulty.

prompt sensitivity eval variance

Lessons from the Trenches on Reproducible Evaluation of Language Models

Haidar, M. et al. — arXiv:2405.14782, 2024

A practitioner’s guide to reproducibility failures in LLM evaluation, drawn from experience running large-scale evals. Identifies sources of variance that are rarely reported: temperature and sampling randomness, prompt template sensitivity, tokenization differences between frameworks, and hardware-level floating-point non-determinism. The paper makes the case that reproducibility in LLM evals requires specifying far more than just the model name and task description — a concrete list of what “full specification” actually entails.

reproducibility eval variance sampling randomness

Forget What You Know about LLM Evaluations: LLMs are Like a Chameleon

Cohen, S. et al. — arXiv:2502.07445, 2025

Shows that LLM performance on benchmarks fluctuates dramatically with superficial changes to evaluation setup — from option ordering to question framing — producing score swings large enough to change model rankings. Argues that standard benchmark scores do not measure a stable underlying capability but rather a model’s response to a specific configuration. Directly motivates the When Rankings Flip investigation: a CI on a single eval configuration cannot capture configuration-induced variance.

ranking instability eval validity

Measuring what Matters: Construct Validity in Large Language Model Benchmarks

Bean, A.M. et al. — arXiv:2511.04703, NeurIPS 2025

A systematic review of 445 LLM benchmarks from major NLP and ML conferences, examining whether they actually measure the concepts they claim to — safety, robustness, reasoning, and so on. Using 29 expert reviewers, the authors identify recurring patterns in tasks, metrics, and phenomena that undermine construct validity. Concludes with eight recommendations for more rigorous benchmark development, and a practical construct validity checklist for benchmarks. See also the Communications of the ACM post for an accessible summary.

construct validity benchmark design eval methodology

Beyond Synthetic Benchmarks: Evaluating LLM Performance on Real-World Class-Level Code Generation

Jackson, D. & Fei, G. — arXiv:2510.26130, 2025

Evaluates LLMs on class-level code generation using real-world open-source repositories rather than synthetic or curated problems. Finds that model rankings on synthetic benchmarks (HumanEval, MBPP) diverge substantially from rankings on authentic tasks, and that benchmark-to-benchmark transfer is weak even within coding evals. A concrete illustration of external validity failure: even a statistically rigorous CI on a synthetic benchmark cannot bound performance on real-world tasks if the benchmark distribution is unrepresentative.

external validity code generation benchmark transfer

Counting on Consensus: Selecting the Right Inter-Annotator Agreement Metric for NLP

James, B. — arXiv:2603.06865, 2026

A systematic guide to inter-annotator agreement (IAA) metric selection for NLP tasks. Compares Cohen’s κ, Krippendorff’s α, Fleiss’ κ, and percent agreement, clarifying when each is appropriate and when they give misleading results. Particularly relevant for LLM-as-judge pipelines: when multiple judges (human or model) score the same items, IAA measures reveal whether the scoring process is reliable — a prerequisite for trusting any downstream CI built on those scores.

inter-annotator agreement LLM-as-judge annotation methodology

Methods and Perspectives Used in evalstats

Papers for some statistical methods implemented in the library, as well as the estimation approach to stats adopted here.

Probable inference, the law of succession, and statistical inference

Wilson, E.B. — Journal of the American Statistical Association, 22(158), 1927

Derives what is now called the Wilson score interval for proportions by inverting the score test statistic directly, rather than approximating p by p̂ in the standard error. The resulting interval never extends outside [0, 1], and achieves near-nominal coverage even at small N and extreme probabilities — both conditions common in binary LLM evals. The Wald interval (p̂ ± z⋅SE) is a first-order approximation that fails badly here.

used in evalstats single-sample binary CI

Interval estimation for the difference between independent proportions: comparison of eleven methods

Newcombe, R.G. — Statistics in Medicine, 17(8), 1998

Systematically compares 11 CI methods for the difference between two proportions. The Wilson-hybrid method (Method 10 in the paper) consistently achieves near-nominal coverage across the full probability space tested. evalstats uses the paired-samples extension of this method for pairwise binary comparisons between prompts or models evaluated on the same input set. Note: this paper addresses independent proportions; the paired variant used in evalstats follows the same Wilson-square-root construction applied to paired differences.

used in evalstats pairwise binary CI

Bootstrap methods: Another look at the jackknife

Efron, B. — Annals of Statistics, 7(1), 1979

The foundational bootstrap paper. Introduces resampling with replacement as a general method for estimating the sampling distribution of any statistic, without making distributional assumptions. Establishes the theoretical properties that make the bootstrap valid for smooth statistics and justifies its use for constructing CIs from finite samples. All bootstrap methods in evalstats (percentile, BCa, smooth) trace their lineage here.

used in evalstats bootstrap foundation

Inference for the Generalization Error

Nadeau, C. & Bengio, Y. — Machine Learning, 52, 2003

Shows that the standard cross-validation variance estimator is biased because it ignores the dependence between training sets in k-fold CV. Derives a corrected variance estimator that accounts for both test-set variability and training-set randomness, yielding significance tests that achieve the correct level rather than inflating type-I error. The “corrected resampled t-test” from this paper is the standard reference for rigorous hypothesis testing when comparing two algorithms on a single dataset via repeated cross-validation.

cross-validation hypothesis testing variance estimation

Some simple models of the data analysis problem

Banks, D.L. — Journal of the American Statistical Association, 1988

Introduces the P_SR estimator: adding KDE-bandwidth Gaussian perturbations to bootstrap resamples, which smooths the discrete empirical distribution. Shows improved coverage over the plain percentile bootstrap for continuous data at small sample sizes, because the smoothing reduces the discretization artifacts that cause the standard bootstrap to under-cover at small N. evalstats implements this as smooth_bootstrap using Scott’s rule (h = n^−1/5 × s) for the bandwidth.

used in evalstats smooth bootstrap default

The Banks-B Bootstrap: Theory and Applications for Small Samples

Simkus, M. — Doctoral thesis, Durham University, 2022

An extended treatment of the Banks P_SR and related smoothed bootstrap estimators. Argues that Banks’ method is substantially underappreciated in the statistical literature and particularly effective for small-N settings where the standard bootstrap discretization error is largest. Provides theoretical coverage guarantees and simulation evidence consistent with evalstats’ own simulation results favoring the smooth bootstrap.

used in evalstats smooth bootstrap theory

A note on the analytic method for bootstrapped confidence intervals

Counsell, A. & Pek, J. — Psychological Methods, 2019

Simulation study comparing percentile, BCa, and studentized bootstrap CIs for factor analysis loadings. Finds that the percentile method achieves better overall coverage than BCa in typical settings, and that BCa’s reputation for superiority depends on conditions that are often not met. Although the domain (factor analysis) differs from LLM evals, the qualitative finding is consistent with evalstats’ simulation results: BCa underperforms on pairwise comparisons, where the asymmetry it was designed to correct does not manifest cleanly.

cited in guide BCa vs. percentile

A simple sequentially rejective multiple test procedure

Holm, S. — Scandinavian Journal of Statistics, 6(2), 1979

Introduces the Holm–Bonferroni step-down procedure: sort p-values in ascending order, then compare the k-th smallest against α/(m−k+1). This is uniformly more powerful than plain Bonferroni while maintaining strong family-wise error rate control under any dependence structure. Applied in evalstats when comparing k prompts or models simultaneously, replacing the conventional but unnecessarily conservative Bonferroni correction.

used in evalstats multiple comparisons

Statistical Comparisons of Classifiers over Multiple Data Sets

Demšar, J. — Journal of Machine Learning Research, 7, 2006

The canonical reference for comparing multiple classifiers across multiple datasets. Argues that parametric tests (t-tests, ANOVA) are routinely misapplied in this setting because their normality and independence assumptions are violated, and recommends non-parametric alternatives: the Wilcoxon signed-ranks test for two-classifier comparisons and the Friedman test with post-hoc corrections for more than two. Introduces the critical difference (CD) diagram — the standard visualization for showing which classifiers differ significantly after correcting for multiple comparisons. CD diagrams are implemented in evalstats for multi-prompt and multi-model comparisons.

used in evalstats multiple comparisons critical difference diagrams nonparametric methods

An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for All Pairwise Comparisons

García, S. & Herrera, F. — Journal of Machine Learning Research, 9, 2008

Extends Demšar (2006) by providing rigorous post-hoc procedures specifically for all pairwise comparisons among N classifiers, rather than comparisons against a single control. Derives adjusted p-values (Holm, Shaffer, Bergmann-Hommel) that maintain family-wise error rate control across the full n×(n−1)/2 comparison matrix. Essential companion to Demšar when the goal is to rank all prompts or models against each other rather than against a single baseline.

multiple comparisons pairwise post-hoc tests nonparametric methods

Should We Really Use Post-Hoc Tests Based on Mean-Ranks?

Benavoli, A., Corani, G. & Mangili, F. — arXiv:1505.02288, 2015

Demonstrates a troubling property of mean-rank post-hoc tests (Nemenyi, Bonferroni on ranks): whether two algorithms are declared significantly different depends on which other algorithms are included in the comparison set. Adding or removing an irrelevant third algorithm can flip the significance verdict between two others — a violation of basic statistical independence. Recommends pairwise sign-tests or Wilcoxon signed-rank tests instead, which depend only on the two algorithms being directly compared. A direct critique of the standard Demšar CD-diagram pipeline.

rank-test dependence multiple comparisons nonparametric methods

Time for a Change: a Tutorial for Comparing Multiple Classifiers Through Bayesian Analysis

Benavoli, A., Corani, G., Demšar, J. & Zaffalon, M. — JMLR, 18, 2017

A tutorial-style paper by the same group as the 2015 critique above, proposing Bayesian replacements for the Friedman/Nemenyi NHST pipeline. Rather than asking “is there a significant difference?,” the Bayesian approach answers “what is the posterior probability that algorithm A outperforms B?” and can incorporate a region of practical equivalence (ROPE) to distinguish meaningful from negligible differences. Includes practical Bayesian signed-rank and correlated t-tests with software. The Bayesian estimation framing aligns closely with evalstats’ CI-first philosophy.

Bayesian methods multiple comparisons ROPE

Statistical Comparison of Classifiers Through Bayesian Hierarchical Modelling

Corani, G., Benavoli, A., Demšar, J., Mangili, F. & Zaffalon, M. — arXiv:1609.08905, 2016

Extends the Bayesian approach from Benavoli et al. (2017) by fitting a hierarchical model across all datasets simultaneously rather than treating each independently. The hierarchical structure pools information across datasets, reducing variance in posterior estimates and making conclusions more stable when individual datasets are noisy or small. Produces posterior probabilities that one classifier outperforms another with joint analysis of the full multi-dataset structure.

Bayesian methods hierarchical models multiple comparisons

A Bayesian Bradley–Terry Model to Compare Multiple ML Algorithms on Multiple Data Sets

Wainer, J. — arXiv:2208.04935, 2022

Applies the Bradley–Terry paired-comparison model in a Bayesian framework to rank multiple machine learning algorithms across datasets. Unlike rank-based non-parametric tests, the Bradley–Terry model produces a full posterior over algorithm rankings and allows defining regions of practical equivalence for any performance metric. Includes R and Python implementations. Useful for LLM eval workflows where models or prompts need to be ranked and the ranking uncertainty matters as much as the ranking itself.

Bayesian methods ranking multiple comparisons

A Practical Tutorial on the Use of Nonparametric Statistical Tests as a Methodology for Comparing Evolutionary and Swarm Intelligence Algorithms

Derrac, J., García, S., Molina, D. & Herrera, F. — Swarm and Evolutionary Computation, 1, 2011

A hands-on tutorial walking through the full non-parametric testing workflow for comparing algorithms across multiple benchmark problems: Wilcoxon signed-rank for pairwise comparisons, Friedman for k-algorithm omnibus tests, and Holm/Shaffer/Bergmann post-hoc corrections for multiple pairwise conclusions. Uses CEC’2005 optimization benchmark results as running examples. Despite the evolutionary computing framing, this is the most practically useful step-by-step reference for anyone applying the Demšar framework to a new domain — including LLM evals.

nonparametric methods tutorial multiple comparisons

Exploratory and Inferential Analysis of Benchmark Experiments

Eugster, M.J.A., Hothorn, T. & Leisch, F. — Technical Report, Ludwig Maximilians University Munich, 2008

Presents a unified toolkit of exploratory and inferential methods for benchmark experiment analysis, combining novel visualizations with parametric and non-parametric tests to establish statistically sound algorithm rankings. Goes beyond the Demšar pipeline by treating benchmark analysis as an exploratory data analysis problem first, with formal inference as a secondary step. Useful for analysts who want richer visual diagnostics of multi-algorithm, multi-dataset results before committing to a single statistical summary.

benchmark analysis exploratory analysis visualization

scmamp: Statistical Comparison of Multiple Algorithms in Multiple Problems

Calvo, B. & Santafé, G. — The R Journal, 2016

An R package that unifies the full Demšar/García statistical comparison workflow — data loading, hypothesis testing (Friedman, Wilcoxon, Holm, Shaffer), CD diagram generation, and report output — into a single coherent API. Also implements the Bayesian tests from Benavoli et al. (2017). The reference implementation for anyone applying the non-parametric multi-algorithm comparison framework in R, and a useful touchstone for understanding what evalstats’ CD diagram output corresponds to in the classical literature.

software multiple comparisons critical difference diagrams

The Bayesian bootstrap

Rubin, D.B. — Annals of Statistics, 9(1), 1981

Proposes assigning Dirichlet-distributed weights to the observed data points instead of resampling with replacement. The resulting posterior is smoother than the standard bootstrap’s discrete empirical distribution. In simulation, the Bayesian bootstrap tends to produce narrower (higher-precision) intervals than plain percentile bootstrap, but at the cost of lower coverage (lower recall) at small N — the trade-off discussed in the Stats Reference Guide.

available in evalstats Bayesian bootstrap

Bootstrap confidence intervals: when, which, what? A practical guide for medical statisticians

Carpenter, J. & Bithell, J. — Statistics in Medicine, 19(9), 2000

A practical survey of bootstrap CI methods (basic, percentile, BCa, studentized) with guidance on when each is appropriate. Explains why BCa often outperforms percentile in asymmetric situations, but also identifies conditions under which BCa fails — particularly when the acceleration constant estimate is unstable, which tends to occur in pairwise comparison settings. Useful background for understanding the BCa coverage collapse in evalstats’ simulations.

bootstrap reference BCa behavior

Statistical Rethinking — 2026 Edition (GitHub)

McElreath, R. — CRC Press / Chapman & Hall, 2nd ed. (2020); 2026 lecture series ongoing

The most widely recommended introduction to Bayesian data analysis for scientists who want to build genuine probabilistic models rather than run canned tests. Works through the full Bayesian workflow — prior selection, likelihood specification, posterior sampling via MCMC, and model criticism — using a deliberately slow, conceptual pace that demystifies what inference actually does. The annual lecture series (see the 2026 GitHub repo for videos and code) keeps the material current. Relevant to LLM eval practice as grounding for anyone who wants to go beyond bootstrap CIs toward full Bayesian models of eval score distributions or judge reliability.

Bayesian methods textbook probabilistic modeling

Introduction to the New Statistics: Estimation, Open Science, and Beyond

Cumming, G. & Calin-Jageman, R. — Routledge, 2nd ed., 2024

The canonical textbook for the estimation-based approach to statistics: report effect sizes and confidence intervals, not just p-values and reject/fail-to-reject decisions. Argues that point estimates accompanied by CIs communicate more information than binary significance tests, support meta-analytic thinking, and encourage replication. Directly underpins the philosophy of this guide: the goal of an LLM eval is not to “pass a significance threshold” but to estimate a quantity — a score, a difference, an improvement — with calibrated uncertainty. Also covers open-science practices (pre-registration, effect size reporting) that translate naturally to rigorous eval design.

estimation approach effect sizes & CIs textbook

Statistical Testing in NLP

Papers on how statistical significance testing should be applied to NLP system comparisons — a direct precursor to the same questions now facing LLM evaluation.

An Empirical Investigation of Statistical Significance in NLP

Berg-Kirkpatrick, T., Burkett, D. & Klein, D. — EMNLP 2012

Empirically examines the relationship between metric gains, test-set size, system similarity, and statistical significance in NLP evaluation. Investigates whether significance on a held-out test set predicts out-of-domain performance — an external validity question that maps directly onto LLM eval practice. Uses bootstrap resampling throughout. The findings ground subsequent debates: apparent improvements in NLP are frequently within noise, and the conditions under which significance generalizes are narrower than the field assumed.

significance testing NLP evaluation bootstrap

The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing

Dror, R., Baumer, G., Shlomov, S. & Reichart, R. — ACL 2018

A survey and practical protocol for selecting the right significance test for NLP comparisons, covering bootstrap, permutation, Wilcoxon, McNemar, and parametric tests across different output types (sequences, labels, scores). A survey of ACL and TACL 2017 papers found that significance testing was frequently ignored or misapplied. The paper’s decision-tree protocol — choose a test based on output type, sample size, and dependence structure — transfers directly to LLM eval: the same choices arise when comparing models on classification, generation, or rating tasks.

significance testing NLP evaluation test selection

Statistical Debates

Foundational papers in ongoing debates about p-values, estimation, and the reliability of published research.

Why Most Published Research Findings Are False

Ioannidis, J.P.A. — PLoS Medicine, 2(8), 2005

A mathematical argument that the majority of claimed research findings in biomedicine are likely false, given typical study power, significance thresholds, researcher degrees of freedom, and the ratio of true to tested hypotheses. The framework is general: any field that runs many low-powered tests on a sparse signal, reports only significant results, and allows flexibility in analysis will produce a literature where most positive findings are noise. The diagnosis applies directly to LLM benchmark claims: small N evals with many implicit comparisons (prompts, models, metrics) and no correction for multiplicity generate exactly the conditions Ioannidis identifies as most likely to produce false positives.

replication crisis false positives statistical power

Statistical Tests, P Values, Confidence Intervals, and Power: A Guide to Misinterpretations

Greenland, S., Senn, S.J., Rothman, K.J. et al. — European Journal of Epidemiology, 31(4), 2016

A systematic catalog of 25 common misinterpretations of p-values, confidence intervals, and statistical power, co-authored by some of the most prominent statisticians and epidemiologists working on these problems. Establishes the precise meaning of each concept and shows how each is routinely misread — including the widespread belief that a 95% CI contains the true value with 95% probability, or that a non-significant p-value means the null hypothesis is true. Essential reading alongside the Cumming debate: even researchers who have abandoned p-values often carry the same misconceptions into their CI interpretations.

p-value misuse CI misinterpretation statistical foundations

The New Statistics: Why and How

Cumming, G. — Psychological Science, 25(1), 2014

The seminal journal article making the case for replacing null-hypothesis significance testing with an estimation approach: report effect sizes and confidence intervals, pursue meta-analytic thinking, and pre-specify studies where possible. Argues that NHST encourages binary thinking that impedes cumulative science, while estimation methods produce results that compound naturally across replications. The companion to Cumming & Calin-Jageman’s textbook and a direct influence on subsequent debates in HCI and beyond about whether p-values should be abandoned outright. (If you're short on time, I recommend the talk version.)

estimation approach effect sizes & CIs NHST critique

Why Hypothesis Tests Are Essential for Psychological Science: A Comment on Cumming (2014)

Morey, R.D., Rouder, J.N. & Wagenmakers, E.-J. et al. — Psychological Science, 25(6), 2014

A direct reply to Cumming (2014), broadly endorsing open science and the shift away from mechanical p-value thresholds, but pushing back on the claim that estimation alone is sufficient. The core argument: testing a theory requires knowing what you’d expect under both the theory and its negation, plus a principled method for choosing between them. CIs don’t provide the second or third component — they can show that data are consistent with a theory, but not that the data couldn’t just as easily arise if the theory were false. A useful counterweight in the estimation-vs-testing debate: the disagreement here is not about whether to report CIs, but about whether hypothesis tests can ever be informative in addition to them.

debates Cumming (2014) hypothesis testing NHST defense

Statistical Pitfalls in ML

Papers on broader patterns of statistical misuse in machine learning research.

Accounting for Variance in Machine Learning Benchmarks

Bouthillier, X. et al. — MLSys 2021

Systematic study of variance in ML benchmark results from random seeds, data ordering, and hardware. Shows that a substantial fraction of published improvements fall within the natural variance of the same algorithm under different random initialization. The paper introduces a framework for separating “algorithmic variance” from “trial variance,” and argues that current reporting standards make it impossible to know which published improvements are real. Directly motivates the need for bootstrap CIs in eval comparisons.

random seed variance benchmark reproducibility

Post-selection confidence bounds for prediction performance

Rink, P. & Brannath, W. — Machine Learning, 2025

When you evaluate N models and report the performance of the best one, the confidence interval for that model is no longer valid: the selection step introduces optimism bias that standard CIs don’t account for. This paper constructs valid lower confidence bounds that explicitly condition on the selection procedure, combining bootstrap tilting with a max-T multiplicity correction to cover all candidate models simultaneously. The result is a principled answer to a problem common in LLM eval workflows: reporting the performance of the prompt or model that ranked highest in a sweep, without acknowledging that the top rank is itself a noisy outcome. (Although we arrived at the need to use max-T correction on CIs independently, this paper provides a rigorous theoretical and simulation-backed foundation for that choice in an ML context.)

selection bias post-selection inference simultaneous CIs

Troubling Trends in Machine Learning Scholarship

Lipton, Z.C. & Steinhardt, J. — ICML, 2018

Identifies failure modes in ML research writing: overclaiming, conflating ablations with explanations, failure to engage with negative results, and misuse of evaluation protocols. Several of the patterns identified directly apply to LLM eval claims: cherry-picked examples presented as representative benchmarks, single-prompt evaluations reported as capability measurements, and improvements that disappear under slightly different conditions.

research methodology overclaiming

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Ribeiro, M.T. et al. — ACL 2020

Introduces behavioral testing: a structured suite of test types (minimum functionality, invariance, directional expectation) designed to probe specific capabilities rather than aggregate accuracy. Shows that models achieving high accuracy on standard benchmarks fail systematically on targeted behavioral tests. A complement to statistical inference methods: knowing whether a CI is tight is less useful if the eval items don’t actually represent the capability you care about.

behavioral testing eval design

Repairing the Cracked Foundation: A Survey of Obstacles in Evaluation Practices for Generated Text

Gehrmann, S. et al. — JAIR, 2023

Surveys obstacles in NLG evaluation: metrics that fail to correlate with human judgment, benchmark contamination, under-specified evaluation protocols, and lack of statistical testing. Identifies six categories of failure across 252 papers. Particularly relevant sections cover inter-annotator agreement methodology and the mismatch between automated metrics and human preference — problems now manifesting in LLM-as-judge pipelines.

NLG evaluation human annotation benchmark contamination

The Permutation Test: A Visual Explanation

Wilber, J. — Interactive explainer, 2019

A scroll-driven, animated introduction to permutation testing — the nonparametric method for asking whether an observed difference between two groups could plausibly arise by chance alone. Rather than assuming a parametric distribution, a permutation test repeatedly shuffles the group labels and measures how often random reassignments produce a difference as large as the one observed, building an exact null distribution. A natural complement to bootstrap CIs for model comparison: where a bootstrap CI quantifies uncertainty around an estimate, a permutation test directly answers the sharp question of statistical significance.

visual explainer hypothesis testing nonparametric methods

Statistical Debates in HCI

HCI has been an recently active arena for rethinking statistical practice. HCI often deals with low sample sizes, and thus, insights from this field may be particularly relevant for LLM evals, which also face similar challenges.

Rethinking Statistical Analysis Methods for CHI

Kaptein, M. & Robertson, J. — CHI 2012

An early call for reform of statistical practice in HCI research, diagnosing three entrenched problems: the transposed conditional fallacy (treating p-values as the probability the hypothesis is true), neglect of statistical power, and reluctance to interpret effect sizes. The authors argue these compound to produce weak theories from vaguely specified hypotheses and advocate shifting from significance thresholds to effect magnitude as the primary criterion for evaluating results. Uses CHI 2010 publications as illustrative data. A useful precursor to the Dragicevic and Kay et al. papers further down: it establishes the critique that motivates their more prescriptive reform proposals.

p-value misuse effect sizes & CIs statistical power

Fair Statistical Communication in HCI

Dragicevic, P. — Modern Statistical Methods for HCI, Springer, pp. 291–330, 2016

Argues that p-values and dichotomous significance testing are poor tools for scientific communication — not because researchers misuse them, but because the tools themselves are poorly suited to the task. Written for the HCI community but directly applicable to LLM evals: the paper explains in non-technical terms why switching to an estimation approach (reporting effect sizes and interval estimates with informative charts) produces clearer, more honest, and more actionable research findings. Offers concrete guidance on communicating empirical results without any tests or p-values, emphasising nuanced interpretation over binary pass/fail verdicts.

p-value misuse estimation approach effect sizes & CIs

Researcher-Centered Design of Statistics: Why Bayesian Statistics Better Fit the Culture and Incentives of HCI

Kay, M., Nelson, G.L. & Hekler, E.B. — CHI 2016

Frames the choice of statistical methodology as a user-centered design problem: existing frequentist practice fails researchers because study results rarely accumulate into progressively more precise estimates — knowledge accrual stalls. The authors use simulation to compare frequentist and Bayesian publication worlds, showing that Bayesian analysis supports knowledge accrual with each new study and enables more principled conclusions from small-N work on novel techniques. Directly relevant to LLM eval practice: the argument for methods that compound information across studies rather than producing isolated significance verdicts applies equally to iterative model evaluation.

Bayesian methods estimation approach knowledge accrual

Transparent Statistics Guidelines

Transparent Statistics in HCI Working Group — Living document, 2019–present

A community-maintained reference covering the most commonly mishandled statistical topics in HCI research: effect sizes, p-values, inferential statistics, pre-registration, multiple comparisons, inter-rater reliability, Bayesian inference, and Likert-scale data. Each chapter provides an FAQ with direct answers to specific questions, plus code exemplars with worked interpretations. A practical companion to the more argumentative papers in this section — useful for quickly resolving questions like “how should I report this Likert result?” or “when do I need to correct for multiple comparisons?”

living guidelines multiple comparisons effect sizes & CIs Likert data