Papers, Articles, and Commentary
A curated list of papers and articles that I've encountered and noted while working on this project. The list is organized into two sections: Commentary on select papers, and Papers, articles, and other educational resources on statistical methods and perspectives in LLM eval, ML, and HCI contexts.
Commentary on select papers
The "Evals Need Error Bars" Paper
In 2024, Evan Miller at Anthropic made an early, important case for error bars in evaluations. The argument centered on standard errors: combine standard deviation with sample size, report alongside the mean.
This is reasonable, but limited for the evals landscape of today. The framework assumes thousands or hundreds of thousands of datapoints — as Anthropic's public benchmarks have. That is not the world of most developers. A typical custom developer eval has 20–100 items, not 10,000. At that scale:
- Standard errors assume normal distributions of LLM outputs, which rarely hold.
- They assume symmetry around the mean — also rarely true for bounded scores at high performance.
- They collapse with hierarchical data (e.g., same inputs evaluated across multiple prompts and models).
What we want instead is confidence intervals on pairwise differences, with no strong distributional assumptions. To trust these intervals, we need to use methods that are robust at small N. To understand which methods are robust, we would ideally run simulations that reflect the kinds of data we see in real evals.
The Forgotten Paper on Statistical Comparisons in ML: Demšar (2006)
Before the LLM era, the machine learning community had a well-established answer to the question of how to statistically compare multiple models across multiple evaluation sets. Janez Demšar’s 2006 Journal of Machine Learning Research paper, Statistical Comparisons of Classifiers over Multiple Data Sets, argued that parametric tests like ANOVA are routinely misapplied to classifier comparisons, violating normality and independence assumptions. He recommended non-parametric alternatives— the Wilcoxon signed-ranks test for two models, the Friedman test plus post-hoc corrections for more than two—and introduced the critical difference (CD) diagram as a clean visual for communicating the results.
The paper is highly cited in classical ML but has largely been forgotten in the LLM benchmark discourse. That is a shame, because the situation Demšar addressed is precisely the one we face: many models (or prompts), many evaluation sets (or items), non-normal distributions, and a need to make comparative claims that are statistically honest about uncertainty.
Demšar’s core insight is that comparing the rank of methods averaged across datasets is more robust than comparing raw means, and that multiple comparisons require explicit correction. This transfers directly to LLM evals whenever you are comparing N prompts or models across M evaluation sets and asking which is best.
That said, the framework needs adaptation for modern eval practice. CD diagrams and rank-band visualizations are built around a narrow, rigid NHST pipeline: run a Friedman test, apply a post-hoc correction (Nemenyi or Holm), then draw cliques of methods that are not significantly different. This works, but it doesn't adapt to the estimation-first approach that we advocate, which focuses on effect sizes and confidence intervals rather than p-values. If we want to run any other test than Friedman, essentially, we can't use the CD diagram as Demšar’s paper defines it.
But CD diagram is just a visualization! There's nothing stopping us from adapting it to other tests and other philosophies. In fact,
evalstats adapts the CD diagram to work with pairwise confidence intervals (adjusted for multiple comparisons)
and estimated ranks from a bootstrapped distribution. However, it also supports the traditional NHST Friedman test and CD diagrams
generated from that, for users who want to run a more traditional analysis.
The "Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints" Paper
An ICML position paper titled "Don't Use the CLT in LLM Evals With Fewer Than a Few Hundred Datapoints" argues that for N<300 samples you must use Bayesian statistics, and that frequentist and bootstrap methods are untrustworthy at small N. But is that really true?
The paper evaluates Bayesian methods under simulation assumptions closely aligned with those Bayesian priors. That setup is valid for answering a specific question, but it can favor Bayesian approaches relative to methods optimized for broader or misspecified settings.
Despite the broad title, the analysis is focused on Bernoulli (binary) outcomes. It does not cover numeric/interval eval scores (e.g., Likert or 0–100 grades), and it compares against a limited set of frequentist baselines rather than the strongest modern alternatives.
This matters because other simulation work has reported different rankings under different designs, including settings where certain frequentist intervals achieve higher coverage than Bayesian variants:
In ML terms, this is a precision–recall style trade-off: some methods give tighter intervals, others give higher coverage. Which is preferable depends on risk tolerance, data type, and whether under-coverage or over-conservatism is the bigger error in your application.
In our own simulations (see below): Wilson is strongest for binary
single-sample coverage, Bayesian paired is strongest for binary pairwise coverage at small-to-mid N,
and smooth bootstrap with kernel density estimation (KDE) is the most reliable general-purpose method across non-binary eval scores.
However, in our larger-N runs, the bayes_evals Bayesian paired method breaks down around
N≈200+, producing overconfident intervals that can be dangerously wrong.
Thus, evalstats uses Bayesian paired only when N < 100, then switches to
bootstrap for better accuracy at larger N and to take advantage of multi-comparisons correction for pairwise CIs.
LLM Eval Statistics and Debates
Papers that have directly shaped how the field thinks about statistical rigor in LLM evaluation. There's also some commentary on select papers.
An influential argument for treating LLM eval scores as statistical estimates with quantified uncertainty. Proposes bootstrap CIs, paired tests for model comparisons, and power analysis for eval design. Addresses cluster structure (multi-run, multi-judge settings) via cluster-robust standard errors. The paper identified a genuine gap: the industry norm of reporting bare means without uncertainty is indefensible at any sample size. See the Debunking section for how evalstats’ smooth bootstrap improves on SE-based intervals at small N.
Argues that for binary LLM eval scores with N < 300, Bayesian beta-posterior methods substantially outperform frequentist and bootstrap alternatives. The paper’s simulation design evaluates Bayesian estimates on distributions that match the Bayesian prior, which inflates its apparent advantage. It also focuses exclusively on the binary case and does not benchmark against the Wilson interval or smoothed bootstrap. See the Debunking section of the Stats Reference Guide for a detailed critique. The paper’s core call for more rigorous statistics in LLM evals is well-founded, even if the specific recommendation is overstated.
A NIST standards document advocating for statistical rigor in AI evaluation, with formal guidance on CI estimation, test selection, and uncertainty quantification for model comparisons. Argues that the dominant evaluation culture — reporting point estimates without uncertainty — cannot support the regulatory and procurement decisions now being made on the basis of benchmark scores. Provides a taxonomy of evaluation tasks and matching statistical approaches, making it a useful framework for practitioners choosing between methods.
A companion to NIST AI 800-3, focused specifically on automated benchmark evaluation practices for language models. Provides standardized guidance on how to design, run, and report automated benchmark evaluations, with an eye toward reproducibility and consistency across the AI research community. Where 800-3 addresses statistical modeling and uncertainty quantification, 800-2 addresses the procedural infrastructure: how benchmarks should be constructed, scored, and compared.
A formal public comment submitted to the NIST AI 800-2 process by Princeton’s SAGE lab, focused on benchmark reliability concerns. Provides technical perspective on the limitations and failure modes of automated LLM benchmarks from a research group actively studying benchmark validity. A useful complement to the official NIST document, representing practitioner and researcher pushback on where current benchmark evaluation standards fall short.
LLM Evaluation Methodology
Papers on how LLM evals are structured, run, and reported — and where they go wrong.
A 15-chapter online textbook tracing the foundations, current challenges, and future directions of ML benchmarking. Covers statistical testing, the holdout method, test-set reuse, replication crises, generative models, and LLM evaluation. Opens with a dual diagnosis: benchmarks are “the iron rule to tame the anything goes” in ML research, yet “a crisis grips the benchmarking enterprise.” A thorough treatment of why aggregate benchmark scores are both essential and fragile — and what rigorous alternatives look like.
Introduces HELM: evaluating models across 42 scenarios, 7 metric categories, and hundreds of sub-scenarios. One of the most comprehensive eval frameworks published. Aggregate scores are reported as means without confidence intervals, making it a useful illustration of scale-dependent eval limitations: at the scale HELM operates (thousands of items per scenario), standard errors are small enough to be negligible, but for the typical developer running a custom eval at N=50, HELM’s methodology does not transfer.
Introduces and validates the LLM-as-a-judge paradigm: using a capable model (GPT-4) as a scoring proxy for human preference. Shows ~80% agreement with human judges on pairwise comparisons. Also introduces Chatbot Arena and its Elo-rating leaderboard. The paper identifies key judge failure modes (position bias, verbosity bias, self-enhancement) but does not quantify judge-induced variance in downstream CI estimates.
Shows that “emergent” abilities — capabilities that appear suddenly at a threshold model scale — are largely artifacts of the choice of metric. Discontinuous metrics (exact match) create apparent phase transitions; replacing them with continuous metrics (edit distance, token probability) reveals smooth, continuous improvement curves. A precise demonstration of why the score function is part of the statistical model: the metric you choose changes the conclusions you can draw.
Argues that MCC is the most informative single metric for binary classification: a high MCC guarantees all four confusion-matrix rates are high (TPR, TNR, PPV, NPV), whereas balanced accuracy or bookmaker informedness only guarantee three. Uses mathematical derivations and worked examples to show scenarios where BA and BM signal near-perfect performance while MCC correctly flags failure. Directly relevant to LLM binary evals: when pass/fail or correct/incorrect scores are the output, choosing the wrong summary metric can mask systematic failure modes on one class. A concrete companion to Schaeffer et al.: not only does the shape of the metric matter (continuous vs. discontinuous), but so does which confusion-matrix cell it is sensitive to.
Documents that different BLEU implementations produce scores that differ by up to 1–2 BLEU points on the same output, making cross-paper comparisons unreliable. An early example of what has become a general problem: aggregate metrics reported without enough implementation detail to reproduce them. The lesson generalizes directly to LLM evals: a score of 78% is only meaningful if the scoring function, judge model, prompt template, and evaluation protocol are fully specified.
A large-scale collaborative benchmark covering 204 tasks designed to be beyond the capabilities of then-current models. Beyond its task coverage, BIG-Bench is notable methodologically for its use of multiple human raters with inter-rater reliability checks, calibrated difficulty ratings, and discussion of metric sensitivity. A useful contrast to single-metric benchmarks, though statistical uncertainty of aggregate scores is still not prominently reported.
Introduces ProSA, a framework for measuring how sensitive LLM performance is to surface-level prompt variation (phrasing, ordering, delimiter choice). Finds that sensitivity varies substantially across models, tasks, and prompt dimensions, and that models with similar mean accuracy can have very different variance profiles. Directly motivates the Prompt Sensitivity investigation: a model’s mean score on a single prompt is a noisy proxy for its true task capability, and the noise is correlated with prompt style rather than item difficulty.
A practitioner’s guide to reproducibility failures in LLM evaluation, drawn from experience running large-scale evals. Identifies sources of variance that are rarely reported: temperature and sampling randomness, prompt template sensitivity, tokenization differences between frameworks, and hardware-level floating-point non-determinism. The paper makes the case that reproducibility in LLM evals requires specifying far more than just the model name and task description — a concrete list of what “full specification” actually entails.
Shows that LLM performance on benchmarks fluctuates dramatically with superficial changes to evaluation setup — from option ordering to question framing — producing score swings large enough to change model rankings. Argues that standard benchmark scores do not measure a stable underlying capability but rather a model’s response to a specific configuration. Directly motivates the When Rankings Flip investigation: a CI on a single eval configuration cannot capture configuration-induced variance.
A systematic review of 445 LLM benchmarks from major NLP and ML conferences, examining whether they actually measure the concepts they claim to — safety, robustness, reasoning, and so on. Using 29 expert reviewers, the authors identify recurring patterns in tasks, metrics, and phenomena that undermine construct validity. Concludes with eight recommendations for more rigorous benchmark development, and a practical construct validity checklist for benchmarks. See also the Communications of the ACM post for an accessible summary.
Evaluates LLMs on class-level code generation using real-world open-source repositories rather than synthetic or curated problems. Finds that model rankings on synthetic benchmarks (HumanEval, MBPP) diverge substantially from rankings on authentic tasks, and that benchmark-to-benchmark transfer is weak even within coding evals. A concrete illustration of external validity failure: even a statistically rigorous CI on a synthetic benchmark cannot bound performance on real-world tasks if the benchmark distribution is unrepresentative.
A systematic guide to inter-annotator agreement (IAA) metric selection for NLP tasks. Compares Cohen’s κ, Krippendorff’s α, Fleiss’ κ, and percent agreement, clarifying when each is appropriate and when they give misleading results. Particularly relevant for LLM-as-judge pipelines: when multiple judges (human or model) score the same items, IAA measures reveal whether the scoring process is reliable — a prerequisite for trusting any downstream CI built on those scores.
Methods and Perspectives Used in evalstats
Papers for some statistical methods implemented in the library, as well as the estimation approach to stats adopted here.
Derives what is now called the Wilson score interval for proportions by inverting the score test statistic directly, rather than approximating p by p̂ in the standard error. The resulting interval never extends outside [0, 1], and achieves near-nominal coverage even at small N and extreme probabilities — both conditions common in binary LLM evals. The Wald interval (p̂ ± z⋅SE) is a first-order approximation that fails badly here.
Systematically compares 11 CI methods for the difference between two proportions. The Wilson-hybrid method (Method 10 in the paper) consistently achieves near-nominal coverage across the full probability space tested. evalstats uses the paired-samples extension of this method for pairwise binary comparisons between prompts or models evaluated on the same input set. Note: this paper addresses independent proportions; the paired variant used in evalstats follows the same Wilson-square-root construction applied to paired differences.
The foundational bootstrap paper. Introduces resampling with replacement as a general method for estimating the sampling distribution of any statistic, without making distributional assumptions. Establishes the theoretical properties that make the bootstrap valid for smooth statistics and justifies its use for constructing CIs from finite samples. All bootstrap methods in evalstats (percentile, BCa, smooth) trace their lineage here.
Shows that the standard cross-validation variance estimator is biased because it ignores the dependence between training sets in k-fold CV. Derives a corrected variance estimator that accounts for both test-set variability and training-set randomness, yielding significance tests that achieve the correct level rather than inflating type-I error. The “corrected resampled t-test” from this paper is the standard reference for rigorous hypothesis testing when comparing two algorithms on a single dataset via repeated cross-validation.
Introduces the PSR estimator: adding KDE-bandwidth Gaussian perturbations to
bootstrap resamples, which smooths the discrete empirical distribution. Shows improved
coverage over the plain percentile bootstrap for continuous data at small sample sizes,
because the smoothing reduces the discretization artifacts that cause the standard bootstrap
to under-cover at small N. evalstats implements this as smooth_bootstrap
using Scott’s rule (h = n−1/5 × s) for the bandwidth.
An extended treatment of the Banks PSR and related smoothed bootstrap estimators. Argues that Banks’ method is substantially underappreciated in the statistical literature and particularly effective for small-N settings where the standard bootstrap discretization error is largest. Provides theoretical coverage guarantees and simulation evidence consistent with evalstats’ own simulation results favoring the smooth bootstrap.
Simulation study comparing percentile, BCa, and studentized bootstrap CIs for factor analysis loadings. Finds that the percentile method achieves better overall coverage than BCa in typical settings, and that BCa’s reputation for superiority depends on conditions that are often not met. Although the domain (factor analysis) differs from LLM evals, the qualitative finding is consistent with evalstats’ simulation results: BCa underperforms on pairwise comparisons, where the asymmetry it was designed to correct does not manifest cleanly.
Introduces the Holm–Bonferroni step-down procedure: sort p-values in ascending order, then compare the k-th smallest against α/(m−k+1). This is uniformly more powerful than plain Bonferroni while maintaining strong family-wise error rate control under any dependence structure. Applied in evalstats when comparing k prompts or models simultaneously, replacing the conventional but unnecessarily conservative Bonferroni correction.
The canonical reference for comparing multiple classifiers across multiple datasets. Argues that parametric tests (t-tests, ANOVA) are routinely misapplied in this setting because their normality and independence assumptions are violated, and recommends non-parametric alternatives: the Wilcoxon signed-ranks test for two-classifier comparisons and the Friedman test with post-hoc corrections for more than two. Introduces the critical difference (CD) diagram — the standard visualization for showing which classifiers differ significantly after correcting for multiple comparisons. CD diagrams are implemented in evalstats for multi-prompt and multi-model comparisons.
Extends Demšar (2006) by providing rigorous post-hoc procedures specifically for all pairwise comparisons among N classifiers, rather than comparisons against a single control. Derives adjusted p-values (Holm, Shaffer, Bergmann-Hommel) that maintain family-wise error rate control across the full n×(n−1)/2 comparison matrix. Essential companion to Demšar when the goal is to rank all prompts or models against each other rather than against a single baseline.
Demonstrates a troubling property of mean-rank post-hoc tests (Nemenyi, Bonferroni on ranks): whether two algorithms are declared significantly different depends on which other algorithms are included in the comparison set. Adding or removing an irrelevant third algorithm can flip the significance verdict between two others — a violation of basic statistical independence. Recommends pairwise sign-tests or Wilcoxon signed-rank tests instead, which depend only on the two algorithms being directly compared. A direct critique of the standard Demšar CD-diagram pipeline.
A tutorial-style paper by the same group as the 2015 critique above, proposing Bayesian replacements for the Friedman/Nemenyi NHST pipeline. Rather than asking “is there a significant difference?,” the Bayesian approach answers “what is the posterior probability that algorithm A outperforms B?” and can incorporate a region of practical equivalence (ROPE) to distinguish meaningful from negligible differences. Includes practical Bayesian signed-rank and correlated t-tests with software. The Bayesian estimation framing aligns closely with evalstats’ CI-first philosophy.
Extends the Bayesian approach from Benavoli et al. (2017) by fitting a hierarchical model across all datasets simultaneously rather than treating each independently. The hierarchical structure pools information across datasets, reducing variance in posterior estimates and making conclusions more stable when individual datasets are noisy or small. Produces posterior probabilities that one classifier outperforms another with joint analysis of the full multi-dataset structure.
Applies the Bradley–Terry paired-comparison model in a Bayesian framework to rank multiple machine learning algorithms across datasets. Unlike rank-based non-parametric tests, the Bradley–Terry model produces a full posterior over algorithm rankings and allows defining regions of practical equivalence for any performance metric. Includes R and Python implementations. Useful for LLM eval workflows where models or prompts need to be ranked and the ranking uncertainty matters as much as the ranking itself.
A hands-on tutorial walking through the full non-parametric testing workflow for comparing algorithms across multiple benchmark problems: Wilcoxon signed-rank for pairwise comparisons, Friedman for k-algorithm omnibus tests, and Holm/Shaffer/Bergmann post-hoc corrections for multiple pairwise conclusions. Uses CEC’2005 optimization benchmark results as running examples. Despite the evolutionary computing framing, this is the most practically useful step-by-step reference for anyone applying the Demšar framework to a new domain — including LLM evals.
Presents a unified toolkit of exploratory and inferential methods for benchmark experiment analysis, combining novel visualizations with parametric and non-parametric tests to establish statistically sound algorithm rankings. Goes beyond the Demšar pipeline by treating benchmark analysis as an exploratory data analysis problem first, with formal inference as a secondary step. Useful for analysts who want richer visual diagnostics of multi-algorithm, multi-dataset results before committing to a single statistical summary.
An R package that unifies the full Demšar/García statistical comparison workflow — data loading, hypothesis testing (Friedman, Wilcoxon, Holm, Shaffer), CD diagram generation, and report output — into a single coherent API. Also implements the Bayesian tests from Benavoli et al. (2017). The reference implementation for anyone applying the non-parametric multi-algorithm comparison framework in R, and a useful touchstone for understanding what evalstats’ CD diagram output corresponds to in the classical literature.
Proposes assigning Dirichlet-distributed weights to the observed data points instead of resampling with replacement. The resulting posterior is smoother than the standard bootstrap’s discrete empirical distribution. In simulation, the Bayesian bootstrap tends to produce narrower (higher-precision) intervals than plain percentile bootstrap, but at the cost of lower coverage (lower recall) at small N — the trade-off discussed in the Stats Reference Guide.
A practical survey of bootstrap CI methods (basic, percentile, BCa, studentized) with guidance on when each is appropriate. Explains why BCa often outperforms percentile in asymmetric situations, but also identifies conditions under which BCa fails — particularly when the acceleration constant estimate is unstable, which tends to occur in pairwise comparison settings. Useful background for understanding the BCa coverage collapse in evalstats’ simulations.
The most widely recommended introduction to Bayesian data analysis for scientists who want to build genuine probabilistic models rather than run canned tests. Works through the full Bayesian workflow — prior selection, likelihood specification, posterior sampling via MCMC, and model criticism — using a deliberately slow, conceptual pace that demystifies what inference actually does. The annual lecture series (see the 2026 GitHub repo for videos and code) keeps the material current. Relevant to LLM eval practice as grounding for anyone who wants to go beyond bootstrap CIs toward full Bayesian models of eval score distributions or judge reliability.
The canonical textbook for the estimation-based approach to statistics: report effect sizes and confidence intervals, not just p-values and reject/fail-to-reject decisions. Argues that point estimates accompanied by CIs communicate more information than binary significance tests, support meta-analytic thinking, and encourage replication. Directly underpins the philosophy of this guide: the goal of an LLM eval is not to “pass a significance threshold” but to estimate a quantity — a score, a difference, an improvement — with calibrated uncertainty. Also covers open-science practices (pre-registration, effect size reporting) that translate naturally to rigorous eval design.
Statistical Testing in NLP
Papers on how statistical significance testing should be applied to NLP system comparisons — a direct precursor to the same questions now facing LLM evaluation.
Empirically examines the relationship between metric gains, test-set size, system similarity, and statistical significance in NLP evaluation. Investigates whether significance on a held-out test set predicts out-of-domain performance — an external validity question that maps directly onto LLM eval practice. Uses bootstrap resampling throughout. The findings ground subsequent debates: apparent improvements in NLP are frequently within noise, and the conditions under which significance generalizes are narrower than the field assumed.
A survey and practical protocol for selecting the right significance test for NLP comparisons, covering bootstrap, permutation, Wilcoxon, McNemar, and parametric tests across different output types (sequences, labels, scores). A survey of ACL and TACL 2017 papers found that significance testing was frequently ignored or misapplied. The paper’s decision-tree protocol — choose a test based on output type, sample size, and dependence structure — transfers directly to LLM eval: the same choices arise when comparing models on classification, generation, or rating tasks.
Statistical Debates
Foundational papers in ongoing debates about p-values, estimation, and the reliability of published research.
A mathematical argument that the majority of claimed research findings in biomedicine are likely false, given typical study power, significance thresholds, researcher degrees of freedom, and the ratio of true to tested hypotheses. The framework is general: any field that runs many low-powered tests on a sparse signal, reports only significant results, and allows flexibility in analysis will produce a literature where most positive findings are noise. The diagnosis applies directly to LLM benchmark claims: small N evals with many implicit comparisons (prompts, models, metrics) and no correction for multiplicity generate exactly the conditions Ioannidis identifies as most likely to produce false positives.
A systematic catalog of 25 common misinterpretations of p-values, confidence intervals, and statistical power, co-authored by some of the most prominent statisticians and epidemiologists working on these problems. Establishes the precise meaning of each concept and shows how each is routinely misread — including the widespread belief that a 95% CI contains the true value with 95% probability, or that a non-significant p-value means the null hypothesis is true. Essential reading alongside the Cumming debate: even researchers who have abandoned p-values often carry the same misconceptions into their CI interpretations.
The seminal journal article making the case for replacing null-hypothesis significance testing with an estimation approach: report effect sizes and confidence intervals, pursue meta-analytic thinking, and pre-specify studies where possible. Argues that NHST encourages binary thinking that impedes cumulative science, while estimation methods produce results that compound naturally across replications. The companion to Cumming & Calin-Jageman’s textbook and a direct influence on subsequent debates in HCI and beyond about whether p-values should be abandoned outright. (If you're short on time, I recommend the talk version.)
A direct reply to Cumming (2014), broadly endorsing open science and the shift away from mechanical p-value thresholds, but pushing back on the claim that estimation alone is sufficient. The core argument: testing a theory requires knowing what you’d expect under both the theory and its negation, plus a principled method for choosing between them. CIs don’t provide the second or third component — they can show that data are consistent with a theory, but not that the data couldn’t just as easily arise if the theory were false. A useful counterweight in the estimation-vs-testing debate: the disagreement here is not about whether to report CIs, but about whether hypothesis tests can ever be informative in addition to them.
Statistical Pitfalls in ML
Papers on broader patterns of statistical misuse in machine learning research.
Systematic study of variance in ML benchmark results from random seeds, data ordering, and hardware. Shows that a substantial fraction of published improvements fall within the natural variance of the same algorithm under different random initialization. The paper introduces a framework for separating “algorithmic variance” from “trial variance,” and argues that current reporting standards make it impossible to know which published improvements are real. Directly motivates the need for bootstrap CIs in eval comparisons.
When you evaluate N models and report the performance of the best one, the confidence interval for that model is no longer valid: the selection step introduces optimism bias that standard CIs don’t account for. This paper constructs valid lower confidence bounds that explicitly condition on the selection procedure, combining bootstrap tilting with a max-T multiplicity correction to cover all candidate models simultaneously. The result is a principled answer to a problem common in LLM eval workflows: reporting the performance of the prompt or model that ranked highest in a sweep, without acknowledging that the top rank is itself a noisy outcome. (Although we arrived at the need to use max-T correction on CIs independently, this paper provides a rigorous theoretical and simulation-backed foundation for that choice in an ML context.)
Identifies failure modes in ML research writing: overclaiming, conflating ablations with explanations, failure to engage with negative results, and misuse of evaluation protocols. Several of the patterns identified directly apply to LLM eval claims: cherry-picked examples presented as representative benchmarks, single-prompt evaluations reported as capability measurements, and improvements that disappear under slightly different conditions.
Introduces behavioral testing: a structured suite of test types (minimum functionality, invariance, directional expectation) designed to probe specific capabilities rather than aggregate accuracy. Shows that models achieving high accuracy on standard benchmarks fail systematically on targeted behavioral tests. A complement to statistical inference methods: knowing whether a CI is tight is less useful if the eval items don’t actually represent the capability you care about.
Surveys obstacles in NLG evaluation: metrics that fail to correlate with human judgment, benchmark contamination, under-specified evaluation protocols, and lack of statistical testing. Identifies six categories of failure across 252 papers. Particularly relevant sections cover inter-annotator agreement methodology and the mismatch between automated metrics and human preference — problems now manifesting in LLM-as-judge pipelines.
A scroll-driven, animated introduction to permutation testing — the nonparametric method for asking whether an observed difference between two groups could plausibly arise by chance alone. Rather than assuming a parametric distribution, a permutation test repeatedly shuffles the group labels and measures how often random reassignments produce a difference as large as the one observed, building an exact null distribution. A natural complement to bootstrap CIs for model comparison: where a bootstrap CI quantifies uncertainty around an estimate, a permutation test directly answers the sharp question of statistical significance.
Statistical Debates in HCI
HCI has been an recently active arena for rethinking statistical practice. HCI often deals with low sample sizes, and thus, insights from this field may be particularly relevant for LLM evals, which also face similar challenges.
An early call for reform of statistical practice in HCI research, diagnosing three entrenched problems: the transposed conditional fallacy (treating p-values as the probability the hypothesis is true), neglect of statistical power, and reluctance to interpret effect sizes. The authors argue these compound to produce weak theories from vaguely specified hypotheses and advocate shifting from significance thresholds to effect magnitude as the primary criterion for evaluating results. Uses CHI 2010 publications as illustrative data. A useful precursor to the Dragicevic and Kay et al. papers further down: it establishes the critique that motivates their more prescriptive reform proposals.
Argues that p-values and dichotomous significance testing are poor tools for scientific communication — not because researchers misuse them, but because the tools themselves are poorly suited to the task. Written for the HCI community but directly applicable to LLM evals: the paper explains in non-technical terms why switching to an estimation approach (reporting effect sizes and interval estimates with informative charts) produces clearer, more honest, and more actionable research findings. Offers concrete guidance on communicating empirical results without any tests or p-values, emphasising nuanced interpretation over binary pass/fail verdicts.
Frames the choice of statistical methodology as a user-centered design problem: existing frequentist practice fails researchers because study results rarely accumulate into progressively more precise estimates — knowledge accrual stalls. The authors use simulation to compare frequentist and Bayesian publication worlds, showing that Bayesian analysis supports knowledge accrual with each new study and enables more principled conclusions from small-N work on novel techniques. Directly relevant to LLM eval practice: the argument for methods that compound information across studies rather than producing isolated significance verdicts applies equally to iterative model evaluation.
A community-maintained reference covering the most commonly mishandled statistical topics in HCI research: effect sizes, p-values, inferential statistics, pre-registration, multiple comparisons, inter-rater reliability, Bayesian inference, and Likert-scale data. Each chapter provides an FAQ with direct answers to specific questions, plus code exemplars with worked interpretations. A practical companion to the more argumentative papers in this section — useful for quickly resolving questions like “how should I report this Likert result?” or “when do I need to correct for multiple comparisons?”