Example Usage — Stats for LLM Evals

Quick Start (CLI)

This guide walks you through running your first analysis using the command-line interface (CLI). All you need is a CSV file of your eval scores in an expected format, and a few seconds to run the analysis and read the report. The evalstats workflow can be described as:

0. Install evalstats

You can install evalstats via pip or uv:

pip install evalstats

1. Prepare your CSV

evalstats expects a long-format (tidy) CSV: one row per observation. At minimum you need input and score, plus either prompt/template or model. The input column identifies which individual test case each row belongs to (e.g., questions in a Q&A dataset), which this lets evalstats use paired statistics that account for per-input difficulty differences.

If you evaluated each prompt multiple times per input (e.g. at different random seeds), add a run column. With 3 or more runs per cell, evalstats automatically switches to an appropriate method like nested bootstrap that accounts for run-level variance on top of input-level variance, giving you more honest uncertainty estimates.

prompt	input	run	score
prompt_a	q_001	0	0.68
prompt_a	q_001	1	0.71
prompt_a	q_001	2	0.74
prompt_b	q_001	0	0.82
prompt_b	q_001	1	0.79
prompt_b	q_001	2	0.85
prompt_c	q_001	0	0.89
prompt_c	q_001	1	0.93
prompt_c	q_001	2	0.91
…	…	…	…

Column name aliases

Column names are matched case-insensitively. Accepted aliases: prompt / template / prompt_template — input / example / item / id — score / value / result / metric — run / seed / trial. Add a model column for multi-model data.

2. Run the analysis

Point evalstats analyze at your file. It auto-detects the data type (binary, continuous, Likert, etc.), recognizes the multi-run structure, and picks an appropriate CI method automatically.

evalstats analyze results.csv

3. Read the output

The terminal report has three sections. Here's what you'd see for the CSV above (20 inputs × 3 prompts × 3 runs, continuous scores):

====================
 PROMPTS COMPARISON
====================
3 prompts | 20 inputs | method=smooth_bootstrap | 99% confidence intervals (CI)

--- Mean Performance (marginal bootstrap CIs) ---
  axis: [0.561, 1.009]  (· ±1σ, ─ CI, ● mean, │ grand mean)
  Prompt                   Interval Plot                                 Mean    CI Low   CI High
  prompt_a                   ···──────●─────···│                       0.680    0.616    0.744
  prompt_b                          ····──────●│─────····              0.776    0.707    0.846
  prompt_c                                     │  ···─────●────···     0.906    0.852    0.957

--- Pairwise Comparisons (nested smooth bootstrap (n=10000, R=3) (simultaneous CIs computed with max-T)) ---
  legend: (· ±1σ, ─ CI, ● mean, │ zero)    axis: [-0.351, +0.351]    effect: Left - Right
  Left         Right        Interval Plot                                 Mean    CI Low   CI High       ES
  prompt_c     prompt_b                        ·│──────●──────···      +0.1297   +0.0235   +0.2359    0.790
  prompt_c     prompt_a                         │     ··─────●─────··  +0.2261   +0.1345   +0.3177    0.990
  prompt_b     prompt_a                     ··──│────●────────··       +0.0965   -0.0265   +0.2195    0.600
  ES = Effect Size (r_rb) = rank biserial correlation (small≈0.1, medium≈0.3, large≈0.5)

  Statistically indistinguishable rank bands (similar to critical difference diagrams) computed from 99% CI, max_t-adjusted:
    #2–#3: [prompt_b ─ prompt_a]

  -> Evidence suggests a clear best option: 'prompt_c'

--- Executive Summary (Prompt leaderboard) ---
  Prompt            Grp     Mean  CI               Stability         Verdict
  ──────────────────────────────────────────────────────────────────────────
  prompt_c           #1    0.906  [0.852, 0.957]   Stable            Likely best
  prompt_b           #2    0.776  [0.707, 0.846]   Stable            Significant drop-off
  prompt_a           #2    0.680  [0.616, 0.744]   Stable            Significant drop-off
  ──────────────────────────────────────────────────────────────────────────

Start at the Executive Summary at the bottom. Entities in the same rank group (#1, #2, …) are statistically indistinguishable after multiple-comparisons correction — even if one has a higher mean. Here, prompt_c is the clear best (#1), while prompt_b and prompt_a both land in #2, meaning the gap between them is not yet statistically significant.

The Pairwise Comparisons section shows a visual CI plot and effect-size (ES) for each pair. When the CI interval plot spans the zero line (│), the difference is not significant. The Statistically indistinguishable rank bands line summarises which adjacent groups cannot be distinguished.

Saving output

# Save a Markdown report and plot
evalstats analyze results.csv --out report.md plot.png

# Save structured JSON for programmatic use
evalstats analyze results.csv --out analysis.json

# Print just the executive leaderboard (fastest read)
evalstats analyze results.csv --brief

See evalstats analyze --help for the full option list.

Python API

All functions accept NumPy arrays or plain Python lists and return a result object with a .summary() method that prints the same terminal report as the CLI.

Comparing prompts

Pass a dict mapping prompt name → per-input score array. Each array must be the same length and correspond to the same inputs (paired data).

import numpy as np
import evalstats as estats

scores = {
    "prompt_a": np.array([0.72, 0.60, 0.88, 0.75, ...]),
    "prompt_b": np.array([0.85, 0.78, 0.90, 0.82, ...]),
    "prompt_c": np.array([0.91, 0.83, 0.87, 0.95, ...]),
}

result = estats.compare_prompts(scores)
result.summary()        # full terminal report

Comparing models

Same interface as compare_prompts, keyed by model name instead.

scores = {
    "gpt-4o-mini":   np.array([0.82, 0.91, 0.78, ...]),
    "gemma-3-4b-it": np.array([0.75, 0.84, 0.71, ...]),
    "qwen3-8b":      np.array([0.88, 0.93, 0.85, ...]),
}

result = estats.compare_models(scores)
result.summary()

Comparing models × prompts

Pass a nested dict — outer keys are models, inner keys are prompt templates. evalstats reports statistics for every model–prompt combination, letting you see whether a prompt improvement generalizes across models.

scores = {
    "gpt-4o-mini": {
        "5-shot": np.array([0.79, 0.85, 0.83, ...]),
        "0-shot": np.array([0.61, 0.73, 0.69, ...]),
    },
    "gemma-3-4b-it": {
        "5-shot": np.array([0.68, 0.77, 0.72, ...]),
        "0-shot": np.array([0.60, 0.72, 0.65, ...]),
    },
}

result = estats.compare_models(scores)
result.full_summary()

================================================
 CROSS-MODEL RANKING (ALL MODEL/TEMPLATE PAIRS)
================================================
                  5-shot    0-shot
  ─────────────────────────────────
    gpt-4o-mini  0.814 █*  0.656 ·
  gemma-3-4b-it  0.704 ▒   0.624 ·
  ─────────────────────────────────
  * = best pair by mean  |  heat: · (low) → █ (high), range [0.624, 0.814]

--- Rank Probabilities: All 4 by P(Best) (smooth bootstrap, n=10000, ranked by mean) ---
  Model           Template   P(Best)                  E[Rank]
  gpt-4o-mini     5-shot     100.0% ██████████████     1.00 █▆▃───────────
  gpt-4o-mini     0-shot       0.0% ░░░░░░░░░░░░░░     3.07 ───────▃▆█▆▃──
  gemma-3-4b-it   5-shot       0.0% ░░░░░░░░░░░░░░     2.04 ──▃▆█▆▃───────
  gemma-3-4b-it   0-shot       0.0% ░░░░░░░░░░░░░░     3.89 ───────────▃▆█

--- Mean Performance: All 4 (marginal CIs) ---
  axis: [0.517, 0.921]  (· ±1σ, ─ CI, ● mean, │ grand mean)
  Model           Template Interval Plot                                 Mean    CI Low   CI High
  gpt-4o-mini     5-shot                     │  ····────●─────····     0.814    0.770    0.860
  gpt-4o-mini     0-shot       ·····─────●───│─·····                   0.656    0.604    0.711
  gemma-3-4b-it   5-shot              ···────│●────···                 0.704    0.663    0.744
  gemma-3-4b-it   0-shot     ····─────●────··│·                        0.624    0.579    0.670

--- Executive Summary (Cross-model pair leaderboard) ---
  Model            Template      Grp     Mean  CI               Verdict
  ─────────────────────────────────────────────────────────────────────
  gpt-4o-mini      5-shot         #1    0.814  [0.770, 0.860]   Likely best
  gemma-3-4b-it    5-shot         #2    0.704  [0.663, 0.744]   Significant drop-off
  gpt-4o-mini      0-shot         #2    0.656  [0.604, 0.711]   Significant drop-off
  gemma-3-4b-it    0-shot         #3    0.624  [0.579, 0.670]   Significant drop-off
  ─────────────────────────────────────────────────────────────────────

In the crossed leaderboard, gpt-4o-mini / 5-shot ranks first. The 5-shot setting is stronger for both models, and the gemma-3-4b-it / 5-shot and gpt-4o-mini / 0-shot rows landing in the same group (#2) indicate those two middle conditions are not cleanly separated at this confidence level.

Looking for p-values?

Pass p_values=True to print corrected p-values in the pairwise table. You can optionally set pairwise_test="wilcoxon" to force Wilcoxon signed-rank p-values (instead of auto-selection). With omnibus=True, evalstats also prepends a Friedman omnibus test line before the pairwise rows.

scores = {
    "gpt-4o-mini":   np.array([0.82, 0.91, 0.78, 0.86, 0.74, ...]),
    "gemma-3-4b-it": np.array([0.75, 0.84, 0.71, 0.79, 0.68, ...]),
    "qwen3-8b":      np.array([0.88, 0.93, 0.85, 0.91, 0.83, ...]),
}

result = estats.compare_models(
    scores,
    omnibus=True,
    p_values=True,
    pairwise_test="wilcoxon",
)
result.summary()

===================
 MODELS COMPARISON
===================
3 models | 40 inputs | method=smooth_bootstrap | 99% confidence intervals (CI)

--- Mean Performance (marginal bootstrap CIs) ---
  axis: [0.683, 0.937]  (· ±1σ, ─ CI, ● mean, │ grand mean)
  Model                    Interval Plot                                 Mean    CI Low   CI High
  gpt-4o-mini                          ······───│─●────······          0.827    0.798    0.857
  gemma-3-4b-it              ·····────●────·····│                      0.753    0.726    0.778
  qwen3-8b                                    ··│··────●─────·····     0.863    0.836    0.891

--- Pairwise Comparisons (smooth bootstrap (n=10000) (fdr_bh-corrected p-values) (simultaneous CIs computed with max-T)) ---
  Friedman omnibus: χ²(2) = 28.950, p = 5.171e-07***
  legend: (· ±1σ, ─ CI, ● mean, │ zero)    axis: [-0.193, +0.193]    effect: Left - Right
  Left            Right           Interval Plot                                 Mean    CI Low   CI High       ES    p (wsr)
  qwen3-8b        gpt-4o-mini                    ····─│───●────·····         +0.0361   -0.0082   +0.0804    0.429    0.01716
  qwen3-8b        gemma-3-4b-it                       │  ····────●─────····  +0.1101   +0.0681   +0.1521    0.944 3.492e-09***
  gpt-4o-mini     gemma-3-4b-it                     ··│··─────●─────·····    +0.0740   +0.0249   +0.1231    0.702 6.606e-05***
  ES = Effect Size (r_rb) = rank biserial correlation (small≈0.1, medium≈0.3, large≈0.5)
  p (wsr) = Wilcoxon signed-rank (fdr_bh-corrected)
  stars: * p<0.01, ** p<0.001, *** p<0.0001

  Statistically indistinguishable rank bands (similar to critical difference diagrams) computed from 99% CI, max_t-adjusted:
    #1–#2: [qwen3-8b ─ gpt-4o-mini]

--- Executive Summary (Model leaderboard) ---
  Model             Grp     Mean  CI               Verdict
  ────────────────────────────────────────────────────────
  qwen3-8b           #1    0.863  [0.836, 0.891]   Tied with gpt-4o-mini as best
  gpt-4o-mini        #1    0.827  [0.798, 0.857]   Tied with qwen3-8b as best
  gemma-3-4b-it      #2    0.753  [0.726, 0.778]   Significant drop-off
  ────────────────────────────────────────────────────────

The Friedman omnibus test (χ²(2) = 28.95, p < 0.001) confirms there is overall variation across models. In the pairwise comparisons, qwen3-8b vs gpt-4o-mini has a CI that crosses zero ([-0.008, +0.080] — not significant), so both land in rank group #1. The new p (wsr) column reports corrected Wilcoxon p-values for each pair. Only the gaps to gemma-3-4b-it are conclusive.

Loading from a CSV or Excel file

estats.analyze() mirrors the CLI — pass a file path and it handles format detection automatically. Useful when you want to run the analysis from a script or notebook.

import evalstats as estats

result = estats.analyze("results.csv")
result.summary()

# With options
result = estats.analyze(
    "results.xlsx",
    sheet="Eval Results",
    ci=0.95,
    method="lmm",       # mixed-effects model
    n_bootstrap=20000,
)
result.summary()

Common Options

These parameters are accepted by all API functions (compare_prompts, compare_models, analyze) and their CLI equivalents.

Parameter	Default	Notes
ci	0.99	Confidence level. Use 0.95 for the traditional default.
method	"auto"	Auto-selects based on data type. Options: `"lmm"`, `"bootstrap"`, `"smooth_bootstrap"`, `"bca"`, `"bayes_bootstrap"`, `"bayes_binary"`, `"wilson"`, `"newcombe"`, `"permutation"`, `"fisher_exact"`, `"sign_test"`.
n_bootstrap	10000	Bootstrap resamples. Increase for smoother CI estimates.
simultaneous_ci	True	Family-wise CI correction (max-T). Set to False for marginal CIs.
statistic	"mean"	Central tendency. Also supports `"median"`.
reference	"grand_mean"	Comparison baseline for the advantage plot. Pass a template label to compare against a specific prompt or model.
omnibus	False	Run a Friedman omnibus test before pairwise comparisons (k ≥ 3).
p_values	False	Show pairwise p-values in the report. CLI: `--p-values`.
pairwise_test	"auto"	Pairwise p-value test. Options: `"auto"`, `"bootstrap"`, `"wilcoxon"`, `"nemenyi"`. CLI: `--pairwise-test TEST` (setting this explicitly also enables p-values).
correction	"holm"	Multiple-comparison correction. Options: `"holm"`, `"bonferroni"`, `"fdr_bh"`, `"none"`.
failure_threshold	None	Flag inputs scoring below this value in the robustness table.

Full runnable examples are in the examples/ folder on GitHub.