Quick Start (CLI)
This guide walks you through running your first analysis using the command-line interface (CLI).
All you need is a CSV file of your eval scores in an expected format, and a few seconds to run the analysis and read the report.
The evalstats workflow can be described as:
0. Install evalstats
You can install evalstats via pip or uv:
pip install evalstats
1. Prepare your CSV
evalstats expects a long-format (tidy) CSV: one row per
observation. At minimum you need input and score, plus either
prompt/template or model.
The input column identifies which individual test case each
row belongs to (e.g., questions in a Q&A dataset), which this lets evalstats use paired statistics that account
for per-input difficulty differences.
If you evaluated each prompt multiple times per input (e.g. at different random seeds),
add a run column. With 3 or more runs per cell, evalstats
automatically switches to an appropriate method like nested bootstrap
that accounts for run-level variance on top of input-level variance,
giving you more honest uncertainty estimates.
| prompt | input | run | score |
|---|---|---|---|
| prompt_a | q_001 | 0 | 0.68 |
| prompt_a | q_001 | 1 | 0.71 |
| prompt_a | q_001 | 2 | 0.74 |
| prompt_b | q_001 | 0 | 0.82 |
| prompt_b | q_001 | 1 | 0.79 |
| prompt_b | q_001 | 2 | 0.85 |
| prompt_c | q_001 | 0 | 0.89 |
| prompt_c | q_001 | 1 | 0.93 |
| prompt_c | q_001 | 2 | 0.91 |
| … | … | … | … |
Column names are matched case-insensitively. Accepted aliases:
prompt / template / prompt_template —
input / example / item / id —
score / value / result / metric —
run / seed / trial.
Add a model column for multi-model data.
2. Run the analysis
Point evalstats analyze at your file. It auto-detects the data type
(binary, continuous, Likert, etc.), recognizes the multi-run structure, and picks an
appropriate CI method automatically.
evalstats analyze results.csv
3. Read the output
The terminal report has three sections. Here's what you'd see for the CSV above (20 inputs × 3 prompts × 3 runs, continuous scores):
====================
PROMPTS COMPARISON
====================
3 prompts | 20 inputs | method=smooth_bootstrap | 99% confidence intervals (CI)
--- Mean Performance (marginal bootstrap CIs) ---
axis: [0.561, 1.009] (· ±1σ, ─ CI, ● mean, │ grand mean)
Prompt Interval Plot Mean CI Low CI High
prompt_a ···──────●─────···│ 0.680 0.616 0.744
prompt_b ····──────●│─────···· 0.776 0.707 0.846
prompt_c │ ···─────●────··· 0.906 0.852 0.957
--- Pairwise Comparisons (nested smooth bootstrap (n=10000, R=3) (simultaneous CIs computed with max-T)) ---
legend: (· ±1σ, ─ CI, ● mean, │ zero) axis: [-0.351, +0.351] effect: Left - Right
Left Right Interval Plot Mean CI Low CI High ES
prompt_c prompt_b ·│──────●──────··· +0.1297 +0.0235 +0.2359 0.790
prompt_c prompt_a │ ··─────●─────·· +0.2261 +0.1345 +0.3177 0.990
prompt_b prompt_a ··──│────●────────·· +0.0965 -0.0265 +0.2195 0.600
ES = Effect Size (r_rb) = rank biserial correlation (small≈0.1, medium≈0.3, large≈0.5)
Statistically indistinguishable rank bands (similar to critical difference diagrams) computed from 99% CI, max_t-adjusted:
#2–#3: [prompt_b ─ prompt_a]
-> Evidence suggests a clear best option: 'prompt_c'
--- Executive Summary (Prompt leaderboard) ---
Prompt Grp Mean CI Stability Verdict
──────────────────────────────────────────────────────────────────────────
prompt_c #1 0.906 [0.852, 0.957] Stable Likely best
prompt_b #2 0.776 [0.707, 0.846] Stable Significant drop-off
prompt_a #2 0.680 [0.616, 0.744] Stable Significant drop-off
──────────────────────────────────────────────────────────────────────────
Start at the Executive Summary at the bottom. Entities in the same rank
group (#1, #2, …) are statistically indistinguishable
after multiple-comparisons correction — even if one has a higher mean. Here,
prompt_c is the clear best (#1), while
prompt_b and prompt_a both land in #2,
meaning the gap between them is not yet statistically significant.
The Pairwise Comparisons section shows a visual CI plot and effect-size
(ES) for each pair. When the CI interval plot spans the zero line (│), the
difference is not significant. The Statistically indistinguishable rank
bands line summarises which adjacent groups cannot be distinguished.
Saving output
# Save a Markdown report and plot
evalstats analyze results.csv --out report.md plot.png
# Save structured JSON for programmatic use
evalstats analyze results.csv --out analysis.json
# Print just the executive leaderboard (fastest read)
evalstats analyze results.csv --brief
See evalstats analyze --help for the full option list.
Python API
All functions accept NumPy arrays or plain Python lists and return a result object
with a .summary() method that prints the same terminal report as the CLI.
Comparing prompts
Pass a dict mapping prompt name → per-input score array. Each array must be the same length and correspond to the same inputs (paired data).
import numpy as np
import evalstats as estats
scores = {
"prompt_a": np.array([0.72, 0.60, 0.88, 0.75, ...]),
"prompt_b": np.array([0.85, 0.78, 0.90, 0.82, ...]),
"prompt_c": np.array([0.91, 0.83, 0.87, 0.95, ...]),
}
result = estats.compare_prompts(scores)
result.summary() # full terminal report
Comparing models
Same interface as compare_prompts, keyed by model name instead.
scores = {
"gpt-4o-mini": np.array([0.82, 0.91, 0.78, ...]),
"gemma-3-4b-it": np.array([0.75, 0.84, 0.71, ...]),
"qwen3-8b": np.array([0.88, 0.93, 0.85, ...]),
}
result = estats.compare_models(scores)
result.summary()
Comparing models × prompts
Pass a nested dict — outer keys are models, inner keys are prompt templates.
evalstats reports statistics for every model–prompt combination,
letting you see whether a prompt improvement generalizes across models.
scores = {
"gpt-4o-mini": {
"5-shot": np.array([0.79, 0.85, 0.83, ...]),
"0-shot": np.array([0.61, 0.73, 0.69, ...]),
},
"gemma-3-4b-it": {
"5-shot": np.array([0.68, 0.77, 0.72, ...]),
"0-shot": np.array([0.60, 0.72, 0.65, ...]),
},
}
result = estats.compare_models(scores)
result.full_summary()
================================================
CROSS-MODEL RANKING (ALL MODEL/TEMPLATE PAIRS)
================================================
5-shot 0-shot
─────────────────────────────────
gpt-4o-mini 0.814 █* 0.656 ·
gemma-3-4b-it 0.704 ▒ 0.624 ·
─────────────────────────────────
* = best pair by mean | heat: · (low) → █ (high), range [0.624, 0.814]
--- Rank Probabilities: All 4 by P(Best) (smooth bootstrap, n=10000, ranked by mean) ---
Model Template P(Best) E[Rank]
gpt-4o-mini 5-shot 100.0% ██████████████ 1.00 █▆▃───────────
gpt-4o-mini 0-shot 0.0% ░░░░░░░░░░░░░░ 3.07 ───────▃▆█▆▃──
gemma-3-4b-it 5-shot 0.0% ░░░░░░░░░░░░░░ 2.04 ──▃▆█▆▃───────
gemma-3-4b-it 0-shot 0.0% ░░░░░░░░░░░░░░ 3.89 ───────────▃▆█
--- Mean Performance: All 4 (marginal CIs) ---
axis: [0.517, 0.921] (· ±1σ, ─ CI, ● mean, │ grand mean)
Model Template Interval Plot Mean CI Low CI High
gpt-4o-mini 5-shot │ ····────●─────···· 0.814 0.770 0.860
gpt-4o-mini 0-shot ·····─────●───│─····· 0.656 0.604 0.711
gemma-3-4b-it 5-shot ···────│●────··· 0.704 0.663 0.744
gemma-3-4b-it 0-shot ····─────●────··│· 0.624 0.579 0.670
--- Executive Summary (Cross-model pair leaderboard) ---
Model Template Grp Mean CI Verdict
─────────────────────────────────────────────────────────────────────
gpt-4o-mini 5-shot #1 0.814 [0.770, 0.860] Likely best
gemma-3-4b-it 5-shot #2 0.704 [0.663, 0.744] Significant drop-off
gpt-4o-mini 0-shot #2 0.656 [0.604, 0.711] Significant drop-off
gemma-3-4b-it 0-shot #3 0.624 [0.579, 0.670] Significant drop-off
─────────────────────────────────────────────────────────────────────
In the crossed leaderboard, gpt-4o-mini / 5-shot ranks first.
The 5-shot setting is stronger for both models, and the
gemma-3-4b-it / 5-shot and gpt-4o-mini / 0-shot
rows landing in the same group (#2) indicate those two
middle conditions are not cleanly separated at this confidence level.
Looking for p-values?
Pass p_values=True to print corrected p-values in the pairwise table.
You can optionally set pairwise_test="wilcoxon" to force Wilcoxon
signed-rank p-values (instead of auto-selection). With omnibus=True,
evalstats also prepends a Friedman omnibus test line before the
pairwise rows.
scores = {
"gpt-4o-mini": np.array([0.82, 0.91, 0.78, 0.86, 0.74, ...]),
"gemma-3-4b-it": np.array([0.75, 0.84, 0.71, 0.79, 0.68, ...]),
"qwen3-8b": np.array([0.88, 0.93, 0.85, 0.91, 0.83, ...]),
}
result = estats.compare_models(
scores,
omnibus=True,
p_values=True,
pairwise_test="wilcoxon",
)
result.summary()
===================
MODELS COMPARISON
===================
3 models | 40 inputs | method=smooth_bootstrap | 99% confidence intervals (CI)
--- Mean Performance (marginal bootstrap CIs) ---
axis: [0.683, 0.937] (· ±1σ, ─ CI, ● mean, │ grand mean)
Model Interval Plot Mean CI Low CI High
gpt-4o-mini ······───│─●────······ 0.827 0.798 0.857
gemma-3-4b-it ·····────●────·····│ 0.753 0.726 0.778
qwen3-8b ··│··────●─────····· 0.863 0.836 0.891
--- Pairwise Comparisons (smooth bootstrap (n=10000) (fdr_bh-corrected p-values) (simultaneous CIs computed with max-T)) ---
Friedman omnibus: χ²(2) = 28.950, p = 5.171e-07***
legend: (· ±1σ, ─ CI, ● mean, │ zero) axis: [-0.193, +0.193] effect: Left - Right
Left Right Interval Plot Mean CI Low CI High ES p (wsr)
qwen3-8b gpt-4o-mini ····─│───●────····· +0.0361 -0.0082 +0.0804 0.429 0.01716
qwen3-8b gemma-3-4b-it │ ····────●─────···· +0.1101 +0.0681 +0.1521 0.944 3.492e-09***
gpt-4o-mini gemma-3-4b-it ··│··─────●─────····· +0.0740 +0.0249 +0.1231 0.702 6.606e-05***
ES = Effect Size (r_rb) = rank biserial correlation (small≈0.1, medium≈0.3, large≈0.5)
p (wsr) = Wilcoxon signed-rank (fdr_bh-corrected)
stars: * p<0.01, ** p<0.001, *** p<0.0001
Statistically indistinguishable rank bands (similar to critical difference diagrams) computed from 99% CI, max_t-adjusted:
#1–#2: [qwen3-8b ─ gpt-4o-mini]
--- Executive Summary (Model leaderboard) ---
Model Grp Mean CI Verdict
────────────────────────────────────────────────────────
qwen3-8b #1 0.863 [0.836, 0.891] Tied with gpt-4o-mini as best
gpt-4o-mini #1 0.827 [0.798, 0.857] Tied with qwen3-8b as best
gemma-3-4b-it #2 0.753 [0.726, 0.778] Significant drop-off
────────────────────────────────────────────────────────
The Friedman omnibus test (χ²(2) = 28.95, p < 0.001) confirms there is overall
variation across models. In the pairwise comparisons, qwen3-8b vs
gpt-4o-mini has a CI that crosses zero ([-0.008, +0.080] — not
significant), so both land in rank group #1. The new
p (wsr) column reports corrected Wilcoxon p-values for each pair.
Only the gaps to gemma-3-4b-it are conclusive.
Loading from a CSV or Excel file
estats.analyze() mirrors the CLI — pass a file path and it handles
format detection automatically. Useful when you want to run the analysis from a
script or notebook.
import evalstats as estats
result = estats.analyze("results.csv")
result.summary()
# With options
result = estats.analyze(
"results.xlsx",
sheet="Eval Results",
ci=0.95,
method="lmm", # mixed-effects model
n_bootstrap=20000,
)
result.summary()
Common Options
These parameters are accepted by all API functions (compare_prompts,
compare_models, analyze) and their CLI equivalents.
| Parameter | Default | Notes |
|---|---|---|
| ci | 0.99 | Confidence level. Use 0.95 for the traditional default. |
| method | "auto" | Auto-selects based on data type. Options: "lmm", "bootstrap", "smooth_bootstrap", "bca", "bayes_bootstrap", "bayes_binary", "wilson", "newcombe", "permutation", "fisher_exact", "sign_test". |
| n_bootstrap | 10000 | Bootstrap resamples. Increase for smoother CI estimates. |
| simultaneous_ci | True | Family-wise CI correction (max-T). Set to False for marginal CIs. |
| statistic | "mean" | Central tendency. Also supports "median". |
| reference | "grand_mean" | Comparison baseline for the advantage plot. Pass a template label to compare against a specific prompt or model. |
| omnibus | False | Run a Friedman omnibus test before pairwise comparisons (k ≥ 3). |
| p_values | False | Show pairwise p-values in the report. CLI: --p-values. |
| pairwise_test | "auto" | Pairwise p-value test. Options: "auto", "bootstrap", "wilcoxon", "nemenyi". CLI: --pairwise-test TEST (setting this explicitly also enables p-values). |
| correction | "holm" | Multiple-comparison correction. Options: "holm", "bonferroni", "fdr_bh", "none". |
| failure_threshold | None | Flag inputs scoring below this value in the robustness table. |
Full runnable examples are in the
examples/ folder on GitHub.