Checking Response Consistency — Stats for LLM Evals

What This Investigation Covers

Most LLM eval pipelines run each test item once. But at temperature > 0, the same prompt produces different outputs — and different scores — on every call. This variance has two sources: the model’s stochastic generation and the judge’s stochastic scoring. Until you measure it, you can’t know whether your eval score is a stable estimate or a lucky draw. This investigation shows how to quantify run-to-run variance, separate its two sources, and decide how many runs per item your pipeline actually needs.

What you’ll learn

Quantifying run-to-run variance: How to score the same inputs multiple times and estimate within-item standard deviation. When is variance small enough to trust a single pass, and when does it swamp your signal?

Separating model variance from judge variance: How to design a replicated experiment — scoring fixed outputs multiple times, and generating multiple outputs per item — to isolate how much variability comes from the model vs. the scorer.

Computing the minimum runs needed per item: The calculation for deciding how many samples per item are required to keep your aggregate score reliable to within a target margin, given your measured within-item variance.

When consistency is itself a quality signal: How to report response consistency as a standalone metric: a model that scores 80% with low variance may be more deployable than one that scores 85% with high variance on the same items.

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.