What This Investigation Covers
Most LLM eval pipelines run each test item once. But at temperature > 0, the same prompt produces different outputs — and different scores — on every call. This variance has two sources: the model’s stochastic generation and the judge’s stochastic scoring. Until you measure it, you can’t know whether your eval score is a stable estimate or a lucky draw. This investigation shows how to quantify run-to-run variance, separate its two sources, and decide how many runs per item your pipeline actually needs.
What you’ll learn
Quantifying run-to-run variance: How to score the same inputs multiple times and estimate within-item standard deviation. When is variance small enough to trust a single pass, and when does it swamp your signal?
Separating model variance from judge variance: How to design a replicated experiment — scoring fixed outputs multiple times, and generating multiple outputs per item — to isolate how much variability comes from the model vs. the scorer.
Computing the minimum runs needed per item: The calculation for deciding how many samples per item are required to keep your aggregate score reliable to within a target margin, given your measured within-item variance.
When consistency is itself a quality signal: How to report response consistency as a standalone metric: a model that scores 80% with low variance may be more deployable than one that scores 85% with high variance on the same items.
Looking for quick usage examples? Check out the Example Usage page.
Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.