Auditing Your LLM Judge — Stats for LLM Evals

What This Investigation Covers

LLM-as-judge pipelines are trusted at scale, but the key question is rarely asked rigorously: how well does the judge actually agree with humans? Simple accuracy against a held-out set understates the problem. Agreement metrics must account for scale type (binary, ordinal, continuous), chance-level agreement, and the difference between systematic bias and random disagreement. This investigation covers how to design a human annotation study for judge validation, compute the right agreement statistics with confidence intervals, and interpret the results in terms of when automated scoring is — and isn’t — an adequate proxy.

What you’ll learn

Choosing the right agreement metric: Cohen’s κ, Krippendorff’s α, weighted κ, and intraclass correlation: which to use based on your scale type (binary, ordinal, continuous) and whether you have two raters or many.

Structuring your annotation sample: How many items to annotate, how to stratify across score levels to avoid ceiling/floor effects, and how to handle disagreements between human annotators before using them as a gold standard.

The minimum agreement threshold: What κ or ICC value is required before automated judge scores can substitute for human labels? How to set your own threshold based on the consequences of a wrong call in your deployment context.

Diagnosing disagreement patterns: When the judge and humans diverge, is it systematic (a consistent directional bias you can correct) or idiosyncratic (random noise that requires more human labels to average out)?

Looking for quick usage examples? Check out the Example Usage page.

⚗

Investigation in progress

Full worked examples, interactive code, and simulation-backed results are coming soon. Follow on GitHub for updates.