LLM judges hide instability: 33-67% of documents break consistency

LLM-as-judge has become a default fixture in model evals, summarization scoring, and agent benchmarks. A new April 16 arXiv paper, Diagnosing LLM Judge Reliability, goes after a weakness that is easy to miss: aggregate scores can look stable while individual inputs receive internally inconsistent judgments.

The authors apply two diagnostics to SummEval. The first is transitivity analysis. At the aggregate level, violation rates look small, between 0.8% and 4.1%. But at the document level, 33-67% of documents contain at least one directed 3-cycle. In plain terms, a judge can prefer A over B, B over C, and C over A on the same input family. That is not a small bookkeeping issue if the score drives leaderboard claims, model selection, or production routing.

The second diagnostic uses split conformal prediction sets over 1-5 Likert scores. The method provides theoretically guaranteed >=1-alpha coverage, while the width of the prediction set becomes a per-instance reliability signal. The paper reports that pooled prediction-set width correlates with absolute error at r_s=+0.576 over N=1,918 examples, with p < 10^-100. Wider sets are not just a mathematical artifact. They identify inputs where the judge is more likely to be wrong.

The criterion-level result is just as useful for practitioners. Across four judges and four criteria, the type of criterion matters more than the judge identity. Relevance is relatively reliable, with an average set size around 3.0. Coherence is weaker at about 3.9. Fluency and consistency are much less reliable, with average set sizes around 4.9, nearly the full 1-5 range. A team using the same LLM judge for every quality dimension may therefore be mixing strong and weak signals without realizing it.

The paper is not an argument to stop automated evaluation. It is an argument to stop treating single LLM-judge scores as clean measurements. The authors say they release code, prompts, and cached results, which makes the diagnostic reproducible. For future leaderboards and internal eval systems, the practical takeaway is direct: publish uncertainty and inconsistency checks next to the score, or risk optimizing against a judge that is quietly unstable on the very examples that matter.

LLM judges hide instability: 33-67% of documents break consistency

Related Articles

Open-weight models narrow the gap to 3-6 months, OpenRouter says

OpenAI says 30% of SWE-Bench Pro is broken and drops its recommendation

GPT-5.6 reaches ChatGPT, Codex and API with an 80.0 agent score

Related Articles

Open-weight models narrow the gap to 3-6 months, OpenRouter says
LLM Jun 28, 2026 2 min read

OpenAI says 30% of SWE-Bench Pro is broken and drops its recommendation
LLM X/Twitter Jul 10, 2026 2 min read

GPT-5.6 reaches ChatGPT, Codex and API with an 80.0 agent score
LLM X/Twitter Jul 10, 2026 2 min read