LLM judges hide instability: 33-67% of documents break consistency

Original: Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations View original →

Read in other languages: 한국어日本語
LLM Apr 17, 2026 By Insights AI 2 min read Source

LLM-as-judge has become a default fixture in model evals, summarization scoring, and agent benchmarks. A new April 16 arXiv paper, Diagnosing LLM Judge Reliability, goes after a weakness that is easy to miss: aggregate scores can look stable while individual inputs receive internally inconsistent judgments.

The authors apply two diagnostics to SummEval. The first is transitivity analysis. At the aggregate level, violation rates look small, between 0.8% and 4.1%. But at the document level, 33-67% of documents contain at least one directed 3-cycle. In plain terms, a judge can prefer A over B, B over C, and C over A on the same input family. That is not a small bookkeeping issue if the score drives leaderboard claims, model selection, or production routing.

The second diagnostic uses split conformal prediction sets over 1-5 Likert scores. The method provides theoretically guaranteed >=1-alpha coverage, while the width of the prediction set becomes a per-instance reliability signal. The paper reports that pooled prediction-set width correlates with absolute error at r_s=+0.576 over N=1,918 examples, with p < 10^-100. Wider sets are not just a mathematical artifact. They identify inputs where the judge is more likely to be wrong.

The criterion-level result is just as useful for practitioners. Across four judges and four criteria, the type of criterion matters more than the judge identity. Relevance is relatively reliable, with an average set size around 3.0. Coherence is weaker at about 3.9. Fluency and consistency are much less reliable, with average set sizes around 4.9, nearly the full 1-5 range. A team using the same LLM judge for every quality dimension may therefore be mixing strong and weak signals without realizing it.

The paper is not an argument to stop automated evaluation. It is an argument to stop treating single LLM-judge scores as clean measurements. The authors say they release code, prompts, and cached results, which makes the diagnostic reproducible. For future leaderboards and internal eval systems, the practical takeaway is direct: publish uncertainty and inconsistency checks next to the score, or risk optimizing against a judge that is quietly unstable on the very examples that matter.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.