LLM judges hide instability: 33-67% of documents break consistency
Original: Diagnosing LLM Judge Reliability: Conformal Prediction Sets and Transitivity Violations View original →
LLM-as-judge has become a default fixture in model evals, summarization scoring, and agent benchmarks. A new April 16 arXiv paper, Diagnosing LLM Judge Reliability, goes after a weakness that is easy to miss: aggregate scores can look stable while individual inputs receive internally inconsistent judgments.
The authors apply two diagnostics to SummEval. The first is transitivity analysis. At the aggregate level, violation rates look small, between 0.8% and 4.1%. But at the document level, 33-67% of documents contain at least one directed 3-cycle. In plain terms, a judge can prefer A over B, B over C, and C over A on the same input family. That is not a small bookkeeping issue if the score drives leaderboard claims, model selection, or production routing.
The second diagnostic uses split conformal prediction sets over 1-5 Likert scores. The method provides theoretically guaranteed >=1-alpha coverage, while the width of the prediction set becomes a per-instance reliability signal. The paper reports that pooled prediction-set width correlates with absolute error at r_s=+0.576 over N=1,918 examples, with p < 10^-100. Wider sets are not just a mathematical artifact. They identify inputs where the judge is more likely to be wrong.
The criterion-level result is just as useful for practitioners. Across four judges and four criteria, the type of criterion matters more than the judge identity. Relevance is relatively reliable, with an average set size around 3.0. Coherence is weaker at about 3.9. Fluency and consistency are much less reliable, with average set sizes around 4.9, nearly the full 1-5 range. A team using the same LLM judge for every quality dimension may therefore be mixing strong and weak signals without realizing it.
The paper is not an argument to stop automated evaluation. It is an argument to stop treating single LLM-judge scores as clean measurements. The authors say they release code, prompts, and cached results, which makes the diagnostic reproducible. For future leaderboards and internal eval systems, the practical takeaway is direct: publish uncertainty and inconsistency checks next to the score, or risk optimizing against a judge that is quietly unstable on the very examples that matter.
Related Articles
HN did not just ask whether Claude Opus 4.7 scores higher; it asked whether the product behavior is stable enough to build around. The thread quickly moved into adaptive thinking, tokenizer costs, safety filters, and bruised trust after recent Claude complaints.
A Show HN repo claims that duplicating a few LLM layers can improve reasoning without training or weight changes. The underlying README, however, shows real tradeoffs, making this more convincing as capability steering than as a universal model upgrade.
Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.
Comments (0)
No comments yet. Be the first to comment!