r/artificial spotlights BullshitBench v2 as Claude leads the nonsense-detection board
Original: Claude is the least bullshit-y AI View original →
What r/artificial surfaced
A March 29, 2026 r/artificial link post pushed BullshitBench v2 back into view. The benchmark is designed to test whether models reject nonsense instead of confidently continuing a broken premise. That is a narrower and more practical framing than generic hallucination talk. The README says responses are grouped into clear pushback, partial challenge, and accepted nonsense.
According to the README, BullshitBench v2 uses 100 nonsense prompts across 5 domains: software, finance, legal, medical, and physics. The public v2 leaderboard also says the scoring pipeline uses a 3-judge panel with mean aggregation, specifically anthropic/claude-sonnet-4.6, openai/gpt-5.2, and google/gemini-3.1-pro-preview, and the published board currently contains 80 model or reasoning rows.
What the latest leaderboard shows
The Reddit headline is accurate in a narrow sense. The published leaderboard.csv puts anthropic/claude-sonnet-4.6@reasoning=high at rank 1 with an avg_score of 1.87, a green_rate of 0.91, and a red_rate of 0.03. In other words, 91 of 100 prompts were scored as clear pushback and 3 of 100 as accepted nonsense. Several of the next rows are also Anthropic models. By contrast, openai/gpt-5.4@reasoning=none appears at rank 17 with a green_rate of 0.48 and red_rate of 0.16, while qwen/qwen3.5-397b-a17b@reasoning=high ranks 6 with 0.78 green and 0.05 red.
Why the result needs caveats
This is still a community benchmark, not a neutral industry standard. The repo was updated on March 12, 2026, the question set is curated by the project itself, and one of the three judges is an Anthropic model, so the leaderboard should be read as a useful signal rather than a final verdict. Even so, the benchmark is interesting because it makes the failure mode concrete. The practical question is not only which model knows more facts, but which one notices that a premise is broken and stops the user before continuing.
That is why the chart resonated on r/artificial. In software, medical, legal, and finance workflows, the useful model is not just the one with more answers. It is the one that can say clearly, and early, that the question itself does not make sense.
Related Articles
Anthropic said on X on March 18 that nearly 81,000 Claude users participated in a one-week qualitative interview study. The results offer a rare large-scale look at what people actually want from AI and what worries them.
Anthropic said on February 12, 2026 that it raised $30 billion in Series G funding at a $380 billion post-money valuation. The company says the capital will support frontier research, product development, and infrastructure expansion.
Anthropic published a coordinated vulnerability disclosure framework for bugs its AI systems help identify in open-source and authorized closed-source software. The policy adds concrete timelines, human review requirements, and escalation paths as coding agents become more capable security researchers.
Comments (0)
No comments yet. Be the first to comment!