Opus 4.7’s Reddit benchmark fight was really about refusals versus regression

Opus 4.7 became a Reddit argument because the headline number was too sharp to ignore. A 2026-04-17 r/singularity post pointed to the NYT Connections extended benchmark, where Opus 4.7 high was reported at 41.0% while Opus 4.6 scored 94.7%. The thread climbed to around 1000 score and more than 150 comments because users were not only asking whether the model was worse. They were asking what “worse” means now.

The linked benchmark project evaluates LLMs on NYT Connections puzzles extended with extra trick words. That makes it a compact test of grouping, abstraction, and trap avoidance. One high-scoring Reddit comment added that Opus 4.7 without reasoning landed at 15.3%, last among 62 models, which gave the thread its initial shock value.

Then the caveat arrived, and it changed the discussion. Community discussion noted that much of the gap may come from refusals rather than wrong answers. A commenter relayed an update from the benchmark creator saying that on puzzles Opus 4.7 allowed to be evaluated, it scored 90.9%. That is still below Opus 4.6, but it turns the story from simple collapse into a harder question about safety behavior, evaluation rules, and model routing.

That distinction matters. A model that fails because it cannot solve a puzzle is different from one that refuses to engage, and both are different from a provider silently steering workloads through cheaper or differently tuned paths. Reddit users brought all three theories into the thread. Some called it a cost-saving model. Others said coding use still felt strong while math, teaching, or reasoning-heavy workflows felt worse.

The useful signal is not that one benchmark definitively ranks Opus 4.7. It is that modern benchmark scores now entangle capability, refusal policy, reasoning mode, token budget, and provider-side product choices. The community energy came from that ambiguity. The old straight-line story that every new frontier model is simply better is getting harder to maintain when users can point to a single task and ask whether the model failed, refused, or was tuned for a different job.

Opus 4.7’s Reddit benchmark fight was really about refusals versus regression

Related Articles

HN Looks Past the Claude Opus 4.7 Headline to Adaptive Thinking, Tokens, and Trust

Claude Keeps Telling Users to Sleep Mid-Conversation, and Anthropic Calls It a 'Character Tic'

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

Comments (0)

Leave a Comment

Related Articles

HN Looks Past the Claude Opus 4.7 Headline to Adaptive Thinking, Tokens, and Trust
LLM Hacker News Apr 17, 2026 2 min read

Claude Keeps Telling Users to Sleep Mid-Conversation, and Anthropic Calls It a 'Character Tic'
LLM Reddit May 20, 2026 1 min read

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings