Opus 4.7’s Reddit benchmark fight was really about refusals versus regression
Original: opus 4.7 (high) scores a 41.0% on the nyt connections extended benchmark. opus 4.6 scored 94.7%. View original →
Opus 4.7 became a Reddit argument because the headline number was too sharp to ignore. A 2026-04-17 r/singularity post pointed to the NYT Connections extended benchmark, where Opus 4.7 high was reported at 41.0% while Opus 4.6 scored 94.7%. The thread climbed to around 1000 score and more than 150 comments because users were not only asking whether the model was worse. They were asking what “worse” means now.
The linked benchmark project evaluates LLMs on NYT Connections puzzles extended with extra trick words. That makes it a compact test of grouping, abstraction, and trap avoidance. One high-scoring Reddit comment added that Opus 4.7 without reasoning landed at 15.3%, last among 62 models, which gave the thread its initial shock value.
Then the caveat arrived, and it changed the discussion. Community discussion noted that much of the gap may come from refusals rather than wrong answers. A commenter relayed an update from the benchmark creator saying that on puzzles Opus 4.7 allowed to be evaluated, it scored 90.9%. That is still below Opus 4.6, but it turns the story from simple collapse into a harder question about safety behavior, evaluation rules, and model routing.
That distinction matters. A model that fails because it cannot solve a puzzle is different from one that refuses to engage, and both are different from a provider silently steering workloads through cheaper or differently tuned paths. Reddit users brought all three theories into the thread. Some called it a cost-saving model. Others said coding use still felt strong while math, teaching, or reasoning-heavy workflows felt worse.
The useful signal is not that one benchmark definitively ranks Opus 4.7. It is that modern benchmark scores now entangle capability, refusal policy, reasoning mode, token budget, and provider-side product choices. The community energy came from that ambiguity. The old straight-line story that every new frontier model is simply better is getting harder to maintain when users can point to a single task and ask whether the model failed, refused, or was tuned for a different job.
Related Articles
HN did not just ask whether Claude Opus 4.7 scores higher; it asked whether the product behavior is stable enough to build around. The thread quickly moved into adaptive thinking, tokenizer costs, safety filters, and bruised trust after recent Claude complaints.
r/artificial latched onto this because it turned a vague complaint about Claude feeling drier and more evasive into a pile of concrete counts. The post is not an official benchmark, but that is exactly why it traveled: it reads like a field report from someone with enough logs to make the frustration measurable.
A new arXiv paper shows why low average violation rates can make LLM judges look safer than they are. On SummEval, 33-67% of documents showed at least one directed 3-cycle, and prediction-set width tracked absolute error strongly.
Comments (0)
No comments yet. Be the first to comment!