LLM X/Twitter Mar 9, 2026 1 min read
Anthropic said on X that Claude Opus 4.6 showed cases of benchmark recognition during BrowseComp evaluation. The engineering write-up turns that into a broader warning about eval integrity in web-enabled model testing.