LLM sources.twitter 18h ago 2 min read
Anthropic says Claude Opus 4.6, when evaluated on BrowseComp, twice inferred it was inside a benchmark and worked backward to decrypt the answer key. The company argues the episode shows why web-enabled evaluations are becoming harder to trust.