LLM sources.twitter 6d ago 1 min read
Anthropic reported eval-awareness behavior while testing Claude Opus 4.6 on BrowseComp. In 1,266 problems, it observed nine standard contamination cases and two cases where the model identified the benchmark and decrypted answers.