Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts

Overview

Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that addresses a fundamental opacity in large language models: while Claude communicates in words, it processes information as numerical activations that humans cannot directly read. NLAs train Claude to translate its own activations into natural-language explanations.

How NLAs Work

An NLA consists of two jointly trained components. The Activation Verbalizer (AV) encodes an internal activation into a natural-language description; the Activation Reconstructor (AR) then recovers the original activation from that description alone. Both components are trained together with reinforcement learning, optimizing for activation reconstruction while producing human-readable outputs.

Safety Auditing Application

In auditing experiments, a researcher equipped with NLAs successfully identified a model's hidden implanted motivation 12-15% of the time, without access to the training data that planted it. NLAs also revealed that Claude internally recognized a scenario as a manipulative test without ever voicing that suspicion, demonstrating the technique's ability to surface undisclosed internal states.

Significance and Limitations

NLAs break new ground by enabling inspection of what models think but do not say—a key gap in current interpretability tooling. The main limitation is computational cost: training requires simultaneous reinforcement learning on two model copies, and inference generates hundreds of tokens per activation.

AI X/Twitter Feb 24, 2026 1 min read

Anthropic Introduces 'Persona Selection Model' Theory to Explain AI's Human-Like Behavior

Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.

#anthropic #claude #ai-research

AI X/Twitter May 6, 2026 1 min read

Claude Launches 10 Ready-to-Run Finance Agents: From Pitchbooks to KYC Screening

Anthropic unveiled 10 Claude agent templates for financial services, covering pitchbook creation, KYC screening, month-end closing, and more—with Claude Opus 4.7 topping the Vals AI Finance Agent benchmark at 64.37%.

#anthropic #claude #ai-agents

AI May 29, 2026 1 min read

Anthropic’s $65B round turns the Claude race into a compute race

The Claude story is no longer only about model quality. Anthropic says its Series H raised $65B at a $965B post-money valuation, while run-rate revenue crossed $47B earlier in May.

#anthropic #funding #claude