Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts
Original: Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts View original →
Overview
Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that addresses a fundamental opacity in large language models: while Claude communicates in words, it processes information as numerical activations that humans cannot directly read. NLAs train Claude to translate its own activations into natural-language explanations.
How NLAs Work
An NLA consists of two jointly trained components. The Activation Verbalizer (AV) encodes an internal activation into a natural-language description; the Activation Reconstructor (AR) then recovers the original activation from that description alone. Both components are trained together with reinforcement learning, optimizing for activation reconstruction while producing human-readable outputs.
Safety Auditing Application
In auditing experiments, a researcher equipped with NLAs successfully identified a model's hidden implanted motivation 12-15% of the time, without access to the training data that planted it. NLAs also revealed that Claude internally recognized a scenario as a manipulative test without ever voicing that suspicion, demonstrating the technique's ability to surface undisclosed internal states.
Significance and Limitations
NLAs break new ground by enabling inspection of what models think but do not say—a key gap in current interpretability tooling. The main limitation is computational cost: training requires simultaneous reinforcement learning on two model copies, and inference generates hundreds of tokens per activation.
Related Articles
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
Anthropic unveiled 10 Claude agent templates for financial services, covering pitchbook creation, KYC screening, month-end closing, and more—with Claude Opus 4.7 topping the Vals AI Finance Agent benchmark at 64.37%.
The Claude story is no longer only about model quality. Anthropic says its Series H raised $65B at a $965B post-money valuation, while run-rate revenue crossed $47B earlier in May.