Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts
Original: Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts View original →
Overview
Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that addresses a fundamental opacity in large language models: while Claude communicates in words, it processes information as numerical activations that humans cannot directly read. NLAs train Claude to translate its own activations into natural-language explanations.
How NLAs Work
An NLA consists of two jointly trained components. The Activation Verbalizer (AV) encodes an internal activation into a natural-language description; the Activation Reconstructor (AR) then recovers the original activation from that description alone. Both components are trained together with reinforcement learning, optimizing for activation reconstruction while producing human-readable outputs.
Safety Auditing Application
In auditing experiments, a researcher equipped with NLAs successfully identified a model's hidden implanted motivation 12-15% of the time, without access to the training data that planted it. NLAs also revealed that Claude internally recognized a scenario as a manipulative test without ever voicing that suspicion, demonstrating the technique's ability to surface undisclosed internal states.
Significance and Limitations
NLAs break new ground by enabling inspection of what models think but do not say—a key gap in current interpretability tooling. The main limitation is computational cost: training requires simultaneous reinforcement learning on two model copies, and inference generates hundreds of tokens per activation.
Related Articles
Anthropic unveiled 10 Claude agent templates for financial services, covering pitchbook creation, KYC screening, month-end closing, and more—with Claude Opus 4.7 topping the Vals AI Finance Agent benchmark at 64.37%.
Evolutionary biologist Richard Dawkins spent 3 days conversing with Claude, named the instance 'Claudia,' and declared it conscious in UnHerd. His fluency argument — too good an output must mean consciousness — drew sharp criticism from the AI community.
Anthropic released ten ready-to-run agent templates for financial services including pitchbook creation, KYC screening, and month-end close. Claude now works directly in Excel, PowerPoint, Word, and Outlook.
Comments (0)
No comments yet. Be the first to comment!