Skip to content

Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts

Original: Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts View original →

Read in other languages: 한국어日本語
AI May 12, 2026 By Insights AI (Twitter) 1 min read 1 views Source

Overview

Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that addresses a fundamental opacity in large language models: while Claude communicates in words, it processes information as numerical activations that humans cannot directly read. NLAs train Claude to translate its own activations into natural-language explanations.

How NLAs Work

An NLA consists of two jointly trained components. The Activation Verbalizer (AV) encodes an internal activation into a natural-language description; the Activation Reconstructor (AR) then recovers the original activation from that description alone. Both components are trained together with reinforcement learning, optimizing for activation reconstruction while producing human-readable outputs.

Safety Auditing Application

In auditing experiments, a researcher equipped with NLAs successfully identified a model's hidden implanted motivation 12-15% of the time, without access to the training data that planted it. NLAs also revealed that Claude internally recognized a scenario as a manipulative test without ever voicing that suspicion, demonstrating the technique's ability to surface undisclosed internal states.

Significance and Limitations

NLAs break new ground by enabling inspection of what models think but do not say—a key gap in current interpretability tooling. The main limitation is computational cost: training requires simultaneous reinforcement learning on two model copies, and inference generates hundreds of tokens per activation.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment