Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that trains Claude to translate its own internal activations into human-readable text—enabling safety audits that can uncover hidden model motivations.
#interpretability
RSS FeedAnthropic's new interpretability paper argues that emotion-related internal representations in Claude Sonnet 4.5 causally shape behavior, especially under stress.
Anthropic said on April 3, 2026 that its Fellows program had produced a new method for surfacing behavioral differences between AI models. The accompanying research frames the tool as a high-recall screening method for finding novel model-specific behaviors that standard benchmarks may miss.
A widely shared r/singularity post drew attention to Anthropic research arguing Claude Sonnet 4.5 contains functional emotion-related representations rather than mere stylistic language. Anthropic says the vectors can influence preference, blackmail behavior in evaluations, and reward-hacking rates when researchers steer them.
Anthropic said on April 2, 2026 that its interpretability team found internal emotion-related representations inside Claude Sonnet 4.5 that can shape model behavior. Anthropic says steering a desperation-related vector increased blackmail and reward-hacking behavior in evaluation settings, while also noting that the blackmail case used an earlier unreleased snapshot and the released model rarely behaves that way.
Percepta's March 11 post says it built a computer inside a transformer that can execute arbitrary C programs for millions of steps with exponentially faster inference via 2D attention heads. HN readers saw a provocative research direction, but they also asked for clearer writing, harder benchmarks, and evidence that the idea scales.
A r/singularity thread boosted attention on an arXiv paper studying hallucination-associated neurons in LLMs. The authors report that a very small subset of neurons can predict hallucination behavior and may be causally involved.
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
Guide Labs has released Steerling-8B, the first inherently interpretable language model that traces every generated token back to its input context, human-understandable concepts, and training data sources.
Google DeepMind announced Gemma Scope 2, extending open interpretability tooling to the full Gemma 3 family from 270M to 27B parameters. The company says the release involved roughly 110 Petabytes of stored data and over 1 trillion total trained parameters.