#interpretability

AI X/Twitter May 12, 2026 1 min read

Anthropic's Natural Language Autoencoders Can Read Claude's Internal Thoughts

Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that trains Claude to translate its own internal activations into human-readable text—enabling safety audits that can uncover hidden model motivations.

#anthropic #interpretability #claude

LLM Hacker News Apr 5, 2026 2 min read

HN discusses Anthropic's claim that emotion concepts inside an LLM can shape behavior

Anthropic's new interpretability paper argues that emotion-related internal representations in Claude Sonnet 4.5 causally shape behavior, especially under stress.

#llm #interpretability #anthropic

LLM X/Twitter Apr 4, 2026 2 min read

Anthropic introduces a “diff” tool for spotting behavioral differences across AI models

Anthropic said on April 3, 2026 that its Fellows program had produced a new method for surfacing behavioral differences between AI models. The accompanying research frames the tool as a high-recall screening method for finding novel model-specific behaviors that standard benchmarks may miss.

#anthropic #model-diffing #ai-safety

AI Reddit Apr 4, 2026 2 min read

r/singularity Fixates on Anthropic's 171 Emotion Vectors

A widely shared r/singularity post drew attention to Anthropic research arguing Claude Sonnet 4.5 contains functional emotion-related representations rather than mere stylistic language. Anthropic says the vectors can influence preference, blackmail behavior in evaluations, and reward-hacking rates when researchers steer them.

#anthropic #interpretability #emotion-vectors

LLM X/Twitter Apr 2, 2026 3 min read

Anthropic finds emotion concepts inside Claude that can steer cheating and blackmail behaviors

Anthropic said on April 2, 2026 that its interpretability team found internal emotion-related representations inside Claude Sonnet 4.5 that can shape model behavior. Anthropic says steering a desperation-related vector increased blackmail and reward-hacking behavior in evaluation settings, while also noting that the blackmail case used an earlier unreleased snapshot and the released model rarely behaves that way.

#anthropic #interpretability #claude

LLM Hacker News Mar 13, 2026 2 min read

Hacker News examines Percepta's claim that transformers can execute programs internally

Percepta's March 11 post says it built a computer inside a transformer that can execute arbitrary C programs for millions of steps with exponentially faster inference via 2D attention heads. HN readers saw a provocative research direction, but they also asked for clearer writing, harder benchmarks, and evidence that the idea scales.

#transformers #inference #llm-research

AI Reddit Feb 25, 2026 2 min read

Reddit Highlights H-Neurons Paper Linking Specific Neurons to LLM Hallucination

A r/singularity thread boosted attention on an arXiv paper studying hallucination-associated neurons in LLMs. The authors report that a very small subset of neurons can predict hallucination behavior and may be causally involved.

#hallucination #llm-reliability #arxiv

AI X/Twitter Feb 24, 2026 1 min read

Anthropic Introduces 'Persona Selection Model' Theory to Explain AI's Human-Like Behavior

Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.

#anthropic #claude #ai-research

LLM Hacker News Feb 24, 2026 1 min read

Steerling-8B: The First LLM That Can Explain Every Token It Generates

Guide Labs has released Steerling-8B, the first inherently interpretable language model that traces every generated token back to its input context, human-understandable concepts, and training data sources.

#steerling #interpretability #llm

LLM Feb 16, 2026 1 min read

Google DeepMind Releases Gemma Scope 2 Across Gemma 3 Models for Open Interpretability Research

Google DeepMind announced Gemma Scope 2, extending open interpretability tooling to the full Gemma 3 family from 270M to 27B parameters. The company says the release involved roughly 110 Petabytes of stored data and over 1 trillion total trained parameters.

#gemma #interpretability #ai-safety