A r/singularity thread boosted attention on an arXiv paper studying hallucination-associated neurons in LLMs. The authors report that a very small subset of neurons can predict hallucination behavior and may be causally involved.
#interpretability
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
Guide Labs has released Steerling-8B, the first inherently interpretable language model that traces every generated token back to its input context, human-understandable concepts, and training data sources.
Google DeepMind announced Gemma Scope 2, extending open interpretability tooling to the full Gemma 3 family from 270M to 27B parameters. The company says the release involved roughly 110 Petabytes of stored data and over 1 trillion total trained parameters.