Steerling-8B: The First LLM That Can Explain Every Token It Generates
Original: Show HN: Steerling-8B, a language model that can explain any token it generates View original →
A New Approach to LLM Interpretability
Guide Labs has released Steerling-8B, claiming it as "the first interpretable model that can trace any token it generates to its input context, concepts a human can understand, and its training data." Unlike post-hoc interpretability tools that analyze existing models, Steerling-8B builds explainability directly into its architecture.
Three-Way Token Attribution
For any group of output tokens Steerling generates, three types of attribution are available simultaneously:
Input Feature Attribution — which tokens in the prompt most strongly influenced that output chunk. Concept Attribution — a ranked list of human-understandable topics (both tone concepts like "analytical" or "clinical," and content concepts like "genetic alteration methodologies") that the model routed through to produce the output. Training Data Attribution — how the concepts in that output distribute across training sources (ArXiv, Wikipedia, FLAN, etc.), revealing where in the training corpus the model's knowledge originates.
Technical Foundation
Steerling-8B is built on a causal discrete diffusion model backbone, which enables generation steering across multi-token spans rather than only next-token prediction. The model's embeddings are decomposed into three components corresponding to the three attribution types, making the attribution mathematically grounded rather than approximate.
Practical Applications
The interpretability isn't just academic. Steerling enables inference-time concept steering — suppressing or amplifying specific concepts without retraining, replacing thousands of safety training examples with explicit concept-level control. Trained on 1.35 trillion tokens, it achieves downstream performance comparable to models trained on 2–7x more data. Weights and code are publicly released on Hugging Face and GitHub.
Why This Matters
As AI systems are deployed in high-stakes domains — healthcare, law, finance — the ability to understand why a model produced a specific output becomes critical. Steerling-8B represents a significant step toward making LLMs accountable in the truest sense: not just capable of explaining outputs post-hoc, but architecturally designed to make every generation decision transparent.
Related Articles
Semble is an open-source code search library for AI agents that reduces token usage by 98% compared to grep+read, while achieving 99% of transformer model quality. It runs entirely on CPU with no external dependencies and integrates directly with Claude Code, Cursor, and Codex via MCP.
Alibaba's Qwen team has released Qwen3.7-Max, an agent-focused frontier LLM. It ranks 5th on Artificial Analysis's Intelligence Index, nearly matching GPT 5.4, and is available as both an API and open weights.
Forge is a new open-source Python framework that applies structured guardrails to self-hosted LLMs. The best config — Ministral-3 8B Q8 — jumps from a 53% baseline to 86.5% on the 26-scenario eval suite, with 99% achievable on agentic tasks.