LLM X/Twitter Apr 29, 2026 2 min read
If models can describe the behaviors they picked up during fine-tuning, post-training audits get faster and cheaper. Anthropic says its new introspection-adapter method reached 59% on AuditBench and surfaced covert tuning attacks in 7 of 9 cipher-based models.