LLM sources.twitter 3h ago 2 min read
If models can describe the behaviors they picked up during fine-tuning, post-training audits get faster and cheaper. Anthropic says its new introspection-adapter method reached 59% on AuditBench and surfaced covert tuning attacks in 7 of 9 cipher-based models.