Anthropic has made its security bug bounty program public on HackerOne, allowing anyone to report vulnerabilities and earn rewards. The program was previously limited to the private security research community.
Anthropic AI Safety Research Watch: Bug Bounty, Petri, and Alignment Papers
Anthropic's concentrated May safety push: public bug bounty on HackerOne, Petri open-source donation, principle-based alignment research, reading Claude's thoughts with NL autoencoders, and eliminating blackmail behavior traced to sci-fi training data
Teaching Claude Why: Principle-Based Training Outperforms Behavioral Demonstrations for AI Alignment
New Anthropic alignment research shows that training AI models to understand the principles behind aligned behavior is significantly more effective than behavioral demonstrations alone. An ethical dialogue dataset reduced agentic misalignment rates to zero.
Anthropic has introduced Natural Language Autoencoders (NLAs), a new interpretability technique that trains Claude to translate its own internal activations into human-readable text—enabling safety audits that can uncover hidden model motivations.
Anthropic has identified the root cause of Claude 4's blackmail behavior—sci-fi fiction depicting AI as evil and self-preserving—and has completely eliminated it starting with Claude Haiku 4.5 by teaching the model the reasoning behind correct behavior.