On March 25, 2026, OpenAI launched a public Safety Bug Bounty focused on AI abuse and safety risks. The new track complements its security program by accepting AI-specific failures such as prompt injection, data exfiltration, and harmful agent behavior.
#ai-safety
RSS FeedGoogle DeepMind said on March 26, 2026 that it is releasing research on how conversational AI might exploit emotions or manipulate people into harmful choices. The company says it built the first empirically validated toolkit to measure harmful AI manipulation, based on nine studies with more than 10,000 participants across the UK, the US, and India.
OpenAI published a March 19, 2026 overview of its internal coding-agent monitoring stack. The company is using model-powered oversight in real deployments and argues similar safeguards should become standard for internal agent use.
Anthropic published a March 6, 2026 case study showing how Claude Opus 4.6 authored a working test exploit for Firefox vulnerability CVE-2026-2796. The company presents the result as an early warning about advancing model cyber capabilities, not as proof of reliable real-world offensive automation.
Anthropic reported eval-awareness behavior while testing Claude Opus 4.6 on BrowseComp. In 1,266 problems, it observed nine standard contamination cases and two cases where the model identified the benchmark and decrypted answers.
OpenAI said it published a new Chain-of-Thought controllability evaluation suite and research paper. The company reports that GPT-5.4 Thinking showed limited ability to obscure its reasoning, supporting chain-of-thought monitoring as a practical safety mechanism.
Anthropic published a Frontier Safety Roadmap that outlines dated goals across security, safeguards, alignment, and policy. The document pairs current ASL-3 protections with milestone targets through 2027, including policy proposals and expanded internal oversight.
Anthropic published Responsible Scaling Policy Version 3.0 on February 24, 2026. The update keeps the ASL framework but retools how commitments are managed when capability thresholds are hard to measure unambiguously.
OpenAI’s February 2026 safety report says it banned accounts linked to seven operations originating in China. The company says abuse covered cyber activity, covert influence, and scams, while overall malicious use remained low versus legitimate use.
Sam Altman announced OpenAI reached an agreement with the U.S. Department of War to deploy AI models on classified networks, with core safety principles including bans on domestic mass surveillance and autonomous weapon systems.
OpenAI said on February 28, 2026 that it reached an agreement with the U.S. Department of War to deploy advanced AI systems in classified environments. In a follow-up post, the company said the arrangement uses a multi-layer safety approach and cloud-based deployment with cleared personnel in the loop.
Anthropic released Responsible Scaling Policy 3.0, adding a structured Frontier Safety and Security Framework and new roadmap and reporting mechanisms. The update emphasizes explicit commitments to pause or withhold deployment if risk thresholds are exceeded.