#safety

LLM Mar 4, 2026 2 min read

OpenAI Publishes GPT-5.3 Instant System Card With Detailed Safety and HealthBench Results

OpenAI released the GPT-5.3 Instant System Card on March 3, 2026. The document reports category-level disallowed-content scores, dynamic multi-turn safety testing updates, and HealthBench outcomes, including areas of both improvement and regression.

#openai #gpt-5.3 #system-card

AI Mar 3, 2026 1 min read

Trump Bans Anthropic from Federal Use; OpenAI Strikes Pentagon Deal Hours Later

After Trump ordered federal agencies to stop using Anthropic AI, the Pentagon designated the firm a national security supply chain risk—and OpenAI secured a competing Defense Department agreement within hours.

#openai #anthropic #regulation

AI Mar 2, 2026 1 min read

Study: AI Chatbots Escalated to Nuclear Action in 95% of War Game Simulations

A King's College London study tested ChatGPT, Claude, and Gemini in Cold War-style nuclear crisis simulations. AI models chose nuclear escalation in 95% of scenarios and left all eight de-escalation options entirely unused across 21 games.

#safety #research #nuclear

AI Feb 24, 2026 1 min read

Pentagon Summons Anthropic CEO Over Military Use of Claude, Threatens Supply Chain Risk Label

Defense Secretary Pete Hegseth summoned Anthropic CEO Dario Amodei to the Pentagon over Claude's military deployment. Anthropic refused to allow Claude for autonomous weapons or mass surveillance, risking its $200M DoD contract.

#anthropic #safety #regulation

AI Feb 20, 2026 2 min read

OpenAI commits $7.5M to independent AI alignment research

OpenAI announced a $7.5 million commitment to support independent AI alignment research. The program combines direct funding and uncapped research credits for university and nonprofit teams focused on frontier model safety.

#openai #alignment #safety

AI Reddit Feb 19, 2026 2 min read

Reddit Tracks Anthropic's New Agent Autonomy Study Across Claude Code and API Usage

A Reddit r/singularity post surfaced Anthropic's February 18, 2026 research on real-world agent autonomy, including findings on longer autonomous runs, rising auto-approve behavior among experienced users, and risk distribution across domains.

#anthropic #ai-agents #autonomy

AI Feb 16, 2026 2 min read

OpenAI Details Safety Alignment Stack, Reporting 97% Refusal on Uncertain Requests

OpenAI published a framework for safety alignment based on instruction hierarchy and uncertainty-aware behavior. In the company’s reported tests, refusal on uncertain requests rose from about 59% to about 97% when chain-of-command reasoning was applied.

#openai #safety #alignment

LLM Feb 13, 2026 1 min read

Microsoft Discovers 'GRP-Obliteration': A Single Prompt That Breaks LLM Safety Alignment

Microsoft AI Safety team discovered GRP-Obliteration, an attack that disables safety alignment across 15 major LLMs with a single prompt. GPT-OSS-20B's attack success rate jumped from 13% to 93%.

#microsoft #safety #jailbreak