Election-season AI safety is moving from slogans to measurable tests. On April 24, 2026, Anthropic published Claude election metrics showing 100% and 99.8% appropriate handling on a 600-prompt misuse-and-legitimate-use set for Opus 4.7 and Sonnet 4.6, plus 90% and 94% performance in influence-operation simulations.
#ai-safety
RSS Feedr/artificial pushed this study because it replaces vague AGI doom with a much more concrete threat model: swarms of AI personas that can infiltrate communities, coordinate instantly, and manufacture the appearance of consensus.
A new arXiv preprint reports that LLM judges became meaningfully more lenient when prompts framed evaluation consequences, exposing a weak point in automated safety and quality benchmarks.
OpenAI is widening access to GPT-5.4-Cyber through verified cyber-defense channels, with $10 million in API credits and government evaluation access attached. The real story is the access model: stronger cyber capability is being paired with identity checks, tiered trust, and accountability rather than a simple public release.
Synthetic-data training has a sharper safety problem than obvious bad examples. A Nature paper co-authored by Anthropic researchers reports that traits such as owl preference or misalignment can move through semantically unrelated number sequences.
Automating alignment research is moving from concept to measured experiment. Anthropic says a Claude Opus 4.6 researcher recovered 97% of the weak-to-strong supervision gap at roughly 1/100 the human time cost.
A Reddit thread pulled attention to AISI’s latest Mythos Preview evaluation, which shows a step change not just on expert CTFs but on multi-stage cyber ranges. The important claim is not generic danger rhetoric, but that Mythos became the first model to complete a 32-step corporate attack simulation end to end.
OpenAI on March 25 launched a public Safety Bug Bounty program on Bugcrowd for AI abuse, agentic misuse, and platform-integrity reports. The company says the new track complements its existing Security Bug Bounty rather than replacing it.
OpenAI introduced its Safety Fellowship on X and published program details on April 6, 2026 for external researchers and practitioners working on AI safety and alignment. The move is notable because it extends work on evaluation, robustness, privacy-preserving safety methods, and agentic oversight beyond OpenAI’s internal teams.
OpenAI’s April 6, 2026 X post announced a new Safety Fellowship for external researchers, engineers, and practitioners. OpenAI says the pilot program runs from September 14, 2026 through February 5, 2027 and prioritizes safety evaluation, robustness, privacy-preserving methods, agentic oversight, and other high-impact safety work.
Anthropic said on April 3, 2026 that its Fellows program had produced a new method for surfacing behavioral differences between AI models. The accompanying research frames the tool as a high-recall screening method for finding novel model-specific behaviors that standard benchmarks may miss.
Anthropic said on March 31, 2026 that it signed an MOU with the Australian government to collaborate on AI safety research and support Australia’s National AI Plan. Anthropic says the agreement includes work with Australia’s AI Safety Institute, Economic Index data sharing, and AUD$3 million in partnerships with Australian research institutions.