#ai-safety

AI 23h ago 2 min read

Anthropic quantifies Claude’s election defenses ahead of the U.S. midterms

Election-season AI safety is moving from slogans to measurable tests. On April 24, 2026, Anthropic published Claude election metrics showing 100% and 99.8% appropriate handling on a 600-prompt misuse-and-legitimate-use set for Opus 4.7 and Sonnet 4.6, plus 90% and 94% performance in influence-operation simulations.

#anthropic #elections #ai-safety

AI Reddit 2d ago 2 min read

r/artificial Fixates on a Harder AI Threat: Swarms That Manufacture Consensus

r/artificial pushed this study because it replaces vague AGI doom with a much more concrete threat model: swarms of AI personas that can infiltrate communities, coordinate instantly, and manufacture the appearance of consensus.

#ai-safety #misinformation #elections

LLM Apr 19, 2026 1 min read

LLM judges miss unsafe answers 30% more when stakes are named

A new arXiv preprint reports that LLM judges became meaningfully more lenient when prompts framed evaluation consequences, exposing a weak point in automated safety and quality benchmarks.

#llm-evals #ai-safety #benchmarks

AI sources.official Apr 17, 2026 2 min read

OpenAI puts GPT-5.4-Cyber in the hands of vetted defenders

OpenAI is widening access to GPT-5.4-Cyber through verified cyber-defense channels, with $10 million in API credits and government evaluation access attached. The real story is the access model: stronger cyber capability is being paired with identity checks, tiered trust, and accountability rather than a simple public release.

#openai #cybersecurity #gpt-5.4-cyber

LLM sources.twitter Apr 16, 2026 1 min read

Nature paper shows LLM traits can pass through hidden data signals

Synthetic-data training has a sharper safety problem than obvious bad examples. A Nature paper co-authored by Anthropic researchers reports that traits such as owl preference or misalignment can move through semantically unrelated number sequences.

#ai-safety #llm #distillation

LLM sources.twitter Apr 16, 2026 1 min read

Anthropic’s Opus agents recover 97% of a weak-to-strong gap

Automating alignment research is moving from concept to measured experiment. Anthropic says a Claude Opus 4.6 researcher recovered 97% of the weak-to-strong supervision gap at roughly 1/100 the human time cost.

#ai-safety #alignment #claude

LLM Reddit Apr 14, 2026 2 min read

r/singularity amplifies an AISI result that says Claude Mythos is starting to chain real cyber workflows, not just solve toy tasks

A Reddit thread pulled attention to AISI’s latest Mythos Preview evaluation, which shows a step change not just on expert CTFs but on multi-stage cyber ranges. The important claim is not generic danger rhetoric, but that Mythos became the first model to complete a 32-step corporate attack simulation end to end.

#claude-mythos #aisi #cybersecurity

AI Apr 12, 2026 2 min read

Introducing the OpenAI Safety Bug Bounty program

OpenAI on March 25 launched a public Safety Bug Bounty program on Bugcrowd for AI abuse, agentic misuse, and platform-integrity reports. The company says the new track complements its existing Security Bug Bounty rather than replacing it.

#openai #security #bug-bounty

AI sources.x Apr 9, 2026 2 min read

OpenAI Launches Safety Fellowship for Independent AI Alignment Research

OpenAI introduced its Safety Fellowship on X and published program details on April 6, 2026 for external researchers and practitioners working on AI safety and alignment. The move is notable because it extends work on evaluation, robustness, privacy-preserving safety methods, and agentic oversight beyond OpenAI’s internal teams.

#openai #ai-safety #alignment

AI sources.twitter Apr 6, 2026 2 min read

OpenAI opens applications for a Safety Fellowship focused on alignment and misuse research

OpenAI’s April 6, 2026 X post announced a new Safety Fellowship for external researchers, engineers, and practitioners. OpenAI says the pilot program runs from September 14, 2026 through February 5, 2027 and prioritizes safety evaluation, robustness, privacy-preserving methods, agentic oversight, and other high-impact safety work.

#openai #ai-safety #alignment

LLM sources.twitter Apr 4, 2026 2 min read

Anthropic introduces a “diff” tool for spotting behavioral differences across AI models

Anthropic said on April 3, 2026 that its Fellows program had produced a new method for surfacing behavioral differences between AI models. The accompanying research frames the tool as a high-recall screening method for finding novel model-specific behaviors that standard benchmarks may miss.

#anthropic #model-diffing #ai-safety

AI sources.twitter Apr 1, 2026 2 min read

Anthropic signs Australia MOU on AI safety research and National AI Plan support

Anthropic said on March 31, 2026 that it signed an MOU with the Australian government to collaborate on AI safety research and support Australia’s National AI Plan. Anthropic says the agreement includes work with Australia’s AI Safety Institute, Economic Index data sharing, and AUD$3 million in partnerships with Australian research institutions.

#anthropic #australia #ai-safety