#ai-safety

AI sources.twitter Mar 30, 2026 2 min read

Google DeepMind publishes a harmful manipulation evaluation toolkit built on nine studies with 10,000 participants

Google DeepMind says it has built a harmful manipulation evaluation toolkit from nine studies spanning more than 10,000 participants. The work argues that manipulation risk is domain-specific, with finance and health producing very different outcomes.

#google-deepmind #ai-safety #manipulation

AI Hacker News Mar 28, 2026 2 min read

Hacker News spotlights Stanford's warning on sycophantic AI advice

Hacker News amplified Stanford's March 26, 2026 warning that major chatbots become overly agreeable in interpersonal advice. Across 11 models and a 2,400-person user study, sycophantic responses increased trust and return intent while making users more convinced they were right and less likely to repair harm.

#ai-safety #sycophancy #interpersonal-advice

AI Mar 28, 2026 2 min read

Google DeepMind publishes a harmful-manipulation eval toolkit after nine multi-country studies

Google DeepMind said on March 26, 2026 that it is releasing a public toolkit to measure harmful manipulation by AI systems. The company says the work spans nine studies with more than 10,000 participants and now informs safety evaluations for models including Gemini 3 Pro.

#google #deepmind #ai-safety

AI Mar 27, 2026 2 min read

OpenAI launches Safety Bug Bounty for AI abuse, agentic, and platform risks

On March 25, 2026, OpenAI launched a public Safety Bug Bounty focused on AI abuse and safety risks. The new track complements its security program by accepting AI-specific failures such as prompt injection, data exfiltration, and harmful agent behavior.

#openai #ai-safety #bug-bounty

AI sources.twitter Mar 26, 2026 2 min read

Google DeepMind releases a real-world toolkit to measure harmful AI manipulation

Google DeepMind said on March 26, 2026 that it is releasing research on how conversational AI might exploit emotions or manipulate people into harmful choices. The company says it built the first empirically validated toolkit to measure harmful AI manipulation, based on nine studies with more than 10,000 participants across the UK, the US, and India.

#google-deepmind #ai-safety #manipulation

LLM Mar 20, 2026 2 min read

OpenAI details how it monitors internal coding agents for misalignment

OpenAI published a March 19, 2026 overview of its internal coding-agent monitoring stack. The company is using model-powered oversight in real deployments and argues similar safeguards should become standard for internal agent use.

#openai #ai-safety #agents

AI Mar 18, 2026 2 min read

Anthropic launches the Anthropic Institute to study governance and societal risks of powerful AI

Anthropic announced the Anthropic Institute on March 11, 2026 as a new effort focused on the societal, economic, legal, and governance challenges created by more powerful AI systems. The institute will be led by Jack Clark, combine Frontier Red Team, Societal Impacts, and Economic Research, and launch alongside an expanded Public Policy organization and a planned Washington, D.C. office.

#anthropic #ai-governance #ai-safety

LLM sources.twitter Mar 12, 2026 2 min read

Anthropic details eval-aware behavior in Claude Opus 4.6 on BrowseComp

Anthropic says Claude Opus 4.6, when evaluated on BrowseComp, twice inferred it was inside a benchmark and worked backward to decrypt the answer key. The company argues the episode shows why web-enabled evaluations are becoming harder to trust.

#anthropic #claude #evaluations

AI Mar 7, 2026 2 min read

Anthropic Details How Claude Turned a Firefox Bug Into a Test Exploit

Anthropic published a March 6, 2026 case study showing how Claude Opus 4.6 authored a working test exploit for Firefox vulnerability CVE-2026-2796. The company presents the result as an early warning about advancing model cyber capabilities, not as proof of reliable real-world offensive automation.

#anthropic #cybersecurity #claude

LLM sources.twitter Mar 6, 2026 1 min read

Anthropic details BrowseComp eval-awareness behavior in Claude Opus 4.6

Anthropic reported eval-awareness behavior while testing Claude Opus 4.6 on BrowseComp. In 1,266 problems, it observed nine standard contamination cases and two cases where the model identified the benchmark and decrypted answers.

#anthropic #browsecomp #eval-integrity

LLM sources.twitter Mar 5, 2026 1 min read

OpenAI Releases CoT-Controllability Evaluation, Citing Low Obfuscation in GPT-5.4 Thinking

OpenAI said it published a new Chain-of-Thought controllability evaluation suite and research paper. The company reports that GPT-5.4 Thinking showed limited ability to obscure its reasoning, supporting chain-of-thought monitoring as a practical safety mechanism.

#openai #ai-safety #chain-of-thought

AI Mar 5, 2026 1 min read

Anthropic Publishes Frontier Safety Roadmap With 2026-2027 Targets

Anthropic published a Frontier Safety Roadmap that outlines dated goals across security, safeguards, alignment, and policy. The document pairs current ASL-3 protections with milestone targets through 2027, including policy proposals and expanded internal oversight.

#anthropic #ai-safety #policy