#safety

LLM 1d ago 2 min read

Anthropic stress-tests Claude for elections, hits 100% and 99.8%

Anthropic put hard numbers behind Claude’s election safeguards. Opus 4.7 and Sonnet 4.6 responded appropriately 100% and 99.8% of the time in a 600-prompt election-policy test, and triggered web search 92% and 95% of the time on U.S. midterm-related queries.

#anthropic #claude #elections

AI 4d ago 2 min read

OpenAI’s Images 2.0 safety card makes deepfake risk measurable

OpenAI’s April 21 system card puts concrete safety numbers behind ChatGPT Images 2.0, including 6.7% policy-violating generations before final blocking in thinking mode. The card matters because higher realism, web-grounded image reasoning, biorisk prompts, and provenance are now treated as one deployment problem.

#openai #image-generation #safety

AI Apr 15, 2026 2 min read

Stanford’s AI Index 2026 shows $285.9B U.S. lead and a safety gap

Stanford HAI’s new report says the measurement gap is now part of the AI story, not a side note. U.S. private AI investment reached $285.9 billion in 2025, while documented AI incidents rose to 362 from 233 a year earlier.

#ai #stanford #investment

AI Apr 13, 2026 2 min read

OpenAI launches Child Safety Blueprint for AI-enabled abuse prevention

OpenAI introduced the Child Safety Blueprint on April 8, 2026 as a policy framework for combating AI-enabled child sexual exploitation. The proposal combines legal updates, stronger provider reporting, and safety-by-design measures inside AI systems.

#openai #safety #policy

AI Apr 12, 2026 2 min read

Anthropic clarifies RSP v3.1 and advances its Frontier Safety Roadmap

Anthropic updated its Responsible Scaling Policy page on April 2, 2026 and moved the policy to version 3.1. The company says the revision mostly clarifies its AI R&D threshold language and makes explicit that it can pause development even when the RSP does not strictly require it.

#anthropic #safety #governance

AI Apr 11, 2026 2 min read

OpenAI introduces a Child Safety Blueprint for AI-enabled exploitation risks

OpenAI published a policy blueprint aimed at preventing and combating AI-enabled child sexual exploitation. The framework combines legal modernization, better provider reporting, and safety-by-design measures inside AI systems.

#openai #safety #policy

LLM Hacker News Apr 5, 2026 2 min read

HN discusses Anthropic's claim that emotion concepts inside an LLM can shape behavior

Anthropic's new interpretability paper argues that emotion-related internal representations in Claude Sonnet 4.5 causally shape behavior, especially under stress.

#llm #interpretability #anthropic

LLM sources.twitter Apr 2, 2026 3 min read

Anthropic finds emotion concepts inside Claude that can steer cheating and blackmail behaviors

Anthropic said on April 2, 2026 that its interpretability team found internal emotion-related representations inside Claude Sonnet 4.5 that can shape model behavior. Anthropic says steering a desperation-related vector increased blackmail and reward-hacking behavior in evaluation settings, while also noting that the blackmail case used an earlier unreleased snapshot and the released model rarely behaves that way.

#anthropic #interpretability #claude

AI Mar 29, 2026 2 min read

Meta rolls out its AI support assistant globally and expands automated safety systems

Meta said on March 19, 2026 that it is rolling out the Meta AI support assistant globally on Facebook and Instagram in markets where Meta AI is available. The company also said newer AI enforcement systems are finding 5,000 previously missed scam attempts per day and sharply reducing some moderation errors.

#meta #support #safety

AI Mar 28, 2026 2 min read

OpenAI details Sora’s safety stack with C2PA, consent controls, and teen protections

OpenAI said on March 23, 2026 that Sora videos include visible and invisible provenance signals, including C2PA metadata, alongside consent controls and tighter rules for videos involving real people. The company also described teen-specific protections, content filters across video and audio, and blocks on music that imitates living artists or existing works.

#openai #sora #video-generation

AI Mar 22, 2026 2 min read

Meta rolls out AI support and stronger content enforcement across Facebook and Instagram

Meta said on March 19, 2026 that it is expanding the Meta AI support assistant and deploying more advanced AI moderation systems across its apps. The company framed the update around faster account support, better scam detection, and fewer enforcement mistakes.

#meta #ai-support #content-moderation

AI Mar 20, 2026 2 min read

Anthropic maps AI hopes and fears through 81,000 user interviews

On March 18, 2026, Anthropic published a large qualitative study based on responses from 80,508 Claude users about what they want from AI and what they fear. The company says the work spans 159 countries and 70 languages, and that 81% of respondents reported AI had already moved them toward at least part of their vision.

#anthropic #user-research #ai-adoption