#alignment - Insights

LLM sources.twitter Apr 16, 2026 1 min read

Anthropic’s Opus agents recover 97% of a weak-to-strong gap

Automating alignment research is moving from concept to measured experiment. Anthropic says a Claude Opus 4.6 researcher recovered 97% of the weak-to-strong supervision gap at roughly 1/100 the human time cost.

#ai-safety #alignment #claude

11

LLM Apr 14, 2026 2 min read

Anthropic pushes Claude into alignment research, reaches 0.97 PGR

Anthropic is using Claude not just as a model to align, but as a researcher that improved weak-to-strong supervision nearly to the ceiling. In the linked study, nine Claude Opus 4.6 agents pushed performance-gap recovery from a 0.23 human baseline to 0.97 after 800 cumulative research hours.

#anthropic #claude #alignment

11

AI sources.x Apr 9, 2026 2 min read

OpenAI Launches Safety Fellowship for Independent AI Alignment Research

OpenAI introduced its Safety Fellowship on X and published program details on April 6, 2026 for external researchers and practitioners working on AI safety and alignment. The move is notable because it extends work on evaluation, robustness, privacy-preserving safety methods, and agentic oversight beyond OpenAI’s internal teams.

#openai #ai-safety #alignment

14

AI sources.twitter Apr 6, 2026 2 min read

OpenAI opens applications for a Safety Fellowship focused on alignment and misuse research

OpenAI’s April 6, 2026 X post announced a new Safety Fellowship for external researchers, engineers, and practitioners. OpenAI says the pilot program runs from September 14, 2026 through February 5, 2027 and prioritizes safety evaluation, robustness, privacy-preserving methods, agentic oversight, and other high-impact safety work.

#openai #ai-safety #alignment

19

AI Reddit Apr 4, 2026 2 min read

r/singularity Fixates on Anthropic's 171 Emotion Vectors

A widely shared r/singularity post drew attention to Anthropic research arguing Claude Sonnet 4.5 contains functional emotion-related representations rather than mere stylistic language. Anthropic says the vectors can influence preference, blackmail behavior in evaluations, and reward-hacking rates when researchers steer them.

#anthropic #interpretability #emotion-vectors

20

LLM Mar 19, 2026 2 min read

OpenAI details how it monitors internal coding agents for misalignment

OpenAI said on March 19, 2026 that it now monitors internal coding-agent deployments with a GPT-5.4 Thinking-based system that reviews actions and chains of thought within 30 minutes. The company says the setup has already processed tens of millions of trajectories and is meant to catch behavior that diverges from user intent or internal policy.

#openai #agents #alignment

32

LLM Mar 16, 2026 2 min read

OpenAI releases IH-Challenge to strengthen instruction hierarchy and prompt-injection resistance

OpenAI said on March 10, 2026 that its new IH-Challenge dataset improves instruction hierarchy behavior in frontier LLMs, with gains in safety steerability and prompt-injection robustness. The company also released the dataset publicly on Hugging Face to support further research.

#openai #alignment #prompt-injection

35

AI sources.twitter Feb 24, 2026 1 min read

Anthropic Introduces 'Persona Selection Model' Theory to Explain AI's Human-Like Behavior

Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.

#anthropic #claude #ai-research

42

AI sources.twitter Feb 24, 2026 1 min read

Anthropic Introduces 'Persona Selection Model' Theory to Explain AI's Human-Like Behavior

Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.

#anthropic #claude #ai-research

25

AI sources.twitter Feb 24, 2026 1 min read

Anthropic Introduces 'Persona Selection Model' Theory to Explain AI's Human-Like Behavior

Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.

#anthropic #claude #ai-research

27

AI Feb 20, 2026 2 min read

OpenAI commits $7.5M to independent AI alignment research

OpenAI announced a $7.5 million commitment to support independent AI alignment research. The program combines direct funding and uncapped research credits for university and nonprofit teams focused on frontier model safety.

#openai #alignment #safety

32

AI Feb 16, 2026 2 min read

OpenAI Details Safety Alignment Stack, Reporting 97% Refusal on Uncertain Requests

OpenAI published a framework for safety alignment based on instruction hierarchy and uncertainty-aware behavior. In the company’s reported tests, refusal on uncertain requests rose from about 59% to about 97% when chain-of-command reasoning was applied.

#openai #safety #alignment

29