Automating alignment research is moving from concept to measured experiment. Anthropic says a Claude Opus 4.6 researcher recovered 97% of the weak-to-strong supervision gap at roughly 1/100 the human time cost.
#alignment
RSS FeedAnthropic is using Claude not just as a model to align, but as a researcher that improved weak-to-strong supervision nearly to the ceiling. In the linked study, nine Claude Opus 4.6 agents pushed performance-gap recovery from a 0.23 human baseline to 0.97 after 800 cumulative research hours.
OpenAI introduced its Safety Fellowship on X and published program details on April 6, 2026 for external researchers and practitioners working on AI safety and alignment. The move is notable because it extends work on evaluation, robustness, privacy-preserving safety methods, and agentic oversight beyond OpenAI’s internal teams.
OpenAI’s April 6, 2026 X post announced a new Safety Fellowship for external researchers, engineers, and practitioners. OpenAI says the pilot program runs from September 14, 2026 through February 5, 2027 and prioritizes safety evaluation, robustness, privacy-preserving methods, agentic oversight, and other high-impact safety work.
A widely shared r/singularity post drew attention to Anthropic research arguing Claude Sonnet 4.5 contains functional emotion-related representations rather than mere stylistic language. Anthropic says the vectors can influence preference, blackmail behavior in evaluations, and reward-hacking rates when researchers steer them.
OpenAI said on March 19, 2026 that it now monitors internal coding-agent deployments with a GPT-5.4 Thinking-based system that reviews actions and chains of thought within 30 minutes. The company says the setup has already processed tens of millions of trajectories and is meant to catch behavior that diverges from user intent or internal policy.
OpenAI said on March 10, 2026 that its new IH-Challenge dataset improves instruction hierarchy behavior in frontier LLMs, with gains in safety steerability and prompt-injection robustness. The company also released the dataset publicly on Hugging Face to support further research.
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
Anthropic published a new theory explaining why AI assistants like Claude express emotions and use anthropomorphic language—proposing that models select from personas inherited from fictional characters during training.
OpenAI announced a $7.5 million commitment to support independent AI alignment research. The program combines direct funding and uncapped research credits for university and nonprofit teams focused on frontier model safety.
OpenAI published a framework for safety alignment based on instruction hierarchy and uncertainty-aware behavior. In the company’s reported tests, refusal on uncertain requests rose from about 59% to about 97% when chain-of-command reasoning was applied.