#ai-evals

AI Hacker News Apr 12, 2026 2 min read

Small Open Models Reproduce Key Mythos Vulnerability Analysis

An AISLE post that surged on Hacker News argues that Anthropic’s Mythos launch proves the category, but not an exclusive moat. In AISLE’s tests, small and open models recovered major parts of the showcased vulnerability work once the right code path was isolated.

#cybersecurity #ai-evals #mythos

AI Reddit Mar 29, 2026 2 min read

r/artificial spotlights BullshitBench v2 as Claude leads the nonsense-detection board

An r/artificial link post resurfaced BullshitBench v2, a community benchmark built around 100 nonsense prompts and a 3-judge panel. The current public leaderboard places Claude Sonnet 4.6 with high reasoning at a 91% green rate and 3% red rate, but the results still need to be read as a community signal rather than a neutral standard.

#ai-evals #benchmarking #claude