A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.
#benchmarks
RSS FeedA Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.
Google DeepMind has published a cognitive taxonomy for evaluating progress toward AGI and paired it with a Kaggle hackathon to build new benchmarks. The framework maps AI systems against human baselines across 10 cognitive abilities instead of relying on a single headline score.
A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.
A fresh r/LocalLLaMA post argues that the main bottleneck in Graph-RAG multi-hop QA is often reasoning rather than retrieval. The linked paper suggests structured prompting and graph-based context compression can let an open Llama 8B model match or beat a plain 70B baseline at a much lower cost.
A Show HN repo claims that duplicating a few LLM layers can improve reasoning without training or weight changes. The underlying README, however, shows real tradeoffs, making this more convincing as capability steering than as a universal model upgrade.
Google DeepMind published a new framework for evaluating progress toward AGI on March 17, 2026. The proposal tries to shift the discussion from single benchmark scores toward a structured map of human-like cognitive capabilities.
Google DeepMind said on March 17, 2026 that it has published a new cognitive-science framework for evaluating progress toward AGI and launched a Kaggle hackathon to turn that framework into practical benchmarks. The proposal defines 10 cognitive abilities, recommends comparison against human baselines, and puts $200,000 behind community-built evaluations.
Google DeepMind said on X that it is launching a Kaggle hackathon with $200,000 in prizes to build new cognitive evaluations for AI. The linked Google post says the effort is part of a broader framework for measuring AGI progress across 10 cognitive abilities rather than a single benchmark.
OpenAI said on March 10, 2026 that its new IH-Challenge dataset improves instruction hierarchy behavior in frontier LLMs, with gains in safety steerability and prompt-injection robustness. The company also released the dataset publicly on Hugging Face to support further research.
A fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.