#benchmarks

LLM Reddit Mar 28, 2026 2 min read

r/MachineLearning challenges LoCoMo’s reliability with a detailed audit

A post on r/MachineLearning argues that LoCoMo’s leaderboard is being treated with more confidence than its evaluation setup deserves. The audit claims the benchmark has a 6.4% ground-truth error rate and that its judge accepts intentionally wrong but topically adjacent answers far too often, turning attention from raw scores to benchmark reliability.

#benchmarks #evaluation #long-context

LLM Hacker News Mar 28, 2026 2 min read

Hacker News spotlights ATLAS and the economics of local coding agents

A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.

#coding-agents #benchmarks #local-inference

LLM Mar 24, 2026 2 min read

Google DeepMind Proposes a Cognitive Framework for Measuring AGI Progress

Google DeepMind has published a cognitive taxonomy for evaluating progress toward AGI and paired it with a Kaggle hackathon to build new benchmarks. The framework maps AI systems against human baselines across 10 cognitive abilities instead of relying on a single headline score.

#deepmind #agi #benchmarks

LLM Reddit Mar 23, 2026 2 min read

r/LocalLLaMA benchmark argues M5 Max shines most on MoE prompt processing

A rerun benchmark posted to r/LocalLLaMA argues that Apple’s M5 Max shows its clearest gains on prompt processing rather than raw generation alone. The post reports 2,845 tok/s PP512 for Qwen 3.5 35B-A3B MoE and 92.2 tok/s generation, but these remain community measurements rather than independent lab benchmarks.

#apple-silicon #llama.cpp #mlx

LLM Reddit Mar 22, 2026 2 min read

r/LocalLLaMA Highlights Graph-RAG Work That Lets Llama 8B Challenge 70B Multi-Hop QA

A fresh r/LocalLLaMA post argues that the main bottleneck in Graph-RAG multi-hop QA is often reasoning rather than retrieval. The linked paper suggests structured prompting and graph-based context compression can let an open Llama 8B model match or beat a plain 70B baseline at a much lower cost.

#graph-rag #llama #reasoning

LLM Hacker News Mar 21, 2026 3 min read

HN Examines llm-circuit-finder: Layer Duplication as Capability Steering, Not a Free LLM Upgrade

A Show HN repo claims that duplicating a few LLM layers can improve reasoning without training or weight changes. The underlying README, however, shows real tradeoffs, making this more convincing as capability steering than as a universal model upgrade.

#llm #reasoning #benchmarks

AI Mar 20, 2026 1 min read

Google DeepMind introduces an AGI cognitive framework and a Kaggle hackathon

Google DeepMind published a new framework for evaluating progress toward AGI on March 17, 2026. The proposal tries to shift the discussion from single benchmark scores toward a structured map of human-like cognitive capabilities.

#google-deepmind #agi #benchmarks

AI Mar 19, 2026 2 min read

Google DeepMind proposes a cognitive framework for measuring AGI progress

Google DeepMind said on March 17, 2026 that it has published a new cognitive-science framework for evaluating progress toward AGI and launched a Kaggle hackathon to turn that framework into practical benchmarks. The proposal defines 10 cognitive abilities, recommends comparison against human baselines, and puts $200,000 behind community-built evaluations.

#google-deepmind #agi #evaluation

AI sources.twitter Mar 18, 2026 1 min read

Google DeepMind turns AGI evaluation into a global Kaggle challenge

Google DeepMind said on X that it is launching a Kaggle hackathon with $200,000 in prizes to build new cognitive evaluations for AI. The linked Google post says the effort is part of a broader framework for measuring AGI progress across 10 cognitive abilities rather than a single benchmark.

#google-deepmind #kaggle #agi

LLM Mar 16, 2026 2 min read

OpenAI releases IH-Challenge to strengthen instruction hierarchy and prompt-injection resistance

OpenAI said on March 10, 2026 that its new IH-Challenge dataset improves instruction hierarchy behavior in frontier LLMs, with gains in safety steerability and prompt-injection robustness. The company also released the dataset publicly on Hugging Face to support further research.

#openai #alignment #prompt-injection

LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA: Community benchmark data turns Apple Silicon local LLM claims into something measurable

A fast-rising r/LocalLLaMA thread says the community has already submitted nearly 10,000 Apple Silicon benchmark runs across more than 400 models. The post matters because it replaces scattered anecdotes with a shared dataset that begins to show consistent throughput patterns across M-series chips and context lengths.

#apple-silicon #benchmarks #omlx

LLM Reddit Mar 14, 2026 2 min read

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.

#mlx #llama.cpp #apple-silicon