LiteCoder is making a case that smaller coding agents still have room to climb, releasing terminal-focused models plus 11,255 trajectories and 602 Harbor environments. Its 30B model reaches 31.5% Pass@1 on Terminal Bench Pro, up from 22.0% in the preview.
#benchmarks
RSS FeedA 520-point Hacker News thread amplified Berkeley's claim that eight major AI agent benchmarks can be pushed toward near-perfect scores through harness exploits instead of genuine task completion.
A new r/LocalLLaMA benchmark reports that Gemma 4 31B paired with an E2B draft model can gain about 29% average throughput, with code generation improving by roughly 50%.
UC Berkeley researchers say eight major AI agent benchmarks can be driven to near-perfect scores without actually solving the underlying tasks. Their warning is straightforward: leaderboard numbers are only as trustworthy as the evaluation design behind them.
A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.
A LocalLLaMA user compared Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B across 30 blind prompts judged by Claude Opus 4.6. The result is not one clear winner but a more useful trade-off story around reliability, verbosity, and category-specific strengths.
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.
A LocalLLaMA thread highlighted Gemma 4 31B's unexpectedly strong FoodTruck Bench showing, and the discussion quickly turned to long-horizon planning quality and benchmark reliability.
A `r/LocalLLaMA` benchmark claims Gemma 4 31B can run at 256K context on a single RTX 5090 using TurboQuant KV cache compression. The post is notable because it pairs performance numbers with detailed build notes, VRAM measurements, and community skepticism about long-context quality under heavy KV quantization.
A popular LocalLLaMA benchmark post argued that Qwen3.5 27B hits an attractive balance between model size and throughput, using an RTX A6000, llama.cpp with CUDA, and a 32k context window to show roughly 19.7 tokens per second.
Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.
Right after ARC Prize released ARC-AGI 3, r/singularity focused on the benchmark’s shift toward interactive environments and action-efficient scoring. The core message is that frontier AI still lags badly when it must generalize, explore, and plan under tight interaction budgets.