#benchmarks

LLM Apr 15, 2026 2 min read

LiteCoder pushes terminal agents to 31.5% on Terminal Bench Pro

LiteCoder is making a case that smaller coding agents still have room to climb, releasing terminal-focused models plus 11,255 trajectories and 602 Harbor environments. Its 30B model reaches 31.5% Pass@1 on Terminal Bench Pro, up from 22.0% in the preview.

#litecoder #coding-agents #benchmarks

AI Hacker News Apr 13, 2026 2 min read

Hacker News spotlights Berkeley's warning that top AI agent benchmarks are vulnerable to score hacking

A 520-point Hacker News thread amplified Berkeley's claim that eight major AI agent benchmarks can be pushed toward near-perfect scores through harness exploits instead of genuine task completion.

#ai-agents #benchmarks #evaluation

LLM Reddit Apr 12, 2026 2 min read

LocalLLaMA Benchmarks Gemma 4 Speculative Decoding at a 29% Average Speedup

A new r/LocalLLaMA benchmark reports that Gemma 4 31B paired with an E2B draft model can gain about 29% average throughput, with code generation improving by roughly 50%.

#gemma-4 #speculative-decoding #llama-cpp

AI Hacker News Apr 12, 2026 1 min read

Berkeley Shows How Benchmark Hacking Can Inflate AI Agent Scores

UC Berkeley researchers say eight major AI agent benchmarks can be driven to near-perfect scores without actually solving the underlying tasks. Their warning is straightforward: leaderboard numbers are only as trustworthy as the evaluation design behind them.

#benchmarks #ai-agents #evaluation

LLM Reddit Apr 10, 2026 2 min read

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.

#qwen #blackwell #inference

LLM Reddit Apr 5, 2026 2 min read

A LocalLLaMA blind eval finds Qwen 3.5 wins more matchups while Gemma 4 posts higher averages

A LocalLLaMA user compared Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B across 30 blind prompts judged by Claude Opus 4.6. The result is not one clear winner but a more useful trade-off story around reliability, verbosity, and category-specific strengths.

#gemma-4 #qwen3.5 #benchmarks

LLM sources.twitter Apr 5, 2026 2 min read

Cursor details Composer 2’s training stack, from continued pretraining to real-world RL

Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.

#cursor #composer-2 #reinforcement-learning

LLM Reddit Apr 5, 2026 2 min read

LocalLLaMA debates Gemma 4 31B's surprising FoodTruck Bench result

A LocalLLaMA thread highlighted Gemma 4 31B's unexpectedly strong FoodTruck Bench showing, and the discussion quickly turned to long-horizon planning quality and benchmark reliability.

#llm #gemma #benchmarks

LLM Reddit Apr 4, 2026 2 min read

LocalLLaMA Benchmarks Gemma 4 31B at 256K Context on One RTX 5090

A `r/LocalLLaMA` benchmark claims Gemma 4 31B can run at 256K context on a single RTX 5090 using TurboQuant KV cache compression. The post is notable because it pairs performance numbers with detailed build notes, VRAM measurements, and community skepticism about long-context quality under heavy KV quantization.

#gemma4 #llama.cpp #kv-cache

LLM Reddit Mar 31, 2026 2 min read

LocalLLaMA Debates Qwen3.5 27B as a Practical Sweet Spot

A popular LocalLLaMA benchmark post argued that Qwen3.5 27B hits an attractive balance between model size and throughput, using an RTX A6000, llama.cpp with CUDA, and a 32k context window to show roughly 19.7 tokens per second.

#qwen3.5 #local-llm #benchmarks

LLM Reddit Mar 30, 2026 2 min read

r/MachineLearning Flags LoCoMo Errors and Weak Judge Reliability

Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.

#benchmarks #memory-systems #evaluation

AI Reddit Mar 30, 2026 2 min read

r/singularity Zeroes In on ARC-AGI 3 and Action-Efficiency Scoring

Right after ARC Prize released ARC-AGI 3, r/singularity focused on the benchmark’s shift toward interactive environments and action-efficient scoring. The core message is that frontier AI still lags badly when it must generalize, explore, and plan under tight interaction budgets.

#arc-agi #benchmarks #reasoning