#benchmarks

LLM Reddit Apr 18, 2026 1 min read

Qwen3.6 excitement turned into a GGUF runtime checklist on r/LocalLLaMA

The LocalLLaMA thread cared less about a release headline and more about which Qwen3.6 GGUF quant actually works. Unsloth’s benchmark post pushed the discussion into KLD, disk size, CUDA 13.2 failures, and the messy details that decide local inference quality.

#qwen #gguf #local-llm

LLM sources.research Apr 17, 2026 2 min read

LLM judges hide instability: 33-67% of documents break consistency

A new arXiv paper shows why low average violation rates can make LLM judges look safer than they are. On SummEval, 33-67% of documents showed at least one directed 3-cycle, and prediction-set width tracked absolute error strongly.

#llm #evaluation #benchmarks

Sciences Reddit Apr 17, 2026 1 min read

Four failed replications put r/MachineLearning back on reproducibility

r/MachineLearning reacted because the sample was small but painfully familiar: one user said 4 of 7 paper claims they checked this year did not reproduce, with 2 still sitting as unresolved GitHub issues. The comments moved from resignation about reviewers not running code to concrete demands for submission-time reproducibility reports.

#machine-learning #reproducibility #research

LLM Apr 17, 2026 2 min read

HWE-Bench finds agents fix 70.7% of real hardware bugs

HWE-Bench moves LLM agent evaluation from isolated HDL tasks to repository-scale hardware repairs. The best agent solved 70.7% overall, but performance fell below 65% on complex SoC-level projects.

#agents #hardware #benchmarks

LLM Apr 17, 2026 2 min read

AIBuildAI reaches 63.1% medal rate for model-building agents

A new arXiv paper puts a hierarchical agent system at the top of MLE-Bench with a 63.1% medal rate. The result matters because the agent handles design, coding, debugging, training, and tuning from a task description plus data.

#agents #automl #benchmarks

AI sources.twitter Apr 17, 2026 2 min read

Claude Opus 4.7 hits 70% on CursorBench while keeping Opus price

Why it matters: Anthropic is pushing Opus toward longer autonomous coding work without raising the premium model price. The linked launch page says Opus 4.7 reaches 70% on CursorBench versus 58% for Opus 4.6, while API pricing stays at $5 per million input tokens and $25 per million output tokens.

#anthropic #claude-opus-4.7 #benchmarks

LLM Apr 17, 2026 2 min read

IBM's VAKRA benchmark exposes where tool agents fail

IBM Research’s VAKRA moves agent evaluation from static Q&A into executable tool environments. With 8,000+ locally hosted APIs across 62 domains and 3-7 step reasoning chains, the benchmark finds a gap between surface tool use and reliable enterprise agents.

#agents #benchmarks #ibm

LLM Reddit Apr 17, 2026 2 min read

LocalLLaMA Turns a 'Model Got Dumber' Complaint Into a Measurement Problem

LocalLLaMA did not just vent about weaker models; the thread turned the feeling into questions about provider routing, quantization, peak-time behavior, and how to prove a silent downgrade. The evidence is not settled, but the anxiety is real.

#local-llm #benchmarks #model-quality

LLM Hacker News Apr 17, 2026 2 min read

HN Looks Past the Claude Opus 4.7 Headline to Adaptive Thinking, Tokens, and Trust

HN did not just ask whether Claude Opus 4.7 scores higher; it asked whether the product behavior is stable enough to build around. The thread quickly moved into adaptive thinking, tokenizer costs, safety filters, and bruised trust after recent Claude complaints.

#claude #llm #adaptive-thinking

AI sources.twitter Apr 16, 2026 1 min read

Cursor agents lift NVIDIA Blackwell CUDA kernels by 38%

Coding agents are being tested on GPU performance work, not just app scaffolding. Cursor says its NVIDIA collaboration produced a 38% geomean speedup across 235 CUDA kernel problems in three weeks.

#ai-agents #cuda #nvidia

LLM Reddit Apr 15, 2026 2 min read

Reddit Tries to Put Numbers on the Feeling That Claude Got More Cautious

r/artificial latched onto this because it turned a vague complaint about Claude feeling drier and more evasive into a pile of concrete counts. The post is not an official benchmark, but that is exactly why it traveled: it reads like a field report from someone with enough logs to make the frustration measurable.

#claude #model-behavior #benchmarks

LLM Apr 15, 2026 2 min read

LiteCoder pushes terminal agents to 31.5% on Terminal Bench Pro

LiteCoder is making a case that smaller coding agents still have room to climb, releasing terminal-focused models plus 11,255 trajectories and 602 Harbor environments. Its 30B model reaches 31.5% Pass@1 on Terminal Bench Pro, up from 22.0% in the preview.

#litecoder #coding-agents #benchmarks