#benchmark

LLM Hacker News 3d ago 2 min read

HN’s GPT-5.5 read: the real question is whether it finishes the job

HN treated GPT-5.5 less like another model launch and more like a test of whether AI can actually carry messy computer tasks to completion. The discussion kept drifting from benchmarks to rollout timing, API access, and whether the gains show up in real coding work.

#openai #gpt-5.5 #agentic-coding

LLM Reddit 4d ago 2 min read

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

A r/LocalLLaMA benchmark compared 21 local coding models on HumanEval+, speed, and memory, putting Qwen 3.6 35B-A3B on top while surfacing practical RAM and tok/s trade-offs.

#localllama #benchmark #qwen

AI sources.twitter Apr 19, 2026 2 min read

ParseBench tests OCR agents with 167K rules across real documents

Why it matters: document agents fail when parsers drop tables, chart values, or visual grounding. ParseBench uses about 2,000 enterprise document pages, 167K+ rule-based tests, and 14 evaluated methods.

#llamaindex #parsebench #ocr

Humanoid Robots Reddit Apr 17, 2026 2 min read

Humanoid home-task failures gave r/singularity a half-full glass

r/singularity did not read an 88% fail rate as pure failure; many users saw the same number as a 12% foothold, while others warned that benchmark age and missing robot platforms matter.

#humanoid-robots #home-robots #benchmark

LLM sources.twitter Apr 14, 2026 2 min read

Quantized Gemma 4 31B nearly doubles throughput at half memory

Quantization only matters when the accuracy hit stays small enough to use in production. Red Hat AI says its quantized Gemma 4 31B keeps 99%+ accuracy while delivering nearly 2x tokens/sec at half the memory footprint, with weights released openly via LLM Compressor.

#gemma-4 #quantization #vllm

AI sources.twitter Apr 14, 2026 2 min read

EinsteinArena lifts a Newton-era math bound from 593 to 604

This is the kind of numeric jump that makes multi-agent research hard to ignore. Together says EinsteinArena agents raised the 11-dimensional kissing number lower bound from 593 to 604 and had already logged 11 new SOTA results on open problems by April 11.

#agents #open-science #mathematics

LLM Reddit Apr 13, 2026 2 min read

LocalLLaMA Benchmark Claims Gemma 4 Speculative Decoding Gains of 29% on Average

A detailed `r/LocalLLaMA` benchmark reports that pairing `Gemma 4 31B` with `Gemma 4 E2B` as a draft model in `llama.cpp` lifted average throughput from `57.17 t/s` to `73.73 t/s`.

#speculative-decoding #gemma-4 #llama-cpp

LLM sources.twitter Apr 10, 2026 1 min read

vLLM Lands in the First MLPerf Vision-Language Benchmark Submission

vLLM said NVIDIA used the framework for the first MLPerf vision-language benchmark submission built on Qwen3-VL. NVIDIA’s accompanying blog places that result inside a broader Blackwell Ultra push that claims up to 2.7x throughput gains and more than 60% lower token cost on the same infrastructure for some workloads.

#vllm #mlperf #benchmark

LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA User Says Gemma 4 26B A3B Finally Makes Local Tool Calling Feel Stable

A LocalLLaMA post with roughly 350 points argues that Gemma 4 26B A3B becomes unusually effective for local coding-agent and tool-calling workflows when paired with the right runtime settings, contrasting it with prompt-caching and function-calling issues the poster saw in other local-model setups.

#gemma-4 #local-llm #tool-calling

LLM Reddit Apr 7, 2026 2 min read

A LocalLLaMA Benchmark Suggests MoE Models Fit 32 GB Apple Laptops Well

A recent LocalLLaMA discussion shared results from Mac LLM Bench, an open benchmark workflow for Apple Silicon systems. The most useful takeaway is practical: dense 32B models hit a clear wall on a 32 GB MacBook Air M5, while some MoE models offer a much better latency-to-capability tradeoff.

#apple-silicon #benchmark #llama-cpp

LLM Reddit Apr 2, 2026 2 min read

LocalLLaMA Benchmark Pits Dual DGX Sparks Against a 512GB Mac Studio for Qwen3.5 397B

A detailed LocalLLaMA post compared a $10K Mac Studio M3 Ultra 512GB with a similarly priced dual DGX Spark setup for running Qwen3.5 397B A17B locally. The Mac delivered 30 to 40 tok/s and easier setup, while the dual Spark build offered faster prefill and embedding performance at much higher operational complexity.

#qwen3.5 #mac-studio #dgx-spark

AI Reddit Mar 30, 2026 2 min read

r/singularity Tracks Symbolica’s 36.08% ARC-AGI-3 Result and Its Cost Advantage

A March 2026 r/singularity post with 203 points and 82 comments highlighted Symbolica’s claim that its Agentica SDK reached an unverified 36.08% on ARC-AGI-3. The headline numbers were 113 of 182 playable levels solved, 7 of 25 games completed, and a much lower reported cost than chain-of-thought baselines.

#arc-agi #agents #benchmark