A LocalLLaMA blind eval finds Qwen 3.5 wins more matchups while Gemma 4 posts higher averages
Original: Gemma 4 31B vs Gemma 4 26B-A4B vs Qwen 3.5 27B — 30-question blind eval with Claude Opus 4.6 as judge View original →
A new LocalLLaMA post offers a compact but revealing blind evaluation of three popular local models: Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B. The author ran 30 prompts across five categories, kept prompts and temperature aligned, and used Claude Opus 4.6 as a single structured judge. Even if readers disagree with the judge choice, the write-up is useful because it separates win rate, average score, reliability, and category-specific behavior instead of collapsing everything into one headline number.
The results are more nuanced than a simple winner-takes-all story. Qwen 3.5 27B won 14 of 30 tasks, versus 12 for Gemma 4 31B and 4 for Gemma 4 26B-A4B. But average scores tell a different story: both Gemma models landed at 8.82, while Qwen ended at 8.17 because it posted three 0.0 failures. The author argues that if those format failures or refusals are stripped out, Qwen’s adjusted average would jump to about 9.08. In short, Qwen may have the highest ceiling in this test, but Gemma appears steadier under the chosen setup.
The category breakdown is also instructive. Qwen led reasoning and analysis, Gemma 4 31B led communication, and code was basically tied. The 26B-A4B MoE variant matched the dense 31B model’s average when it worked but errored out twice, which makes it interesting from an efficiency perspective if reliability improves. The post also notes that Qwen generated 3-5x more tokens per response, raising the familiar question of whether higher raw performance justifies a verbosity tax in local deployments.
The comments immediately push on methodology. Readers question LLM-as-judge bias, quantization choices, llama.cpp build regressions, and sample size. Those objections are fair, and the author openly lists similar caveats. That openness is precisely why the thread matters. It reflects where local-model evaluation is heading in 2026: not toward one definitive benchmark, but toward community experiments that compare reliability, speed, latency, output shape, and practical usefulness together.
Related Articles
A popular LocalLLaMA benchmark post argued that Qwen3.5 27B hits an attractive balance between model size and throughput, using an RTX A6000, llama.cpp with CUDA, and a 32k context window to show roughly 19.7 tokens per second.
A recent r/LocalLLaMA benchmark thread argues that tokens-per-second screenshots hide the real trade-offs between MLX and llama.cpp on Apple Silicon. MLX still wins on short-context generation, but long-context workloads can erase that headline speedup because prefill dominates total latency.
Penfield Labs argues that LoCoMo still circulates as a major memory benchmark even though 99 of its 1,540 answer-key entries are score-corrupting and its gpt-4o-mini judge passed 62.81% of intentionally wrong answers in an audit.
Comments (0)
No comments yet. Be the first to comment!