A LocalLLaMA blind eval finds Qwen 3.5 wins more matchups while Gemma 4 posts higher averages

A new LocalLLaMA post offers a compact but revealing blind evaluation of three popular local models: Gemma 4 31B, Gemma 4 26B-A4B, and Qwen 3.5 27B. The author ran 30 prompts across five categories, kept prompts and temperature aligned, and used Claude Opus 4.6 as a single structured judge. Even if readers disagree with the judge choice, the write-up is useful because it separates win rate, average score, reliability, and category-specific behavior instead of collapsing everything into one headline number.

The results are more nuanced than a simple winner-takes-all story. Qwen 3.5 27B won 14 of 30 tasks, versus 12 for Gemma 4 31B and 4 for Gemma 4 26B-A4B. But average scores tell a different story: both Gemma models landed at 8.82, while Qwen ended at 8.17 because it posted three 0.0 failures. The author argues that if those format failures or refusals are stripped out, Qwen’s adjusted average would jump to about 9.08. In short, Qwen may have the highest ceiling in this test, but Gemma appears steadier under the chosen setup.

The category breakdown is also instructive. Qwen led reasoning and analysis, Gemma 4 31B led communication, and code was basically tied. The 26B-A4B MoE variant matched the dense 31B model’s average when it worked but errored out twice, which makes it interesting from an efficiency perspective if reliability improves. The post also notes that Qwen generated 3-5x more tokens per response, raising the familiar question of whether higher raw performance justifies a verbosity tax in local deployments.

The comments immediately push on methodology. Readers question LLM-as-judge bias, quantization choices, llama.cpp build regressions, and sample size. Those objections are fair, and the author openly lists similar caveats. That openness is precisely why the thread matters. It reflects where local-model evaluation is heading in 2026: not toward one definitive benchmark, but toward community experiments that compare reliability, speed, latency, output shape, and practical usefulness together.

A LocalLLaMA blind eval finds Qwen 3.5 wins more matchups while Gemma 4 posts higher averages

Related Articles

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max

LocalLLaMA Debates Qwen3.5 27B as a Practical Sweet Spot

Comments (0)

Leave a Comment

Related Articles

Benchmark audit finds 25.7% flawed tasks and shifts agent rankings

r/LocalLLaMA: The real latency trade-offs between MLX and llama.cpp on M1 Max
LLM Reddit Mar 14, 2026 2 min read

LocalLLaMA Debates Qwen3.5 27B as a Practical Sweet Spot
LLM Reddit Mar 31, 2026 2 min read