A Pac-Man prompt pushed LocalLLaMA to argue about something bigger than tokens per second
Original: Qwen 3.6 27B vs Gemma 4 31B - making Packman game! View original →
This comparison spread through LocalLLaMA because the setup feels closer to a real generated artifact than a clean benchmark. The author ran the same Pac-Man-style game prompt through two local models on a MacBook Pro M5 Max with 64GB of RAM and asked for a complete standalone HTML file. The prompt is long and operational: generate a procedural 21×21 maze, manage four ghosts with distinct behaviors, support mobile and keyboard controls, store scores in localStorage, run on requestAnimationFrame, add particle effects, and avoid broken movement, unreachable pellets, or frozen entities.
The result did not line up neatly with raw speed. Qwen 3.6 27B produced 33,946 tokens at 32 tokens per second and took 18 minutes 04 seconds. Gemma 4 31B ran slightly slower at 27 tokens per second, but finished in 3 minutes 51 seconds with only 6,209 tokens. The poster said Qwen felt more creative and visually expressive, while Gemma delivered cleaner logic and stronger interactions with walls, ghosts, clicks, and effects. For this one-shot game build, Gemma was the clear winner.
The comments turned that into a broader argument about evaluation design. One popular reply joked that “keep performance stable” and “no bugs” are hilarious prompt requirements, because they sound more like a shipping checklist than a benchmark. Another commenter pushed back harder and said the prompt is still underspecified, so the test may mostly reveal how much Pac-Man logic the model already carries. A third user tried a looser version of the task with Qwen and got a noticeably different result, which underlined how much these comparisons depend on prompt framing.
That is why this post landed. It points to a growing gap between throughput metrics and completion quality for local models acting more like agents than chatbots. In that setting, the winning model is not necessarily the one that writes the most tokens or the fastest. It is the one that gets to a playable, coherent result with less waste and fewer failure modes. LocalLLaMA treated this Pac-Man test as evidence that the scorecard for local models is shifting.
Source: Reddit discussion
Related Articles
A r/LocalLLaMA benchmark compared 21 local coding models on HumanEval+, speed, and memory, putting Qwen 3.6 35B-A3B on top while surfacing practical RAM and tok/s trade-offs.
LocalLLaMA reacted because the post did not just tweak a benchmark table. It went after a widely repeated local-inference assumption and showed that the answer changes sharply by model family, especially for Gemma. By crawl time on April 25, 2026, the thread had 324 points and 58 comments.
LocalLLaMA lit up at the idea that a 27B model could tie Sonnet 4.6 on an agentic index, but the thread turned just as fast to benchmark gaming, real context windows, and what people can actually run at home.
Comments (0)
No comments yet. Be the first to comment!