A Pac-Man prompt pushed LocalLLaMA to argue about something bigger than tokens per second
Original: Qwen 3.6 27B vs Gemma 4 31B - making Packman game! View original →
This comparison spread through LocalLLaMA because the setup feels closer to a real generated artifact than a clean benchmark. The author ran the same Pac-Man-style game prompt through two local models on a MacBook Pro M5 Max with 64GB of RAM and asked for a complete standalone HTML file. The prompt is long and operational: generate a procedural 21×21 maze, manage four ghosts with distinct behaviors, support mobile and keyboard controls, store scores in localStorage, run on requestAnimationFrame, add particle effects, and avoid broken movement, unreachable pellets, or frozen entities.
The result did not line up neatly with raw speed. Qwen 3.6 27B produced 33,946 tokens at 32 tokens per second and took 18 minutes 04 seconds. Gemma 4 31B ran slightly slower at 27 tokens per second, but finished in 3 minutes 51 seconds with only 6,209 tokens. The poster said Qwen felt more creative and visually expressive, while Gemma delivered cleaner logic and stronger interactions with walls, ghosts, clicks, and effects. For this one-shot game build, Gemma was the clear winner.
The comments turned that into a broader argument about evaluation design. One popular reply joked that “keep performance stable” and “no bugs” are hilarious prompt requirements, because they sound more like a shipping checklist than a benchmark. Another commenter pushed back harder and said the prompt is still underspecified, so the test may mostly reveal how much Pac-Man logic the model already carries. A third user tried a looser version of the task with Qwen and got a noticeably different result, which underlined how much these comparisons depend on prompt framing.
That is why this post landed. It points to a growing gap between throughput metrics and completion quality for local models acting more like agents than chatbots. In that setting, the winning model is not necessarily the one that writes the most tokens or the fastest. It is the one that gets to a playable, coherent result with less waste and fewer failure modes. LocalLLaMA treated this Pac-Man test as evidence that the scorecard for local models is shifting.
Source: Reddit discussion
Related Articles
HN focused less on whether local LLMs fully replace frontier models and more on where they already make sense. The thread turned into a practical debate about Gemma, Qwen, agentic coding, memory limits, cost, and privacy.
A r/LocalLLaMA benchmark compared 21 local coding models on HumanEval+, speed, and memory, putting Qwen 3.6 35B-A3B on top while surfacing practical RAM and tok/s trade-offs.
LocalLLaMA users reacted strongly to a small but practical vLLM nightly change. The new Qwen3+ streaming parser is aimed at mid-turn stops and streaming tool-call failures that can break Qwen3.6 agent loops.