A Pac-Man prompt pushed LocalLLaMA to argue about something bigger than tokens per second

This comparison spread through LocalLLaMA because the setup feels closer to a real generated artifact than a clean benchmark. The author ran the same Pac-Man-style game prompt through two local models on a MacBook Pro M5 Max with 64GB of RAM and asked for a complete standalone HTML file. The prompt is long and operational: generate a procedural 21×21 maze, manage four ghosts with distinct behaviors, support mobile and keyboard controls, store scores in localStorage, run on requestAnimationFrame, add particle effects, and avoid broken movement, unreachable pellets, or frozen entities.

The result did not line up neatly with raw speed. Qwen 3.6 27B produced 33,946 tokens at 32 tokens per second and took 18 minutes 04 seconds. Gemma 4 31B ran slightly slower at 27 tokens per second, but finished in 3 minutes 51 seconds with only 6,209 tokens. The poster said Qwen felt more creative and visually expressive, while Gemma delivered cleaner logic and stronger interactions with walls, ghosts, clicks, and effects. For this one-shot game build, Gemma was the clear winner.

The comments turned that into a broader argument about evaluation design. One popular reply joked that “keep performance stable” and “no bugs” are hilarious prompt requirements, because they sound more like a shipping checklist than a benchmark. Another commenter pushed back harder and said the prompt is still underspecified, so the test may mostly reveal how much Pac-Man logic the model already carries. A third user tried a looser version of the task with Qwen and got a noticeably different result, which underlined how much these comparisons depend on prompt framing.

That is why this post landed. It points to a growing gap between throughput metrics and completion quality for local models acting more like agents than chatbots. In that setting, the winning model is not necessarily the one that writes the most tokens or the fastest. It is the one that gets to a playable, coherent result with less waste and fewer failure modes. LocalLLaMA treated this Pac-Man test as evidence that the scorecard for local models is shifting.

Source: Reddit discussion

A Pac-Man prompt pushed LocalLLaMA to argue about something bigger than tokens per second

Related Articles

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes

LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth

Qwen3.6-27B Hits Sonnet Territory, and LocalLLaMA Starts Arguing About What Counts

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA Gets a MacBook Air M5 Benchmark for 21 Coding Models, Not Just Vibes
LLM Reddit Apr 23, 2026 2 min read

LocalLLaMA Jumps on a KV-Cache Benchmark That Breaks the "q8_0 Is Basically Free" Myth

Qwen3.6-27B Hits Sonnet Territory, and LocalLLaMA Starts Arguing About What Counts