LocalLLaMA debates Gemma 4 31B's surprising FoodTruck Bench result

On April 4, 2026, a LocalLLaMA discussion drew roughly 277 upvotes as users reacted to Gemma 4 31B's unexpectedly strong showing on FoodTruck Bench. The original post argued that Gemma 4 31B had reached third place, beating GLM 5, Qwen 3.5 397B, and the Claude Sonnet variants. Unlike a standard coding or knowledge benchmark, FoodTruck Bench measures whether an AI agent can run a 30-day food-truck business under uncertainty.

The official leaderboard backs the core ranking. As of April 5, 2026, FoodTruck Bench listed Gemma 4 31B at No. 3 with a median net worth of $24,878, behind only Claude Opus 4.6 and GPT-5.2. According to the site's methodology, each model is run five times under identical conditions and the median run is shown. The benchmark spans 30 simulated days, 34 agent tools, pricing, staffing, inventory, and location choices, so the ranking is really about sustained multi-step decision-making rather than one-shot answer accuracy.

That is why the thread resonated with local-model users. A 31B-class open model placing that high suggests that smaller weights may be getting more credible at long-horizon agent work, not just short prompts. The original poster specifically pointed to Gemma's apparent ability to stay on plan over multiple simulated days, which matches what many local-LLM practitioners care about most: whether a model can maintain coherence after repeated tool use, delayed consequences, and self-generated notes.

The comments, however, were far from celebratory consensus. Several readers questioned how robust FoodTruck Bench really is and warned about “benchmaxxing,” contamination, or rapid model-specific optimization against small public leaderboards. That skepticism matters. The takeaway is not that Gemma 4 31B has definitively solved agentic reasoning, but that an open 31B model now has a credible result on a benchmark designed around economic decisions and state carry-over. For anyone building local agents, that is enough reason to test it seriously rather than dismiss it out of hand.

FoodTruck Bench evaluates 30 days of business decisions rather than one-shot QA or coding tasks.
As of April 5, 2026, Gemma 4 31B ranked third on the official leaderboard with a median net worth of $24,878.
The Reddit discussion split between excitement over open-model progress and skepticism about benchmark gaming or contamination.

LocalLLaMA debates Gemma 4 31B's surprising FoodTruck Bench result

Related Articles

Open-weight models narrow the gap to 3-6 months, OpenRouter says

Senior SWE-Bench tests coding agents against the messy idea of seniority

GitHub Copilot harness matches native agents across five coding benches