LocalLLaMA debates Gemma 4 31B's surprising FoodTruck Bench result
Original: Gemma 4 31B beats several frontier models on the FoodTruck Bench View original →
On April 4, 2026, a LocalLLaMA discussion drew roughly 277 upvotes as users reacted to Gemma 4 31B's unexpectedly strong showing on FoodTruck Bench. The original post argued that Gemma 4 31B had reached third place, beating GLM 5, Qwen 3.5 397B, and the Claude Sonnet variants. Unlike a standard coding or knowledge benchmark, FoodTruck Bench measures whether an AI agent can run a 30-day food-truck business under uncertainty.
The official leaderboard backs the core ranking. As of April 5, 2026, FoodTruck Bench listed Gemma 4 31B at No. 3 with a median net worth of $24,878, behind only Claude Opus 4.6 and GPT-5.2. According to the site's methodology, each model is run five times under identical conditions and the median run is shown. The benchmark spans 30 simulated days, 34 agent tools, pricing, staffing, inventory, and location choices, so the ranking is really about sustained multi-step decision-making rather than one-shot answer accuracy.
That is why the thread resonated with local-model users. A 31B-class open model placing that high suggests that smaller weights may be getting more credible at long-horizon agent work, not just short prompts. The original poster specifically pointed to Gemma's apparent ability to stay on plan over multiple simulated days, which matches what many local-LLM practitioners care about most: whether a model can maintain coherence after repeated tool use, delayed consequences, and self-generated notes.
The comments, however, were far from celebratory consensus. Several readers questioned how robust FoodTruck Bench really is and warned about “benchmaxxing,” contamination, or rapid model-specific optimization against small public leaderboards. That skepticism matters. The takeaway is not that Gemma 4 31B has definitively solved agentic reasoning, but that an open 31B model now has a credible result on a benchmark designed around economic decisions and state carry-over. For anyone building local agents, that is enough reason to test it seriously rather than dismiss it out of hand.
- FoodTruck Bench evaluates 30 days of business decisions rather than one-shot QA or coding tasks.
- As of April 5, 2026, Gemma 4 31B ranked third on the official leaderboard with a median net worth of $24,878.
- The Reddit discussion split between excitement over open-model progress and skepticism about benchmark gaming or contamination.
Related Articles
Google said on April 2, 2026 that Gemma 4 is its most capable open model family so far, built from the same technology base as Gemini 3. Google says the family spans E2B, E4B, 26B MoE, and 31B Dense models, adds function-calling and structured JSON support, and offers up to 256K context with an Apache 2.0 license.
Together AI said on March 13, 2026 that v2 of Open Deep Research is fully free and open source. The companion blog describes a planner and self-reflection workflow for multi-hop web research and ships code plus evaluation assets for developers.
Anthropic introduced Claude Sonnet 4.6 on February 17, 2026, adding a beta 1M token context window while keeping API pricing at $3/$15 per million tokens. The company says the new default model improves coding, computer use, and long-context reasoning enough to cover more work that previously pushed users toward Opus-class models.
Comments (0)
No comments yet. Be the first to comment!