DeepSeek V4 Pro Matches GPT-5.2 on Agentic Benchmark — 17x Cheaper, 10 Weeks Later

FoodTruck Bench

FoodTruck Bench is a 30-day agentic benchmark where models run a food truck via 34 tools covering locations, pricing, inventory, staff, weather, and events — with persistent memory and daily reflection. It measures real agentic capability rather than single-turn performance.

Results

DeepSeek V4 Pro landed 4th overall, behind Claude Opus 4.6, GPT-5.2, and Grok 4.3. It tied Grok 4.3 on outcome and came within 3% of GPT-5.2's median score. It is the first Chinese model to reach the frontier tier on this benchmark.

The Cost Gap

GPT-5.2 was tested in mid-February. DeepSeek V4 Pro reached equivalent performance 10 weeks later at roughly 17x lower cost. This confirms a recurring pattern: frontier performance gaps close within weeks to months, while price differences remain large.

Community Impact

Several LocalLLaMA users ran their own 10-day workflow audits, finding that a significant fraction of daily tasks could be handled by local models (Qwen3.6 27B on a 3090) at near-zero cost. The benchmark result quantifies the value pressure on expensive frontier API calls for production workloads.

LLM Hacker News May 30, 2026 1 min read

Liquid AI Releases LFM2.5: 8B MoE Model Trained on 38T Tokens

Liquid AI's new LFM2.5 8B-A1B MoE model delivers 253 tokens/s on M5 Max, runs under 6GB memory on mobile, and achieves 18,500 output tokens/s on H100—all while outperforming similarly-sized dense models on key benchmarks.

#liquid-ai #llm #moe

LLM X/Twitter 3d ago 2 min read

1.3M conversations give OpenAI a pre-release risk forecast for GPT-5 models

OpenAI’s Deployment Simulation matters because it turns safety review into a measurable pre-release forecast. The study used about 1.3 million de-identified conversations and reported a 1.5x median multiplicative error on GPT-5-series risk estimates.

#openai #deployment-simulation #model-safety

LLM X/Twitter 5d ago 1 min read

Fusion API targets Fable 5 research quality at half the cost

OpenRouter says Fusion reached within 1% of Claude Fable 5 on 100 DRACO deep-research tasks while costing roughly half as much. The product shifts the contest from one frontier model to a server-side panel, judge, and synthesizer workflow.

#openrouter #fusion-api #llm