DeepSeek V4 Pro Matches GPT-5.2 on Agentic Benchmark — 17x Cheaper, 10 Weeks Later

FoodTruck Bench

FoodTruck Bench is a 30-day agentic benchmark where models run a food truck via 34 tools covering locations, pricing, inventory, staff, weather, and events — with persistent memory and daily reflection. It measures real agentic capability rather than single-turn performance.

Results

DeepSeek V4 Pro landed 4th overall, behind Claude Opus 4.6, GPT-5.2, and Grok 4.3. It tied Grok 4.3 on outcome and came within 3% of GPT-5.2's median score. It is the first Chinese model to reach the frontier tier on this benchmark.

The Cost Gap

GPT-5.2 was tested in mid-February. DeepSeek V4 Pro reached equivalent performance 10 weeks later at roughly 17x lower cost. This confirms a recurring pattern: frontier performance gaps close within weeks to months, while price differences remain large.

Community Impact

Several LocalLLaMA users ran their own 10-day workflow audits, finding that a significant fraction of daily tasks could be handled by local models (Qwen3.6 27B on a 3090) at near-zero cost. The benchmark result quantifies the value pressure on expensive frontier API calls for production workloads.

LLM Hacker News 3d ago 1 min read

DeepSeek V4: Near-Frontier LLM Performance at a Fraction of the Cost

DeepSeek released DeepSeek-V4-Pro (1.6T total parameters, 49B active) and V4-Flash (284B total, 13B active), both Mixture-of-Experts models with MIT license and 1M token context. V4-Pro is the largest open-weights model released so far, and its pricing at $1.74/M input undercuts GPT-5.4 and Claude Sonnet 4.6 by more than half.

#deepseek #llm #open-weights

LLM Reddit 3d ago 1 min read

ARC-AGI-3 Benchmarks: GPT-5.5 at 0.43%, Claude Opus 4.7 at 0.18%

The latest ARC-AGI-3 scores show GPT-5.5 High at 0.43% and Claude Opus 4.7 at 0.18% — the most powerful models today remain effectively at zero on this AGI benchmark.

#arc-agi #benchmark #gpt-5

LLM Reddit 3d ago 1 min read

ARC-AGI-3 Benchmarks: GPT-5.5 at 0.43%, Claude Opus 4.7 at 0.18%

The latest ARC-AGI-3 scores show GPT-5.5 High at 0.43% and Claude Opus 4.7 at 0.18% — the most powerful models today remain effectively at zero on this AGI benchmark.

#arc-agi #benchmark #gpt-5

DeepSeek V4 Pro Matches GPT-5.2 on Agentic Benchmark — 17x Cheaper, 10 Weeks Later

FoodTruck Bench

Results

The Cost Gap

Community Impact

Related Articles

DeepSeek V4: Near-Frontier LLM Performance at a Fraction of the Cost

ARC-AGI-3 Benchmarks: GPT-5.5 at 0.43%, Claude Opus 4.7 at 0.18%

ARC-AGI-3 Benchmarks: GPT-5.5 at 0.43%, Claude Opus 4.7 at 0.18%

Comments (0)

Leave a Comment