LocalLLaMA sees 38.2% as the moment local coding stops feeling theoretical

What caught fire in LocalLLaMA was not 38.2% as an isolated number. It was what that number allegedly means on the timeline. The post says open-weight 27B-32B models were run on Terminal-Bench 2.0's 89 tasks under the default per-task timeout, and Qwen 3.6-27B finished at 34 out of 89, or 38.2%. The argument is that this is the first moment local coding models feel less like a curiosity and more like something you can actually route work to.

The framing relies on a time-equivalence comparison against the verified Terminal-Bench leaderboard. The post places 38.2% right beside late-2025 hosted runs: Terminus 2 plus Claude Opus 4.1 at 38.0%, GPT-5.1-Codex at 36.9%, Claude Code plus Sonnet 4.5 at 40.1%, and Codex CLI plus GPT-5-Codex at 44.3%. That does not erase the current frontier gap, with today's best hosted agents sitting near 80%, but it changes the practical question. A local model that looks roughly six to eight months behind the frontier is suddenly relevant for regulated environments, air-gapped networks, on-prem pipelines, and cost-sensitive batch work.

The linked Antigma write-up adds two useful caveats. First, the headline 38.2% is a default-timeout number, not a hard ceiling. The same Qwen 3.6-27B reaches 59.3% in Qwen's own evaluation when the timeout is stretched to three hours, suggesting that a meaningful chunk of failures are budget failures rather than pure correctness failures. Second, local usability depends heavily on hardware fit. On a 64 GB RAM plus RTX 3060 12 GB machine, MoE models feel much better than dense ones. On an RTX 5090 32 GB, even dense Qwen 3.6-27B reportedly reaches interactive speeds that feel normal rather than borderline.

Benchmark: Terminal-Bench 2.0, 89 tasks
Default-timeout result: Qwen 3.6-27B at 38.2%
Thread's interpretation: roughly late-2025 frontier quality
Extended-timeout context: Qwen's own run reaches 59.3%

The first reply asking whether the tests all used RTX 5090 hardware says a lot about why the post traveled. LocalLLaMA did not read this as a victory lap over frontier APIs. It read it as a deployment question becoming real. The thread's energy comes from that shift: local coding is still behind, but it is now behind by an amount that many teams can work with.

Source links: Reddit thread, linked benchmark write-up.

LocalLLaMA sees 38.2% as the moment local coding stops feeling theoretical

Related Articles

LocalLLaMA Rallies Around a Qwen3.6 Result That Puts the Scaffold on Trial

Qwen3.6-27B beats Qwen3.5-397B on coding and ships under Apache 2.0

Qwen3.6-Max-Preview pushes coding benchmarks, but stays cloud-only

Comments (0)

Leave a Comment

Related Articles

LocalLLaMA Rallies Around a Qwen3.6 Result That Puts the Scaffold on Trial

Qwen3.6-27B beats Qwen3.5-397B on coding and ships under Apache 2.0

Qwen3.6-Max-Preview pushes coding benchmarks, but stays cloud-only
LLM Apr 22, 2026 2 min read