Aging

LocalLLaMA sees 38.2% as the moment local coding stops feeling theoretical

Original: Local model on coding has reached a certain threshold to be feasible for real work View original →

Read in other languages: 한국어日本語
LLM Apr 28, 2026 By Insights AI (Reddit) 2 min read 2 views Source

What caught fire in LocalLLaMA was not 38.2% as an isolated number. It was what that number allegedly means on the timeline. The post says open-weight 27B-32B models were run on Terminal-Bench 2.0's 89 tasks under the default per-task timeout, and Qwen 3.6-27B finished at 34 out of 89, or 38.2%. The argument is that this is the first moment local coding models feel less like a curiosity and more like something you can actually route work to.

The framing relies on a time-equivalence comparison against the verified Terminal-Bench leaderboard. The post places 38.2% right beside late-2025 hosted runs: Terminus 2 plus Claude Opus 4.1 at 38.0%, GPT-5.1-Codex at 36.9%, Claude Code plus Sonnet 4.5 at 40.1%, and Codex CLI plus GPT-5-Codex at 44.3%. That does not erase the current frontier gap, with today's best hosted agents sitting near 80%, but it changes the practical question. A local model that looks roughly six to eight months behind the frontier is suddenly relevant for regulated environments, air-gapped networks, on-prem pipelines, and cost-sensitive batch work.

The linked Antigma write-up adds two useful caveats. First, the headline 38.2% is a default-timeout number, not a hard ceiling. The same Qwen 3.6-27B reaches 59.3% in Qwen's own evaluation when the timeout is stretched to three hours, suggesting that a meaningful chunk of failures are budget failures rather than pure correctness failures. Second, local usability depends heavily on hardware fit. On a 64 GB RAM plus RTX 3060 12 GB machine, MoE models feel much better than dense ones. On an RTX 5090 32 GB, even dense Qwen 3.6-27B reportedly reaches interactive speeds that feel normal rather than borderline.

  • Benchmark: Terminal-Bench 2.0, 89 tasks
  • Default-timeout result: Qwen 3.6-27B at 38.2%
  • Thread's interpretation: roughly late-2025 frontier quality
  • Extended-timeout context: Qwen's own run reaches 59.3%

The first reply asking whether the tests all used RTX 5090 hardware says a lot about why the post traveled. LocalLLaMA did not read this as a victory lap over frontier APIs. It read it as a deployment question becoming real. The thread's energy comes from that shift: local coding is still behind, but it is now behind by an amount that many teams can work with.

Source links: Reddit thread, linked benchmark write-up.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.