A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review
Original: Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and more tested in coding View original →
Community Spark
A r/LocalLLaMA post compared Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and other coding models, then pointed readers to the SanityHarness leaderboard. The author says the old and newer leaderboards now contain 145 results, built from repeated coding-agent evaluations rather than one-off vibe checks.
What Was Tested
The post describes SanityHarness as a coding-agent-agnostic benchmark. Its GitHub README says the harness runs compact but challenging problems inside isolated Docker containers, spans six languages, applies weighted scoring, and includes integrity checks plus hidden tests. In this latest pass, the author says Kimi K2.6-Code-Preview was tested through early access alongside Opus 4.7, GLM 5.1, Minimax M2.7 and other systems.
The Tension In The Results
The community hook was not a clean leaderboard winner. The author says Opus 4.7 can score well in evals while feeling much worse in actual coding sessions, with frequent hallucination and stubborn incorrect assumptions. Kimi K2.6 is described as a step up from Kimi K2.5 and slightly above GLM 5.1 in the author’s testing. Minimax M2.7 and Qwen 3.6 Plus are framed as useful middle-tier options, especially around price or local availability, but not replacements for the strongest API models.
What The Comments Added
Replies pressed on benchmark validity. One commenter questioned whether the Kimi-for-coding backend always respects the requested model ID. Another said their own C, C++, Rust, LISP and math work still favors GPT and Gemini 3.1 Pro. That made the thread useful: it did not treat the board as final truth. Instead, it exposed the variables that make coding-agent evals hard to compare, including provider routing, framework behavior, cost, task mix and the gap between score and daily use. It is especially useful for teams choosing agents because it separates benchmark pass rate from reliability during messy repository work.
Sources: r/LocalLLaMA discussion, SanityHarness leaderboard, SanityHarness GitHub.
Related Articles
LiteCoder is making a case that smaller coding agents still have room to climb, releasing terminal-focused models plus 11,255 trajectories and 602 Harbor environments. Its 30B model reaches 31.5% Pass@1 on Terminal Bench Pro, up from 22.0% in the preview.
The r/singularity thread did not just react to Opus 4.7 scoring 41.0% where Opus 4.6 scored 94.7%. The interesting part was the community trying to separate real capability loss from refusal behavior, routing, and benchmark interpretation.
A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.
Comments (0)
No comments yet. Be the first to comment!