A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review
Original: Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and more tested in coding View original →
Community Spark
A r/LocalLLaMA post compared Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and other coding models, then pointed readers to the SanityHarness leaderboard. The author says the old and newer leaderboards now contain 145 results, built from repeated coding-agent evaluations rather than one-off vibe checks.
What Was Tested
The post describes SanityHarness as a coding-agent-agnostic benchmark. Its GitHub README says the harness runs compact but challenging problems inside isolated Docker containers, spans six languages, applies weighted scoring, and includes integrity checks plus hidden tests. In this latest pass, the author says Kimi K2.6-Code-Preview was tested through early access alongside Opus 4.7, GLM 5.1, Minimax M2.7 and other systems.
The Tension In The Results
The community hook was not a clean leaderboard winner. The author says Opus 4.7 can score well in evals while feeling much worse in actual coding sessions, with frequent hallucination and stubborn incorrect assumptions. Kimi K2.6 is described as a step up from Kimi K2.5 and slightly above GLM 5.1 in the author’s testing. Minimax M2.7 and Qwen 3.6 Plus are framed as useful middle-tier options, especially around price or local availability, but not replacements for the strongest API models.
What The Comments Added
Replies pressed on benchmark validity. One commenter questioned whether the Kimi-for-coding backend always respects the requested model ID. Another said their own C, C++, Rust, LISP and math work still favors GPT and Gemini 3.1 Pro. That made the thread useful: it did not treat the board as final truth. Instead, it exposed the variables that make coding-agent evals hard to compare, including provider routing, framework behavior, cost, task mix and the gap between score and daily use. It is especially useful for teams choosing agents because it separates benchmark pass rate from reliability during messy repository work.
Sources: r/LocalLLaMA discussion, SanityHarness leaderboard, SanityHarness GitHub.
Related Articles
A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.
HN latched onto a practical shift in coding evals: correctness is no longer enough if the patch would fail human review.
Cursor said on March 26, 2026 that real-time reinforcement learning lets it ship improved Composer 2 checkpoints every five hours. Cursor’s March 27 technical report says the model combines continued pretraining on Kimi K2.5 with large-scale RL in realistic Cursor sessions, scores 61.3 on CursorBench, and runs on an asynchronous multi-region RL stack with large sandbox fleets.