A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review

Community Spark

A r/LocalLLaMA post compared Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and other coding models, then pointed readers to the SanityHarness leaderboard. The author says the old and newer leaderboards now contain 145 results, built from repeated coding-agent evaluations rather than one-off vibe checks.

What Was Tested

The post describes SanityHarness as a coding-agent-agnostic benchmark. Its GitHub README says the harness runs compact but challenging problems inside isolated Docker containers, spans six languages, applies weighted scoring, and includes integrity checks plus hidden tests. In this latest pass, the author says Kimi K2.6-Code-Preview was tested through early access alongside Opus 4.7, GLM 5.1, Minimax M2.7 and other systems.

The Tension In The Results

The community hook was not a clean leaderboard winner. The author says Opus 4.7 can score well in evals while feeling much worse in actual coding sessions, with frequent hallucination and stubborn incorrect assumptions. Kimi K2.6 is described as a step up from Kimi K2.5 and slightly above GLM 5.1 in the author’s testing. Minimax M2.7 and Qwen 3.6 Plus are framed as useful middle-tier options, especially around price or local availability, but not replacements for the strongest API models.

What The Comments Added

Replies pressed on benchmark validity. One commenter questioned whether the Kimi-for-coding backend always respects the requested model ID. Another said their own C, C++, Rust, LISP and math work still favors GPT and Gemini 3.1 Pro. That made the thread useful: it did not treat the board as final truth. Instead, it exposed the variables that make coding-agent evals hard to compare, including provider routing, framework behavior, cost, task mix and the gap between score and daily use. It is especially useful for teams choosing agents because it separates benchmark pass rate from reliability during messy repository work.

Sources: r/LocalLLaMA discussion, SanityHarness leaderboard, SanityHarness GitHub.

A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review

Community Spark

What Was Tested

The Tension In The Results

What The Comments Added

Related Articles

Claude Opus 5 puts near-Fable coding power at half the cost

Ornith-1.0 tests the open-model bar for agentic coding

GitHub makes Kimi K2.7 Code Copilot's first open-weight choice