A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review

Community Spark

A r/LocalLLaMA post compared Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and other coding models, then pointed readers to the SanityHarness leaderboard. The author says the old and newer leaderboards now contain 145 results, built from repeated coding-agent evaluations rather than one-off vibe checks.

What Was Tested

The post describes SanityHarness as a coding-agent-agnostic benchmark. Its GitHub README says the harness runs compact but challenging problems inside isolated Docker containers, spans six languages, applies weighted scoring, and includes integrity checks plus hidden tests. In this latest pass, the author says Kimi K2.6-Code-Preview was tested through early access alongside Opus 4.7, GLM 5.1, Minimax M2.7 and other systems.

The Tension In The Results

The community hook was not a clean leaderboard winner. The author says Opus 4.7 can score well in evals while feeling much worse in actual coding sessions, with frequent hallucination and stubborn incorrect assumptions. Kimi K2.6 is described as a step up from Kimi K2.5 and slightly above GLM 5.1 in the author’s testing. Minimax M2.7 and Qwen 3.6 Plus are framed as useful middle-tier options, especially around price or local availability, but not replacements for the strongest API models.

What The Comments Added

Replies pressed on benchmark validity. One commenter questioned whether the Kimi-for-coding backend always respects the requested model ID. Another said their own C, C++, Rust, LISP and math work still favors GPT and Gemini 3.1 Pro. That made the thread useful: it did not treat the board as final truth. Instead, it exposed the variables that make coding-agent evals hard to compare, including provider routing, framework behavior, cost, task mix and the gap between score and daily use. It is especially useful for teams choosing agents because it separates benchmark pass rate from reliability during messy repository work.

Sources: r/LocalLLaMA discussion, SanityHarness leaderboard, SanityHarness GitHub.

A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review

Community Spark

What Was Tested

The Tension In The Results

What The Comments Added

Related Articles

LiteCoder pushes terminal agents to 31.5% on Terminal Bench Pro

Opus 4.7’s Reddit benchmark fight was really about refusals versus regression

Hacker News spotlights ATLAS and the economics of local coding agents

Comments (0)

Leave a Comment

Related Articles

LiteCoder pushes terminal agents to 31.5% on Terminal Bench Pro

Opus 4.7’s Reddit benchmark fight was really about refusals versus regression

Hacker News spotlights ATLAS and the economics of local coding agents
LLM Hacker News Mar 28, 2026 2 min read