A 145-result coding eval put Kimi K2.6, Opus 4.7, GLM 5.1 and Minimax under LocalLLaMA review

Original: Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and more tested in coding View original →

Read in other languages: 한국어日本語
LLM Apr 19, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Community Spark

A r/LocalLLaMA post compared Kimi K2.6-Code-Preview, Opus 4.7, GLM 5.1, Minimax M2.7 and other coding models, then pointed readers to the SanityHarness leaderboard. The author says the old and newer leaderboards now contain 145 results, built from repeated coding-agent evaluations rather than one-off vibe checks.

What Was Tested

The post describes SanityHarness as a coding-agent-agnostic benchmark. Its GitHub README says the harness runs compact but challenging problems inside isolated Docker containers, spans six languages, applies weighted scoring, and includes integrity checks plus hidden tests. In this latest pass, the author says Kimi K2.6-Code-Preview was tested through early access alongside Opus 4.7, GLM 5.1, Minimax M2.7 and other systems.

The Tension In The Results

The community hook was not a clean leaderboard winner. The author says Opus 4.7 can score well in evals while feeling much worse in actual coding sessions, with frequent hallucination and stubborn incorrect assumptions. Kimi K2.6 is described as a step up from Kimi K2.5 and slightly above GLM 5.1 in the author’s testing. Minimax M2.7 and Qwen 3.6 Plus are framed as useful middle-tier options, especially around price or local availability, but not replacements for the strongest API models.

What The Comments Added

Replies pressed on benchmark validity. One commenter questioned whether the Kimi-for-coding backend always respects the requested model ID. Another said their own C, C++, Rust, LISP and math work still favors GPT and Gemini 3.1 Pro. That made the thread useful: it did not treat the board as final truth. Instead, it exposed the variables that make coding-agent evals hard to compare, including provider routing, framework behavior, cost, task mix and the gap between score and daily use. It is especially useful for teams choosing agents because it separates benchmark pass rate from reliability during messy repository work.

Sources: r/LocalLLaMA discussion, SanityHarness leaderboard, SanityHarness GitHub.

Share: Long

Related Articles

LLM Hacker News Mar 28, 2026 2 min read

A Hacker News post pushed ATLAS into the spotlight by framing a consumer-GPU coding agent as a serious cost challenger to hosted systems. The headline benchmark is interesting, but the repository itself makes clear that its 74.6% result is not a controlled head-to-head against Claude 4.5 Sonnet because the task counts and evaluation protocols differ.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.