GLM-5 Becomes Top Open-Weights Model on Extended NYT Connections Benchmark

GLM-5 Takes the Lead

Zhipu AI's GLM-5 has achieved a score of 81.8 on the Extended NYT Connections benchmark, making it the new top-performing open-weights language model on this evaluation. It edges out the previous leader Kimi K2.5 Thinking, which scored 78.3 — a meaningful 3.5-point gap.

What the NYT Connections Benchmark Tests

The Extended NYT Connections benchmark is based on The New York Times' word association puzzle game, adapted for LLM evaluation. Players (or models) must sort 16 words into 4 hidden categories. What makes this benchmark challenging for LLMs is that it requires genuine conceptual reasoning beyond statistical pattern matching — understanding polysemy, cultural references, lateral thinking, and semantic groupings that aren't immediately obvious.

Unlike standard benchmarks that can be gamed by memorization, NYT Connections tests flexible, contextual intelligence. A model that does well here is demonstrating something closer to genuine language understanding. The full benchmark results are available at github.com/lechmazur/nyt-connections.

Chinese Open-Source AI's Rising Tide

Zhipu AI is a Beijing-based AI startup with strong ties to Tsinghua University, known for its General Language Model (GLM) series. GLM-5's achievement highlights the rapid progress of Chinese open-source AI — particularly notable given that its main competition (Kimi K2.5 Thinking from Moonshot AI) is also a Chinese startup.

Open-Weights Competition Intensifies

This result signals that Chinese models are increasingly competitive in the open-weights space, challenging Western counterparts like Meta's Llama series and Mistral. GLM-5's score of 81.8 also compares favorably to many proprietary models, suggesting the gap between open and closed models continues to narrow at a rapid pace.

LLM Hacker News 5d ago 2 min read

Kimi K2.6 turned HN’s model debate toward open-weight coding agents

HN read Kimi K2.6 as a test of whether open-weight coding agents can last through real engineering work. The 12-hour and 13-hour coding cases drew attention, while commenters immediately pressed on speed, provider accuracy, and benchmark realism.

#kimi #coding-agents #open-weights

LLM Hacker News 3d ago 2 min read

HN Spots the Real DeepSeek V4 Story: The Docs Link Was Thin, but the Weights Were Already Live

HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.

#deepseek #llm #moe

LLM sources.twitter 3d ago 2 min read

Qwen3.6-27B beats Qwen3.5-397B on coding and ships under Apache 2.0

Why it matters: an open-weight 27B dense model is now being pitched against much larger coding systems on real agent tasks. Qwen’s own model card lists SWE-bench Verified at 77.2 for Qwen3.6-27B versus 76.2 for Qwen3.5-397B-A17B, with Apache 2.0 licensing.

#qwen #open-weights #coding-models