GLM-5 Becomes Top Open-Weights Model on Extended NYT Connections Benchmark
Original: GLM-5 is the new top open-weights model on the Extended NYT Connections benchmark, with a score of 81.8, edging out Kimi K2.5 Thinking (78.3) View original →
GLM-5 Takes the Lead
Zhipu AI's GLM-5 has achieved a score of 81.8 on the Extended NYT Connections benchmark, making it the new top-performing open-weights language model on this evaluation. It edges out the previous leader Kimi K2.5 Thinking, which scored 78.3 — a meaningful 3.5-point gap.
What the NYT Connections Benchmark Tests
The Extended NYT Connections benchmark is based on The New York Times' word association puzzle game, adapted for LLM evaluation. Players (or models) must sort 16 words into 4 hidden categories. What makes this benchmark challenging for LLMs is that it requires genuine conceptual reasoning beyond statistical pattern matching — understanding polysemy, cultural references, lateral thinking, and semantic groupings that aren't immediately obvious.
Unlike standard benchmarks that can be gamed by memorization, NYT Connections tests flexible, contextual intelligence. A model that does well here is demonstrating something closer to genuine language understanding. The full benchmark results are available at github.com/lechmazur/nyt-connections.
Chinese Open-Source AI's Rising Tide
Zhipu AI is a Beijing-based AI startup with strong ties to Tsinghua University, known for its General Language Model (GLM) series. GLM-5's achievement highlights the rapid progress of Chinese open-source AI — particularly notable given that its main competition (Kimi K2.5 Thinking from Moonshot AI) is also a Chinese startup.
Open-Weights Competition Intensifies
This result signals that Chinese models are increasingly competitive in the open-weights space, challenging Western counterparts like Meta's Llama series and Mistral. GLM-5's score of 81.8 also compares favorably to many proprietary models, suggesting the gap between open and closed models continues to narrow at a rapid pace.
Related Articles
HN read Kimi K2.6 as a test of whether open-weight coding agents can last through real engineering work. The 12-hour and 13-hour coding cases drew attention, while commenters immediately pressed on speed, provider accuracy, and benchmark realism.
HN did not latch onto DeepSeek V4 because of a polished launch page. The thread took off when commenters realized the front-page link was just updated docs while the weights and base models were already live for inspection.
Why it matters: an open-weight 27B dense model is now being pitched against much larger coding systems on real agent tasks. Qwen’s own model card lists SWE-bench Verified at 77.2 for Qwen3.6-27B versus 76.2 for Qwen3.5-397B-A17B, with Apache 2.0 licensing.
Comments (0)
No comments yet. Be the first to comment!