GLM-5 Becomes Top Open-Weights Model on Extended NYT Connections Benchmark
Original: GLM-5 is the new top open-weights model on the Extended NYT Connections benchmark, with a score of 81.8, edging out Kimi K2.5 Thinking (78.3) View original →
GLM-5 Takes the Lead
Zhipu AI's GLM-5 has achieved a score of 81.8 on the Extended NYT Connections benchmark, making it the new top-performing open-weights language model on this evaluation. It edges out the previous leader Kimi K2.5 Thinking, which scored 78.3 — a meaningful 3.5-point gap.
What the NYT Connections Benchmark Tests
The Extended NYT Connections benchmark is based on The New York Times' word association puzzle game, adapted for LLM evaluation. Players (or models) must sort 16 words into 4 hidden categories. What makes this benchmark challenging for LLMs is that it requires genuine conceptual reasoning beyond statistical pattern matching — understanding polysemy, cultural references, lateral thinking, and semantic groupings that aren't immediately obvious.
Unlike standard benchmarks that can be gamed by memorization, NYT Connections tests flexible, contextual intelligence. A model that does well here is demonstrating something closer to genuine language understanding. The full benchmark results are available at github.com/lechmazur/nyt-connections.
Chinese Open-Source AI's Rising Tide
Zhipu AI is a Beijing-based AI startup with strong ties to Tsinghua University, known for its General Language Model (GLM) series. GLM-5's achievement highlights the rapid progress of Chinese open-source AI — particularly notable given that its main competition (Kimi K2.5 Thinking from Moonshot AI) is also a Chinese startup.
Open-Weights Competition Intensifies
This result signals that Chinese models are increasingly competitive in the open-weights space, challenging Western counterparts like Meta's Llama series and Mistral. GLM-5's score of 81.8 also compares favorably to many proprietary models, suggesting the gap between open and closed models continues to narrow at a rapid pace.
Related Articles
Google AI Developers has released Android Bench, an official leaderboard for LLMs on Android development tasks. In the first results, Gemini 3.1 Pro ranks first, and Google is also publishing the benchmark, dataset, and test harness.
China's GLM-5 model achieves a score of 50 on the Intelligence Index, claiming top performance among open-source large language models.
Anthropic released Claude Opus 4.6, achieving industry-leading performance in coding, long-context retrieval, and knowledge work.
Comments (0)
No comments yet. Be the first to comment!