GLM-5.1 inference gains came from network topology, not new GPUs
Original: Zai replaced the network architecture running GLM-5.1 inference and the gains are pretty wild View original →
A LocalLLaMA post about Zai’s GLM-5.1 inference cluster drew attention because the gains came from the network layer, not a new model or new GPUs. According to the post, Zai replaced a standard ROFT setup with ZCube, developed with Tsinghua University and HarnetsAI, on a thousand-GPU cluster running GLM-5.1 coding inference. The framing was especially interesting because the GPUs, software stack, and model stayed the same.
The reported production numbers were concrete: switch and optical module costs down 33%, GPU inference throughput up 15%, and first-token P99 tail latency down 40.6%. That is not the usual tradeoff operators expect. Higher network performance often implies more hardware spend. Here, the claim is that a topology change lowered cost while improving throughput and tail latency.
The technical issue is Prefill-Decode disaggregated inference. KV Cache transfers create asymmetric traffic between nodes, and a topology that works for training can map poorly to inference traffic. In the post’s explanation, ROFT’s static rail mapping led to hotspots on particular Leaf switches and PFC backpressure. ZCube removes the Spine layer and uses a flattened complete bipartite interconnect between two switch groups, reducing a class of congestion by design.
The most useful community reaction was that the bottleneck keeps moving lower in the stack. LLM inference optimization is no longer just about weights, quantization, or scheduler tricks. At large scale, the fabric carrying KV Cache traffic can decide both cost and responsiveness. For operators, this is a reminder to profile network topology before assuming the next performance step requires more GPUs.
Related Articles
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
A LocalLLaMA thread about Intel’s Arc Pro B70 and B65 reached 213 upvotes and 133 comments. Intel says the B70 is available from March 25, 2026 with a suggested starting price of $949, while the B65 follows in mid-April.
The money is following the layer that decides which model gets each request. OpenRouter says weekly traffic rose 5x in six months to 25 trillion tokens, while its platform now spans 400+ models and more than 8 million users.
Comments (0)
No comments yet. Be the first to comment!