Community Builds 16-Node NVIDIA DGX Spark Cluster for Unified-Memory LLM Inference
Original: 16x Spark Cluster (Build Update) View original →
Build Complete
A LocalLLaMA community member has completed a 16-node NVIDIA DGX Spark cluster, connecting all nodes via a FS N8510 switch using QSFP56 cables. The setup achieves 100–111 Gbps per rail (dual rail), aggregating to the advertised 200 Gbps per node.
Why DGX Spark Over H100s or GB300?
The answer is unified memory. The builder's primary goal was maximizing unified memory capacity within the NVIDIA ecosystem. At 8 nodes, the setup served GLM-5.1-NVFP4 (434 GB) at TP=8. With 16 nodes, the plan is to test DeepSeek and Kimi alongside a prefill/decode split architecture.
Setup Process
Each DGX Spark ships with NVIDIA's Ubuntu flavor with most software pre-installed. The setup process involved racking the units, creating matching user accounts across all nodes, waiting ~20 minutes per node for updates, then scripting passwordless SSH, jumbo frames, and IP configuration.
What This Signals
This build is notable as an example of the growing accessibility of large-scale GPU clusters to individuals and small teams. The focus on unified memory over raw compute reflects a maturing approach to LLM inference infrastructure — optimizing for model capacity rather than pure throughput.
Related Articles
NVIDIA says its GB300 NVL72 delivered up to 20x more concurrent agentic coding capacity per megawatt than H200 on Artificial Analysis’ new AA-AgentPerf benchmark. The test measures concurrent AI agents under service-level objectives, not just raw token throughput.
The expensive part of LLM inference is often the experiment itself. NVIDIA says DynoSim replayed a 23,608-request trace on an Apple M4 MacBook Air in 2.41 seconds, about 1,500x faster than the 60.1-minute serving window it modeled.
NVIDIA is targeting the hidden cost of LLM serving experiments. Its DynoSim post says the Rust simulator can screen deployment choices before GPU validation, with a blog example replaying 23,608 requests about 1,500x faster than real time.