LocalLLaMA Benchmark Pits Dual DGX Sparks Against a 512GB Mac Studio for Qwen3.5 397B
Original: Dual DGX Sparks vs Mac Studio M3 Ultra 512GB: Running Qwen3.5 397B locally on both. Here's what I found. View original →
Why this comparison mattered
A long-form benchmark post on r/LocalLLaMA delivered one of the clearest first-hand comparisons yet of what it actually takes to run a very large open model locally. At crawl time, the thread had 402 points and 229 comments. The author says they had been spending about $2K per month on Claude API usage for a personal assistant running through Slack, then decided to compare local hardware paths by buying a $10K Mac Studio M3 Ultra 512GB and a similarly priced dual DGX Spark setup.
The test model was Qwen3.5 397B A17B. On the Mac Studio, the author used MLX 6 bit quantization, loading a 323GB model into 512GB of unified memory. The reported generation speed was 30 to 40 tok/s. The author argues that the main advantage is roughly 800 GB/s of memory bandwidth, which makes giant-model token generation feel smooth in a single box. Setup was described as relatively easy, but prefill on large system prompts took 30+ seconds, and MLX VLM lacked built-in handling for tool calls and stripping thinking tokens, forcing the author to build a custom async proxy.
What the dual DGX Spark setup changed
For the dual Spark build, the author used INT4 AutoRound, loading 98GB per node across two 128GB nodes with vLLM TP=2. Generation landed at 27 to 28 tok/s, a bit lower than the Mac on steady decode, but prefill was noticeably faster and batch embedding performance was far better. In the author’s view, CUDA tensor cores, vLLM kernels, and tensor parallelism make the Spark platform more attractive when inference has to coexist with RAG, embedding, or reranking workloads.
The tradeoff was operational pain. The post says only one QSFP cable worked reliably, Node2’s IP disappeared after reboot, the safe GPU memory ceiling had to be found by binary search around 0.88, page cache needed to be flushed on both nodes before model load, and some units thermally throttled within 20 minutes. In other words, the Spark setup offered more ecosystem flexibility and better throughput for auxiliary tasks, but demanded far more tuning time.
The practical conclusion
The strongest takeaway is not that one platform decisively won. The author instead split roles: Mac Studio for inference, dual Sparks for RAG, embedding, and reranking, connected over Tailscale. They also estimate that $20K in hardware would break even in about 10 months against a $2K/month API bill.
For local-LLM operators, that makes this thread more than a brag post. It is a concrete illustration that large-model local deployment is becoming an architecture choice with measurable cost, bandwidth, and workflow tradeoffs rather than a niche experiment.
Source: r/LocalLLaMA thread
Related Articles
A llama.cpp comparison on r/LocalLLaMA reached 55 upvotes and 81 comments. By testing RTX 5090, DGX Spark, AMD AI395, and single or dual R9700 setups under the same parameters, the post offers a practical view of local inference trade-offs that vendor slides usually hide.
Cursor has published the Composer 2 technical report, outlining its code-focused continued pretraining, large-scale reinforcement learning pipeline, and CursorBench-led evaluation strategy. The report offers an unusually detailed first-party look at how a production coding agent is trained and measured.
A popular LocalLLaMA benchmark post argued that Qwen3.5 27B hits an attractive balance between model size and throughput, using an RTX A6000, llama.cpp with CUDA, and a 32k context window to show roughly 19.7 tokens per second.
Comments (0)
No comments yet. Be the first to comment!