Why this comparison mattered

A long-form benchmark post on r/LocalLLaMA delivered one of the clearest first-hand comparisons yet of what it actually takes to run a very large open model locally. At crawl time, the thread had 402 points and 229 comments. The author says they had been spending about $2K per month on Claude API usage for a personal assistant running through Slack, then decided to compare local hardware paths by buying a $10K Mac Studio M3 Ultra 512GB and a similarly priced dual DGX Spark setup.

The test model was Qwen3.5 397B A17B. On the Mac Studio, the author used MLX 6 bit quantization, loading a 323GB model into 512GB of unified memory. The reported generation speed was 30 to 40 tok/s. The author argues that the main advantage is roughly 800 GB/s of memory bandwidth, which makes giant-model token generation feel smooth in a single box. Setup was described as relatively easy, but prefill on large system prompts took 30+ seconds, and MLX VLM lacked built-in handling for tool calls and stripping thinking tokens, forcing the author to build a custom async proxy.

What the dual DGX Spark setup changed

For the dual Spark build, the author used INT4 AutoRound, loading 98GB per node across two 128GB nodes with vLLM TP=2. Generation landed at 27 to 28 tok/s, a bit lower than the Mac on steady decode, but prefill was noticeably faster and batch embedding performance was far better. In the author’s view, CUDA tensor cores, vLLM kernels, and tensor parallelism make the Spark platform more attractive when inference has to coexist with RAG, embedding, or reranking workloads.

The tradeoff was operational pain. The post says only one QSFP cable worked reliably, Node2’s IP disappeared after reboot, the safe GPU memory ceiling had to be found by binary search around 0.88, page cache needed to be flushed on both nodes before model load, and some units thermally throttled within 20 minutes. In other words, the Spark setup offered more ecosystem flexibility and better throughput for auxiliary tasks, but demanded far more tuning time.

The practical conclusion

The strongest takeaway is not that one platform decisively won. The author instead split roles: Mac Studio for inference, dual Sparks for RAG, embedding, and reranking, connected over Tailscale. They also estimate that $20K in hardware would break even in about 10 months against a $2K/month API bill.

For local-LLM operators, that makes this thread more than a brag post. It is a concrete illustration that large-model local deployment is becoming an architecture choice with measurable cost, bandwidth, and workflow tradeoffs rather than a niche experiment.

Source: r/LocalLLaMA thread

#mac-studio

LocalLLaMA Benchmark Pits Dual DGX Sparks Against a 512GB Mac Studio for Qwen3.5 397B