A LocalLLaMA benchmark maps where RTX 5090, AI395, and dual R9700 actually win
Original: [Benchmark] The Ultimate Llama.cpp Shootout: RTX 5090 vs DGX Spark vs AMD AI395 & R9700 (ROCm/Vulkan) View original →
Why this benchmark mattered
The Ultimate Llama.cpp Shootout thread on r/LocalLLaMA collected 55 upvotes and 81 comments. It was not the biggest post of the day, but it carried a lot of technical value. The author used llama-bench build 8463 to compare RTX 5090, DGX Spark GB10, AMD AI395, and single and dual AMD R9700 setups under the same test parameters, across both dense and mixture-of-experts models.
The tested models were Qwen2.5 32B, Qwen3.5 35B MoE, Qwen2.5 70B, and Qwen3.5 122B MoE, using -ngl 99 -fa 1 -p 2048 -n 256 -b 512. That kind of benchmark is unusually useful because local inference decisions are rarely about peak speed alone. What matters is whether a model fits, how stable the backend is, and what trade-offs emerge once you move beyond a single friendly benchmark case.
Key takeaways
According to the post, the RTX 5090 was dominant when the model fit inside 32GB of VRAM. On Qwen3.5 35B MoE it reached 5988.83 t/s in prompt processing and 205.36 t/s in generation. But it could not load the 70B Q4_K_M or 122B models at all.
- The AMD AI395, with 98GB of shared memory, was the only non-enterprise node in the comparison able to run the 122B MoE model.
- The author reported that the AI395 required
-mmp 0, and then delivered nearly 20 t/s generation while peaking around 108W. - A dual R9700 setup, with 60GB total VRAM, ran the 70B model at 11.49 t/s generation and nearly 600 t/s prompt processing under ROCm.
- ROCm consistently won on prompt processing, while Vulkan sometimes posted better generation speeds but was less stable, including
vk::DeviceLostErrorfailures.
Why it matters
The post makes a practical point that hardware buyers care about more than marketing charts. The 5090 looks unmatched if a model fits. The AI395 is slower but unusually flexible because of memory capacity. Dual R9700 is not the fastest path, but it opens a realistic route to 70B-class local models on AMD hardware.
That means the right choice depends on workload, not just headline tokens per second. Do you want the fastest small-to-mid MoE runs, or do you need enough memory headroom to push 70B and 122B-class models? Community benchmarks like this are valuable because they surface fit, throughput, and backend stability in the same table.
Original source: Reddit benchmark post
Related Articles
A few weeks after release, r/LocalLLaMA is converging on task-specific sampler and reasoning-budget presets for Qwen3.5 rather than one default setup.
A high-signal r/LocalLLaMA benchmark post said moving Qwen 3.5 27B from mainline llama.cpp to ik_llama.cpp raised prompt evaluation from about 43 tok/sec to 1,122 tok/sec on a Blackwell RTX PRO 4000, with generation climbing from 7.5 tok/sec to 26 tok/sec.
A benchmark thread on r/LocalLLaMA compared ROCm 7 nightlies and Vulkan on an AMD Mi50 for llama.cpp, arguing that Vulkan wins short dense workloads while ROCm pulls ahead on long context and some MoE scenarios.
Comments (0)
No comments yet. Be the first to comment!