A LocalLLaMA benchmark maps where RTX 5090, AI395, and dual R9700 actually win
Original: [Benchmark] The Ultimate Llama.cpp Shootout: RTX 5090 vs DGX Spark vs AMD AI395 & R9700 (ROCm/Vulkan) View original →
Why this benchmark mattered
The Ultimate Llama.cpp Shootout thread on r/LocalLLaMA collected 55 upvotes and 81 comments. It was not the biggest post of the day, but it carried a lot of technical value. The author used llama-bench build 8463 to compare RTX 5090, DGX Spark GB10, AMD AI395, and single and dual AMD R9700 setups under the same test parameters, across both dense and mixture-of-experts models.
The tested models were Qwen2.5 32B, Qwen3.5 35B MoE, Qwen2.5 70B, and Qwen3.5 122B MoE, using -ngl 99 -fa 1 -p 2048 -n 256 -b 512. That kind of benchmark is unusually useful because local inference decisions are rarely about peak speed alone. What matters is whether a model fits, how stable the backend is, and what trade-offs emerge once you move beyond a single friendly benchmark case.
Key takeaways
According to the post, the RTX 5090 was dominant when the model fit inside 32GB of VRAM. On Qwen3.5 35B MoE it reached 5988.83 t/s in prompt processing and 205.36 t/s in generation. But it could not load the 70B Q4_K_M or 122B models at all.
- The AMD AI395, with 98GB of shared memory, was the only non-enterprise node in the comparison able to run the 122B MoE model.
- The author reported that the AI395 required
-mmp 0, and then delivered nearly 20 t/s generation while peaking around 108W. - A dual R9700 setup, with 60GB total VRAM, ran the 70B model at 11.49 t/s generation and nearly 600 t/s prompt processing under ROCm.
- ROCm consistently won on prompt processing, while Vulkan sometimes posted better generation speeds but was less stable, including
vk::DeviceLostErrorfailures.
Why it matters
The post makes a practical point that hardware buyers care about more than marketing charts. The 5090 looks unmatched if a model fits. The AI395 is slower but unusually flexible because of memory capacity. Dual R9700 is not the fastest path, but it opens a realistic route to 70B-class local models on AMD hardware.
That means the right choice depends on workload, not just headline tokens per second. Do you want the fastest small-to-mid MoE runs, or do you need enough memory headroom to push 70B and 122B-class models? Community benchmarks like this are valuable because they surface fit, throughput, and backend stability in the same table.
Original source: Reddit benchmark post
Related Articles
A community user achieved 110 tokens/second running Qwen3.6 35B A3B on an RTX 4070 Super 12GB via ik_llama.cpp, a fork with superior CPU offload optimization that significantly outperforms upstream llama.cpp's Multi-Token Prediction implementation.
The LocalLLaMA thread climbed because it translated Intel workstation GPU news into the metrics local inference users actually watch: VRAM, bandwidth, software support, and cost-per-model.
A LocalLLaMA thread about Intel’s Arc Pro B70 and B65 reached 213 upvotes and 133 comments. Intel says the B70 is available from March 25, 2026 with a suggested starting price of $949, while the B65 follows in mid-April.