A LocalLLaMA benchmark maps where RTX 5090, AI395, and dual R9700 actually win

Original: [Benchmark] The Ultimate Llama.cpp Shootout: RTX 5090 vs DGX Spark vs AMD AI395 & R9700 (ROCm/Vulkan) View original →

Read in other languages: 한국어日本語
LLM Mar 26, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Why this benchmark mattered

The Ultimate Llama.cpp Shootout thread on r/LocalLLaMA collected 55 upvotes and 81 comments. It was not the biggest post of the day, but it carried a lot of technical value. The author used llama-bench build 8463 to compare RTX 5090, DGX Spark GB10, AMD AI395, and single and dual AMD R9700 setups under the same test parameters, across both dense and mixture-of-experts models.

The tested models were Qwen2.5 32B, Qwen3.5 35B MoE, Qwen2.5 70B, and Qwen3.5 122B MoE, using -ngl 99 -fa 1 -p 2048 -n 256 -b 512. That kind of benchmark is unusually useful because local inference decisions are rarely about peak speed alone. What matters is whether a model fits, how stable the backend is, and what trade-offs emerge once you move beyond a single friendly benchmark case.

Key takeaways

According to the post, the RTX 5090 was dominant when the model fit inside 32GB of VRAM. On Qwen3.5 35B MoE it reached 5988.83 t/s in prompt processing and 205.36 t/s in generation. But it could not load the 70B Q4_K_M or 122B models at all.

  • The AMD AI395, with 98GB of shared memory, was the only non-enterprise node in the comparison able to run the 122B MoE model.
  • The author reported that the AI395 required -mmp 0, and then delivered nearly 20 t/s generation while peaking around 108W.
  • A dual R9700 setup, with 60GB total VRAM, ran the 70B model at 11.49 t/s generation and nearly 600 t/s prompt processing under ROCm.
  • ROCm consistently won on prompt processing, while Vulkan sometimes posted better generation speeds but was less stable, including vk::DeviceLostError failures.

Why it matters

The post makes a practical point that hardware buyers care about more than marketing charts. The 5090 looks unmatched if a model fits. The AI395 is slower but unusually flexible because of memory capacity. Dual R9700 is not the fastest path, but it opens a realistic route to 70B-class local models on AMD hardware.

That means the right choice depends on workload, not just headline tokens per second. Do you want the fastest small-to-mid MoE runs, or do you need enough memory headroom to push 70B and 122B-class models? Community benchmarks like this are valuable because they surface fit, throughput, and backend stability in the same table.

Original source: Reddit benchmark post

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.