A LocalLLaMA benchmark maps where RTX 5090, AI395, and dual R9700 actually win

Why this benchmark mattered

The Ultimate Llama.cpp Shootout thread on r/LocalLLaMA collected 55 upvotes and 81 comments. It was not the biggest post of the day, but it carried a lot of technical value. The author used llama-bench build 8463 to compare RTX 5090, DGX Spark GB10, AMD AI395, and single and dual AMD R9700 setups under the same test parameters, across both dense and mixture-of-experts models.

The tested models were Qwen2.5 32B, Qwen3.5 35B MoE, Qwen2.5 70B, and Qwen3.5 122B MoE, using -ngl 99 -fa 1 -p 2048 -n 256 -b 512. That kind of benchmark is unusually useful because local inference decisions are rarely about peak speed alone. What matters is whether a model fits, how stable the backend is, and what trade-offs emerge once you move beyond a single friendly benchmark case.

Key takeaways

According to the post, the RTX 5090 was dominant when the model fit inside 32GB of VRAM. On Qwen3.5 35B MoE it reached 5988.83 t/s in prompt processing and 205.36 t/s in generation. But it could not load the 70B Q4_K_M or 122B models at all.

The AMD AI395, with 98GB of shared memory, was the only non-enterprise node in the comparison able to run the 122B MoE model.
The author reported that the AI395 required -mmp 0, and then delivered nearly 20 t/s generation while peaking around 108W.
A dual R9700 setup, with 60GB total VRAM, ran the 70B model at 11.49 t/s generation and nearly 600 t/s prompt processing under ROCm.
ROCm consistently won on prompt processing, while Vulkan sometimes posted better generation speeds but was less stable, including vk::DeviceLostError failures.

Why it matters

The post makes a practical point that hardware buyers care about more than marketing charts. The 5090 looks unmatched if a model fits. The AI395 is slower but unusually flexible because of memory capacity. Dual R9700 is not the fastest path, but it opens a realistic route to 70B-class local models on AMD hardware.

That means the right choice depends on workload, not just headline tokens per second. Do you want the fastest small-to-mid MoE runs, or do you need enough memory headroom to push 70B and 122B-class models? Community benchmarks like this are valuable because they surface fit, throughput, and backend stability in the same table.

Original source: Reddit benchmark post

A LocalLLaMA benchmark maps where RTX 5090, AI395, and dual R9700 actually win

Why this benchmark mattered

Key takeaways

Why it matters

Related Articles

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets

r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion

LocalLLaMA Shares Mi50 ROCm 7 vs Vulkan Benchmarks for llama.cpp

Comments (0)

Leave a Comment

Related Articles

r/LocalLLaMA Tries to Standardize Practical Qwen3.5 Presets

r/LocalLLaMA Benchmarks ik_llama.cpp at 26x Faster Qwen 3.5 Prompt Ingestion

LocalLLaMA Shares Mi50 ROCm 7 vs Vulkan Benchmarks for llama.cpp