Intel Arc Pro B70 Community Benchmark Suggests Viable Qwen3.5-27B Serving
Original: Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4 View original →
What the community post measured
This r/LocalLLaMA post, sitting at 82 score and 44 comments on April 12, 2026, offers one of the more useful community benchmarks for Intel’s Arc Pro B70 32GB. The author says they spent multiple nights getting Intel’s llm-scaler-vllm fork working correctly, then published both single-GPU and dual-GPU measurements for Qwen3.5-27B int4.
The results point to a setup that is not effortless, but is clearly usable. On a single GPU, the author reports roughly 12 to 14 tokens per second for generation, while 2048-token prefill lands around 1700 t/s. At higher concurrency, total throughput rises much more aggressively: the post reports 130.90 total t/s for tg512 at concurrency 32. In dual-GPU mode, tensor parallel underperformed, while pipeline parallel improved the high-concurrency picture and pushed total tg512 throughput to 195.82 t/s at concurrency 32.
Operational takeaways from the post
- Tensor parallel degraded performance across the author’s tests.
- Pipeline parallel hurt single-query generation but improved throughput under heavier concurrent load.
- The author compares the 32-concurrency result to an RTX Pro 4500 32GB and says Intel lands about 20% lower on total generation while drawing about 50% more power.
- The working setup depended on a recent beta fork, and the author says Ubuntu 26.04 pre-release worked where Ubuntu 24.04.4 did not.
Why it matters
This is still a community measurement, not a controlled lab benchmark, and the author is careful to present it as such. Even so, the post is valuable because it goes beyond impressions. It includes a concrete Docker command, notes on the Intel XPU target path, and per-concurrency tables for both single- and dual-card runs. For anyone trying to decide whether Intel hardware can serve a Qwen3.5-27B-class model locally, that level of detail is far more actionable than generic launch-day claims.
Original source: r/LocalLLaMA post.
Related Articles
A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
A March 12, 2026 LocalLLaMA benchmark post claims the best sustained decode for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 Blackwell GPUs is 50.5 tok/s with Marlin, because native CUTLASS grouped GEMM paths on SM120 fail or fall back.
Comments (0)
No comments yet. Be the first to comment!