Intel Arc Pro B70 Community Benchmark Suggests Viable Qwen3.5-27B Serving

What the community post measured

This r/LocalLLaMA post, sitting at 82 score and 44 comments on April 12, 2026, offers one of the more useful community benchmarks for Intel’s Arc Pro B70 32GB. The author says they spent multiple nights getting Intel’s llm-scaler-vllm fork working correctly, then published both single-GPU and dual-GPU measurements for Qwen3.5-27B int4.

The results point to a setup that is not effortless, but is clearly usable. On a single GPU, the author reports roughly 12 to 14 tokens per second for generation, while 2048-token prefill lands around 1700 t/s. At higher concurrency, total throughput rises much more aggressively: the post reports 130.90 total t/s for tg512 at concurrency 32. In dual-GPU mode, tensor parallel underperformed, while pipeline parallel improved the high-concurrency picture and pushed total tg512 throughput to 195.82 t/s at concurrency 32.

Operational takeaways from the post

Tensor parallel degraded performance across the author’s tests.
Pipeline parallel hurt single-query generation but improved throughput under heavier concurrent load.
The author compares the 32-concurrency result to an RTX Pro 4500 32GB and says Intel lands about 20% lower on total generation while drawing about 50% more power.
The working setup depended on a recent beta fork, and the author says Ubuntu 26.04 pre-release worked where Ubuntu 24.04.4 did not.

Why it matters

This is still a community measurement, not a controlled lab benchmark, and the author is careful to present it as such. Even so, the post is valuable because it goes beyond impressions. It includes a concrete Docker command, notes on the Intel XPU target path, and per-concurrency tables for both single- and dual-card runs. For anyone trying to decide whether Intel hardware can serve a Qwen3.5-27B-class model locally, that level of detail is far more actionable than generic launch-day claims.

Original source: r/LocalLLaMA post.

Intel Arc Pro B70 Community Benchmark Suggests Viable Qwen3.5-27B Serving

What the community post measured

Operational takeaways from the post

Why it matters

Related Articles

r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

Comments (0)

Leave a Comment

Related Articles

r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup
LLM Reddit Mar 15, 2026 2 min read

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
LLM Reddit Mar 15, 2026 2 min read

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels
LLM Reddit Mar 16, 2026 2 min read