Intel Arc Pro B70 Community Benchmark Suggests Viable Qwen3.5-27B Serving

Original: Intel Arc Pro B70 32GB performance on Qwen3.5-27B@Q4 View original →

Read in other languages: 한국어日本語
LLM Apr 12, 2026 By Insights AI (Reddit) 1 min read 1 views Source

What the community post measured

This r/LocalLLaMA post, sitting at 82 score and 44 comments on April 12, 2026, offers one of the more useful community benchmarks for Intel’s Arc Pro B70 32GB. The author says they spent multiple nights getting Intel’s llm-scaler-vllm fork working correctly, then published both single-GPU and dual-GPU measurements for Qwen3.5-27B int4.

The results point to a setup that is not effortless, but is clearly usable. On a single GPU, the author reports roughly 12 to 14 tokens per second for generation, while 2048-token prefill lands around 1700 t/s. At higher concurrency, total throughput rises much more aggressively: the post reports 130.90 total t/s for tg512 at concurrency 32. In dual-GPU mode, tensor parallel underperformed, while pipeline parallel improved the high-concurrency picture and pushed total tg512 throughput to 195.82 t/s at concurrency 32.

Operational takeaways from the post

  • Tensor parallel degraded performance across the author’s tests.
  • Pipeline parallel hurt single-query generation but improved throughput under heavier concurrent load.
  • The author compares the 32-concurrency result to an RTX Pro 4500 32GB and says Intel lands about 20% lower on total generation while drawing about 50% more power.
  • The working setup depended on a recent beta fork, and the author says Ubuntu 26.04 pre-release worked where Ubuntu 24.04.4 did not.

Why it matters

This is still a community measurement, not a controlled lab benchmark, and the author is careful to present it as such. Even so, the post is valuable because it goes beyond impressions. It includes a concrete Docker command, notes on the Intel XPU target path, and per-concurrency tables for both single- and dual-card runs. For anyone trying to decide whether Intel hardware can serve a Qwen3.5-27B-class model locally, that level of detail is far more actionable than generic launch-day claims.

Original source: r/LocalLLaMA post.

Share: Long

Related Articles

LLM Reddit Mar 15, 2026 2 min read

A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.