#vllm

LLM Reddit Apr 12, 2026 1 min read

Intel Arc Pro B70 Community Benchmark Suggests Viable Qwen3.5-27B Serving

A detailed r/LocalLLaMA benchmark reports single- and dual-GPU numbers for Qwen3.5-27B int4 on Intel Arc Pro B70 32GB using Intel’s vLLM fork. The setup is still finicky, but the measurements outline a practical path for local serving on Intel hardware.

#localllm #intel-arc #qwen

102

LLM X/Twitter Apr 10, 2026 1 min read

vLLM Lands in the First MLPerf Vision-Language Benchmark Submission

vLLM said NVIDIA used the framework for the first MLPerf vision-language benchmark submission built on Qwen3-VL. NVIDIA’s accompanying blog places that result inside a broader Blackwell Ultra push that claims up to 2.7x throughput gains and more than 60% lower token cost on the same infrastructure for some workloads.

#vllm #mlperf #benchmark

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally

A popular r/LocalLLaMA self-post lays out a concrete 2x H200 serving stack for GPT-OSS-120B, including routing, monitoring, and queueing tradeoffs. The appeal is not just the headline throughput, but the unusually detailed operational data behind it.

#localllama #vllm #litellm

LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.

#speculative-decoding #inference #vllm

100

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.

#qwen #vllm #nvidia-b200

121

LLM Reddit Mar 16, 2026 2 min read

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

A March 12, 2026 LocalLLaMA benchmark post claims the best sustained decode for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 Blackwell GPUs is 50.5 tok/s with Marlin, because native CUTLASS grouped GEMM paths on SM120 fail or fall back.

#qwen #blackwell #vllm

123

LLM Reddit Mar 15, 2026 2 min read

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.

#qwen #blackwell #vllm

116

LLM Reddit Mar 7, 2026 2 min read

LocalLLaMA PSA: Test New Models on Base Runtimes Before Convenience Layers

A well-received PSA on r/LocalLLaMA argues that convenience layers such as Ollama and LM Studio can change model behavior enough to distort evaluation. The more durable lesson from the thread is reproducibility: hold templates, stop tokens, sampling, runtime versions, and quantization constant before judging a model.

#local-llm #model-evaluation #llama-cpp

LLM Reddit Mar 4, 2026 2 min read

LocalLLaMA Experiment Claims Qwen3.5-35B-A3B Reaches 37.8% on SWE-bench Verified Hard

A LocalLLaMA post reports that a simple “verify after every edit” loop raised Qwen3.5-35B-A3B from 22.2% to 37.8% on SWE-bench Verified Hard, approaching a cited 40% reference for Claude Opus 4.6.

#swe-bench #coding-agents #qwen

121

LLM Reddit Mar 2, 2026 1 min read

How to Run Qwen3.5 27B with 170k Context at 100+ t/s on 2x RTX 3090

A community developer achieved 100+ t/s decode speed and 585 t/s aggregate throughput for 8 simultaneous requests running Qwen3.5 27B on a dual RTX 3090 setup with NVLink, using vLLM with tensor parallelism and MTP optimization.

#qwen #local-inference #vllm

130