#vllm

LLM Reddit 20h ago 2 min read

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.

#qwen #vllm #rtx-5090

LLM Reddit 1d ago 2 min read

LocalLLaMA Loves the 80 TPS Qwen3.6 Demo, Then Immediately Starts Auditing the Fine Print

LocalLLaMA did not just cheer the number. The moment 80 tps and a 218k context window appeared, the thread shifted to prompt length, quantization tradeoffs, and whether the vLLM setup really holds up in practice.

#qwen3-6 #vllm #rtx-5090

LLM Reddit 2d ago 2 min read

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.

#qwen #vllm #rtx-5090

LLM sources.twitter 4d ago 1 min read

Cohere W4A8 vLLM path claims 58% faster first-token latency

Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.

#cohere #vllm #inference

LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Reads TGI’s Maintenance Mode as the Moment vLLM Became the Default

The Reddit thread is not about mourning TGI. It reads like operators comparing notes after active momentum shifted away from it, with most commenters saying vLLM is now the safer default for general inference serving because the migration path is lighter and the performance case is easier to defend.

#llm #inference #vllm

LLM sources.twitter Apr 14, 2026 2 min read

Quantized Gemma 4 31B nearly doubles throughput at half memory

Quantization only matters when the accuracy hit stays small enough to use in production. Red Hat AI says its quantized Gemma 4 31B keeps 99%+ accuracy while delivering nearly 2x tokens/sec at half the memory footprint, with weights released openly via LLM Compressor.

#gemma-4 #quantization #vllm

LLM Reddit Apr 12, 2026 1 min read

Intel Arc Pro B70 Community Benchmark Suggests Viable Qwen3.5-27B Serving

A detailed r/LocalLLaMA benchmark reports single- and dual-GPU numbers for Qwen3.5-27B int4 on Intel Arc Pro B70 32GB using Intel’s vLLM fork. The setup is still finicky, but the measurements outline a practical path for local serving on Intel hardware.

#localllm #intel-arc #qwen

LLM sources.twitter Apr 10, 2026 1 min read

vLLM Lands in the First MLPerf Vision-Language Benchmark Submission

vLLM said NVIDIA used the framework for the first MLPerf vision-language benchmark submission built on Qwen3-VL. NVIDIA’s accompanying blog places that result inside a broader Blackwell Ultra push that claims up to 2.7x throughput gains and more than 60% lower token cost on the same infrastructure for some workloads.

#vllm #mlperf #benchmark

LLM Reddit Apr 8, 2026 2 min read

r/LocalLLaMA Shares a University-Hospital Stack Serving 1B+ Tokens Per Day Locally

A popular r/LocalLLaMA self-post lays out a concrete 2x H200 serving stack for GPT-OSS-120B, including routing, monitoring, and queueing tradeoffs. The appeal is not just the headline throughput, but the unusually detailed operational data behind it.

#localllama #vllm #litellm

LLM Reddit Apr 7, 2026 2 min read

LocalLLaMA Flags DFlash as an Open-Source Route to Faster Speculative Decoding

A LocalLLaMA thread pulled attention to DFlash, a block-diffusion draft model for speculative decoding whose paper claims lossless acceleration above 6x and direct support for vLLM, SGLang, and selected Transformers backends.

#speculative-decoding #inference #vllm

LLM Reddit Mar 28, 2026 2 min read

LocalLLaMA Follows a 1.1M Tok/s Qwen 3.5 27B Run as vLLM Tuning Becomes the Real Story

A March 26, 2026 r/LocalLLaMA post about serving Qwen 3.5 27B on Google Cloud B200 clusters reached 205 points and 52 comments at crawl time. The linked write-up reports 1,103,941 total tokens per second on 12 nodes after switching from tensor to data parallelism, shrinking context length, enabling FP8 KV cache, and using MTP-1 speculative decoding.

#qwen #vllm #nvidia-b200

LLM Reddit Mar 16, 2026 2 min read

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

A March 12, 2026 LocalLLaMA benchmark post claims the best sustained decode for Qwen3.5-397B NVFP4 on 4x RTX PRO 6000 Blackwell GPUs is 50.5 tok/s with Marlin, because native CUTLASS grouped GEMM paths on SM120 fail or fall back.

#qwen #blackwell #vllm