#vllm

LLM Hacker News Jun 28, 2026 2 min read

Two Strix Halo boards as a vLLM cluster: the hard part is RDMA

Local LLM builders are moving from “can it run?” to “can two small unified-memory boxes behave like one machine?” This guide walks through Framework Strix Halo boards, Intel E810 RoCE v2, and vLLM serving.

#amd #strix-halo #vllm

LLM Reddit Jun 16, 2026 2 min read

vLLM’s Qwen3+ streaming parser targets a real local-agent pain point

LocalLLaMA users reacted strongly to a small but practical vLLM nightly change. The new Qwen3+ streaming parser is aimed at mid-turn stops and streaming tool-call failures that can break Qwen3.6 agent loops.

#vllm #qwen #tool-calling

LLM Hacker News May 31, 2026 1 min read

Tiny-vLLM teaches LLM inference by rebuilding the stack in C++ and CUDA

The HN reaction centered on the README as much as the code: a small engine that turns vLLM concepts into a guided implementation path.

#llm #cuda #inference

LLM Reddit May 28, 2026 1 min read

Starlette BadHost bug puts vLLM, MCP servers, and AI tool stacks on notice

LocalLLaMA readers quickly turned the story into an operator checklist: check Starlette, FastAPI, vLLM, LiteLLM, MCP servers, and anything exposed to the Internet.

#security #starlette #mcp

LLM Reddit May 1, 2026 2 min read

LocalLLaMA cared less about peak speed than a 3090 setup that finally stopped crashing at 218K context

LocalLLaMA cared less about headline speed than a Qwen3.6 setup on one RTX 3090 that reached 218K context and stopped crashing on long tool outputs.

#qwen #rtx-3090 #vllm

LLM Reddit Apr 30, 2026 2 min read

LocalLLaMA Fixates on a Qwen3.6 27B Setup That Pushes 204k Context on Two 16GB GPUs

LocalLLaMA reacted to this post because it brought hard numbers, not vendor marketing: a dual RTX 5060 Ti 16GB setup pushing Qwen3.6 27B to roughly 60 tok/s with a 204k context window.

#qwen #local-llm #vllm

LLM X/Twitter Apr 28, 2026 2 min read

vLLM lifts FP8 long-context accuracy from 13% to 89%

Why it matters: FP8 inference only pays off if the accuracy collapse is fixable. vLLM says a two-level accumulation change lifted 128k needle-in-a-haystack accuracy from 13% to 89% while preserving FP8 decode speed.

#vllm #fp8 #inference

LLM Reddit Apr 27, 2026 2 min read

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.

#qwen #vllm #rtx-5090

LLM Reddit Apr 25, 2026 2 min read

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.

#qwen #vllm #rtx-5090

LLM X/Twitter Apr 23, 2026 1 min read

Cohere W4A8 vLLM path claims 58% faster first-token latency

Why it matters: inference cost is now a product constraint, not only an infrastructure problem. Cohere said its W4A8 path in vLLM is up to 58% faster on TTFT and 45% faster on TPOT versus W4A16 on Hopper.

#cohere #vllm #inference

LLM Reddit Apr 16, 2026 2 min read

LocalLLaMA Reads TGI’s Maintenance Mode as the Moment vLLM Became the Default

The Reddit thread is not about mourning TGI. It reads like operators comparing notes after active momentum shifted away from it, with most commenters saying vLLM is now the safer default for general inference serving because the migration path is lighter and the performance case is easier to defend.

#llm #inference #vllm

LLM X/Twitter Apr 14, 2026 2 min read

Quantized Gemma 4 31B nearly doubles throughput at half memory

Quantization only matters when the accuracy hit stays small enough to use in production. Red Hat AI says its quantized Gemma 4 31B keeps 99%+ accuracy while delivering nearly 2x tokens/sec at half the memory footprint, with weights released openly via LLM Compressor.

#gemma-4 #quantization #vllm