LocalLLaMA Fixates on a Qwen3.6 27B Setup That Pushes 204k Context on Two 16GB GPUs
Original: Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working View original →
Why this Reddit post landed
The LocalLLaMA thread reached 111 upvotes and 44 comments at crawl time, and the reason is obvious once you read it: the post is packed with reproducible settings and uncomfortable caveats. Instead of “my local rig feels fast,” the author laid out the hardware stack, runtime versions, flags, memory headroom, and failure modes for running Qwen3.6 27B on 2x RTX 5060 Ti 16GB. That is exactly the kind of information the community values, because it helps separate real local inference progress from vague brag posts.
The setup and the headline numbers
The post uses a Proxmox LXC environment with 32GB total VRAM, 16 vCPU, about 60GB RAM, CUDA 13, Torch 2.11 nightly, and vLLM 0.19.2rc1.dev. The model is listed as sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP. Reported performance came in around 50-52 tok/s at 8K context with MTP n=1, 62-66 tok/s with MTP n=3, and 59-66 tok/s at 32K context. The headline claim is that a 204800-token window does start and work, though the author is explicit that this is right at the edge.
Where the caveats start to matter
The most useful part of the thread is not the best-case throughput but the operating envelope. The author says idle VRAM at 204k sits around 14.45 GiB per GPU, rises to about 15.65 GiB after a 168k-token prefill, and that a 168k retrieval smoke test completed in roughly 256 seconds. They also note that gpu_memory_utilization=0.94 failed KV allocation while 0.95 worked, startup takes several minutes because of compile and autotune behavior, and the config is not meant for high concurrency because max_num_seqs=1. Top comments immediately asked about stability, PCIe generation, and whether NVFP4 support on Blackwell was the real enabler, which shows the community saw the post as a deployable recipe, not a meme.
Why this matters for local inference
The broader takeaway is that “serious enough” local LLM setups keep moving down-market. This is not roomy hardware, and the author does not pretend otherwise. But a two-card 16GB-per-card Blackwell box sustaining this class of context and throughput changes how hobbyists and small teams think about local experimentation. The post is valuable precisely because it includes the sharp edges along with the wins. Reddit source thread
Related Articles
LocalLLaMA lit up at the idea that a 27B model could tie Sonnet 4.6 on an agentic index, but the thread turned just as fast to benchmark gaming, real context windows, and what people can actually run at home.
A high-engagement LocalLLaMA post shared reproducible benchmark data showing Qwen3.5-122B NVFP4 decoding around 198 tok/s on a dual RTX PRO 6000 Blackwell system using SGLang b12x+NEXTN and a PCIe switch topology.
The LocalLLaMA thread cared less about a release headline and more about which Qwen3.6 GGUF quant actually works. Unsloth’s benchmark post pushed the discussion into KLD, disk size, CUDA 13.2 failures, and the messy details that decide local inference quality.
Comments (0)
No comments yet. Be the first to comment!