LocalLLaMA Fixates on a Qwen3.6 27B Setup That Pushes 204k Context on Two 16GB GPUs

Original: Qwen3.6 27B on dual RTX 5060 Ti 16GB with vLLM: ~60 tok/s, 204k context working View original →

Read in other languages: 한국어日本語
LLM Apr 30, 2026 By Insights AI (Reddit) 2 min read 1 views Source

Why this Reddit post landed

The LocalLLaMA thread reached 111 upvotes and 44 comments at crawl time, and the reason is obvious once you read it: the post is packed with reproducible settings and uncomfortable caveats. Instead of “my local rig feels fast,” the author laid out the hardware stack, runtime versions, flags, memory headroom, and failure modes for running Qwen3.6 27B on 2x RTX 5060 Ti 16GB. That is exactly the kind of information the community values, because it helps separate real local inference progress from vague brag posts.

The setup and the headline numbers

The post uses a Proxmox LXC environment with 32GB total VRAM, 16 vCPU, about 60GB RAM, CUDA 13, Torch 2.11 nightly, and vLLM 0.19.2rc1.dev. The model is listed as sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP. Reported performance came in around 50-52 tok/s at 8K context with MTP n=1, 62-66 tok/s with MTP n=3, and 59-66 tok/s at 32K context. The headline claim is that a 204800-token window does start and work, though the author is explicit that this is right at the edge.

Where the caveats start to matter

The most useful part of the thread is not the best-case throughput but the operating envelope. The author says idle VRAM at 204k sits around 14.45 GiB per GPU, rises to about 15.65 GiB after a 168k-token prefill, and that a 168k retrieval smoke test completed in roughly 256 seconds. They also note that gpu_memory_utilization=0.94 failed KV allocation while 0.95 worked, startup takes several minutes because of compile and autotune behavior, and the config is not meant for high concurrency because max_num_seqs=1. Top comments immediately asked about stability, PCIe generation, and whether NVFP4 support on Blackwell was the real enabler, which shows the community saw the post as a deployable recipe, not a meme.

Why this matters for local inference

The broader takeaway is that “serious enough” local LLM setups keep moving down-market. This is not roomy hardware, and the author does not pretend otherwise. But a two-card 16GB-per-card Blackwell box sustaining this class of context and throughput changes how hobbyists and small teams think about local experimentation. The post is valuable precisely because it includes the sharp edges along with the wins. Reddit source thread

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment