LocalLLaMA Fixates on a Qwen3.6 27B Setup That Pushes 204k Context on Two 16GB GPUs

Why this Reddit post landed

The LocalLLaMA thread reached 111 upvotes and 44 comments at crawl time, and the reason is obvious once you read it: the post is packed with reproducible settings and uncomfortable caveats. Instead of “my local rig feels fast,” the author laid out the hardware stack, runtime versions, flags, memory headroom, and failure modes for running Qwen3.6 27B on 2x RTX 5060 Ti 16GB. That is exactly the kind of information the community values, because it helps separate real local inference progress from vague brag posts.

The setup and the headline numbers

The post uses a Proxmox LXC environment with 32GB total VRAM, 16 vCPU, about 60GB RAM, CUDA 13, Torch 2.11 nightly, and vLLM 0.19.2rc1.dev. The model is listed as sakamakismile/Qwen3.6-27B-Text-NVFP4-MTP. Reported performance came in around 50-52 tok/s at 8K context with MTP n=1, 62-66 tok/s with MTP n=3, and 59-66 tok/s at 32K context. The headline claim is that a 204800-token window does start and work, though the author is explicit that this is right at the edge.

Where the caveats start to matter

The most useful part of the thread is not the best-case throughput but the operating envelope. The author says idle VRAM at 204k sits around 14.45 GiB per GPU, rises to about 15.65 GiB after a 168k-token prefill, and that a 168k retrieval smoke test completed in roughly 256 seconds. They also note that gpu_memory_utilization=0.94 failed KV allocation while 0.95 worked, startup takes several minutes because of compile and autotune behavior, and the config is not meant for high concurrency because max_num_seqs=1. Top comments immediately asked about stability, PCIe generation, and whether NVFP4 support on Blackwell was the real enabler, which shows the community saw the post as a deployable recipe, not a meme.

Why this matters for local inference

The broader takeaway is that “serious enough” local LLM setups keep moving down-market. This is not roomy hardware, and the author does not pretend otherwise. But a two-card 16GB-per-card Blackwell box sustaining this class of context and throughput changes how hobbyists and small teams think about local experimentation. The post is valuable precisely because it includes the sharp edges along with the wins. Reddit source thread

LocalLLaMA Fixates on a Qwen3.6 27B Setup That Pushes 204k Context on Two 16GB GPUs

Why this Reddit post landed

The setup and the headline numbers

Where the caveats start to matter

Why this matters for local inference

Related Articles

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell

Related Articles

LocalLLaMA Benchmark Argues RTX PRO 6000 SM120 Is Being Held Back by Broken CUTLASS NVFP4 MoE Kernels
LLM Reddit Mar 16, 2026 2 min read

LocalLLaMA Patch Claims Faster Qwen3.5-397B Inference on Blackwell Workstations With a K=64 Kernel Fix
LLM Reddit Mar 15, 2026 2 min read

LocalLLaMA Benchmarks Qwen3.5-122B at 198 tok/s on Dual RTX PRO 6000 Blackwell
LLM Reddit Apr 10, 2026 2 min read