r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup
Original: 2000 TPS with QWEN 3.5 27b on RTX-5090 View original →
A high-signal r/LocalLLaMA thread turned a narrow local-inference workload into a useful tuning discussion. At crawl time, the post had 203 upvotes and 73 comments. The author said the job was classifying markdown documents with lots of input tokens, very little output, and almost no cache reuse because each document was different. This was not presented as a universal benchmark. It was a production-shaped workload where throughput mattered more than chatbot-style interaction.
In that setup, the author reported processing 1,214,072 input tokens and only 815 output tokens across 320 documents in ten minutes, which they summarized as roughly 2,000 tokens per second. The stack was unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf running on the official llama.cpp:server-cuda13 image. The post also listed the configuration choices that seemed to matter most: no vision model or mmproj, a no-thinking mode, keeping the full inference footprint inside available VRAM, reducing context size to 128k, and matching parallelism to a batch size of eight.
What the thread actually teaches
- The author explicitly framed the result as workload-specific rather than a generic “27B on 5090” number.
- The eight-way setup gave each request about 16k of context, while the rare larger documents were pushed into a separate path.
- Commenters suggested testing unified cache with
-kvuand highlighted that continuous batching-cbwas part of the tuning story.
That makes the post more useful than a brag screenshot. Local model discussion often collapses into single-token decode speed, but many real tasks look closer to this example: read a large document, classify it, emit a tiny structured answer, and move on to the next file. In that regime, batching, context budgeting, and avoiding unnecessary multimodal overhead can matter more than conversational polish. The thread is valuable because it anchors the performance claim to a clearly stated workload instead of pretending one number covers every use case.
The subreddit response reflects that shift. r/LocalLLaMA treated the post as a practical field report that others could refine, critique, or adapt. Even skeptical replies helped clarify the boundaries of the claim, which is exactly what makes community posts useful when they contain enough operational detail. For teams running local models on repetitive document pipelines, this is the kind of tuning recipe that is more actionable than a polished benchmark chart.
Source and community discussion: r/LocalLLaMA
Related Articles
A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.
A March 14, 2026 LocalLLaMA post outlined a CUTLASS and FlashInfer patch for SM120 Blackwell workstations, claiming major gains for Qwen3.5-397B NVFP4 inference and linking the work to FlashInfer PR #2786.
NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.
Comments (0)
No comments yet. Be the first to comment!