r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup
Original: 2000 TPS with QWEN 3.5 27b on RTX-5090 View original →
A high-signal r/LocalLLaMA thread turned a narrow local-inference workload into a useful tuning discussion. At crawl time, the post had 203 upvotes and 73 comments. The author said the job was classifying markdown documents with lots of input tokens, very little output, and almost no cache reuse because each document was different. This was not presented as a universal benchmark. It was a production-shaped workload where throughput mattered more than chatbot-style interaction.
In that setup, the author reported processing 1,214,072 input tokens and only 815 output tokens across 320 documents in ten minutes, which they summarized as roughly 2,000 tokens per second. The stack was unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf running on the official llama.cpp:server-cuda13 image. The post also listed the configuration choices that seemed to matter most: no vision model or mmproj, a no-thinking mode, keeping the full inference footprint inside available VRAM, reducing context size to 128k, and matching parallelism to a batch size of eight.
What the thread actually teaches
- The author explicitly framed the result as workload-specific rather than a generic “27B on 5090” number.
- The eight-way setup gave each request about 16k of context, while the rare larger documents were pushed into a separate path.
- Commenters suggested testing unified cache with
-kvuand highlighted that continuous batching-cbwas part of the tuning story.
That makes the post more useful than a brag screenshot. Local model discussion often collapses into single-token decode speed, but many real tasks look closer to this example: read a large document, classify it, emit a tiny structured answer, and move on to the next file. In that regime, batching, context budgeting, and avoiding unnecessary multimodal overhead can matter more than conversational polish. The thread is valuable because it anchors the performance claim to a clearly stated workload instead of pretending one number covers every use case.
The subreddit response reflects that shift. r/LocalLLaMA treated the post as a practical field report that others could refine, critique, or adapt. Even skeptical replies helped clarify the boundaries of the claim, which is exactly what makes community posts useful when they contain enough operational detail. For teams running local models on repetitive document pipelines, this is the kind of tuning recipe that is more actionable than a polished benchmark chart.
Source and community discussion: r/LocalLLaMA
Related Articles
A recent r/LocalLLaMA post presents Qwen3.5 27B as an unusually strong local inference sweet spot. The author reports about 19.7 tokens per second on an RTX A6000 48GB with llama.cpp and a 32K context, while the comments turn into a detailed debate about dense-versus-MoE VRAM economics.
r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.
LocalLLaMA lit up at the idea that a 27B model could tie Sonnet 4.6 on an agentic index, but the thread turned just as fast to benchmark gaming, real context windows, and what people can actually run at home.
Comments (0)
No comments yet. Be the first to comment!