r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup

A high-signal r/LocalLLaMA thread turned a narrow local-inference workload into a useful tuning discussion. At crawl time, the post had 203 upvotes and 73 comments. The author said the job was classifying markdown documents with lots of input tokens, very little output, and almost no cache reuse because each document was different. This was not presented as a universal benchmark. It was a production-shaped workload where throughput mattered more than chatbot-style interaction.

In that setup, the author reported processing 1,214,072 input tokens and only 815 output tokens across 320 documents in ten minutes, which they summarized as roughly 2,000 tokens per second. The stack was unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf running on the official llama.cpp:server-cuda13 image. The post also listed the configuration choices that seemed to matter most: no vision model or mmproj, a no-thinking mode, keeping the full inference footprint inside available VRAM, reducing context size to 128k, and matching parallelism to a batch size of eight.

What the thread actually teaches

The author explicitly framed the result as workload-specific rather than a generic “27B on 5090” number.
The eight-way setup gave each request about 16k of context, while the rare larger documents were pushed into a separate path.
Commenters suggested testing unified cache with -kvu and highlighted that continuous batching -cb was part of the tuning story.

That makes the post more useful than a brag screenshot. Local model discussion often collapses into single-token decode speed, but many real tasks look closer to this example: read a large document, classify it, emit a tiny structured answer, and move on to the next file. In that regime, batching, context budgeting, and avoiding unnecessary multimodal overhead can matter more than conversational polish. The thread is valuable because it anchors the performance claim to a clearly stated workload instead of pretending one number covers every use case.

The subreddit response reflects that shift. r/LocalLLaMA treated the post as a practical field report that others could refine, critique, or adapt. Even skeptical replies helped clarify the boundaries of the claim, which is exactly what makes community posts useful when they contain enough operational detail. For teams running local models on repetitive document pipelines, this is the kind of tuning recipe that is more actionable than a polished benchmark chart.

Source and community discussion: r/LocalLLaMA

r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup

What the thread actually teaches

Related Articles

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet

Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp

Related Articles

r/LocalLLaMA argues Qwen3.5 27B is where local speed, quality, and hardware practicality meet
LLM Reddit Apr 8, 2026 2 min read

Running Qwen3.6 35B A3B at 80+ tok/sec on 12GB VRAM With llama.cpp MTP
LLM Reddit May 10, 2026 1 min read

110 tok/s on a 35B Model with 12GB VRAM Using ik_llama.cpp
LLM Reddit May 22, 2026 1 min read