r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup

Original: 2000 TPS with QWEN 3.5 27b on RTX-5090 View original →

Read in other languages: 한국어日本語
LLM Mar 15, 2026 By Insights AI (Reddit) 2 min read 2 views Source

A high-signal r/LocalLLaMA thread turned a narrow local-inference workload into a useful tuning discussion. At crawl time, the post had 203 upvotes and 73 comments. The author said the job was classifying markdown documents with lots of input tokens, very little output, and almost no cache reuse because each document was different. This was not presented as a universal benchmark. It was a production-shaped workload where throughput mattered more than chatbot-style interaction.

In that setup, the author reported processing 1,214,072 input tokens and only 815 output tokens across 320 documents in ten minutes, which they summarized as roughly 2,000 tokens per second. The stack was unsloth/Qwen3.5-27B-UD-Q5_K_XL.gguf running on the official llama.cpp:server-cuda13 image. The post also listed the configuration choices that seemed to matter most: no vision model or mmproj, a no-thinking mode, keeping the full inference footprint inside available VRAM, reducing context size to 128k, and matching parallelism to a batch size of eight.

What the thread actually teaches

  • The author explicitly framed the result as workload-specific rather than a generic “27B on 5090” number.
  • The eight-way setup gave each request about 16k of context, while the rare larger documents were pushed into a separate path.
  • Commenters suggested testing unified cache with -kvu and highlighted that continuous batching -cb was part of the tuning story.

That makes the post more useful than a brag screenshot. Local model discussion often collapses into single-token decode speed, but many real tasks look closer to this example: read a large document, classify it, emit a tiny structured answer, and move on to the next file. In that regime, batching, context budgeting, and avoiding unnecessary multimodal overhead can matter more than conversational polish. The thread is valuable because it anchors the performance claim to a clearly stated workload instead of pretending one number covers every use case.

The subreddit response reflects that shift. r/LocalLLaMA treated the post as a practical field report that others could refine, critique, or adapt. Even skeptical replies helped clarify the boundaries of the claim, which is exactly what makes community posts useful when they contain enough operational detail. For teams running local models on repetitive document pipelines, this is the kind of tuning recipe that is more actionable than a polished benchmark chart.

Source and community discussion: r/LocalLLaMA

Share: Long

Related Articles

LLM Reddit 4d ago 2 min read

A r/LocalLLaMA post pointed Mac users to llama.cpp pull request #20361, merged on March 11, 2026, adding a fused GDN recurrent Metal kernel. The PR shows around 12-36% throughput gains on Qwen 3.5 variants, while Reddit commenters noted the change is merged but can still trail MLX on some local benchmarks.

LLM sources.twitter 4d ago 2 min read

NVIDIA AI Developer introduced Nemotron 3 Super on March 11, 2026 as an open 120B-parameter hybrid MoE model with 12B active parameters and a native 1M-token context window. NVIDIA says the model targets agentic workloads with up to 5x higher throughput than the previous Nemotron Super model.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.