#rtx-5090

LLM Reddit 3d ago 2 min read

Qwen3.6 27B Hits 100 tps on One RTX 5090, and LocalLLaMA Immediately Asks About Quality

LocalLLaMA was interested for a reason beyond a flashy speed number. A post claiming 105-108 tps and a full 256k native context window for Qwen3.6-27B-INT4 on a single RTX 5090 turned the thread into a practical discussion about how much quality survives once local inference gets this fast.

#qwen #vllm #rtx-5090

LLM Reddit 4d ago 2 min read

LocalLLaMA Loves the 80 TPS Qwen3.6 Demo, Then Immediately Starts Auditing the Fine Print

LocalLLaMA did not just cheer the number. The moment 80 tps and a 218k context window appeared, the thread shifted to prompt length, quantization tradeoffs, and whether the vLLM setup really holds up in practice.

#qwen3-6 #vllm #rtx-5090

LLM Reddit 5d ago 2 min read

LocalLLaMA Sees a New Local Bar: Qwen 3.6 27B at ~80 t/s on One RTX 5090

r/LocalLLaMA reacted because this was not just another “new model out” post. The claim was concrete: Qwen3.6-27B running at about 80 tokens per second with a 218k context window on a single RTX 5090 via vLLM 0.19.

#qwen #vllm #rtx-5090

AI Reddit Apr 11, 2026 2 min read

Reddit Flags a Possible cuBLAS Regression on RTX 5090 Batched FP32 Workloads

A MachineLearning thread argues that cuBLAS may be choosing an inefficient kernel for batched FP32 matrix multiplication on RTX 5090. The significance is not just the claimed slowdown, but the fact that the post includes reproducible benchmark tables, profiling notes, and linked repro material.

#cublas #rtx-5090 #cuda

LLM Reddit Mar 15, 2026 2 min read

r/LocalLLaMA: Qwen 3.5 27B Hits ~2000 TPS in a Document-Classification Setup

A r/LocalLLaMA field report showed how a very specific local inference workload was tuned for throughput. The author reported about 2,000 tokens per second while classifying markdown documents with Qwen 3.5 27B, and the comment thread turned the post into a practical optimization discussion.

#qwen #localllm #llama-cpp