LocalLLaMA likes Luce DFlash because the 3090 speedup looks practical
Original: Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 View original →
LocalLLaMA pushed this one up because it felt tangible. The post describes a standalone C++/CUDA stack on top of ggml that runs Qwen3.6-27B on a single 24 GB RTX 3090 and uses speculative decoding to nearly double throughput against autoregressive decoding. The detail that really matters is not the marketing flourish, but the claim of zero retraining. The speedup is supposed to come from the execution path, not from shipping a conveniently altered benchmark model.
The posted numbers are specific enough to sound testable. Across HumanEval, GSM8K, and Math500, the author reports an autoregressive mean of 34.97 tok/s and a DFlash mean of 69.19 tok/s, for a 1.98x average speedup. HumanEval rises from 34.90 to 78.16 tok/s, Math500 from 35.13 to 69.77 tok/s, and GSM8K from 34.89 to 59.65 tok/s. The same write-up also says the stack compresses KV cache to TQ3_0 so 256K context can fit into 24 GB, and that sliding-window flash attention keeps a 60K context decoding around 89.7 tok/s instead of collapsing toward 25.8 tok/s.
The GitHub repository frames the project in exactly the way LocalLLaMA likes to hear: hand-tuned inference for specific consumer hardware, not a vague future promise. That is why the post resonated. The core question in this subreddit is rarely whether a result is impressive on eight H100s. It is whether a real person with a 3090 can reproduce something useful tonight. Qwen3.6-27B, GGUF weights, and one 24 GB card is an immediately legible setup for the community.
- Reported average speedup: 1.98x over autoregressive decoding
- HumanEval: 34.90 to 78.16 tok/s
- Memory trick: TQ3_0 KV cache compression for 256K-context goals
- Serving options: OpenAI-compatible HTTP endpoint or local REPL
That is why the community energy here feels different from a pure leaderboard post. LocalLLaMA is not rewarding a prettier chart. It is rewarding a credible claim that consumer-hardware local inference can move from barely acceptable to genuinely comfortable with the right systems work. Luce DFlash hit the exact nerve this subreddit watches most closely: does the software rewrite make old hardware feel new enough to matter?
Source links: Reddit thread, Lucebox repository.
Related Articles
A LocalLLaMA implementation report says a native MLX DFlash runtime can speed up Qwen inference on Apple Silicon by more than 2x in several settings. The notable part is not only the throughput gain, but the claim that outputs remain bit-for-bit identical to the greedy baseline.
LocalLLaMA lit up at the idea that a 27B model could tie Sonnet 4.6 on an agentic index, but the thread turned just as fast to benchmark gaming, real context windows, and what people can actually run at home.
LocalLLaMA immediately locked onto the thing AMD users rarely get from new tooling: hard numbers instead of vague promises. The thread heated up because Hipfire arrived with RDNA-focused benchmark tables and users were already posting their own measurements under it.
Comments (0)
No comments yet. Be the first to comment!