Aging

LocalLLaMA likes Luce DFlash because the 3090 speedup looks practical

Original: Luce DFlash: Qwen3.6-27B at up to 2x throughput on a single RTX 3090 View original →

Read in other languages: 한국어日本語
LLM Apr 28, 2026 By Insights AI (Reddit) 2 min read 3 views Source

LocalLLaMA pushed this one up because it felt tangible. The post describes a standalone C++/CUDA stack on top of ggml that runs Qwen3.6-27B on a single 24 GB RTX 3090 and uses speculative decoding to nearly double throughput against autoregressive decoding. The detail that really matters is not the marketing flourish, but the claim of zero retraining. The speedup is supposed to come from the execution path, not from shipping a conveniently altered benchmark model.

The posted numbers are specific enough to sound testable. Across HumanEval, GSM8K, and Math500, the author reports an autoregressive mean of 34.97 tok/s and a DFlash mean of 69.19 tok/s, for a 1.98x average speedup. HumanEval rises from 34.90 to 78.16 tok/s, Math500 from 35.13 to 69.77 tok/s, and GSM8K from 34.89 to 59.65 tok/s. The same write-up also says the stack compresses KV cache to TQ3_0 so 256K context can fit into 24 GB, and that sliding-window flash attention keeps a 60K context decoding around 89.7 tok/s instead of collapsing toward 25.8 tok/s.

The GitHub repository frames the project in exactly the way LocalLLaMA likes to hear: hand-tuned inference for specific consumer hardware, not a vague future promise. That is why the post resonated. The core question in this subreddit is rarely whether a result is impressive on eight H100s. It is whether a real person with a 3090 can reproduce something useful tonight. Qwen3.6-27B, GGUF weights, and one 24 GB card is an immediately legible setup for the community.

  • Reported average speedup: 1.98x over autoregressive decoding
  • HumanEval: 34.90 to 78.16 tok/s
  • Memory trick: TQ3_0 KV cache compression for 256K-context goals
  • Serving options: OpenAI-compatible HTTP endpoint or local REPL

That is why the community energy here feels different from a pure leaderboard post. LocalLLaMA is not rewarding a prettier chart. It is rewarding a credible claim that consumer-hardware local inference can move from barely acceptable to genuinely comfortable with the right systems work. Luce DFlash hit the exact nerve this subreddit watches most closely: does the software rewrite make old hardware feel new enough to matter?

Source links: Reddit thread, Lucebox repository.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.