Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output
Original: Orthrus-Qwen3: up to 7.8× tokens/forward on Qwen3, identical output distribution View original →
What Orthrus Does
Orthrus is an inference framework that breaks the sequential bottleneck of standard autoregressive LLM decoding. Applied to Qwen3, it achieves up to 7.8× tokens per forward pass while preserving the original model's output distribution exactly — no quality tradeoff, just speed.
The Dual-View Architecture
Unlike speculative decoding, which uses a separate draft model, Orthrus unifies two generation pathways within a single model via a shared KV cache. The diffusion view generates multiple candidate tokens in parallel; the autoregressive view verifies them. Only 16% of parameters require fine-tuning, and the base model remains frozen — meaning Orthrus can be applied to existing models without full retraining.
Practical Benefits
A 4–7.8× speedup without memory overhead or a separate draft model simplifies deployment significantly. The gains are especially pronounced on longer contexts. The framework is open-source, making it accessible for the broader community to apply to other model families beyond Qwen3.
Reception
The project earned 176 points on Hacker News and 260+ on r/LocalLLaMA simultaneously, with the Qwen3-8B variant drawing particular enthusiasm from the local AI community. The combination of measurable speedup, identical output guarantee, and easy applicability makes Orthrus a standout contribution to the inference optimization space.
Related Articles
PR #22673 merging Multi-Token Prediction support into llama.cpp has been accepted into master. The change brings the inference technique popularized by DeepSeek to the most widely used local LLM inference engine.
LocalLLaMA did not treat Luce DFlash as another benchmark screenshot. The post took off because it promised almost 2x mean throughput for Qwen3.6-27B on a single RTX 3090, with no retraining and enough memory engineering to keep long-context local inference practical.
Open-source PFlash uses speculative prefill to dramatically cut time-to-first-token for long-context LLM inference, achieving 10.4x speedup on Qwen3.6-27B Q4_K_M with a consumer RTX 3090.
Comments (0)
No comments yet. Be the first to comment!