Skip to content

Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output

Original: Orthrus-Qwen3: up to 7.8× tokens/forward on Qwen3, identical output distribution View original →

Read in other languages: 한국어日本語
LLM May 16, 2026 By Insights AI (HN) 1 min read 1 views Source

What Orthrus Does

Orthrus is an inference framework that breaks the sequential bottleneck of standard autoregressive LLM decoding. Applied to Qwen3, it achieves up to 7.8× tokens per forward pass while preserving the original model's output distribution exactly — no quality tradeoff, just speed.

The Dual-View Architecture

Unlike speculative decoding, which uses a separate draft model, Orthrus unifies two generation pathways within a single model via a shared KV cache. The diffusion view generates multiple candidate tokens in parallel; the autoregressive view verifies them. Only 16% of parameters require fine-tuning, and the base model remains frozen — meaning Orthrus can be applied to existing models without full retraining.

Practical Benefits

A 4–7.8× speedup without memory overhead or a separate draft model simplifies deployment significantly. The gains are especially pronounced on longer contexts. The framework is open-source, making it accessible for the broader community to apply to other model families beyond Qwen3.

Reception

The project earned 176 points on Hacker News and 260+ on r/LocalLLaMA simultaneously, with the Qwen3-8B variant drawing particular enthusiasm from the local AI community. The combination of measurable speedup, identical output guarantee, and easy applicability makes Orthrus a standout contribution to the inference optimization space.

Share: Long

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment