Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output

What Orthrus Does

Orthrus is an inference framework that breaks the sequential bottleneck of standard autoregressive LLM decoding. Applied to Qwen3, it achieves up to 7.8× tokens per forward pass while preserving the original model's output distribution exactly — no quality tradeoff, just speed.

The Dual-View Architecture

Unlike speculative decoding, which uses a separate draft model, Orthrus unifies two generation pathways within a single model via a shared KV cache. The diffusion view generates multiple candidate tokens in parallel; the autoregressive view verifies them. Only 16% of parameters require fine-tuning, and the base model remains frozen — meaning Orthrus can be applied to existing models without full retraining.

Practical Benefits

A 4–7.8× speedup without memory overhead or a separate draft model simplifies deployment significantly. The gains are especially pronounced on longer contexts. The framework is open-source, making it accessible for the broader community to apply to other model families beyond Qwen3.

Reception

The project earned 176 points on Hacker News and 260+ on r/LocalLLaMA simultaneously, with the Qwen3-8B variant drawing particular enthusiasm from the local AI community. The combination of measurable speedup, identical output guarantee, and easy applicability makes Orthrus a standout contribution to the inference optimization space.

LLM Reddit 1h ago 1 min read

Multi-Token Prediction Support Lands in llama.cpp

PR #22673 merging Multi-Token Prediction support into llama.cpp has been accepted into master. The change brings the inference technique popularized by DeepSeek to the most widely used local LLM inference engine.

#llama-cpp #inference #open-source

LLM Reddit Apr 28, 2026 2 min read

LocalLLaMA likes Luce DFlash because the 3090 speedup looks practical

LocalLLaMA did not treat Luce DFlash as another benchmark screenshot. The post took off because it promised almost 2x mean throughput for Qwen3.6-27B on a single RTX 3090, with no retraining and enough memory engineering to keep long-context local inference practical.

#qwen #speculative-decoding #gguf

LLM Reddit May 2, 2026 1 min read

PFlash Achieves 10x Prefill Speedup Over llama.cpp at 128K Context on RTX 3090

Open-source PFlash uses speculative prefill to dramatically cut time-to-first-token for long-context LLM inference, achieving 10.4x speedup on Qwen3.6-27B Q4_K_M with a consumer RTX 3090.

#llama.cpp #inference #prefill