The Orthrus framework achieves up to 7.8× tokens per forward pass on Qwen3 models while maintaining a provably identical output distribution to the original. Its dual-view architecture shares a single KV cache between autoregressive and diffusion pathways.
#qwen3
RSS Feedllama.cpp's Multi-Token Prediction (MTP) support has entered beta, currently covering Qwen3.5 MTP. Combined with maturing tensor-parallel support, most token generation speed gaps between llama.cpp and vLLM are expected to close.
Lightning OPD attacks a practical bottleneck in on-policy distillation: keeping a live teacher model running throughout training. The paper reports 69.9% on AIME 2024 from Qwen3-8B-Base in 30 GPU hours, a 4.0x speedup over standard OPD.
A 54-point Reddit post flagged merged PR #19441 as the moment qwen3-omni-moe and qwen3-asr support reached llama.cpp, with commenters focused on local multimodal and ASR use cases.
StepFun opened more than a model card by releasing the Step-3.5-Flash-SFT dataset on Hugging Face. The repo bundles raw JSON data, tokenizer snapshots, and StepTronOSS-oriented compiled shards, while the Reddit discussion focused on reproducibility, reasoning traces, and the implications of the dual-license setup.
Qwen3's TTS model encodes voices into 1024-dimensional vectors, enabling gender swapping, pitch adjustment, voice mixing, and semantic voice search through vector math — now available as a standalone lightweight encoder on HuggingFace.