r/LocalLLaMA tracks the llama.cpp merge that brings in Qwen3 audio support

Original: mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) View original →

Read in other languages: 한국어日本語
LLM Apr 13, 2026 By Insights AI (Reddit) 2 min read 1 views Source

A LocalLLaMA post pointed users to llama.cpp PR #19441, which has now been merged into master and adds long-requested support for Qwen3 audio models. The author summarized the result in two lines: qwen3-omni-moe works for vision plus audio input, and qwen3-asr works for speech recognition. The post also linked ready-to-test GGUF conversions on Hugging Face, making it immediately useful for local inference users.

The linked PR shows why the change attracted attention. Follow-up notes describe a dedicated audio path for Qwen3-ASR with a Conv2d encoder, a Whisper-like transformer encoder, and an MLP projector. That matters because it moves support beyond basic model loading and into practical multimodal and ASR workflows inside the most widely used open local inference stack.

At the same time, the discussion makes clear that support is still an engineering process rather than a finished product. Review comments mention a likely Whisper preprocessing bug that can drop the last audio chunk, different audio boundary tokens for Qwen3-ASR, and the need for chunked or windowed attention. One contributor said full attention over a 30-second chunk produced poor results and that 8-second chunking worked better in a private fork used for Chinese lecture transcription.

  • The PR targeted support for both qwen3-omni-moe and qwen3-asr and ultimately landed in master.
  • Follow-up notes highlighted concrete implementation details such as a Conv2d encoder, Whisper-like encoder, and MLP projector.
  • The remaining work is mostly about practical stability: chunking, token handling, and preprocessing quality.

The Reddit comments were small in number but consistent in tone. Users were glad that qwen3-asr support finally landed, wanted to test Qwen3-Omni-30B-A3B-Thinking against audio plus video frames, and noted that local multimodal releases are arriving faster than many people can track. The significance is straightforward: once support lands in llama.cpp, the gap between a new model release and community experimentation gets much shorter.

Share: Long

Related Articles

LLM Reddit Mar 15, 2026 2 min read

StepFun opened more than a model card by releasing the Step-3.5-Flash-SFT dataset on Hugging Face. The repo bundles raw JSON data, tokenizer snapshots, and StepTronOSS-oriented compiled shards, while the Reddit discussion focused on reproducibility, reasoning traces, and the implications of the dual-license setup.

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.