r/LocalLLaMA tracks the llama.cpp merge that brings in Qwen3 audio support

A LocalLLaMA post pointed users to llama.cpp PR #19441, which has now been merged into master and adds long-requested support for Qwen3 audio models. The author summarized the result in two lines: qwen3-omni-moe works for vision plus audio input, and qwen3-asr works for speech recognition. The post also linked ready-to-test GGUF conversions on Hugging Face, making it immediately useful for local inference users.

The linked PR shows why the change attracted attention. Follow-up notes describe a dedicated audio path for Qwen3-ASR with a Conv2d encoder, a Whisper-like transformer encoder, and an MLP projector. That matters because it moves support beyond basic model loading and into practical multimodal and ASR workflows inside the most widely used open local inference stack.

At the same time, the discussion makes clear that support is still an engineering process rather than a finished product. Review comments mention a likely Whisper preprocessing bug that can drop the last audio chunk, different audio boundary tokens for Qwen3-ASR, and the need for chunked or windowed attention. One contributor said full attention over a 30-second chunk produced poor results and that 8-second chunking worked better in a private fork used for Chinese lecture transcription.

The PR targeted support for both qwen3-omni-moe and qwen3-asr and ultimately landed in master.
Follow-up notes highlighted concrete implementation details such as a Conv2d encoder, Whisper-like encoder, and MLP projector.
The remaining work is mostly about practical stability: chunking, token handling, and preprocessing quality.

The Reddit comments were small in number but consistent in tone. Users were glad that qwen3-asr support finally landed, wanted to test Qwen3-Omni-30B-A3B-Thinking against audio plus video frames, and noted that local multimodal releases are arriving faster than many people can track. The significance is straightforward: once support lands in llama.cpp, the gap between a new model release and community experimentation gets much shorter.

r/LocalLLaMA tracks the llama.cpp merge that brings in Qwen3 audio support

Related Articles

Orthrus-Qwen3 Delivers 7.8× Faster Inference With Identical Output

Gemma 4 12B puts the spotlight on encoder-free multimodal local AI

Gemma 4 12B removes separate encoders for laptop-scale multimodal AI