r/LocalLLaMA tracks the llama.cpp merge that brings in Qwen3 audio support
Original: mtmd: qwen3 audio support (qwen3-omni and qwen3-asr) View original →
A LocalLLaMA post pointed users to llama.cpp PR #19441, which has now been merged into master and adds long-requested support for Qwen3 audio models. The author summarized the result in two lines: qwen3-omni-moe works for vision plus audio input, and qwen3-asr works for speech recognition. The post also linked ready-to-test GGUF conversions on Hugging Face, making it immediately useful for local inference users.
The linked PR shows why the change attracted attention. Follow-up notes describe a dedicated audio path for Qwen3-ASR with a Conv2d encoder, a Whisper-like transformer encoder, and an MLP projector. That matters because it moves support beyond basic model loading and into practical multimodal and ASR workflows inside the most widely used open local inference stack.
At the same time, the discussion makes clear that support is still an engineering process rather than a finished product. Review comments mention a likely Whisper preprocessing bug that can drop the last audio chunk, different audio boundary tokens for Qwen3-ASR, and the need for chunked or windowed attention. One contributor said full attention over a 30-second chunk produced poor results and that 8-second chunking worked better in a private fork used for Chinese lecture transcription.
- The PR targeted support for both qwen3-omni-moe and qwen3-asr and ultimately landed in master.
- Follow-up notes highlighted concrete implementation details such as a Conv2d encoder, Whisper-like encoder, and MLP projector.
- The remaining work is mostly about practical stability: chunking, token handling, and preprocessing quality.
The Reddit comments were small in number but consistent in tone. Users were glad that qwen3-asr support finally landed, wanted to test Qwen3-Omni-30B-A3B-Thinking against audio plus video frames, and noted that local multimodal releases are arriving faster than many people can track. The significance is straightforward: once support lands in llama.cpp, the gap between a new model release and community experimentation gets much shorter.
Related Articles
StepFun opened more than a model card by releasing the Step-3.5-Flash-SFT dataset on Hugging Face. The repo bundles raw JSON data, tokenizer snapshots, and StepTronOSS-oriented compiled shards, while the Reddit discussion focused on reproducibility, reasoning traces, and the implications of the dual-license setup.
Mistral announced Mistral Small 4 on March 16, 2026 as a single open model that combines reasoning, multimodal input, and agentic coding. Key specs include 119B total parameters, 6B active parameters per token, a 256k context window, Apache 2.0 licensing, and configurable reasoning effort.
A fresh LocalLLaMA thread argues that some early Gemma 4 failures are really inference-stack bugs rather than model quality problems. By linking active llama.cpp pull requests and user reports after updates, the post reframes launch benchmarks as a full-stack issue.
Comments (0)
No comments yet. Be the first to comment!