r/LocalLLaMA、Qwen3 audio supportが入ったllama.cpp mergeを追う

LocalLLaMAのpostは、llama.cpp PR #19441を取り上げ、Qwen3 audio model supportがmasterにmergeされたことを素早く共有した。投稿者は、qwen3-omni-moeがvision plus audio inputで動作し、qwen3-asrも動作すると要約し、Hugging Face上のGGUFリンクも添えて、local inferenceユーザーがすぐ試せる状態にしていた。

この変更が注目されたのは、単なるmodel load対応以上の内容だったからだ。PRの補足では、Qwen3-ASR向けのdedicated audio path、Conv2d encoder、Whisper-like transformer encoder、MLP projectorが説明されている。つまり、最も広く使われるopen local inference stackの中で、multimodal inputとASR workflowを現実的に扱う段階へ進んだことになる。

同時に、このdiscussionはsupportがまだengineering processの途中であることも示している。review commentでは、最後のaudio chunkが落ちる可能性のあるWhisper preprocessing bug、Qwen3-ASRで異なるaudio boundary token、そしてchunkedまたはwindowed attentionの必要性が指摘された。あるcontributorは、30秒chunkにfull attentionをかけると結果が悪く、私的なforkで中国語講義のtranscriptionに使った際には8秒chunkの方がうまくいったと述べている。

PRはqwen3-omni-moeとqwen3-asrの両方を対象にし、最終的にmasterへmergeされた。
補足ではConv2d encoder、Whisper-like encoder、MLP projectorといった実装詳細が示された。
残る課題はchunking、token handling、preprocessing qualityのような実運用の安定性に近い部分だ。

コメント数は多くなかったが、反応の方向は揃っていた。qwen3-asr supportがようやく入ったことを歓迎し、Qwen3-Omni-30B-A3B-Thinkingをaudioとvideo frameの両方で試したいという声があり、local multimodal releaseの速度が速すぎて追い切れないという反応もあった。意味は明快だ。supportがllama.cppに入ると、新しいmodel公開からcommunity experimentationまでの距離は一気に縮まる。

r/LocalLLaMA、Qwen3 audio supportが入ったllama.cpp mergeを追う

Related Articles

Orthrus-Qwen3、同一出力を保ちながら推論速度7.8倍を実現

Gemma 4 12B、encoder-free multimodal設計でローカルAI議論の中心へ

Gemma 4 12B、別エンコーダなしでノートPC級マルチモーダル推論へApache 2.0で公開