The thread’s energy centered on the architecture claim: what does “encoder-free” really mean for a 12B multimodal model?
The thread’s energy centered on the architecture claim: what does “encoder-free” really mean for a 12B multimodal model?
Local multimodal AI is moving into the 12B class. Google Gemma introduced Gemma 4 12B under Apache 2.0, describing a unified encoder-free design for image, audio, and text inputs.
Google’s I/O 2026 AI story is about distribution as much as models. Gemini 3.5 Flash is now generally available across API, Antigravity, Android Studio, enterprise tools, Search, and the Gemini app, while Gemini Omni Flash brings video generation into the same push.
At Google I/O 2026, Google DeepMind unveiled Gemini Omni — its first model capable of generating video from any input including text, images, audio, and video. Combining Gemini's intelligence with Google's generative media systems, it is available now through the Gemini app and YouTube Shorts.
ByteDance Research has open-sourced Lance, a 3B-parameter unified multimodal model that handles image and video generation, editing, and understanding in a single framework. It achieves top-tier benchmark scores, matching or outperforming models twice its size.
Google has updated the Gemini API File Search tool to support multimodal content including images, audio, and video, making it easier for developers to build efficient, verifiable RAG systems.
IBM Research has published MAMMAL, a multi-modal model that unifies proteins, molecules, and gene data. It achieves state-of-the-art results on 9 of 11 biological benchmarks and outperforms AlphaFold 3 on several drug-discovery tasks.
The r/singularity community found that Claude Mythos can generate image outputs, reportedly marking Anthropic's first foray into image generation models.
LocalLLaMA reacted hard because DeepSeek's visual-primitives idea makes points and boxes part of reasoning itself, and the repo going private only made the thread hotter.
The important medical AI story here is not replacement but reliability. Google DeepMind says its AI co-clinician produced zero critical errors in 97 of 98 realistic primary-care queries, while physicians still beat it overall in multimodal telemedicine simulations.
NVIDIA is targeting the cost bottleneck in multimodal agents, not just the demo factor. Nemotron 3 Nano Omni claims up to 9x higher throughput, a 256K context window, and six leaderboard wins for document, video, and audio understanding.
Multimodal agents still pay a tax for chaining separate vision, audio, and text models. NVIDIA says Nemotron 3 Nano Omni collapses that stack into a 30B model with 256K context and up to 9.2x higher effective video system capacity at the same responsiveness target.