A LocalLLaMA demo pointed to Parlor, which runs speech and vision understanding with Gemma 4 E2B and uses Kokoro for text-to-speech, all on-device. The README reports roughly 2.5-3.0 seconds end-to-end latency and about 83 tokens/sec decode speed on an Apple M3 Pro.
#multimodal
RSS Feed
xAI used a recent X thread to spell out one of the capability upgrades behind Grok Imagine’s Quality mode. The company says the mode improves world knowledge and prompt understanding, letting the image model better interpret complex scenes, physics, object relationships, and specific cultural or brand references.
Together AI said on April 3, 2026 that Wan 2.7 from Alibaba Cloud is now available on its platform. The accompanying product post says text-to-video is live now, with image-to-video, reference-to-video, and video edit workflows rolling out on the same API, auth, and billing surface.
Google AI said on March 26, 2026 that Gemini 3.1 Flash Live is launching for developers building real-time voice and vision agents. Google highlighted faster natural dialogue, better task completion in noisy environments, and stronger complex-instruction following, while its Live API docs describe low-latency multimodal streaming with tool use and 70-language support.
Alibaba Cloud positioned Qwen3.6-Plus as a 1M-context model for agentic coding, tool use, and multimodal reasoning, and Hacker News quickly surfaced it as a high-interest AI launch.
Google DeepMind has introduced Gemma 4 as a new open-model family built from Gemini 3 research. The lineup spans E2B and E4B edge models through 26B and 31B local-workstation models, with function calling, multimodal reasoning, and 140-language support at the center of the release.
Meta said on March 26, 2026 that TRIBE v2 is a foundation model for predicting human brain responses to sight, sound, and language. The supporting paper and demo highlight zero-shot generalization, prediction across 70,000 voxels, and public releases of the paper, code, and model weights.
Google DeepMind said on March 26, 2026 that Gemini 3.1 Flash Live is rolling out in Gemini Live and Google Search Live, while developers can access it through Google AI Studio. Google’s announcement positions 3.1 Flash Live as its highest-quality audio model, with lower latency, improved tonal understanding, and benchmark gains including 90.8% on ComplexFuncBench Audio.
Google expanded Search Live on March 26, 2026 to every language and location where AI Mode is available. The move pushes multimodal voice-and-camera search to more than 200 countries and territories and gives Gemini’s live audio stack a much larger real-world footprint.
Mistral announced Mistral Small 4 on March 16, 2026 as a single open model that combines reasoning, multimodal input, and agentic coding. Key specs include 119B total parameters, 6B active parameters per token, a 256k context window, Apache 2.0 licensing, and configurable reasoning effort.
OpenAI announced GPT-5.4 mini and nano on March 17, 2026. The company says mini is more than 2x faster than GPT-5 mini while improving coding, reasoning, multimodal understanding, and tool use, while nano targets low-cost classification, extraction, ranking, and simpler coding subagents.
NVIDIA AI Dev highlighted on March 27, 2026 that Edison's PaperQA3 can reason over more than 150 million research papers and patents and posted strong LABBench2 results. Edison's article says the multimodal system can now read figures and tables, compare hundreds of visual elements before answering, and rank among the strongest deep-research agents on relevant LABBench2 subsets.