Meituan puts LongCat-Video-Avatar 1.5 on Hugging Face with MIT license
Original: Meituan Releases LongCat-Video-Avatar 1.5 as MIT Model View original →
Open avatar generation gets a stronger reference point
Audio-driven avatar generation is moving beyond closed demos and into model hubs where developers can inspect, run, and adapt the stack. In the source tweet, Gorden Sun described LongCat-Video-Avatar 1.5 as an “audio-driven video generation” model. The original tweet is available here.
The project page says LongCat-Video-Avatar 1.5 was built by the Meituan LongCat Team on top of LongCat-Video. Its demos cover lip-sync, singing, animation, and multi-person interaction, with the 1.0-to-1.5 comparison emphasizing better mouth-shape accuracy, stronger identity preservation in long videos, broader interaction scenarios, and faster 8-step generation. The comparison section names HeyGen, Kling Avatar 2.0, and OmniHuman-1.5, placing the release directly against commercial and frontier avatar systems.
The Hugging Face model card is the practical part of the story. The model is tagged for Diffusers, ONNX, Safetensors, and Transformers, and the task tags include audio-text-to-video, audio-image-text-to-video, audio-driven-video-continuation, avatar, and video-generation. It also lists an MIT license and provides starter code for using the model with Diffusers, lowering the barrier for developers who want to test the release locally or build evaluation harnesses around it.
The next thing to watch is how developers reconcile openness with deployment risk. The project page says some demo images and audio come from real videos for academic demonstration, while the Hugging Face card asks downstream users to evaluate accuracy, safety, fairness, data protection, privacy, and content safety before sensitive use. If independent tests confirm stable identity, lip motion, and inference speed, LongCat-Video-Avatar 1.5 could become a useful baseline for open avatar research. If not, its largest impact may still be forcing clearer comparisons in a market where many avatar systems remain difficult to audit.
Related Articles
Google revealed Gemini Omni at I/O 2026—a "world model" that processes text, audio, images, and video together to simulate physical environments. Unlike Sora or Runway, it lets users edit footage through natural language and maintains scene consistency across modifications. It replaces Veo in the Gemini app immediately.
At Google I/O 2026, Google DeepMind unveiled Gemini Omni — its first model capable of generating video from any input including text, images, audio, and video. Combining Gemini's intelligence with Google's generative media systems, it is available now through the Gemini app and YouTube Shorts.
A video believed to be from Google's unreleased 'Omni' video generation model has leaked, drawing 1,300+ upvotes on r/singularity. Users particularly noted the model's unusually coherent text rendering - a persistent weakness in current video generation models.
Comments (0)
No comments yet. Be the first to comment!