ByteDance Releases Lance: 3B Unified Multimodal Model Matching 7B Benchmarks
Original: ByteDance Releases Lance: A 3B Unified Multimodal Model for Image and Video Generation View original →
One Model, All Modalities
ByteDance Research has released Lance, a lightweight unified multimodal model with 3 billion active parameters. Unlike siloed models that specialize in a single task, Lance handles image generation, video generation, image editing, video editing, image understanding, and video QA — all within a single architecture. It is available under the Apache 2.0 license.
Core Capabilities
Lance supports text-to-image (768×768), text-to-video (up to 121 frames at 480p), instruction-based image editing, frame-aware video editing, and visual question answering for both images and video. The model was fine-tuned from Qwen2.5-VL-3B-Instruct using a staged multi-task training recipe on 128 A100 GPUs.
Benchmark Performance
Despite its compact size, Lance delivers impressive results: DPG score of 84.67 (competitive with 7B models), GenEval 0.90, GEdit 7.30 (best-in-class among unified models), and VBench 85.11 (highest among tested models for video generation). These results challenge the assumption that unified models must sacrifice quality for versatility.
Availability
Model weights and inference scripts are available on GitHub (bytedance/Lance) and Hugging Face (bytedance-research/Lance). A minimum of 40GB VRAM is required. The release has generated significant interest in r/LocalLLaMA, where it received over 600 upvotes as a compelling option for locally-run multimodal tasks.
Related Articles
Why it matters: open models rarely arrive with both giant context claims and deployable model splits. DeepSeek put hard numbers on the release with a 1M-context design, a 1.6T/49B Pro model, and a 284B/13B Flash variant.
NVIDIA unveiled Nemotron 3 Nano Omni on April 28, 2026 — an open 30B-A3B hybrid MoE model unifying vision, audio, and language with a 256K context window and 9x higher throughput than comparable open omni models.
Andrej Karpathy shared a practical tip: appending structure your response as HTML to any LLM prompt and viewing the result in a browser. He also outlined a broader vision for the evolution of human-AI interfaces, from text to interactive neural simulations.
Comments (0)
No comments yet. Be the first to comment!