Google DeepMind rolls out Gemini 3.1 Flash-Lite for high-volume, low-cost workloads
Original: 3.1 Flash-Lite outperforms 2.5 Flash with faster performance at a lower price. New ‘thinking levels’ let you dial in reasoning to adapt for different tasks, while still being able to handle complex workloads - like generating UI and dashboards or creating simulations. View original →
What Google DeepMind said on X
On March 3, 2026, Google DeepMind said Gemini 3.1 Flash-Lite outperforms Gemini 2.5 Flash at a lower price and with faster performance. The X post also highlighted new “thinking levels,” a control that lets developers trade off reasoning effort against latency and cost depending on the task. Google framed the model as capable of both high-throughput work and more complex jobs such as generating UIs, dashboards, and simulations.
The announcement post on Google’s blog makes the commercial positioning clearer. Google says Gemini 3.1 Flash-Lite is rolling out in preview through the Gemini API in Google AI Studio and through Vertex AI. It is priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens, and Google says it improves on 2.5 Flash with a 2.5x faster time to first token and a 45% increase in output speed according to Artificial Analysis.
What the model card adds
Google DeepMind’s published model card describes 3.1 Flash-Lite as a natively multimodal reasoning model in the Gemini 3 family, based on Gemini 3 Pro, with up to a 1 million token context window and up to 64K output tokens. The card says the model is optimized for high-volume, latency-sensitive workloads such as translation and classification. It also publishes benchmark results that place the model at 86.9% on GPQA Diamond, 76.8% on MMMU-Pro, and 72.0% on LiveCodeBench, while listing output speed at 363 tokens per second.
Why this launch matters
This release is important because low-cost models increasingly define the economics of production AI. Many enterprise and consumer features do not need the largest model available; they need predictable latency, controllable reasoning, and pricing that works at very high request volume. Google is clearly targeting that layer with a model that can cover translation, moderation, classification, and lightweight agent-style tasks without forcing developers into a premium-tier cost structure.
The inclusion of thinking levels is also notable. Rather than separating cheap models from more reasoned models entirely, Google is trying to make reasoning depth adjustable inside the same serving tier. For developers building real-time applications, that can simplify model routing and make cost-performance tuning more operationally practical.
Sources: Google DeepMind X post, Google blog, Google DeepMind model card
Related Articles
Google on March 3, 2026 introduced Gemini 3.1 Flash-Lite as the fastest and most cost-efficient model in the Gemini 3 family. The preview is rolling out through Google AI Studio and Vertex AI at $0.25/1M input tokens and $1.50/1M output tokens.
Google AI shared practical Gemini 3.1 Flash-Lite examples, including high-volume image sorting and business automation scenarios. The thread also points developers to preview access via Gemini API, Google AI Studio, and Vertex AI.
Google AI Developers says Gemini Embedding 2 is now in preview via the Gemini API and Vertex AI. Google describes it as its first fully multimodal embedding model on the Gemini architecture and its most capable embedding model so far.
Comments (0)
No comments yet. Be the first to comment!