Google DeepMind rolls out Gemini 3.1 Flash-Lite for high-volume, low-cost workloads

Original: 3.1 Flash-Lite outperforms 2.5 Flash with faster performance at a lower price. New ‘thinking levels’ let you dial in reasoning to adapt for different tasks, while still being able to handle complex workloads - like generating UI and dashboards or creating simulations. View original →

Read in other languages: 한국어日本語
LLM Mar 7, 2026 By Insights AI 2 min read 4 views Source

What Google DeepMind said on X

On March 3, 2026, Google DeepMind said Gemini 3.1 Flash-Lite outperforms Gemini 2.5 Flash at a lower price and with faster performance. The X post also highlighted new “thinking levels,” a control that lets developers trade off reasoning effort against latency and cost depending on the task. Google framed the model as capable of both high-throughput work and more complex jobs such as generating UIs, dashboards, and simulations.

The announcement post on Google’s blog makes the commercial positioning clearer. Google says Gemini 3.1 Flash-Lite is rolling out in preview through the Gemini API in Google AI Studio and through Vertex AI. It is priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens, and Google says it improves on 2.5 Flash with a 2.5x faster time to first token and a 45% increase in output speed according to Artificial Analysis.

What the model card adds

Google DeepMind’s published model card describes 3.1 Flash-Lite as a natively multimodal reasoning model in the Gemini 3 family, based on Gemini 3 Pro, with up to a 1 million token context window and up to 64K output tokens. The card says the model is optimized for high-volume, latency-sensitive workloads such as translation and classification. It also publishes benchmark results that place the model at 86.9% on GPQA Diamond, 76.8% on MMMU-Pro, and 72.0% on LiveCodeBench, while listing output speed at 363 tokens per second.

Why this launch matters

This release is important because low-cost models increasingly define the economics of production AI. Many enterprise and consumer features do not need the largest model available; they need predictable latency, controllable reasoning, and pricing that works at very high request volume. Google is clearly targeting that layer with a model that can cover translation, moderation, classification, and lightweight agent-style tasks without forcing developers into a premium-tier cost structure.

The inclusion of thinking levels is also notable. Rather than separating cheap models from more reasoned models entirely, Google is trying to make reasoning depth adjustable inside the same serving tier. For developers building real-time applications, that can simplify model routing and make cost-performance tuning more operationally practical.

Sources: Google DeepMind X post, Google blog, Google DeepMind model card

Share:

Related Articles

Comments (0)

No comments yet. Be the first to comment!

Leave a Comment

© 2026 Insights. All rights reserved.