Google DeepMind rolls out Gemini 3.1 Flash-Lite for high-volume, low-cost workloads

What Google DeepMind said on X

On March 3, 2026, Google DeepMind said Gemini 3.1 Flash-Lite outperforms Gemini 2.5 Flash at a lower price and with faster performance. The X post also highlighted new “thinking levels,” a control that lets developers trade off reasoning effort against latency and cost depending on the task. Google framed the model as capable of both high-throughput work and more complex jobs such as generating UIs, dashboards, and simulations.

The announcement post on Google’s blog makes the commercial positioning clearer. Google says Gemini 3.1 Flash-Lite is rolling out in preview through the Gemini API in Google AI Studio and through Vertex AI. It is priced at $0.25 per 1 million input tokens and $1.50 per 1 million output tokens, and Google says it improves on 2.5 Flash with a 2.5x faster time to first token and a 45% increase in output speed according to Artificial Analysis.

What the model card adds

Google DeepMind’s published model card describes 3.1 Flash-Lite as a natively multimodal reasoning model in the Gemini 3 family, based on Gemini 3 Pro, with up to a 1 million token context window and up to 64K output tokens. The card says the model is optimized for high-volume, latency-sensitive workloads such as translation and classification. It also publishes benchmark results that place the model at 86.9% on GPQA Diamond, 76.8% on MMMU-Pro, and 72.0% on LiveCodeBench, while listing output speed at 363 tokens per second.

Why this launch matters

This release is important because low-cost models increasingly define the economics of production AI. Many enterprise and consumer features do not need the largest model available; they need predictable latency, controllable reasoning, and pricing that works at very high request volume. Google is clearly targeting that layer with a model that can cover translation, moderation, classification, and lightweight agent-style tasks without forcing developers into a premium-tier cost structure.

The inclusion of thinking levels is also notable. Rather than separating cheap models from more reasoned models entirely, Google is trying to make reasoning depth adjustable inside the same serving tier. For developers building real-time applications, that can simplify model routing and make cost-performance tuning more operationally practical.

Sources: Google DeepMind X post, Google blog, Google DeepMind model card

Google DeepMind rolls out Gemini 3.1 Flash-Lite for high-volume, low-cost workloads

What Google DeepMind said on X

What the model card adds

Why this launch matters

Related Articles

Google previews Gemini 3.1 Flash-Lite as its fastest and most cost-efficient Gemini 3 model

Google turns Deep Research into an MCP-native agent for finance and life sciences

Google turns Cloud Next into an agent-platform pitch at 16B TPM

Comments (0)

Leave a Comment

Related Articles

Google previews Gemini 3.1 Flash-Lite as its fastest and most cost-efficient Gemini 3 model
LLM Mar 13, 2026 2 min read

Google turns Deep Research into an MCP-native agent for finance and life sciences

Google turns Cloud Next into an agent-platform pitch at 16B TPM