On April 22, 2026, Google released Gemini 3 Flash — a model designed to bring Gemini 3 Pro’s reasoning quality into the latency, cost, and throughput envelope developers expect from “Flash.” It is the first time Google has compressed frontier-class reasoning into a model cheap enough to ship in real-time products without flinching at the bill.
The headline numbers from the launch:
- $0.50 per million input tokens, $3.00 per million output tokens
- 1M-token context window
- 78% on SWE-bench Verified for agentic coding — beating both the entire 2.5 series and Gemini 3 Pro
- 90.4% on GPQA Diamond, 81.2% on MMMU Pro, 33.7% on Humanity’s Last Exam (without tools)
- Native vision, function calling, and tool use across the API and the Gemini CLI
That last benchmark — Flash beating Pro on agentic SWE-bench — is what makes this release more than a routine cadence bump.
What “Real-Time AI” actually means here
Speed talk in AI is often hand-wavy. The concrete improvements at this generation are measurable:
- The companion Gemini 3.1 Flash-Lite ships 2.5× faster Time to First Answer Token than 2.5 Flash and a 45% higher output speed. That is the difference between a chatbot that streams and one that appears.
- Latency this low is what lets Search Live, the Live API, and conversational voice agents work without the awkward half-second pause that gives away the model.
- The Live API preview (
gemini-3.1-flash-live-preview) is currently listed at $0 in the Gemini API price page — Google is heavily subsidising the real-time use case to drive adoption.
If you are building a product where a user is waiting for the model — a coding assistant, a voice agent, a Search-grounded question-answer flow — the cost of latency is no longer “annoying”; it is the product. Flash 3 is the first time the latency-optimised tier and the quality-optimised tier are not on different sides of a tradeoff.
Function calling, tool use, and agents
Function calling is no longer a side feature. Gemini 3 Flash is a first-class agent model:
- Tool use is native — define a JSON schema for a function, the model calls it with structured arguments, you return a result, the model integrates and continues.
- Search grounding is built in. Customer-submitted requests can trigger one or more Google Search queries, with retrieved context returned as grounded citations. Google gives 5,000 free grounded prompts per month shared across Gemini 3 models, then $14 per 1,000 thereafter.
- The 1M-token context lets you load a real codebase, a long document set, or a multi-step agent trajectory directly without summarisation tricks.
Combined with the SWE-bench score, this is why the Flash variant — not Pro — is showing up in coding agents and IDE integrations first.
Search Live and the consumer surface
Search Live is the most visible deployment of Gemini 3 Flash for non-developers. It is the model behind the new conversational answers in Google Search — questions get a streamed response with sources, and follow-ups stay in context. The reason Google can ship that to billions of users without melting the GPU pool is exactly the price/latency profile of Flash 3.
Whether you find Search Live useful or worry about source attribution, it is a useful proof point: this is a model that runs at consumer scale.
How to use it today
Three straightforward paths:
- Gemini API.
gemini-3-flashis generally available. Direct fetch fromhttps://generativelanguage.googleapis.com, OpenAI-compatible SDK shim, or the officialgoogle-genaiSDK. - Vertex AI. Same model, enterprise controls, regional deployment, contracted commitments. Useful if you need data-residency guarantees or are already on Google Cloud.
- Gemini CLI.
gemini-3-flashis now the default in Gemini CLI for code-focused tasks — fastest path to “play with it on real code.”
For agent frameworks: LangChain, LlamaIndex, AutoGen, and Vercel AI SDK already have Gemini 3 adapters. Drop gemini-3-flash in as the model name, keep your existing tool definitions, and most pipelines work as-is.
Picking Flash vs Pro vs Flash-Lite
A rough decision tree based on the published benchmarks and pricing:
- Gemini 3 Pro — when you need the absolute quality ceiling on a single hard reasoning request and cost per call doesn’t matter.
- Gemini 3 Flash — the new default. Pro-class reasoning, agent-grade tool use, fast enough for real-time, cheap enough for production. The SWE-bench score means it is the right pick for coding agents.
- Gemini 3.1 Flash-Lite — when you need more speed and lower cost and your task is well-defined enough that the small quality gap doesn’t matter (classification, routing, structured extraction, large-batch tagging). $0.25/1M in, $1.50/1M out.
What changes from here
For two years the assumption has been that frontier-quality models were too slow and too expensive to put in real-time products. Gemini 3 Flash is the first model where that stops being true on the major dimensions at once: it is fast enough for streaming voice, cheap enough for high-throughput agents, smart enough to outscore Pro on SWE-bench, and capable of native multi-step tool use over a 1M-token window.
If you are still defaulting to Pro for everything because Flash “isn’t smart enough” — re-run your evals. The hierarchy moved.
