On April 2, 2026, Google DeepMind released Gemma 4, the most ambitious open-weight model family the company has shipped to date. Built on the same research foundations as Gemini 3 and licensed under Apache 2.0, Gemma 4 lands as four distinct models — two designed for phones and edge devices, two for consumer GPUs and workstations — and it changes what "local AI" actually means in practice.
The headline claim from the launch blog is that Gemma 4 "outcompetes models 20× its size." That sounds like marketing, but the Arena AI leaderboard backs it: as of April 1, 2026, the 31B Dense model sits at #3 among all open models globally, and the 26B MoE at #6. That puts a 20 GB Ollama download in the same conversation as flagship 600B+ closed APIs.
The four sizes, decoded
Google didn't just ship "small, medium, large." Each Gemma 4 variant uses a different architectural trick to hit a specific deployment target.
E2B — phone-class reasoning
2.3B effective parameters (5.1B with Per-Layer Embeddings), 35 layers, 128K context, multimodal with native audio. Designed to run on a phone. The "E" prefix stands for effective parameters: PLE feeds a secondary embedding signal into every decoder layer, so a small core model behaves like a larger one without the memory cost of actually being one.
E4B — laptop-grade multimodal
4.5B effective parameters (8B with embeddings), 42 layers, 128K context, native audio. This is the sweet spot for a developer laptop — a 9.6 GB Ollama download that handles images, video, and speech on-device.
26B MoE — consumer GPU
A Mixture-of-Experts model: 25.2B total parameters but only 3.8B fire per forward pass (8 experts plus 1 shared, out of 128). 256K context. The point of MoE here is inference economics — you pay the memory cost of 25B but the compute cost of ~4B, which is why a single consumer GPU can run it at usable speed.
31B Dense — workstation
30.7B dense parameters, 60 layers, 256K context. This is the quality ceiling — Google's recommended starting point for fine-tuning, and the variant currently sitting at #3 on the open Arena leaderboard.
Why this is "local AI" rather than just "small AI"
Three things have to be true for local AI to actually work for a developer. Gemma 4 is the first open release where all three line up at once.
The license has to permit it. Apache 2.0 means no monthly active user caps, no acceptable-use carve-outs, no special permissions. Ship it in your product, fine-tune it, sell access — Google explicitly does not care.
The context has to be long enough to hold a real codebase or document. 256K tokens on the 26B and 31B is enough for most repositories, full books, and long multi-turn agent traces. The edge models cap at 128K, which is still ahead of where GPT-4 launched.
Tool use has to be native. Gemma 4 ships with first-class function calling, structured JSON output, and native system instructions. You can build an agent that reads files, hits APIs, and returns structured responses without coercing the model with a wrapper prompt.
It also supports configurable thinking modes across all four sizes — the model can spend more compute on harder questions when you tell it to, in the same spirit as Gemini's deep-think mode.
Multimodal — including audio on the small ones
Every Gemma 4 model accepts text and image input, and all sizes also accept video. The two edge models (E2B and E4B) additionally accept native audio input, which makes them practical bases for on-device transcription and voice agents — no separate Whisper pipeline needed.
Image handling is more flexible than Gemma 3. Variable aspect ratios and resolutions are supported, and you can dial the visual token budget per request (70, 140, 280, 560, or 1120 tokens) — useful when you're trading detail for latency.
Across all sizes, multilingual support spans 140+ languages.
Run it locally in 30 seconds
Every variant is on Ollama. Pick a size and pull:
# Phone-class — 7.2 GB, multimodal with audio
ollama run gemma4:e2b
# Laptop sweet spot — 9.6 GB
ollama run gemma4:e4b
# Consumer GPU MoE — 18 GB
ollama run gemma4:26b
# Workstation dense — 20 GB
ollama run gemma4:31b
For quick prototyping, ollama run gemma4 resolves to the default tag (currently the E4B build at 9.6 GB) and is the fastest path to a working REPL.
If you want a quantized variant — useful when you're tight on RAM — Ollama publishes the standard q4_K_M and q8_0 tags too:
# 4-bit quantized E2B — runs comfortably on 8 GB RAM
ollama pull gemma4:e2b-it-q4_K_M
# 8-bit quantized 31B — for higher fidelity than the default
ollama pull gemma4:31b-it-q8_0
Picking the right size
Shipping on-device or in a mobile app: E2B (phones) or E4B (laptops). The native audio support is the real differentiator here.
Single consumer GPU, 24 GB class: 26B MoE. Same ballpark quality as the 31B at a fraction of the inference compute.
Workstation, server, or fine-tuning target: 31B Dense. Best quality, most stable target for further training.
What it actually changes
For two years, "local AI" has meant a tradeoff: small models for privacy and speed, big closed APIs for actual capability. Gemma 4 narrows that gap sharply. A 9.6 GB download running on a MacBook now does multimodal reasoning, function calling, and 128K-token context — capabilities that, eighteen months ago, required a frontier API and a credit card.
It's not the most capable model in the world. It's the most capable model you can run, ship, and own outright — and that's the line that matters.
