Qwen2.5-Coder is the strongest open-weight coding model that runs comfortably on a developer laptop. The 32B-Instruct variant scores 73.7 on Aider, putting it within striking distance of GPT-4o on real-world code-edit benchmarks — and you can run the whole family locally with Ollama in about three commands.
This guide covers everything you actually need: install, pick the right size for your hardware, wire it into VS Code, and verify it’s working.
Why bother with a local model
Three reasons local makes sense for coding specifically:
- Privacy. Your code never leaves the machine. Particularly useful if you’re working under an NDA, on client code, or with internal repos that aren’t supposed to hit a third-party API.
- Cost. Free at inference time. If you run an autocomplete loop all day, the API bill on a hosted model adds up fast.
- Latency. On Apple Silicon (M2 or above) or an RTX 3060+, autocomplete first-token latency drops under 350ms — faster than any hosted model can be after the network round trip.
Quality is the tradeoff. The 7B model isn’t GPT-4. But it is good enough for autocomplete, reasonable enough for refactors, and the 32B version genuinely competes for many edit-style tasks.
Step 1 — Install Ollama
Ollama is the runtime. One installer, no config:
# macOS
brew install ollama
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# Windows: download the installer from ollama.com/download
Start the service:
ollama serve
Leave that running in a terminal. It binds to localhost:11434.
Step 2 — Pick your size
Qwen2.5-Coder ships in 6 sizes. Pick based on your RAM/VRAM:
| Tag | Disk | Min RAM | Use case |
|---|---|---|---|
qwen2.5-coder:0.5b |
~400 MB | 2 GB | Toy / Raspberry Pi |
qwen2.5-coder:1.5b |
~1 GB | 4 GB | Cheap autocomplete |
qwen2.5-coder:3b |
~2 GB | 6 GB | Lightweight laptops |
qwen2.5-coder:7b |
~4.2 GB | 8 GB | Sweet spot for most laptops |
qwen2.5-coder:14b |
~8 GB | 16 GB | High-end laptop / desktop |
qwen2.5-coder:32b |
~19 GB | 32 GB / 24 GB VRAM | Workstation, GPT-4o-class quality |
If you don’t know which to pick: start with 7B. It is the right tradeoff between quality and speed for almost everyone.
ollama pull qwen2.5-coder:7b
That’s a one-time ~4.2 GB download. From here on it runs entirely offline.
Step 3 — Verify it works
Quick smoke test in the terminal:
ollama run qwen2.5-coder:7b "Write a Python function that returns the nth Fibonacci number, with memoization."
You should see streaming output within ~1 second and a complete function in 3–5 seconds on modern hardware.
If you want a programmatic check:
curl http://localhost:11434/api/generate -d '{
"model": "qwen2.5-coder:7b",
"prompt": "// TypeScript function: debounce<T extends (...args:any[]) => any>(fn: T, wait: number)\n",
"stream": false
}' | jq -r '.response'
This is the same HTTP API every editor extension talks to.
Step 4 — Wire it into VS Code
The most useful integration is Continue — an open-source AI coding extension that points at any local Ollama model.
- Install Continue from the VS Code marketplace.
- Open Continue’s settings (
~/.continue/config.json). - Replace the
modelsblock:
{
"models": [
{
"title": "Qwen 2.5 Coder 7B (local)",
"provider": "ollama",
"model": "qwen2.5-coder:7b",
"apiBase": "http://localhost:11434"
}
],
"tabAutocompleteModel": {
"title": "Qwen 2.5 Coder 7B (autocomplete)",
"provider": "ollama",
"model": "qwen2.5-coder:7b"
}
}
Restart VS Code. You now have:
- Inline autocomplete as you type (Ghost text, accept with Tab)
- Chat panel (Cmd/Ctrl+L) — ask questions about the open file
- Edit mode (Cmd/Ctrl+I) — highlight code, describe a change in natural language
Everything routes through your local Ollama. No telemetry, no API key.
Step 5 — (Optional) Use the 32B for hard problems
If you have a workstation or 24 GB+ GPU, run qwen2.5-coder:32b as a “second model” you switch to for actual refactors:
ollama pull qwen2.5-coder:32b
Then in config.json, add it as a second models entry. Continue lets you switch models per-conversation, so you can use 7B for autocomplete and 32B for the harder edit-mode requests. The 32B Instruct variant is the one that scores 73.7 on Aider — good enough that for a meaningful share of tasks you would not notice the difference vs a paid hosted model.
Performance reality check
A few honest notes from running this in practice:
- Apple Silicon (M2 Pro and up): 7B at Q4 quantization gives ~30–50 tok/s, autocomplete latency well under 350ms. Genuinely usable as your daily driver.
- RTX 3060 / 4060 / 4070: Comparable speed; the 14B is also viable.
- CPU-only laptops: 7B works but feels slow (~5–10 tok/s). The 3B is the practical pick.
- Memory pressure: Ollama keeps the model loaded between requests. If you switch models a lot, expect a few seconds of reload.
Languages it actually understands
Qwen2.5-Coder was trained on 92+ programming languages, including the long tail (Solidity, OCaml, Lean, Zig). For mainstream stacks (TypeScript, Python, Go, Rust, Java, Swift) the quality is consistently good. For very niche languages, validate before trusting it.
Where this fits in 2026
Hosted coding assistants are still ahead at the very top of the curve — GPT-5-class and Gemini 3 Pro do harder reasoning. But the gap that mattered two years ago, where local was a toy and hosted was production, is closed. Qwen2.5-Coder 32B is a real coding model. 7B is a real autocomplete engine. And both run on hardware you already own.
If you’ve been paying for a coding assistant out of habit, it’s worth one afternoon to see whether a local setup covers 90% of your workflow.
