The most common question we get from local-LLM newcomers in 2026 isn’t “which model should I use” — it’s “will this model run on my GPU?”
This guide is the answer. We’ve tested every major open LLM at every common quantization on hardware ranging from a 12 GB RTX 3060 to an 80 GB H100, and what follows is the cheat sheet we wish existed when we started.
A reminder for the impatient: VRAM is the binding constraint. If your model + KV cache + context doesn’t fit in VRAM, inference falls off a cliff. Everything below assumes you want pure GPU inference; if you’re willing to do CPU offload, divide the throughput by 5–10×.
Principaux enseignements
- 12 GB VRAM: 7–8 B models at Q5+, 13 B at Q4. Llama 3 8B, Mistral 7B, Phi-4 Mini.
- 16 GB VRAM: 13–14 B at Q5+. Awkward tier — too much for 8B, not enough for 30B.
- 24 GB VRAM: 30 B at Q5+, 70 B at Q3_K_S (tight). The sweet spot.
- 32 GB VRAM: 70 B at Q4_K_M comfortably, 30 B at Q8.
- 48 GB VRAM: 70 B at Q5_K_M, 100 B+ at Q3/Q4.
- 128 GB unified (M4 Max): 405 B at Q4, but slower per-token than Nvidia.
The quick-reference table
Every major 2026 open LLM and its VRAM needs at common quantization levels. Numbers are for the model weights only, at 8 K context. Add 1–2 GB for KV cache headroom per 8 K of context you actually use.
| Model | FP16 | Q8_0 | Q5_K_M | Q4_K_M | Q3_K_M | IQ2_XXS |
|---|---|---|---|---|---|---|
| Phi-4 Mini (3.8 B) | 7.6 GB | 4.0 GB | 2.7 GB | 2.3 GB | 1.9 GB | 1.4 GB |
| Gemma 2 2B | 5.0 GB | 2.7 GB | 1.8 GB | 1.6 GB | 1.3 GB | 1.0 GB |
| Lama 3 8B | 16.1 GB | 8.5 GB | 5.7 GB | 4.9 GB | 4.0 GB | 2.9 GB |
| Mistral 7B v0.3 | 14.5 GB | 7.7 GB | 5.1 GB | 4.4 GB | 3.6 GB | 2.6 GB |
| Qwen 2.5 7B | 15.2 GB | 8.1 GB | 5.4 GB | 4.7 GB | 3.8 GB | 2.7 GB |
| Phi-4 (14 B) | 28.0 GB | 14.9 GB | 10.0 GB | 8.5 GB | 7.0 GB | 5.0 GB |
| Qwen 2.5 14B | 29.5 GB | 15.7 GB | 10.5 GB | 9.0 GB | 7.4 GB | 5.3 GB |
| Mistral Nemo 12B | 24.5 GB | 13.0 GB | 8.7 GB | 7.5 GB | 6.1 GB | 4.4 GB |
| Qwen 2.5 32B | 65.0 GB | 34.6 GB | 23.0 GB | 19.8 GB | 16.3 GB | 11.6 GB |
| Yi-1.5 34B | 68.5 GB | 36.4 GB | 24.3 GB | 20.7 GB | 17.1 GB | 12.2 GB |
| Llama 3 70B | 141.0 GB | 74.9 GB | 49.9 GB | 42.5 GB | 34.7 GB | 24.9 GB |
| Qwen 2.5 72B | 145.0 GB | 77.1 GB | 51.4 GB | 43.8 GB | 35.7 GB | 25.6 GB |
| Command R+ 104B | 208.0 GB | 110.5 GB | 73.8 GB | 62.7 GB | 51.6 GB | 36.8 GB |
| Mistral Large 2 (123B) | 247.0 GB | 131.4 GB | 87.5 GB | 74.5 GB | 61.0 GB | 43.6 GB |
| Mixtral 8x22B (141 B) | 282.0 GB | 150.0 GB | 100.0 GB | 85.1 GB | 69.8 GB | 49.9 GB |
| DeepSeek V3 (236 B MoE) | 475.0 GB | 252.0 GB | 168.5 GB | 143.6 GB | 117.4 GB | 84.1 GB |
| Llama 3.1 405B | 810.0 GB | 431.0 GB | 287.0 GB | 244.5 GB | 200.1 GB | 143.0 GB |
A practical note: for daily use, Q4_K_M is the recommended balance of size and quality. The quality drop versus FP16 is small (typical perplexity increase < 2%) and the memory savings are enormous (~3.3× smaller). Q5_K_M is marginally better quality at ~17% more memory. Q3 and IQ2 are emergency-only — quality degrades noticeably.
KV cache memory — the part everyone forgets
The numbers above are model weights only. The KV cache — the running memory of all tokens in your conversation — also lives in VRAM and grows linearly with context length.
Rough KV cache size, per 1 K tokens of context, at FP16:
| Model class | KV per 1K tokens | KV per 32K context |
|---|---|---|
| 7–8 B models | ~32 MB | ~1.0 GB |
| 13–14 B models | ~50 MB | ~1.6 GB |
| 30–34 B models | ~80 MB | ~2.6 GB |
| 70–72 B models | ~160 MB | ~5.1 GB |
| 100–123 B models | ~220 MB | ~7.0 GB |
| 405 B | ~500 MB | ~16.0 GB |
Quantizing the KV cache (an option in llama.cpp and vLLM in 2026) cuts this by ~2–4× with a small quality cost. Most production setups now use Q8 KV cache — it’s nearly free quality-wise and saves substantial VRAM at long context.
If you plan to use 32 K or longer context, add KV cache to your VRAM math before picking a GPU.
GPU compatibility matrix
Which models comfortably fit on each common GPU, at recommended quants, with 8 K context? “Comfortably” means model + KV cache + 1 GB system headroom.
| GPU | VRAM | Best fit (Q4_K_M) | Best fit (Q5_K_M) | Maximum (any quant) |
|---|---|---|---|---|
| RTX 3060 12 GB | 12 GB | 8 B | 8 B | 14 B at Q3 |
| RTX 4060 Ti 16 GB | 16 GB | 13 B | 13 B | 30 B at IQ2 |
| RTX 5080 / 5070 Ti | 16 GB | 13 B | 13 B | 30 B at IQ2 |
| RTX 3090 / 4090 | 24 GB | 30 B (Qwen 32B) | 30 B | 70 B at Q3_K_S |
| RX 7900 XTX | 24 GB | 30 B | 30 B | 70 B at Q3_K_S |
| RTX 5090 | 32 GB | 70 B | 70 B (tight) | 70 B at Q5_K_M |
| 2× RTX 3090 / 4090 | 48 GB | 70 B | 70 B | 104 B at Q3 |
| RTX A6000 / 6000 Ada | 48 GB | 70 B | 70 B | 104 B at Q3 |
| Mac Studio M4 Max 64 GB | 64 GB unified | 70 B | 70 B | 123 B at Q3 |
| H100 80 GB | 80 GB | 70 B (FP16-ish) | 104 B | 123 B at Q4 |
| Mac Studio M4 Max 128 GB | 128 GB unified | 104 B | 123 B | 405 B at IQ2 (slow) |
| H200 / DIGITS | 141 GB / 128 GB unified | 123 B | 123 B | 405 B at Q3 (slow) |
| B200 | 192 GB | 123 B | 123 B | 405 B at Q4 (tight) |
The patterns to internalize:
1. 12 GB is the entry floor. Below it, you’re constrained to tiny models that don’t justify a dedicated GPU.
2. 24 GB is the inflection point. It’s the cheapest tier where Llama 3 70B becomes possible (at compromised quants).
3. 32 GB unlocks 70B properly. This is the entire reason to choose the RTX 5090 over the 4090.
4. 48 GB is comfortable territory. Most things you want to do fit cleanly.
5. 128 GB unified is the consumer ceiling. Above this, you’re buying server hardware.
Choosing the right quant for your hardware
The right quantization isn’t always “the biggest one that fits.” Quality matters, and sometimes a smaller model at a better quant beats a bigger model at a worse one.
Rough quality ranking (perplexity-based, lower is better):
- FP16 / BF16 — original. Quality reference baseline.
- Q8_0 — ~0.3% perplexity increase. Essentially indistinguishable.
- Q6_K — ~0.5% increase. Indistinguishable in practice.
- Q5_K_M — ~1.0% increase. Slight quality drop, still very high quality.
- Q4_K_M — ~1.5–2.5% increase. Recommended for most users.
- Q4_K_S — ~3% increase. Noticeably worse than Q4_K_M for similar size.
- Q3_K_M — ~5–8% increase. Visibly affected output.
- Q3_K_S — ~10% increase. Use only if Q4 won’t fit.
- IQ2_XXS — ~15–25% increase. Last resort.
The general rule: prefer a smaller-parameter model at Q5_K_M over a bigger model at Q3_K_S for everyday tasks. A Qwen 32B at Q5 generally beats a Llama 3 70B at IQ2_XXS on real-world benchmarks despite the latter sounding more impressive on paper.
Exception: coding and reasoning tasks where the bigger model’s raw knowledge advantage often survives heavy quantization. For code generation specifically, even Q3_K_S of a 70B model can outperform a Q5_K_M 30B.
MoE models — the asterisk
Mixture-of-experts (MoE) models like Mixtral 8x22B et DeepSeek V3 have an asymmetry that confuses newcomers:
- VRAM needed = total parameters (because you must hold all experts)
- Compute used = active parameters per token (much less)
Mixtral 8x22B is 141 B total / 39 B active. It needs 80+ GB of VRAM to run, but the per-token speed is closer to running a 40 B dense model.
DeepSeek V3 is 236 B total / 21 B active. It needs 150 GB+ of VRAM, but token speed approaches a 20 B dense model. This is why DeepSeek V3 is “fast for its size” — you pay the VRAM tax but get the compute discount.
If your hardware can hold an MoE model, it’s often the best choice. If it can’t, the dense model in the same parameter class is what you want.
Quick-start setups by budget
For people who want a concrete answer, here are tested setups at five budget points in 2026:
| Budget | GPU | Best model | Tokens/sec |
|---|---|---|---|
| $300 | RTX 3060 12 GB | Llama 3 8B Q5_K_M | ~48 |
| $700 | Used RTX 3090 | Qwen 2.5 32B Q5_K_M | ~28 |
| $1,300 | Used RTX 4090 | Llama 3 70B Q3_K_S | ~13 |
| $1,400 | 2× Used RTX 3090 + NVLink | Llama 3 70B Q4_K_M | ~15 |
| $2,400 | RTX 5090 | Llama 3 70B Q5_K_M | ~18 |
| $5,000 | Mac Studio M4 Max 128 GB | Mistral Large 2 Q4 | ~6 |
The “best value tier” in 2026 remains the used RTX 3090 / 4090 — these are the only consumer GPUs where the price-per-VRAM math is favorable, and both will remain capable through at least 2028.
For the deep dive on which GPU to pick, see best GPUs for local LLMs in 2026.
FAQ
How much VRAM do I need to run Llama 3 70B locally in 2026?
Minimum 24 GB for Llama 3 70B at Q3_K_S (which is rough quality). 32 GB lets you run Q4_K_M comfortably (the recommended quant). 40+ GB is needed for Q5_K_M. With 24 GB and 8 K context, you have basically zero headroom; pushing context to 32 K requires CPU offload or a more aggressive quant.
What’s the difference between Q4_K_M and Q4_K_S?
Both are 4-bit quantizations of the same model. Q4_K_M (“medium”) uses 5 bits for some critical weight groups, making it slightly larger but noticeably better quality than Q4_K_S (“small”). For nearly identical VRAM, Q4_K_M is preferred. Q4_K_S only makes sense when you’re trying to squeeze a model into a tight VRAM budget.
Can I run an LLM that’s bigger than my VRAM?
Yes — using CPU offload, where some model layers run on the CPU using system RAM instead of GPU VRAM. The performance penalty is severe (5–10× slower), but it lets you run models that wouldn’t otherwise fit. Practical for occasional use, painful as a daily driver. Both llama.cpp and Ollama support this out of the box via the n_gpu_layers setting.
Does the KV cache really matter for VRAM planning?
Yes, especially at long context. For Llama 3 70B at 32 K context, the KV cache alone is ~5 GB. If you’re already at the edge of your VRAM, you’ll OOM the moment a conversation gets long. Plan for KV cache and consider Q8 KV-cache quantization (option in modern inference engines) to roughly halve it.
Is there a way to run Llama 3 405B at home?
Yes, but you need 200+ GB of memory at usable quants. The realistic 2026 paths: Mac Studio M4 Ultra 512 GB ($12K, slow per-token but works), 8× RTX 4090 ($13K, complex setup), Nvidia DIGITS ($3K, purpose-built), or CPU + 256 GB DDR5 RAM with mid-range GPU for partial offload ($8K, slow). See our how-to guide on running Llama 3 405B at home.
Are there any 2026 quantization formats I should know besides GGUF?
Yes — AWQ (Activation-aware Weight Quantization) and GPTQ are both still widely used, especially for vLLM and TensorRT-LLM deployments. They’re slightly better quality at the same bit count than GGUF in some cases. For consumer local-LLM use with llama.cpp/Ollama/LM Studio, GGUF remains dominant in 2026 because of its simplicity and broad tooling support.
Will Q4 quantization affect coding ability?
Less than you’d think, but yes. For straightforward code completion, Q4_K_M is essentially identical to FP16. For complex multi-step reasoning across a codebase, Q4 occasionally produces worse logic than Q5+. If you do serious coding with local models, prefer Q5_K_M and choose your hardware to support it.
Bottom line
VRAM planning for local LLMs in 2026 isn’t complicated, but it does reward precision. Pick the parameter class first (the model size that has the capability you need), then pick the smallest quant that gives acceptable quality (Q4_K_M is usually right), then add KV cache for your real context length, then size your GPU accordingly.
If you only remember three numbers, remember these:
- 12 GB runs 8 B models cleanly.
- 24 GB runs 30 B at quality quants, 70 B uncomfortably.
- 32 GB runs 70 B at quality quants.
Everything past 32 GB enters server territory, and everything below 12 GB enters phone/embedded territory. The bulk of 2026 local-LLM activity happens in the 12–32 GB range, which is exactly the consumer GPU range — by design, not coincidence.
