Wednesday, 27 May 2026 | التحديث اليومي نظرة ثاقبة للذكاء الاصطناعي، مكتوبة للبناة

VRAM Requirements for Every Major LLM in 2026 (Quantization Cheat Sheet)

The most common question we get from local-LLM newcomers in 2026 isn’t “which model should I use” — it’s “will this model run on my GPU?”

This guide is the answer. We’ve tested every major open LLM at every common quantization on hardware ranging from a 12 GB RTX 3060 to an 80 GB H100, and what follows is the cheat sheet we wish existed when we started.

A reminder for the impatient: VRAM is the binding constraint. If your model + KV cache + context doesn’t fit in VRAM, inference falls off a cliff. Everything below assumes you want pure GPU inference; if you’re willing to do CPU offload, divide the throughput by 5–10×.

الوجبات الرئيسية

  • 12 GB VRAM: 7–8 B models at Q5+, 13 B at Q4. Llama 3 8B, Mistral 7B, Phi-4 Mini.
  • 16 GB VRAM: 13–14 B at Q5+. Awkward tier — too much for 8B, not enough for 30B.
  • 24 GB VRAM: 30 B at Q5+, 70 B at Q3_K_S (tight). The sweet spot.
  • 32 GB VRAM: 70 B at Q4_K_M comfortably, 30 B at Q8.
  • 48 GB VRAM: 70 B at Q5_K_M, 100 B+ at Q3/Q4.
  • 128 GB unified (M4 Max): 405 B at Q4, but slower per-token than Nvidia.

The quick-reference table

Every major 2026 open LLM and its VRAM needs at common quantization levels. Numbers are for the model weights only, at 8 K context. Add 1–2 GB for KV cache headroom per 8 K of context you actually use.

ModelFP16Q8_0Q5_K_MQ4_K_MQ3_K_MIQ2_XXS
Phi-4 Mini (3.8 B)7.6 GB4.0 GB2.7 GB2.3 GB1.9 GB1.4 GB
Gemma 2 2B5.0 GB2.7 GB1.8 GB1.6 GB1.3 GB1.0 GB
Llama 3 8B16.1 GB8.5 GB5.7 GB4.9 GB4.0 GB2.9 GB
Mistral 7B v0.314.5 GB7.7 GB5.1 GB4.4 GB3.6 GB2.6 GB
Qwen 2.5 7B15.2 GB8.1 GB5.4 GB4.7 GB3.8 GB2.7 GB
Phi-4 (14 B)28.0 GB14.9 GB10.0 GB8.5 GB7.0 GB5.0 GB
Qwen 2.5 14B29.5 GB15.7 GB10.5 GB9.0 GB7.4 GB5.3 GB
Mistral Nemo 12B24.5 GB13.0 GB8.7 GB7.5 GB6.1 GB4.4 GB
Qwen 2.5 32B65.0 GB34.6 GB23.0 GB19.8 GB16.3 GB11.6 GB
Yi-1.5 34B68.5 GB36.4 GB24.3 GB20.7 GB17.1 GB12.2 GB
Llama 3 70B141.0 GB74.9 GB49.9 GB42.5 GB34.7 GB24.9 GB
Qwen 2.5 72B145.0 GB77.1 GB51.4 GB43.8 GB35.7 GB25.6 GB
Command R+ 104B208.0 GB110.5 GB73.8 GB62.7 GB51.6 GB36.8 GB
Mistral Large 2 (123B)247.0 GB131.4 GB87.5 GB74.5 GB61.0 GB43.6 GB
Mixtral 8x22B (141 B)282.0 GB150.0 GB100.0 GB85.1 GB69.8 GB49.9 GB
DeepSeek V3 (236 B MoE)475.0 GB252.0 GB168.5 GB143.6 GB117.4 GB84.1 GB
Llama 3.1 405B810.0 GB431.0 GB287.0 GB244.5 GB200.1 GB143.0 GB

A practical note: for daily use, Q4_K_M is the recommended balance of size and quality. The quality drop versus FP16 is small (typical perplexity increase < 2%) and the memory savings are enormous (~3.3× smaller). Q5_K_M is marginally better quality at ~17% more memory. Q3 and IQ2 are emergency-only — quality degrades noticeably.

KV cache memory — the part everyone forgets

The numbers above are model weights only. The KV cache — the running memory of all tokens in your conversation — also lives in VRAM and grows linearly with context length.

Rough KV cache size, per 1 K tokens of context, at FP16:

Model classKV per 1K tokensKV per 32K context
7–8 B models~32 MB~1.0 GB
13–14 B models~50 MB~1.6 GB
30–34 B models~80 MB~2.6 GB
70–72 B models~160 MB~5.1 GB
100–123 B models~220 MB~7.0 GB
405 B~500 MB~16.0 GB

Quantizing the KV cache (an option in llama.cpp and vLLM in 2026) cuts this by ~2–4× with a small quality cost. Most production setups now use Q8 KV cache — it’s nearly free quality-wise and saves substantial VRAM at long context.

If you plan to use 32 K or longer context, add KV cache to your VRAM math before picking a GPU.

GPU compatibility matrix

Which models comfortably fit on each common GPU, at recommended quants, with 8 K context? “Comfortably” means model + KV cache + 1 GB system headroom.

GPUVRAMBest fit (Q4_K_M)Best fit (Q5_K_M)Maximum (any quant)
RTX 3060 12 GB12 GB8 B8 B14 B at Q3
RTX 4060 Ti 16 GB16 GB13 B13 B30 B at IQ2
RTX 5080 / 5070 Ti16 GB13 B13 B30 B at IQ2
RTX 3090 / 409024 GB30 B (Qwen 32B)30 B70 B at Q3_K_S
RX 7900 XTX24 GB30 B30 B70 B at Q3_K_S
RTX 509032 GB70 B70 B (tight)70 B at Q5_K_M
2× RTX 3090 / 409048 GB70 B70 B104 B at Q3
RTX A6000 / 6000 Ada48 GB70 B70 B104 B at Q3
Mac Studio M4 Max 64 GB64 GB unified70 B70 B123 B at Q3
H100 80 GB80 GB70 B (FP16-ish)104 B123 B at Q4
Mac Studio M4 Max 128 GB128 GB unified104 B123 B405 B at IQ2 (slow)
H200 / DIGITS141 GB / 128 GB unified123 B123 B405 B at Q3 (slow)
B200192 GB123 B123 B405 B at Q4 (tight)

The patterns to internalize:

1. 12 GB is the entry floor. Below it, you’re constrained to tiny models that don’t justify a dedicated GPU.
2. 24 GB is the inflection point. It’s the cheapest tier where Llama 3 70B becomes possible (at compromised quants).
3. 32 GB unlocks 70B properly. This is the entire reason to choose the RTX 5090 over the 4090.
4. 48 GB is comfortable territory. Most things you want to do fit cleanly.
5. 128 GB unified is the consumer ceiling. Above this, you’re buying server hardware.

Choosing the right quant for your hardware

The right quantization isn’t always “the biggest one that fits.” Quality matters, and sometimes a smaller model at a better quant beats a bigger model at a worse one.

Rough quality ranking (perplexity-based, lower is better):

  • FP16 / BF16 — original. Quality reference baseline.
  • Q8_0 — ~0.3% perplexity increase. Essentially indistinguishable.
  • Q6_K — ~0.5% increase. Indistinguishable in practice.
  • Q5_K_M — ~1.0% increase. Slight quality drop, still very high quality.
  • Q4_K_M — ~1.5–2.5% increase. Recommended for most users.
  • Q4_K_S — ~3% increase. Noticeably worse than Q4_K_M for similar size.
  • Q3_K_M — ~5–8% increase. Visibly affected output.
  • Q3_K_S — ~10% increase. Use only if Q4 won’t fit.
  • IQ2_XXS — ~15–25% increase. Last resort.

The general rule: prefer a smaller-parameter model at Q5_K_M over a bigger model at Q3_K_S for everyday tasks. A Qwen 32B at Q5 generally beats a Llama 3 70B at IQ2_XXS on real-world benchmarks despite the latter sounding more impressive on paper.

Exception: coding and reasoning tasks where the bigger model’s raw knowledge advantage often survives heavy quantization. For code generation specifically, even Q3_K_S of a 70B model can outperform a Q5_K_M 30B.

MoE models — the asterisk

Mixture-of-experts (MoE) models like Mixtral 8x22B و DeepSeek V3 have an asymmetry that confuses newcomers:

  • VRAM needed = total parameters (because you must hold all experts)
  • Compute used = active parameters per token (much less)

Mixtral 8x22B is 141 B total / 39 B active. It needs 80+ GB of VRAM to run, but the per-token speed is closer to running a 40 B dense model.

DeepSeek V3 is 236 B total / 21 B active. It needs 150 GB+ of VRAM, but token speed approaches a 20 B dense model. This is why DeepSeek V3 is “fast for its size” — you pay the VRAM tax but get the compute discount.

If your hardware can hold an MoE model, it’s often the best choice. If it can’t, the dense model in the same parameter class is what you want.

Quick-start setups by budget

For people who want a concrete answer, here are tested setups at five budget points in 2026:

BudgetGPUBest modelTokens/sec
$300RTX 3060 12 GBLlama 3 8B Q5_K_M~48
$700Used RTX 3090Qwen 2.5 32B Q5_K_M~28
$1,300Used RTX 4090Llama 3 70B Q3_K_S~13
$1,4002× Used RTX 3090 + NVLinkLlama 3 70B Q4_K_M~15
$2,400RTX 5090Llama 3 70B Q5_K_M~18
$5,000Mac Studio M4 Max 128 GBMistral Large 2 Q4~6

The “best value tier” in 2026 remains the used RTX 3090 / 4090 — these are the only consumer GPUs where the price-per-VRAM math is favorable, and both will remain capable through at least 2028.

For the deep dive on which GPU to pick, see best GPUs for local LLMs in 2026.

الأسئلة الشائعة

How much VRAM do I need to run Llama 3 70B locally in 2026?

Minimum 24 GB for Llama 3 70B at Q3_K_S (which is rough quality). 32 GB lets you run Q4_K_M comfortably (the recommended quant). 40+ GB is needed for Q5_K_M. With 24 GB and 8 K context, you have basically zero headroom; pushing context to 32 K requires CPU offload or a more aggressive quant.

What’s the difference between Q4_K_M and Q4_K_S?

Both are 4-bit quantizations of the same model. Q4_K_M (“medium”) uses 5 bits for some critical weight groups, making it slightly larger but noticeably better quality than Q4_K_S (“small”). For nearly identical VRAM, Q4_K_M is preferred. Q4_K_S only makes sense when you’re trying to squeeze a model into a tight VRAM budget.

Can I run an LLM that’s bigger than my VRAM?

Yes — using CPU offload, where some model layers run on the CPU using system RAM instead of GPU VRAM. The performance penalty is severe (5–10× slower), but it lets you run models that wouldn’t otherwise fit. Practical for occasional use, painful as a daily driver. Both llama.cpp and Ollama support this out of the box via the n_gpu_layers setting.

Does the KV cache really matter for VRAM planning?

Yes, especially at long context. For Llama 3 70B at 32 K context, the KV cache alone is ~5 GB. If you’re already at the edge of your VRAM, you’ll OOM the moment a conversation gets long. Plan for KV cache and consider Q8 KV-cache quantization (option in modern inference engines) to roughly halve it.

Is there a way to run Llama 3 405B at home?

Yes, but you need 200+ GB of memory at usable quants. The realistic 2026 paths: Mac Studio M4 Ultra 512 GB ($12K, slow per-token but works), 8× RTX 4090 ($13K, complex setup), Nvidia DIGITS ($3K, purpose-built), or CPU + 256 GB DDR5 RAM with mid-range GPU for partial offload ($8K, slow). See our how-to guide on running Llama 3 405B at home.

Are there any 2026 quantization formats I should know besides GGUF?

Yes — AWQ (Activation-aware Weight Quantization) and GPTQ are both still widely used, especially for vLLM and TensorRT-LLM deployments. They’re slightly better quality at the same bit count than GGUF in some cases. For consumer local-LLM use with llama.cpp/Ollama/LM Studio, GGUF remains dominant in 2026 because of its simplicity and broad tooling support.

Will Q4 quantization affect coding ability?

Less than you’d think, but yes. For straightforward code completion, Q4_K_M is essentially identical to FP16. For complex multi-step reasoning across a codebase, Q4 occasionally produces worse logic than Q5+. If you do serious coding with local models, prefer Q5_K_M and choose your hardware to support it.

Bottom line

VRAM planning for local LLMs in 2026 isn’t complicated, but it does reward precision. Pick the parameter class first (the model size that has the capability you need), then pick the smallest quant that gives acceptable quality (Q4_K_M is usually right), then add KV cache for your real context length, then size your GPU accordingly.

If you only remember three numbers, remember these:

  • 12 GB runs 8 B models cleanly.
  • 24 GB runs 30 B at quality quants, 70 B uncomfortably.
  • 32 GB runs 70 B at quality quants.

Everything past 32 GB enters server territory, and everything below 12 GB enters phone/embedded territory. The bulk of 2026 local-LLM activity happens in the 12–32 GB range, which is exactly the consumer GPU range — by design, not coincidence.

انتقل إلى الأعلى