Wondering how much GPU memory you need to run a large language model locally? This free calculator estimates the VRAM required for any LLM — based on its size, your quantization, and your context length — then shows you exactly which GPUs and Macs can run it.
Which hardware can run it?
Estimate only. Real usage varies with framework (llama.cpp, vLLM, ExLlama), KV-cache precision, and flash-attention. KV cache is computed at FP16; MoE models are sized by total parameters for weights. Use this as a planning guide, not a guarantee.
How to use the VRAM calculator
- Pick your model — choose a popular model (Llama 3, Qwen, Mixtral, DeepSeek, Gemma…) or select “Custom” and enter any parameter count.
- Choose a quantization — full precision (FP16) is the most accurate but largest; Q4_K_M is the most popular balance of size and quality. Lower quants shrink VRAM at some quality cost.
- Set your context length — longer context means a bigger KV cache and more VRAM. Most chat use cases are fine at 8K–32K.
- Read the result — the total VRAM estimate breaks down into model weights, KV cache, and overhead, and every GPU is marked ✓ Runs, ⚠ Tight, or ✗ Too small.
How LLM VRAM is calculated
The memory an LLM needs to run (inference) comes from three parts:
- Model weights = parameters × bytes-per-parameter. At FP16 that’s 2 bytes per parameter, so a 70B model needs ~140 GB. Quantizing to 4-bit (Q4) cuts that to roughly 40 GB.
- KV cache = the attention key/value memory that grows with context length and batch size. For long contexts it can rival the weights themselves.
- Overhead = activations, CUDA/Metal buffers, and framework reserves — typically 5–15% on top.
The quick rule of thumb: VRAM ≈ (parameters in billions × bytes-per-parameter) + KV cache + ~10% overhead. The calculator above does the full math for you, including per-model layer and attention-head counts.
Quantization quick reference
| التحويل الكمي | Bytes / param | 70B model weights | Quality |
|---|---|---|---|
| FP16 / BF16 | 2.0 | ~141 GB | Reference (best) |
| Q8 / FP8 | 1.0 | ~70 GB | Near-lossless |
| س4_ك_م | ~0.58 | ~41 GB | Best balance (recommended) |
| Q3_K_M | ~0.46 | ~33 GB | Noticeable loss |
| Q2_K | ~0.35 | ~25 GB | Last resort |
Frequently asked questions
How much VRAM do I need to run Llama 3 70B?
At Q4_K_M with an 8K context, roughly 43–48 GB — so a single 48 GB card (RTX A6000) or two 24 GB GPUs (2× RTX 4090/3090), or a 64 GB+ Mac. At full FP16 you’d need ~150 GB (an A100 80GB pair or an H200). Use the calculator above for your exact settings.
Can I run a 70B model on a 24 GB GPU like the RTX 4090?
Not at useful quality on a single 24 GB card — even Q3 puts a 70B model around 35 GB. You can run it across two 24 GB GPUs, or step down to a 32B-class model (Qwen 2.5 32B) which fits comfortably at Q4 on a single 4090.
Does quantization hurt quality?
A little. Q8 is effectively lossless; Q4_K_M loses very little for most tasks and is the community default; below Q3 the degradation becomes noticeable. For coding and reasoning, stay at Q4 or higher when you can.
Why does context length increase VRAM so much?
The KV cache stores attention state for every token in the context, for every layer. Doubling the context roughly doubles the KV cache. At very long contexts (128K), the cache alone can exceed the model weights — which is why long-context inference needs so much memory.
How accurate is this calculator?
It’s a planning estimate, typically within ~10–15% of real-world usage. Actual memory depends on your framework (llama.cpp, vLLM, ExLlama), whether you quantize the KV cache, flash-attention, and OS reserves. Always leave headroom.
Building or buying a rig for local AI? See our guides on the best GPUs for local LLMs و VRAM requirements for every major LLM.
