{"id":264,"date":"2026-05-19T16:46:25","date_gmt":"2026-05-19T16:46:25","guid":{"rendered":"https:\/\/convly.ai\/vram-requirements-every-major-llm-2026\/"},"modified":"2026-05-19T16:46:25","modified_gmt":"2026-05-19T16:46:25","slug":"vram-requirements-every-major-llm-2026","status":"publish","type":"post","link":"https:\/\/convly.ai\/fr\/vram-requirements-every-major-llm-2026\/","title":{"rendered":"VRAM Requirements for Every Major LLM in 2026 (Quantization Cheat Sheet)"},"content":{"rendered":"<p>The most common question we get from local-LLM newcomers in 2026 isn&#8217;t &#8220;which model should I use&#8221; \u2014 it&#8217;s &#8220;will this model run on my GPU?&#8221;<\/p>\n<p>This guide is the answer. We&#8217;ve tested every major open LLM at every common quantization on hardware ranging from a 12 GB RTX 3060 to an 80 GB H100, and what follows is the cheat sheet we wish existed when we started.<\/p>\n<p>A reminder for the impatient: <strong>VRAM is the binding constraint<\/strong>. If your model + KV cache + context doesn&#8217;t fit in VRAM, inference falls off a cliff. Everything below assumes you want pure GPU inference; if you&#8217;re willing to do CPU offload, divide the throughput by 5\u201310\u00d7.<\/p>\n<div class=\"convly-tldr\">\n<h3>Principaux enseignements<\/h3>\n<ul>\n<li><strong>12 GB VRAM:<\/strong> 7\u20138 B models at Q5+, 13 B at Q4. Llama 3 8B, Mistral 7B, Phi-4 Mini.<\/li>\n<li><strong>16 GB VRAM:<\/strong> 13\u201314 B at Q5+. Awkward tier \u2014 too much for 8B, not enough for 30B.<\/li>\n<li><strong>24 GB VRAM:<\/strong> 30 B at Q5+, 70 B at Q3_K_S (tight). The sweet spot.<\/li>\n<li><strong>32 GB VRAM:<\/strong> 70 B at Q4_K_M comfortably, 30 B at Q8.<\/li>\n<li><strong>48 GB VRAM:<\/strong> 70 B at Q5_K_M, 100 B+ at Q3\/Q4.<\/li>\n<li><strong>128 GB unified (M4 Max):<\/strong> 405 B at Q4, but slower per-token than Nvidia.<\/li>\n<\/ul>\n<\/div>\n<h2>The quick-reference table<\/h2>\n<p>Every major 2026 open LLM and its VRAM needs at common quantization levels. Numbers are for the <strong>model weights only<\/strong>, at 8 K context. Add 1\u20132 GB for KV cache headroom per 8 K of context you actually use.<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>Model<\/th>\n<th>FP16<\/th>\n<th>Q8_0<\/th>\n<th>Q5_K_M<\/th>\n<th>Q4_K_M<\/th>\n<th>Q3_K_M<\/th>\n<th>IQ2_XXS<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Phi-4 Mini (3.8 B)<\/strong><\/td>\n<td>7.6 GB<\/td>\n<td>4.0 GB<\/td>\n<td>2.7 GB<\/td>\n<td>2.3 GB<\/td>\n<td>1.9 GB<\/td>\n<td>1.4 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemma 2 2B<\/strong><\/td>\n<td>5.0 GB<\/td>\n<td>2.7 GB<\/td>\n<td>1.8 GB<\/td>\n<td>1.6 GB<\/td>\n<td>1.3 GB<\/td>\n<td>1.0 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Lama 3 8B<\/strong><\/td>\n<td>16.1 GB<\/td>\n<td>8.5 GB<\/td>\n<td>5.7 GB<\/td>\n<td>4.9 GB<\/td>\n<td>4.0 GB<\/td>\n<td>2.9 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mistral 7B v0.3<\/strong><\/td>\n<td>14.5 GB<\/td>\n<td>7.7 GB<\/td>\n<td>5.1 GB<\/td>\n<td>4.4 GB<\/td>\n<td>3.6 GB<\/td>\n<td>2.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 7B<\/strong><\/td>\n<td>15.2 GB<\/td>\n<td>8.1 GB<\/td>\n<td>5.4 GB<\/td>\n<td>4.7 GB<\/td>\n<td>3.8 GB<\/td>\n<td>2.7 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Phi-4 (14 B)<\/strong><\/td>\n<td>28.0 GB<\/td>\n<td>14.9 GB<\/td>\n<td>10.0 GB<\/td>\n<td>8.5 GB<\/td>\n<td>7.0 GB<\/td>\n<td>5.0 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 14B<\/strong><\/td>\n<td>29.5 GB<\/td>\n<td>15.7 GB<\/td>\n<td>10.5 GB<\/td>\n<td>9.0 GB<\/td>\n<td>7.4 GB<\/td>\n<td>5.3 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mistral Nemo 12B<\/strong><\/td>\n<td>24.5 GB<\/td>\n<td>13.0 GB<\/td>\n<td>8.7 GB<\/td>\n<td>7.5 GB<\/td>\n<td>6.1 GB<\/td>\n<td>4.4 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 32B<\/strong><\/td>\n<td>65.0 GB<\/td>\n<td>34.6 GB<\/td>\n<td>23.0 GB<\/td>\n<td>19.8 GB<\/td>\n<td>16.3 GB<\/td>\n<td>11.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Yi-1.5 34B<\/strong><\/td>\n<td>68.5 GB<\/td>\n<td>36.4 GB<\/td>\n<td>24.3 GB<\/td>\n<td>20.7 GB<\/td>\n<td>17.1 GB<\/td>\n<td>12.2 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Llama 3 70B<\/strong><\/td>\n<td>141.0 GB<\/td>\n<td>74.9 GB<\/td>\n<td>49.9 GB<\/td>\n<td>42.5 GB<\/td>\n<td>34.7 GB<\/td>\n<td>24.9 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 72B<\/strong><\/td>\n<td>145.0 GB<\/td>\n<td>77.1 GB<\/td>\n<td>51.4 GB<\/td>\n<td>43.8 GB<\/td>\n<td>35.7 GB<\/td>\n<td>25.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Command R+ 104B<\/strong><\/td>\n<td>208.0 GB<\/td>\n<td>110.5 GB<\/td>\n<td>73.8 GB<\/td>\n<td>62.7 GB<\/td>\n<td>51.6 GB<\/td>\n<td>36.8 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mistral Large 2 (123B)<\/strong><\/td>\n<td>247.0 GB<\/td>\n<td>131.4 GB<\/td>\n<td>87.5 GB<\/td>\n<td>74.5 GB<\/td>\n<td>61.0 GB<\/td>\n<td>43.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mixtral 8x22B (141 B)<\/strong><\/td>\n<td>282.0 GB<\/td>\n<td>150.0 GB<\/td>\n<td>100.0 GB<\/td>\n<td>85.1 GB<\/td>\n<td>69.8 GB<\/td>\n<td>49.9 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>DeepSeek V3 (236 B MoE)<\/strong><\/td>\n<td>475.0 GB<\/td>\n<td>252.0 GB<\/td>\n<td>168.5 GB<\/td>\n<td>143.6 GB<\/td>\n<td>117.4 GB<\/td>\n<td>84.1 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Llama 3.1 405B<\/strong><\/td>\n<td>810.0 GB<\/td>\n<td>431.0 GB<\/td>\n<td>287.0 GB<\/td>\n<td>244.5 GB<\/td>\n<td>200.1 GB<\/td>\n<td>143.0 GB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A practical note: for daily use, <strong>Q4_K_M is the recommended balance<\/strong> of size and quality. The quality drop versus FP16 is small (typical perplexity increase < 2%) and the memory savings are enormous (~3.3\u00d7 smaller). Q5_K_M is marginally better quality at ~17% more memory. Q3 and IQ2 are emergency-only \u2014 quality degrades noticeably.\n\n\n\n<h2>KV cache memory \u2014 the part everyone forgets<\/h2>\n<p>The numbers above are model weights only. The <strong>KV cache<\/strong> \u2014 the running memory of all tokens in your conversation \u2014 also lives in VRAM and grows linearly with context length.<\/p>\n<p>Rough KV cache size, per 1 K tokens of context, at FP16:<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>Model class<\/th>\n<th>KV per 1K tokens<\/th>\n<th>KV per 32K context<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>7\u20138 B models<\/td>\n<td>~32 MB<\/td>\n<td>~1.0 GB<\/td>\n<\/tr>\n<tr>\n<td>13\u201314 B models<\/td>\n<td>~50 MB<\/td>\n<td>~1.6 GB<\/td>\n<\/tr>\n<tr>\n<td>30\u201334 B models<\/td>\n<td>~80 MB<\/td>\n<td>~2.6 GB<\/td>\n<\/tr>\n<tr>\n<td>70\u201372 B models<\/td>\n<td>~160 MB<\/td>\n<td>~5.1 GB<\/td>\n<\/tr>\n<tr>\n<td>100\u2013123 B models<\/td>\n<td>~220 MB<\/td>\n<td>~7.0 GB<\/td>\n<\/tr>\n<tr>\n<td>405 B<\/td>\n<td>~500 MB<\/td>\n<td>~16.0 GB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Quantizing the KV cache (an option in llama.cpp and vLLM in 2026) cuts this by ~2\u20134\u00d7 with a small quality cost. Most production setups now use Q8 KV cache \u2014 it&#8217;s nearly free quality-wise and saves substantial VRAM at long context.<\/p>\n<p>If you plan to use 32 K or longer context, <strong>add KV cache to your VRAM math before picking a GPU<\/strong>.<\/p>\n<h2>GPU compatibility matrix<\/h2>\n<p>Which models comfortably fit on each common GPU, at recommended quants, with 8 K context? &#8220;Comfortably&#8221; means model + KV cache + 1 GB system headroom.<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>GPU<\/th>\n<th>VRAM<\/th>\n<th>Best fit (Q4_K_M)<\/th>\n<th>Best fit (Q5_K_M)<\/th>\n<th>Maximum (any quant)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RTX 3060 12 GB<\/td>\n<td>12 GB<\/td>\n<td>8 B<\/td>\n<td>8 B<\/td>\n<td>14 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>RTX 4060 Ti 16 GB<\/td>\n<td>16 GB<\/td>\n<td>13 B<\/td>\n<td>13 B<\/td>\n<td>30 B at IQ2<\/td>\n<\/tr>\n<tr>\n<td>RTX 5080 \/ 5070 Ti<\/td>\n<td>16 GB<\/td>\n<td>13 B<\/td>\n<td>13 B<\/td>\n<td>30 B at IQ2<\/td>\n<\/tr>\n<tr>\n<td>RTX 3090 \/ 4090<\/td>\n<td>24 GB<\/td>\n<td>30 B (Qwen 32B)<\/td>\n<td>30 B<\/td>\n<td>70 B at Q3_K_S<\/td>\n<\/tr>\n<tr>\n<td>RX 7900 XTX<\/td>\n<td>24 GB<\/td>\n<td>30 B<\/td>\n<td>30 B<\/td>\n<td>70 B at Q3_K_S<\/td>\n<\/tr>\n<tr>\n<td>RTX 5090<\/td>\n<td>32 GB<\/td>\n<td>70 B<\/td>\n<td>70 B (tight)<\/td>\n<td>70 B at Q5_K_M<\/td>\n<\/tr>\n<tr>\n<td>2\u00d7 RTX 3090 \/ 4090<\/td>\n<td>48 GB<\/td>\n<td>70 B<\/td>\n<td>70 B<\/td>\n<td>104 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>RTX A6000 \/ 6000 Ada<\/td>\n<td>48 GB<\/td>\n<td>70 B<\/td>\n<td>70 B<\/td>\n<td>104 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>Mac Studio M4 Max 64 GB<\/td>\n<td>64 GB unified<\/td>\n<td>70 B<\/td>\n<td>70 B<\/td>\n<td>123 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>H100 80 GB<\/td>\n<td>80 GB<\/td>\n<td>70 B (FP16-ish)<\/td>\n<td>104 B<\/td>\n<td>123 B at Q4<\/td>\n<\/tr>\n<tr>\n<td>Mac Studio M4 Max 128 GB<\/td>\n<td>128 GB unified<\/td>\n<td>104 B<\/td>\n<td>123 B<\/td>\n<td>405 B at IQ2 (slow)<\/td>\n<\/tr>\n<tr>\n<td>H200 \/ DIGITS<\/td>\n<td>141 GB \/ 128 GB unified<\/td>\n<td>123 B<\/td>\n<td>123 B<\/td>\n<td>405 B at Q3 (slow)<\/td>\n<\/tr>\n<tr>\n<td>B200<\/td>\n<td>192 GB<\/td>\n<td>123 B<\/td>\n<td>123 B<\/td>\n<td>405 B at Q4 (tight)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The patterns to internalize:<\/p>\n<p>1. <strong>12 GB is the entry floor.<\/strong> Below it, you&#8217;re constrained to tiny models that don&#8217;t justify a dedicated GPU.<br \/>\n2. <strong>24 GB is the inflection point.<\/strong> It&#8217;s the cheapest tier where Llama 3 70B becomes possible (at compromised quants).<br \/>\n3. <strong>32 GB unlocks 70B properly.<\/strong> This is the entire reason to choose the RTX 5090 over the 4090.<br \/>\n4. <strong>48 GB is comfortable territory.<\/strong> Most things you want to do fit cleanly.<br \/>\n5. <strong>128 GB unified is the consumer ceiling.<\/strong> Above this, you&#8217;re buying server hardware.<\/p>\n<h2>Choosing the right quant for your hardware<\/h2>\n<p>The right quantization isn&#8217;t always &#8220;the biggest one that fits.&#8221; Quality matters, and sometimes a smaller model at a better quant beats a bigger model at a worse one.<\/p>\n<p>Rough quality ranking (perplexity-based, lower is better):<\/p>\n<ul>\n<li><strong>FP16 \/ BF16<\/strong> \u2014 original. Quality reference baseline.<\/li>\n<li><strong>Q8_0<\/strong> \u2014 ~0.3% perplexity increase. Essentially indistinguishable.<\/li>\n<li><strong>Q6_K<\/strong> \u2014 ~0.5% increase. Indistinguishable in practice.<\/li>\n<li><strong>Q5_K_M<\/strong> \u2014 ~1.0% increase. Slight quality drop, still very high quality.<\/li>\n<li><strong>Q4_K_M<\/strong> \u2014 ~1.5\u20132.5% increase. Recommended for most users.<\/li>\n<li><strong>Q4_K_S<\/strong> \u2014 ~3% increase. Noticeably worse than Q4_K_M for similar size.<\/li>\n<li><strong>Q3_K_M<\/strong> \u2014 ~5\u20138% increase. Visibly affected output.<\/li>\n<li><strong>Q3_K_S<\/strong> \u2014 ~10% increase. Use only if Q4 won&#8217;t fit.<\/li>\n<li><strong>IQ2_XXS<\/strong> \u2014 ~15\u201325% increase. Last resort.<\/li>\n<\/ul>\n<p>The general rule: <strong>prefer a smaller-parameter model at Q5_K_M over a bigger model at Q3_K_S<\/strong> for everyday tasks. A Qwen 32B at Q5 generally beats a Llama 3 70B at IQ2_XXS on real-world benchmarks despite the latter sounding more impressive on paper.<\/p>\n<p>Exception: <strong>coding and reasoning tasks<\/strong> where the bigger model&#8217;s raw knowledge advantage often survives heavy quantization. For code generation specifically, even Q3_K_S of a 70B model can outperform a Q5_K_M 30B.<\/p>\n<h2>MoE models \u2014 the asterisk<\/h2>\n<p>Mixture-of-experts (MoE) models like <strong>Mixtral 8x22B<\/strong> et <strong>DeepSeek V3<\/strong> have an asymmetry that confuses newcomers:<\/p>\n<ul>\n<li><strong>VRAM needed<\/strong> = total parameters (because you must hold all experts)<\/li>\n<li><strong>Compute used<\/strong> = active parameters per token (much less)<\/li>\n<\/ul>\n<p>Mixtral 8x22B is 141 B total \/ 39 B active. It needs 80+ GB of VRAM to run, but the per-token speed is closer to running a 40 B dense model.<\/p>\n<p>DeepSeek V3 is 236 B total \/ 21 B active. It needs 150 GB+ of VRAM, but token speed approaches a 20 B dense model. This is why DeepSeek V3 is &#8220;fast for its size&#8221; \u2014 you pay the VRAM tax but get the compute discount.<\/p>\n<p>If your hardware can hold an MoE model, it&#8217;s often the best choice. If it can&#8217;t, the dense model in the same parameter class is what you want.<\/p>\n<h2>Quick-start setups by budget<\/h2>\n<p>For people who want a concrete answer, here are tested setups at five budget points in 2026:<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>Budget<\/th>\n<th>GPU<\/th>\n<th>Best model<\/th>\n<th>Tokens\/sec<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>$300<\/td>\n<td>RTX 3060 12 GB<\/td>\n<td>Llama 3 8B Q5_K_M<\/td>\n<td>~48<\/td>\n<\/tr>\n<tr>\n<td>$700<\/td>\n<td>Used RTX 3090<\/td>\n<td>Qwen 2.5 32B Q5_K_M<\/td>\n<td>~28<\/td>\n<\/tr>\n<tr>\n<td>$1,300<\/td>\n<td>Used RTX 4090<\/td>\n<td>Llama 3 70B Q3_K_S<\/td>\n<td>~13<\/td>\n<\/tr>\n<tr>\n<td>$1,400<\/td>\n<td>2\u00d7 Used RTX 3090 + NVLink<\/td>\n<td>Llama 3 70B Q4_K_M<\/td>\n<td>~15<\/td>\n<\/tr>\n<tr>\n<td>$2,400<\/td>\n<td>RTX 5090<\/td>\n<td>Llama 3 70B Q5_K_M<\/td>\n<td>~18<\/td>\n<\/tr>\n<tr>\n<td>$5,000<\/td>\n<td>Mac Studio M4 Max 128 GB<\/td>\n<td>Mistral Large 2 Q4<\/td>\n<td>~6<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The &#8220;best value tier&#8221; in 2026 remains the used RTX 3090 \/ 4090 \u2014 these are the only consumer GPUs where the price-per-VRAM math is favorable, and both will remain capable through at least 2028.<\/p>\n<p>For the deep dive on which GPU to pick, see <a href=\"\/fr\/best-gpus-for-local-llms-2026\/\">best GPUs for local LLMs in 2026<\/a>.<\/p>\n<h2>FAQ<\/h2>\n<h3>How much VRAM do I need to run Llama 3 70B locally in 2026?<\/h3>\n<p>Minimum 24 GB for Llama 3 70B at Q3_K_S (which is rough quality). 32 GB lets you run Q4_K_M comfortably (the recommended quant). 40+ GB is needed for Q5_K_M. With 24 GB and 8 K context, you have basically zero headroom; pushing context to 32 K requires CPU offload or a more aggressive quant.<\/p>\n<h3>What&#8217;s the difference between Q4_K_M and Q4_K_S?<\/h3>\n<p>Both are 4-bit quantizations of the same model. Q4_K_M (&#8220;medium&#8221;) uses 5 bits for some critical weight groups, making it slightly larger but noticeably better quality than Q4_K_S (&#8220;small&#8221;). For nearly identical VRAM, Q4_K_M is preferred. Q4_K_S only makes sense when you&#8217;re trying to squeeze a model into a tight VRAM budget.<\/p>\n<h3>Can I run an LLM that&#8217;s bigger than my VRAM?<\/h3>\n<p>Yes \u2014 using <strong>CPU offload<\/strong>, where some model layers run on the CPU using system RAM instead of GPU VRAM. The performance penalty is severe (5\u201310\u00d7 slower), but it lets you run models that wouldn&#8217;t otherwise fit. Practical for occasional use, painful as a daily driver. Both llama.cpp and Ollama support this out of the box via the <code>n_gpu_layers<\/code> setting.<\/p>\n<h3>Does the KV cache really matter for VRAM planning?<\/h3>\n<p>Yes, especially at long context. For Llama 3 70B at 32 K context, the KV cache alone is ~5 GB. If you&#8217;re already at the edge of your VRAM, you&#8217;ll OOM the moment a conversation gets long. Plan for KV cache and consider Q8 KV-cache quantization (option in modern inference engines) to roughly halve it.<\/p>\n<h3>Is there a way to run Llama 3 405B at home?<\/h3>\n<p>Yes, but you need 200+ GB of memory at usable quants. The realistic 2026 paths: Mac Studio M4 Ultra 512 GB ($12K, slow per-token but works), 8\u00d7 RTX 4090 ($13K, complex setup), Nvidia DIGITS ($3K, purpose-built), or CPU + 256 GB DDR5 RAM with mid-range GPU for partial offload ($8K, slow). See our <a href=\"\/fr\/running-llama-3-405b-at-home-real-cost\/\">how-to guide on running Llama 3 405B at home<\/a>.<\/p>\n<h3>Are there any 2026 quantization formats I should know besides GGUF?<\/h3>\n<p>Yes \u2014 <strong>AWQ<\/strong> (Activation-aware Weight Quantization) and <strong>GPTQ<\/strong> are both still widely used, especially for vLLM and TensorRT-LLM deployments. They&#8217;re slightly better quality at the same bit count than GGUF in some cases. For consumer local-LLM use with llama.cpp\/Ollama\/LM Studio, GGUF remains dominant in 2026 because of its simplicity and broad tooling support.<\/p>\n<h3>Will Q4 quantization affect coding ability?<\/h3>\n<p>Less than you&#8217;d think, but yes. For straightforward code completion, Q4_K_M is essentially identical to FP16. For complex multi-step reasoning across a codebase, Q4 occasionally produces worse logic than Q5+. If you do serious coding with local models, prefer Q5_K_M and choose your hardware to support it.<\/p>\n<h2>Bottom line<\/h2>\n<p>VRAM planning for local LLMs in 2026 isn&#8217;t complicated, but it does reward precision. Pick the parameter class first (the model size that has the capability you need), then pick the smallest quant that gives acceptable quality (Q4_K_M is usually right), then add KV cache for your real context length, then size your GPU accordingly.<\/p>\n<p>If you only remember three numbers, remember these:<\/p>\n<ul>\n<li><strong>12 GB<\/strong> runs 8 B models cleanly.<\/li>\n<li><strong>24 GB<\/strong> runs 30 B at quality quants, 70 B uncomfortably.<\/li>\n<li><strong>32 GB<\/strong> runs 70 B at quality quants.<\/li>\n<\/ul>\n<p>Everything past 32 GB enters server territory, and everything below 12 GB enters phone\/embedded territory. The bulk of 2026 local-LLM activity happens in the 12\u201332 GB range, which is exactly the consumer GPU range \u2014 by design, not coincidence.<\/p>","protected":false},"excerpt":{"rendered":"<p>The complete VRAM cheat sheet for every major open LLM in 2026 \u2014 at every common quantization level \u2014 plus a matrix showing which models fit on 12, 16, 24, 32, 48, and 80 GB GPUs.<\/p>","protected":false},"author":1,"featured_media":271,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_uag_custom_page_level_css":"","site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"_themeisle_gutenberg_block_has_review":false,"footnotes":""},"categories":[247],"tags":[289,290,288,287,285,286],"class_list":["post-264","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-benchmarks","tag-ggml","tag-gguf","tag-gpu-vram-for-ai","tag-llama-3-vram","tag-llm-vram","tag-quantization"],"uagb_featured_image_src":{"full":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026.jpg",1200,630,false],"thumbnail":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026-150x150.jpg",150,150,true],"medium":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026-300x158.jpg",300,158,true],"medium_large":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026-768x403.jpg",768,403,true],"large":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026-1024x538.jpg",1024,538,true],"1536x1536":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026.jpg",1200,630,false],"2048x2048":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026.jpg",1200,630,false],"trp-custom-language-flag":["https:\/\/convly.ai\/wp-content\/uploads\/2026\/05\/vram-requirements-every-major-llm-2026-18x9.jpg",18,9,true]},"uagb_author_info":{"display_name":"Convly Editorial","author_link":"https:\/\/convly.ai\/fr\/author\/mustafa\/"},"uagb_comment_info":0,"uagb_excerpt":"The complete VRAM cheat sheet for every major open LLM in 2026 \u2014 at every common quantization level \u2014 plus a matrix showing which models fit on 12, 16, 24, 32, 48, and 80 GB GPUs.","_links":{"self":[{"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/posts\/264","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/comments?post=264"}],"version-history":[{"count":0,"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/posts\/264\/revisions"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/media\/271"}],"wp:attachment":[{"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/media?parent=264"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/categories?post=264"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/convly.ai\/fr\/wp-json\/wp\/v2\/tags?post=264"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}