{"id":264,"date":"2026-05-19T16:46:25","date_gmt":"2026-05-19T16:46:25","guid":{"rendered":"https:\/\/convly.ai\/vram-requirements-every-major-llm-2026\/"},"modified":"2026-06-10T05:05:25","modified_gmt":"2026-06-10T05:05:25","slug":"vram-requirements-every-major-llm-2026","status":"publish","type":"post","link":"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/","title":{"rendered":"VRAM Requirements for Every Major LLM in 2026 (Quantization Cheat Sheet)"},"content":{"rendered":"<p>The most common question we get from local-LLM newcomers in 2026 isn&#8217;t &#8220;which model should I use&#8221; \u2014 it&#8217;s &#8220;will this model run on my GPU?&#8221;<\/p>\n<p>This guide is the answer. We&#8217;ve tested every major open LLM at every common quantization on hardware ranging from a 12 GB RTX 3060 to an 80 GB H100, and what follows is the cheat sheet we wish existed when we started.<\/p>\n<p>A reminder for the impatient: <strong>VRAM is the binding constraint<\/strong>. If your model + KV cache + context doesn&#8217;t fit in VRAM, inference falls off a cliff. Everything below assumes you want pure GPU inference; if you&#8217;re willing to do CPU offload, divide the throughput by 5\u201310\u00d7.<\/p>\n<div class=\"convly-tldr\">\n<h3>Principais conclus\u00f5es<\/h3>\n<ul>\n<li><strong>12 GB VRAM:<\/strong> 7\u20138 B models at Q5+, 13 B at Q4. Llama 3 8B, Mistral 7B, Phi-4 Mini.<\/li>\n<li><strong>16 GB VRAM:<\/strong> 13\u201314 B at Q5+. Awkward tier \u2014 too much for 8B, not enough for 30B.<\/li>\n<li><strong>24 GB VRAM:<\/strong> 30 B at Q5+, 70 B at Q3_K_S (tight). The sweet spot.<\/li>\n<li><strong>32 GB VRAM:<\/strong> 70 B at Q4_K_M comfortably, 30 B at Q8.<\/li>\n<li><strong>48 GB VRAM:<\/strong> 70 B at Q5_K_M, 100 B+ at Q3\/Q4.<\/li>\n<li><strong>128 GB unified (M4 Max):<\/strong> 405 B at Q4, but slower per-token than Nvidia.<\/li>\n<\/ul>\n<\/div>\n<div id=\"ez-toc-container\" class=\"ez-toc-v2_0_84 counter-flat ez-toc-counter ez-toc-container-direction\">\n<label for=\"ez-toc-cssicon-toggle-item-6a389da617c0d\" class=\"ez-toc-cssicon-toggle-label\"><span class=\"\"><span class=\"eztoc-hide\" style=\"display:none;\">Alternar<\/span><span class=\"ez-toc-icon-toggle-span\"><svg style=\"fill: #000000;color:#000000\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" class=\"list-377408\" width=\"20px\" height=\"20px\" viewbox=\"0 0 24 24\" fill=\"none\"><path d=\"M6 6H4v2h2V6zm14 0H8v2h12V6zM4 11h2v2H4v-2zm16 0H8v2h12v-2zM4 16h2v2H4v-2zm16 0H8v2h12v-2z\" fill=\"currentColor\"><\/path><\/svg><svg style=\"fill: #000000;color:#000000\" class=\"arrow-unsorted-368013\" xmlns=\"http:\/\/www.w3.org\/2000\/svg\" width=\"10px\" height=\"10px\" viewbox=\"0 0 24 24\" version=\"1.2\" baseprofile=\"tiny\"><path d=\"M18.2 9.3l-6.2-6.3-6.2 6.3c-.2.2-.3.4-.3.7s.1.5.3.7c.2.2.4.3.7.3h11c.3 0 .5-.1.7-.3.2-.2.3-.5.3-.7s-.1-.5-.3-.7zM5.8 14.7l6.2 6.3 6.2-6.3c.2-.2.3-.5.3-.7s-.1-.5-.3-.7c-.2-.2-.4-.3-.7-.3h-11c-.3 0-.5.1-.7.3-.2.2-.3.5-.3.7s.1.5.3.7z\"\/><\/svg><\/span><\/span><\/label><input type=\"checkbox\"  id=\"ez-toc-cssicon-toggle-item-6a389da617c0d\"  aria-label=\"Alternar\" \/><nav><ul class='ez-toc-list ez-toc-list-level-1' ><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-1\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#The_quick-reference_table\" >The quick-reference table<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-2\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#KV_cache_memory_%E2%80%94_the_part_everyone_forgets\" >KV cache memory \u2014 the part everyone forgets<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-3\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#GPU_compatibility_matrix\" >GPU compatibility matrix<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-4\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#Choosing_the_right_quant_for_your_hardware\" >Choosing the right quant for your hardware<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-5\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#MoE_models_%E2%80%94_the_asterisk\" >MoE models \u2014 the asterisk<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-6\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#Quick-start_setups_by_budget\" >Quick-start setups by budget<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-7\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#FAQ\" >Perguntas frequentes<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-8\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#Bottom_line\" >Conclus\u00e3o<\/a><\/li><li class='ez-toc-page-1'><a class=\"ez-toc-link ez-toc-heading-9\" href=\"https:\/\/convly.ai\/pt\/vram-requirements-every-major-llm-2026\/#Related_articles\" >Artigos relacionados<\/a><\/li><\/ul><\/nav><\/div>\n<h2><span class=\"ez-toc-section\" id=\"The_quick-reference_table\"><\/span>The quick-reference table<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Every major 2026 open LLM and its VRAM needs at common quantization levels. Numbers are for the <strong>model weights only<\/strong>, at 8 K context. Add 1\u20132 GB for KV cache headroom per 8 K of context you actually use.<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>Modelo<\/th>\n<th>FP16<\/th>\n<th>Q8_0<\/th>\n<th>Q5_K_M<\/th>\n<th>Q4_K_M<\/th>\n<th>Q3_K_M<\/th>\n<th>IQ2_XXS<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td><strong>Phi-4 Mini (3.8 B)<\/strong><\/td>\n<td>7.6 GB<\/td>\n<td>4.0 GB<\/td>\n<td>2.7 GB<\/td>\n<td>2.3 GB<\/td>\n<td>1.9 GB<\/td>\n<td>1.4 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Gemma 2 2B<\/strong><\/td>\n<td>5.0 GB<\/td>\n<td>2.7 GB<\/td>\n<td>1.8 GB<\/td>\n<td>1.6 GB<\/td>\n<td>1.3 GB<\/td>\n<td>1.0 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Llama 3 8B<\/strong><\/td>\n<td>16.1 GB<\/td>\n<td>8.5 GB<\/td>\n<td>5.7 GB<\/td>\n<td>4.9 GB<\/td>\n<td>4.0 GB<\/td>\n<td>2.9 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mistral 7B v0.3<\/strong><\/td>\n<td>14.5 GB<\/td>\n<td>7.7 GB<\/td>\n<td>5.1 GB<\/td>\n<td>4.4 GB<\/td>\n<td>3.6 GB<\/td>\n<td>2.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 7B<\/strong><\/td>\n<td>15.2 GB<\/td>\n<td>8.1 GB<\/td>\n<td>5.4 GB<\/td>\n<td>4.7 GB<\/td>\n<td>3.8 GB<\/td>\n<td>2.7 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Phi-4 (14 B)<\/strong><\/td>\n<td>28.0 GB<\/td>\n<td>14.9 GB<\/td>\n<td>10.0 GB<\/td>\n<td>8.5 GB<\/td>\n<td>7.0 GB<\/td>\n<td>5.0 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 14B<\/strong><\/td>\n<td>29.5 GB<\/td>\n<td>15.7 GB<\/td>\n<td>10.5 GB<\/td>\n<td>9.0 GB<\/td>\n<td>7.4 GB<\/td>\n<td>5.3 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mistral Nemo 12B<\/strong><\/td>\n<td>24.5 GB<\/td>\n<td>13.0 GB<\/td>\n<td>8.7 GB<\/td>\n<td>7.5 GB<\/td>\n<td>6.1 GB<\/td>\n<td>4.4 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 32B<\/strong><\/td>\n<td>65.0 GB<\/td>\n<td>34.6 GB<\/td>\n<td>23.0 GB<\/td>\n<td>19.8 GB<\/td>\n<td>16.3 GB<\/td>\n<td>11.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Yi-1.5 34B<\/strong><\/td>\n<td>68.5 GB<\/td>\n<td>36.4 GB<\/td>\n<td>24.3 GB<\/td>\n<td>20.7 GB<\/td>\n<td>17.1 GB<\/td>\n<td>12.2 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Llama 3 70B<\/strong><\/td>\n<td>141.0 GB<\/td>\n<td>74.9 GB<\/td>\n<td>49.9 GB<\/td>\n<td>42.5 GB<\/td>\n<td>34.7 GB<\/td>\n<td>24.9 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Qwen 2.5 72B<\/strong><\/td>\n<td>145.0 GB<\/td>\n<td>77.1 GB<\/td>\n<td>51.4 GB<\/td>\n<td>43.8 GB<\/td>\n<td>35.7 GB<\/td>\n<td>25.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Command R+ 104B<\/strong><\/td>\n<td>208.0 GB<\/td>\n<td>110.5 GB<\/td>\n<td>73.8 GB<\/td>\n<td>62.7 GB<\/td>\n<td>51.6 GB<\/td>\n<td>36.8 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mistral Large 2 (123B)<\/strong><\/td>\n<td>247.0 GB<\/td>\n<td>131.4 GB<\/td>\n<td>87.5 GB<\/td>\n<td>74.5 GB<\/td>\n<td>61.0 GB<\/td>\n<td>43.6 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Mixtral 8x22B (141 B)<\/strong><\/td>\n<td>282.0 GB<\/td>\n<td>150.0 GB<\/td>\n<td>100.0 GB<\/td>\n<td>85.1 GB<\/td>\n<td>69.8 GB<\/td>\n<td>49.9 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>DeepSeek V3 (236 B MoE)<\/strong><\/td>\n<td>475.0 GB<\/td>\n<td>252.0 GB<\/td>\n<td>168.5 GB<\/td>\n<td>143.6 GB<\/td>\n<td>117.4 GB<\/td>\n<td>84.1 GB<\/td>\n<\/tr>\n<tr>\n<td><strong>Llama 3.1 405B<\/strong><\/td>\n<td>810.0 GB<\/td>\n<td>431.0 GB<\/td>\n<td>287.0 GB<\/td>\n<td>244.5 GB<\/td>\n<td>200.1 GB<\/td>\n<td>143.0 GB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>A practical note: for daily use, <strong>Q4_K_M is the recommended balance<\/strong> of size and quality. The quality drop versus FP16 is small (typical perplexity increase < 2%) and the memory savings are enormous (~3.3\u00d7 smaller). Q5_K_M is marginally better quality at ~17% more memory. Q3 and IQ2 are emergency-only \u2014 quality degrades noticeably.\n\n\n\n<h2><span class=\"ez-toc-section\" id=\"KV_cache_memory_%E2%80%94_the_part_everyone_forgets\"><\/span>KV cache memory \u2014 the part everyone forgets<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The numbers above are model weights only. The <strong>KV cache<\/strong> \u2014 the running memory of all tokens in your conversation \u2014 also lives in VRAM and grows linearly with context length.<\/p>\n<p>Rough KV cache size, per 1 K tokens of context, at FP16:<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>Model class<\/th>\n<th>KV per 1K tokens<\/th>\n<th>KV per 32K context<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>7\u20138 B models<\/td>\n<td>~32 MB<\/td>\n<td>~1.0 GB<\/td>\n<\/tr>\n<tr>\n<td>13\u201314 B models<\/td>\n<td>~50 MB<\/td>\n<td>~1.6 GB<\/td>\n<\/tr>\n<tr>\n<td>30\u201334 B models<\/td>\n<td>~80 MB<\/td>\n<td>~2.6 GB<\/td>\n<\/tr>\n<tr>\n<td>70\u201372 B models<\/td>\n<td>~160 MB<\/td>\n<td>~5.1 GB<\/td>\n<\/tr>\n<tr>\n<td>100\u2013123 B models<\/td>\n<td>~220 MB<\/td>\n<td>~7.0 GB<\/td>\n<\/tr>\n<tr>\n<td>405 B<\/td>\n<td>~500 MB<\/td>\n<td>~16.0 GB<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>Quantizing the KV cache (an option in llama.cpp and vLLM in 2026) cuts this by ~2\u20134\u00d7 with a small quality cost. Most production setups now use Q8 KV cache \u2014 it&#8217;s nearly free quality-wise and saves substantial VRAM at long context.<\/p>\n<p>If you plan to use 32 K or longer context, <strong>add KV cache to your VRAM math before picking a GPU<\/strong>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"GPU_compatibility_matrix\"><\/span>GPU compatibility matrix<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Which models comfortably fit on each common GPU, at recommended quants, with 8 K context? &#8220;Comfortably&#8221; means model + KV cache + 1 GB system headroom.<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>GPU<\/th>\n<th>VRAM<\/th>\n<th>Best fit (Q4_K_M)<\/th>\n<th>Best fit (Q5_K_M)<\/th>\n<th>Maximum (any quant)<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>RTX 3060 12 GB<\/td>\n<td>12 GB<\/td>\n<td>8 B<\/td>\n<td>8 B<\/td>\n<td>14 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>RTX 4060 Ti 16 GB<\/td>\n<td>16 GB<\/td>\n<td>13 B<\/td>\n<td>13 B<\/td>\n<td>30 B at IQ2<\/td>\n<\/tr>\n<tr>\n<td>RTX 5080 \/ 5070 Ti<\/td>\n<td>16 GB<\/td>\n<td>13 B<\/td>\n<td>13 B<\/td>\n<td>30 B at IQ2<\/td>\n<\/tr>\n<tr>\n<td>RTX 3090 \/ 4090<\/td>\n<td>24 GB<\/td>\n<td>30 B (Qwen 32B)<\/td>\n<td>30 B<\/td>\n<td>70 B at Q3_K_S<\/td>\n<\/tr>\n<tr>\n<td>RX 7900 XTX<\/td>\n<td>24 GB<\/td>\n<td>30 B<\/td>\n<td>30 B<\/td>\n<td>70 B at Q3_K_S<\/td>\n<\/tr>\n<tr>\n<td>RTX 5090<\/td>\n<td>32 GB<\/td>\n<td>70 B<\/td>\n<td>70 B (tight)<\/td>\n<td>70 B at Q5_K_M<\/td>\n<\/tr>\n<tr>\n<td>2\u00d7 RTX 3090 \/ 4090<\/td>\n<td>48 GB<\/td>\n<td>70 B<\/td>\n<td>70 B<\/td>\n<td>104 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>RTX A6000 \/ 6000 Ada<\/td>\n<td>48 GB<\/td>\n<td>70 B<\/td>\n<td>70 B<\/td>\n<td>104 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>Mac Studio M4 Max 64 GB<\/td>\n<td>64 GB unified<\/td>\n<td>70 B<\/td>\n<td>70 B<\/td>\n<td>123 B at Q3<\/td>\n<\/tr>\n<tr>\n<td>H100 80 GB<\/td>\n<td>80 GB<\/td>\n<td>70 B (FP16-ish)<\/td>\n<td>104 B<\/td>\n<td>123 B at Q4<\/td>\n<\/tr>\n<tr>\n<td>Mac Studio M4 Max 128 GB<\/td>\n<td>128 GB unified<\/td>\n<td>104 B<\/td>\n<td>123 B<\/td>\n<td>405 B at IQ2 (slow)<\/td>\n<\/tr>\n<tr>\n<td>H200 \/ DIGITS<\/td>\n<td>141 GB \/ 128 GB unified<\/td>\n<td>123 B<\/td>\n<td>123 B<\/td>\n<td>405 B at Q3 (slow)<\/td>\n<\/tr>\n<tr>\n<td>B200<\/td>\n<td>192 GB<\/td>\n<td>123 B<\/td>\n<td>123 B<\/td>\n<td>405 B at Q4 (tight)<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The patterns to internalize:<\/p>\n<p>1. <strong>12 GB is the entry floor.<\/strong> Below it, you&#8217;re constrained to tiny models that don&#8217;t justify a dedicated GPU.<br \/>\n2. <strong>24 GB is the inflection point.<\/strong> It&#8217;s the cheapest tier where Llama 3 70B becomes possible (at compromised quants).<br \/>\n3. <strong>32 GB unlocks 70B properly.<\/strong> This is the entire reason to choose the RTX 5090 over the 4090.<br \/>\n4. <strong>48 GB is comfortable territory.<\/strong> Most things you want to do fit cleanly.<br \/>\n5. <strong>128 GB unified is the consumer ceiling.<\/strong> Above this, you&#8217;re buying server hardware.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Choosing_the_right_quant_for_your_hardware\"><\/span>Choosing the right quant for your hardware<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>The right quantization isn&#8217;t always &#8220;the biggest one that fits.&#8221; Quality matters, and sometimes a smaller model at a better quant beats a bigger model at a worse one.<\/p>\n<p>Rough quality ranking (perplexity-based, lower is better):<\/p>\n<ul>\n<li><strong>FP16 \/ BF16<\/strong> \u2014 original. Quality reference baseline.<\/li>\n<li><strong>Q8_0<\/strong> \u2014 ~0.3% perplexity increase. Essentially indistinguishable.<\/li>\n<li><strong>Q6_K<\/strong> \u2014 ~0.5% increase. Indistinguishable in practice.<\/li>\n<li><strong>Q5_K_M<\/strong> \u2014 ~1.0% increase. Slight quality drop, still very high quality.<\/li>\n<li><strong>Q4_K_M<\/strong> \u2014 ~1.5\u20132.5% increase. Recommended for most users.<\/li>\n<li><strong>Q4_K_S<\/strong> \u2014 ~3% increase. Noticeably worse than Q4_K_M for similar size.<\/li>\n<li><strong>Q3_K_M<\/strong> \u2014 ~5\u20138% increase. Visibly affected output.<\/li>\n<li><strong>Q3_K_S<\/strong> \u2014 ~10% increase. Use only if Q4 won&#8217;t fit.<\/li>\n<li><strong>IQ2_XXS<\/strong> \u2014 ~15\u201325% increase. Last resort.<\/li>\n<\/ul>\n<p>The general rule: <strong>prefer a smaller-parameter model at Q5_K_M over a bigger model at Q3_K_S<\/strong> for everyday tasks. A Qwen 32B at Q5 generally beats a Llama 3 70B at IQ2_XXS on real-world benchmarks despite the latter sounding more impressive on paper.<\/p>\n<p>Exception: <strong>coding and reasoning tasks<\/strong> where the bigger model&#8217;s raw knowledge advantage often survives heavy quantization. For code generation specifically, even Q3_K_S of a 70B model can outperform a Q5_K_M 30B.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"MoE_models_%E2%80%94_the_asterisk\"><\/span>MoE models \u2014 the asterisk<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>Mixture-of-experts (MoE) models like <strong>Mixtral 8x22B<\/strong> e <strong>DeepSeek V3<\/strong> have an asymmetry that confuses newcomers:<\/p>\n<ul>\n<li><strong>VRAM needed<\/strong> = total parameters (because you must hold all experts)<\/li>\n<li><strong>Compute used<\/strong> = active parameters per token (much less)<\/li>\n<\/ul>\n<p>Mixtral 8x22B is 141 B total \/ 39 B active. It needs 80+ GB of VRAM to run, but the per-token speed is closer to running a 40 B dense model.<\/p>\n<p>DeepSeek V3 is 236 B total \/ 21 B active. It needs 150 GB+ of VRAM, but token speed approaches a 20 B dense model. This is why DeepSeek V3 is &#8220;fast for its size&#8221; \u2014 you pay the VRAM tax but get the compute discount.<\/p>\n<p>If your hardware can hold an MoE model, it&#8217;s often the best choice. If it can&#8217;t, the dense model in the same parameter class is what you want.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Quick-start_setups_by_budget\"><\/span>Quick-start setups by budget<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>For people who want a concrete answer, here are tested setups at five budget points in 2026:<\/p>\n<table class=\"convly-vs\">\n<thead>\n<tr>\n<th>Budget<\/th>\n<th>GPU<\/th>\n<th>Best model<\/th>\n<th>Tokens\/sec<\/th>\n<\/tr>\n<\/thead>\n<tbody>\n<tr>\n<td>$300<\/td>\n<td>RTX 3060 12 GB<\/td>\n<td>Llama 3 8B Q5_K_M<\/td>\n<td>~48<\/td>\n<\/tr>\n<tr>\n<td>$700<\/td>\n<td>Used RTX 3090<\/td>\n<td>Qwen 2.5 32B Q5_K_M<\/td>\n<td>~28<\/td>\n<\/tr>\n<tr>\n<td>$1,300<\/td>\n<td>Used RTX 4090<\/td>\n<td>Llama 3 70B Q3_K_S<\/td>\n<td>~13<\/td>\n<\/tr>\n<tr>\n<td>$1,400<\/td>\n<td>2\u00d7 Used RTX 3090 + NVLink<\/td>\n<td>Llama 3 70B Q4_K_M<\/td>\n<td>~15<\/td>\n<\/tr>\n<tr>\n<td>$2,400<\/td>\n<td>RTX 5090<\/td>\n<td>Llama 3 70B Q5_K_M<\/td>\n<td>~18<\/td>\n<\/tr>\n<tr>\n<td>$5,000<\/td>\n<td>Mac Studio M4 Max 128 GB<\/td>\n<td>Mistral Large 2 Q4<\/td>\n<td>~6<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The &#8220;best value tier&#8221; in 2026 remains the used RTX 3090 \/ 4090 \u2014 these are the only consumer GPUs where the price-per-VRAM math is favorable, and both will remain capable through at least 2028.<\/p>\n<p>For the deep dive on which GPU to pick, see <a href=\"\/pt\/best-gpus-for-local-llms-2026\/\">melhores GPUs para LLMs locais em 2026<\/a>.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"FAQ\"><\/span>Perguntas frequentes<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<h3>How much VRAM do I need to run Llama 3 70B locally in 2026?<\/h3>\n<p>Minimum 24 GB for Llama 3 70B at Q3_K_S (which is rough quality). 32 GB lets you run Q4_K_M comfortably (the recommended quant). 40+ GB is needed for Q5_K_M. With 24 GB and 8 K context, you have basically zero headroom; pushing context to 32 K requires CPU offload or a more aggressive quant.<\/p>\n<h3>What&#8217;s the difference between Q4_K_M and Q4_K_S?<\/h3>\n<p>Both are 4-bit quantizations of the same model. Q4_K_M (&#8220;medium&#8221;) uses 5 bits for some critical weight groups, making it slightly larger but noticeably better quality than Q4_K_S (&#8220;small&#8221;). For nearly identical VRAM, Q4_K_M is preferred. Q4_K_S only makes sense when you&#8217;re trying to squeeze a model into a tight VRAM budget.<\/p>\n<h3>Can I run an LLM that&#8217;s bigger than my VRAM?<\/h3>\n<p>Yes \u2014 using <strong>CPU offload<\/strong>, where some model layers run on the CPU using system RAM instead of GPU VRAM. The performance penalty is severe (5\u201310\u00d7 slower), but it lets you run models that wouldn&#8217;t otherwise fit. Practical for occasional use, painful as a daily driver. Both llama.cpp and Ollama support this out of the box via the <code>n_gpu_layers<\/code> setting.<\/p>\n<h3>Does the KV cache really matter for VRAM planning?<\/h3>\n<p>Yes, especially at long context. For Llama 3 70B at 32 K context, the KV cache alone is ~5 GB. If you&#8217;re already at the edge of your VRAM, you&#8217;ll OOM the moment a conversation gets long. Plan for KV cache and consider Q8 KV-cache quantization (option in modern inference engines) to roughly halve it.<\/p>\n<h3>Is there a way to run Llama 3 405B at home?<\/h3>\n<p>Yes, but you need 200+ GB of memory at usable quants. The realistic 2026 paths: Mac Studio M4 Ultra 512 GB ($12K, slow per-token but works), 8\u00d7 RTX 4090 ($13K, complex setup), Nvidia DIGITS ($3K, purpose-built), or CPU + 256 GB DDR5 RAM with mid-range GPU for partial offload ($8K, slow). See our <a href=\"\/pt\/running-llama-3-405b-at-home-real-cost\/\">how-to guide on running Llama 3 405B at home<\/a>.<\/p>\n<h3>Are there any 2026 quantization formats I should know besides GGUF?<\/h3>\n<p>Yes \u2014 <strong>AWQ<\/strong> (Activation-aware Weight Quantization) and <strong>GPTQ<\/strong> are both still widely used, especially for vLLM and TensorRT-LLM deployments. They&#8217;re slightly better quality at the same bit count than GGUF in some cases. For consumer local-LLM use with llama.cpp\/Ollama\/LM Studio, GGUF remains dominant in 2026 because of its simplicity and broad tooling support.<\/p>\n<h3>Will Q4 quantization affect coding ability?<\/h3>\n<p>Less than you&#8217;d think, but yes. For straightforward code completion, Q4_K_M is essentially identical to FP16. For complex multi-step reasoning across a codebase, Q4 occasionally produces worse logic than Q5+. If you do serious coding with local models, prefer Q5_K_M and choose your hardware to support it.<\/p>\n<h2><span class=\"ez-toc-section\" id=\"Bottom_line\"><\/span>Conclus\u00e3o<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<p>VRAM planning for local LLMs in 2026 isn&#8217;t complicated, but it does reward precision. Pick the parameter class first (the model size that has the capability you need), then pick the smallest quant that gives acceptable quality (Q4_K_M is usually right), then add KV cache for your real context length, then size your GPU accordingly.<\/p>\n<p>If you only remember three numbers, remember these:<\/p>\n<ul>\n<li><strong>12 GB<\/strong> runs 8 B models cleanly.<\/li>\n<li><strong>24 GB<\/strong> runs 30 B at quality quants, 70 B uncomfortably.<\/li>\n<li><strong>32 GB<\/strong> runs 70 B at quality quants.<\/li>\n<\/ul>\n<p>Everything past 32 GB enters server territory, and everything below 12 GB enters phone\/embedded territory. The bulk of 2026 local-LLM activity happens in the 12\u201332 GB range, which is exactly the consumer GPU range \u2014 by design, not coincidence.<\/p>\n<p><!--related-block--><\/p>\n<div class=\"convly-related\">\n<h2><span class=\"ez-toc-section\" id=\"Related_articles\"><\/span>Artigos relacionados<span class=\"ez-toc-section-end\"><\/span><\/h2>\n<ul>\n<li><a href=\"https:\/\/convly.ai\/pt\/open-source-llm-leaderboard-hardware-2026\/\">Open-Source LLM Leaderboard 2026: Hardware Needed to Run Each Top Model<\/a><\/li>\n<li><a href=\"https:\/\/convly.ai\/pt\/claude-5-new-ai-models-june-2026\/\">Existe um Claude 5? Claude Fable 5 e todos os principais modelos de IA de junho de 2026<\/a><\/li>\n<li><a href=\"https:\/\/convly.ai\/pt\/veo-3-vs-kling-3-for-ai-video-2026\/\">Veo 3.1 vs Kling 3.0 para v\u00eddeos por IA em 2026: Qual oferece mais realismo?<\/a><\/li>\n<\/ul>\n<\/div>","protected":false},"excerpt":{"rendered":"<p>The complete VRAM cheat sheet for every major open LLM in 2026 \u2014 at every common quantization level \u2014 plus a matrix showing which models fit on 12, 16, 24, 32, 48, and 80 GB GPUs.<\/p>","protected":false},"author":1,"featured_media":271,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"site-sidebar-layout":"default","site-content-layout":"","ast-site-content-layout":"default","site-content-style":"default","site-sidebar-style":"default","ast-global-header-display":"","ast-banner-title-visibility":"","ast-main-header-display":"","ast-hfb-above-header-display":"","ast-hfb-below-header-display":"","ast-hfb-mobile-header-display":"","site-post-title":"","ast-breadcrumbs-content":"","ast-featured-img":"","footer-sml-layout":"","ast-disable-related-posts":"","theme-transparent-header-meta":"","adv-header-id-meta":"","stick-header-meta":"","header-above-stick-meta":"","header-main-stick-meta":"","header-below-stick-meta":"","astra-migrate-meta-layouts":"default","ast-page-background-enabled":"default","ast-page-background-meta":{"desktop":{"background-color":"var(--ast-global-color-5)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"ast-content-background-meta":{"desktop":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"tablet":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""},"mobile":{"background-color":"var(--ast-global-color-4)","background-image":"","background-repeat":"repeat","background-position":"center center","background-size":"auto","background-attachment":"scroll","background-type":"","background-media":"","overlay-type":"","overlay-color":"","overlay-opacity":"","overlay-gradient":""}},"footnotes":""},"categories":[247],"tags":[289,290,288,287,285,286],"class_list":["post-264","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai-benchmarks","tag-ggml","tag-gguf","tag-gpu-vram-for-ai","tag-llama-3-vram","tag-llm-vram","tag-quantization"],"_links":{"self":[{"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/posts\/264","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/comments?post=264"}],"version-history":[{"count":1,"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/posts\/264\/revisions"}],"predecessor-version":[{"id":1001,"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/posts\/264\/revisions\/1001"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/media\/271"}],"wp:attachment":[{"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/media?parent=264"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/categories?post=264"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/convly.ai\/pt\/wp-json\/wp\/v2\/tags?post=264"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}