Saturday, 6 June 2026 | Mise à jour quotidienne L'intelligence artificielle au service des constructeurs

Ollama System Requirements in 2026: How Much RAM and VRAM You Really Need

The single most common reason a model won’t run in Ollama isn’t a bug — it’s that the model is bigger than your memory. Ollama itself is tiny; the models are what demand hardware. This guide gives you the real RAM and VRAM numbers for each model size in 2026, plus a simple formula so you know what fits before you spend ten minutes downloading something that won’t load.

If you haven’t installed Ollama yet, start with our step-by-step install guide.

Principaux enseignements

  • The rule of thumb: a quantized (Q4) model needs roughly 0.6 GB of memory per billion parameters, plus headroom for context.
  • 2–3B models: run on CPU, ~2–4 GB RAM. Fine on a basic laptop.
  • 7–8B models: ~6–8 GB RAM/VRAM. The sweet spot for most laptops.
  • 27–34B models: ~20–24 GB VRAM. Needs a high-end GPU or Apple Silicon with lots of unified memory.
  • 70B+ models: 40 GB+ — a workstation GPU, multi-GPU rig, or 64 GB+ unified memory.

Why memory is the whole story

To generate text, a model’s weights have to sit in fast memory — your GPU’s VRAM, or system RAM if you’re running on CPU. If the model doesn’t fit, one of two things happens: Ollama spills part of it to slower memory (and performance collapses), or it refuses to load with an out-of-memory error. Everything else — CPU speed, disk, OS — matters far less than having enough of the right memory.

Two factors set the requirement:

  1. Parameter count — a 7B model has 7 billion weights; a 70B model has ten times as many.
  2. Quantification — Ollama uses compressed GGUF weights. A 4-bit (Q4) quant cuts memory roughly in half versus 8-bit, with minimal quality loss, which is why it’s the default sweet spot.

The simple formula

For a 4-bit quantized model — what Ollama pulls by default — estimate:

Memory needed ≈ (parameters in billions) × 0.6 GB + context overhead

So a 7B model needs roughly 4–5 GB, a 13B model about 8 GB, a 27B model around 18–20 GB, and a 70B model 40 GB or more. Add a bit on top for the KV cache, which grows with how long your conversations get. Always leave a few gigabytes of headroom for your operating system.

Requirements by model size

Model sizeMemory (Q4)Runs onExample models
2–3B~2–4 GBCPU / any laptopGemma2 2B, Phi-4 mini
7–8B~6–8 GBEntry GPU / 16 GB laptopDeepSeek-R1 7B, Llama 3.3 8B
13–14B~10–12 GBMid-range GPUPhi-4, mid Qwen
27–34B~18–24 GBHigh-end GPU / Apple SiliconGemma 4 26B, Qwen 3.6 27B
70B~40–48 GBWorkstation / multi-GPULlama 70B class
200B+ (MoE)100 GB+Server / huge unified memoryQwen3 235B-A22B

For a deeper breakdown across specific models, see our guide to Exigences en matière de VRAM pour tous les principaux programmes d'éducation et de formation tout au long de la vie.

GPU vs CPU vs Apple Silicon

NVIDIA GPU — the gold standard. VRAM is the hard limit: the model must fit in your card’s memory to run fast. A 24 GB card (RTX 4090/5090) comfortably runs up to ~27–34B models.

CPU only — works for small models (2–8B) but is much slower, since system RAM bandwidth can’t match a GPU. Perfectly fine for light tasks on a laptop with no discrete GPU.

Apple Silicon — a special case, and a strong one. Because Macs use mémoire unifiée shared between CPU and GPU, a Mac with 64 GB can load models that would need an expensive multi-GPU PC. Since Ollama v0.19 (March 2026) added the MLX backend, Apple Silicon also got much faster — making a high-memory Mac one of the best single-box local-LLM machines you can buy. For how that stacks up against a discrete GPU, see Strix Halo vs Apple M4 Pro.

AMD GPU — supported via ROCm. It works well for inference in 2026; check our ROCm vs CUDA breakdown for the current state.

How to make a big model fit

If the model you want is just over your memory, you have options before giving up:

  • Use a smaller quant — pull a q4 or even q3 variant instead of q8. You trade a little quality for a big memory saving.
  • Pick a smaller model size — a well-chosen 8B often beats a barely-running, swapped-out 27B.
  • Shorten the context window — a smaller context uses less KV-cache memory.
  • Close other apps — on a CPU/unified-memory machine, free RAM is your budget.

To pick a model matched to your hardware, see the best local LLMs to run on Ollama.

FAQ

How much RAM do I need to run Ollama?

It depends entirely on the model. Ollama itself needs almost nothing; the model sets the requirement. As a rule, a 4-bit model needs about 0.6 GB per billion parameters — so ~4–5 GB for a 7B model, ~8 GB for 13B, and 40 GB+ for a 70B. Always leave a few gigabytes free for your OS.

Can I run Ollama without a GPU?

Yes. Small models (2–8B) run fine on CPU, just more slowly than on a GPU. A model like Gemma2 2B needs only about 1.7 GB of RAM and works on basic laptops. For models above ~13B, a GPU or Apple Silicon with unified memory makes a real difference.

How much VRAM do I need for a 7B model?

About 6–8 GB for a 4-bit quantized 7B model, including some context overhead. That fits comfortably on most entry-level discrete GPUs and on laptops with 16 GB of unified or system memory.

Why is Ollama running so slowly?

Almost always because the model doesn’t fully fit in your GPU’s VRAM, so part of it spilled to system RAM or CPU. Check with ollama ps — if it shows high CPU usage, switch to a smaller model or a more aggressive quant so the whole model fits in fast memory.

Is a Mac good for running Ollama?

Yes, often excellent. Apple Silicon’s unified memory lets a 64 GB Mac run models that would otherwise need a costly multi-GPU PC, and the MLX backend (since v0.19) made it fast too. A high-memory Mac is one of the best single-machine options for local LLMs in 2026.

Résultat

Before you download anything, do the quick math: parameters × 0.6 GB for a 4-bit model, plus headroom. Match that to your VRAM (NVIDIA/AMD) or unified memory (Apple), and you’ll never hit a frustrating out-of-memory error again. When in doubt, start one size smaller than you think — a model that fits and runs fast beats a bigger one that crawls.

Défiler vers le haut