Ollama System Requirements in 2026: How Much RAM and VRAM You Really Need

Atualizado 15 de junho de 2026 · Publicado originalmente em 6 de junho de 2026

The single most common reason a model won’t run in Ollama isn’t a bug — it’s that the model is bigger than your memory. Ollama itself is tiny; the models are what demand hardware. This guide gives you the real RAM and VRAM numbers for each model size in 2026, plus a simple formula so you know what fits before you spend ten minutes downloading something that won’t load.

If you haven’t installed Ollama yet, start with our step-by-step install guide.

Principais conclusões

The rule of thumb: a quantized (Q4) model needs roughly 0.6 GB of memory per billion parameters, plus headroom for context.
2–3B models: run on CPU, ~2–4 GB RAM. Fine on a basic laptop.
7–8B models: ~6–8 GB RAM/VRAM. The sweet spot for most laptops.
27–34B models: ~20–24 GB VRAM. Needs a high-end GPU or Apple Silicon with lots of unified memory.
70B+ models: 40 GB+ — a workstation GPU, multi-GPU rig, or 64 GB+ unified memory.

Why memory is the whole story

To generate text, a model’s weights have to sit in fast memory — your GPU’s VRAM, or system RAM if you’re running on CPU. If the model doesn’t fit, one of two things happens: Ollama spills part of it to slower memory (and performance collapses), or it refuses to load with an out-of-memory error. Everything else — CPU speed, disk, OS — matters far less than having enough of the right memory.

Two factors set the requirement:

Parameter count — a 7B model has 7 billion weights; a 70B model has ten times as many.
Quantization — Ollama uses compressed GGUF weights. A 4-bit (Q4) quant cuts memory roughly in half versus 8-bit, with minimal quality loss, which is why it’s the default sweet spot.

The simple formula

For a 4-bit quantized model — what Ollama pulls by default — estimate:

Memory needed ≈ (parameters in billions) × 0.6 GB + context overhead

So a 7B model needs roughly 4–5 GB, a 13B model about 8 GB, a 27B model around 18–20 GB, and a 70B model 40 GB or more. Add a bit on top for the KV cache, which grows with how long your conversations get. Always leave a few gigabytes of headroom for your operating system.

Requirements by model size

Tamanho do modelo	Memory (Q4)	Runs on	Example models
2–3B	~2–4 GB	CPU / any laptop	Gemma2 2B, Phi-4 mini
7–8B	~6–8 GB	Entry GPU / 16 GB laptop	DeepSeek-R1 7B, Llama 3.3 8B
13–14B	~10–12 GB	Mid-range GPU	Phi-4, mid Qwen
27–34B	~18–24 GB	High-end GPU / Apple Silicon	Gemma 4 26B, Qwen 3.6 27B
70B	~40–48 GB	Workstation / multi-GPU	Llama 70B class
200B+ (MoE)	100 GB+	Server / huge unified memory	Qwen3 235B-A22B

For a deeper breakdown across specific models, see our guide to Requisitos de VRAM para todos os principais LLMs.

GPU vs CPU vs Apple Silicon

NVIDIA GPU — the gold standard. VRAM is the hard limit: the model must fit in your card’s memory to run fast. A 24 GB card (RTX 4090/5090) comfortably runs up to ~27–34B models.

CPU only — works for small models (2–8B) but is much slower, since system RAM bandwidth can’t match a GPU. Perfectly fine for light tasks on a laptop with no discrete GPU.

Apple Silicon — a special case, and a strong one. Because Macs use unified memory shared between CPU and GPU, a Mac with 64 GB can load models that would need an expensive multi-GPU PC. Since Ollama v0.19 (March 2026) added the MLX backend, Apple Silicon also got much faster — making a high-memory Mac one of the best single-box local-LLM machines you can buy. For how that stacks up against a discrete GPU, see Strix Halo vs Apple M4 Pro.

AMD GPU — supported via ROCm. It works well for inference in 2026; check our análise comparativa entre ROCm e CUDA for the current state.

How to make a big model fit

If the model you want is just over your memory, you have options before giving up:

Use a smaller quant — pull a q4 or even q3 variant instead of q8. You trade a little quality for a big memory saving.
Pick a smaller model size — a well-chosen 8B often beats a barely-running, swapped-out 27B.
Shorten the context window — a smaller context uses less KV-cache memory.
Close other apps — on a CPU/unified-memory machine, free RAM is your budget.

To pick a model matched to your hardware, see the melhores LLMs locais para executar no Ollama.

Storage and software prerequisites people forget

RAM and VRAM get all the attention, but two quieter requirements trip up more first-time installs than anything else: disk space and the software stack underneath. Get these wrong and Ollama either refuses to install or fails halfway through a model download.

Disk space. The Ollama binary itself is small — budget roughly 4 GB for the install. The models are what eat your drive. Every model is downloaded once and cached on disk, then loaded into memory at runtime, so you need room for the full weights on top of whatever you have free. As a rough guide at common 4-bit quantization:

An 8B model (e.g. Llama 3.1 8B): about 5 GB on disk.
A 20B-class model: roughly 12–14 GB.
A 70B model: around 40 GB.
A very large MoE model (Llama 4-class): 65 GB or more.

These stack up fast. A casual collection of a few models lands at 30–80 GB; keep several large variants and you will cross 200 GB without trying. A 512 GB SSD is a sensible floor if you plan to collect models.

Use an SSD, ideally NVMe. Because the weights are read off disk into RAM or VRAM every time a model first loads, a slow mechanical drive shows up directly as sluggish startup — a 40 GB model crawls off a spinning disk. Fast storage does not change tokens-per-second once the model is loaded, but it makes the first prompt feel instant instead of a 30-second stall.

Operating system and drivers. Ollama runs natively on all three platforms, but each has a floor:

macOS: 11 (Big Sur) or newer, on both Apple Silicon and Intel.
Windows: Windows 10 22H2 or newer (Home or Pro), on x86_64 and ARM64 — so Snapdragon machines run it natively, without x86 emulation.
Linux: most modern distributions (Ubuntu 18.04+, Debian, Fedora, RHEL, Arch).

For GPU acceleration you also need current drivers: a recent NVIDIA driver — 531 or newer (and 570 or newer for older Maxwell- and Pascal-era cards) — for CUDA, or a Vulkan-capable or ROCm v7 driver stack on AMD Radeon. Miss the driver and Ollama silently falls back to CPU — which is the most common reason a machine “with a good GPU” runs slowly.

Perguntas frequentes

How much RAM do I need to run Ollama?

It depends entirely on the model. Ollama itself needs almost nothing; the model sets the requirement. As a rule, a 4-bit model needs about 0.6 GB per billion parameters — so ~4–5 GB for a 7B model, ~8 GB for 13B, and 40 GB+ for a 70B. Always leave a few gigabytes free for your OS.

Can I run Ollama without a GPU?

Yes. Small models (2–8B) run fine on CPU, just more slowly than on a GPU. A model like Gemma2 2B needs only about 1.7 GB of RAM and works on basic laptops. For models above ~13B, a GPU or Apple Silicon with unified memory makes a real difference.

How much VRAM do I need for a 7B model?

About 6–8 GB for a 4-bit quantized 7B model, including some context overhead. That fits comfortably on most entry-level discrete GPUs and on laptops with 16 GB of unified or system memory.

Why is Ollama running so slowly?

Almost always because the model doesn’t fully fit in your GPU’s VRAM, so part of it spilled to system RAM or CPU. Check with ollama ps — if it shows high CPU usage, switch to a smaller model or a more aggressive quant so the whole model fits in fast memory.

Is a Mac good for running Ollama?

Yes, often excellent. Apple Silicon’s unified memory lets a 64 GB Mac run models that would otherwise need a costly multi-GPU PC, and the MLX backend (since v0.19) made it fast too. A high-memory Mac is one of the best single-machine options for local LLMs in 2026.

How much disk space do I need for Ollama?

Plan for about 4 GB for the Ollama install itself, then add the size of each model you pull. At 4-bit quantization an 8B model is roughly 5 GB, a 70B is around 40 GB, and the largest models exceed 65 GB. A typical multi-model setup lands between 30 and 80 GB, so a 512 GB SSD is a comfortable starting point. An SSD (preferably NVMe) is strongly recommended, because models load off disk every time you first run them.

Where does Ollama store models, and can I move them to another drive?

By default Ollama keeps downloaded models in a hidden folder in your home directory — ~/.ollama on macOS and Linux, and %HOMEPATH%.ollama on Windows. If your system drive is small, you can redirect storage to a larger or external disk by setting the OLLAMA_MODELS environment variable to a new path before starting Ollama. This is the cleanest fix when your boot drive runs out of room.

Which operating systems does Ollama support?

Ollama runs natively on macOS 11 (Big Sur) or newer, Windows 10 22H2 or newer (64-bit, including ARM64 devices like Snapdragon laptops), and most modern Linux distributions such as Ubuntu 18.04+, Fedora, and Arch. For GPU acceleration you also need an up-to-date driver — a recent NVIDIA driver for CUDA, or a ROCm/Vulkan-capable driver on AMD — otherwise Ollama runs on the CPU instead.

Conclusão

Before you download anything, do the quick math: parameters × 0.6 GB for a 4-bit model, plus headroom. Match that to your VRAM (NVIDIA/AMD) or unified memory (Apple), and you’ll never hit a frustrating out-of-memory error again. When in doubt, start one size smaller than you think — a model that fits and runs fast beats a bigger one that crawls.