If you’ve spent any time around local AI in the last two years, you’ve heard the name. Ollama is the tool that turned “run a large language model on your own machine” from a weekend of CUDA errors into a single command: ollama run llama3.3.
This guide explains exactly what Ollama is, how it works under the hood, what it can and can’t do, and whether it’s the right tool for you in 2026.
Punti chiave
- What it is: a free, open-source tool that downloads, manages, and runs open LLMs locally with one command — no cloud, no API keys, no data leaving your machine.
- How it works: it wraps the
llama.cppengine (and Apple’s MLX on Mac since v0.19) and handles model downloads, quantization, GPU allocation, and a REST API on port11434. - Who it’s for: developers and tinkerers who want the lowest-friction way to prototype with local models. It’s the “lowest regret” entry point in 2026.
- Who it isn’t for: high-concurrency production serving — for that, vLLM is roughly 16–20× faster under load.
- Cost: $0. It’s MIT-licensed and runs entirely on your hardware.
What Ollama actually is
Ollama is an open-source runtime for large language models that runs on your own computer — Mac, Windows, or Linux. Think of it as the “Docker for LLMs”: instead of wrestling with Python environments, model weights, and GPU drivers, you type one command and a model is running.
The pitch is simple: keep your data on your machine, pay nothing per token, and work offline. When you run ollama run gemma4, Ollama downloads the model, loads it into your GPU’s memory (or system RAM if you don’t have a GPU), and drops you into a chat prompt. That’s it.
Behind that simplicity, Ollama is doing a lot of work for you:
- Model management — pulling, versioning, and storing models from its registry, the way a package manager handles software.
- Quantization — automatically using compressed (GGUF) versions of models so a 27-billion-parameter model fits in consumer memory.
- GPU layer allocation — deciding how much of the model lives on your GPU versus CPU, based on the VRAM you have.
- Context and KV-cache management — handling the memory that grows as a conversation gets longer.
- A REST API — exposing everything on
http://localhost:11434so your own apps can talk to it.
How it works under the hood
Ollama is not itself an inference engine. It’s an experience layer wrapped around one. Under the hood it uses llama.cpp, the C++ engine that does the actual math of running a quantized model efficiently on CPUs and GPUs. As of v0.19 (March 2026), Ollama also uses Apple’s MLX backend on Apple Silicon — a change that delivered enormous speedups (on an M5 Max running Qwen 3.5, decode throughput nearly doubled).
The workflow looks like this:
- You run a command —
ollama run qwen3from the terminal, or a request to the API. - Ollama resolves the model — if it isn’t already downloaded, it pulls the GGUF weights from the registry.
- It loads the model into memory — splitting layers between GPU and CPU based on available VRAM.
- It serves responses — either interactively in your terminal or as JSON over the REST API.
That REST API is the part developers care about most. Any app that can make an HTTP request can use a local model through Ollama — and because Ollama added an OpenAI-compatible endpoint, a lot of existing code works by just changing the base URL.
What you can build with it
Ollama is the engine behind a huge range of local-AI projects in 2026:
- Private chatbots that never send a word to the cloud.
- Coding assistants — the newer
ollama launchcommand wires up tools like Claude Code, OpenCode, and Codex to a local or cloud model with no config files. - RAG systems using Ollama’s batch embedding API to index your own documents.
- Agents and automations that call local models for classification, extraction, or summarization at zero marginal cost.
- Structured-output pipelines — Ollama can now constrain a model’s output to a JSON schema, which makes it reliable for programmatic use.
Where Ollama fits among the alternatives
Ollama isn’t the only way to run models locally, and it isn’t always the best. Here’s the honest landscape:
| Strumento | Ideale per | Trade-off |
|---|---|---|
| Ollama | One-developer prototyping on any OS | Slow under heavy concurrency |
| LM Studio | A polished GUI to browse and chat with models | Less scriptable; desktop-first |
| vLLM | Multi-user production serving on GPUs | Complex setup; not local-first |
| llama.cpp | Maximum speed and embedded/edge hardware | Lowest-level; you assemble it yourself |
If you’re one person experimenting, Ollama wins on sheer convenience. The moment you need to serve many users at once, you’ll want to read our full breakdown of Ollama vs LM Studio vs vLLM vs llama.cpp.
Getting started in two minutes
The barrier to entry is genuinely tiny:
- Install it — download the app for your OS (see our step-by-step install guide).
- Pull and run a model —
ollama run gemma4for a strong all-rounder, orollama run qwen3for coding. - Talk to it — chat in the terminal, or point your app at
http://localhost:11434.
Before you pick a model, check that your machine can handle it — our guide to Ollama’s system requirements maps model sizes to the RAM and VRAM you actually need.
What hardware do you actually need?
Ollama will start on almost any machine with a CPU and 8 GB of RAM, but “starts” and “feels usable” are different questions. The single number that decides your experience is how much memory the model fits into, because the entire model has to sit in RAM (or, ideally, GPU VRAM) while it runs. A reliable rule of thumb is roughly 0.6 GB of memory per billion parameters at the default Q4_K_M quantization, plus a little headroom for context.
That math gives you a quick sizing guide for the most common model classes:
| Model class | Approx. download (Q4_K_M) | Comfortable memory |
|---|---|---|
| 7–8B (Llama 3.x, Mistral) | ~5 GB | 8 GB+ |
| 13–14B (Qwen, Phi) | ~9 GB | 16 GB+ |
| 32B | ~20 GB | 24 GB+ |
| 70B (Llama 3.3) | ~43 GB | 64 GB+ |
For most people the practical sweet spot is a GPU or Mac with around 16 GB of VRAM or unified memory — enough to run 7B–14B models at speeds that feel instant. A 16 GB RTX-class card or a 16 GB Apple Silicon Mac both land squarely in this zone.
Two architectural notes matter when you choose. A discrete NVIDIA GPU wins decisively whenever the model fits inside its VRAM, delivering the fastest tokens per second. Apple Silicon’s unified memory is the opposite trade-off: it shares all system RAM with the GPU, so a 64 GB or 128 GB Mac can run 32B–70B models that simply won’t load on a consumer graphics card — just at lower throughput. The crossover sits around the 24 GB model mark.
You può run Ollama with no GPU at all. A modern multi-core CPU handles a 7B model at a workable few-to-low-double-digit tokens per second, but large 70B models on CPU drop below one token per second — fine for overnight batch jobs, painful for chat. If interactive speed matters, GPU or Apple Silicon acceleration is the deciding factor.
Domande frequenti
Is Ollama free?
Yes. Ollama is open-source under the MIT license and completely free. The only “cost” is the hardware you run it on and the electricity it uses — there are no per-token charges because nothing goes to a cloud provider.
Does Ollama send my data anywhere?
No. By design, inference happens entirely on your machine. The only network traffic is downloading a model the first time you pull it. This is the main reason teams in healthcare, legal, and finance use it — sensitive prompts never leave the building.
Do I need a GPU to run Ollama?
No, but it helps a lot. Ollama runs on CPU alone for smaller models (a 2–3B model is comfortable on a modern laptop), and uses your GPU automatically when one is available. For models above ~13B parameters, a GPU or Apple Silicon with unified memory makes a big difference. See our system requirements guide for specifics.
What models can Ollama run?
Over 100 open models, including Meta’s Llama 3.3 and Llama 4, Google’s Gemma 4, Alibaba’s Qwen 3 series, DeepSeek V3 and R1, Mistral, and Microsoft’s Phi-4. Our pick of the migliori LLM locali da eseguire su Ollama breaks down which to use for which job.
Is Ollama better than ChatGPT?
Different tools. ChatGPT gives you a frontier model with no setup but sends your data to the cloud and charges a subscription. Ollama runs smaller open models locally, free and private, but a top local model still trails the very best cloud models on the hardest tasks. For privacy, cost, and offline use, Ollama wins; for raw capability on complex reasoning, the cloud frontier is still ahead.
What is the Ollama API port?
Ollama exposes its REST API on http://localhost:11434 by default. It also offers an OpenAI-compatible endpoint, so a lot of existing OpenAI-SDK code works by simply pointing the base URL at your local Ollama instance.
Can Ollama replace the OpenAI API in my existing app?
For most apps, yes. Ollama exposes an OpenAI-compatible endpoint at http://localhost:11434/v1, including the /v1/chat/completions route that most tools call. Point your OpenAI client’s base_url at it, pass any placeholder API key, and set the model field to an installed Ollama tag. Embeddings, vision, and tool-calling are supported too, so many projects switch by changing two lines. It covers parts of the OpenAI API rather than every parameter, so verify any exotic fields your app relies on.
Can I run Ollama without a GPU?
Yes. Ollama runs entirely on CPU when no compatible GPU is present — you just need enough system RAM to hold the model. A current multi-core CPU runs a 7B model at usable speeds, but throughput falls off sharply as models grow, and 70B-class models on CPU are too slow for interactive use. For day-to-day chat, a GPU or Apple Silicon Mac makes the difference between sluggish and snappy.
How much disk space do Ollama models take, and where are they stored?
Plan for the download sizes above: a 7B model is roughly 5 GB on disk, a 70B model around 43 GB, and pulling several models adds up quickly. By default they live under ~/.ollama/models (o C:Users<you>.ollamamodels on Windows). You can relocate that directory with the OLLAMA_MODELS environment variable, and remove anything you no longer need with ollama rm <model>.
Conclusione
Ollama won the local-LLM space in 2026 by doing one thing extremely well: removing friction. It’s free, private, runs on hardware you already own, and gets you from “I want to try a local model” to a running model in about two minutes. It isn’t the fastest option under heavy load, and a local model won’t beat the best cloud frontier on the hardest problems — but as the on-ramp to local AI, nothing else comes close. If you’re starting out, start here.
