LM Studio is the closest thing the local-AI world has to a “just works” desktop app. You download it, search a model from a built-in catalog, click load, and start chatting — no terminal, no Docker, no config files. Behind that friendly window sits the same llama.cpp and MLX engines that power most of the local-LLM ecosystem, plus a one-click server that mimics the OpenAI API so your existing code can talk to a model running on your own machine.
This guide takes you from zero to a running local model through the GUI. We cover what LM Studio actually is in mid-2026, how to install it on Windows, macOS, and Linux, how to pick a model and quantization that fits your hardware, how to flip on the local server, and roughly how much VRAM and RAM you need. We also draw an honest line between LM Studio and Ollama, because they solve overlapping but different problems.
Principaux enseignements
- LM Studio is a free desktop GUI built by Element Labs (the company behind LM Studio, founded by the app’s original creator) for running open-weight LLMs locally — free for personal and commercial use since July 8, 2025, with no license or form required.
- The latest stable release is 0.4.16 (June 8, 2026), which bumped the default context length to 8k tokens and shipped “Locally,” a companion mobile app for iPhone and iPad.
- It runs two engines: llama.cpp for GGUF models (NVIDIA/AMD/Intel/CPU) and MLX for Apple Silicon, with recent additions like tensor-parallel multi-GPU (0.4.15) and stable MTP speculative decoding (0.4.14).
- A built-in OpenAI-compatible server exposes any loaded model at
http://localhost:1234/v1— point any OpenAI SDK at that URL and it works with no code changes. - Hardware floor: AVX2 CPU, 16GB+ RAM recommended, and roughly 6–9GB VRAM for a comfortable 7B–13B model at Q4. macOS needs Apple Silicon and macOS 14+.
- Choose LM Studio for exploring and chatting; choose Ollama for headless servers and automation. They’re complementary, not rivals.
What LM Studio actually is
LM Studio is a desktop application that downloads and runs large language models entirely on your own hardware. Nothing leaves your machine. It bundles two inference engines: llama.cpp, which runs the widely used GGUF model format on NVIDIA, AMD, Intel, and CPU-only systems, and Apple’s MLX, which runs MLX-format models natively on M-series Macs. You get a model browser, a ChatGPT-style chat window, per-model inference settings, and a server toggle — all in one window.
The product is built by Element Labs, Inc., the company behind LM Studio, founded in 2023 by Yagil Burowski — the app’s original creator. As of July 8, 2025 it became free for use at work, dropping the previous requirement to request a separate commercial license. You and your team can install it and use it commercially with no form, no registration, and no fee. There is a separate paid LM Studio Enterprise tier for organizations that want advanced features like SSO, model/MCP gating, and private collaboration, but the core app most people want is free.
The current stable build is 0.4.16, released June 8, 2026. Recent versions have moved fast: 0.4.10 added OAuth for MCP servers, 0.4.14 shipped stable MTP speculative decoding (faster generation on models with multi-token-prediction heads), 0.4.15 added tensor parallelism for splitting a model across multiple GPUs, and 0.4.16 raised the default context window to 8k tokens and introduced “Locally,” a companion iPhone/iPad app that streams from your desktop over LM Link.
Installing LM Studio on Windows, macOS, and Linux
Installation is a normal app install — download the build for your OS from lmstudio.ai and run it. The catch is the platform requirements, which matter more than for typical software because LLMs lean hard on your CPU instruction set and memory.
| Platform | Requirement | Notes |
|---|---|---|
| Windows | x64 or ARM, AVX2 CPU | Snapdragon X Elite (ARM) supported; standard .exe installer |
| macOS | Apple Silicon (M1–M4), macOS 14.0+ | Intel Macs are unsupported; unlocks the MLX engine |
| Linux | x64 or ARM64, Ubuntu 20.04+ | Ships as an AppImage; distros beyond Ubuntu 22 are less tested |
Les AVX2 instruction set is mandatory on x64 systems. In practice that covers Intel Core chips from 4th generation (Haswell, 2013) onward and every AMD Ryzen, so any reasonably modern PC qualifies. The big gotcha is macOS: Intel Macs are not supported at all in current builds — you need an M-series chip. On Linux, the AppImage means there’s nothing to install system-wide; you make it executable and run it.
After first launch, LM Studio walks you through the Discover tab and, on a fresh install, may suggest a starter model. Don’t accept blindly — pick a model that matches your hardware, which is the next step.
Downloading and choosing a model
Open the Discover tab. The built-in downloader pulls models from Hugging Face, and you can search by keyword (“qwen”, “gemma”), by a specific user/model identifier, or by pasting a full Hugging Face URL. Each model lists several quantization variants — labels like Q4_K_M, Q5_K_M, or Q8_0. Quantization compresses the weights to shrink the file and the memory footprint, trading a little quality for a lot of size.
For most people, Q4_K_M is the sweet spot. It cuts a 7B model from roughly 13–14GB at full (FP16) precision down to about 4GB — around 70% smaller — while keeping the vast majority of output quality; on standard perplexity benchmarks the gap from full precision is small enough that it rarely shows in everyday chat. The “K_M” means medium K-quant: it spends more bits on the most sensitive tensors (such as attention output projections, kept at a higher precision) and fewer elsewhere. Go higher only if you have the headroom, lower only if you must.
Matching quantization to your VRAM
| Available VRAM | Recommended quant | Rule of thumb |
|---|---|---|
| Under 8GB | Q2_K / Q3_K_M | Stick to 7B–8B models |
| 8–12GB | Q4_K_M (recommended) | 7B comfortably; 13B fits a 12GB card |
| 12–16GB | Q5_K_M / Q6_K | Higher quality on mid models |
| 16–24GB | Q8_0 | Near-lossless on 7B–13B |
| 24GB+ | F16 (full precision) | Or larger models at Q4/Q5 |
Rough storage and memory sizes by model size at Q4: a 7B is about 4–5GB, a 13B is 8–9GB, a 30B is 18–20GB, and a 70B exceeds 40GB. A 13B at Q4_K_M occupies roughly 8–9GB of weights, so a 12GB GPU can host it (weights plus a modest KV cache) on the GPU; otherwise LM Studio offloads what fits and runs the rest on CPU, which is slower. Remember that the KV cache and context length add to these figures, so leave a couple of gigabytes of headroom. If you’re still picking your first model, our roundup of the best local LLMs to run on Ollama in 2026 maps almost one-to-one to LM Studio, since both use the same GGUF files.
On Apple Silicon, prefer MLX builds where available. On supported models, MLX-format builds are often faster than the equivalent GGUF on the same M-series chip — commonly in the ballpark of 10–40%, though the gap varies by model and can be near zero (and in some recent models GGUF even edges ahead). Quality is broadly comparable, but it isn’t always identical: GGUF’s mixed-precision Q4_K_M assigns more bits to sensitive layers, whereas MLX 4-bit is more uniform, so it’s worth comparing both for a model you’ll use heavily. LM Studio lets you switch format per model from the UI, so you can grab the MLX variant when one exists and fall back to GGUF when it doesn’t.
The built-in local server (OpenAI-compatible API)
This is the feature that turns LM Studio from a chat toy into a developer tool. Load a model, open the Developer/Server tab, and toggle the server on. LM Studio then serves an OpenAI-compatible REST API at http://localhost:1234/v1, exposing endpoints for chat completions, completions, embeddings, and responses. Any client that speaks the OpenAI Chat Completions schema — the Python openai SDK, the Node openai package, LangChain’s OpenAI wrapper, or a raw curl — connects by simply pointing its base_url / baseURL at that address.
There’s no real API key requirement and no network egress: requests stay on your machine, there are no rate limits, and there’s no per-token cost. In code, you typically pass a placeholder key like "lm-studio" and set the base URL, and existing OpenAI calls work unchanged. That makes LM Studio a clean drop-in for development, testing, and privacy-sensitive workloads where you can’t send data to a cloud API.
Where the server shines
- One toggle — no YAML, no separate daemon to configure
- Drop-in OpenAI compatibility; swap the base URL and ship
- Fully local: zero cost, no rate limits, no data leaving the box
- Great for prototyping agents and RAG against a free local model
Where it falls short
- Tied to the desktop GUI — not designed for headless servers or a VPS
- Higher idle memory overhead than a CLI runtime
- Single-box scope; no built-in clustering or load balancing
- For always-on production serving, a dedicated runtime fits better
If you outgrow a single desktop and need headless, always-on serving, that’s exactly the line where Ollama or a heavier engine takes over — see our Ollama vs LM Studio vs vLLM vs llama.cpp comparison pour une analyse complète.
Hardware and VRAM: what you actually need
The honest baseline is an AVX2 CPU and 16GB of system RAM (8GB will run small models, but you’ll feel the ceiling fast — short context, small models, and noticeable slowdowns). RAM matters even on GPU setups because any layers that don’t fit in VRAM spill to system memory.
For GPU acceleration, at least 4GB of dedicated VRAM is the recommended floor, and more is strictly better. A practical target for a smooth 7B–13B experience is an 8–12GB card. Larger models scale up fast: a 70B at Q4 needs roughly 40GB+ across VRAM and RAM, which is why running one comfortably typically means 48–64GB of system memory if you can’t fit it entirely on the GPU. On Apple Silicon, the unified memory architecture pools RAM and VRAM, so a 32GB or 64GB Mac punches above its weight for mid-sized models. If you’re shopping for a card specifically for this, our guide to the best GPUs for local LLMs in 2026 breaks down the price-per-gigabyte math.
LM Studio vs Ollama: which one is for you
These two get compared constantly, and the short answer is that they’re built for different people. Ollama is a developer-first CLI and HTTP service you run headless; LM Studio is a polished GUI you click. Both run GGUF models through llama.cpp, so raw speed per token is essentially the same for an identical model and quantization. The differences are about ergonomics and deployment.
| Dimension | LM Studio | Ollama |
|---|---|---|
| Primary interface | Desktop GUI | CLI + HTTP API |
| Idle footprint | Heavier (full GUI) | Lighter (background service) |
| Model format | GGUF + MLX | GGUF |
| OpenAI-compatible server | Yes, port 1234 | Yes, port 11434 |
| Headless / server use | Not the intended use | Designed for it |
| Meilleur pour | Exploring and chatting | Automation and deployment |
Pick LM Studio if you mostly want to chat with models on a laptop, browse and try many models with no friction, and avoid the terminal entirely — it gives Windows users especially a smooth, installer-driven experience. Pick Ollama if you’re wiring models into a codebase, deploying to a VPS, or scripting a pipeline. Many people run both: LM Studio to find and evaluate a model, Ollama to serve it in production. If you’re weighing GUI alternatives specifically, our Ollama vs Jan comparison covers another open-source contender in the same space.
FAQ
Is LM Studio free for commercial use?
Yes. Since July 8, 2025, LM Studio is free for both personal and commercial/workplace use, and you no longer need to request a separate license or fill out any form. There is an optional paid Enterprise tier for organizations wanting advanced administration features (such as SSO and model/MCP gating), but the standard app is free.
Does LM Studio work on Intel Macs?
No. Current LM Studio builds require Apple Silicon (M1 through M4 and their variants) and macOS 14.0 or newer. Intel-based Macs are unsupported. On Apple Silicon you also get the faster MLX engine in addition to GGUF.
What model format does LM Studio use?
LM Studio runs GGUF models through its bundled llama.cpp engine on virtually all hardware, and MLX-format models through Apple’s MLX engine on M-series Macs. GGUF is the single-file standard shared by LM Studio, Ollama, Jan, and GPT4All, so models are largely interchangeable across these tools.
What is the difference between Q4_K_M and Q8_0?
Both are quantization levels. Q4_K_M is 4-bit and roughly a third the size of full precision while keeping the large majority of quality — the recommended default for most hardware. Q8_0 is 8-bit, larger and effectively near-lossless, worth using only if you have 16–24GB of VRAM to spare.
How do I connect my code to LM Studio’s local server?
Enable the server in the Developer/Server tab with a model loaded, then point any OpenAI SDK’s base URL at http://localhost:1234/v1. No real API key is needed (pass any placeholder string), and existing OpenAI Chat Completions code works without other changes.
How much VRAM do I need to run a 7B model?
A 7B model at Q4_K_M is about 4–5GB on disk, and with the KV cache and overhead a card with 6–8GB of VRAM runs it comfortably and fully on the GPU. With less VRAM, LM Studio offloads the overflow to system RAM and CPU, which still works but runs slower.
Can I run LM Studio as a server on a VPS?
It’s not the intended use case. LM Studio is built around its desktop GUI, and the server toggle assumes a local machine. For headless, always-on hosting on a VPS, Ollama or a dedicated inference engine is the better fit.
Résultat
LM Studio is the easiest on-ramp to local LLMs in 2026, and it’s now genuinely free for any use. If you want to download a model, chat with it, and occasionally point your own code at a private OpenAI-compatible endpoint — all without touching a terminal — nothing else is this approachable. The 0.4.x line has also closed real gaps with features like tensor-parallel multi-GPU and speculative decoding, so it’s no longer just a beginner toy.
Where it stops short is deployment. The GUI overhead and desktop-bound server mean LM Studio isn’t the tool for headless production serving — that’s Ollama’s or vLLM’s job. The pragmatic move is to treat LM Studio as your exploration and chat workbench, lean on it to find the right model and quantization for your hardware, and reach for a dedicated runtime when you need to serve that model around the clock. For most individuals running models on a laptop or desktop, though, this is the first app to install.
