Retrieval-augmented generation stopped being a research curiosity years ago. In 2026 it is the default way to put an LLM in front of your own documents without paying to fine-tune a model or risking it confidently inventing answers. The pattern is simple to describe and full of sharp edges to implement: find the right text, hand it to the model, and let the model write the answer.
This is a build guide, not a survey. By the end you will know exactly which components a working RAG pipeline needs in 2026, which specific tools and model versions to reach for, and a minimal code sketch you can run locally or against an API. We verified every version number, price, and benchmark below against current sources — because the worst RAG bug is the one you copy from a blog post written for last year’s libraries.
Principaux enseignements
- Six stages, in order: chunk, embed, store, retrieve, rerank, generate. Skip the reranker and your top results are noticeably worse; skip evaluation and you’ll never know.
- Boring chunking wins. Recursive splitting at ~512 tokens with 10–20% overlap beat fancy semantic chunking (69% vs 54% accuracy) in a 2026 benchmark. Start there.
- Embeddings: nomic-embed-text (768 dims, free, local) for prototypes; OpenAI text-embedding-3-large ($0.13/1M tokens, 3072 dims) or Voyage-3.5 for quality at scale.
- Vector DB: pgvector if you already run Postgres; Qdrant v1.18 (Apache 2.0, Rust) when you need fast filtered search; Chroma for quick local work.
- Frameworks: LangChain 1.x (LangGraph runtime) for agentic flows, LlamaIndex 0.14.x for retrieval-heavy apps — and you can run a useful pipeline in ~40 lines without either.
- Add a reranker. Cohere Rerank 3.5 ($2 per 1,000 searches) or open-source BGE-reranker-v2-m3 (free, ~50–100ms on GPU) cheaply lifts top-k relevance.
How a RAG pipeline actually works
A RAG system has two phases. Indexing happens once (or whenever your documents change): you split source files into chunks, convert each chunk to a vector with an embedding model, and store those vectors in a database. Querying happens on every request: you embed the user’s question, find the most similar chunks, optionally rerank them, paste the best ones into a prompt, and call an LLM.
That is the whole idea. The engineering is in the details — chunk size, which embedding model, how many results to retrieve, whether to rerank, and how you measure if any of it works. If you want the conceptual background before building, our RAG explainer covers the theory; this piece is about wiring it up. And if you’re still deciding between RAG and customizing the model itself, the fine-tuning vs RAG comparison is the right place to start — for most teams feeding private, changing data to an LLM, RAG is the cheaper and more maintainable answer.
Step 1: Chunk your documents
Embedding models have a context limit and, more importantly, lose precision on long passages. So you split documents into chunks. The 2026 consensus, backed by benchmarks rather than vibes, is unglamorous: use a recursive character splitter targeting roughly 512 tokens with 10–20% overlap (50–100 tokens).
A February 2026 evaluation across 50 real documents found that naive recursive splitting at 512 tokens scored 69% retrieval accuracy, while semantic chunking — which tries to split on meaning boundaries — scored only 54%. The reason is mundane: semantic chunking produced fragments averaging 43 tokens, too small to give the model enough context to answer. Meanwhile a separate January 2026 study using SPLADE retrieval found overlap added indexing cost with no measurable benefit on its dataset. The honest takeaway: start with fixed-size recursive chunks, and only reach for semantic or page-level chunking if your evaluation metrics prove you need it on your specific documents.
Step 2: Choose an embedding model
This is the most consequential decision in the pipeline, and the gap between options is real. Here are the choices worth considering in mid-2026, with verified numbers.
| Modèle | Dimensions | Context | Price / 1M tokens | Notes |
|---|---|---|---|---|
| nomic-embed-text v1.5 | 768 (MRL 64–768) | 8,192 | Free (local) | 274MB; the default local pick |
| mxbai-embed-large | 1024 | 512 | Free (local) | 670MB; higher quality, short context |
| BGE-M3 | 1024 + sparse | 8,192 | Free (local) | MIT license, 100+ languages |
| OpenAI text-embedding-3-small | 1536 | 8,191 | $0.02 | Cheap API baseline |
| OpenAI text-embedding-3-large | 3072 | 8,191 | $0.13 | $0.065 via Batch API |
| Voyage-3.5 | 2048 (MRL 256–2048) | 32,000 | $0.06 | Beats 3-large by ~8% on retrieval |
| Gemini Embedding | 3072 | — | API | Tops MTEB v2 (~68.3) |
For a prototype, start local with nomic-embed-text — it’s fast, free, fits on a 16GB laptop, and reportedly beats OpenAI’s older text-embedding-ada-002. For production, the open-source field has genuinely caught up: BGE-M3 is the MIT-licensed workhorse most self-hosted stacks default to, while Voyage-3.5 and Gemini Embedding lead the managed-API benchmarks. The one rule that matters: whatever you embed your documents with, you must embed your queries with the same model. Mixing models silently destroys retrieval.
Step 3: Store vectors in a vector database
Once you have embeddings, they need to live somewhere that supports fast nearest-neighbor search. You have three sensible tiers in 2026.
Reach for these
- pgvector 0.8 if you already run Postgres. With an HNSW index it serves single-digit-to-low-double-digit-millisecond p95 latency at 1M vectors. Version 0.8 added iterative scans so filtered queries return enough results. No new infrastructure.
- Qdrant v1.18 (Apache 2.0, Rust) when filtering matters. Its ACORN algorithm (added in 1.16) tackles the classic “filter kills my recall” problem by widening the HNSW search under restrictive filters, and is among the strongest options for filtered search. One Docker command to self-host.
- Chroma for local prototyping. Best developer experience, embedded mode, zero ops — perfect until you outgrow it.
Watch out for
- Managed services bill by usage and surprise people: at 100M vectors, Pinecone can run $5,000+/month versus a far cheaper self-hosted Qdrant or pgvector on your own VMs. Audit before you scale.
- HNSW index builds are slow at scale, and the index can hit ~8GB for 1M vectors at 1536 dims (use halfvec to roughly halve that).
- Storage hardware dominates throughput: the same pgvector setup did ~410 QPS on cloud SSD versus 2,150 QPS on NVMe.
A deeper breakdown lives in our vector database guide, but for most teams the decision tree is short: already on Postgres → pgvector; need heavy filtering or billions of vectors → Qdrant or Milvus; just experimenting → Chroma.
Step 4: Retrieve and rerank
Retrieval itself is one call: embed the query, ask the database for the top-k nearest chunks (k of 20–50 is typical). But raw vector similarity is a blunt instrument. A reranker — a cross-encoder that scores each query-document pair individually — re-sorts those candidates and surfaces the genuinely relevant ones before they reach the model.
The standard pattern: retrieve top 50 with your bi-encoder, rerank, keep the top 5–10. Cohere Rerank 3.5 costs $0.002 per search ($2 per 1,000) and typically adds on the order of 100–300ms of latency. If you have a GPU and want zero per-query cost, the open-source BGE-reranker-v2-m3 runs in ~50–100ms and supports multilingual content. Reranking is one of the highest-leverage, lowest-effort upgrades you can make — most pipelines that “retrieve garbage” are missing this step.
Step 5: Augment the prompt and generate
Now assemble the prompt: a short system instruction telling the model to answer only from the supplied context, the reranked chunks, and the user’s question. Then call your LLM.
For the generation model you can go local or API. Locally via Ollama, the 2026 sweet spot is an 8B-class model — Qwen3 8B or Llama 3.1 8B at Q4_K_M quantization — which fits in 8–12GB of VRAM and runs at 40+ tokens/second on a modern GPU. Qwen3 14B (~8–9GB at Q4) is a strong step up with a 128K context window for stuffing in more retrieved text. For a hosted, higher-ceiling option, a frontier API model works well; our Claude API chatbot tutorial walks through that path end to end. A useful reminder from practitioners: for RAG, retrieval quality usually matters more than model size — clean chunks plus a good embedder plus a small LLM beats a huge model fed bad context.
Step 6: A minimal code sketch
Here is a complete local pipeline using LangChain 1.x, Chroma, and Ollama. It indexes a document and answers a question — no API keys required.
# pip install langchain langchain-community langchain-chroma langchain-ollama
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_ollama import OllamaEmbeddings, ChatOllama
from langchain_chroma import Chroma
# 1. Load + chunk (~512 tokens, ~15% overlap; sizes are in characters)
docs = TextLoader("handbook.txt").load()
chunks = RecursiveCharacterTextSplitter(
chunk_size=2000, chunk_overlap=300
).split_documents(docs)
# 2. Embed + 3. Store
embeddings = OllamaEmbeddings(model="nomic-embed-text")
store = Chroma.from_documents(chunks, embeddings)
# 4. Retrieve (top 4)
retriever = store.as_retriever(search_kwargs={"k": 4})
# 5. Augment + generate
llm = ChatOllama(model="qwen3:8b")
question = "What is the refund window?"
context = "nn".join(d.page_content for d in retriever.invoke(question))
prompt = (f"Answer using ONLY the context. If it's not there, say so.nn"
f"Context:n{context}nnQuestion: {question}")
print(llm.invoke(prompt).content)
That’s the whole loop. To add reranking, insert a ContextualCompressionRetriever with a cross-encoder between steps 4 and 5. With LlamaIndex 0.14.x the same flow is typically less code thanks to its purpose-built retrieval abstractions — it’s the better choice for retrieval-heavy apps, while LangChain’s LangGraph runtime shines when you need stateful, multi-step agents. (Choosing an orchestration layer is its own topic; see our AI agent frameworks comparison.)
Step 7: Evaluate — don’t skip this
The difference between a demo and a product is measurement. The standard tool is RAGAS, which scores faithfulness (did the answer stick to the context?), context precision, and context recall using an LLM as judge. Build a small set of 20–50 question-answer pairs from your real documents and run it on every change.
This is also how you make every earlier decision honestly. Should you switch to semantic chunking? Add a reranker? Bump k from 4 to 8? Don’t guess — change one variable, rerun RAGAS, and keep the change only if the numbers improve. Without this loop you’re tuning blind.
FAQ
How much does it cost to run a RAG pipeline?
Almost free to prototype. With local Ollama embeddings, Chroma, and a local LLM, your only cost is electricity. At scale, the main bills are the vector DB (a self-hosted Qdrant or pgvector instance on your own VM is dramatically cheaper than managed offerings, which can exceed $5,000/month at 100M vectors) and, if you use APIs, embeddings (OpenAI text-embedding-3-large is $0.13 per million tokens) plus generation calls.
Do I need a vector database, or can I use a regular one?
You need vector search, but not necessarily a dedicated product. pgvector adds it to PostgreSQL and handles 1M vectors at low p95 latency (single-digit ms on NVMe, higher on cloud SSD), so if you already run Postgres you can avoid new infrastructure entirely. Reach for a dedicated DB like Qdrant when you need heavy metadata filtering or billions of vectors.
What chunk size should I use?
Start at roughly 512 tokens with 10–20% overlap using a recursive splitter. A 2026 benchmark found this beat semantic chunking 69% to 54% on retrieval accuracy. Only move to more sophisticated chunking if your evaluation metrics show it helps on your specific documents.
Is a reranker actually necessary?
Not to get something working, but it’s one of the cheapest quality upgrades available. Retrieve a wide set (top 50), rerank with Cohere Rerank 3.5 or open-source BGE-reranker-v2-m3, and keep the top 5–10. Most pipelines that surface irrelevant chunks are simply missing this step.
Can I build RAG without LangChain or LlamaIndex?
Yes. The core loop — embed, search, prompt, generate — is about 40 lines of plain Python calling your embedding model, vector DB client, and LLM directly. Frameworks save time on loaders, rerankers, and agentic orchestration, but they’re optional, and a from-scratch build gives you full control over every step.
Should I use a local model or an API for generation?
Local (via Ollama, with an 8B model on 8–12GB of VRAM) is great for privacy, cost control, and offline use. An API gives you a higher quality ceiling and zero ops. Many teams prototype locally to iterate cheaply, then choose per-deployment based on data-sensitivity and budget.
How do I keep the index fresh as documents change?
Re-embed and upsert only what changed rather than rebuilding everything. Track a content hash or modified-date per source document, and on update delete the old chunks for that document and insert new ones. Most vector DBs support upserts and deletes by metadata filter, which makes incremental updates straightforward.
Résultat
Building a RAG pipeline in 2026 is genuinely approachable: six stages, a handful of mature tools, and roughly 40 lines of code to a working prototype. The traps are not in the architecture — they’re in the defaults. Use boring 512-token chunks, match your query and document embedders, add a reranker, and never tune without RAGAS in the loop. Start local and free with nomic-embed-text, Chroma, and an 8B Ollama model; graduate individual components to pgvector, Qdrant, Voyage, or a frontier API only when your evaluation numbers — not a blog post — tell you to. Get the retrieval right and a small model will carry you surprisingly far.
