← All articles

AI & LLMs

RAG, end to end

2026-07-10 · 6 min read

Ask a raw language model about your company's refund policy and it will confidently make one up — because it was never trained on it, and it can't tell what it doesn't know. RAG — Retrieval-Augmented Generation — solves this the way a good student does an open-book exam: don't answer from memory, look it up first, then write.

The three steps

The RAG pipeline Question what you asked Retrieve top-k matching chunks Augment chunks + question LLM Answer grounded knowledge base (your docs)
Retrieve the relevant facts, paste them into the prompt, and let the model write only from what's in front of it.
  1. Retrieve. Your documents are pre-split into chunks and turned into embeddings. At query time, you embed the question and pull the handful of chunks that sit closest in meaning.
  2. Augment. You paste those chunks into the prompt: "Using the context below, answer the question. Context: …"
  3. Generate. The model writes an answer grounded in the supplied text — and, ideally, cites which chunk it used.
The gist

Don't ask the model to remember your data — hand it the relevant pages at question time and ask it to read. Open-book beats closed-book.

Why it's the default for "chat with your docs"

  • Fresh & specific. Update a document and the answer updates — no retraining.
  • Fewer hallucinations. The most probable answer is now copied from a real source, not invented.
  • Citable. You can show where each claim came from, which is often the whole point.

Where it goes wrong

RAG is only as good as its retrieval. If the right chunk isn't fetched, the model answers from a gap and hallucinates anyway. Chunk too big and you waste context window; chunk too small and you sever the meaning. And retrieved content is untrusted input — so mind prompt injection hiding in the documents you pull in. Get retrieval right, though, and a general model becomes an expert on your world without a single training run.


RAGLLMEmbeddingsAI
← Back to the blog