Retrieval-Augmented Generation
A technique combining a language model with external document retrieval to ground its answers.
What Is Retrieval-Augmented Generation
Retrieval-augmented generation, commonly abbreviated as RAG, is a technique that combines a large language model with an external retrieval step, so the model generates its response using relevant documents or data fetched at request time rather than relying solely on knowledge encoded in its training data.
How It Works
A typical RAG pipeline has two stages. First, given a query, a retrieval system searches an external knowledge source, such as a document store or vector database, and returns the passages most relevant to that query, often using semantic search based on embeddings rather than exact keyword matching. Second, those retrieved passages are inserted into the model's context window alongside the original query, and the model generates its answer grounded in that supplied material rather than from memory alone. The quality of the final answer depends heavily on the quality of retrieval: irrelevant or incomplete retrieved content tends to produce a correspondingly weak or inaccurate response.
Why It Matters
RAG addresses two structural limitations of language models: a fixed training cutoff date and no built-in access to private or frequently changing data. Instead of fine-tuning a model every time underlying information changes, an application can update its retrieval source, such as a document index, and have the model immediately reflect the new information on its next request. RAG also tends to reduce hallucination, since the model has concrete source material to draw from and can be instructed to answer only using the retrieved content or to cite it directly.
RAG vs Fine-Tuning
RAG and fine-tuning are often confused but address different needs. RAG supplies external, current, or private knowledge without changing the model's parameters, making it well suited to fast-changing information and to keeping data separate from the model itself. Fine-tuning changes how the model behaves or writes by updating its parameters, which suits stylistic or behavioral adaptation more than knowledge that changes often. The two techniques are complementary and are frequently combined: a fine-tuned model handling retrieval-grounded answers in a consistent format is a common production pattern.
Common Use Cases
- Question answering over internal documentation or a knowledge base.
- Customer support assistants that need current product information.
- Coding agents that retrieve relevant source files before generating a change.