RAG vs Fine-tuning: How to Make a Base LLM Context-Aware
Base LLMs are trained on public data. They know a lot about the world in general and almost nothing about your organization's documents, codebase, or internal knowledge. When you need a model that can reason over private information, you have three tools available: prompt engineering, retrieval-augmented generation, and fine-tuning.
Each has a different cost-complexity tradeoff.
Prompt engineering
The simplest approach: include the relevant context directly in the prompt.
You are a helpful assistant. Answer questions about the following document:
{document_text}
Question: {user_question}
Works well when the relevant context is small and known ahead of time. Breaks down when the knowledge base grows beyond what fits in a context window, or when you do not know which documents are relevant before the user asks.
Fine-tuning
Fine-tuning adjusts the model's weights on your private data. The information gets "baked in" — no retrieval step, no context stuffing.
The cost is high. GPT-4 has roughly 1.76 trillion parameters. Even smaller models (7B–13B) require significant GPU memory and training time to fine-tune properly. More importantly, fine-tuning is a one-shot process: when your knowledge base updates, you retrain.
Fine-tuning is the right choice when you need the model to behave differently (adopt a specific style, follow domain-specific reasoning patterns) rather than just know different facts. For knowledge injection alone, it is usually overkill.
Retrieval-augmented generation
RAG adds a retrieval layer between the user query and the LLM. Instead of hoping the model "knows" the answer, you look up the relevant context at query time and hand it to the model explicitly.
user query
│
▼
embed query → search vector DB → retrieve top-k chunks
│
▼
LLM(query + context) → answer
The knowledge base lives outside the model. Updating it means re-embedding new documents, not retraining.
The pipeline in full
Data preparation (offline)
- Extract text from raw sources (PDFs, HTML, databases)
- Chunk into segments the model can reason over (typically 200–500 tokens)
- Generate embeddings for each chunk
- Store in a vector database (ChromaDB, Pinecone, pgvector, etc.)
Retrieval (online, per query)
- Embed the user's query with the same model used for documents
- Run a nearest-neighbor search against the stored embeddings
- Return the top-k most semantically similar chunks
Generation (online, per query)
- Inject retrieved chunks into the prompt as context
- Call the LLM with the grounded prompt
- Return the response
The embedding model used for documents and queries must be the same — a common source of silent bugs when the retrieval step is returning garbage.
Decision framework
| Situation | Recommended approach |
|---|---|
| Context is small and static | Prompt engineering |
| Large knowledge base, frequent queries | RAG |
| Knowledge updates frequently | RAG |
| Need to adapt model behavior or style | Fine-tuning |
| Both knowledge injection and style | Fine-tuning on top of RAG |
Start with prompt engineering. When the context window becomes a constraint, move to RAG. Reserve fine-tuning for when you need behavior change, not just knowledge access.
RAG tradeoffs
RAG is not free. Retrieval quality depends on embedding quality — a bad embedding model or poor chunking strategy produces irrelevant context, which produces worse answers than no context at all. And the added latency (embed query → search → retrieve) is measurable, particularly when the vector store is large or hosted remotely.
The failure modes are also subtle. If the relevant chunk was not indexed, the model will not find it. If the chunking split a key passage across two chunks, neither chunk may score high enough. Getting RAG right in production requires treating the retrieval component as seriously as the generation component.
That said, for most use cases — private document Q&A, knowledge base search, domain-specific assistants — RAG is the right first move. It is cheaper than fine-tuning, more flexible than prompt stuffing, and failure modes are debuggable.