RAG vs Fine-tuning: How to Make a Base LLM Context-Aware

Base LLMs are trained on public data. They know a lot about the world in general and almost nothing about your organization's documents, codebase, or internal knowledge. When you need a model that can reason over private information, you have three tools available: prompt engineering, retrieval-augmented generation, and fine-tuning.

Each has a different cost-complexity tradeoff.

Prompt engineering

The simplest approach: include the relevant context directly in the prompt.

You are a helpful assistant. Answer questions about the following document:

{document_text}

Question: {user_question}

Works well when the relevant context is small and known ahead of time. Breaks down when the knowledge base grows beyond what fits in a context window, or when you do not know which documents are relevant before the user asks.

Fine-tuning

Fine-tuning adjusts the model's weights on your private data. The information gets "baked in" — no retrieval step, no context stuffing.

The cost is high. GPT-4 has roughly 1.76 trillion parameters. Even smaller models (7B–13B) require significant GPU memory and training time to fine-tune properly. More importantly, fine-tuning is a one-shot process: when your knowledge base updates, you retrain.

Fine-tuning is the right choice when you need the model to behave differently (adopt a specific style, follow domain-specific reasoning patterns) rather than just know different facts. For knowledge injection alone, it is usually overkill.

Retrieval-augmented generation

RAG adds a retrieval layer between the user query and the LLM. Instead of hoping the model "knows" the answer, you look up the relevant context at query time and hand it to the model explicitly.

user query
    │
    ▼
embed query → search vector DB → retrieve top-k chunks
                                        │
                                        ▼
                              LLM(query + context) → answer

The knowledge base lives outside the model. Updating it means re-embedding new documents, not retraining.

The pipeline in full

Data preparation (offline)

Extract text from raw sources (PDFs, HTML, databases)
Chunk into segments the model can reason over (typically 200–500 tokens)
Generate embeddings for each chunk
Store in a vector database (ChromaDB, Pinecone, pgvector, etc.)

Retrieval (online, per query)

Embed the user's query with the same model used for documents
Run a nearest-neighbor search against the stored embeddings
Return the top-k most semantically similar chunks

Generation (online, per query)

Inject retrieved chunks into the prompt as context
Call the LLM with the grounded prompt
Return the response

The embedding model used for documents and queries must be the same — a common source of silent bugs when the retrieval step is returning garbage.

Decision framework

Situation	Recommended approach
Context is small and static	Prompt engineering
Large knowledge base, frequent queries	RAG
Knowledge updates frequently	RAG
Need to adapt model behavior or style	Fine-tuning
Both knowledge injection and style	Fine-tuning on top of RAG

Start with prompt engineering. When the context window becomes a constraint, move to RAG. Reserve fine-tuning for when you need behavior change, not just knowledge access.

RAG tradeoffs

RAG is not free. Retrieval quality depends on embedding quality — a bad embedding model or poor chunking strategy produces irrelevant context, which produces worse answers than no context at all. And the added latency (embed query → search → retrieve) is measurable, particularly when the vector store is large or hosted remotely.

The failure modes are also subtle. If the relevant chunk was not indexed, the model will not find it. If the chunking split a key passage across two chunks, neither chunk may score high enough. Getting RAG right in production requires treating the retrieval component as seriously as the generation component.

That said, for most use cases — private document Q&A, knowledge base search, domain-specific assistants — RAG is the right first move. It is cheaper than fine-tuning, more flexible than prompt stuffing, and failure modes are debuggable.