RAG vs Fine-Tuning: When to Use What in Production
·ScaledByDesign·
airagfine-tuningllm
Everyone Asks the Wrong Question
"Should we use RAG or fine-tuning?" is like asking "should we use a database or an API?" The answer is: it depends on what you're building, and sometimes you need both.
Here's the decision framework we use with every client.
RAG: When Your Data Changes
Retrieval-Augmented Generation is the right choice when:
- Your knowledge base updates frequently (docs, products, policies)
- You need citations and source attribution
- Accuracy on specific facts matters more than style
- You want to control exactly what the model can reference
How RAG Actually Works in Production
async function ragPipeline(query: string): Promise<AgentResponse> {
// 1. Embed the query
const queryEmbedding = await embed(query);
// 2. Retrieve relevant chunks
const chunks = await vectorStore.search(queryEmbedding, {
topK: 5,
minScore: 0.75,
filter: { status: "published" },
});
// 3. Rerank for relevance
const reranked = await reranker.rank(query, chunks);
const context = reranked.slice(0, 3);
// 4. Generate with context
const response = await llm.generate({
system: SYSTEM_PROMPT,
context: context.map((c) => c.text).join("\n\n"),
query,
temperature: 0.1, // Low temp for factual responses
});
// 5. Return with sources
return {
answer: response.text,
sources: context.map((c) => ({
title: c.metadata.title,
url: c.metadata.url,
score: c.score,
})),
};
}RAG Pitfalls We See Constantly
- Chunking too aggressively — splitting mid-paragraph destroys context
- No reranking step — vector similarity alone isn't enough
- Stale embeddings — your vectors need to update when your docs do
- Ignoring retrieval quality — if you retrieve garbage, the LLM generates garbage with confidence
Fine-Tuning: When You Need a Different Model
Fine-tuning is the right choice when:
- You need the model to behave differently (tone, format, reasoning style)
- You have a specific task with consistent input/output patterns
- Latency matters and you can't afford the retrieval step
- You need the model to learn patterns, not just facts
When Fine-Tuning Actually Makes Sense
# Good fine-tuning use case: structured extraction
# Input: messy customer emails
# Output: structured JSON with intent, urgency, entities
training_examples = [
{
"input": "Hey, my order #4521 hasn't arrived and it's been 2 weeks. This is ridiculous.",
"output": {
"intent": "order_status_complaint",
"urgency": "high",
"order_id": "4521",
"sentiment": "frustrated",
"days_waiting": 14
}
},
# ... hundreds more examples
]Fine-Tuning Pitfalls
- Not enough data — you need hundreds of high-quality examples minimum
- Overfitting to training data — the model memorizes instead of generalizing
- Stale models — fine-tuned models don't update when your business changes
- Cost of iteration — every change means retraining
The Decision Framework
| Factor | RAG | Fine-Tuning |
|---|---|---|
| Data changes frequently | ✅ Best choice | ❌ Requires retraining |
| Need source citations | ✅ Built-in | ❌ Not possible |
| Consistent output format | ⚠️ Possible with prompting | ✅ Best choice |
| Custom tone/personality | ⚠️ Prompt engineering | ✅ Best choice |
| Latency-sensitive | ⚠️ Retrieval adds ~200ms | ✅ No retrieval step |
| Small dataset (< 100 examples) | ✅ Works with any amount | ❌ Not enough data |
| Need to reason over private data | ✅ Best choice | ⚠️ Risk of data leakage |
| Budget-constrained | ✅ Use existing models | ⚠️ Training costs add up |
The Hybrid Approach (What We Usually Recommend)
Most production systems benefit from both:
User Query
↓
[Fine-tuned model for intent classification] → Fast, consistent
↓
[RAG pipeline for knowledge retrieval] → Accurate, up-to-date
↓
[Base model for response generation] → Flexible, contextual
↓
Response with sources
The fine-tuned model handles the how (format, tone, routing). RAG handles the what (facts, data, citations). The base model ties it together.
What This Means for Your Project
Before you commit to either approach:
- Define your success metric — is it accuracy? Speed? Consistency?
- Audit your data — do you have enough quality examples for fine-tuning?
- Map your update frequency — how often does your knowledge change?
- Set a latency budget — can you afford the retrieval step?
- Start with RAG — it's faster to prototype and easier to debug
The companies that get AI right don't pick a technique first. They define the problem, then pick the tool. The ones that fail start with "we should use RAG" and work backwards.