Fine-Tuning vs. RAG — The Decision Framework for Production AI
·ScaledByDesign·
aifine-tuningragllmmachine-learning
The $200K Fine-Tuning Mistake
A client spent $200K fine-tuning GPT-4 on their customer support data. Three months of labeling, training, and evaluation. The result: a model that answered historical questions well but couldn't answer anything about products launched after the training cutoff. They needed RAG — not fine-tuning.
Different problem, different solution. Here's how to pick the right one.
What Each Approach Does
RAG (Retrieval-Augmented Generation):
→ Augments the LLM with external knowledge at query time
→ Knowledge base can be updated without retraining
→ Model stays generic; context makes it specific
→ Cost: per-query retrieval + generation
→ Setup: days to weeks
Fine-Tuning:
→ Modifies the model's weights based on your data
→ Changes the model's behavior, style, or capabilities
→ Knowledge baked into model weights (static)
→ Cost: training compute + inference
→ Setup: weeks to months
The Decision Matrix
Use RAG when:
✓ Knowledge changes frequently (products, pricing, policies)
✓ You need to cite sources (verifiable, auditable answers)
✓ You have a large knowledge base (docs, FAQs, manuals)
✓ Accuracy > style (customer support, technical documentation)
✓ You need to deploy fast (days, not months)
Use Fine-Tuning when:
✓ You need a specific writing style or tone
✓ The task is highly specialized (medical coding, legal analysis)
✓ You need faster inference (no retrieval step)
✓ You're working with structured output formats
✓ The base model can't follow complex instructions reliably
Use Both when:
✓ You need a specific style AND dynamic knowledge
✓ Example: fine-tune for your brand voice + RAG for product catalog
RAG: The Implementation
// RAG pipeline for customer support
async function ragAnswer(query: string, customerId?: string) {
// 1. Retrieve relevant documents
const embedding = await embed(query);
const documents = await vectorDB.search(embedding, {
topK: 5,
filter: { status: "published" }, // Only approved content
});
// 2. Optionally add customer context
let customerContext = "";
if (customerId) {
const customer = await getCustomer(customerId);
customerContext = `Customer: ${customer.tier} tier,
${customer.orderCount} orders, member since ${customer.joinDate}`;
}
// 3. Build prompt with retrieved context
const prompt = `
Answer the customer's question using ONLY the provided documents.
If the answer isn't in the documents, say "I don't have that information."
${customerContext}
Documents:
${documents.map(d => `[${d.title}]: ${d.content}`).join("\n\n")}
Question: ${query}
`;
// 4. Generate answer
return await llm.chat({
model: "gpt-4o-mini",
messages: [{ role: "user", content: prompt }],
});
}RAG Advantages:
✓ Knowledge always current (update docs → answers update)
✓ Auditable: every answer traceable to source documents
✓ No training required: works with any base model
✓ Easy to fix: wrong answer? Update the document, not the model
✓ Cost: $0.001-0.01 per query (retrieval + generation)
RAG Limitations:
✗ Retrieval quality caps answer quality (bad search = bad answers)
✗ Slower: retrieval step adds 100-500ms latency
✗ Context window limits: can't use all documents at once
✗ Doesn't change model behavior or style
✗ Complex queries spanning many documents are challenging
Fine-Tuning: The Implementation
// Fine-tuning training data format
const trainingExamples = [
{
messages: [
{ role: "system", content: "You are a customer service agent for AcmeSkin." },
{ role: "user", content: "What's your return policy?" },
{ role: "assistant", content: "Hey! Great question. We offer 30-day no-hassle returns on all products. Just reach out to us and we'll send you a prepaid label. Easy peasy. 🎉" },
],
},
// ... 500-5000 examples of desired behavior
];
// Key: examples should demonstrate the BEHAVIOR you want, not just knowledge
// The model learns HOW to respond, not WHAT to knowFine-Tuning Advantages:
✓ Consistent style/tone across all responses
✓ Faster inference (no retrieval step)
✓ Better at following complex output formats
✓ Can learn specialized domain patterns
✓ Works well for classification and extraction tasks
Fine-Tuning Limitations:
✗ Static knowledge (training cutoff)
✗ Expensive: $50-5,000+ per training run
✗ Slow iteration: days per experiment
✗ Hallucination risk: model may confuse fine-tuned knowledge
✗ Data quality critical: garbage in → garbage out
✗ Model updates invalidate fine-tunes (need to retrain)
The Hybrid Approach
For production systems, the best approach often combines both:
// Hybrid: fine-tuned model for style + RAG for knowledge
async function hybridAnswer(query: string) {
// RAG retrieval
const context = await retrieveContext(query);
// Fine-tuned model for generation (trained on brand voice)
return await fineTunedModel.chat({
messages: [
{ role: "system", content: "Answer using the provided context. Stay in brand voice." },
{ role: "user", content: `Context: ${context}\n\nQuestion: ${query}` },
],
});
}Cost Comparison
Scenario: 10,000 queries/day customer support
RAG Only:
Embedding: 10K × $0.0001 = $1/day
Vector search: 10K × $0.0002 = $2/day
Generation: 10K × $0.003 = $30/day
Total: ~$33/day ($990/month)
Fine-Tuned Only:
Training: $500 per run (monthly retrain)
Generation: 10K × $0.004 = $40/day
Total: ~$40/day + $500/month ($1,700/month)
Hybrid:
RAG retrieval: $3/day
Fine-tuned gen: $40/day
Training: $500/month
Total: ~$43/day + $500/month ($1,790/month)
The cost difference is often smaller than expected. Choose based on capabilities needed, not just price.
The Decision Checklist
□ Does your knowledge change weekly or more? → RAG
□ Do you need source citations? → RAG
□ Is consistent brand voice critical? → Fine-tuning
□ Do you need to deploy in < 1 week? → RAG
□ Is the task highly specialized? → Fine-tuning
□ Do you need both fresh knowledge and consistent style? → Hybrid
□ Is your training data < 500 quality examples? → RAG (not enough data to fine-tune well)
Start with RAG. It's faster to build, easier to debug, and simpler to maintain. Only add fine-tuning when RAG alone can't achieve the behavior you need — and you have the data and budget to do it right.