ScaledByDesign/Articles
All ArticlesServicesAbout
scaledbydesign.com
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Articles
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

Everyone Asks the Wrong QuestionRAG: When Your Data ChangesHow RAG Actually Works in ProductionRAG Pitfalls We See ConstantlyFine-Tuning: When You Need a Different ModelWhen Fine-Tuning Actually Makes SenseFine-Tuning PitfallsThe Decision FrameworkThe Hybrid Approach (What We Usually Recommend)What This Means for Your Project
  1. Articles
  2. AI & Automation
  3. RAG vs Fine-Tuning: When to Use What in Production

RAG vs Fine-Tuning: When to Use What in Production

February 6, 2026·ScaledByDesign·
airagfine-tuningllm

Everyone Asks the Wrong Question

"Should we use RAG or fine-tuning?" is like asking "should we use a database or an API?" The answer is: it depends on what you're building, and sometimes you need both.

Here's the decision framework we use with every client.

RAG: When Your Data Changes

Retrieval-Augmented Generation is the right choice when:

  • Your knowledge base updates frequently (docs, products, policies)
  • You need citations and source attribution
  • Accuracy on specific facts matters more than style
  • You want to control exactly what the model can reference

How RAG Actually Works in Production

async function ragPipeline(query: string): Promise<AgentResponse> {
  // 1. Embed the query
  const queryEmbedding = await embed(query);
 
  // 2. Retrieve relevant chunks
  const chunks = await vectorStore.search(queryEmbedding, {
    topK: 5,
    minScore: 0.75,
    filter: { status: "published" },
  });
 
  // 3. Rerank for relevance
  const reranked = await reranker.rank(query, chunks);
  const context = reranked.slice(0, 3);
 
  // 4. Generate with context
  const response = await llm.generate({
    system: SYSTEM_PROMPT,
    context: context.map((c) => c.text).join("\n\n"),
    query,
    temperature: 0.1, // Low temp for factual responses
  });
 
  // 5. Return with sources
  return {
    answer: response.text,
    sources: context.map((c) => ({
      title: c.metadata.title,
      url: c.metadata.url,
      score: c.score,
    })),
  };
}

RAG Pitfalls We See Constantly

  1. Chunking too aggressively — splitting mid-paragraph destroys context
  2. No reranking step — vector similarity alone isn't enough
  3. Stale embeddings — your vectors need to update when your docs do
  4. Ignoring retrieval quality — if you retrieve garbage, the LLM generates garbage with confidence

Fine-Tuning: When You Need a Different Model

Fine-tuning is the right choice when:

  • You need the model to behave differently (tone, format, reasoning style)
  • You have a specific task with consistent input/output patterns
  • Latency matters and you can't afford the retrieval step
  • You need the model to learn patterns, not just facts

When Fine-Tuning Actually Makes Sense

# Good fine-tuning use case: structured extraction
# Input: messy customer emails
# Output: structured JSON with intent, urgency, entities
 
training_examples = [
    {
        "input": "Hey, my order #4521 hasn't arrived and it's been 2 weeks. This is ridiculous.",
        "output": {
            "intent": "order_status_complaint",
            "urgency": "high",
            "order_id": "4521",
            "sentiment": "frustrated",
            "days_waiting": 14
        }
    },
    # ... hundreds more examples
]

Fine-Tuning Pitfalls

  1. Not enough data — you need hundreds of high-quality examples minimum
  2. Overfitting to training data — the model memorizes instead of generalizing
  3. Stale models — fine-tuned models don't update when your business changes
  4. Cost of iteration — every change means retraining

The Decision Framework

FactorRAGFine-Tuning
Data changes frequently✅ Best choice❌ Requires retraining
Need source citations✅ Built-in❌ Not possible
Consistent output format⚠️ Possible with prompting✅ Best choice
Custom tone/personality⚠️ Prompt engineering✅ Best choice
Latency-sensitive⚠️ Retrieval adds ~200ms✅ No retrieval step
Small dataset (< 100 examples)✅ Works with any amount❌ Not enough data
Need to reason over private data✅ Best choice⚠️ Risk of data leakage
Budget-constrained✅ Use existing models⚠️ Training costs add up

The Hybrid Approach (What We Usually Recommend)

Most production systems benefit from both:

User Query
    ↓
[Fine-tuned model for intent classification] → Fast, consistent
    ↓
[RAG pipeline for knowledge retrieval] → Accurate, up-to-date
    ↓
[Base model for response generation] → Flexible, contextual
    ↓
Response with sources

The fine-tuned model handles the how (format, tone, routing). RAG handles the what (facts, data, citations). The base model ties it together.

What This Means for Your Project

Before you commit to either approach:

  1. Define your success metric — is it accuracy? Speed? Consistency?
  2. Audit your data — do you have enough quality examples for fine-tuning?
  3. Map your update frequency — how often does your knowledge change?
  4. Set a latency budget — can you afford the retrieval step?
  5. Start with RAG — it's faster to prototype and easier to debug

The companies that get AI right don't pick a technique first. They define the problem, then pick the tool. The ones that fail start with "we should use RAG" and work backwards.

Previous
The 90-Day Fractional CTO Checklist
Next
Your AI Agent Isn't Working Because You Skipped the Guardrails
Articles
Your AI Agent Isn't Working Because You Skipped the GuardrailsRAG vs Fine-Tuning: When to Use What in ProductionHow to Cut Your LLM Costs by 70% Without Losing QualityThe AI Implementation Playbook for Non-Technical FoundersWhy Most AI Chatbots Fail (And What Production-Grade Looks Like)Building AI Agents That Know When to Hand Off to HumansVibe Coding Is Destroying Your CodebaseAI Won't Fix Your Broken Data Pipeline