RAG vs Fine-Tuning: When to Use What in Production

February 6, 2026·ScaledByDesign·

airagfine-tuningllm

Everyone Asks the Wrong Question

"Should we use RAG or fine-tuning?" is like asking "should we use a database or an API?" The answer is: it depends on what you're building, and sometimes you need both.

Here's the decision framework we use with every client.

RAG: When Your Data Changes

Retrieval-Augmented Generation is the right choice when:

Your knowledge base updates frequently (docs, products, policies)
You need citations and source attribution
Accuracy on specific facts matters more than style
You want to control exactly what the model can reference

How RAG Actually Works in Production

async function ragPipeline(query: string): Promise<AgentResponse> {
  // 1. Embed the query
  const queryEmbedding = await embed(query);
 
  // 2. Retrieve relevant chunks
  const chunks = await vectorStore.search(queryEmbedding, {
    topK: 5,
    minScore: 0.75,
    filter: { status: "published" },
  });
 
  // 3. Rerank for relevance
  const reranked = await reranker.rank(query, chunks);
  const context = reranked.slice(0, 3);
 
  // 4. Generate with context
  const response = await llm.generate({
    system: SYSTEM_PROMPT,
    context: context.map((c) => c.text).join("\n\n"),
    query,
    temperature: 0.1, // Low temp for factual responses
  });
 
  // 5. Return with sources
  return {
    answer: response.text,
    sources: context.map((c) => ({
      title: c.metadata.title,
      url: c.metadata.url,
      score: c.score,
    })),
  };
}

RAG Pitfalls We See Constantly

Chunking too aggressively — splitting mid-paragraph destroys context
No reranking step — vector similarity alone isn't enough
Stale embeddings — your vectors need to update when your docs do
Ignoring retrieval quality — if you retrieve garbage, the LLM generates garbage with confidence

Fine-Tuning: When You Need a Different Model

Fine-tuning is the right choice when:

You need the model to behave differently (tone, format, reasoning style)
You have a specific task with consistent input/output patterns
Latency matters and you can't afford the retrieval step
You need the model to learn patterns, not just facts

When Fine-Tuning Actually Makes Sense

# Good fine-tuning use case: structured extraction
# Input: messy customer emails
# Output: structured JSON with intent, urgency, entities
 
training_examples = [
    {
        "input": "Hey, my order #4521 hasn't arrived and it's been 2 weeks. This is ridiculous.",
        "output": {
            "intent": "order_status_complaint",
            "urgency": "high",
            "order_id": "4521",
            "sentiment": "frustrated",
            "days_waiting": 14
        }
    },
    # ... hundreds more examples
]

Fine-Tuning Pitfalls

Not enough data — you need hundreds of high-quality examples minimum
Overfitting to training data — the model memorizes instead of generalizing
Stale models — fine-tuned models don't update when your business changes
Cost of iteration — every change means retraining

The Decision Framework

Factor	RAG	Fine-Tuning
Data changes frequently	✅ Best choice	❌ Requires retraining
Need source citations	✅ Built-in	❌ Not possible
Consistent output format	⚠️ Possible with prompting	✅ Best choice
Custom tone/personality	⚠️ Prompt engineering	✅ Best choice
Latency-sensitive	⚠️ Retrieval adds ~200ms	✅ No retrieval step
Small dataset (< 100 examples)	✅ Works with any amount	❌ Not enough data
Need to reason over private data	✅ Best choice	⚠️ Risk of data leakage
Budget-constrained	✅ Use existing models	⚠️ Training costs add up

Most production systems benefit from both:

User Query
    ↓
[Fine-tuned model for intent classification] → Fast, consistent
    ↓
[RAG pipeline for knowledge retrieval] → Accurate, up-to-date
    ↓
[Base model for response generation] → Flexible, contextual
    ↓
Response with sources

The fine-tuned model handles the how (format, tone, routing). RAG handles the what (facts, data, citations). The base model ties it together.

What This Means for Your Project

Before you commit to either approach:

Define your success metric — is it accuracy? Speed? Consistency?
Audit your data — do you have enough quality examples for fine-tuning?
Map your update frequency — how often does your knowledge change?
Set a latency budget — can you afford the retrieval step?
Start with RAG — it's faster to prototype and easier to debug

The companies that get AI right don't pick a technique first. They define the problem, then pick the tool. The ones that fail start with "we should use RAG" and work backwards.

The 90-Day Fractional CTO Checklist

Your AI Agent Isn't Working Because You Skipped the Guardrails

RAG vs Fine-Tuning: When to Use What in Production

February 6, 2026·ScaledByDesign·

airagfine-tuningllm

Everyone Asks the Wrong Question

"Should we use RAG or fine-tuning?" is like asking "should we use a database or an API?" The answer is: it depends on what you're building, and sometimes you need both.

Here's the decision framework we use with every client.

RAG: When Your Data Changes

Retrieval-Augmented Generation is the right choice when:

Your knowledge base updates frequently (docs, products, policies)
You need citations and source attribution
Accuracy on specific facts matters more than style
You want to control exactly what the model can reference

How RAG Actually Works in Production

async function ragPipeline(query: string): Promise<AgentResponse> {
  // 1. Embed the query
  const queryEmbedding = await embed(query);
 
  // 2. Retrieve relevant chunks
  const chunks = await vectorStore.search(queryEmbedding, {
    topK: 5,
    minScore: 0.75,
    filter: { status: "published" },
  });
 
  // 3. Rerank for relevance
  const reranked = await reranker.rank(query, chunks);
  const context = reranked.slice(0, 3);
 
  // 4. Generate with context
  const response = await llm.generate({
    system: SYSTEM_PROMPT,
    context: context.map((c) => c.text).join("\n\n"),
    query,
    temperature: 0.1, // Low temp for factual responses
  });
 
  // 5. Return with sources
  return {
    answer: response.text,
    sources: context.map((c) => ({
      title: c.metadata.title,
      url: c.metadata.url,
      score: c.score,
    })),
  };
}

RAG Pitfalls We See Constantly

Chunking too aggressively — splitting mid-paragraph destroys context
No reranking step — vector similarity alone isn't enough
Stale embeddings — your vectors need to update when your docs do
Ignoring retrieval quality — if you retrieve garbage, the LLM generates garbage with confidence

Fine-Tuning: When You Need a Different Model

Fine-tuning is the right choice when:

You need the model to behave differently (tone, format, reasoning style)
You have a specific task with consistent input/output patterns
Latency matters and you can't afford the retrieval step
You need the model to learn patterns, not just facts

When Fine-Tuning Actually Makes Sense

# Good fine-tuning use case: structured extraction
# Input: messy customer emails
# Output: structured JSON with intent, urgency, entities
 
training_examples = [
    {
        "input": "Hey, my order #4521 hasn't arrived and it's been 2 weeks. This is ridiculous.",
        "output": {
            "intent": "order_status_complaint",
            "urgency": "high",
            "order_id": "4521",
            "sentiment": "frustrated",
            "days_waiting": 14
        }
    },
    # ... hundreds more examples
]

Fine-Tuning Pitfalls

Not enough data — you need hundreds of high-quality examples minimum
Overfitting to training data — the model memorizes instead of generalizing
Stale models — fine-tuned models don't update when your business changes
Cost of iteration — every change means retraining

The Decision Framework

Factor	RAG	Fine-Tuning
Data changes frequently	✅ Best choice	❌ Requires retraining
Need source citations	✅ Built-in	❌ Not possible
Consistent output format	⚠️ Possible with prompting	✅ Best choice
Custom tone/personality	⚠️ Prompt engineering	✅ Best choice
Latency-sensitive	⚠️ Retrieval adds ~200ms	✅ No retrieval step
Small dataset (< 100 examples)	✅ Works with any amount	❌ Not enough data
Need to reason over private data	✅ Best choice	⚠️ Risk of data leakage
Budget-constrained	✅ Use existing models	⚠️ Training costs add up

Most production systems benefit from both:

User Query
    ↓
[Fine-tuned model for intent classification] → Fast, consistent
    ↓
[RAG pipeline for knowledge retrieval] → Accurate, up-to-date
    ↓
[Base model for response generation] → Flexible, contextual
    ↓
Response with sources

The fine-tuned model handles the how (format, tone, routing). RAG handles the what (facts, data, citations). The base model ties it together.

What This Means for Your Project

Before you commit to either approach:

Define your success metric — is it accuracy? Speed? Consistency?
Audit your data — do you have enough quality examples for fine-tuning?
Map your update frequency — how often does your knowledge change?
Set a latency budget — can you afford the retrieval step?
Start with RAG — it's faster to prototype and easier to debug

The companies that get AI right don't pick a technique first. They define the problem, then pick the tool. The ones that fail start with "we should use RAG" and work backwards.

The 90-Day Fractional CTO Checklist

Your AI Agent Isn't Working Because You Skipped the Guardrails

Everyone Asks the Wrong Question

RAG: When Your Data Changes

How RAG Actually Works in Production

RAG Pitfalls We See Constantly

Fine-Tuning: When You Need a Different Model

When Fine-Tuning Actually Makes Sense

Fine-Tuning Pitfalls

The Decision Framework

The Hybrid Approach (What We Usually Recommend)

What This Means for Your Project

Everyone Asks the Wrong Question

RAG: When Your Data Changes

How RAG Actually Works in Production

RAG Pitfalls We See Constantly

Fine-Tuning: When You Need a Different Model

When Fine-Tuning Actually Makes Sense

Fine-Tuning Pitfalls

The Decision Framework

The Hybrid Approach (What We Usually Recommend)

What This Means for Your Project