RAG Pipeline Optimization — From 8s to 400ms
The 8-Second RAG Response
A client built a customer support chatbot powered by RAG. It worked — answers were accurate, grounded in their documentation. But each response took 8 seconds. Customers abandoned the chat after 4 seconds. The bot was smart but useless because it was slow.
We optimized the pipeline to 400ms average response time. Same answer quality. Here's every optimization we applied.
The Original Pipeline (8.2s)
Step 1: Receive user query ~10ms
Step 2: Embed query with OpenAI ada-002 ~350ms
Step 3: Search Pinecone for top-20 chunks ~120ms
Step 4: Re-rank all 20 chunks with a cross-encoder ~800ms
Step 5: Format prompt with all 20 chunks ~5ms
Step 6: Call GPT-4o with 8K token context ~6,900ms
Total: ~8,185ms
The bottleneck was obvious: Step 6 (LLM generation) was 84% of latency. But every step had room for improvement.
Optimization 1: Streaming (Perceived Latency → 0)
The single biggest UX improvement — start showing the response immediately:
// Stream the response instead of waiting for completion
async function* streamRAGResponse(query: string) {
const context = await getRelevantContext(query); // Still takes ~1.2s
const stream = await openai.chat.completions.create({
model: "gpt-4o-mini", // Optimization 2: cheaper model
messages: [
{ role: "system", content: buildSystemPrompt(context) },
{ role: "user", content: query },
],
stream: true,
max_tokens: 500,
});
for await (const chunk of stream) {
const content = chunk.choices[0]?.delta?.content;
if (content) yield content;
}
}
// Time to first token: ~1.4s (context retrieval + LLM warmup)
// Perceived latency: user sees text appearing after 1.4s
// Total completion: ~3s (but user is reading, not waiting)Streaming doesn't reduce total time, but it transforms the experience. Users start reading at 1.4s instead of staring at a spinner for 8s.
Optimization 2: Model Selection
GPT-4o was overkill for 80% of customer support queries:
// Route queries to appropriate models
function selectModel(query: string, complexity: number): ModelConfig {
// Simple FAQ-type queries → fastest, cheapest model
if (complexity < 0.3) {
return { model: "gpt-4o-mini", maxTokens: 300 };
// Generation time: ~800ms vs ~4,500ms for gpt-4o
}
// Multi-step reasoning, comparison queries → full model
if (complexity > 0.7) {
return { model: "gpt-4o", maxTokens: 800 };
}
// Default: mid-tier
return { model: "gpt-4o-mini", maxTokens: 500 };
}
// Complexity scoring (cheap, fast classification)
async function scoreComplexity(query: string): Promise<number> {
// Simple heuristics first (no API call needed)
const wordCount = query.split(" ").length;
const hasComparison = /compare|versus|difference|better/i.test(query);
const hasMultiStep = /and also|then|after that|steps/i.test(query);
let score = 0;
if (wordCount > 30) score += 0.3;
if (hasComparison) score += 0.3;
if (hasMultiStep) score += 0.3;
return Math.min(score, 1);
}Impact: 70% of queries now use gpt-4o-mini. Average generation time: 1.2s → 600ms.
Optimization 3: Smarter Retrieval
Retrieving 20 chunks and re-ranking all of them was wasteful:
// Before: retrieve 20, re-rank all 20, use top 5
// After: retrieve 8, re-rank top 8, use top 3
async function getRelevantContext(query: string): Promise<string> {
// 1. Embed query (use a faster, local model)
const embedding = await localEmbed(query); // ~15ms vs 350ms for OpenAI
// 2. Vector search — only retrieve 8 (not 20)
const candidates = await vectorDB.search(embedding, {
topK: 8, // Reduced from 20
minScore: 0.75, // Skip low-relevance results entirely
});
// 3. Re-rank only if we have > 3 candidates
let topChunks;
if (candidates.length > 3) {
topChunks = await rerank(query, candidates, { topK: 3 });
} else {
topChunks = candidates;
}
// 4. Return concatenated context (smaller = faster LLM generation)
return topChunks.map(c => c.text).join("\n\n---\n\n");
}Impact: Retrieval step: 920ms → 180ms. Context size: 6K tokens → 2K tokens (smaller context = faster generation).
Optimization 4: Embedding Cache
Many queries are similar. Cache the embedding to skip the embedding step entirely:
// Semantic embedding cache
const embeddingCache = new LRUCache<string, number[]>({ max: 10000 });
async function getEmbedding(text: string): Promise<number[]> {
// Normalize query for cache hits
const normalized = text.toLowerCase().trim().replace(/\s+/g, " ");
const cacheKey = hash(normalized);
const cached = embeddingCache.get(cacheKey);
if (cached) return cached; // Cache hit: 0ms vs 15-350ms
const embedding = await localEmbed(normalized);
embeddingCache.set(cacheKey, embedding);
return embedding;
}Impact: 30% cache hit rate. Average embedding time: 15ms → 10ms.
Optimization 5: Parallel Execution
Don't run steps sequentially when they can overlap:
async function ragPipeline(query: string) {
// Run embedding and complexity scoring in parallel
const [embedding, complexity] = await Promise.all([
getEmbedding(query),
scoreComplexity(query),
]);
// Select model while searching (model selection is instant)
const [candidates, modelConfig] = await Promise.all([
vectorDB.search(embedding, { topK: 8, minScore: 0.75 }),
Promise.resolve(selectModel(query, complexity)),
]);
// Re-rank (only if needed)
const context = candidates.length > 3
? await rerank(query, candidates, { topK: 3 })
: candidates;
// Stream response
return streamResponse(query, context, modelConfig);
}The Optimized Pipeline (400ms to first token)
Step 1: Receive query ~10ms
Step 2: Embed query (local model + cache) ~10ms (was 350ms)
Score complexity (parallel) ~5ms
Step 3: Vector search (top-8, min score 0.75) ~80ms (was 120ms)
Step 4: Re-rank top 3 (if needed) ~100ms (was 800ms)
Step 5: Format prompt (2K tokens, not 8K) ~2ms
Step 6: Stream first token (gpt-4o-mini) ~200ms (was 6,900ms wait)
Time to first token: ~407ms
Full response: ~1.8s (was 8.2s)
Quality Validation
Speed means nothing if answers get worse. We validated with a test suite:
Answer quality metrics (500 test queries):
Before optimization:
Accuracy: 92%
Relevance: 88%
Completeness: 85%
Avg response time: 8.2s
After optimization:
Accuracy: 91% (-1%, within margin)
Relevance: 90% (+2%, better retrieval filtering)
Completeness: 82% (-3%, shorter context)
Avg response time: 1.8s (-78%)
Time to first token: 0.4s (-95%)
The 3% drop in completeness was a trade-off we accepted — answers were slightly less detailed but still accurate. And users actually read them now because they didn't abandon the chat waiting.
RAG performance isn't about picking the fastest LLM. It's about optimizing the entire pipeline — embedding, retrieval, re-ranking, generation — and running as much as possible in parallel. The LLM is often not even the biggest bottleneck once you look at the full picture.