ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The $45K/Month AI BillTechnique 1: Model Routing (The Biggest Win)Technique 2: Prompt OptimizationTechnique 3: Response CachingTechnique 4: Batch ProcessingThe Cost DashboardThe Result
  1. Insights
  2. AI & Automation
  3. LLM Cost Optimization — How We Cut a Client's AI Bill by 73%

LLM Cost Optimization — How We Cut a Client's AI Bill by 73%

April 8, 2026·ScaledByDesign·
aillmcost-optimizationproductionopenai

The $45K/Month AI Bill

A SaaS client integrated AI features across their platform: document summarization, customer support chat, content generation, and data extraction. Six months later, their monthly OpenAI bill was $45K — and growing 20% month-over-month. The AI features generated $28K/month in attributable revenue. They were losing money on every AI interaction.

We cut their bill to $12K/month while maintaining the same output quality. Here's how.

Technique 1: Model Routing (The Biggest Win)

Not every request needs GPT-4o. Most requests work fine with cheaper, faster models:

// Route requests to the cheapest model that meets quality requirements
function selectModel(request: AIRequest): ModelConfig {
  // Simple classification/extraction → use a small model
  if (request.type === "classification" || request.type === "extraction") {
    return { model: "gpt-4o-mini", maxTokens: 500 };
    // Cost: ~$0.0003 per request vs $0.015 for gpt-4o
  }
 
  // Customer-facing chat → use mid-tier
  if (request.type === "chat" && !request.requiresReasoning) {
    return { model: "gpt-4o-mini", maxTokens: 1000 };
  }
 
  // Complex reasoning, code generation, analysis → use top-tier
  if (request.type === "analysis" || request.requiresReasoning) {
    return { model: "gpt-4o", maxTokens: 2000 };
  }
 
  // Default to cheapest
  return { model: "gpt-4o-mini", maxTokens: 500 };
}

Impact: 60% of the client's requests were classification or extraction tasks. Routing these to GPT-4o-mini reduced costs for those requests by 95%. Overall bill impact: -40%.

Technique 2: Prompt Optimization

Most prompts are 3-5x longer than they need to be:

❌ Before (890 tokens):
"You are an expert customer service agent for a premium skincare brand.
Your role is to help customers with their orders, products, and account.
You should be friendly, professional, and helpful. Always greet the customer
warmly. If you don't know the answer, say so politely. Never make up 
information. Always refer to our return policy which states..."
(continues for 600 more tokens of instructions)

✅ After (180 tokens):
"Skincare brand CS agent. Be helpful, concise. Reference provided context only.
If unsure, say 'Let me check with our team.' 
Return policy: 30-day no-questions. Free return shipping.
Customer context: {{customer_data}}
Order context: {{order_data}}"

The shorter prompt produces equivalent quality output. Every token in the system prompt is charged on every single request — so a 700-token reduction at 10K requests/day saves significant money.

Impact: Prompt optimization across all features: -15% total cost.

Technique 3: Response Caching

Many AI requests are identical or near-identical:

async function cachedAICall(prompt: string, options: AIOptions) {
  // Hash the prompt for cache key
  const cacheKey = `ai:${hash(prompt)}:${options.model}`;
  
  const cached = await redis.get(cacheKey);
  if (cached) return JSON.parse(cached);
 
  const response = await openai.chat.completions.create({
    model: options.model,
    messages: [{ role: "user", content: prompt }],
  });
 
  // Cache for 1 hour (adjust based on data freshness needs)
  await redis.set(cacheKey, JSON.stringify(response), "EX", 3600);
  
  return response;
}
 
// For semantic caching (similar but not identical prompts):
async function semanticCache(prompt: string) {
  const embedding = await getEmbedding(prompt);
  const similar = await vectorDB.search(embedding, { threshold: 0.95 });
  
  if (similar.length > 0) {
    return similar[0].cachedResponse; // Serve cached response
  }
 
  const response = await callLLM(prompt);
  await vectorDB.insert({ embedding, cachedResponse: response });
  return response;
}

Impact: 25% of requests were cache hits (exact or semantic). Cost reduction: -18%.

Technique 4: Batch Processing

Don't process items one-by-one when you can batch:

// ❌ Before: One API call per document (100 documents = 100 calls)
for (const doc of documents) {
  await summarize(doc); // $0.01 per call × 100 = $1.00
}
 
// ✅ After: Batch multiple documents in one call
const batches = chunk(documents, 5); // Groups of 5
for (const batch of batches) {
  await summarizeBatch(batch); // $0.03 per call × 20 = $0.60
}
 
// Or use OpenAI's Batch API for non-real-time processing
// 50% discount, results within 24 hours
const batchJob = await openai.batches.create({
  input_file_id: uploadedFile.id,
  endpoint: "/v1/chat/completions",
  completion_window: "24h", // 50% cost reduction
});

Impact: Batch processing for non-real-time features: -12% total cost.

The Cost Dashboard

Track these metrics to prevent runaway costs:

Daily Monitoring:
  Total API spend (by model, by feature)
  Cost per user interaction
  Cache hit rate
  Average tokens per request (input + output)
  
Weekly Review:
  Cost trend (is it growing faster than usage?)
  Top 10 most expensive prompts
  Model distribution (what % of requests use each model?)
  Features ranked by cost-per-value
  
Alerts:
  Daily spend > 150% of 7-day average → Slack alert
  Any single request > $0.50 → Log and review
  Cache hit rate drops below 20% → Investigate

The Result

Before optimization:
  Monthly LLM cost:       $45,000
  Cost per AI interaction: $0.18
  Revenue per interaction: $0.11
  Net: LOSING money on AI

After optimization:
  Monthly LLM cost:       $12,200 (-73%)
  Cost per AI interaction: $0.048
  Revenue per interaction: $0.11
  Net: 2.3x ROI on AI features

Breakdown of savings:
  Model routing:     -40% ($18,000)
  Prompt optimization: -15% ($6,750)
  Response caching:  -18% ($8,100)
  Batch processing:  -12% ($5,400)
  Overlapping savings adjusted: Total -73%

AI is only valuable if the economics work. Before adding more AI features, optimize the ones you have. The cheapest API call is the one you don't make — and the second cheapest is the one served from cache.

Previous
The Headless Commerce Migration Playbook — From Monolith to Composable
Insights
LLM Cost Optimization — How We Cut a Client's AI Bill by 73%AI Hallucination Detection in Production — What Actually WorksWe Built an AI Code Review Bot — Here's What It Actually Catches (And What It Misses)Prompt Engineering Is Dead — Context Engineering Is What MattersYour AI Agent Isn't Working Because You Skipped the GuardrailsRAG vs Fine-Tuning: When to Use What in ProductionHow to Cut Your LLM Costs by 70% Without Losing QualityThe AI Implementation Playbook for Non-Technical FoundersWhy Most AI Chatbots Fail (And What Production-Grade Looks Like)Building AI Agents That Know When to Hand Off to HumansVibe Coding Is Destroying Your CodebaseAI Won't Fix Your Broken Data Pipeline

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call