How to Cut Your LLM Costs by 70% Without Losing Quality
The $50k/Month Wake-Up Call
A client came to us spending $47,000/month on OpenAI API calls. Their AI features were popular — but the unit economics were underwater. Every customer interaction cost $0.12 in inference alone.
Eight weeks later, they were spending $14,000/month with better output quality. Here's exactly how.
Strategy 1: Model Tiering
Not every request needs GPT-4. Most don't.
type ModelTier = "fast" | "standard" | "premium";
function selectModel(request: AIRequest): ModelTier {
// Simple classification, FAQ, routing → fast model
if (request.complexity === "low" || request.type === "classification") {
return "fast"; // GPT-4o-mini, Claude Haiku, Gemini Flash
}
// Standard generation, summarization → mid-tier
if (request.complexity === "medium") {
return "standard"; // GPT-4o, Claude Sonnet
}
// Complex reasoning, code generation, analysis → premium
return "premium"; // GPT-4.5, Claude Opus, o3
}
const MODEL_MAP = {
fast: { model: "gpt-4o-mini", costPer1kTokens: 0.00015 },
standard: { model: "gpt-4o", costPer1kTokens: 0.0025 },
premium: { model: "gpt-4.5-preview", costPer1kTokens: 0.075 },
};Impact: 60-70% of requests can use the fast tier. That alone cuts costs by 40%.
Strategy 2: Semantic Caching
If someone asks the same question twice, don't call the API twice:
async function cachedInference(prompt: string): Promise<string> {
// Check semantic cache (not just exact match)
const embedding = await embed(prompt);
const cached = await cache.findSimilar(embedding, {
threshold: 0.95,
maxAge: "24h",
});
if (cached) {
metrics.increment("cache_hit");
return cached.response;
}
// Cache miss — call the model
const response = await llm.generate(prompt);
await cache.store(embedding, response, { ttl: "24h" });
metrics.increment("cache_miss");
return response;
}Impact: 20-40% cache hit rate is typical. Higher for support and FAQ use cases.
Strategy 3: Prompt Optimization
Shorter prompts = fewer tokens = lower cost. But most teams write prompts like novels:
❌ BEFORE (847 tokens):
"You are a helpful customer service assistant for Acme Corp.
Acme Corp was founded in 2019 and sells premium kitchen appliances.
Our values are quality, innovation, and customer satisfaction.
When responding to customers, always be polite, professional, and helpful.
If you don't know the answer, say so honestly.
Always try to resolve the issue in one interaction.
Here is the customer's message: {message}
Please provide a helpful response."
✅ AFTER (156 tokens):
"Acme Corp support agent. Kitchen appliances.
Rules: Be concise. Resolve in one reply. Say 'I don't know' when unsure.
Customer: {message}"Impact: 50-80% reduction in prompt tokens. The model performs the same or better with concise instructions.
Strategy 4: Token Budgets Per Feature
Set hard limits on what each feature can spend:
const TOKEN_BUDGETS = {
chat_response: { maxInput: 2000, maxOutput: 500 },
email_draft: { maxInput: 3000, maxOutput: 1000 },
document_summary: { maxInput: 8000, maxOutput: 500 },
code_review: { maxInput: 4000, maxOutput: 2000 },
};
async function enforceTokenBudget(
feature: string,
input: string
): Promise<string> {
const budget = TOKEN_BUDGETS[feature];
const inputTokens = countTokens(input);
if (inputTokens > budget.maxInput) {
// Truncate intelligently, not just cut off
input = await summarizeToFit(input, budget.maxInput);
}
return llm.generate(input, { maxTokens: budget.maxOutput });
}Strategy 5: Batch Processing
Real-time isn't always necessary. Batch what you can:
// Instead of processing each email individually...
// Batch them and process together
async function batchClassify(emails: Email[]): Promise<Classification[]> {
// Group into batches of 20
const batches = chunk(emails, 20);
return Promise.all(
batches.map((batch) =>
llm.generate({
prompt: `Classify these ${batch.length} emails. Return JSON array.`,
input: batch.map((e, i) => `[${i}] ${e.subject}: ${e.body}`).join("\n"),
})
)
);
}
// 100 emails: 100 API calls → 5 API calls. Same results.Impact: 80-95% reduction in API calls for batch-eligible workloads.
The Cost Dashboard You Need
You can't optimize what you don't measure:
| Metric | Track Daily |
|---|---|
| Total spend | By model, by feature |
| Cost per interaction | By user segment |
| Cache hit rate | Target > 30% |
| Token efficiency | Output quality per dollar |
| Model tier distribution | % fast vs standard vs premium |
| Waste | Requests that generated unused output |
The Results
After implementing all five strategies for our client:
| Metric | Before | After |
|---|---|---|
| Monthly spend | $47,000 | $14,000 |
| Avg cost per interaction | $0.12 | $0.03 |
| Response quality (human eval) | 4.1/5 | 4.3/5 |
| P95 latency | 3.2s | 1.8s |
| Cache hit rate | 0% | 34% |
Quality went up because the tiering system matched the right model to each task, and prompt optimization removed noise that was confusing the models.
Stop Burning Money
Your LLM costs should scale sub-linearly with usage. If they're scaling linearly — or worse, super-linearly — you're leaving money on the table. The five strategies above aren't theoretical. They're what we implement in every AI engagement.
The model is rarely the bottleneck. The engineering around it is.