The Rate Limiting Strategy That Saved Our Client's API
Three Outages in One Month
Our client's API served 200 merchants processing payments. Three times in one month, a single merchant's batch processing job brought down the entire API. Every merchant's transactions failed. Every merchant's customers saw errors.
The total cost:
- Revenue lost during outages: $340K (94 min, 67 min, 112 min downtime)
- Customer credits issued: $85K
- Three enterprise accounts ($420K ARR) threatened to leave
- Emergency weekend work: $12K in contractor costs
- Brand damage: 47 merchants explored alternatives
One merchant's bug cost 199 other merchants their revenue. The root cause: no rate limiting. One tenant could consume 100% of API capacity.
The Layered Approach
Rate limiting isn't one thing — it's four layers working together.
Layer 1: Global Rate Limit (DDoS Protection)
At the load balancer / CDN level:
Total API capacity: 10,000 requests/second
Global limit: 8,000 requests/second (20% headroom)
If total traffic exceeds 8,000 rps:
→ Return 503 Service Unavailable
→ Shed load starting with lowest-priority traffic
→ Alert on-call immediately
This is your circuit breaker. It prevents total API collapse
regardless of where the traffic comes from.
Layer 2: Per-Tenant Rate Limit
// Each tenant gets a fair share of API capacity
interface TenantRateLimits {
tier: "free" | "growth" | "enterprise";
limits: {
requestsPerMinute: number;
requestsPerHour: number;
burstLimit: number; // Max requests in any 1-second window
concurrentRequests: number;
};
}
const tierLimits: Record<string, TenantRateLimits["limits"]> = {
free: { requestsPerMinute: 60, requestsPerHour: 1000, burstLimit: 10, concurrentRequests: 5 },
growth: { requestsPerMinute: 300, requestsPerHour: 10000, burstLimit: 50, concurrentRequests: 20 },
enterprise: { requestsPerMinute: 1000, requestsPerHour: 50000, burstLimit: 100, concurrentRequests: 50 },
};Layer 3: Per-Endpoint Rate Limit
Not all endpoints are equal. Price them differently:
GET /api/products (read, cacheable):
500 requests/minute — cheap to serve
POST /api/orders (write, triggers webhooks):
50 requests/minute — expensive, has side effects
POST /api/auth/login (security-sensitive):
5 requests/minute per IP — prevents credential stuffing
POST /api/reports/generate (heavy computation):
5 requests/hour — CPU-intensive, queue instead of rate limit
GET /api/search (moderate, hits database):
100 requests/minute — indexing makes this reasonable
Layer 4: Adaptive Rate Limiting
// Adjust limits based on current system health
function getAdaptiveLimit(
baseLimit: number,
systemHealth: SystemHealth
): number {
// If system is healthy, allow full rate
if (systemHealth.cpuUsage < 70 && systemHealth.responseTime < 200) {
return baseLimit;
}
// If system is stressed, reduce limits
if (systemHealth.cpuUsage > 85 || systemHealth.responseTime > 500) {
return Math.floor(baseLimit * 0.5); // Cut to 50%
}
// If system is degraded, severely limit
if (systemHealth.cpuUsage > 95 || systemHealth.responseTime > 1000) {
return Math.floor(baseLimit * 0.2); // Cut to 20%
}
// Moderate stress — slight reduction
return Math.floor(baseLimit * 0.75);
}The Token Bucket Algorithm
The most practical rate limiting algorithm for APIs:
class TokenBucket {
private tokens: number;
private lastRefill: number;
constructor(
private maxTokens: number, // Bucket capacity
private refillRate: number, // Tokens added per second
) {
this.tokens = maxTokens;
this.lastRefill = Date.now();
}
tryConsume(count: number = 1): boolean {
this.refill();
if (this.tokens >= count) {
this.tokens -= count;
return true; // Request allowed
}
return false; // Rate limited
}
private refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
this.tokens = Math.min(
this.maxTokens,
this.tokens + elapsed * this.refillRate
);
this.lastRefill = now;
}
}
// Example: 100 requests per minute, burst of 20
// refillRate = 100/60 = 1.67 tokens per second
const limiter = new TokenBucket(20, 1.67);Response Headers
Always communicate rate limit status to clients:
Successful request:
HTTP/1.1 200 OK
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 87
X-RateLimit-Reset: 1706745600
X-RateLimit-Policy: 100;w=60 (100 requests per 60-second window)
Rate limited request:
HTTP/1.1 429 Too Many Requests
Retry-After: 30
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1706745600
{
"error": {
"code": "RATE_LIMITED",
"message": "Rate limit exceeded. Retry after 30 seconds.",
"retryAfter": 30,
"limit": 100,
"window": "60s"
}
}
Distributed Rate Limiting
Single-server token buckets don't work when you have multiple API servers:
Option 1: Redis-based (recommended for most)
All API servers share rate limit state via Redis.
Atomic operations prevent race conditions.
Pros: Simple, fast (1ms overhead), battle-tested
Cons: Redis becomes a dependency
Option 2: Sliding window with local + sync
Each server tracks locally, syncs to central store periodically.
Pros: No Redis dependency, works offline
Cons: Slightly less accurate, can over-serve by Nx during sync gap
For most APIs: Use Redis. The 1ms overhead is negligible,
and the accuracy is worth it.
The Results
After implementing layered rate limiting:
Before (3 months):
Outages: 3 major incidents
P99 response time: 2,400ms (during batch jobs)
Revenue lost to outages: $340K
Customer credits: $85K
Churn risk: $420K ARR
Total financial impact: $845K
After (6 months):
Outages: 0
P99 response time: 180ms (consistent)
Revenue lost: $0
Customer credits: $0
Churn risk: $0
Total financial impact: $0
Implementation cost: 2 weeks of engineering ($16K)
ROI: 52x in first 6 months
Additional benefit: The merchant causing the batch processing
spike? They got 429 responses, implemented proper queuing,
and are now a better-behaved API consumer. Their integration
is more reliable too.
The Dashboard
Rate Limiting Health:
Global:
Current load: 3,200 rps (40% of capacity) ✅
Peak today: 5,100 rps (64% of capacity)
429 responses today: 1,247 (0.02% of total requests)
Per-Tenant (Top 5 by Usage):
Merchant A: 800 rps (80% of enterprise limit) ⚠️
Merchant B: 420 rps (42% of enterprise limit) ✅
Merchant C: 180 rps (60% of growth limit) ✅
Merchant D: 95 rps (32% of growth limit) ✅
Merchant E: 88 rps (29% of growth limit) ✅
Rate Limited Events (Last 24h):
Total 429 responses: 1,247
├── Per-tenant limits: 890 (71%)
├── Per-endpoint limits: 312 (25%)
├── Adaptive limits: 45 (4%)
└── Global limits: 0 (0%) ✅
Top limited tenants:
├── Merchant F (free tier): 523 events → suggest upgrade
└── Merchant A (enterprise): 367 events → review limits
Implementation Priority
Week 1: Per-tenant rate limiting (prevents noisy neighbor)
→ Use Redis + token bucket
→ Set conservative initial limits
→ Add response headers
Week 2: Per-endpoint limits (protects expensive operations)
→ Lower limits on write endpoints
→ Strict limits on auth endpoints
→ Queue-based limits on reports
Week 3: Global limits + adaptive (prevents total outage)
→ Load balancer-level circuit breaker
→ Health-based adaptive adjustment
Week 4: Dashboard + alerting
→ Real-time rate limit monitoring
→ Alert when tenants consistently hit limits
→ Alert when system enters adaptive mode
Why This Costs You
The no-rate-limiting tax:
Single bad actor outage:
- 60-120 minutes downtime
- $150K-400K in lost revenue (depends on GMV)
- $40K-100K in customer credits
- 20-40 hours of emergency engineering
- Churn risk from enterprise customers
Rate limiting prevents ALL of this.
- Implementation: 2-3 weeks
- Cost: $12K-20K in engineering time
- Ongoing cost: $180/month (Redis instance)
One prevented outage pays for 8 years of rate limiting infrastructure.
Rate limiting isn't about saying no to your customers. It's about guaranteeing that every customer gets reliable service, even when one customer misbehaves. Build it before you need it — because when you need it, it's already too late.