ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

Three Outages in One MonthThe Layered ApproachLayer 1: Global Rate Limit (DDoS Protection)Layer 2: Per-Tenant Rate LimitLayer 3: Per-Endpoint Rate LimitLayer 4: Adaptive Rate LimitingThe Token Bucket AlgorithmResponse HeadersDistributed Rate LimitingThe ResultsThe DashboardImplementation PriorityWhy This Costs You
  1. Insights
  2. Architecture
  3. The Rate Limiting Strategy That Saved Our Client's API

The Rate Limiting Strategy That Saved Our Client's API

December 24, 2025·ScaledByDesign·
rate-limitingapireliabilityinfrastructure

Three Outages in One Month

Our client's API served 200 merchants processing payments. Three times in one month, a single merchant's batch processing job brought down the entire API. Every merchant's transactions failed. Every merchant's customers saw errors.

The total cost:

  • Revenue lost during outages: $340K (94 min, 67 min, 112 min downtime)
  • Customer credits issued: $85K
  • Three enterprise accounts ($420K ARR) threatened to leave
  • Emergency weekend work: $12K in contractor costs
  • Brand damage: 47 merchants explored alternatives

One merchant's bug cost 199 other merchants their revenue. The root cause: no rate limiting. One tenant could consume 100% of API capacity.

The Layered Approach

Rate limiting isn't one thing — it's four layers working together.

Layer 1: Global Rate Limit (DDoS Protection)

At the load balancer / CDN level:

  Total API capacity: 10,000 requests/second
  Global limit: 8,000 requests/second (20% headroom)

  If total traffic exceeds 8,000 rps:
    → Return 503 Service Unavailable
    → Shed load starting with lowest-priority traffic
    → Alert on-call immediately

  This is your circuit breaker. It prevents total API collapse
  regardless of where the traffic comes from.

Layer 2: Per-Tenant Rate Limit

// Each tenant gets a fair share of API capacity
interface TenantRateLimits {
  tier: "free" | "growth" | "enterprise";
  limits: {
    requestsPerMinute: number;
    requestsPerHour: number;
    burstLimit: number;      // Max requests in any 1-second window
    concurrentRequests: number;
  };
}
 
const tierLimits: Record<string, TenantRateLimits["limits"]> = {
  free:       { requestsPerMinute: 60,   requestsPerHour: 1000,   burstLimit: 10,  concurrentRequests: 5 },
  growth:     { requestsPerMinute: 300,  requestsPerHour: 10000,  burstLimit: 50,  concurrentRequests: 20 },
  enterprise: { requestsPerMinute: 1000, requestsPerHour: 50000,  burstLimit: 100, concurrentRequests: 50 },
};

Layer 3: Per-Endpoint Rate Limit

Not all endpoints are equal. Price them differently:

  GET /api/products (read, cacheable):
    500 requests/minute — cheap to serve

  POST /api/orders (write, triggers webhooks):
    50 requests/minute — expensive, has side effects

  POST /api/auth/login (security-sensitive):
    5 requests/minute per IP — prevents credential stuffing

  POST /api/reports/generate (heavy computation):
    5 requests/hour — CPU-intensive, queue instead of rate limit

  GET /api/search (moderate, hits database):
    100 requests/minute — indexing makes this reasonable

Layer 4: Adaptive Rate Limiting

// Adjust limits based on current system health
function getAdaptiveLimit(
  baseLimit: number,
  systemHealth: SystemHealth
): number {
  // If system is healthy, allow full rate
  if (systemHealth.cpuUsage < 70 && systemHealth.responseTime < 200) {
    return baseLimit;
  }
 
  // If system is stressed, reduce limits
  if (systemHealth.cpuUsage > 85 || systemHealth.responseTime > 500) {
    return Math.floor(baseLimit * 0.5); // Cut to 50%
  }
 
  // If system is degraded, severely limit
  if (systemHealth.cpuUsage > 95 || systemHealth.responseTime > 1000) {
    return Math.floor(baseLimit * 0.2); // Cut to 20%
  }
 
  // Moderate stress — slight reduction
  return Math.floor(baseLimit * 0.75);
}

The Token Bucket Algorithm

The most practical rate limiting algorithm for APIs:

class TokenBucket {
  private tokens: number;
  private lastRefill: number;
 
  constructor(
    private maxTokens: number,      // Bucket capacity
    private refillRate: number,     // Tokens added per second
  ) {
    this.tokens = maxTokens;
    this.lastRefill = Date.now();
  }
 
  tryConsume(count: number = 1): boolean {
    this.refill();
 
    if (this.tokens >= count) {
      this.tokens -= count;
      return true;  // Request allowed
    }
 
    return false;   // Rate limited
  }
 
  private refill() {
    const now = Date.now();
    const elapsed = (now - this.lastRefill) / 1000;
    this.tokens = Math.min(
      this.maxTokens,
      this.tokens + elapsed * this.refillRate
    );
    this.lastRefill = now;
  }
}
 
// Example: 100 requests per minute, burst of 20
// refillRate = 100/60 = 1.67 tokens per second
const limiter = new TokenBucket(20, 1.67);

Response Headers

Always communicate rate limit status to clients:

Successful request:
  HTTP/1.1 200 OK
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 87
  X-RateLimit-Reset: 1706745600
  X-RateLimit-Policy: 100;w=60 (100 requests per 60-second window)

Rate limited request:
  HTTP/1.1 429 Too Many Requests
  Retry-After: 30
  X-RateLimit-Limit: 100
  X-RateLimit-Remaining: 0
  X-RateLimit-Reset: 1706745600

  {
    "error": {
      "code": "RATE_LIMITED",
      "message": "Rate limit exceeded. Retry after 30 seconds.",
      "retryAfter": 30,
      "limit": 100,
      "window": "60s"
    }
  }

Distributed Rate Limiting

Single-server token buckets don't work when you have multiple API servers:

Option 1: Redis-based (recommended for most)

  All API servers share rate limit state via Redis.
  Atomic operations prevent race conditions.

  Pros: Simple, fast (1ms overhead), battle-tested
  Cons: Redis becomes a dependency

Option 2: Sliding window with local + sync

  Each server tracks locally, syncs to central store periodically.

  Pros: No Redis dependency, works offline
  Cons: Slightly less accurate, can over-serve by Nx during sync gap

For most APIs: Use Redis. The 1ms overhead is negligible,
and the accuracy is worth it.

The Results

After implementing layered rate limiting:

Before (3 months):
  Outages: 3 major incidents
  P99 response time: 2,400ms (during batch jobs)
  Revenue lost to outages: $340K
  Customer credits: $85K
  Churn risk: $420K ARR
  Total financial impact: $845K

After (6 months):
  Outages: 0
  P99 response time: 180ms (consistent)
  Revenue lost: $0
  Customer credits: $0
  Churn risk: $0
  Total financial impact: $0

Implementation cost: 2 weeks of engineering ($16K)
ROI: 52x in first 6 months

Additional benefit: The merchant causing the batch processing
spike? They got 429 responses, implemented proper queuing,
and are now a better-behaved API consumer. Their integration
is more reliable too.

The Dashboard

Rate Limiting Health:

Global:
  Current load: 3,200 rps (40% of capacity) ✅
  Peak today: 5,100 rps (64% of capacity)
  429 responses today: 1,247 (0.02% of total requests)

Per-Tenant (Top 5 by Usage):
  Merchant A:  800 rps (80% of enterprise limit)  ⚠️
  Merchant B:  420 rps (42% of enterprise limit)   ✅
  Merchant C:  180 rps (60% of growth limit)       ✅
  Merchant D:  95 rps (32% of growth limit)        ✅
  Merchant E:  88 rps (29% of growth limit)        ✅

Rate Limited Events (Last 24h):
  Total 429 responses: 1,247
  ├── Per-tenant limits: 890 (71%)
  ├── Per-endpoint limits: 312 (25%)
  ├── Adaptive limits: 45 (4%)
  └── Global limits: 0 (0%) ✅

  Top limited tenants:
  ├── Merchant F (free tier): 523 events → suggest upgrade
  └── Merchant A (enterprise): 367 events → review limits

Implementation Priority

Week 1: Per-tenant rate limiting (prevents noisy neighbor)
  → Use Redis + token bucket
  → Set conservative initial limits
  → Add response headers

Week 2: Per-endpoint limits (protects expensive operations)
  → Lower limits on write endpoints
  → Strict limits on auth endpoints
  → Queue-based limits on reports

Week 3: Global limits + adaptive (prevents total outage)
  → Load balancer-level circuit breaker
  → Health-based adaptive adjustment

Week 4: Dashboard + alerting
  → Real-time rate limit monitoring
  → Alert when tenants consistently hit limits
  → Alert when system enters adaptive mode

Why This Costs You

The no-rate-limiting tax:

Single bad actor outage:
  - 60-120 minutes downtime
  - $150K-400K in lost revenue (depends on GMV)
  - $40K-100K in customer credits
  - 20-40 hours of emergency engineering
  - Churn risk from enterprise customers

Rate limiting prevents ALL of this.
  - Implementation: 2-3 weeks
  - Cost: $12K-20K in engineering time
  - Ongoing cost: $180/month (Redis instance)

One prevented outage pays for 8 years of rate limiting infrastructure.

Rate limiting isn't about saying no to your customers. It's about guaranteeing that every customer gets reliable service, even when one customer misbehaves. Build it before you need it — because when you need it, it's already too late.

Previous
When to Rewrite vs Refactor: The Decision Framework
Next
Technical Interviews Are Broken — Here's What We Do Instead
Insights
Why You Should Start With a MonolithEvent-Driven Architecture for the Rest of UsThe Real Cost of Microservices at Your ScaleThe Caching Strategy That Cut Our Client's AWS Bill by 60%API Design Mistakes That Will Haunt You for YearsMulti-Tenant Architecture: The Decisions You Can't UndoCI/CD Pipelines That Actually Make You FasterThe Rate Limiting Strategy That Saved Our Client's APIWhen to Rewrite vs Refactor: The Decision Framework

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call