The On-Call Rotation That Doesn't Burn Out Your Team

December 27, 2025·ScaledByDesign·

on-calldevopscultureteams

On-Call Is Broken at Most Companies

The standard on-call setup: one engineer carries a pager for a week, gets woken up 3 times, spends the next week exhausted, and quietly starts interviewing. Multiply by 12 months and you have an on-call rotation that's your biggest source of attrition.

On-call doesn't have to be this way. The goal isn't to endure incidents — it's to eliminate them.

The Sustainable On-Call Framework

Structure: The Two-Tier Model

Tier 1: First Responder (Primary On-Call)
  → Responds to all pages within 15 minutes
  → Follows runbook for known issues
  → Escalates to Tier 2 if not resolved in 30 minutes
  → Rotates weekly

Tier 2: Subject Matter Expert (Secondary On-Call)
  → Only engaged if Tier 1 can't resolve
  → Deep expertise in specific systems
  → Available within 30 minutes
  → Rotates monthly (less disruptive)

Why two tiers: Most pages (70-80%) are known issues with documented fixes. Tier 1 handles these with runbooks. Only novel problems reach the expert — which means experts are rarely woken up.

Rotation Rules

1. Minimum 4 people in the rotation
   → Anything less = on-call every 3 weeks = burnout

2. No back-to-back weeks
   → Minimum 3 weeks between on-call shifts

3. Voluntary swap system
   → Easy to trade shifts (Slack bot, PagerDuty)
   → No questions asked

4. Protected recovery time
   → If paged after midnight: late start next day
   → If paged 3+ times in one night: next day off
   → Non-negotiable

5. On-call compensation
   → Stipend for being on-call ($200-500/week)
   → Additional per-page compensation for after-hours
   → Or equivalent comp time

The Alert Budget

This is the single most important concept. Set a maximum number of acceptable pages per week:

Alert Budget: 5 pages per on-call week

Week 1: 3 pages (under budget ✅)
Week 2: 7 pages (over budget ❌)
  → Mandatory incident review
  → Team dedicates 20% of next sprint to reducing alerts
  
Week 3: 4 pages (under budget ✅)
Week 4: 8 pages (over budget ❌)
  → Engineering leadership involved
  → Root cause analysis for every page
  → Systemic fix required before next sprint work

The rule: If on-call consistently exceeds the alert budget, it becomes the team's #1 priority — above features, above tech debt, above everything. The alert budget forces the team to fix the systems, not just endure them.

Reducing Incident Volume

The Post-Incident Review (Blameless)

Every page gets a 15-minute review:

## Incident: [Title]
**Date:** [When]  **Duration:** [How long]  **Severity:** [1-3]
 
### What happened
[2-3 sentences]
 
### Timeline
- HH:MM Alert fired
- HH:MM On-call acknowledged
- HH:MM Root cause identified
- HH:MM Fix deployed
- HH:MM Verified resolved
 
### Root cause
[Technical explanation]
 
### Action items
1. [Prevent recurrence] — Owner: [Name] — Due: [Date]
2. [Improve detection] — Owner: [Name] — Due: [Date]
 
### Classification
- [ ] Known issue (runbook exists but didn't work)
- [ ] New issue (needs new runbook)
- [ ] False alarm (alert needs tuning)
- [ ] Customer-caused (consider rate limiting)

The Incident Categories

Track where your pages come from:

Category	Target %	If Over Target
False alarms	< 10%	Fix alert thresholds
Known issues with runbooks	< 30%	Automate the fix
Infrastructure (DB, cache, DNS)	< 20%	Invest in reliability
Application bugs	< 20%	Improve testing
External dependencies	< 20%	Add fallbacks

The Automation Ladder

For every recurring incident, climb this ladder:

Level 0: Page a human who follows a runbook
Level 1: Page a human + auto-collect diagnostics
Level 2: Auto-remediate + notify human after
Level 3: Auto-remediate + no notification (logged only)
Level 4: Prevent the condition entirely

Goal: Move every incident type up at least one level per quarter

The On-Call Handoff

The handoff between on-call shifts should be a 15-minute meeting:

Outgoing on-call shares:
  1. How many pages this week? (vs alert budget)
  2. Any ongoing issues to watch?
  3. Any new runbooks created or updated?
  4. Any alerts that need tuning?
  5. Anything unusual about the current system state?

Incoming on-call confirms:
  1. PagerDuty/OpsGenie is configured correctly
  2. VPN/access to all systems is working
  3. Runbook index is bookmarked
  4. Escalation contacts are current

The Metrics That Matter

Track monthly and share with the team:

Metric	Healthy	Needs Work	Critical
Pages per week	< 5	5-10	> 10
After-hours pages	< 2/week	2-4/week	> 4/week
Mean time to acknowledge	< 5 min	5-15 min	> 15 min
Mean time to resolve	< 30 min	30-60 min	> 60 min
False alarm rate	< 10%	10-25%	> 25%
Repeat incidents	< 20%	20-40%	> 40%
On-call satisfaction	> 7/10	5-7/10	< 5/10

The Cultural Shift

On-call quality is a leading indicator of engineering culture:

Good culture: On-call is a shared responsibility. Incidents drive improvements. The team celebrates reducing alert volume.
Bad culture: On-call is a punishment. The same incidents recur monthly. Senior engineers find ways to avoid the rotation.

The difference isn't tooling. It's whether leadership treats on-call as something to endure or something to improve.

Make on-call better, and you'll make the product better, the team happier, and retention easier. It's one of the highest-leverage investments an engineering organization can make.

CI/CD Pipelines That Actually Make You Faster

Multi-Tenant Architecture: The Decisions You Can't Undo

The On-Call Rotation That Doesn't Burn Out Your Team

December 27, 2025·ScaledByDesign·

on-calldevopscultureteams

On-Call Is Broken at Most Companies

On-call doesn't have to be this way. The goal isn't to endure incidents — it's to eliminate them.

The Sustainable On-Call Framework

Structure: The Two-Tier Model

Tier 1: First Responder (Primary On-Call)
  → Responds to all pages within 15 minutes
  → Follows runbook for known issues
  → Escalates to Tier 2 if not resolved in 30 minutes
  → Rotates weekly

Tier 2: Subject Matter Expert (Secondary On-Call)
  → Only engaged if Tier 1 can't resolve
  → Deep expertise in specific systems
  → Available within 30 minutes
  → Rotates monthly (less disruptive)

Why two tiers: Most pages (70-80%) are known issues with documented fixes. Tier 1 handles these with runbooks. Only novel problems reach the expert — which means experts are rarely woken up.

Rotation Rules

1. Minimum 4 people in the rotation
   → Anything less = on-call every 3 weeks = burnout

2. No back-to-back weeks
   → Minimum 3 weeks between on-call shifts

3. Voluntary swap system
   → Easy to trade shifts (Slack bot, PagerDuty)
   → No questions asked

4. Protected recovery time
   → If paged after midnight: late start next day
   → If paged 3+ times in one night: next day off
   → Non-negotiable

5. On-call compensation
   → Stipend for being on-call ($200-500/week)
   → Additional per-page compensation for after-hours
   → Or equivalent comp time

The Alert Budget

This is the single most important concept. Set a maximum number of acceptable pages per week:

Alert Budget: 5 pages per on-call week

Week 1: 3 pages (under budget ✅)
Week 2: 7 pages (over budget ❌)
  → Mandatory incident review
  → Team dedicates 20% of next sprint to reducing alerts
  
Week 3: 4 pages (under budget ✅)
Week 4: 8 pages (over budget ❌)
  → Engineering leadership involved
  → Root cause analysis for every page
  → Systemic fix required before next sprint work

Reducing Incident Volume

The Post-Incident Review (Blameless)

Every page gets a 15-minute review:

## Incident: [Title]
**Date:** [When]  **Duration:** [How long]  **Severity:** [1-3]
 
### What happened
[2-3 sentences]
 
### Timeline
- HH:MM Alert fired
- HH:MM On-call acknowledged
- HH:MM Root cause identified
- HH:MM Fix deployed
- HH:MM Verified resolved
 
### Root cause
[Technical explanation]
 
### Action items
1. [Prevent recurrence] — Owner: [Name] — Due: [Date]
2. [Improve detection] — Owner: [Name] — Due: [Date]
 
### Classification
- [ ] Known issue (runbook exists but didn't work)
- [ ] New issue (needs new runbook)
- [ ] False alarm (alert needs tuning)
- [ ] Customer-caused (consider rate limiting)

The Incident Categories

Track where your pages come from:

Category	Target %	If Over Target
False alarms	< 10%	Fix alert thresholds
Known issues with runbooks	< 30%	Automate the fix
Infrastructure (DB, cache, DNS)	< 20%	Invest in reliability
Application bugs	< 20%	Improve testing
External dependencies	< 20%	Add fallbacks

The Automation Ladder

For every recurring incident, climb this ladder:

Level 0: Page a human who follows a runbook
Level 1: Page a human + auto-collect diagnostics
Level 2: Auto-remediate + notify human after
Level 3: Auto-remediate + no notification (logged only)
Level 4: Prevent the condition entirely

Goal: Move every incident type up at least one level per quarter

The On-Call Handoff

The handoff between on-call shifts should be a 15-minute meeting:

Outgoing on-call shares:
  1. How many pages this week? (vs alert budget)
  2. Any ongoing issues to watch?
  3. Any new runbooks created or updated?
  4. Any alerts that need tuning?
  5. Anything unusual about the current system state?

Incoming on-call confirms:
  1. PagerDuty/OpsGenie is configured correctly
  2. VPN/access to all systems is working
  3. Runbook index is bookmarked
  4. Escalation contacts are current

The Metrics That Matter

Track monthly and share with the team:

Metric	Healthy	Needs Work	Critical
Pages per week	< 5	5-10	> 10
After-hours pages	< 2/week	2-4/week	> 4/week
Mean time to acknowledge	< 5 min	5-15 min	> 15 min
Mean time to resolve	< 30 min	30-60 min	> 60 min
False alarm rate	< 10%	10-25%	> 25%
Repeat incidents	< 20%	20-40%	> 40%
On-call satisfaction	> 7/10	5-7/10	< 5/10

The Cultural Shift

On-call quality is a leading indicator of engineering culture:

Good culture: On-call is a shared responsibility. Incidents drive improvements. The team celebrates reducing alert volume.
Bad culture: On-call is a punishment. The same incidents recur monthly. Senior engineers find ways to avoid the rotation.

The difference isn't tooling. It's whether leadership treats on-call as something to endure or something to improve.

Make on-call better, and you'll make the product better, the team happier, and retention easier. It's one of the highest-leverage investments an engineering organization can make.

CI/CD Pipelines That Actually Make You Faster

Multi-Tenant Architecture: The Decisions You Can't Undo

The On-Call Rotation That Doesn't Burn Out Your Team

On-Call Is Broken at Most Companies

The Sustainable On-Call Framework

Structure: The Two-Tier Model

Rotation Rules

The Alert Budget

Reducing Incident Volume

The Post-Incident Review (Blameless)

The Incident Categories

The Automation Ladder

The On-Call Handoff

The Metrics That Matter

The Cultural Shift

Ready to Ship?

The On-Call Rotation That Doesn't Burn Out Your Team

On-Call Is Broken at Most Companies

The Sustainable On-Call Framework

Structure: The Two-Tier Model

Rotation Rules

The Alert Budget

Reducing Incident Volume

The Post-Incident Review (Blameless)

The Incident Categories

The Automation Ladder

The On-Call Handoff

The Metrics That Matter

The Cultural Shift

Ready to Ship?