The On-Call Rotation That Doesn't Burn Out Your Team
On-Call Is Broken at Most Companies
The standard on-call setup: one engineer carries a pager for a week, gets woken up 3 times, spends the next week exhausted, and quietly starts interviewing. Multiply by 12 months and you have an on-call rotation that's your biggest source of attrition.
On-call doesn't have to be this way. The goal isn't to endure incidents — it's to eliminate them.
The Sustainable On-Call Framework
Structure: The Two-Tier Model
Tier 1: First Responder (Primary On-Call)
→ Responds to all pages within 15 minutes
→ Follows runbook for known issues
→ Escalates to Tier 2 if not resolved in 30 minutes
→ Rotates weekly
Tier 2: Subject Matter Expert (Secondary On-Call)
→ Only engaged if Tier 1 can't resolve
→ Deep expertise in specific systems
→ Available within 30 minutes
→ Rotates monthly (less disruptive)
Why two tiers: Most pages (70-80%) are known issues with documented fixes. Tier 1 handles these with runbooks. Only novel problems reach the expert — which means experts are rarely woken up.
Rotation Rules
1. Minimum 4 people in the rotation
→ Anything less = on-call every 3 weeks = burnout
2. No back-to-back weeks
→ Minimum 3 weeks between on-call shifts
3. Voluntary swap system
→ Easy to trade shifts (Slack bot, PagerDuty)
→ No questions asked
4. Protected recovery time
→ If paged after midnight: late start next day
→ If paged 3+ times in one night: next day off
→ Non-negotiable
5. On-call compensation
→ Stipend for being on-call ($200-500/week)
→ Additional per-page compensation for after-hours
→ Or equivalent comp time
The Alert Budget
This is the single most important concept. Set a maximum number of acceptable pages per week:
Alert Budget: 5 pages per on-call week
Week 1: 3 pages (under budget ✅)
Week 2: 7 pages (over budget ❌)
→ Mandatory incident review
→ Team dedicates 20% of next sprint to reducing alerts
Week 3: 4 pages (under budget ✅)
Week 4: 8 pages (over budget ❌)
→ Engineering leadership involved
→ Root cause analysis for every page
→ Systemic fix required before next sprint work
The rule: If on-call consistently exceeds the alert budget, it becomes the team's #1 priority — above features, above tech debt, above everything. The alert budget forces the team to fix the systems, not just endure them.
Reducing Incident Volume
The Post-Incident Review (Blameless)
Every page gets a 15-minute review:
## Incident: [Title]
**Date:** [When] **Duration:** [How long] **Severity:** [1-3]
### What happened
[2-3 sentences]
### Timeline
- HH:MM Alert fired
- HH:MM On-call acknowledged
- HH:MM Root cause identified
- HH:MM Fix deployed
- HH:MM Verified resolved
### Root cause
[Technical explanation]
### Action items
1. [Prevent recurrence] — Owner: [Name] — Due: [Date]
2. [Improve detection] — Owner: [Name] — Due: [Date]
### Classification
- [ ] Known issue (runbook exists but didn't work)
- [ ] New issue (needs new runbook)
- [ ] False alarm (alert needs tuning)
- [ ] Customer-caused (consider rate limiting)The Incident Categories
Track where your pages come from:
| Category | Target % | If Over Target |
|---|---|---|
| False alarms | < 10% | Fix alert thresholds |
| Known issues with runbooks | < 30% | Automate the fix |
| Infrastructure (DB, cache, DNS) | < 20% | Invest in reliability |
| Application bugs | < 20% | Improve testing |
| External dependencies | < 20% | Add fallbacks |
The Automation Ladder
For every recurring incident, climb this ladder:
Level 0: Page a human who follows a runbook
Level 1: Page a human + auto-collect diagnostics
Level 2: Auto-remediate + notify human after
Level 3: Auto-remediate + no notification (logged only)
Level 4: Prevent the condition entirely
Goal: Move every incident type up at least one level per quarter
The On-Call Handoff
The handoff between on-call shifts should be a 15-minute meeting:
Outgoing on-call shares:
1. How many pages this week? (vs alert budget)
2. Any ongoing issues to watch?
3. Any new runbooks created or updated?
4. Any alerts that need tuning?
5. Anything unusual about the current system state?
Incoming on-call confirms:
1. PagerDuty/OpsGenie is configured correctly
2. VPN/access to all systems is working
3. Runbook index is bookmarked
4. Escalation contacts are current
The Metrics That Matter
Track monthly and share with the team:
| Metric | Healthy | Needs Work | Critical |
|---|---|---|---|
| Pages per week | < 5 | 5-10 | > 10 |
| After-hours pages | < 2/week | 2-4/week | > 4/week |
| Mean time to acknowledge | < 5 min | 5-15 min | > 15 min |
| Mean time to resolve | < 30 min | 30-60 min | > 60 min |
| False alarm rate | < 10% | 10-25% | > 25% |
| Repeat incidents | < 20% | 20-40% | > 40% |
| On-call satisfaction | > 7/10 | 5-7/10 | < 5/10 |
The Cultural Shift
On-call quality is a leading indicator of engineering culture:
- Good culture: On-call is a shared responsibility. Incidents drive improvements. The team celebrates reducing alert volume.
- Bad culture: On-call is a punishment. The same incidents recur monthly. Senior engineers find ways to avoid the rotation.
The difference isn't tooling. It's whether leadership treats on-call as something to endure or something to improve.
Make on-call better, and you'll make the product better, the team happier, and retention easier. It's one of the highest-leverage investments an engineering organization can make.