Database Migrations Without Downtime
The Migration That Took Down Production
It's always the same story: a developer runs ALTER TABLE orders ADD COLUMN discount_code VARCHAR(255) on a table with 50 million rows. PostgreSQL acquires an ACCESS EXCLUSIVE lock, writes block, reads pile up, the connection pool exhausts, and the site goes down for 20 minutes while the migration finishes.
The real cost: One client ran this exact migration during business hours on a table with 80M rows. Site down for 23 minutes. Orders lost: 1,247. Revenue impact: $94K. Customer support tickets: 600+. Three enterprise deals stalled because prospects saw the downtime. Total cost: $220K+ for a migration that should have taken zero downtime.
This is preventable. Every time.
Why Migrations Are Dangerous
PostgreSQL (and most databases) need locks to modify schema. The danger level depends on the operation:
| Operation | Lock Type | Danger Level |
|---|---|---|
ADD COLUMN (nullable, no default) | ACCESS EXCLUSIVE (brief) | Low ✅ |
ADD COLUMN with DEFAULT (PG 11+) | ACCESS EXCLUSIVE (brief) | Low ✅ |
ADD COLUMN with DEFAULT (PG < 11) | ACCESS EXCLUSIVE (rewrites table) | Critical ❌ |
DROP COLUMN | ACCESS EXCLUSIVE (brief) | Medium ⚠️ |
ALTER COLUMN TYPE | ACCESS EXCLUSIVE (rewrites table) | Critical ❌ |
ADD INDEX | SHARE (blocks writes) | High ⚠️ |
ADD INDEX CONCURRENTLY | None (mostly) | Low ✅ |
ADD NOT NULL constraint | ACCESS EXCLUSIVE (scans table) | High ⚠️ |
ADD FOREIGN KEY | SHARE ROW EXCLUSIVE | High ⚠️ |
RENAME COLUMN | ACCESS EXCLUSIVE (brief) | Medium ⚠️ |
The Safe Migration Playbook
Pattern 1: Add Column (The Right Way)
-- ✅ SAFE: Nullable column, no default
ALTER TABLE orders ADD COLUMN discount_code VARCHAR(255);
-- ✅ SAFE on PG 11+: With default (metadata-only change)
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'pending';
-- ❌ DANGEROUS on PG < 11: Rewrites entire table
ALTER TABLE orders ADD COLUMN status VARCHAR(50) DEFAULT 'pending';Pattern 2: Create Index (Always Concurrently)
-- ❌ DANGEROUS: Blocks all writes until complete
CREATE INDEX idx_orders_email ON orders (email);
-- ✅ SAFE: Builds index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_email ON orders (email);
-- Note: CONCURRENTLY can't run inside a transaction
-- Your migration tool needs to support thisImportant: If a concurrent index build fails, it leaves an invalid index. Always check:
SELECT indexrelid::regclass, indisvalid
FROM pg_index
WHERE NOT indisvalid;
-- Drop invalid indexes and retryPattern 3: Change Column Type (The Expand-Contract Pattern)
Never ALTER COLUMN TYPE on a large table. Instead:
Why this matters: ALTER COLUMN TYPE on a 30M row table rewrites
the entire table while holding an ACCESS EXCLUSIVE lock. On production
hardware, that's 8-15 minutes of total downtime. At $8K/minute in lost
GMV, that's $64K-120K in lost revenue. The expand-contract pattern adds
zero downtime — just 3 extra deploy steps spread over a week.
Step 1: Add new column
ALTER TABLE orders ADD COLUMN amount_v2 BIGINT;
Step 2: Backfill (in batches)
UPDATE orders SET amount_v2 = amount
WHERE id BETWEEN 1 AND 10000;
-- Repeat in batches of 10k
Step 3: Deploy code that writes to BOTH columns
-- Application writes to amount AND amount_v2
Step 4: Verify data consistency
SELECT COUNT(*) FROM orders
WHERE amount_v2 IS NULL AND created_at < NOW() - INTERVAL '1 hour';
Step 5: Switch reads to new column
-- Application reads from amount_v2
Step 6: Stop writing to old column
-- Application only writes to amount_v2
Step 7: Drop old column (in a later migration)
ALTER TABLE orders DROP COLUMN amount;
Step 8: Rename new column (optional)
ALTER TABLE orders RENAME COLUMN amount_v2 TO amount;
Yes, it's more steps. No, your site doesn't go down.
Pattern 4: Add NOT NULL Constraint
-- ❌ DANGEROUS: Scans entire table while holding lock
ALTER TABLE orders ALTER COLUMN email SET NOT NULL;
-- ✅ SAFE: Add as a CHECK constraint with NOT VALID
ALTER TABLE orders ADD CONSTRAINT orders_email_not_null
CHECK (email IS NOT NULL) NOT VALID;
-- Then validate separately (no exclusive lock needed)
ALTER TABLE orders VALIDATE CONSTRAINT orders_email_not_null;Pattern 5: Add Foreign Key
-- ❌ DANGEROUS: Validates all existing rows while holding lock
ALTER TABLE orders ADD FOREIGN KEY (customer_id)
REFERENCES customers (id);
-- ✅ SAFE: Add without validation, then validate separately
ALTER TABLE orders ADD CONSTRAINT fk_orders_customer
FOREIGN KEY (customer_id) REFERENCES customers (id)
NOT VALID;
ALTER TABLE orders VALIDATE CONSTRAINT fk_orders_customer;The Backfill Strategy
Large data backfills are as dangerous as schema changes if done wrong:
// ❌ DANGEROUS: One massive UPDATE
await db.query(`UPDATE orders SET discount_code = 'none' WHERE discount_code IS NULL`);
// Locks millions of rows, generates massive WAL, kills replication lag
Real failure: A team ran a one-shot UPDATE on 45M rows to backfill a new
field. The query ran for 18 minutes, generated 80GB of WAL, and pushed
replication lag to 22 minutes. Read replicas served stale data. Inventory
showed products in stock that were actually sold out. Oversold 340 items.
Cost: $28K in customer credits + support burden.
// ✅ SAFE: Batched updates with pauses
async function backfill() {
let lastId = 0;
const BATCH_SIZE = 5000;
while (true) {
const result = await db.query(`
UPDATE orders
SET discount_code = 'none'
WHERE id > $1
AND id <= $1 + $2
AND discount_code IS NULL
RETURNING id
`, [lastId, BATCH_SIZE]);
if (result.rowCount === 0) break;
lastId += BATCH_SIZE;
// Pause to let replication catch up
await sleep(100);
// Check replication lag
const lag = await getReplicationLag();
if (lag > 5000) {
console.log(`Replication lag ${lag}ms, pausing...`);
await sleep(5000);
}
}
}The Pre-Migration Checklist
Run before every production migration:
Before:
[ ] Tested on staging with production-size data
[ ] Checked lock type for each operation
[ ] Verified no long-running queries on target tables
[ ] Backfill strategy documented for data migrations
[ ] Rollback plan written and tested
[ ] Team notified of migration window
During:
[ ] Monitor: active queries, lock waits, replication lag
[ ] Set statement_timeout to prevent runaway locks
[ ] Run during low-traffic window if possible
After:
[ ] Verify schema changes applied correctly
[ ] Check for invalid indexes
[ ] Monitor application error rates for 30 minutes
[ ] Verify replication is caught up
Tooling That Helps
pg_stat_activity— see active queries and locks in real-timepg_locks— understand what's waiting for whatpgroll— automated zero-downtime schema migrationsreshape— another zero-downtime migration toolstrong_migrations(Ruby) /safe-pg-migrations— catch dangerous migrations in CI
The Rule
If a migration takes longer than 1 second on your largest table, it needs the expand-contract pattern. No exceptions. The 10 minutes of extra development time is always worth avoiding the 20 minutes of downtime and the incident retro that follows.
The math is simple:
Unsafe migration:
- Development time: 15 minutes (write the ALTER statement)
- Downtime: 15-25 minutes (table locked during migration)
- Revenue lost at $6K/min: $90K-150K
- Customer impact: 100% of users
- Total cost: $90K-150K + reputation damage
Safe migration (expand-contract):
- Development time: 90 minutes (7 steps over 3 deploys)
- Downtime: 0 minutes
- Revenue lost: $0
- Customer impact: 0%
- Total cost: 2 hours of engineering time ($200)
ROI: Spend $200 to save $90K-150K. Yet teams skip it to "move fast."
Schema changes are a solved problem. The solution is just slower and more deliberate than most teams want to be. Be deliberate anyway.