ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The 4 AM Migration Horror StoryThe Rules of Zero-Downtime MigrationsPattern 1: The Expand-Contract PatternPattern 2: Adding NOT NULL Columns SafelyPattern 3: Large Table BackfillsPattern 4: Creating Indexes Without LockingThe Migration Checklist
  1. Insights
  2. Infrastructure
  3. Zero-Downtime Database Migrations — The Patterns That Actually Work

Zero-Downtime Database Migrations — The Patterns That Actually Work

March 18, 2026·ScaledByDesign·
databasemigrationspostgresqldevopszero-downtime

The 4 AM Migration Horror Story

A client's team scheduled a database migration for 4 AM on a Saturday. "Add a NOT NULL column to the orders table. Should take 5 minutes." The orders table had 47 million rows. The ALTER TABLE locked the entire table. The migration ran for 3 hours and 22 minutes. During that time, every order, every checkout, every dashboard query — blocked. $180K in lost revenue.

Zero-downtime migrations aren't a luxury. They're a requirement for any production system processing real transactions.

The Rules of Zero-Downtime Migrations

Rule 1: Never lock a table with active traffic
Rule 2: Every migration must be backward-compatible
Rule 3: Deploy in phases — code first, then schema, then cleanup
Rule 4: Always have a rollback plan that doesn't require another migration

Pattern 1: The Expand-Contract Pattern

The safest approach for schema changes. Three phases:

Phase 1 — EXPAND: Add new column/table (nullable, no constraints)
  → Old code still works, new column is just ignored
  
Phase 2 — MIGRATE: Backfill data, deploy code that writes to both
  → Both old and new schemas work simultaneously
  
Phase 3 — CONTRACT: Remove old column/table, add constraints
  → Only after all code uses the new schema

Example: Renaming a column from name to full_name:

-- Phase 1: EXPAND (deploy this migration)
ALTER TABLE customers ADD COLUMN full_name TEXT;
 
-- Phase 2: MIGRATE (deploy code change + backfill)
-- Code now writes to BOTH columns
UPDATE customers SET full_name = name WHERE full_name IS NULL;
-- Run in batches for large tables (see Pattern 3)
 
-- Phase 3: CONTRACT (after all code uses full_name)
ALTER TABLE customers DROP COLUMN name;
ALTER TABLE customers ALTER COLUMN full_name SET NOT NULL;

Each phase is a separate deployment. If anything goes wrong, you roll back to the previous phase — no data loss, no downtime.

Pattern 2: Adding NOT NULL Columns Safely

The migration that caused the 4 AM outage. Here's how to do it safely:

-- ❌ DANGEROUS: Locks the entire table while it adds and backfills
ALTER TABLE orders ADD COLUMN status TEXT NOT NULL DEFAULT 'pending';
 
-- ✅ SAFE: Three-step approach
-- Step 1: Add nullable column (instant, no lock)
ALTER TABLE orders ADD COLUMN status TEXT;
 
-- Step 2: Set default for new rows (instant, no lock)
ALTER TABLE orders ALTER COLUMN status SET DEFAULT 'pending';
 
-- Step 3: Backfill existing rows in batches
DO $$
DECLARE
  batch_size INT := 10000;
  total_updated INT := 0;
BEGIN
  LOOP
    UPDATE orders
    SET status = 'pending'
    WHERE id IN (
      SELECT id FROM orders
      WHERE status IS NULL
      LIMIT batch_size
      FOR UPDATE SKIP LOCKED
    );
    
    GET DIAGNOSTICS total_updated = ROW_COUNT;
    EXIT WHEN total_updated = 0;
    
    RAISE NOTICE 'Updated % rows', total_updated;
    PERFORM pg_sleep(0.1);  -- Brief pause to reduce load
  END LOOP;
END $$;
 
-- Step 4: Add NOT NULL constraint (after backfill completes)
ALTER TABLE orders ALTER COLUMN status SET NOT NULL;

The FOR UPDATE SKIP LOCKED clause is critical — it prevents the backfill from blocking concurrent transactions.

Pattern 3: Large Table Backfills

For tables with millions of rows, batch processing is non-negotiable:

async function backfillInBatches(options: {
  table: string;
  setClause: string;
  whereClause: string;
  batchSize: number;
  delayMs: number;
}) {
  let totalUpdated = 0;
  let batchCount = 0;
 
  while (true) {
    const result = await db.query(`
      WITH batch AS (
        SELECT id FROM ${options.table}
        WHERE ${options.whereClause}
        LIMIT ${options.batchSize}
        FOR UPDATE SKIP LOCKED
      )
      UPDATE ${options.table}
      SET ${options.setClause}
      WHERE id IN (SELECT id FROM batch)
    `);
 
    totalUpdated += result.rowCount;
    batchCount++;
 
    console.log(`Batch ${batchCount}: ${result.rowCount} rows (${totalUpdated} total)`);
 
    if (result.rowCount === 0) break;
    await sleep(options.delayMs);  // Reduce database load
  }
 
  return { totalUpdated, batchCount };
}
 
// Usage
await backfillInBatches({
  table: "orders",
  setClause: "status = 'pending'",
  whereClause: "status IS NULL",
  batchSize: 5000,
  delayMs: 100,  // 100ms pause between batches
});

Pattern 4: Creating Indexes Without Locking

Standard CREATE INDEX locks the table for writes. Use CONCURRENTLY:

-- ❌ LOCKS the table during index creation
CREATE INDEX idx_orders_status ON orders (status);
 
-- ✅ Builds index without blocking writes
CREATE INDEX CONCURRENTLY idx_orders_status ON orders (status);

The CONCURRENTLY option takes longer but doesn't block any queries. One caveat: it can't run inside a transaction block, and if it fails, you need to drop the invalid index and retry.

The Migration Checklist

Before running any migration on production:

Pre-Migration:
  □ Tested on a production-sized dataset (not just dev with 100 rows)
  □ Measured lock time — any operation > 1 second needs the safe pattern
  □ Backfill script tested and idempotent (safe to run multiple times)
  □ Rollback plan documented and tested
  □ Monitoring alerts set up for query latency and lock waits

During Migration:
  □ Run during lowest-traffic window (but don't depend on low traffic)
  □ Monitor active locks: SELECT * FROM pg_stat_activity WHERE wait_event_type = 'Lock'
  □ Monitor replication lag if using replicas
  □ Have a kill switch ready for long-running queries

Post-Migration:
  □ Verify data integrity (row counts, null checks, constraint violations)
  □ Monitor application error rates for 30 minutes
  □ Run ANALYZE on affected tables to update query planner statistics
  □ Document the migration for the team

Database migrations don't have to be scary. With the expand-contract pattern, batched backfills, and concurrent index creation, you can modify production schemas with millions of rows while your application keeps serving traffic. The key is patience — three safe deploys beats one dangerous one.

Previous
Event-Driven Architecture Without the PhD — A Practical Guide
Insights
Zero-Downtime Database Migrations — The Patterns That Actually WorkTerraform State Management Lessons We Learned the Hard WayKubernetes Is Overkill for Your Startup — Here's What to Use InsteadScale Postgres Before Reaching for NoSQLDatabase Migrations Without DowntimeObservability That Actually Helps You Sleep at Night

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call