ScaledByDesign/Insights
ServicesPricingAboutContact
Book a Call
Scaled By Design

Fractional CTO + execution partner for revenue-critical systems.

Company

  • About
  • Services
  • Contact

Resources

  • Insights
  • Pricing
  • FAQ

Legal

  • Privacy Policy
  • Terms of Service

© 2026 ScaledByDesign. All rights reserved.

contact@scaledbydesign.com

On This Page

The $50K State File IncidentState Management Rule 1: Remote State, AlwaysState Management Rule 2: State IsolationState Management Rule 3: Never Modify State ManuallyState Management Rule 4: Prevent Manual ChangesState Management Rule 5: Drift DetectionState Management Rule 6: State File BackupsThe Checklist
  1. Insights
  2. Infrastructure
  3. Terraform State Management Lessons We Learned the Hard Way

Terraform State Management Lessons We Learned the Hard Way

March 2, 2026·ScaledByDesign·
terraforminfrastructure-as-codedevopsawsstate-management

The $50K State File Incident

A client's engineer ran terraform apply on a Friday afternoon. The state file had drifted from reality because someone had made manual changes in the AWS console. Terraform's plan showed "47 resources to destroy and recreate." The engineer, in a hurry, approved it.

Forty-seven resources — including three production RDS instances — were destroyed and recreated. The databases came back empty. Four hours of downtime. $50K in lost revenue. The backups worked (thankfully), but the recovery took until 2am.

All because of a state file that nobody was actively managing.

State Management Rule 1: Remote State, Always

If your Terraform state file lives on someone's laptop, it's not a matter of if you'll lose it — it's when.

# backend.tf — this is non-negotiable
terraform {
  backend "s3" {
    bucket         = "company-terraform-state"
    key            = "prod/networking/terraform.tfstate"
    region         = "us-east-1"
    encrypt        = true
    dynamodb_table = "terraform-state-locks"
  }
}

The S3 backend with DynamoDB locking gives you three critical things:

  • Shared access: Everyone works with the same state
  • Locking: Two people can't modify state simultaneously
  • Encryption: State contains secrets (RDS passwords, API keys)

State Management Rule 2: State Isolation

One giant state file for your entire infrastructure is a recipe for disaster. Split state by environment and component:

terraform/
├── modules/              # Reusable modules
│   ├── networking/
│   ├── database/
│   └── compute/
├── environments/
│   ├── prod/
│   │   ├── networking/   # State: prod/networking/terraform.tfstate
│   │   ├── database/     # State: prod/database/terraform.tfstate
│   │   ├── compute/      # State: prod/compute/terraform.tfstate
│   │   └── monitoring/   # State: prod/monitoring/terraform.tfstate
│   ├── staging/
│   │   ├── networking/
│   │   ├── database/
│   │   └── compute/
│   └── dev/
│       └── ...

Each component has its own state file. Benefits:

  • Blast radius: A bad apply in compute can't destroy your database
  • Speed: Small state files mean fast plan/apply cycles
  • Team parallelism: Different teams can work on different components simultaneously

State Management Rule 3: Never Modify State Manually

When state drifts from reality, the temptation is to edit the state file directly. Don't. Use Terraform's built-in state commands:

# Import an existing resource into state
terraform import aws_instance.web i-1234567890abcdef0
 
# Move a resource to a new address (after refactoring)
terraform state mv aws_instance.old aws_instance.new
 
# Remove a resource from state (without destroying it)
terraform state rm aws_instance.legacy
 
# Show current state for a resource
terraform state show aws_instance.web

State Management Rule 4: Prevent Manual Changes

The $50K incident happened because someone made changes in the AWS console. Prevent this:

# Prevent accidental destruction of critical resources
resource "aws_db_instance" "production" {
  # ... configuration ...
 
  lifecycle {
    prevent_destroy = true  # Terraform will refuse to destroy this
  }
}
 
# Tag everything managed by Terraform
resource "aws_instance" "web" {
  # ... configuration ...
  tags = {
    ManagedBy   = "terraform"
    Environment = var.environment
    Component   = "compute"
    StateFile   = "prod/compute/terraform.tfstate"
  }
}

Set up AWS Config rules or SCPs to alert when someone modifies Terraform-managed resources manually.

State Management Rule 5: Drift Detection

Don't wait for terraform plan to discover drift. Run automated drift detection:

# .github/workflows/drift-detection.yml
name: Terraform Drift Detection
on:
  schedule:
    - cron: '0 8 * * 1-5'  # Every weekday at 8am
 
jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        component: [networking, database, compute, monitoring]
    steps:
      - uses: actions/checkout@v4
      - uses: hashicorp/setup-terraform@v3
 
      - name: Terraform Plan (Drift Check)
        run: |
          cd environments/prod/${{ matrix.component }}
          terraform init
          terraform plan -detailed-exitcode -out=drift.plan
        continue-on-error: true
 
      - name: Alert on Drift
        if: steps.plan.outcome == 'failure'
        run: |
          # Send Slack alert with drift details
          curl -X POST "$SLACK_WEBHOOK" -d "{
            \"text\": \"⚠️ Terraform drift detected in prod/${{ matrix.component }}\"
          }"

State Management Rule 6: State File Backups

S3 versioning gives you state file history, but also set up explicit backups:

resource "aws_s3_bucket_versioning" "state" {
  bucket = aws_s3_bucket.terraform_state.id
  versioning_configuration {
    status = "Enabled"
  }
}
 
resource "aws_s3_bucket_lifecycle_configuration" "state" {
  bucket = aws_s3_bucket.terraform_state.id
  rule {
    id     = "state-versions"
    status = "Enabled"
    noncurrent_version_expiration {
      noncurrent_days = 90  # Keep 90 days of state history
    }
  }
}

When something goes wrong (and it will), you can roll back to a previous state version.

The Checklist

Before any terraform apply in production, verify:

  • State is remote with locking enabled
  • You're targeting the correct workspace/environment
  • The plan output matches your expectations (read every line)
  • Critical resources have prevent_destroy lifecycle rules
  • Drift detection is running on a schedule
  • State backups are enabled with versioning

Terraform is a powerful tool. State management is what separates teams that use it successfully from teams that have $50K incidents on Friday afternoons. Treat your state files with the same care you treat your production databases — because they control them.

Previous
CQRS Without the Complexity — A Practical Implementation Guide
Insights
Terraform State Management Lessons We Learned the Hard WayKubernetes Is Overkill for Your Startup — Here's What to Use InsteadScale Postgres Before Reaching for NoSQLDatabase Migrations Without DowntimeObservability That Actually Helps You Sleep at Night

Ready to Ship?

Let's talk about your engineering challenges and how we can help.

Book a Call