Terraform State Surgery - Splitting, Moving, and Refactoring Without Downtime

Your Terraform state file started small. A VPC here, an RDS instance there. Then someone added the EKS cluster. Then the Lambda functions. Then three more environments. Now terraform plan takes 15 minutes, and you’re terrified to touch anything because 400 resources might get recreated.

Sound familiar? Time for state surgery.

This post covers the real-world techniques for splitting monolithic state files, moving resources between states, and refactoring your Terraform structure without accidentally destroying production.

Code Repository: All code from this post is available at github.com/moabukar/blog-code/terraform-state-surgery

Why Split State?

Before we dive in, let’s be clear about why you’d want to do this:

Plan times - Large states mean slow plans. A 500-resource state can take 10+ minutes to plan.
Blast radius - One bad terraform apply can affect everything in the state.
Team ownership - Different teams want to manage their own infrastructure.
Lifecycle differences - Networking changes rarely; applications change daily.
State locking conflicts - Multiple engineers blocked waiting for the same state lock.

The Golden Rules

Before any state manipulation:

# 1. ALWAYS backup your state first
terraform state pull > state-backup-$(date +%Y%m%d-%H%M%S).json

# 2. ALWAYS run plan after any state change
terraform plan
# Must show: "No changes. Your infrastructure matches the configuration."

# 3. NEVER delete the backup until you've verified everything works

If terraform plan shows any changes after state manipulation, stop. Something went wrong.

Code Repository: All code from this post is available at github.com/moabukar/blog-code/terraform-state-surgery

Technique 1: Moving Resources Between States

The most common scenario: you have a monolithic state and want to extract resources into a new state file.

The Scenario

You have everything in one state:

aws_vpc.main
aws_subnet.private[0]
aws_subnet.private[1]
aws_subnet.public[0]
aws_subnet.public[1]
aws_eks_cluster.main
aws_eks_node_group.workers
aws_rds_instance.database
aws_lambda_function.api

You want to split into:

networking/ - VPC, subnets
eks/ - Cluster and node groups
database/ - RDS
application/ - Lambda functions

Step 1: Create the New State Structure

mkdir -p terraform/{networking,eks,database,application}

Step 2: Move the Code

Copy the relevant resource blocks to each new directory. For example, terraform/networking/main.tf:

# terraform/networking/main.tf

terraform {
  backend "s3" {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "eu-west-1"
  }
}

resource "aws_vpc" "main" {
  cidr_block = "10.0.0.0/16"
  
  tags = {
    Name = "main-vpc"
  }
}

resource "aws_subnet" "private" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "private-${count.index}"
  }
}

resource "aws_subnet" "public" {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index + 100)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = "public-${count.index}"
  }
}

# Outputs for other states to consume
output "vpc_id" {
  value = aws_vpc.main.id
}

output "private_subnet_ids" {
  value = aws_subnet.private[*].id
}

output "public_subnet_ids" {
  value = aws_subnet.public[*].id
}

Step 3: Import Into the New State

Here’s where most tutorials fail you. They say “just run terraform import.” But with complex resources, you need the exact import IDs.

cd terraform/networking

# Initialize the new backend
terraform init

# Import each resource
terraform import aws_vpc.main vpc-0abc123def456

terraform import 'aws_subnet.private[0]' subnet-0abc123
terraform import 'aws_subnet.private[1]' subnet-0def456

terraform import 'aws_subnet.public[0]' subnet-0ghi789
terraform import 'aws_subnet.public[1]' subnet-0jkl012

# Verify - THIS MUST SHOW NO CHANGES
terraform plan

Step 4: Remove From the Old State

Only after verifying the import worked:

cd ../legacy  # Your old monolithic directory

# Remove from old state
terraform state rm aws_vpc.main
terraform state rm 'aws_subnet.private[0]'
terraform state rm 'aws_subnet.private[1]'
terraform state rm 'aws_subnet.public[0]'
terraform state rm 'aws_subnet.public[1]'

# Verify old state still works
terraform plan
# Should show no changes (just fewer resources now)

The Script We Use

For large migrations, we script this:

#!/bin/bash
# state-migration.sh

set -e

OLD_DIR="./legacy"
NEW_DIR="./networking"

# Resources to move (format: "resource_address|import_id")
RESOURCES=(
  "aws_vpc.main|vpc-0abc123def456"
  "aws_subnet.private[0]|subnet-0abc123"
  "aws_subnet.private[1]|subnet-0def456"
  "aws_subnet.public[0]|subnet-0ghi789"
  "aws_subnet.public[1]|subnet-0jkl012"
)

echo "=== Backing up states ==="
cd "$OLD_DIR"
terraform state pull > "../backup-old-$(date +%Y%m%d-%H%M%S).json"
cd ..

cd "$NEW_DIR"
terraform init
terraform state pull > "../backup-new-$(date +%Y%m%d-%H%M%S).json" 2>/dev/null || echo "New state is empty (expected)"
cd ..

echo "=== Importing into new state ==="
cd "$NEW_DIR"
for resource in "${RESOURCES[@]}"; do
  addr="${resource%%|*}"
  id="${resource##*|}"
  echo "Importing: $addr ($id)"
  terraform import "$addr" "$id" || { echo "FAILED: $addr"; exit 1; }
done

echo "=== Verifying new state ==="
terraform plan -detailed-exitcode
if [ $? -eq 0 ]; then
  echo "✓ New state verified - no changes"
elif [ $? -eq 2 ]; then
  echo "✗ ERROR: Plan shows changes! Aborting."
  exit 1
fi
cd ..

echo "=== Removing from old state ==="
cd "$OLD_DIR"
for resource in "${RESOURCES[@]}"; do
  addr="${resource%%|*}"
  echo "Removing: $addr"
  terraform state rm "$addr" || { echo "FAILED to remove: $addr"; exit 1; }
done

echo "=== Verifying old state ==="
terraform plan -detailed-exitcode
if [ $? -eq 0 ]; then
  echo "✓ Old state verified - no changes"
elif [ $? -eq 2 ]; then
  echo "✗ ERROR: Plan shows changes! Check manually."
  exit 1
fi

echo "=== Migration complete ==="

Technique 2: Using `terraform state mv`

If you’re reorganizing within the same state (renaming resources, moving into modules), use terraform state mv:

Renaming a Resource

# Old: aws_instance.web
# New: aws_instance.application

terraform state mv aws_instance.web aws_instance.application

Moving Into a Module

# Old: aws_instance.web (root module)
# New: module.compute.aws_instance.web

terraform state mv aws_instance.web module.compute.aws_instance.web

Moving Out of a Module

# Old: module.legacy.aws_instance.web
# New: aws_instance.web (root module)

terraform state mv module.legacy.aws_instance.web aws_instance.web

Bulk Moves

# Move all resources from one module to another
terraform state mv module.old_network module.network

Technique 3: Using `moved` Blocks (Terraform 1.1+)

For refactoring that you want tracked in version control, use moved blocks:

# This tells Terraform the resource was renamed
moved {
  from = aws_instance.web
  to   = aws_instance.application
}

# Moving into a module
moved {
  from = aws_instance.web
  to   = module.compute.aws_instance.web
}

# Renaming a module
moved {
  from = module.old_name
  to   = module.new_name
}

Benefits of moved blocks:

Version controlled
Works across team members
Self-documenting
Terraform handles the state update automatically

After applying with moved blocks:

terraform plan
# Shows: "Terraform will perform the following actions:"
# aws_instance.web has moved to aws_instance.application

terraform apply
# State is updated, no infrastructure changes

Important: Keep moved blocks for at least one full release cycle, then remove them.

Technique 4: The `import` Block (Terraform 1.5+)

For new states, you can now define imports in config:

# imports.tf

import {
  to = aws_vpc.main
  id = "vpc-0abc123def456"
}

import {
  to = aws_subnet.private[0]
  id = "subnet-0abc123"
}

import {
  to = aws_subnet.private[1]
  id = "subnet-0def456"
}

Then run:

terraform plan
# Shows what will be imported

terraform apply
# Imports all resources

This is cleaner than CLI imports for large migrations.

Technique 5: Cross-State References with `terraform_remote_state`

After splitting states, you need to reference resources across states:

# terraform/eks/main.tf

# Reference the networking state
data "terraform_remote_state" "networking" {
  backend = "s3"
  config = {
    bucket = "my-terraform-state"
    key    = "networking/terraform.tfstate"
    region = "eu-west-1"
  }
}

# Use outputs from networking state
resource "aws_eks_cluster" "main" {
  name     = "main-cluster"
  role_arn = aws_iam_role.eks.arn

  vpc_config {
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}

Dependency Order

With split states, you need to apply in order:

# 1. Networking first (no dependencies)
cd networking && terraform apply

# 2. Database (depends on networking)
cd ../database && terraform apply

# 3. EKS (depends on networking)
cd ../eks && terraform apply

# 4. Application (depends on everything)
cd ../application && terraform apply

We encode this in CI/CD with explicit job dependencies.

Technique 6: The `removed` Block (Terraform 1.7+)

When you want to remove a resource from state without destroying it:

# This removes from state but keeps the actual resource
removed {
  from = aws_instance.legacy_server

  lifecycle {
    destroy = false
  }
}

Use cases:

Handing off resources to another team
Removing resources that will be managed manually
Migrating to a different IaC tool

Real-World Migration: Monolith to Multi-State

Here’s the actual migration plan we used for a client with 400+ resources:

Phase 1: Analysis

# List all resources in current state
terraform state list > all-resources.txt

# Count by type
terraform state list | cut -d'.' -f1-2 | sort | uniq -c | sort -rn

# Output:
#   45 aws_security_group_rule
#   32 aws_iam_role_policy_attachment
#   28 aws_route53_record
#   15 aws_lambda_function
#   12 aws_s3_bucket
#   ...

Phase 2: Categorization

We grouped resources into logical domains:

networking/     - VPC, subnets, route tables, NAT, IGW
security/       - Security groups, NACLs, WAF
iam/            - Roles, policies, users
dns/            - Route53 zones and records
compute/        - EC2, ASG, Launch templates
eks/            - EKS cluster, node groups, add-ons
rds/            - RDS instances, parameter groups
lambda/         - Lambda functions, layers
storage/        - S3 buckets, EFS
monitoring/     - CloudWatch, SNS, alarms

Phase 3: Dependency Mapping

networking (0 deps)
    ↓
security (networking)
    ↓
iam (0 deps - can parallel with security)
    ↓
dns (networking)
    ↓
rds (networking, security)
    ↓
eks (networking, security, iam)
    ↓
storage (iam)
    ↓
lambda (iam, networking, security, storage)
    ↓
monitoring (everything)

Phase 4: Migration Script

#!/bin/bash
# full-migration.sh

DOMAINS="networking security iam dns rds eks storage lambda monitoring"
OLD_STATE="./legacy"
BACKUP_DIR="./backups/$(date +%Y%m%d-%H%M%S)"

mkdir -p "$BACKUP_DIR"

# Backup everything first
echo "=== Creating backups ==="
cd "$OLD_STATE"
terraform state pull > "$BACKUP_DIR/legacy.json"
cd ..

for domain in $DOMAINS; do
  if [ -d "$domain" ]; then
    cd "$domain"
    terraform state pull > "$BACKUP_DIR/${domain}.json" 2>/dev/null || echo "$domain is new"
    cd ..
  fi
done

# Migrate each domain
for domain in $DOMAINS; do
  echo "=== Migrating: $domain ==="
  
  if [ -f "migrations/${domain}.sh" ]; then
    bash "migrations/${domain}.sh"
    
    # Verify
    cd "$domain"
    if ! terraform plan -detailed-exitcode; then
      echo "ERROR: $domain verification failed!"
      exit 1
    fi
    cd ..
  else
    echo "No migration script for $domain, skipping"
  fi
done

echo "=== All migrations complete ==="

Phase 5: Verification

After each domain migration:

# In the new state directory
terraform plan
# Must show: No changes

# In the old state directory  
terraform plan
# Must show: No changes (just fewer resources)

# Verify actual infrastructure
aws ec2 describe-vpcs --vpc-ids vpc-xxx
aws eks describe-cluster --name main-cluster
# ... spot check critical resources

Common Gotchas

1. Count vs For_Each Index Mismatch

If you’re moving from count to for_each, the state addresses differ:

# count uses numeric index
aws_subnet.private[0]
aws_subnet.private[1]

# for_each uses key
aws_subnet.private["eu-west-1a"]
aws_subnet.private["eu-west-1b"]

You’ll need individual moved blocks:

moved {
  from = aws_subnet.private[0]
  to   = aws_subnet.private["eu-west-1a"]
}

moved {
  from = aws_subnet.private[1]
  to   = aws_subnet.private["eu-west-1b"]
}

2. Provider Aliases

If the resource uses a provider alias, include it in the import:

# Resource uses aliased provider
terraform import 'aws_instance.west["web"]' i-0abc123
# May need: -provider=aws.west

3. Sensitive Values in State

When pulling state for backup, sensitive values are included. Secure your backups:

# Encrypt backup
terraform state pull | gpg --encrypt -r your@email.com > state-backup.json.gpg

4. State Locking During Migration

Disable auto-apply in CI/CD during migration. You don’t want automated applies while manipulating state.

# Force unlock if needed (dangerous - make sure no one else is using it)
terraform force-unlock LOCK_ID

5. Remote State Data Source Timing

If you split networking from EKS, and EKS references networking via terraform_remote_state, you must apply networking first after the split.

The Checklist

## Pre-Migration
- [ ] Backup all state files
- [ ] Document current resource count per state
- [ ] Map dependencies between resources
- [ ] Plan the new state structure
- [ ] Disable CI/CD auto-applies
- [ ] Notify team of migration window

## Per-Domain Migration
- [ ] Create new directory structure
- [ ] Copy resource code to new location
- [ ] Add remote_state data sources where needed
- [ ] Add outputs for cross-state references
- [ ] Run terraform init in new directory
- [ ] Import resources into new state
- [ ] Verify: terraform plan shows no changes
- [ ] Remove resources from old state
- [ ] Verify: old state terraform plan shows no changes
- [ ] Commit changes

## Post-Migration
- [ ] Update CI/CD pipelines for new structure
- [ ] Update documentation
- [ ] Re-enable CI/CD auto-applies
- [ ] Delete old monolithic state (after grace period)
- [ ] Archive backup files securely

Key Takeaways

Always backup state before any manipulation
terraform plan must show no changes after every state operation
Use moved blocks for version-controlled refactoring
Use import blocks (1.5+) for cleaner bulk imports
Use removed blocks (1.7+) to remove without destroying
Map dependencies before splitting - apply order matters
Script large migrations - manual commands are error-prone
Keep backups until you’re 100% confident the migration worked

State surgery is scary, but with the right approach it’s routine. Take it slow, verify everything, and you’ll have clean, maintainable Terraform in no time.

Questions? Find me on LinkedIn or GitHub.