Skip to content
Back to blog Terraform Best Practices (Part 2) - Testing, CI/CD, Security, and Team Workflows

Terraform Best Practices (Part 2) - Testing, CI/CD, Security, and Team Workflows

TerraformDevOps

Terraform Best Practices (Part 2) - Testing, CI/CD, Security, and Team Workflows

In Part 1, we covered Terraform foundations: project structure, state management, and module design. This part focuses on advanced practices that become critical as teams and infrastructure scale.

Testing infrastructure code is different from testing application code. CI/CD for Terraform requires careful thought about plan/apply workflows. Security mistakes in Terraform can expose your entire cloud. And coordinating infrastructure changes across teams needs clear processes.

Let’s dive in.

TL;DR

  • Test modules with Terratest or terraform-compliance
  • Use CI/CD with mandatory plan reviews before apply
  • Never commit secrets - use Vault, AWS Secrets Manager, or environment variables
  • Implement drift detection and reconciliation
  • Document decisions and use PR templates

Testing Terraform

“We don’t test infrastructure code” is a common but dangerous stance. Terraform modules can have bugs just like application code, and the consequences of infrastructure bugs can be severe.

Levels of Testing

┌─────────────────────────────────────────────────────────┐
│                Level 4: End-to-End                      │
│          Deploy full stack, test functionality          │
├─────────────────────────────────────────────────────────┤
│                Level 3: Integration                      │
│         Deploy to real cloud, verify resources          │
├─────────────────────────────────────────────────────────┤
│               Level 2: Contract/Policy                   │
│     Verify plan meets policies (terraform-compliance)    │
├─────────────────────────────────────────────────────────┤
│               Level 1: Static Analysis                   │
│         tflint, checkov, terrascan, validate            │
└─────────────────────────────────────────────────────────┘

Level 1: Static Analysis

Fast checks that don’t require cloud access:

# Terraform validate - syntax and internal consistency
terraform init -backend=false
terraform validate

# tflint - linting and best practices
tflint --init
tflint --recursive

# Checkov - security and compliance scanning
checkov -d .

# Terrascan - policy as code
terrascan scan -d .

tflint configuration:

# .tflint.hcl
plugin "aws" {
  enabled = true
  version = "0.27.0"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "terraform_naming_convention" {
  enabled = true
  format  = "snake_case"
}

rule "terraform_documented_variables" {
  enabled = true
}

rule "terraform_documented_outputs" {
  enabled = true
}

rule "aws_instance_invalid_type" {
  enabled = true
}

Checkov example:

# .checkov.yaml
framework:
  - terraform
check:
  - CKV_AWS_18   # Ensure S3 bucket logging is enabled
  - CKV_AWS_19   # Ensure S3 bucket has encryption enabled
  - CKV_AWS_21   # Ensure S3 bucket has versioning enabled
skip-check:
  - CKV_AWS_144  # Skip S3 cross-region replication (not always needed)

Level 2: Policy Testing with terraform-compliance

Test that your Terraform plans meet organisational policies:

# features/s3.feature
Feature: S3 bucket security
  Scenario: S3 buckets must have encryption
    Given I have aws_s3_bucket defined
    Then it must have server_side_encryption_configuration
    And its server_side_encryption_configuration must have rule

  Scenario: S3 buckets must not be public
    Given I have aws_s3_bucket defined
    Then it must have acl
    And its acl must not be public-read
    And its acl must not be public-read-write
# Generate plan and test
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
terraform-compliance -f features/ -p plan.json

Level 3: Integration Testing with Terratest

Deploy real infrastructure and verify it works:

// test/vpc_test.go
package test

import (
    "testing"

    "github.com/gruntwork-io/terratest/modules/terraform"
    "github.com/gruntwork-io/terratest/modules/aws"
    "github.com/stretchr/testify/assert"
)

func TestVpcModule(t *testing.T) {
    t.Parallel()

    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../modules/vpc",
        Vars: map[string]interface{}{
            "vpc_cidr":    "10.0.0.0/16",
            "environment": "test",
            "name":        "terratest-vpc",
        },
    })

    // Destroy at the end
    defer terraform.Destroy(t, terraformOptions)

    // Deploy
    terraform.InitAndApply(t, terraformOptions)

    // Get outputs
    vpcId := terraform.Output(t, terraformOptions, "vpc_id")
    publicSubnetIds := terraform.OutputList(t, terraformOptions, "public_subnet_ids")

    // Verify VPC exists
    vpc := aws.GetVpcById(t, vpcId, "eu-west-1")
    assert.Equal(t, "10.0.0.0/16", vpc.CidrBlock)

    // Verify subnets were created
    assert.Equal(t, 3, len(publicSubnetIds))

    // Verify subnets are actually public (have route to IGW)
    for _, subnetId := range publicSubnetIds {
        assert.True(t, aws.IsPublicSubnet(t, subnetId, "eu-west-1"))
    }
}

Run Terratest:

cd test
go test -v -timeout 30m

Level 4: End-to-End Testing

Deploy the full stack and test functionality:

func TestFullStackDeployment(t *testing.T) {
    t.Parallel()

    // Deploy infrastructure
    terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
        TerraformDir: "../environments/test",
    })

    defer terraform.Destroy(t, terraformOptions)
    terraform.InitAndApply(t, terraformOptions)

    // Get the ALB URL
    albUrl := terraform.Output(t, terraformOptions, "alb_url")

    // Wait for the application to be healthy
    http_helper.HttpGetWithRetry(
        t,
        fmt.Sprintf("http://%s/health", albUrl),
        nil,
        200,
        "OK",
        30,
        10*time.Second,
    )

    // Run application-level tests
    http_helper.HttpGetWithRetry(
        t,
        fmt.Sprintf("http://%s/api/users", albUrl),
        nil,
        200,
        "",
        5,
        5*time.Second,
    )
}

CI/CD for Terraform

Infrastructure CI/CD is different from application CI/CD. You need human review before applying changes, and failed applies can leave infrastructure in partial states.

Pipeline Structure

# .github/workflows/terraform.yml
name: Terraform
on:
  pull_request:
    branches: [main]
  push:
    branches: [main]

jobs:
  validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: 1.6.0
      
      - name: Terraform Format Check
        run: terraform fmt -check -recursive
      
      - name: Terraform Init
        run: terraform init -backend=false
      
      - name: Terraform Validate
        run: terraform validate
      
      - name: tflint
        uses: terraform-linters/setup-tflint@v4
      - run: tflint --init && tflint --recursive
      
      - name: Checkov
        uses: bridgecrewio/checkov-action@v12
        with:
          directory: .
          soft_fail: false

  plan:
    needs: validate
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging, prod]
    steps:
      - uses: actions/checkout@v4
      
      - uses: hashicorp/setup-terraform@v3
      
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: eu-west-1
      
      - name: Terraform Init
        working-directory: environments/${{ matrix.environment }}
        run: terraform init
      
      - name: Terraform Plan
        working-directory: environments/${{ matrix.environment }}
        run: |
          terraform plan -out=tfplan -no-color | tee plan.txt
      
      - name: Post Plan to PR
        uses: actions/github-script@v7
        if: github.event_name == 'pull_request'
        with:
          script: |
            const fs = require('fs');
            const plan = fs.readFileSync('environments/${{ matrix.environment }}/plan.txt', 'utf8');
            const output = `#### Terraform Plan for \`${{ matrix.environment }}\`
            <details><summary>Show Plan</summary>

            \`\`\`
            ${plan.substring(0, 65000)}
            \`\`\`

            </details>`;

            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: output
            });
      
      - name: Upload Plan
        uses: actions/upload-artifact@v4
        with:
          name: tfplan-${{ matrix.environment }}
          path: environments/${{ matrix.environment }}/tfplan

  apply:
    needs: plan
    runs-on: ubuntu-latest
    if: github.ref == 'refs/heads/main' && github.event_name == 'push'
    environment: production  # Requires approval
    strategy:
      matrix:
        environment: [dev, staging, prod]
      max-parallel: 1  # Apply sequentially
    steps:
      - uses: actions/checkout@v4
      
      - uses: hashicorp/setup-terraform@v3
      
      - name: Configure AWS Credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: eu-west-1
      
      - name: Download Plan
        uses: actions/download-artifact@v4
        with:
          name: tfplan-${{ matrix.environment }}
          path: environments/${{ matrix.environment }}
      
      - name: Terraform Init
        working-directory: environments/${{ matrix.environment }}
        run: terraform init
      
      - name: Terraform Apply
        working-directory: environments/${{ matrix.environment }}
        run: terraform apply -auto-approve tfplan

Key Principles

1. Always plan before apply:

# Never do this
- run: terraform apply -auto-approve

# Always do this
- run: terraform plan -out=tfplan
- run: terraform apply tfplan  # Apply the exact plan that was reviewed

2. Use saved plan files:

The plan you review must be the plan you apply:

# Plan and save
terraform plan -out=tfplan

# Apply exact plan (no drift between plan and apply)
terraform apply tfplan

3. Require approval for production:

apply-prod:
  environment: production  # GitHub environment with required reviewers

4. Sequential applies for dependent environments:

strategy:
  matrix:
    environment: [dev, staging, prod]
  max-parallel: 1  # One at a time

Handling Plan Drift

Plans can drift between plan time and apply time if infrastructure changes:

- name: Check for Drift
  run: |
    terraform plan -detailed-exitcode -out=tfplan
    EXIT_CODE=$?
    if [ $EXIT_CODE -eq 2 ]; then
      echo "Changes detected"
    elif [ $EXIT_CODE -eq 0 ]; then
      echo "No changes"
      exit 0  # Skip apply
    else
      echo "Error"
      exit 1
    fi

Security Best Practices

Terraform can create security holes in your infrastructure or expose secrets. Here’s how to prevent both.

Never Commit Secrets

# NEVER do this
resource "aws_db_instance" "main" {
  password = "super_secret_password"  # Committed to git!
}

# Also bad - .tfvars can be committed accidentally
# terraform.tfvars
db_password = "super_secret_password"

Option 1: Environment variables

export TF_VAR_db_password="secret"
terraform apply
variable "db_password" {
  type      = string
  sensitive = true
}

Option 2: Secrets manager reference

data "aws_secretsmanager_secret_version" "db_password" {
  secret_id = "prod/database/password"
}

resource "aws_db_instance" "main" {
  password = data.aws_secretsmanager_secret_version.db_password.secret_string
}

Option 3: Generate random passwords

resource "random_password" "db" {
  length  = 32
  special = true
}

resource "aws_db_instance" "main" {
  password = random_password.db.result
}

# Store in secrets manager for applications
resource "aws_secretsmanager_secret_version" "db_password" {
  secret_id     = aws_secretsmanager_secret.db.id
  secret_string = random_password.db.result
}

Use OIDC for CI/CD Authentication

Don’t use long-lived access keys:

# GitHub Actions with OIDC
- name: Configure AWS Credentials
  uses: aws-actions/configure-aws-credentials@v4
  with:
    role-to-assume: arn:aws:iam::123456789:role/github-actions-terraform
    aws-region: eu-west-1
# IAM role for GitHub Actions
resource "aws_iam_role" "github_actions" {
  name = "github-actions-terraform"

  assume_role_policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Principal = {
          Federated = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/token.actions.githubusercontent.com"
        }
        Action = "sts:AssumeRoleWithWebIdentity"
        Condition = {
          StringEquals = {
            "token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
          }
          StringLike = {
            "token.actions.githubusercontent.com:sub" = "repo:myorg/infrastructure:*"
          }
        }
      }
    ]
  })
}

Principle of Least Privilege

# Bad - full admin access
resource "aws_iam_role_policy_attachment" "terraform" {
  role       = aws_iam_role.terraform.name
  policy_arn = "arn:aws:iam::aws:policy/AdministratorAccess"
}

# Better - only what's needed
resource "aws_iam_role_policy" "terraform" {
  role = aws_iam_role.terraform.name

  policy = jsonencode({
    Version = "2012-10-17"
    Statement = [
      {
        Effect = "Allow"
        Action = [
          "ec2:*",
          "rds:*",
          "s3:*"
        ]
        Resource = "*"
        Condition = {
          StringEquals = {
            "aws:RequestedRegion" = ["eu-west-1"]
          }
        }
      }
    ]
  })
}

Security Scanning

Run security checks in CI:

- name: Checkov Security Scan
  uses: bridgecrewio/checkov-action@v12
  with:
    directory: .
    framework: terraform
    soft_fail: false
    
- name: tfsec
  uses: aquasecurity/tfsec-action@v1.0.3
  with:
    soft_fail: false

Drift Detection

Infrastructure drifts when someone makes changes outside Terraform (console, CLI, other tools). Detecting and reconciling drift is essential for maintaining infrastructure as code integrity.

Detecting Drift

# Refresh state and compare
terraform plan -refresh-only

# Output:
# Note: Objects have changed outside of Terraform
#
# ~ resource "aws_security_group" "web" {
#     ~ ingress {
#         + cidr_blocks = ["0.0.0.0/0"]  # Added manually!
#       }
#   }

Automated Drift Detection

Schedule drift checks:

# .github/workflows/drift-detection.yml
name: Drift Detection
on:
  schedule:
    - cron: '0 8 * * *'  # Daily at 8am

jobs:
  detect-drift:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        environment: [dev, staging, prod]
    steps:
      - uses: actions/checkout@v4
      
      - uses: hashicorp/setup-terraform@v3
      
      - name: Configure AWS
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
          aws-region: eu-west-1
      
      - name: Terraform Init
        working-directory: environments/${{ matrix.environment }}
        run: terraform init
      
      - name: Detect Drift
        id: drift
        working-directory: environments/${{ matrix.environment }}
        run: |
          terraform plan -detailed-exitcode -refresh-only -out=drift.tfplan 2>&1 | tee drift.txt
          echo "exit_code=$?" >> $GITHUB_OUTPUT
      
      - name: Alert on Drift
        if: steps.drift.outputs.exit_code == '2'
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "⚠️ Drift detected in ${{ matrix.environment }}!",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "Infrastructure drift detected in *${{ matrix.environment }}*. Review and reconcile."
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}

Reconciling Drift

Option 1: Import the changes (keep manual changes)

# If the manual change was intentional
terraform plan -refresh-only
terraform apply -refresh-only  # Update state to match reality

# Then update code to match

Option 2: Override (revert to Terraform)

# If the manual change was a mistake
terraform apply  # Revert infrastructure to match code

Option 3: Selective import

# Import a manually created resource
terraform import aws_security_group.new sg-12345678

Team Workflows

PR Templates

<!-- .github/pull_request_template.md -->
## Infrastructure Change

### Description
<!-- What are you changing and why? -->

### Environment(s) Affected
- [ ] dev
- [ ] staging
- [ ] prod

### Type of Change
- [ ] New infrastructure
- [ ] Modification to existing infrastructure
- [ ] Destruction of infrastructure
- [ ] Refactoring (no functional change)

### Checklist
- [ ] I have run `terraform fmt`
- [ ] I have run `terraform validate`
- [ ] I have reviewed the plan output
- [ ] I have updated documentation if needed
- [ ] I have considered the blast radius of this change
- [ ] I have a rollback plan if needed

### Terraform Plan Summary
<!-- Paste or screenshot the plan summary -->

Code Owners

# .github/CODEOWNERS
# Platform team owns core infrastructure
/environments/prod/              @myorg/platform-team
/modules/networking/             @myorg/platform-team
/modules/eks/                    @myorg/platform-team

# Data team owns data infrastructure
/environments/*/data-*           @myorg/data-team
/modules/redshift/               @myorg/data-team

# Security team reviews security-sensitive changes
/modules/iam/                    @myorg/security-team
**/security*.tf                  @myorg/security-team

Documentation

Document important decisions:

<!-- docs/decisions/001-state-management.md -->
# ADR 001: State Management Strategy

## Status
Accepted

## Context
We need to decide how to manage Terraform state across multiple teams and environments.

## Decision
We will use S3 with DynamoDB locking, with one state file per component per environment.

## Consequences
- Teams can work independently
- Clear blast radius for each apply
- Need to set up cross-stack data sharing via remote state or SSM parameters

Handling Emergencies

When you need to bypass normal process:

## Emergency Change Process

1. **Communicate** - Alert the team in #infrastructure-alerts
2. **Document** - Create an incident ticket
3. **Make the change** - Use Terraform if possible, console if necessary
4. **Reconcile** - If console change, create PR to update Terraform ASAP
5. **Retrospective** - Document what happened and how to prevent recurrence

Performance Optimisation

Large Terraform configurations can be slow. Here’s how to speed them up.

Parallelism

# Default parallelism is 10
terraform apply -parallelism=20

Target Specific Resources

# Only plan/apply specific resources
terraform plan -target=module.api
terraform apply -target=aws_instance.web

Warning: Use sparingly. Targeted applies can create inconsistent state.

State File Size

Large states are slow. Split into smaller states:

# Instead of one huge state
/infrastructure/terraform.tfstate  # 500+ resources, slow

# Split by component
/infrastructure/networking/terraform.tfstate  # 50 resources
/infrastructure/compute/terraform.tfstate     # 100 resources
/infrastructure/database/terraform.tfstate    # 30 resources

Provider Caching

# Set plugin cache directory
export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"

# Providers are downloaded once, reused across projects

Summary

Terraform at scale requires:

  1. Testing - Static analysis, policy testing, integration tests
  2. CI/CD - Automated validation, plan review, controlled applies
  3. Security - No secrets in code, OIDC auth, least privilege
  4. Drift management - Detection, alerting, reconciliation
  5. Team processes - PR templates, code owners, documentation

Infrastructure as code only works if you treat it like code: tested, reviewed, and versioned.


References

Found this helpful?

Comments