Terraform Best Practices (Part 2) - Testing, CI/CD, Security, and Team Workflows
In Part 1, we covered Terraform foundations: project structure, state management, and module design. This part focuses on advanced practices that become critical as teams and infrastructure scale.
Testing infrastructure code is different from testing application code. CI/CD for Terraform requires careful thought about plan/apply workflows. Security mistakes in Terraform can expose your entire cloud. And coordinating infrastructure changes across teams needs clear processes.
Let’s dive in.
TL;DR
- Test modules with Terratest or terraform-compliance
- Use CI/CD with mandatory plan reviews before apply
- Never commit secrets - use Vault, AWS Secrets Manager, or environment variables
- Implement drift detection and reconciliation
- Document decisions and use PR templates
Testing Terraform
“We don’t test infrastructure code” is a common but dangerous stance. Terraform modules can have bugs just like application code, and the consequences of infrastructure bugs can be severe.
Levels of Testing
┌─────────────────────────────────────────────────────────┐
│ Level 4: End-to-End │
│ Deploy full stack, test functionality │
├─────────────────────────────────────────────────────────┤
│ Level 3: Integration │
│ Deploy to real cloud, verify resources │
├─────────────────────────────────────────────────────────┤
│ Level 2: Contract/Policy │
│ Verify plan meets policies (terraform-compliance) │
├─────────────────────────────────────────────────────────┤
│ Level 1: Static Analysis │
│ tflint, checkov, terrascan, validate │
└─────────────────────────────────────────────────────────┘
Level 1: Static Analysis
Fast checks that don’t require cloud access:
# Terraform validate - syntax and internal consistency
terraform init -backend=false
terraform validate
# tflint - linting and best practices
tflint --init
tflint --recursive
# Checkov - security and compliance scanning
checkov -d .
# Terrascan - policy as code
terrascan scan -d .
tflint configuration:
# .tflint.hcl
plugin "aws" {
enabled = true
version = "0.27.0"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
rule "terraform_naming_convention" {
enabled = true
format = "snake_case"
}
rule "terraform_documented_variables" {
enabled = true
}
rule "terraform_documented_outputs" {
enabled = true
}
rule "aws_instance_invalid_type" {
enabled = true
}
Checkov example:
# .checkov.yaml
framework:
- terraform
check:
- CKV_AWS_18 # Ensure S3 bucket logging is enabled
- CKV_AWS_19 # Ensure S3 bucket has encryption enabled
- CKV_AWS_21 # Ensure S3 bucket has versioning enabled
skip-check:
- CKV_AWS_144 # Skip S3 cross-region replication (not always needed)
Level 2: Policy Testing with terraform-compliance
Test that your Terraform plans meet organisational policies:
# features/s3.feature
Feature: S3 bucket security
Scenario: S3 buckets must have encryption
Given I have aws_s3_bucket defined
Then it must have server_side_encryption_configuration
And its server_side_encryption_configuration must have rule
Scenario: S3 buckets must not be public
Given I have aws_s3_bucket defined
Then it must have acl
And its acl must not be public-read
And its acl must not be public-read-write
# Generate plan and test
terraform plan -out=plan.tfplan
terraform show -json plan.tfplan > plan.json
terraform-compliance -f features/ -p plan.json
Level 3: Integration Testing with Terratest
Deploy real infrastructure and verify it works:
// test/vpc_test.go
package test
import (
"testing"
"github.com/gruntwork-io/terratest/modules/terraform"
"github.com/gruntwork-io/terratest/modules/aws"
"github.com/stretchr/testify/assert"
)
func TestVpcModule(t *testing.T) {
t.Parallel()
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../modules/vpc",
Vars: map[string]interface{}{
"vpc_cidr": "10.0.0.0/16",
"environment": "test",
"name": "terratest-vpc",
},
})
// Destroy at the end
defer terraform.Destroy(t, terraformOptions)
// Deploy
terraform.InitAndApply(t, terraformOptions)
// Get outputs
vpcId := terraform.Output(t, terraformOptions, "vpc_id")
publicSubnetIds := terraform.OutputList(t, terraformOptions, "public_subnet_ids")
// Verify VPC exists
vpc := aws.GetVpcById(t, vpcId, "eu-west-1")
assert.Equal(t, "10.0.0.0/16", vpc.CidrBlock)
// Verify subnets were created
assert.Equal(t, 3, len(publicSubnetIds))
// Verify subnets are actually public (have route to IGW)
for _, subnetId := range publicSubnetIds {
assert.True(t, aws.IsPublicSubnet(t, subnetId, "eu-west-1"))
}
}
Run Terratest:
cd test
go test -v -timeout 30m
Level 4: End-to-End Testing
Deploy the full stack and test functionality:
func TestFullStackDeployment(t *testing.T) {
t.Parallel()
// Deploy infrastructure
terraformOptions := terraform.WithDefaultRetryableErrors(t, &terraform.Options{
TerraformDir: "../environments/test",
})
defer terraform.Destroy(t, terraformOptions)
terraform.InitAndApply(t, terraformOptions)
// Get the ALB URL
albUrl := terraform.Output(t, terraformOptions, "alb_url")
// Wait for the application to be healthy
http_helper.HttpGetWithRetry(
t,
fmt.Sprintf("http://%s/health", albUrl),
nil,
200,
"OK",
30,
10*time.Second,
)
// Run application-level tests
http_helper.HttpGetWithRetry(
t,
fmt.Sprintf("http://%s/api/users", albUrl),
nil,
200,
"",
5,
5*time.Second,
)
}
CI/CD for Terraform
Infrastructure CI/CD is different from application CI/CD. You need human review before applying changes, and failed applies can leave infrastructure in partial states.
Pipeline Structure
# .github/workflows/terraform.yml
name: Terraform
on:
pull_request:
branches: [main]
push:
branches: [main]
jobs:
validate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: 1.6.0
- name: Terraform Format Check
run: terraform fmt -check -recursive
- name: Terraform Init
run: terraform init -backend=false
- name: Terraform Validate
run: terraform validate
- name: tflint
uses: terraform-linters/setup-tflint@v4
- run: tflint --init && tflint --recursive
- name: Checkov
uses: bridgecrewio/checkov-action@v12
with:
directory: .
soft_fail: false
plan:
needs: validate
runs-on: ubuntu-latest
strategy:
matrix:
environment: [dev, staging, prod]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: eu-west-1
- name: Terraform Init
working-directory: environments/${{ matrix.environment }}
run: terraform init
- name: Terraform Plan
working-directory: environments/${{ matrix.environment }}
run: |
terraform plan -out=tfplan -no-color | tee plan.txt
- name: Post Plan to PR
uses: actions/github-script@v7
if: github.event_name == 'pull_request'
with:
script: |
const fs = require('fs');
const plan = fs.readFileSync('environments/${{ matrix.environment }}/plan.txt', 'utf8');
const output = `#### Terraform Plan for \`${{ matrix.environment }}\`
<details><summary>Show Plan</summary>
\`\`\`
${plan.substring(0, 65000)}
\`\`\`
</details>`;
github.rest.issues.createComment({
issue_number: context.issue.number,
owner: context.repo.owner,
repo: context.repo.repo,
body: output
});
- name: Upload Plan
uses: actions/upload-artifact@v4
with:
name: tfplan-${{ matrix.environment }}
path: environments/${{ matrix.environment }}/tfplan
apply:
needs: plan
runs-on: ubuntu-latest
if: github.ref == 'refs/heads/main' && github.event_name == 'push'
environment: production # Requires approval
strategy:
matrix:
environment: [dev, staging, prod]
max-parallel: 1 # Apply sequentially
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: eu-west-1
- name: Download Plan
uses: actions/download-artifact@v4
with:
name: tfplan-${{ matrix.environment }}
path: environments/${{ matrix.environment }}
- name: Terraform Init
working-directory: environments/${{ matrix.environment }}
run: terraform init
- name: Terraform Apply
working-directory: environments/${{ matrix.environment }}
run: terraform apply -auto-approve tfplan
Key Principles
1. Always plan before apply:
# Never do this
- run: terraform apply -auto-approve
# Always do this
- run: terraform plan -out=tfplan
- run: terraform apply tfplan # Apply the exact plan that was reviewed
2. Use saved plan files:
The plan you review must be the plan you apply:
# Plan and save
terraform plan -out=tfplan
# Apply exact plan (no drift between plan and apply)
terraform apply tfplan
3. Require approval for production:
apply-prod:
environment: production # GitHub environment with required reviewers
4. Sequential applies for dependent environments:
strategy:
matrix:
environment: [dev, staging, prod]
max-parallel: 1 # One at a time
Handling Plan Drift
Plans can drift between plan time and apply time if infrastructure changes:
- name: Check for Drift
run: |
terraform plan -detailed-exitcode -out=tfplan
EXIT_CODE=$?
if [ $EXIT_CODE -eq 2 ]; then
echo "Changes detected"
elif [ $EXIT_CODE -eq 0 ]; then
echo "No changes"
exit 0 # Skip apply
else
echo "Error"
exit 1
fi
Security Best Practices
Terraform can create security holes in your infrastructure or expose secrets. Here’s how to prevent both.
Never Commit Secrets
# NEVER do this
resource "aws_db_instance" "main" {
password = "super_secret_password" # Committed to git!
}
# Also bad - .tfvars can be committed accidentally
# terraform.tfvars
db_password = "super_secret_password"
Option 1: Environment variables
export TF_VAR_db_password="secret"
terraform apply
variable "db_password" {
type = string
sensitive = true
}
Option 2: Secrets manager reference
data "aws_secretsmanager_secret_version" "db_password" {
secret_id = "prod/database/password"
}
resource "aws_db_instance" "main" {
password = data.aws_secretsmanager_secret_version.db_password.secret_string
}
Option 3: Generate random passwords
resource "random_password" "db" {
length = 32
special = true
}
resource "aws_db_instance" "main" {
password = random_password.db.result
}
# Store in secrets manager for applications
resource "aws_secretsmanager_secret_version" "db_password" {
secret_id = aws_secretsmanager_secret.db.id
secret_string = random_password.db.result
}
Use OIDC for CI/CD Authentication
Don’t use long-lived access keys:
# GitHub Actions with OIDC
- name: Configure AWS Credentials
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: arn:aws:iam::123456789:role/github-actions-terraform
aws-region: eu-west-1
# IAM role for GitHub Actions
resource "aws_iam_role" "github_actions" {
name = "github-actions-terraform"
assume_role_policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Principal = {
Federated = "arn:aws:iam::${data.aws_caller_identity.current.account_id}:oidc-provider/token.actions.githubusercontent.com"
}
Action = "sts:AssumeRoleWithWebIdentity"
Condition = {
StringEquals = {
"token.actions.githubusercontent.com:aud" = "sts.amazonaws.com"
}
StringLike = {
"token.actions.githubusercontent.com:sub" = "repo:myorg/infrastructure:*"
}
}
}
]
})
}
Principle of Least Privilege
# Bad - full admin access
resource "aws_iam_role_policy_attachment" "terraform" {
role = aws_iam_role.terraform.name
policy_arn = "arn:aws:iam::aws:policy/AdministratorAccess"
}
# Better - only what's needed
resource "aws_iam_role_policy" "terraform" {
role = aws_iam_role.terraform.name
policy = jsonencode({
Version = "2012-10-17"
Statement = [
{
Effect = "Allow"
Action = [
"ec2:*",
"rds:*",
"s3:*"
]
Resource = "*"
Condition = {
StringEquals = {
"aws:RequestedRegion" = ["eu-west-1"]
}
}
}
]
})
}
Security Scanning
Run security checks in CI:
- name: Checkov Security Scan
uses: bridgecrewio/checkov-action@v12
with:
directory: .
framework: terraform
soft_fail: false
- name: tfsec
uses: aquasecurity/tfsec-action@v1.0.3
with:
soft_fail: false
Drift Detection
Infrastructure drifts when someone makes changes outside Terraform (console, CLI, other tools). Detecting and reconciling drift is essential for maintaining infrastructure as code integrity.
Detecting Drift
# Refresh state and compare
terraform plan -refresh-only
# Output:
# Note: Objects have changed outside of Terraform
#
# ~ resource "aws_security_group" "web" {
# ~ ingress {
# + cidr_blocks = ["0.0.0.0/0"] # Added manually!
# }
# }
Automated Drift Detection
Schedule drift checks:
# .github/workflows/drift-detection.yml
name: Drift Detection
on:
schedule:
- cron: '0 8 * * *' # Daily at 8am
jobs:
detect-drift:
runs-on: ubuntu-latest
strategy:
matrix:
environment: [dev, staging, prod]
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
- name: Configure AWS
uses: aws-actions/configure-aws-credentials@v4
with:
role-to-assume: ${{ secrets.AWS_ROLE_ARN }}
aws-region: eu-west-1
- name: Terraform Init
working-directory: environments/${{ matrix.environment }}
run: terraform init
- name: Detect Drift
id: drift
working-directory: environments/${{ matrix.environment }}
run: |
terraform plan -detailed-exitcode -refresh-only -out=drift.tfplan 2>&1 | tee drift.txt
echo "exit_code=$?" >> $GITHUB_OUTPUT
- name: Alert on Drift
if: steps.drift.outputs.exit_code == '2'
uses: slackapi/slack-github-action@v1
with:
payload: |
{
"text": "⚠️ Drift detected in ${{ matrix.environment }}!",
"blocks": [
{
"type": "section",
"text": {
"type": "mrkdwn",
"text": "Infrastructure drift detected in *${{ matrix.environment }}*. Review and reconcile."
}
}
]
}
env:
SLACK_WEBHOOK_URL: ${{ secrets.SLACK_WEBHOOK }}
Reconciling Drift
Option 1: Import the changes (keep manual changes)
# If the manual change was intentional
terraform plan -refresh-only
terraform apply -refresh-only # Update state to match reality
# Then update code to match
Option 2: Override (revert to Terraform)
# If the manual change was a mistake
terraform apply # Revert infrastructure to match code
Option 3: Selective import
# Import a manually created resource
terraform import aws_security_group.new sg-12345678
Team Workflows
PR Templates
<!-- .github/pull_request_template.md -->
## Infrastructure Change
### Description
<!-- What are you changing and why? -->
### Environment(s) Affected
- [ ] dev
- [ ] staging
- [ ] prod
### Type of Change
- [ ] New infrastructure
- [ ] Modification to existing infrastructure
- [ ] Destruction of infrastructure
- [ ] Refactoring (no functional change)
### Checklist
- [ ] I have run `terraform fmt`
- [ ] I have run `terraform validate`
- [ ] I have reviewed the plan output
- [ ] I have updated documentation if needed
- [ ] I have considered the blast radius of this change
- [ ] I have a rollback plan if needed
### Terraform Plan Summary
<!-- Paste or screenshot the plan summary -->
Code Owners
# .github/CODEOWNERS
# Platform team owns core infrastructure
/environments/prod/ @myorg/platform-team
/modules/networking/ @myorg/platform-team
/modules/eks/ @myorg/platform-team
# Data team owns data infrastructure
/environments/*/data-* @myorg/data-team
/modules/redshift/ @myorg/data-team
# Security team reviews security-sensitive changes
/modules/iam/ @myorg/security-team
**/security*.tf @myorg/security-team
Documentation
Document important decisions:
<!-- docs/decisions/001-state-management.md -->
# ADR 001: State Management Strategy
## Status
Accepted
## Context
We need to decide how to manage Terraform state across multiple teams and environments.
## Decision
We will use S3 with DynamoDB locking, with one state file per component per environment.
## Consequences
- Teams can work independently
- Clear blast radius for each apply
- Need to set up cross-stack data sharing via remote state or SSM parameters
Handling Emergencies
When you need to bypass normal process:
## Emergency Change Process
1. **Communicate** - Alert the team in #infrastructure-alerts
2. **Document** - Create an incident ticket
3. **Make the change** - Use Terraform if possible, console if necessary
4. **Reconcile** - If console change, create PR to update Terraform ASAP
5. **Retrospective** - Document what happened and how to prevent recurrence
Performance Optimisation
Large Terraform configurations can be slow. Here’s how to speed them up.
Parallelism
# Default parallelism is 10
terraform apply -parallelism=20
Target Specific Resources
# Only plan/apply specific resources
terraform plan -target=module.api
terraform apply -target=aws_instance.web
Warning: Use sparingly. Targeted applies can create inconsistent state.
State File Size
Large states are slow. Split into smaller states:
# Instead of one huge state
/infrastructure/terraform.tfstate # 500+ resources, slow
# Split by component
/infrastructure/networking/terraform.tfstate # 50 resources
/infrastructure/compute/terraform.tfstate # 100 resources
/infrastructure/database/terraform.tfstate # 30 resources
Provider Caching
# Set plugin cache directory
export TF_PLUGIN_CACHE_DIR="$HOME/.terraform.d/plugin-cache"
# Providers are downloaded once, reused across projects
Summary
Terraform at scale requires:
- Testing - Static analysis, policy testing, integration tests
- CI/CD - Automated validation, plan review, controlled applies
- Security - No secrets in code, OIDC auth, least privilege
- Drift management - Detection, alerting, reconciliation
- Team processes - PR templates, code owners, documentation
Infrastructure as code only works if you treat it like code: tested, reviewed, and versioned.