Skip to content
Back to blog FinOps for Engineering Teams - Making Cost Everyone's Problem

FinOps for Engineering Teams - Making Cost Everyone's Problem

DevOpsCulture

FinOps for Engineering Teams - Making Cost Everyone’s Problem

“The cloud bill is too high.”

If you’ve heard this from finance but don’t know what your team specifically costs, you’re not alone. Most engineering teams have zero visibility into their cloud spend. They provision resources, ship features, and assume someone else worries about the bill.

That disconnect is expensive. The people making architectural decisions (engineers) are separated from the financial impact of those decisions. Meanwhile, finance sees a massive AWS bill but can’t tell which team or service is responsible.

FinOps bridges that gap. It’s not about cost-cutting - it’s about making informed trade-offs.

TL;DR

  • Engineers make decisions that drive 80%+ of cloud costs
  • Cost visibility must be at the team/service level, not just account level
  • Tagging is the foundation - enforce it ruthlessly
  • Build cost awareness into CI/CD and code review
  • Start with the big wins: right-sizing, unused resources, reserved capacity

Why Engineering Owns Cloud Costs

Finance can negotiate contracts and pay invoices. They can’t:

  • Choose between Lambda and ECS
  • Decide if you need 3 replicas or 10
  • Pick the right instance type for your workload
  • Design efficient data pipelines
  • Avoid the N+1 query that scans terabytes

These are engineering decisions with financial consequences. A single architectural choice can be the difference between $1,000/month and $100,000/month.

The old model - engineering builds, finance pays - doesn’t work in the cloud. You need engineers who understand cost as a feature, not an afterthought.


The Foundation: Tagging Strategy

You can’t optimise what you can’t measure. Tagging is how you measure.

Required Tags

# Minimum viable tagging strategy
tags:
  team: "platform"           # Who owns this?
  service: "api-gateway"     # What is it part of?
  environment: "production"  # prod/staging/dev
  cost-center: "eng-001"     # Finance's identifier
  managed-by: "terraform"    # How was it created?

Enforce Tags with Terraform

# modules/required-tags/main.tf
variable "required_tags" {
  type = map(string)
  validation {
    condition = alltrue([
      contains(keys(var.required_tags), "team"),
      contains(keys(var.required_tags), "service"),
      contains(keys(var.required_tags), "environment"),
    ])
    error_message = "Required tags: team, service, environment"
  }
}

# Use in all resources
resource "aws_instance" "example" {
  # ... config ...
  
  tags = merge(var.required_tags, {
    Name = "my-instance"
  })
}

Enforce Tags with AWS SCPs

Block untagged resource creation:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "RequireTags",
      "Effect": "Deny",
      "Action": [
        "ec2:RunInstances",
        "rds:CreateDBInstance",
        "elasticloadbalancing:CreateLoadBalancer"
      ],
      "Resource": "*",
      "Condition": {
        "Null": {
          "aws:RequestTag/team": "true",
          "aws:RequestTag/service": "true"
        }
      }
    }
  ]
}

Visibility: Cost Dashboards

Tags are useless without dashboards. Engineers need to see their costs.

AWS Cost Explorer by Tag

# Get last month's cost by team
aws ce get-cost-and-usage \
  --time-period Start=2026-01-01,End=2026-02-01 \
  --granularity MONTHLY \
  --metrics "UnblendedCost" \
  --group-by Type=TAG,Key=team \
  --output table

Automated Slack Reports

# Lambda function for weekly cost reports
import boto3
import json
import requests

def lambda_handler(event, context):
    ce = boto3.client('ce')
    
    response = ce.get_cost_and_usage(
        TimePeriod={
            'Start': '2026-01-27',
            'End': '2026-02-03'
        },
        Granularity='DAILY',
        Metrics=['UnblendedCost'],
        GroupBy=[
            {'Type': 'TAG', 'Key': 'team'}
        ]
    )
    
    # Format for Slack
    costs_by_team = {}
    for result in response['ResultsByTime']:
        for group in result['Groups']:
            team = group['Keys'][0].replace('team$', '') or 'untagged'
            cost = float(group['Metrics']['UnblendedCost']['Amount'])
            costs_by_team[team] = costs_by_team.get(team, 0) + cost
    
    message = "📊 *Weekly Cloud Costs by Team*\n"
    for team, cost in sorted(costs_by_team.items(), key=lambda x: -x[1]):
        message += f"• {team}: ${cost:,.2f}\n"
    
    # Post to Slack
    requests.post(
        SLACK_WEBHOOK_URL,
        json={"text": message}
    )

Grafana Dashboard

# Prometheus/CloudWatch metrics for real-time cost visibility
# Use AWS Cost and Usage Reports exported to S3/Athena

# Example Athena query for Grafana
SELECT 
  line_item_usage_account_id as account,
  resource_tags_user_team as team,
  resource_tags_user_service as service,
  SUM(line_item_unblended_cost) as cost
FROM cost_and_usage_report
WHERE 
  month = '1' 
  AND year = '2026'
GROUP BY 1, 2, 3
ORDER BY cost DESC

Build Cost into CI/CD

Infracost in Pull Requests

Show cost impact before merging:

# .github/workflows/infracost.yml
name: Infracost
on:
  pull_request:
    paths:
      - '**/*.tf'
      - '**/*.tfvars'

jobs:
  infracost:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Infracost
        uses: infracost/actions/setup@v3
        with:
          api-key: ${{ secrets.INFRACOST_API_KEY }}
      
      - name: Generate Infracost diff
        run: |
          infracost diff \
            --path=. \
            --format=json \
            --out-file=/tmp/infracost.json
      
      - name: Post Infracost comment
        uses: infracost/actions/comment@v1
        with:
          path: /tmp/infracost.json
          behavior: update

This posts comments like:

💰 Monthly cost will increase by $1,234 (15%)

| Resource | Before | After | Change |
|----------|--------|-------|--------|
| aws_instance.api | $50 | $200 | +$150 |
| aws_rds_instance.db | $100 | $500 | +$400 |

Cost Budgets as Code

# Terraform budget alerts
resource "aws_budgets_budget" "team_platform" {
  name              = "team-platform-monthly"
  budget_type       = "COST"
  limit_amount      = "5000"
  limit_unit        = "USD"
  time_unit         = "MONTHLY"

  cost_filter {
    name   = "TagKeyValue"
    values = ["user:team$platform"]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 80
    threshold_type             = "PERCENTAGE"
    notification_type          = "ACTUAL"
    subscriber_email_addresses = ["platform-team@company.com"]
    subscriber_sns_topic_arns  = [aws_sns_topic.budget_alerts.arn]
  }

  notification {
    comparison_operator        = "GREATER_THAN"
    threshold                  = 100
    threshold_type             = "PERCENTAGE"
    notification_type          = "FORECASTED"
    subscriber_email_addresses = ["platform-team@company.com", "finance@company.com"]
  }
}

Quick Wins: The 80/20 of Cost Optimisation

1. Right-Size Instances

Most instances are over-provisioned. Check utilisation:

# Find under-utilized EC2 instances
aws cloudwatch get-metric-statistics \
  --namespace AWS/EC2 \
  --metric-name CPUUtilization \
  --dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
  --start-time 2026-01-01T00:00:00Z \
  --end-time 2026-02-01T00:00:00Z \
  --period 86400 \
  --statistics Average \
  --output table

If average CPU is under 20%, you’re over-provisioned.

Automated right-sizing with AWS Compute Optimizer:

# Enable Compute Optimizer
resource "aws_computeoptimizer_enrollment_status" "main" {
  status = "Active"
}

# Query recommendations via CLI
# aws compute-optimizer get-ec2-instance-recommendations

2. Delete Unused Resources

The most expensive resource is one you’re not using:

# Unattached EBS volumes
aws ec2 describe-volumes \
  --filters Name=status,Values=available \
  --query 'Volumes[*].[VolumeId,Size,CreateTime]' \
  --output table

# Old snapshots (> 90 days)
aws ec2 describe-snapshots \
  --owner-ids self \
  --query 'Snapshots[?StartTime<=`2025-11-01`].[SnapshotId,VolumeSize,StartTime]' \
  --output table

# Unused Elastic IPs
aws ec2 describe-addresses \
  --query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]' \
  --output table

# Old AMIs
aws ec2 describe-images \
  --owners self \
  --query 'Images[?CreationDate<=`2025-01-01`].[ImageId,Name,CreationDate]' \
  --output table

3. Reserved Instances / Savings Plans

If you have stable baseline usage, commit to it:

On-demand m5.xlarge: $0.192/hour = $140/month
1-year reserved (no upfront): $0.122/hour = $89/month
Savings: 36%

3-year reserved (all upfront): $0.076/hour = $55/month
Savings: 60%

When to reserve:

  • Baseline load that’s always running
  • Databases (usually 24/7)
  • Core infrastructure (NAT, bastion, monitoring)

When NOT to reserve:

  • Auto-scaled workloads (use Savings Plans instead)
  • Workloads you might eliminate
  • Anything you’re not sure about

4. Spot Instances for Fault-Tolerant Workloads

# EKS node group with spot instances
resource "aws_eks_node_group" "spot" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "spot-workers"
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = var.private_subnet_ids
  capacity_type   = "SPOT"
  
  instance_types = ["m5.large", "m5a.large", "m5n.large", "m4.large"]
  
  scaling_config {
    desired_size = 3
    max_size     = 10
    min_size     = 1
  }
}

Spot can save 60-90% on compute, but instances can be terminated with 2 minutes notice. Use for:

  • CI/CD runners
  • Batch processing
  • Stateless web servers (with proper load balancing)
  • Dev/test environments

5. Data Transfer Costs

Data transfer is the hidden killer:

Same AZ: Free
Cross-AZ: $0.01/GB each way ($0.02 round trip)
To internet: $0.09/GB (first 10TB)
Cross-region: $0.02/GB

Optimizations:

  • Keep chattier services in the same AZ
  • Use VPC endpoints (avoid NAT for AWS services)
  • Compress data before transfer
  • Cache aggressively (CloudFront, ElastiCache)

Team Cost Reviews

Make cost a regular topic, not a crisis response.

Monthly Cost Review Format

# Platform Team - January 2026 Cost Review

## Summary
- Total spend: $45,231 (+12% from December)
- Budget: $50,000 (90% utilized)
- Forecast: $48,500

## Top 5 Cost Drivers
1. EKS cluster compute: $18,000 (40%)
2. RDS databases: $12,000 (27%)
3. Data transfer: $6,000 (13%)
4. S3 storage: $4,000 (9%)
5. CloudWatch: $2,500 (6%)

## What Changed
- New ML pipeline added $3,000/month
- Scaled API servers for holiday traffic (+$2,000)
- Fixed NAT gateway redundancy (-$1,500)

## Action Items
- [ ] Right-size RDS dev instances (est. savings: $800/month)
- [ ] Enable S3 Intelligent-Tiering (est. savings: $400/month)
- [ ] Investigate CloudWatch costs spike

## Next Month Forecast
- Expecting $48,000 (holiday traffic normalizing)
- New feature launch may add $2,000

Cost Anomaly Detection

Set up automated alerts for unexpected changes:

resource "aws_ce_anomaly_monitor" "service" {
  name              = "service-anomaly-monitor"
  monitor_type      = "DIMENSIONAL"
  monitor_dimension = "SERVICE"
}

resource "aws_ce_anomaly_subscription" "alerts" {
  name      = "cost-anomaly-alerts"
  frequency = "DAILY"
  
  monitor_arn_list = [
    aws_ce_anomaly_monitor.service.arn
  ]
  
  subscriber {
    type    = "EMAIL"
    address = "platform-team@company.com"
  }
  
  subscriber {
    type    = "SNS"
    address = aws_sns_topic.cost_alerts.arn
  }

  threshold_expression {
    dimension {
      key           = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
      values        = ["100"]  # Alert if anomaly > $100
      match_options = ["GREATER_THAN_OR_EQUAL"]
    }
  }
}

Culture: Making Cost Part of Engineering

Code Review Checklist

Add cost considerations to your PR template:

## Cost Impact
- [ ] No new AWS resources
- [ ] New resources are right-sized
- [ ] Resources have required tags
- [ ] Considered spot/preemptible instances
- [ ] No hardcoded instance types (use variables)
- [ ] Infracost estimate reviewed

Engineering Scorecards

Include cost metrics alongside reliability and velocity:

MetricTargetActual
Deployment frequency10/week12/week ✅
Change failure rate<5%3% ✅
Mean time to recovery<1hr45min ✅
Cost efficiency<$5/1K requests$4.20/1K
Resource utilization>50% CPU avg62%

Gamification (Use Carefully)

Some teams create friendly competition:

  • Monthly “Cost Cutter” award for biggest optimisation
  • Leaderboard of cost per team (normalised by traffic/value)
  • Share war stories of wasteful resources found

But don’t over-index on cost at the expense of velocity or reliability.


Tools to Consider

ToolPurposeCost
InfracostCost estimates in PRsFree tier available
AWS Cost ExplorerNative AWS cost analysisFree
KubecostKubernetes cost allocationFree tier available
CloudHealthMulti-cloud FinOps platformEnterprise
Spot.ioAutomated spot instance managementPercentage of savings
AWS Compute OptimizerRight-sizing recommendationsFree

Conclusion

FinOps isn’t about spending less - it’s about spending intentionally. Engineers should know:

  1. What their services cost
  2. Why they cost that much
  3. Whether that cost is reasonable for the value delivered

The goal isn’t the cheapest infrastructure. It’s infrastructure where every dollar is a conscious choice, not an accident.

Start with tagging and visibility. Everything else follows.


References

Found this helpful?

Comments