FinOps for Engineering Teams - Making Cost Everyone’s Problem
“The cloud bill is too high.”
If you’ve heard this from finance but don’t know what your team specifically costs, you’re not alone. Most engineering teams have zero visibility into their cloud spend. They provision resources, ship features, and assume someone else worries about the bill.
That disconnect is expensive. The people making architectural decisions (engineers) are separated from the financial impact of those decisions. Meanwhile, finance sees a massive AWS bill but can’t tell which team or service is responsible.
FinOps bridges that gap. It’s not about cost-cutting - it’s about making informed trade-offs.
TL;DR
- Engineers make decisions that drive 80%+ of cloud costs
- Cost visibility must be at the team/service level, not just account level
- Tagging is the foundation - enforce it ruthlessly
- Build cost awareness into CI/CD and code review
- Start with the big wins: right-sizing, unused resources, reserved capacity
Why Engineering Owns Cloud Costs
Finance can negotiate contracts and pay invoices. They can’t:
- Choose between Lambda and ECS
- Decide if you need 3 replicas or 10
- Pick the right instance type for your workload
- Design efficient data pipelines
- Avoid the N+1 query that scans terabytes
These are engineering decisions with financial consequences. A single architectural choice can be the difference between $1,000/month and $100,000/month.
The old model - engineering builds, finance pays - doesn’t work in the cloud. You need engineers who understand cost as a feature, not an afterthought.
The Foundation: Tagging Strategy
You can’t optimise what you can’t measure. Tagging is how you measure.
Required Tags
# Minimum viable tagging strategy
tags:
team: "platform" # Who owns this?
service: "api-gateway" # What is it part of?
environment: "production" # prod/staging/dev
cost-center: "eng-001" # Finance's identifier
managed-by: "terraform" # How was it created?
Enforce Tags with Terraform
# modules/required-tags/main.tf
variable "required_tags" {
type = map(string)
validation {
condition = alltrue([
contains(keys(var.required_tags), "team"),
contains(keys(var.required_tags), "service"),
contains(keys(var.required_tags), "environment"),
])
error_message = "Required tags: team, service, environment"
}
}
# Use in all resources
resource "aws_instance" "example" {
# ... config ...
tags = merge(var.required_tags, {
Name = "my-instance"
})
}
Enforce Tags with AWS SCPs
Block untagged resource creation:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "RequireTags",
"Effect": "Deny",
"Action": [
"ec2:RunInstances",
"rds:CreateDBInstance",
"elasticloadbalancing:CreateLoadBalancer"
],
"Resource": "*",
"Condition": {
"Null": {
"aws:RequestTag/team": "true",
"aws:RequestTag/service": "true"
}
}
}
]
}
Visibility: Cost Dashboards
Tags are useless without dashboards. Engineers need to see their costs.
AWS Cost Explorer by Tag
# Get last month's cost by team
aws ce get-cost-and-usage \
--time-period Start=2026-01-01,End=2026-02-01 \
--granularity MONTHLY \
--metrics "UnblendedCost" \
--group-by Type=TAG,Key=team \
--output table
Automated Slack Reports
# Lambda function for weekly cost reports
import boto3
import json
import requests
def lambda_handler(event, context):
ce = boto3.client('ce')
response = ce.get_cost_and_usage(
TimePeriod={
'Start': '2026-01-27',
'End': '2026-02-03'
},
Granularity='DAILY',
Metrics=['UnblendedCost'],
GroupBy=[
{'Type': 'TAG', 'Key': 'team'}
]
)
# Format for Slack
costs_by_team = {}
for result in response['ResultsByTime']:
for group in result['Groups']:
team = group['Keys'][0].replace('team$', '') or 'untagged'
cost = float(group['Metrics']['UnblendedCost']['Amount'])
costs_by_team[team] = costs_by_team.get(team, 0) + cost
message = "📊 *Weekly Cloud Costs by Team*\n"
for team, cost in sorted(costs_by_team.items(), key=lambda x: -x[1]):
message += f"• {team}: ${cost:,.2f}\n"
# Post to Slack
requests.post(
SLACK_WEBHOOK_URL,
json={"text": message}
)
Grafana Dashboard
# Prometheus/CloudWatch metrics for real-time cost visibility
# Use AWS Cost and Usage Reports exported to S3/Athena
# Example Athena query for Grafana
SELECT
line_item_usage_account_id as account,
resource_tags_user_team as team,
resource_tags_user_service as service,
SUM(line_item_unblended_cost) as cost
FROM cost_and_usage_report
WHERE
month = '1'
AND year = '2026'
GROUP BY 1, 2, 3
ORDER BY cost DESC
Build Cost into CI/CD
Infracost in Pull Requests
Show cost impact before merging:
# .github/workflows/infracost.yml
name: Infracost
on:
pull_request:
paths:
- '**/*.tf'
- '**/*.tfvars'
jobs:
infracost:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Setup Infracost
uses: infracost/actions/setup@v3
with:
api-key: ${{ secrets.INFRACOST_API_KEY }}
- name: Generate Infracost diff
run: |
infracost diff \
--path=. \
--format=json \
--out-file=/tmp/infracost.json
- name: Post Infracost comment
uses: infracost/actions/comment@v1
with:
path: /tmp/infracost.json
behavior: update
This posts comments like:
💰 Monthly cost will increase by $1,234 (15%)
| Resource | Before | After | Change |
|----------|--------|-------|--------|
| aws_instance.api | $50 | $200 | +$150 |
| aws_rds_instance.db | $100 | $500 | +$400 |
Cost Budgets as Code
# Terraform budget alerts
resource "aws_budgets_budget" "team_platform" {
name = "team-platform-monthly"
budget_type = "COST"
limit_amount = "5000"
limit_unit = "USD"
time_unit = "MONTHLY"
cost_filter {
name = "TagKeyValue"
values = ["user:team$platform"]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 80
threshold_type = "PERCENTAGE"
notification_type = "ACTUAL"
subscriber_email_addresses = ["platform-team@company.com"]
subscriber_sns_topic_arns = [aws_sns_topic.budget_alerts.arn]
}
notification {
comparison_operator = "GREATER_THAN"
threshold = 100
threshold_type = "PERCENTAGE"
notification_type = "FORECASTED"
subscriber_email_addresses = ["platform-team@company.com", "finance@company.com"]
}
}
Quick Wins: The 80/20 of Cost Optimisation
1. Right-Size Instances
Most instances are over-provisioned. Check utilisation:
# Find under-utilized EC2 instances
aws cloudwatch get-metric-statistics \
--namespace AWS/EC2 \
--metric-name CPUUtilization \
--dimensions Name=InstanceId,Value=i-1234567890abcdef0 \
--start-time 2026-01-01T00:00:00Z \
--end-time 2026-02-01T00:00:00Z \
--period 86400 \
--statistics Average \
--output table
If average CPU is under 20%, you’re over-provisioned.
Automated right-sizing with AWS Compute Optimizer:
# Enable Compute Optimizer
resource "aws_computeoptimizer_enrollment_status" "main" {
status = "Active"
}
# Query recommendations via CLI
# aws compute-optimizer get-ec2-instance-recommendations
2. Delete Unused Resources
The most expensive resource is one you’re not using:
# Unattached EBS volumes
aws ec2 describe-volumes \
--filters Name=status,Values=available \
--query 'Volumes[*].[VolumeId,Size,CreateTime]' \
--output table
# Old snapshots (> 90 days)
aws ec2 describe-snapshots \
--owner-ids self \
--query 'Snapshots[?StartTime<=`2025-11-01`].[SnapshotId,VolumeSize,StartTime]' \
--output table
# Unused Elastic IPs
aws ec2 describe-addresses \
--query 'Addresses[?AssociationId==null].[PublicIp,AllocationId]' \
--output table
# Old AMIs
aws ec2 describe-images \
--owners self \
--query 'Images[?CreationDate<=`2025-01-01`].[ImageId,Name,CreationDate]' \
--output table
3. Reserved Instances / Savings Plans
If you have stable baseline usage, commit to it:
On-demand m5.xlarge: $0.192/hour = $140/month
1-year reserved (no upfront): $0.122/hour = $89/month
Savings: 36%
3-year reserved (all upfront): $0.076/hour = $55/month
Savings: 60%
When to reserve:
- Baseline load that’s always running
- Databases (usually 24/7)
- Core infrastructure (NAT, bastion, monitoring)
When NOT to reserve:
- Auto-scaled workloads (use Savings Plans instead)
- Workloads you might eliminate
- Anything you’re not sure about
4. Spot Instances for Fault-Tolerant Workloads
# EKS node group with spot instances
resource "aws_eks_node_group" "spot" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "spot-workers"
node_role_arn = aws_iam_role.node.arn
subnet_ids = var.private_subnet_ids
capacity_type = "SPOT"
instance_types = ["m5.large", "m5a.large", "m5n.large", "m4.large"]
scaling_config {
desired_size = 3
max_size = 10
min_size = 1
}
}
Spot can save 60-90% on compute, but instances can be terminated with 2 minutes notice. Use for:
- CI/CD runners
- Batch processing
- Stateless web servers (with proper load balancing)
- Dev/test environments
5. Data Transfer Costs
Data transfer is the hidden killer:
Same AZ: Free
Cross-AZ: $0.01/GB each way ($0.02 round trip)
To internet: $0.09/GB (first 10TB)
Cross-region: $0.02/GB
Optimizations:
- Keep chattier services in the same AZ
- Use VPC endpoints (avoid NAT for AWS services)
- Compress data before transfer
- Cache aggressively (CloudFront, ElastiCache)
Team Cost Reviews
Make cost a regular topic, not a crisis response.
Monthly Cost Review Format
# Platform Team - January 2026 Cost Review
## Summary
- Total spend: $45,231 (+12% from December)
- Budget: $50,000 (90% utilized)
- Forecast: $48,500
## Top 5 Cost Drivers
1. EKS cluster compute: $18,000 (40%)
2. RDS databases: $12,000 (27%)
3. Data transfer: $6,000 (13%)
4. S3 storage: $4,000 (9%)
5. CloudWatch: $2,500 (6%)
## What Changed
- New ML pipeline added $3,000/month
- Scaled API servers for holiday traffic (+$2,000)
- Fixed NAT gateway redundancy (-$1,500)
## Action Items
- [ ] Right-size RDS dev instances (est. savings: $800/month)
- [ ] Enable S3 Intelligent-Tiering (est. savings: $400/month)
- [ ] Investigate CloudWatch costs spike
## Next Month Forecast
- Expecting $48,000 (holiday traffic normalizing)
- New feature launch may add $2,000
Cost Anomaly Detection
Set up automated alerts for unexpected changes:
resource "aws_ce_anomaly_monitor" "service" {
name = "service-anomaly-monitor"
monitor_type = "DIMENSIONAL"
monitor_dimension = "SERVICE"
}
resource "aws_ce_anomaly_subscription" "alerts" {
name = "cost-anomaly-alerts"
frequency = "DAILY"
monitor_arn_list = [
aws_ce_anomaly_monitor.service.arn
]
subscriber {
type = "EMAIL"
address = "platform-team@company.com"
}
subscriber {
type = "SNS"
address = aws_sns_topic.cost_alerts.arn
}
threshold_expression {
dimension {
key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE"
values = ["100"] # Alert if anomaly > $100
match_options = ["GREATER_THAN_OR_EQUAL"]
}
}
}
Culture: Making Cost Part of Engineering
Code Review Checklist
Add cost considerations to your PR template:
## Cost Impact
- [ ] No new AWS resources
- [ ] New resources are right-sized
- [ ] Resources have required tags
- [ ] Considered spot/preemptible instances
- [ ] No hardcoded instance types (use variables)
- [ ] Infracost estimate reviewed
Engineering Scorecards
Include cost metrics alongside reliability and velocity:
| Metric | Target | Actual |
|---|---|---|
| Deployment frequency | 10/week | 12/week ✅ |
| Change failure rate | <5% | 3% ✅ |
| Mean time to recovery | <1hr | 45min ✅ |
| Cost efficiency | <$5/1K requests | $4.20/1K ✅ |
| Resource utilization | >50% CPU avg | 62% ✅ |
Gamification (Use Carefully)
Some teams create friendly competition:
- Monthly “Cost Cutter” award for biggest optimisation
- Leaderboard of cost per team (normalised by traffic/value)
- Share war stories of wasteful resources found
But don’t over-index on cost at the expense of velocity or reliability.
Tools to Consider
| Tool | Purpose | Cost |
|---|---|---|
| Infracost | Cost estimates in PRs | Free tier available |
| AWS Cost Explorer | Native AWS cost analysis | Free |
| Kubecost | Kubernetes cost allocation | Free tier available |
| CloudHealth | Multi-cloud FinOps platform | Enterprise |
| Spot.io | Automated spot instance management | Percentage of savings |
| AWS Compute Optimizer | Right-sizing recommendations | Free |
Conclusion
FinOps isn’t about spending less - it’s about spending intentionally. Engineers should know:
- What their services cost
- Why they cost that much
- Whether that cost is reasonable for the value delivered
The goal isn’t the cheapest infrastructure. It’s infrastructure where every dollar is a conscious choice, not an accident.
Start with tagging and visibility. Everything else follows.