Seven years. Ticketing platforms processing millions of transactions. Open-source protocol infrastructure. IoT security systems. Kubernetes clusters I’ve lost count of. Lambdas that quietly run the world. Terraform state files that haunt my dreams.
This is the post I wish someone had written for me when I started. Every decision below is something I’ve either shipped to production and would do again, or shipped to production and now mass-delete from my muscle memory.
No “it depends” hand-waving without context. No vendor-neutral cowardice. Actual opinions from actual incidents.
AWS
Picking AWS over GCP
🟩 Endorse
GCP has better Kubernetes. GKE’s control plane is superior. The Kubernetes tooling is years ahead.
AWS has better everything else.
Account management. Support that answers the phone. An ecosystem that doesn’t deprecate services you depend on. Backwards compatibility as a religion. A TAM who knows your name.
I’ve run production on both. When GCP support told me to “check the documentation” during an outage, I knew where our future spend was going.
The Kubernetes gap has closed anyway. Karpenter, external-dns, external-secrets, AWS Load Balancer Controller – EKS is now genuinely competitive.
EKS
🟩 Endorse
Running your own control plane is mass-produced serotonin for infrastructure engineers who like etcd quorum problems.
Use EKS. The cost delta versus self-managed is a rounding error compared to engineer time.
Caveat: EKS upgrades are aggressive and non-optional. You will upgrade clusters more often than you want. Automate it or die.
EKS Managed Add-ons
🟧 Regret
Good idea. Poor execution.
The moment you need to customise resource requests, pin an image tag, or modify a ConfigMap, you’re fighting the add-on system. And you will need to customise.
Helm charts managed via Flux/ArgoCD. Full control. Fits existing GitOps. No surprises during cluster upgrades.
RDS
🟩 Endorse
Network outage: downtime, post-mortem, move on.
Data loss: company-ending event, career-ending event, therapy.
The managed database markup is insurance. Automated backups, point-in-time recovery, read replicas, Multi-AZ – this is not where you penny-pinch.
Day one: enable deletion protection, set snapshot retention, test restores. Future you will send a thank-you note.
ElastiCache (Redis)
🟩 Endorse
The Swiss Army knife of “do fast data thing”. Caching, sessions, rate limiting, pub/sub, leaderboards, distributed locks – it handles all of them well enough that you don’t need five separate tools.
Redis versus Valkey licensing drama: AWS will continue supporting something Redis-compatible. Not your problem.
ECR
🟩 Endorse
Ran quay.io. Stability was a disaster. Migrated to ECR. Boring ever since.
Deep IAM integration means EKS nodes pull images without credential rotation. Enable image scanning – it’s free and catches the obvious CVEs.
The registry equivalent of “it just works”.
VPC Endpoints (PrivateLink)
🟩 Endorse
Traffic to AWS services (S3, ECR, Secrets Manager, STS) going over the internet is unnecessary latency, cost, and attack surface. Interface endpoints keep it private.
The gotcha: endpoint policies. By default they’re wide open. Lock them down to specific buckets/resources – otherwise you’ve just created a data exfiltration path.
Gateway endpoints (S3, DynamoDB) are free. Interface endpoints cost money but less than NAT Gateway data processing fees for high-volume services.
Private API Gateway + VPC Link
🟧 Context-dependent
Private API Gateway lets you expose internal services without public endpoints. VPC Link connects it to your NLB/ALB.
The good: WAF integration, throttling, API keys, usage plans – all managed.
The bad: cold starts on private APIs are brutal (seconds, not milliseconds). Fine for internal tooling, painful for latency-sensitive workloads. Also, debugging DNS resolution issues between API Gateway and your VPC will test your patience.
For internal APIs, consider ALB + Lambda authorizers instead – simpler, faster cold starts.
ECS (Fargate and EC2)
🟩 Endorse for specific use cases
Hot take: not everything needs Kubernetes.
ECS Fargate is perfect for: batch jobs, scheduled tasks, simple services that don’t need the K8s ecosystem. No nodes to manage, no cluster upgrades, predictable pricing.
ECS on EC2: useful when you need GPU instances, specific instance types, or want to avoid Fargate’s vCPU pricing at scale.
Where ECS falls down: complex networking policies, service mesh, anything requiring the CNCF ecosystem. If you’re already running EKS, adding ECS creates operational sprawl.
My pattern: EKS for the platform, Fargate for one-off jobs that don’t justify a Helm chart.
Lambda
🟩 Endorse more than I initially did
I was slow to adopt Lambda. “We have Kubernetes, why do we need another compute platform?”
Turns out: event-driven workloads (S3 triggers, SQS consumers, API Gateway backends) are dramatically simpler on Lambda. No scaling config, no pod disruption budgets, no node selectors.
The real win is cost attribution. In Kubernetes, costs hide behind shared nodes. Lambda bills per invocation – you know exactly what each function costs.
Gotchas:
- Cold starts matter for synchronous APIs (use provisioned concurrency or accept the latency)
- 15-minute timeout kills long-running jobs (use Step Functions or ECS)
- VPC-attached Lambdas have their own cold start penalty (ENI attachment)
Pattern that works: Lambda for glue code, event processing, and APIs under 10s response time. ECS/EKS for everything else.
Step Functions
🟩 Endorse
State machines as infrastructure. Orchestrate Lambda, ECS tasks, Glue jobs, human approval steps – all with built-in retry, error handling, and observability.
Express workflows for high-volume, short-duration (synchronous). Standard workflows for long-running, complex orchestration.
The visual debugger alone is worth it – seeing exactly where your workflow failed beats parsing CloudWatch logs.
EventBridge
🟩 Endorse
Decoupled event routing without managing Kafka/SQS fan-out yourself. Schema registry, archive/replay, cross-account event buses.
Pattern: services emit events to EventBridge, rules route to targets (Lambda, SQS, Step Functions). Loose coupling, easy to add new consumers without modifying producers.
One trap: event pattern matching syntax is fiddly. Test patterns thoroughly – silent failures when events don’t match are painful to debug.
NAT Gateway vs NAT Instances
🟧 Context-dependent
NAT Gateway: managed, highly available, expensive at scale. NAT Instances: cheap, requires maintenance, single point of failure unless you build HA yourself.
At Trainline, NAT Gateway costs were eye-watering. We built NAT instances with auto-scaling groups and saved significant money. But it’s technical debt you’re taking on. For most companies, NAT Gateway is correct until your bill says otherwise.
AWS Premium Support
🟧 Regret
It costs as much as another engineer. Unless your team genuinely lacks AWS expertise, the ROI isn’t there.
Enterprise support is worth it if you’re spending £500k+/year on AWS and need the TAM relationship for commercial negotiations.
Control Tower Account Factory for Terraform (AFT)
🟩 Endorse
Pre-AFT, AWS Control Tower was a UI-driven nightmare. Spinning up new accounts meant clicking through the console, manually configuring baselines, and praying someone remembered to tag things correctly. Zero automation.
AFT changed everything. Account provisioning became code. New environment? Terraform apply. Done.
The real win isn’t just speed – it’s standardization. We enforce tagging at account creation. Production accounts get tagged with environment=prod, which we then use for routing decisions (VPC peering, network policies, cost allocation).
Tags beat AWS Organizations for this. Organizations force you into a tree structure – but account properties aren’t always hierarchical. An account can be “production”, “fintech-regulated”, and “us-east-1” all at once. Tags handle that. Organization hierarchy doesn’t.
Gotcha: AFT has a learning curve. The account request workflow via Git, the Terraform customizations, the pipeline structure – it’s not plug-and-play. But once it’s wired up, account provisioning goes from hours to minutes.
If you’re running multi-account AWS and not using AFT, you’re doing Control Tower the hard way.
Kubernetes Ecosystem
Karpenter
🟩 Endorse
If you’re on EKS without Karpenter, you’re lighting money on fire.
Cluster Autoscaler: slow, dumb, fights with node groups. Karpenter: fast, smart, provisions exactly what your pods need.
We’ve seen 30-40% cost reduction on compute after migration. Spot instance handling actually works. Consolidation actually consolidates. Bin-packing that isn’t a joke.
The learning curve – NodePools, EC2NodeClasses, weight-based selection – is real. Worth it. This is non-negotiable for EKS in 2025.
KEDA (Kubernetes Event-Driven Autoscaling)
🟩 Endorse for event-driven workloads
HPA scales on CPU/memory. KEDA scales on anything: SQS queue depth, Kafka lag, Prometheus metrics, cron schedules.
If you’re running workers that process queues, KEDA is the answer. Scale to zero when idle, scale up based on actual backlog. We’ve cut costs significantly on batch processing workloads that used to run 24/7 “just in case”.
Where it shines:
- SQS consumers (scale on ApproximateNumberOfMessages)
- Kafka consumers (scale on consumer lag)
- Scheduled jobs (cron-based scaling, better than CronJobs for some use cases)
- Prometheus-based scaling (custom metrics your app exposes)
Where it doesn’t:
- Request-based scaling (stick with HPA + Ingress metrics)
- Workloads that can’t handle cold starts
Pattern: KEDA for async workers, HPA for sync APIs. They coexist fine – use ScaledObject for event-driven, HPA for everything else.
Flux vs ArgoCD for GitOps
🟩 Endorse (either one, just pick)
Both work. Both are CNCF graduated. The debate is overblown.
Flux:
- Kubernetes-native, feels like CRDs all the way down
- Lighter footprint, less resource overhead
- Better for multi-tenancy with Flux’s tenant model
- Weak observability out of the box – you’ll build tooling to answer “where’s my commit?”
- No UI by default (Weave GitOps exists but meh)
ArgoCD:
- Beautiful UI, developers love clicking around
- App-of-apps pattern is intuitive
- Better for teams who want visual deployment status
- Heavier footprint, more moving parts
- ApplicationSets for dynamic generation
When to use Flux: Platform teams, multi-cluster, GitOps purists, resource-constrained clusters.
When to use ArgoCD: Developer-facing platforms, teams who want dashboards, orgs where “I need to see it” matters.
We went Flux. It’s worked for years. The core reconciliation model is solid. But I’ve seen ArgoCD deployments work equally well. The real mistake is running both, or spending six months evaluating instead of shipping.
External Secrets Operator
🟩 Endorse
AWS Secrets Manager → Kubernetes Secrets. Developers understand it. Terraform manages secrets upstream. AWS rotation continues working.
Replaced SealedSecrets, which required infrastructure knowledge to update and broke every AWS-native rotation integration. ESO is the correct answer.
External DNS
🟩 Endorse
Ingress annotation → Route53 record. Four years. Zero problems. Nothing to say. It just works.
Cert-Manager
🟩 Endorse
Let’s Encrypt certificates, automated, in Kubernetes. Set up once, forget forever.
The only pain: enterprise customers who don’t trust Let’s Encrypt. Budget for a few DigiCert certs annually for the dinosaurs.
Helm v3
🟩 Endorse
Helm v2 was a security nightmare (Tiller). Helm v3 is tolerable.
Go templating is painful to debug. The ecosystem is enormous. It solves “versioned Kubernetes manifests” adequately. We’ve all accepted this is what we have.
Store charts in OCI registries (ECR works). Avoid the S3 + plugin mess.
Service Mesh (Istio/Linkerd)
🟩 No Regrets (for not using)
Service meshes solve real problems. mTLS. Observability. Traffic management.
Service meshes also add operational complexity that most teams can’t afford.
For most companies: you don’t need one. mTLS? Network policies + application-level encryption. Traffic splitting? Ingress controllers. Observability? You’re already running Prometheus.
If you genuinely need mesh features, Linkerd is simpler than Istio. But ask yourself three times if you actually need it.
Cilium
🟩 Endorse
eBPF-based networking. Replaces kube-proxy. Network policies that work. Hubble for observability. No sidecar overhead.
The migration from VPC CNI requires planning. The benefits – performance, observability, no iptables spaghetti – are worth it.
This is where Kubernetes networking is going. Get ahead of it.
Infrastructure as Code
Terraform over CloudFormation
🟩 Endorse
This shouldn’t even be a debate anymore. HCL is readable. The provider ecosystem is enormous. State management is a solved problem. The hiring pool knows Terraform.
CloudFormation has its place – Service Catalog, certain AWS-native integrations – but as your primary IaC? No.
Terragrunt
🟩 Endorse
Terraform’s missing features: DRY remote state configuration, dependency management between modules, environment promotion without copy-paste.
Terragrunt fills the gaps. root.hcl inheritance keeps your environments consistent (note: terragrunt.hcl at root level is deprecated - use root.hcl now). dependency blocks wire outputs between stacks without data sources everywhere. run-all applies changes across your entire estate.
The learning curve is real – you’re now debugging two tools – but the alternative is bespoke wrapper scripts that do the same thing worse.
Pattern: root.hcl defines remote state and provider config, environment folders have terragrunt.hcl files that inherit and override. One terragrunt run-all plan shows drift across everything.
Spacelift
🟩 Endorse for teams with budget
If you’re past 5 engineers touching Terraform, Spacelift pays for itself.
Drift detection that actually works. Policy-as-code (OPA) for guardrails. Stack dependencies. Self-service for developers who shouldn’t have AWS console access but need to provision resources.
The killer feature: contexts and mounted files. Inject secrets, provider configs, and shared modules without templating hell.
Downside: it’s not cheap. And you’re adding a dependency on a vendor for your infrastructure provisioning – evaluate that risk.
Atlantis
🟩 Endorse for teams without budget
PR-based Terraform workflow, self-hosted, free.
atlantis plan on PR open, atlantis apply on merge. Locks prevent concurrent modifications. Works.
We ran Atlantis in Kubernetes for years. Minimal operational overhead once configured. The main gap versus Spacelift: no drift detection, no policy engine (you’ll bolt on Conftest or similar), no fancy UI.
For most teams under 10 engineers, Atlantis is correct. Graduate to Spacelift when you outgrow it.
env0 / Terraform Cloud
🟧 Depends on your constraints
Terraform Cloud: HashiCorp’s offering. Works, reasonably priced, tight integration with the Terraform ecosystem. The free tier is generous for small teams.
env0: similar space, more flexibility on policy and workflows, better GitOps model.
My take: if you’re already paying HashiCorp for Terraform Enterprise, Cloud makes sense. Otherwise, Spacelift (if you have budget) or Atlantis (if you don’t).
Not Using CDK/Pulumi
🟩 No Regrets
“I can use real programming constructs!” – and now your IaC has inheritance hierarchies, unit tests that don’t test anything meaningful, and abstractions that make terraform plan output incomprehensible.
Terraform’s constraints are a feature. It’s harder to be clever. Clever kills you at 3am.
If you need abstraction, write Terraform modules. If you need code generation, write a script that outputs Terraform. Don’t make your IaC a software project.
Exception: genuinely complex conditional logic (CDK/Pulumi handle this better). But ask yourself if the complexity is necessary before reaching for more powerful tools.
Terraform Module Strategy
🟩 Endorse with opinions
Monorepo for internal modules. Versioned releases via git tags. Terragrunt or a module registry for consumption.
Don’t: put modules in the same repo as the Terraform that consumes them (circular dependency hell). Don’t: version with branch names (“just use main” guarantees broken applies). Don’t: build modules that do too much (a module that creates VPC + EKS + RDS + everything is unmaintainable).
Do: small, composable modules. One module, one responsibility. Test with Terratest or tftest if you’re feeling fancy, but at minimum terraform validate in CI.
State File Hygiene
🟩 Endorse being paranoid
State files are your infrastructure’s source of truth. Treat them accordingly.
S3 bucket: versioned, encrypted (SSE-S3 minimum, KMS if compliance requires), bucket policy denying public access, lifecycle rules to clean up old versions.
DynamoDB: locking table, on-demand capacity (you’re not doing enough applies for provisioned to matter).
Separate account: your CI/CD and state live in a management account, not the accounts containing the infrastructure. When you accidentally terraform destroy the wrong workspace, you don’t lose the state bucket too.
Never: commit state to git. Run Terraform from laptops in production. Share state between unrelated projects.
OpenTofu
🟧 Watching closely
HashiCorp’s license change made OpenTofu happen. It’s production-ready, actively maintained, and has feature parity with Terraform 1.5.
I haven’t migrated production workloads yet – inertia is real – but for greenfield projects, it’s a legitimate choice. Spacelift and Terragrunt both support it.
If you’re worried about HashiCorp’s direction, start evaluating. The migration path is straightforward.
Crossplane
🟥 Regret (for most teams)
The pitch is compelling: manage AWS/GCP/Azure resources using Kubernetes CRDs. GitOps for infrastructure. Developers self-serve without learning Terraform.
The reality: you’re using infrastructure to manage infrastructure. Kubernetes managing the very AWS resources Kubernetes runs on. The recursion gives me a headache just writing it.
The problems:
- Provider maturity varies wildly (AWS provider is decent, others less so)
- Debugging is painful – is it a Crossplane issue, provider issue, or AWS issue?
- You need Terraform anyway for the Kubernetes cluster itself
- Composition complexity rivals Terraform modules but with worse tooling
- Your platform team now maintains Crossplane AND probably Terraform
Where it might work:
- Organisations already deep in Kubernetes who want unified control plane
- Platform teams building developer self-service portals
- Multi-cloud environments where one abstraction helps
For most teams: Terraform/Terragrunt/Spacelift handles infrastructure better. If developers need self-service, build an internal portal that calls Terraform, don’t add another layer of abstraction.
I’ve seen more Crossplane migrations fail than succeed. The teams that make it work have dedicated platform engineers and accept they’re running a complex system.
Backstage
🟩 Endorse (with realistic expectations)
Spotify’s developer portal. Service catalog, documentation, templates for scaffolding new services. The promise: one place for developers to find everything.
What it does well:
- Software catalog (who owns what, where’s the repo, what’s the status)
- TechDocs (docs-as-code, lives with the service)
- Scaffolder templates (spin up new services with standards baked in)
- Plugin ecosystem (Kubernetes, CI/CD, cost, whatever you need)
The honest take:
- It’s a framework, not a product. Budget 2-3 months for initial setup and customization
- Plugin quality varies wildly (some are polished, some are abandonware)
- Keeping the catalog accurate requires discipline (or automation)
- React/TypeScript skills needed to build custom plugins
When it’s worth it:
- 50+ services and developers can’t find anything
- Onboarding takes weeks because tribal knowledge
- You want to standardize service creation
When it’s not:
- Small teams where everyone knows everything
- No one to maintain it post-launch
- Expecting magic without investment
We’ve seen it transform developer experience at scale. We’ve also seen it become shelfware. The difference is commitment to maintaining it as a product, not a one-time project.
Atlantis for Terraform
🟩 Endorse
PR-based Terraform workflow. Plan runs on PR, apply on merge. State locking prevents conflicts.
Free, self-hosted, works. We run it in Kubernetes with minimal operational overhead.
Observability
Datadog
🟥 Regret
Great product. Pricing model designed to bankrupt you.
Kubernetes makes it worse: per-host pricing when you’re spinning spot instances up and down constantly. 10 instances running, 20 launched and terminated that hour? You pay for 20.
GPU nodes make it catastrophic: one service per node, full per-host cost. Your ML workloads will subsidise Datadog’s Series C.
We’re migrating to Prometheus + Grafana + Loki. More operational overhead. Dramatically lower cost. No vendor holding your metrics hostage.
Not Using OpenTelemetry Early
🟥 Regret
Instrumented applications directly with Datadog’s SDK. Now we’re locked in. Migration requires touching every service.
OpenTelemetry wasn’t mature four years ago. It is now. Start with it. Tracing is production-ready. Metrics are catching up.
Vendor-agnostic instrumentation isn’t just about cost – it’s about not being held hostage when your observability vendor’s pricing becomes untenable.
Prometheus / Grafana / Loki Stack
🟩 Endorse
Self-hosted observability that scales.
Prometheus for metrics. Loki for logs. Grafana for dashboards. Mimir for long-term metric storage. Tempo for traces.
Yes, you’re running databases. Yes, there’s operational overhead. The cost savings at scale are substantial, and you own your data.
Pattern: Prometheus Operator for Kubernetes-native deployment, ServiceMonitors for autodiscovery, Thanos or Mimir for multi-cluster aggregation.
PagerDuty
🟩 Endorse
It pages you. The pricing is reasonable. The integrations work. Nothing else to say.
Don’t overthink alerting platforms. PagerDuty is fine.
Process & Culture
GitOps Everything
🟩 Endorse
Services. Terraform. Kubernetes manifests. Application config. All in Git. All deployed via reconciliation.
“But I can’t see the pipeline!” – correct. Build deployment status dashboards. Invest in tooling that answers “where is my commit?” The payoff is infrastructure that self-heals and a Git history that tells you exactly what changed when.
Post-Mortems in Notion (not Datadog/PagerDuty)
🟩 Endorse
Both Datadog and PagerDuty have incident management features. Both are inflexible garbage for post-mortems.
Notion (or any wiki) lets you customise the process. Start with PagerDuty’s template, adapt to your team’s culture. The tool that gets used beats the tool with features.
Automating Post-Mortem Process
🟩 Endorse
Nobody wants to be the person chasing people to fill out the post-mortem.
Slack bot: “No update in 1 hour, post a status.” “No calendar invite in 24 hours, schedule the retro.” “Post-mortem doc still empty after 3 days, gentle nudge.”
Make the robot the bad guy. Your relationships with colleagues will thank you.
Regular PagerDuty Review
🟩 Endorse
Alert fatigue is a pipeline:
- No alerts. We need alerts.
- Too many alerts. We ignore alerts.
- We tune alerts. Only critical ones page.
- We ignore non-critical alerts.
- Something in non-critical explodes into an incident.
Two-tier alerting (critical pages, non-critical emails) plus bi-weekly review meetings. For each alert: should it stay critical? Can we automate the fix? Can we tune the threshold?
Non-critical alerts are technical debt. Treat them accordingly.
Monthly Cost Reviews
🟩 Endorse
Finance sees the bill. Finance can’t answer “is this right?”. Engineering can answer. Engineering doesn’t look.
Monthly meeting. Both teams. Every major SaaS bill.
Tag-based cost allocation in AWS. Break down by account, service, team. Spot the anomalies before they compound into “how did we spend £50k on NAT Gateway last month?”
Networking Deep Cuts
Route53 Latency-Based Routing + Health Checks
🟩 Endorse
Multi-region failover without a load balancer in front. Route53 health checks detect failures, latency-based routing sends traffic to the nearest healthy region.
Cheaper than Global Accelerator for most use cases. The 60-second health check interval is the main limitation – if you need faster failover, pay for Global Accelerator.
Pattern: active-active with latency routing for normal operation, automatic failover when health checks fail. Works for anything with a DNS name.
CloudFront + S3 Origin Access Control
🟩 Endorse
OAC replaced OAI (Origin Access Identity) – use it. Cleaner IAM integration, supports SSE-KMS encrypted buckets.
The pattern: S3 bucket is private, CloudFront is the only access path. No public bucket policies, no signed URLs for public content. Invalidation costs add up if you’re deploying frequently – use versioned filenames instead.
For APIs: CloudFront in front of API Gateway or ALB gives you edge caching, WAF integration, and a single domain for static + dynamic content.
Transit Gateway vs VPC Peering
🟧 Context-dependent
VPC Peering: free (data transfer still costs), simple, doesn’t scale past ~125 peerings per VPC.
Transit Gateway: costs money (hourly + per-GB), but gives you hub-and-spoke topology, route tables, multicast, and inter-region peering.
Rule of thumb: 3-5 VPCs? Peering. More than that, or you need centralised egress/ingress? Transit Gateway.
The hidden cost: Transit Gateway data processing fees. High-bandwidth cross-VPC traffic gets expensive fast. Architect to minimise cross-VPC chatter.
DNS Resolution Across Accounts (Route53 Resolver)
🟩 Endorse
Multi-account setups need centralised DNS. Route53 Resolver endpoints let spoke accounts resolve private hosted zones in a central account (and vice versa).
Without this: you’re managing DNS in every account or hacking /etc/hosts. Neither scales.
Pattern: central “networking” account owns private hosted zones, Resolver rules share them to spoke accounts via RAM. Services resolve internal DNS names regardless of which account they’re in.
Data Layer
SQS over Self-Managed Queues
🟩 Endorse
Every time I’ve seen teams run RabbitMQ or ActiveMQ in production, I’ve seen operational pain. Clustering issues, disk space alerts, upgrade nightmares.
SQS: unlimited throughput, no capacity planning, dead-letter queues built in, costs almost nothing at reasonable scale.
FIFO queues when ordering matters (300 TPS limit per message group – design around it). Standard queues for everything else.
The only valid reason for self-managed: you need AMQP protocol compatibility or complex routing (RabbitMQ exchanges). Even then, consider Amazon MQ first.
DynamoDB
🟩 Endorse with caveats
Single-digit millisecond latency at any scale. No connection pooling, no read replicas to manage, global tables for multi-region.
The caveats:
- Data modelling is hard. You must know your access patterns upfront. No JOINs, no ad-hoc queries.
- On-demand pricing is expensive at sustained load. Provisioned capacity + auto-scaling for predictable workloads.
- Hot partitions will ruin your day. Distribute writes across partition keys.
Pattern: use DynamoDB for high-throughput, simple access patterns (session stores, feature flags, user preferences). Use RDS/Aurora for complex queries and relationships.
Aurora Serverless v2
🟧 Cautious endorsement
Scales compute automatically, bills per ACU-second. Sounds perfect for variable workloads.
Reality: the scaling isn’t instant. Under sudden load, you’ll hit capacity limits before scale-up completes. The minimum ACU floor (0.5) still costs money – it’s not scale-to-zero.
Use it for: dev/staging environments, workloads with predictable daily patterns, multi-tenant apps where you can’t right-size a single instance.
Don’t use it for: latency-sensitive production workloads where scaling lag matters.
Things I’d Do Differently
Multiple Applications Sharing a Database
🟥 Regret
Nobody decides to share a database. It happens.
Someone adds a table. Another team adds a foreign key. Suddenly everything’s coupled. The database is used by everyone, cared for by no one. And everything owned by no one is owned by infrastructure eventually.
Problems: crud accumulates that nobody can delete. Performance issues require product context infra doesn’t have. Bad application code alerts the infrastructure team. One team’s bad query takes down everyone.
One service, one database. Enforce it early. It’s harder to untangle later.
Not Adopting Identity Platform Early
🟥 Regret
Started with Google Workspace for groups and permissions. Too inflexible. Too many manual processes.
Okta (or equivalent) from day one. SCIM provisioning. SSO everywhere. Compliance sorted. Only accept SaaS vendors that integrate.
The security and audit benefits compound. The “we’ll do it properly later” never comes.
Not Using Lambda More
🟧 Regret
“EC2 is cheaper than Lambda at scale” – true for theoretical 100% utilisation. Nobody runs at 100% utilisation.
Lambda: scales to zero, per-request pricing, no infrastructure to manage, easy cost attribution.
I was slow to adopt Lambda because we had Kubernetes. Turns out event-driven workloads are dramatically simpler on Lambda. Stop fighting it.
Renovate over Dependabot
🟩 Endorse
Dependency updates are boring until they’re urgent. Then you’re upgrading five major versions in a crisis.
Renovate: more flexible than Dependabot, more complicated to configure. The regex documentation will test your patience. Still worth it.
Automate dependency updates or accept that your dependencies will become technical debt.
CI/CD Deep Cuts
GitHub Actions Self-Hosted Runners on EKS
🟧 Works, with pain
actions-runner-controller lets you run GitHub Actions on your own Kubernetes cluster. Saves money, keeps builds inside your VPC.
The pain: runner pod scaling is flaky, ephemeral runners occasionally fail to clean up, and debugging why a workflow is stuck waiting for a runner is maddening.
We made it work with aggressive pod lifecycle limits and custom metrics for runner pool sizing. But it’s not set-and-forget.
Alternative: CodeBuild for AWS-native workflows. More expensive per-minute, but zero operational overhead.
OIDC Federation for CI/CD (No Long-Lived Credentials)
🟩 Endorse
GitHub Actions, GitLab CI, CircleCI – all support OIDC. Your CI job assumes an IAM role directly, no access keys stored in secrets.
Pattern: IAM OIDC provider trusts your CI provider, role trust policy scopes to specific repos/branches. Terraform apply only works from main branch of infra repo.
If you’re still rotating CI credentials quarterly, stop. OIDC federation is straightforward to set up and eliminates an entire class of security incidents.
Terraform State in S3 + DynamoDB Locking
🟩 Endorse
Obvious in retrospect, but: S3 bucket (versioned, encrypted) for state, DynamoDB table for locking. Atlantis or Terraform Cloud for remote execution.
The mistake I’ve seen: state in the same account as the infrastructure. When you accidentally terraform destroy the state bucket… don’t. Separate “management” account for CI/CD and state.
Security Patterns
IAM Roles Anywhere (Hybrid Workloads)
🟩 Niche but useful
On-prem or non-AWS workloads that need AWS API access? IAM Roles Anywhere lets you use X.509 certificates to assume IAM roles.
No more long-lived access keys on Jenkins servers. Certificate-based auth with automatic credential rotation.
Setup: Private CA (ACM PCA or your own), trust anchor in IAM, certificates on your on-prem machines. More moving parts than access keys, but dramatically better security posture.
Secrets Manager vs Parameter Store
🟧 It depends
Secrets Manager: automatic rotation, cross-account sharing, costs $0.40/secret/month + API calls.
Parameter Store (SecureString): no rotation built-in, same-account only, free tier covers most usage.
Pattern: Secrets Manager for database credentials (use the rotation Lambda), RDS integration is seamless. Parameter Store for everything else (API keys, config values, feature flags).
Don’t pay for Secrets Manager when Parameter Store does the job.
HashiCorp Vault
🟧 It depends (often overkill)
Vault is powerful: dynamic secrets, PKI, transit encryption, identity-based access. It’s also operationally complex – you’re now running a critical distributed system.
When Vault makes sense:
- Dynamic database credentials (short-lived, per-pod)
- PKI infrastructure at scale
- Multi-cloud secrets management
- Strict compliance requiring audit trails on every secret access
When it’s overkill:
- AWS-only shops (Secrets Manager + IAM roles cover 90% of use cases)
- Teams without dedicated platform engineers to maintain it
- Startups who think “we’ll need it eventually”
If you’re running Vault, run it managed (HCP Vault) or accept you’re staffing a Vault team. Self-hosted Vault clusters have bitten more teams than they’ve helped.
External Secrets Operator + Secrets Manager handles most Kubernetes secrets needs without the Vault overhead.
AWS WAF
🟧 Endorse with caveats
WAF in front of ALB/CloudFront blocks obvious attacks: SQL injection, XSS, known bad IPs. AWS Managed Rules cover the basics.
The honest take: WAF is security theater for sophisticated attacks but catches enough script kiddies and scanners to be worth the $5/month base cost. The real protection comes from secure application code, not edge filtering.
What works:
- Rate limiting (actually useful for brute force)
- Geo-blocking if you don’t serve certain regions
- AWS Managed Rules for OWASP top 10
What doesn’t:
- Thinking WAF replaces input validation
- Custom regex rules (maintenance nightmare)
- Blocking legitimate traffic with overly aggressive rules
Pattern: Enable it, use managed rules, set up logging to S3, review blocked requests monthly. Don’t spend weeks tuning rules unless you’re under active attack.
AWS Config + Security Hub
🟩 Endorse for compliance
Config rules catch drift: “S3 bucket is public”, “EBS volume unencrypted”, “Security group allows 0.0.0.0/0”.
Security Hub aggregates findings from Config, GuardDuty, Inspector, and third-party tools. Single pane of glass for compliance posture.
The gotcha: enabling everything generates thousands of findings. Prioritise ruthlessly – start with CIS benchmarks, suppress noise aggressively.
SCPs (Service Control Policies)
🟩 Endorse for guardrails
Organisation-level policies that even account admins can’t bypass. “No resources outside eu-west-1/eu-west-2”, “No public S3 buckets”, “No disabling CloudTrail”.
Pattern: deny-list SCPs in the organisation root for hard security boundaries. Allow-list SCPs for sandbox accounts (only specific services enabled).
Test thoroughly – an overly restrictive SCP will break deployments in ways that are hard to debug.
The Actual Lessons
Seven years of production incidents, 3am pages, and post-mortems have taught me this:
Non-negotiable: EKS + Karpenter. Flux or ArgoCD (pick one, stop debating). External Secrets Operator. Terraform with Terragrunt or Spacelift. OIDC federation (no long-lived credentials, ever). VPC endpoints for AWS service traffic. Prometheus stack (or accept Datadog’s pricing will eventually force migration anyway).
Avoid at all costs: Datadog at scale (pricing model is hostile to Kubernetes). Shared databases between services. EKS managed add-ons (you’ll customise, then fight them). Service meshes you don’t need. Long-lived CI credentials. Running Terraform from laptops. State files in the same account as infrastructure.
Context-dependent: NAT Gateway vs instances (cost threshold). Aurora Serverless v2 (scaling lag). Private API Gateway (cold start tolerance). Transit Gateway vs peering (VPC count). Secrets Manager vs Parameter Store (rotation needs). Spacelift vs Atlantis (budget).
Niche wins worth knowing: Route53 latency routing for cheap multi-region failover. EventBridge for decoupled event routing. Step Functions for complex orchestration. IAM Roles Anywhere for hybrid workloads. SCPs for guardrails that can’t be bypassed. Lambda for event-driven glue code (stop fighting it).
The meta-lessons:
Boring technology wins. Every time. The clever architecture that impresses in design reviews will wake you up at 3am when it fails in ways nobody anticipated.
Debuggability over elegance. If you can’t figure out why it’s broken in 15 minutes with logs and metrics, your architecture is wrong.
Automation compounds. Every hour spent on operational tooling pays dividends for years. Every hour spent manually doing what should be automated is stolen from your future self.
The fanciest architecture means nothing if you can’t debug it at 3am with half your brain still asleep.
I share infrastructure patterns, debugging deep-dives, and production war stories. Connect on LinkedIn or check out CoderCo for hands-on DevOps education.