OpenTelemetry Changed How I Think About Observability
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
Browse posts by topic
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
S3 backup/restore, direct connectivity, Parquet exports - none of them worked cleanly. Here's the full war story of migrating a production ClickHouse instance to Cloud, the version mismatch that broke everything, and the dumb-simple approach that actually got the job done.
Platform engineering has become the most misunderstood role in tech. Everyone's building 'platforms' but few understand what actually makes one successful. Here's what I've learned building platforms for teams of 10 to 500.
A practical guide to breaking up monolithic Terraform state files, moving resources between states, and refactoring infrastructure safely. Includes real examples, scripts, and the exact commands we use.
A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.
A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
DORA metrics are the industry standard for measuring DevOps performance. Here's how to implement them properly, avoid common pitfalls, and actually use them to improve your team's delivery.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
MLOps is becoming a critical skill for DevOps engineers. Here's what matters: the infrastructure patterns, tooling, and operational practices that make ML systems work in production - from someone who learned the hard way.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
Control how pods spread across nodes, zones, and regions. A deep dive into topology spread constraints for high availability and efficient resource utilization.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.
Run AWS services locally for faster development and testing. A practical guide to LocalStack covering S3, Lambda, DynamoDB, SQS, and integration testing patterns.
How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
Stop pushing to test your workflows. Act lets you run GitHub Actions locally with instant feedback. Here's how to set it up and use it effectively.
Tagging is the foundation of cloud governance, cost allocation, and automation. Here's how to implement tagging consistently across your infrastructure using context modules, policies, and automation.
A battle-tested playbook for migrating CI/CD pipelines from Jenkins to GitHub Actions at scale. Covers OIDC authentication, parallel running, secrets migration, and the gotchas that will bite you.
Sign and verify container images without managing keys. A hands-on guide to Cosign, keyless signing, and enforcing signatures in Kubernetes.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
Advanced Terraform practices covering testing strategies, CI/CD pipelines, security hardening, drift detection, and team collaboration patterns for infrastructure as code at scale.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
A practical guide to building an ETL pipeline that extracts weather data from OpenWeatherMap, transforms it with pandas, and loads it into PostgreSQL. Includes Airflow orchestration with email notifications.
A comprehensive guide to Terraform best practices covering project organisation, state management, module design, and foundational patterns for scalable infrastructure as code.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
Build a lightweight Kubernetes cluster on three Raspberry Pi 5 devices. Step-by-step guide covering K3s installation, cluster configuration, and deployment testing.
Your dependencies are an attack vector. Here's how to secure your software supply chain with Sigstore, SLSA frameworks, SBOMs, and admission policies that actually work.
Deploy containerised applications to AWS Lambda or Fargate with a simple YAML config. No infrastructure code required - just define your containers and deploy.
You don't need Google's budget to practice SRE. Here's how to implement Site Reliability Engineering principles with a small team and limited resources.
Cloud cost management isn't just for finance. Here's how engineering teams can build cost awareness into their workflow without slowing down delivery.
Debug distroless and minimal containers in production without redeploying. Ephemeral containers let you attach debugging tools to running pods - here's how to use them effectively.
How to ensure sidecar containers are ready before your main app starts. Covers startupProbe, postStart hooks, and why readinessProbe doesn't do what you think.
Most incident processes are theatre. Here's how to build incident management that reduces downtime, prevents recurrence, and doesn't burn out your team.
Technical guide for upgrading managed Kubernetes clusters across GKE, EKS, and AKS
The questions that separate senior engineers from those who memorised tutorials. Real interview failures, what interviewers are actually looking for, and how to answer with depth.
Comprehensive guide for safely upgrading GKE clusters with minimal downtime and robust rollback procedures
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
Two major cluster crashes, migrating from kops to EKS, slashing compute costs with Karpenter, and the observability stack we rebuilt three times.
A practical guide to connecting to PostgreSQL databases in Kubernetes – exec into pods, VPN access, SOCKS5 proxies, pg_dump, kubectl cp and getting data out when you need it.
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
Most Kubernetes clusters waste 50-70% of their resources. Here's how to measure what you're actually using, fix the worst offenders, and automate the process - without breaking production.
Service meshes promise observability, security, and traffic management. But which one should you choose? A practical comparison based on running all three in production.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
A practical guide to building an IDP that developers actually want to use. Covers the build vs buy decision, Backstage implementation, and the organisational changes required for success.
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
Kubernetes is an incredible technology that solves real problems. But for most startups, it's the wrong tool. Here's how to know when you're ready - and what to use instead.
In the first part of our Container Networking Deep Dive, we explore how to set up a single network namespace inside a VM and connect it to the host using a veth pair.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
Online EBS volume resizing for running instances – the IaC way with Terraform and ASG instance refresh, plus the manual escape hatch when you need it now. No reboot required.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
How to get started in DevOps?
An end-to-end guide for baking a Vault AMI using Packer and deploying a Vault EC2 instance on AWS.
In the first part of our ECS Fargate Deep Dive, we break down what happens behind the scenes when you run a task on Fargate — Firecracker microVMs, ENIs, IAM and the hidden host fleet.
In the second part of our ECS Fargate Deep Dive, we get hands-on with Firecracker — the lightweight VMM that powers Fargate — and simulate task isolation and networking locally.
Deep dive into Helm's --atomic, --wait, and --cleanup-on-fail flags. How they work, when to use them, the CI/CD pipeline trap that catches everyone, and production-ready deployment patterns.
You can't use Terraform to create the S3 bucket that stores Terraform state. Here's how to bootstrap your remote backend properly, plus the philosophical reason this pattern exists everywhere in software.
An end-to-end guide for creating a lab container for DevOps training.
A look at the ever-changing landscape of modern applications
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
Deep dive into Identity Aware Proxies - what they are, how they work, and how to implement them with GCP IAP, Pomerium, and OAuth2-Proxy. Includes Terraform and Kubernetes examples.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
MLOps is becoming a critical skill for DevOps engineers. Here's what matters: the infrastructure patterns, tooling, and operational practices that make ML systems work in production - from someone who learned the hard way.
Compare Dragonfly and Redis for caching and data storage. Dragonfly's multi-threaded architecture vs Redis single-threaded model.
Scale MySQL horizontally with Vitess. Automatic sharding, online schema changes, and Kubernetes-native deployment for massive scale.
Deploy NATS JetStream for messaging and streaming. Simpler than Kafka, faster than RabbitMQ, with persistence and exactly-once delivery.
Use Vertical Pod Autoscaler and Horizontal Pod Autoscaler together without conflicts. Includes KEDA integration and best practices.
Control how pods spread across nodes, zones, and regions. A deep dive into topology spread constraints for high availability and efficient resource utilization.
Implement automated cloud cost optimization with Kubecost and OpenCost. Track costs per team, rightsize resources, and automate savings.
Master AWS Spot Instances in production. Handle interruptions gracefully, use mixed instance groups, and save 60-90% on compute costs.
Master Karpenter for Kubernetes node autoscaling. Replace Cluster Autoscaler with faster, smarter provisioning. Includes cost optimization patterns.
Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.
Implement automated canary deployments with Flagger. Metrics-based promotion, automated rollback, and integration with Istio, Linkerd, and Gateway API.
Implement chaos engineering in Kubernetes with LitmusChaos. Run pod failures, network chaos, and stress tests to validate system resilience.
Detailed comparison of Kyverno and OPA Gatekeeper for Kubernetes policy enforcement. Includes real examples, performance considerations, and migration guidance.
Create custom cloud APIs with Crossplane Compositions. Abstract away complexity and give developers self-service infrastructure with guardrails.
Master Gateway API with traffic splitting, header-based routing, cross-namespace references, and TLS passthrough. The future of Kubernetes ingress.
Deploy a service mesh without sidecars using Cilium. Get mTLS, traffic management, and observability powered by eBPF at the kernel level.
Remove secrets from your applications entirely with Secretless Broker. Inject database credentials, API keys, and certificates via sidecar without your app knowing they exist.
Sign and verify container images without managing keys. A hands-on guide to Cosign, keyless signing, and enforcing signatures in Kubernetes.
Implement admission control policies with OPA Gatekeeper. Enforce security standards, naming conventions, resource limits, and compliance requirements at the cluster level.
Running databases on Kubernetes is controversial. Sometimes it's the right call, sometimes it's a disaster waiting to happen. Here's how to decide, and how to do it properly if you choose to proceed.
Deep dive into eBPF-based security tools - Cilium, Falco, and Tetragon. Learn how to implement runtime security, network policies, and threat detection at the kernel level.
Deep dive into SPIFFE and SPIRE for workload identity. Replace shared secrets with cryptographic identity for service-to-service authentication. Includes Kubernetes deployment and mTLS examples.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
Build a lightweight Kubernetes cluster on three Raspberry Pi 5 devices. Step-by-step guide covering K3s installation, cluster configuration, and deployment testing.
Why your Kubernetes cluster is wide open by default, and the single NetworkPolicy that changes everything. Copy, paste, deploy, sleep better.
Your dependencies are an attack vector. Here's how to secure your software supply chain with Sigstore, SLSA frameworks, SBOMs, and admission policies that actually work.
How to enforce Pod Security Standards using the built-in Pod Security Admission controller. Covers Privileged, Baseline, and Restricted profiles, migration from PSPs, namespace labeling, and exemptions.
Debug distroless and minimal containers in production without redeploying. Ephemeral containers let you attach debugging tools to running pods - here's how to use them effectively.
How to use External Secrets Operator to sync AWS Secrets Manager secrets to Kubernetes. Covers SecretStore, ExternalSecret, IAM with IRSA, templating, and production patterns.
A deep dive into why external DNS resolution in Kubernetes can be painfully slow, how the default ndots:5 setting causes unnecessary lookups, and practical fixes that actually work.
How to ensure sidecar containers are ready before your main app starts. Covers startupProbe, postStart hooks, and why readinessProbe doesn't do what you think.
Technical guide for upgrading managed Kubernetes clusters across GKE, EKS, and AKS
The questions that separate senior engineers from those who memorised tutorials. Real interview failures, what interviewers are actually looking for, and how to answer with depth.
Comprehensive guide for safely upgrading GKE clusters with minimal downtime and robust rollback procedures
OpenTelemetry unifies traces, metrics, and logs under one standard. This guide covers how to instrument your applications, set up collectors, and actually make sense of the data.
Gateway API is the successor to Ingress, bringing role-oriented design, native traffic splitting, and cross-namespace routing. This post compares both APIs, when to migrate, and practical migration patterns.
Two major cluster crashes, migrating from kops to EKS, slashing compute costs with Karpenter, and the observability stack we rebuilt three times.
A practical guide to connecting to PostgreSQL databases in Kubernetes – exec into pods, VPN access, SOCKS5 proxies, pg_dump, kubectl cp and getting data out when you need it.
Most Kubernetes clusters waste 50-70% of their resources. Here's how to measure what you're actually using, fix the worst offenders, and automate the process - without breaking production.
Service meshes promise observability, security, and traffic management. But which one should you choose? A practical comparison based on running all three in production.
A hands-on guide to implementing GitOps with ArgoCD. Covers installation, application management, sync strategies, secrets handling, and the patterns that actually work in production.
Cilium in Kubernetes
Kubernetes is an incredible technology that solves real problems. But for most startups, it's the wrong tool. Here's how to know when you're ready - and what to use instead.
How to setup a private network for your EKS cluster with Twingate
In this blog, we configure mutual TLS (mTLS) using Gateway API on GKE, securing ingress traffic with client certificate validation.
Secure Your Kubernetes with SPIFFE + SPIRE: Zero-Trust Identity for Workloads
DNS spoofing in Kubernetes remains a critical threat, enabling attackers to redirect traffic, intercept data, or disrupt services. This article explores how such attacks occur and outlines strategies to prevent them.
Running Kubernetes clusters privately is a growing best practice. In this blog, I'll walk you through deploying a private AKS cluster on Azure with no public API endpoint, and enabling secure access via Twingate VPN, which provides identity-based access without opening up your network.
In this blog, I'll walk you through setting up a full-featured Apache Pulsar playground using kind (Kubernetes in Docker). Whether you're testing Pulsar for learning or demoing a real pub/sub model with admin tools and monitoring, this setup gives you everything.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
AWS Controllers for Kubernetes
How packets actually flow in Kubernetes – from veth pairs to CNI plugins to kube-proxy modes. With AWS/EKS context throughout.
A hands-on article on deploying an application on Kubernetes with Fargate.
Deep dive into Helm's --atomic, --wait, and --cleanup-on-fail flags. How they work, when to use them, the CI/CD pipeline trap that catches everyone, and production-ready deployment patterns.
A look at the ever-changing landscape of modern applications
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
S3 backup/restore, direct connectivity, Parquet exports - none of them worked cleanly. Here's the full war story of migrating a production ClickHouse instance to Cloud, the version mismatch that broke everything, and the dumb-simple approach that actually got the job done.
AWS doesn't offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.
A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
Master AWS Spot Instances in production. Handle interruptions gracefully, use mixed instance groups, and save 60-90% on compute costs.
Master Karpenter for Kubernetes node autoscaling. Replace Cluster Autoscaler with faster, smarter provisioning. Includes cost optimization patterns.
Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.
Run AWS services locally for faster development and testing. A practical guide to LocalStack covering S3, Lambda, DynamoDB, SQS, and integration testing patterns.
How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
Master AWS PrivateLink for private API access, cross-account connectivity, and SaaS integrations. Includes Terraform examples and multi-region patterns.
Tagging is the foundation of cloud governance, cost allocation, and automation. Here's how to implement tagging consistently across your infrastructure using context modules, policies, and automation.
A battle-tested playbook for migrating CI/CD pipelines from Jenkins to GitHub Actions at scale. Covers OIDC authentication, parallel running, secrets migration, and the gotchas that will bite you.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
A comprehensive guide to Terraform best practices covering project organisation, state management, module design, and foundational patterns for scalable infrastructure as code.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
Deploy containerised applications to AWS Lambda or Fargate with a simple YAML config. No infrastructure code required - just define your containers and deploy.
Cloud cost management isn't just for finance. Here's how engineering teams can build cost awareness into their workflow without slowing down delivery.
AWS offers NAT Gateways as the default, fully managed solution for letting private subnet resources reach the internet. However, NAT Gateways can be pricey: Hourly cost: ~₹3.75/hour (varies by region) Data transfer cost: Additional ₹3.75/GB on top of standard data transfer For small dev/test environments or personal labs, these costs can add up quickly. In contrast, a NAT Instance is just a normal EC2 instance configured to perform IP forwarding and NAT. It’s typically much cheaper to run a small instance (`t3.micro`) than a NAT Gateway, especially if your traffic volume is modest.
How to use External Secrets Operator to sync AWS Secrets Manager secrets to Kubernetes. Covers SecretStore, ExternalSecret, IAM with IRSA, templating, and production patterns.
NAT Gateways are the silent budget killer in AWS. Here's how to reduce costs with NAT instances, VPC endpoints, IPv6, and architectural changes - with real numbers and trade-offs.
Running out of IP addresses in AWS EKS can be a subtle yet critical issue. It often manifests as pods stuck in a pending state or nodes failing to join the cluster, leading to deployment bottlenecks and potential downtime. Understanding the root cause and implementing effective solutions is essential for maintaining cluster health and scalability. Now, there are many ways to fix this, but this is one way.
How to use VPC Endpoints to access AWS services without internet gateways or NAT. Covers Gateway vs Interface endpoints, PrivateLink, endpoint policies, cost optimization, and production Terraform patterns.
The questions that separate senior engineers from those who memorised tutorials. Real interview failures, what interviewers are actually looking for, and how to answer with depth.
How to use SCPs to set permission guardrails across your AWS Organization. Covers SCP evaluation logic, deny vs allow strategies, common patterns, and production-ready Terraform examples.
How to use AWS Config Rules to detect compliance violations and automatically remediate them using SSM Automation documents. Covers managed rules, custom rules, remediation actions, and complete Terraform examples.
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
Two major cluster crashes, migrating from kops to EKS, slashing compute costs with Karpenter, and the observability stack we rebuilt three times.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
How to use AWS Managed Prefix Lists to eliminate hardcoded CIDR blocks in security groups and route tables. Covers AWS-managed prefixes, customer-managed lists for data centres, and production Terraform patterns.
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
How we migrated our CDN to AWS CloudFront at Trainline
Extend your private API Gateway with secure access from other VPCs using PrivateLink and enforce IAM-based authentication.
A hands-on guide to configuring AWS Route 53 for latency-based routing across multiple regions, incorporating health checks for automatic failover.
AWS EKS defaults to the VPC CNI plugin, assigning VPC IPs to pods via ENIs. While straightforward, this setup limits pod density per node and consumes VPC IPs rapidly. To overcome these constraints, deploying Calico with IPIP or BGP offers a scalable alternative.
Online EBS volume resizing for running instances – the IaC way with Terraform and ASG instance refresh, plus the manual escape hatch when you need it now. No reboot required.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
AWS Controllers for Kubernetes
Learn how to deploy a secure, private-only API Gateway inside your VPC using interface endpoints, resource policies, and VPC integration.
A hands-on technical guide to implementing AWS PrivateLink between VPCs using Terraform.
How packets actually flow in Kubernetes – from veth pairs to CNI plugins to kube-proxy modes. With AWS/EKS context throughout.
A hands-on article on deploying an application on Kubernetes with Fargate.
How to get started in DevOps?
An end-to-end guide for baking a Vault AMI using Packer and deploying a Vault EC2 instance on AWS.
In the first part of our ECS Fargate Deep Dive, we break down what happens behind the scenes when you run a task on Fargate — Firecracker microVMs, ENIs, IAM and the hidden host fleet.
In the second part of our ECS Fargate Deep Dive, we get hands-on with Firecracker — the lightweight VMM that powers Fargate — and simulate task isolation and networking locally.
Solving the AWS OIDC Chicken-and-Egg Problem with GitHub Actions
A production-focused deep dive into how BGP actually behaves over AWS Direct Connect – route selection, failover, ASN design, MEDs, prepending, blackholing scenarios, and the real-world issues teams hit at scale.
You can't use Terraform to create the S3 bucket that stores Terraform state. Here's how to bootstrap your remote backend properly, plus the philosophical reason this pattern exists everywhere in software.
Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
Deep dive into Identity Aware Proxies - what they are, how they work, and how to implement them with GCP IAP, Pomerium, and OAuth2-Proxy. Includes Terraform and Kubernetes examples.
AWS doesn't offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.
A practical guide to breaking up monolithic Terraform state files, moving resources between states, and refactoring infrastructure safely. Includes real examples, scripts, and the exact commands we use.
A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.
A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
Master AWS PrivateLink for private API access, cross-account connectivity, and SaaS integrations. Includes Terraform examples and multi-region patterns.
Tagging is the foundation of cloud governance, cost allocation, and automation. Here's how to implement tagging consistently across your infrastructure using context modules, policies, and automation.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
Advanced Terraform practices covering testing strategies, CI/CD pipelines, security hardening, drift detection, and team collaboration patterns for infrastructure as code at scale.
A comprehensive guide to Terraform best practices covering project organisation, state management, module design, and foundational patterns for scalable infrastructure as code.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
How to use VPC Endpoints to access AWS services without internet gateways or NAT. Covers Gateway vs Interface endpoints, PrivateLink, endpoint policies, cost optimization, and production Terraform patterns.
The questions that separate senior engineers from those who memorised tutorials. Real interview failures, what interviewers are actually looking for, and how to answer with depth.
How to use SCPs to set permission guardrails across your AWS Organization. Covers SCP evaluation logic, deny vs allow strategies, common patterns, and production-ready Terraform examples.
How to use AWS Config Rules to detect compliance violations and automatically remediate them using SSM Automation documents. Covers managed rules, custom rules, remediation actions, and complete Terraform examples.
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
How to use AWS Managed Prefix Lists to eliminate hardcoded CIDR blocks in security groups and route tables. Covers AWS-managed prefixes, customer-managed lists for data centres, and production Terraform patterns.
A step-by-step guide to setting up a Kafka cluster on a local Kind cluster using the Strimzi operator, with optional Terraform provisioning.
A hands-on guide to configuring AWS Route 53 for latency-based routing across multiple regions, incorporating health checks for automatic failover.
Online EBS volume resizing for running instances – the IaC way with Terraform and ASG instance refresh, plus the manual escape hatch when you need it now. No reboot required.
Learn how to deploy a secure, private-only API Gateway inside your VPC using interface endpoints, resource policies, and VPC integration.
A hands-on technical guide to implementing AWS PrivateLink between VPCs using Terraform.
You can't use Terraform to create the S3 bucket that stores Terraform state. Here's how to bootstrap your remote backend properly, plus the philosophical reason this pattern exists everywhere in software.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
Deep dive into Identity Aware Proxies - what they are, how they work, and how to implement them with GCP IAP, Pomerium, and OAuth2-Proxy. Includes Terraform and Kubernetes examples.
A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.
A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.
How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.
Detailed comparison of Kyverno and OPA Gatekeeper for Kubernetes policy enforcement. Includes real examples, performance considerations, and migration guidance.
Master AWS PrivateLink for private API access, cross-account connectivity, and SaaS integrations. Includes Terraform examples and multi-region patterns.
Remove secrets from your applications entirely with Secretless Broker. Inject database credentials, API keys, and certificates via sidecar without your app knowing they exist.
Sign and verify container images without managing keys. A hands-on guide to Cosign, keyless signing, and enforcing signatures in Kubernetes.
Implement admission control policies with OPA Gatekeeper. Enforce security standards, naming conventions, resource limits, and compliance requirements at the cluster level.
Deep dive into eBPF-based security tools - Cilium, Falco, and Tetragon. Learn how to implement runtime security, network policies, and threat detection at the kernel level.
Deep dive into SPIFFE and SPIRE for workload identity. Replace shared secrets with cryptographic identity for service-to-service authentication. Includes Kubernetes deployment and mTLS examples.
Advanced Terraform practices covering testing strategies, CI/CD pipelines, security hardening, drift detection, and team collaboration patterns for infrastructure as code at scale.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
Why your Kubernetes cluster is wide open by default, and the single NetworkPolicy that changes everything. Copy, paste, deploy, sleep better.
Your dependencies are an attack vector. Here's how to secure your software supply chain with Sigstore, SLSA frameworks, SBOMs, and admission policies that actually work.
How to enforce Pod Security Standards using the built-in Pod Security Admission controller. Covers Privileged, Baseline, and Restricted profiles, migration from PSPs, namespace labeling, and exemptions.
How to use External Secrets Operator to sync AWS Secrets Manager secrets to Kubernetes. Covers SecretStore, ExternalSecret, IAM with IRSA, templating, and production patterns.
How to use VPC Endpoints to access AWS services without internet gateways or NAT. Covers Gateway vs Interface endpoints, PrivateLink, endpoint policies, cost optimization, and production Terraform patterns.
How to use SCPs to set permission guardrails across your AWS Organization. Covers SCP evaluation logic, deny vs allow strategies, common patterns, and production-ready Terraform examples.
How to use AWS Config Rules to detect compliance violations and automatically remediate them using SSM Automation documents. Covers managed rules, custom rules, remediation actions, and complete Terraform examples.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
eBPF is transforming how we observe, secure, and network Linux systems. This guide covers the fundamentals, practical use cases beyond Cilium, and how to start writing your own eBPF programs.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
How to use AWS Managed Prefix Lists to eliminate hardcoded CIDR blocks in security groups and route tables. Covers AWS-managed prefixes, customer-managed lists for data centres, and production Terraform patterns.
Extend your private API Gateway with secure access from other VPCs using PrivateLink and enforce IAM-based authentication.
In this blog, we configure mutual TLS (mTLS) using Gateway API on GKE, securing ingress traffic with client certificate validation.
DNS spoofing in Kubernetes remains a critical threat, enabling attackers to redirect traffic, intercept data, or disrupt services. This article explores how such attacks occur and outlines strategies to prevent them.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
Learn how to deploy a secure, private-only API Gateway inside your VPC using interface endpoints, resource policies, and VPC integration.
A hands-on technical guide to implementing AWS PrivateLink between VPCs using Terraform.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
Master AWS PrivateLink for private API access, cross-account connectivity, and SaaS integrations. Includes Terraform examples and multi-region patterns.
Master Gateway API with traffic splitting, header-based routing, cross-namespace references, and TLS passthrough. The future of Kubernetes ingress.
Deploy Tailscale for secure connectivity across clouds, offices, and Kubernetes clusters. Zero-config VPN mesh with SSO integration and ACLs.
Deploy a service mesh without sidecars using Cilium. Get mTLS, traffic management, and observability powered by eBPF at the kernel level.
Why your Kubernetes cluster is wide open by default, and the single NetworkPolicy that changes everything. Copy, paste, deploy, sleep better.
A deep dive into why external DNS resolution in Kubernetes can be painfully slow, how the default ndots:5 setting causes unnecessary lookups, and practical fixes that actually work.
NAT Gateways are the silent budget killer in AWS. Here's how to reduce costs with NAT instances, VPC endpoints, IPv6, and architectural changes - with real numbers and trade-offs.
Running out of IP addresses in AWS EKS can be a subtle yet critical issue. It often manifests as pods stuck in a pending state or nodes failing to join the cluster, leading to deployment bottlenecks and potential downtime. Understanding the root cause and implementing effective solutions is essential for maintaining cluster health and scalability. Now, there are many ways to fix this, but this is one way.
How to use VPC Endpoints to access AWS services without internet gateways or NAT. Covers Gateway vs Interface endpoints, PrivateLink, endpoint policies, cost optimization, and production Terraform patterns.
Gateway API is the successor to Ingress, bringing role-oriented design, native traffic splitting, and cross-namespace routing. This post compares both APIs, when to migrate, and practical migration patterns.
eBPF is transforming how we observe, secure, and network Linux systems. This guide covers the fundamentals, practical use cases beyond Cilium, and how to start writing your own eBPF programs.
Service meshes promise observability, security, and traffic management. But which one should you choose? A practical comparison based on running all three in production.
How to use AWS Managed Prefix Lists to eliminate hardcoded CIDR blocks in security groups and route tables. Covers AWS-managed prefixes, customer-managed lists for data centres, and production Terraform patterns.
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
In the first part of our Container Networking Deep Dive, we explore how to set up a single network namespace inside a VM and connect it to the host using a veth pair.
In the second part of our Container Networking Deep Dive, we connect two network namespaces via a bridge on the same Linux host.
Deep Dive into EC2 Networking: ENIs, IP Addressing and Deployment Architectures
AWS EKS defaults to the VPC CNI plugin, assigning VPC IPs to pods via ENIs. While straightforward, this setup limits pod density per node and consumes VPC IPs rapidly. To overcome these constraints, deploying Calico with IPIP or BGP offers a scalable alternative.
Running Kubernetes clusters privately is a growing best practice. In this blog, I'll walk you through deploying a private AKS cluster on Azure with no public API endpoint, and enabling secure access via Twingate VPN, which provides identity-based access without opening up your network.
Networking Tools
Learn how to deploy a secure, private-only API Gateway inside your VPC using interface endpoints, resource policies, and VPC integration.
A hands-on technical guide to implementing AWS PrivateLink between VPCs using Terraform.
How packets actually flow in Kubernetes – from veth pairs to CNI plugins to kube-proxy modes. With AWS/EKS context throughout.
A production-focused deep dive into how BGP actually behaves over AWS Direct Connect – route selection, failover, ASN design, MEDs, prepending, blackholing scenarios, and the real-world issues teams hit at scale.
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
Platform engineering has become the most misunderstood role in tech. Everyone's building 'platforms' but few understand what actually makes one successful. Here's what I've learned building platforms for teams of 10 to 500.
DORA metrics are the industry standard for measuring DevOps performance. Here's how to implement them properly, avoid common pitfalls, and actually use them to improve your team's delivery.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
MLOps is becoming a critical skill for DevOps engineers. Here's what matters: the infrastructure patterns, tooling, and operational practices that make ML systems work in production - from someone who learned the hard way.
Explore Port and Kratix for building internal developer platforms. Self-service infrastructure, developer workflows, and platform engineering patterns.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
Build custom Backstage plugins for your internal developer portal. Create frontend components, backend APIs, and integrate with your existing tools.
Create custom cloud APIs with Crossplane Compositions. Abstract away complexity and give developers self-service infrastructure with guardrails.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
A practical guide to building an IDP that developers actually want to use. Covers the build vs buy decision, Backstage implementation, and the organisational changes required for success.
Most engineers massively undervalue themselves because no one taught them how to negotiate. Here's everything I've learned from negotiating salaries, contracts, titles, and more.
DORA metrics are the industry standard for measuring DevOps performance. Here's how to implement them properly, avoid common pitfalls, and actually use them to improve your team's delivery.
Everyone wants to know the difference between Senior, Staff, and Principal. After holding all three titles, I can tell you the real differences aren't what most people think. It's not about years - it's about scope.
The IC ladder looks appealing until you're at the top. Many senior engineers chase Principal titles without understanding what they're signing up for. Here's what nobody tells you.
After working across all three - tiny startups, hypergrowth scale-ups, and massive enterprises - I can tell you they're completely different jobs. Same title, same tech, completely different experience. Here's what each teaches you.
Everyone claims to have a blameless culture. Few actually do. Here's what real blamelessness looks like and why it's so difficult to achieve.
I've done both. Multiple times. Here's the real trade-offs nobody talks about - the money, the time off problem, the boredom factor, and why your life situation matters more than you think.
The RTO push isn't about productivity. The data is clear: remote work works. What's really happening is a fight over control, real estate, and management inability to adapt.
Documentation is often treated as junior work. That's backwards. The most impactful documentation comes from senior engineers, and writing it is a force multiplier for your expertise.
The idea of the 10x engineer has done more harm than good. What actually matters is team multipliers - engineers who make everyone around them better.
Most meetings are information broadcasts disguised as collaboration. Learn when to meet, when to write, and how to save everyone's time.
Certifications have become a checkbox exercise. They don't prove competence, and they often distract from what actually matters: building things and solving real problems.
Daily standups were meant to improve communication. Instead, they've become status meetings that waste time and interrupt deep work. There's a better way.
Most engineers massively undervalue themselves because no one taught them how to negotiate. Here's everything I've learned from negotiating salaries, contracts, titles, and more.
On interview take-home tests that are suspiciously specific, contractors who get ghosted after detailed proposals, and learning to play the game without becoming bitter about it.
Everyone wants to know the difference between Senior, Staff, and Principal. After holding all three titles, I can tell you the real differences aren't what most people think. It's not about years - it's about scope.
The IC ladder looks appealing until you're at the top. Many senior engineers chase Principal titles without understanding what they're signing up for. Here's what nobody tells you.
After working across all three - tiny startups, hypergrowth scale-ups, and massive enterprises - I can tell you they're completely different jobs. Same title, same tech, completely different experience. Here's what each teaches you.
I've done both. Multiple times. Here's the real trade-offs nobody talks about - the money, the time off problem, the boredom factor, and why your life situation matters more than you think.
The RTO push isn't about productivity. The data is clear: remote work works. What's really happening is a fight over control, real estate, and management inability to adapt.
Documentation is often treated as junior work. That's backwards. The most impactful documentation comes from senior engineers, and writing it is a force multiplier for your expertise.
The idea of the 10x engineer has done more harm than good. What actually matters is team multipliers - engineers who make everyone around them better.
The questions that separate senior engineers from those who memorised tutorials. Real interview failures, what interviewers are actually looking for, and how to answer with depth.
Certifications have become a checkbox exercise. They don't prove competence, and they often distract from what actually matters: building things and solving real problems.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
Sign and verify container images without managing keys. A hands-on guide to Cosign, keyless signing, and enforcing signatures in Kubernetes.
Build a lightweight Kubernetes cluster on three Raspberry Pi 5 devices. Step-by-step guide covering K3s installation, cluster configuration, and deployment testing.
Deploy containerised applications to AWS Lambda or Fargate with a simple YAML config. No infrastructure code required - just define your containers and deploy.
Debug distroless and minimal containers in production without redeploying. Ephemeral containers let you attach debugging tools to running pods - here's how to use them effectively.
How to ensure sidecar containers are ready before your main app starts. Covers startupProbe, postStart hooks, and why readinessProbe doesn't do what you think.
In the first part of our Container Networking Deep Dive, we explore how to set up a single network namespace inside a VM and connect it to the host using a veth pair.
In the second part of our Container Networking Deep Dive, we connect two network namespaces via a bridge on the same Linux host.
In the first part of our ECS Fargate Deep Dive, we break down what happens behind the scenes when you run a task on Fargate — Firecracker microVMs, ENIs, IAM and the hidden host fleet.
In the second part of our ECS Fargate Deep Dive, we get hands-on with Firecracker — the lightweight VMM that powers Fargate — and simulate task isolation and networking locally.
A look at the ever-changing landscape of modern applications
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
Master AWS Spot Instances in production. Handle interruptions gracefully, use mixed instance groups, and save 60-90% on compute costs.
Master Karpenter for Kubernetes node autoscaling. Replace Cluster Autoscaler with faster, smarter provisioning. Includes cost optimization patterns.
Running out of IP addresses in AWS EKS can be a subtle yet critical issue. It often manifests as pods stuck in a pending state or nodes failing to join the cluster, leading to deployment bottlenecks and potential downtime. Understanding the root cause and implementing effective solutions is essential for maintaining cluster health and scalability. Now, there are many ways to fix this, but this is one way.
Technical guide for upgrading managed Kubernetes clusters across GKE, EKS, and AKS
Two major cluster crashes, migrating from kops to EKS, slashing compute costs with Karpenter, and the observability stack we rebuilt three times.
How to setup a private network for your EKS cluster with Twingate
AWS EKS defaults to the VPC CNI plugin, assigning VPC IPs to pods via ENIs. While straightforward, this setup limits pod density per node and consumes VPC IPs rapidly. To overcome these constraints, deploying Calico with IPIP or BGP offers a scalable alternative.
How packets actually flow in Kubernetes – from veth pairs to CNI plugins to kube-proxy modes. With AWS/EKS context throughout.
A hands-on article on deploying an application on Kubernetes with Fargate.
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
A comprehensive guide to migrating your Elasticsearch, Logstash, and Kibana stack from version 6.x to 8.x. Covers breaking changes, migration strategies, index compatibility, and zero-downtime approaches.
A comprehensive guide to setting up Elastic Cloud (Elasticsearch Service), including deployment configuration, security setup, index lifecycle management, integrations, and cost optimization.
Implement automated cloud cost optimization with Kubecost and OpenCost. Track costs per team, rightsize resources, and automate savings.
Implement SLO-based alerting with burn rate alerts. Move from noisy threshold alerts to meaningful reliability signals using error budgets.
Master OpenTelemetry Collector configuration. Build pipelines to transform metrics, filter traces, route logs, and reduce telemetry costs.
OpenTelemetry unifies traces, metrics, and logs under one standard. This guide covers how to instrument your applications, set up collectors, and actually make sense of the data.
Two major cluster crashes, migrating from kops to EKS, slashing compute costs with Karpenter, and the observability stack we rebuilt three times.
eBPF is transforming how we observe, secure, and network Linux systems. This guide covers the fundamentals, practical use cases beyond Cilium, and how to start writing your own eBPF programs.
How we automated Dynatrace alerting configuration using custom Ansible roles - covering alert profiles, problem notifications, metric events, and maintenance windows across multiple environments.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
Run AWS services locally for faster development and testing. A practical guide to LocalStack covering S3, Lambda, DynamoDB, SQS, and integration testing patterns.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
A practical guide to building an ETL pipeline that extracts weather data from OpenWeatherMap, transforms it with pandas, and loads it into PostgreSQL. Includes Airflow orchestration with email notifications.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
Deploy containerised applications to AWS Lambda or Fargate with a simple YAML config. No infrastructure code required - just define your containers and deploy.
An end-to-end guide for creating a lab container for DevOps training.
A look at the ever-changing landscape of modern applications
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
In the first part of our ECS Fargate Deep Dive, we break down what happens behind the scenes when you run a task on Fargate — Firecracker microVMs, ENIs, IAM and the hidden host fleet.
In the second part of our ECS Fargate Deep Dive, we get hands-on with Firecracker — the lightweight VMM that powers Fargate — and simulate task isolation and networking locally.
S3 backup/restore, direct connectivity, Parquet exports - none of them worked cleanly. Here's the full war story of migrating a production ClickHouse instance to Cloud, the version mismatch that broke everything, and the dumb-simple approach that actually got the job done.
A comprehensive guide to migrating your Elasticsearch, Logstash, and Kibana stack from version 6.x to 8.x. Covers breaking changes, migration strategies, index compatibility, and zero-downtime approaches.
A practical guide to breaking up monolithic Terraform state files, moving resources between states, and refactoring infrastructure safely. Includes real examples, scripts, and the exact commands we use.
A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
A battle-tested playbook for migrating CI/CD pipelines from Jenkins to GitHub Actions at scale. Covers OIDC authentication, parallel running, secrets migration, and the gotchas that will bite you.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
Master AWS PrivateLink for private API access, cross-account connectivity, and SaaS integrations. Includes Terraform examples and multi-region patterns.
NAT Gateways are the silent budget killer in AWS. Here's how to reduce costs with NAT instances, VPC endpoints, IPv6, and architectural changes - with real numbers and trade-offs.
How to use VPC Endpoints to access AWS services without internet gateways or NAT. Covers Gateway vs Interface endpoints, PrivateLink, endpoint policies, cost optimization, and production Terraform patterns.
How to use AWS Managed Prefix Lists to eliminate hardcoded CIDR blocks in security groups and route tables. Covers AWS-managed prefixes, customer-managed lists for data centres, and production Terraform patterns.
Extend your private API Gateway with secure access from other VPCs using PrivateLink and enforce IAM-based authentication.
Learn how to deploy a secure, private-only API Gateway inside your VPC using interface endpoints, resource policies, and VPC integration.
A hands-on technical guide to implementing AWS PrivateLink between VPCs using Terraform.
Everyone wants to know the difference between Senior, Staff, and Principal. After holding all three titles, I can tell you the real differences aren't what most people think. It's not about years - it's about scope.
The IC ladder looks appealing until you're at the top. Many senior engineers chase Principal titles without understanding what they're signing up for. Here's what nobody tells you.
After working across all three - tiny startups, hypergrowth scale-ups, and massive enterprises - I can tell you they're completely different jobs. Same title, same tech, completely different experience. Here's what each teaches you.
Everyone claims to have a blameless culture. Few actually do. Here's what real blamelessness looks like and why it's so difficult to achieve.
Documentation is often treated as junior work. That's backwards. The most impactful documentation comes from senior engineers, and writing it is a force multiplier for your expertise.
The idea of the 10x engineer has done more harm than good. What actually matters is team multipliers - engineers who make everyone around them better.
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
Implement automated cloud cost optimization with Kubecost and OpenCost. Track costs per team, rightsize resources, and automate savings.
Tagging is the foundation of cloud governance, cost allocation, and automation. Here's how to implement tagging consistently across your infrastructure using context modules, policies, and automation.
Cloud cost management isn't just for finance. Here's how engineering teams can build cost awareness into their workflow without slowing down delivery.
NAT Gateways are the silent budget killer in AWS. Here's how to reduce costs with NAT instances, VPC endpoints, IPv6, and architectural changes - with real numbers and trade-offs.
Most Kubernetes clusters waste 50-70% of their resources. Here's how to measure what you're actually using, fix the worst offenders, and automate the process - without breaking production.
DORA metrics are the industry standard for measuring DevOps performance. Here's how to implement them properly, avoid common pitfalls, and actually use them to improve your team's delivery.
How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.
A battle-tested playbook for migrating CI/CD pipelines from Jenkins to GitHub Actions at scale. Covers OIDC authentication, parallel running, secrets migration, and the gotchas that will bite you.
Advanced Terraform practices covering testing strategies, CI/CD pipelines, security hardening, drift detection, and team collaboration patterns for infrastructure as code at scale.
A hands-on guide to implementing GitOps with ArgoCD. Covers installation, application management, sync strategies, secrets handling, and the patterns that actually work in production.
Deep dive into Helm's --atomic, --wait, and --cleanup-on-fail flags. How they work, when to use them, the CI/CD pipeline trap that catches everyone, and production-ready deployment patterns.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
A practical guide to breaking up monolithic Terraform state files, moving resources between states, and refactoring infrastructure safely. Includes real examples, scripts, and the exact commands we use.
A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.
Advanced Terraform practices covering testing strategies, CI/CD pipelines, security hardening, drift detection, and team collaboration patterns for infrastructure as code at scale.
A comprehensive guide to Terraform best practices covering project organisation, state management, module design, and foundational patterns for scalable infrastructure as code.
How we automated Dynatrace alerting configuration using custom Ansible roles - covering alert profiles, problem notifications, metric events, and maintenance windows across multiple environments.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
Deploy containerised applications to AWS Lambda or Fargate with a simple YAML config. No infrastructure code required - just define your containers and deploy.
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
A hands-on article on deploying an application on Kubernetes with Fargate.
In the first part of our ECS Fargate Deep Dive, we break down what happens behind the scenes when you run a task on Fargate — Firecracker microVMs, ENIs, IAM and the hidden host fleet.
In the second part of our ECS Fargate Deep Dive, we get hands-on with Firecracker — the lightweight VMM that powers Fargate — and simulate task isolation and networking locally.
Implement automated cloud cost optimization with Kubecost and OpenCost. Track costs per team, rightsize resources, and automate savings.
Master AWS Spot Instances in production. Handle interruptions gracefully, use mixed instance groups, and save 60-90% on compute costs.
Master Karpenter for Kubernetes node autoscaling. Replace Cluster Autoscaler with faster, smarter provisioning. Includes cost optimization patterns.
Cloud cost management isn't just for finance. Here's how engineering teams can build cost awareness into their workflow without slowing down delivery.
NAT Gateways are the silent budget killer in AWS. Here's how to reduce costs with NAT instances, VPC endpoints, IPv6, and architectural changes - with real numbers and trade-offs.
Most Kubernetes clusters waste 50-70% of their resources. Here's how to measure what you're actually using, fix the worst offenders, and automate the process - without breaking production.
S3 backup/restore, direct connectivity, Parquet exports - none of them worked cleanly. Here's the full war story of migrating a production ClickHouse instance to Cloud, the version mismatch that broke everything, and the dumb-simple approach that actually got the job done.
Debug distroless and minimal containers in production without redeploying. Ephemeral containers let you attach debugging tools to running pods - here's how to use them effectively.
Two major cluster crashes, migrating from kops to EKS, slashing compute costs with Karpenter, and the observability stack we rebuilt three times.
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
A production-focused deep dive into how BGP actually behaves over AWS Direct Connect – route selection, failover, ASN design, MEDs, prepending, blackholing scenarios, and the real-world issues teams hit at scale.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
MLOps is becoming a critical skill for DevOps engineers. Here's what matters: the infrastructure patterns, tooling, and operational practices that make ML systems work in production - from someone who learned the hard way.
Create custom cloud APIs with Crossplane Compositions. Abstract away complexity and give developers self-service infrastructure with guardrails.
Kubernetes is an incredible technology that solves real problems. But for most startups, it's the wrong tool. Here's how to know when you're ready - and what to use instead.
You can't use Terraform to create the S3 bucket that stores Terraform state. Here's how to bootstrap your remote backend properly, plus the philosophical reason this pattern exists everywhere in software.
A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.
Stop pushing to test your workflows. Act lets you run GitHub Actions locally with instant feedback. Here's how to set it up and use it effectively.
How to use AWS Config Rules to detect compliance violations and automatically remediate them using SSM Automation documents. Covers managed rules, custom rules, remediation actions, and complete Terraform examples.
A hands-on guide to implementing GitOps with ArgoCD. Covers installation, application management, sync strategies, secrets handling, and the patterns that actually work in production.
How we automated Dynatrace alerting configuration using custom Ansible roles - covering alert profiles, problem notifications, metric events, and maintenance windows across multiple environments.
Deploy a service mesh without sidecars using Cilium. Get mTLS, traffic management, and observability powered by eBPF at the kernel level.
Deep dive into eBPF-based security tools - Cilium, Falco, and Tetragon. Learn how to implement runtime security, network policies, and threat detection at the kernel level.
Service meshes promise observability, security, and traffic management. But which one should you choose? A practical comparison based on running all three in production.
Cilium in Kubernetes
How packets actually flow in Kubernetes – from veth pairs to CNI plugins to kube-proxy modes. With AWS/EKS context throughout.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
eBPF is transforming how we observe, secure, and network Linux systems. This guide covers the fundamentals, practical use cases beyond Cilium, and how to start writing your own eBPF programs.
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
In the first part of our Container Networking Deep Dive, we explore how to set up a single network namespace inside a VM and connect it to the host using a veth pair.
In the second part of our Container Networking Deep Dive, we connect two network namespaces via a bridge on the same Linux host.
A deep dive into why external DNS resolution in Kubernetes can be painfully slow, how the default ndots:5 setting causes unnecessary lookups, and practical fixes that actually work.
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
Using GKE DNS-based endpoints for Secure cluster access
A hands-on guide to configuring AWS Route 53 for latency-based routing across multiple regions, incorporating health checks for automatic failover.
DNS spoofing in Kubernetes remains a critical threat, enabling attackers to redirect traffic, intercept data, or disrupt services. This article explores how such attacks occur and outlines strategies to prevent them.
How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.
Stop pushing to test your workflows. Act lets you run GitHub Actions locally with instant feedback. Here's how to set it up and use it effectively.
A battle-tested playbook for migrating CI/CD pipelines from Jenkins to GitHub Actions at scale. Covers OIDC authentication, parallel running, secrets migration, and the gotchas that will bite you.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
In this deep dive, we set up a secure, production-ready CI/CD pipeline from GitHub Actions to GKE using Workload Identity Federation—no secrets needed.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
Implement automated canary deployments with Flagger. Metrics-based promotion, automated rollback, and integration with Istio, Linkerd, and Gateway API.
Create custom cloud APIs with Crossplane Compositions. Abstract away complexity and give developers self-service infrastructure with guardrails.
How to use External Secrets Operator to sync AWS Secrets Manager secrets to Kubernetes. Covers SecretStore, ExternalSecret, IAM with IRSA, templating, and production patterns.
A hands-on guide to implementing GitOps with ArgoCD. Covers installation, application management, sync strategies, secrets handling, and the patterns that actually work in production.
Technical guide for upgrading managed Kubernetes clusters across GKE, EKS, and AKS
Comprehensive guide for safely upgrading GKE clusters with minimal downtime and robust rollback procedures
Using GKE DNS-based endpoints for Secure cluster access
In this blog, we configure mutual TLS (mTLS) using Gateway API on GKE, securing ingress traffic with client certificate validation.
In this deep dive, we set up a secure, production-ready CI/CD pipeline from GitHub Actions to GKE using Workload Identity Federation—no secrets needed.
Implement SLO-based alerting with burn rate alerts. Move from noisy threshold alerts to meaningful reliability signals using error budgets.
Implement chaos engineering in Kubernetes with LitmusChaos. Run pod failures, network chaos, and stress tests to validate system resilience.
You don't need Google's budget to practice SRE. Here's how to implement Site Reliability Engineering principles with a small team and limited resources.
Most incident processes are theatre. Here's how to build incident management that reduces downtime, prevents recurrence, and doesn't burn out your team.
The questions that separate senior engineers from those who memorised tutorials. Real interview failures, what interviewers are actually looking for, and how to answer with depth.
Deep dive into Identity Aware Proxies - what they are, how they work, and how to implement them with GCP IAP, Pomerium, and OAuth2-Proxy. Includes Terraform and Kubernetes examples.
Deploy Tailscale for secure connectivity across clouds, offices, and Kubernetes clusters. Zero-config VPN mesh with SSO integration and ACLs.
Remove secrets from your applications entirely with Secretless Broker. Inject database credentials, API keys, and certificates via sidecar without your app knowing they exist.
Deep dive into SPIFFE and SPIRE for workload identity. Replace shared secrets with cryptographic identity for service-to-service authentication. Includes Kubernetes deployment and mTLS examples.
Why your Kubernetes cluster is wide open by default, and the single NetworkPolicy that changes everything. Copy, paste, deploy, sleep better.
The RTO push isn't about productivity. The data is clear: remote work works. What's really happening is a fight over control, real estate, and management inability to adapt.
The idea of the 10x engineer has done more harm than good. What actually matters is team multipliers - engineers who make everyone around them better.
Most meetings are information broadcasts disguised as collaboration. Learn when to meet, when to write, and how to save everyone's time.
Daily standups were meant to improve communication. Instead, they've become status meetings that waste time and interrupt deep work. There's a better way.
How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.
A battle-tested playbook for migrating CI/CD pipelines from Jenkins to GitHub Actions at scale. Covers OIDC authentication, parallel running, secrets migration, and the gotchas that will bite you.
In this deep dive, we set up a secure, production-ready CI/CD pipeline from GitHub Actions to GKE using Workload Identity Federation—no secrets needed.
Solving the AWS OIDC Chicken-and-Egg Problem with GitHub Actions
Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.
Run AWS services locally for faster development and testing. A practical guide to LocalStack covering S3, Lambda, DynamoDB, SQS, and integration testing patterns.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
Crossplane + LocalStack on kind: 100 % Local AWS Infrastructure-as-Code
Running databases on Kubernetes is controversial. Sometimes it's the right call, sometimes it's a disaster waiting to happen. Here's how to decide, and how to do it properly if you choose to proceed.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
A practical guide to building an ETL pipeline that extracts weather data from OpenWeatherMap, transforms it with pandas, and loads it into PostgreSQL. Includes Airflow orchestration with email notifications.
A practical guide to connecting to PostgreSQL databases in Kubernetes – exec into pods, VPN access, SOCKS5 proxies, pg_dump, kubectl cp and getting data out when you need it.
Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.
A step-by-step guide to setting up a Kafka cluster on a local Kind cluster using the Strimzi operator, with optional Terraform provisioning.
In this blog, I'll walk you through setting up a full-featured Apache Pulsar playground using kind (Kubernetes in Docker). Whether you're testing Pulsar for learning or demoing a real pub/sub model with admin tools and monitoring, this setup gives you everything.
Falco Kubernetes Lab: Runtime Threat Detection with Prometheus & Grafana
Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.
How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.
How to use SCPs to set permission guardrails across your AWS Organization. Covers SCP evaluation logic, deny vs allow strategies, common patterns, and production-ready Terraform examples.
Extend your private API Gateway with secure access from other VPCs using PrivateLink and enforce IAM-based authentication.
Stop pushing to test your workflows. Act lets you run GitHub Actions locally with instant feedback. Here's how to set it up and use it effectively.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
In this deep dive, we set up a secure, production-ready CI/CD pipeline from GitHub Actions to GKE using Workload Identity Federation—no secrets needed.
AWS doesn't offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.
Every infrastructure decision I'd make again – and the ones I wouldn't – after running production workloads across fintech, open-source, IoT, and beyond.
Deploy containerised applications to AWS Lambda or Fargate with a simple YAML config. No infrastructure code required - just define your containers and deploy.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
Compare Dragonfly and Redis for caching and data storage. Dragonfly's multi-threaded architecture vs Redis single-threaded model.
Use Vertical Pod Autoscaler and Horizontal Pod Autoscaler together without conflicts. Includes KEDA integration and best practices.
A deep dive into why external DNS resolution in Kubernetes can be painfully slow, how the default ndots:5 setting causes unnecessary lookups, and practical fixes that actually work.
S3 backup/restore, direct connectivity, Parquet exports - none of them worked cleanly. Here's the full war story of migrating a production ClickHouse instance to Cloud, the version mismatch that broke everything, and the dumb-simple approach that actually got the job done.
Compare Dragonfly and Redis for caching and data storage. Dragonfly's multi-threaded architecture vs Redis single-threaded model.
Scale MySQL horizontally with Vitess. Automatic sharding, online schema changes, and Kubernetes-native deployment for massive scale.
A practical guide to connecting to PostgreSQL databases in Kubernetes – exec into pods, VPN access, SOCKS5 proxies, pg_dump, kubectl cp and getting data out when you need it.
Deploy a service mesh without sidecars using Cilium. Get mTLS, traffic management, and observability powered by eBPF at the kernel level.
Deep dive into SPIFFE and SPIRE for workload identity. Replace shared secrets with cryptographic identity for service-to-service authentication. Includes Kubernetes deployment and mTLS examples.
In this blog, we configure mutual TLS (mTLS) using Gateway API on GKE, securing ingress traffic with client certificate validation.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
Master AWS PrivateLink for private API access, cross-account connectivity, and SaaS integrations. Includes Terraform examples and multi-region patterns.
How to use VPC Endpoints to access AWS services without internet gateways or NAT. Covers Gateway vs Interface endpoints, PrivateLink, endpoint policies, cost optimization, and production Terraform patterns.
Extend your private API Gateway with secure access from other VPCs using PrivateLink and enforce IAM-based authentication.
A hands-on technical guide to implementing AWS PrivateLink between VPCs using Terraform.
Master AWS Spot Instances in production. Handle interruptions gracefully, use mixed instance groups, and save 60-90% on compute costs.
Implement SLO-based alerting with burn rate alerts. Move from noisy threshold alerts to meaningful reliability signals using error budgets.
Implement chaos engineering in Kubernetes with LitmusChaos. Run pod failures, network chaos, and stress tests to validate system resilience.
You don't need Google's budget to practice SRE. Here's how to implement Site Reliability Engineering principles with a small team and limited resources.
Implement chaos engineering in Kubernetes with LitmusChaos. Run pod failures, network chaos, and stress tests to validate system resilience.
Run AWS services locally for faster development and testing. A practical guide to LocalStack covering S3, Lambda, DynamoDB, SQS, and integration testing patterns.
Stop pushing to test your workflows. Act lets you run GitHub Actions locally with instant feedback. Here's how to set it up and use it effectively.
Advanced Terraform practices covering testing strategies, CI/CD pipelines, security hardening, drift detection, and team collaboration patterns for infrastructure as code at scale.
Most engineers massively undervalue themselves because no one taught them how to negotiate. Here's everything I've learned from negotiating salaries, contracts, titles, and more.
Everyone wants to know the difference between Senior, Staff, and Principal. After holding all three titles, I can tell you the real differences aren't what most people think. It's not about years - it's about scope.
After working across all three - tiny startups, hypergrowth scale-ups, and massive enterprises - I can tell you they're completely different jobs. Same title, same tech, completely different experience. Here's what each teaches you.
I've done both. Multiple times. Here's the real trade-offs nobody talks about - the money, the time off problem, the boredom factor, and why your life situation matters more than you think.
Everyone claims to have a blameless culture. Few actually do. Here's what real blamelessness looks like and why it's so difficult to achieve.
You don't need Google's budget to practice SRE. Here's how to implement Site Reliability Engineering principles with a small team and limited resources.
Most incident processes are theatre. Here's how to build incident management that reduces downtime, prevents recurrence, and doesn't burn out your team.
Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
How to use SCPs to set permission guardrails across your AWS Organization. Covers SCP evaluation logic, deny vs allow strategies, common patterns, and production-ready Terraform examples.
Build custom Backstage plugins for your internal developer portal. Create frontend components, backend APIs, and integrate with your existing tools.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
A practical guide to building an IDP that developers actually want to use. Covers the build vs buy decision, Backstage implementation, and the organisational changes required for success.
Platform engineering has become the most misunderstood role in tech. Everyone's building 'platforms' but few understand what actually makes one successful. Here's what I've learned building platforms for teams of 10 to 500.
Explore Port and Kratix for building internal developer platforms. Self-service infrastructure, developer workflows, and platform engineering patterns.
A practical guide to building an IDP that developers actually want to use. Covers the build vs buy decision, Backstage implementation, and the organisational changes required for success.
A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.
A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
Running databases on Kubernetes is controversial. Sometimes it's the right call, sometimes it's a disaster waiting to happen. Here's how to decide, and how to do it properly if you choose to proceed.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
Gateway API is the successor to Ingress, bringing role-oriented design, native traffic splitting, and cross-namespace routing. This post compares both APIs, when to migrate, and practical migration patterns.
A step-by-step guide to setting up a Kafka cluster on a local Kind cluster using the Strimzi operator, with optional Terraform provisioning.
Using GKE DNS-based endpoints for Secure cluster access
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
DORA metrics are the industry standard for measuring DevOps performance. Here's how to implement them properly, avoid common pitfalls, and actually use them to improve your team's delivery.
Master OpenTelemetry Collector configuration. Build pipelines to transform metrics, filter traces, route logs, and reduce telemetry costs.
OpenTelemetry unifies traces, metrics, and logs under one standard. This guide covers how to instrument your applications, set up collectors, and actually make sense of the data.
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
You don't need Google's budget to practice SRE. Here's how to implement Site Reliability Engineering principles with a small team and limited resources.
How we automated Dynatrace alerting configuration using custom Ansible roles - covering alert profiles, problem notifications, metric events, and maintenance windows across multiple environments.
Deploy a service mesh without sidecars using Cilium. Get mTLS, traffic management, and observability powered by eBPF at the kernel level.
Deep dive into eBPF-based security tools - Cilium, Falco, and Tetragon. Learn how to implement runtime security, network policies, and threat detection at the kernel level.
eBPF is transforming how we observe, secure, and network Linux systems. This guide covers the fundamentals, practical use cases beyond Cilium, and how to start writing your own eBPF programs.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
Deep Dive into EC2 Networking: ENIs, IP Addressing and Deployment Architectures
Online EBS volume resizing for running instances – the IaC way with Terraform and ASG instance refresh, plus the manual escape hatch when you need it now. No reboot required.
Implement automated canary deployments with Flagger. Metrics-based promotion, automated rollback, and integration with Istio, Linkerd, and Gateway API.
A hands-on guide to implementing GitOps with ArgoCD. Covers installation, application management, sync strategies, secrets handling, and the patterns that actually work in production.
Deep Dive into EC2 Networking: ENIs, IP Addressing and Deployment Architectures
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
Deep dive into Helm's --atomic, --wait, and --cleanup-on-fail flags. How they work, when to use them, the CI/CD pipeline trap that catches everyone, and production-ready deployment patterns.
A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
You can't use Terraform to create the S3 bucket that stores Terraform state. Here's how to bootstrap your remote backend properly, plus the philosophical reason this pattern exists everywhere in software.
Running out of IP addresses in AWS EKS can be a subtle yet critical issue. It often manifests as pods stuck in a pending state or nodes failing to join the cluster, leading to deployment bottlenecks and potential downtime. Understanding the root cause and implementing effective solutions is essential for maintaining cluster health and scalability. Now, there are many ways to fix this, but this is one way.
AWS EKS defaults to the VPC CNI plugin, assigning VPC IPs to pods via ENIs. While straightforward, this setup limits pod density per node and consumes VPC IPs rapidly. To overcome these constraints, deploying Calico with IPIP or BGP offers a scalable alternative.
How packets actually flow in Kubernetes – from veth pairs to CNI plugins to kube-proxy modes. With AWS/EKS context throughout.
A comprehensive guide to migrating your Elasticsearch, Logstash, and Kibana stack from version 6.x to 8.x. Covers breaking changes, migration strategies, index compatibility, and zero-downtime approaches.
A comprehensive guide to setting up Elastic Cloud (Elasticsearch Service), including deployment configuration, security setup, index lifecycle management, integrations, and cost optimization.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
In this blog, I'll walk you through setting up a full-featured Apache Pulsar playground using kind (Kubernetes in Docker). Whether you're testing Pulsar for learning or demoing a real pub/sub model with admin tools and monitoring, this setup gives you everything.
Deep dive into Helm's --atomic, --wait, and --cleanup-on-fail flags. How they work, when to use them, the CI/CD pipeline trap that catches everyone, and production-ready deployment patterns.
You don't need Google's budget to practice SRE. Here's how to implement Site Reliability Engineering principles with a small team and limited resources.
Most incident processes are theatre. Here's how to build incident management that reduces downtime, prevents recurrence, and doesn't burn out your team.
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
Debug distroless and minimal containers in production without redeploying. Ephemeral containers let you attach debugging tools to running pods - here's how to use them effectively.
A deep dive into why external DNS resolution in Kubernetes can be painfully slow, how the default ndots:5 setting causes unnecessary lookups, and practical fixes that actually work.
Debug distroless and minimal containers in production without redeploying. Ephemeral containers let you attach debugging tools to running pods - here's how to use them effectively.
A practical guide to connecting to PostgreSQL databases in Kubernetes – exec into pods, VPN access, SOCKS5 proxies, pg_dump, kubectl cp and getting data out when you need it.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
Technical guide for upgrading managed Kubernetes clusters across GKE, EKS, and AKS
Running Kubernetes clusters privately is a growing best practice. In this blog, I'll walk you through deploying a private AKS cluster on Azure with no public API endpoint, and enabling secure access via Twingate VPN, which provides identity-based access without opening up your network.
AWS doesn't offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.
Use Vertical Pod Autoscaler and Horizontal Pod Autoscaler together without conflicts. Includes KEDA integration and best practices.
Master Karpenter for Kubernetes node autoscaling. Replace Cluster Autoscaler with faster, smarter provisioning. Includes cost optimization patterns.
The RTO push isn't about productivity. The data is clear: remote work works. What's really happening is a fight over control, real estate, and management inability to adapt.
Most meetings are information broadcasts disguised as collaboration. Learn when to meet, when to write, and how to save everyone's time.
Daily standups were meant to improve communication. Instead, they've become status meetings that waste time and interrupt deep work. There's a better way.
Master Gateway API with traffic splitting, header-based routing, cross-namespace references, and TLS passthrough. The future of Kubernetes ingress.
Gateway API is the successor to Ingress, bringing role-oriented design, native traffic splitting, and cross-namespace routing. This post compares both APIs, when to migrate, and practical migration patterns.
In this blog, we configure mutual TLS (mTLS) using Gateway API on GKE, securing ingress traffic with client certificate validation.
A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.
Master OpenTelemetry Collector configuration. Build pipelines to transform metrics, filter traces, route logs, and reduce telemetry costs.
OpenTelemetry unifies traces, metrics, and logs under one standard. This guide covers how to instrument your applications, set up collectors, and actually make sense of the data.
AWS doesn't offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
AWS doesn't offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.
Deploy containerised applications to AWS Lambda or Fargate with a simple YAML config. No infrastructure code required - just define your containers and deploy.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
Detailed comparison of Kyverno and OPA Gatekeeper for Kubernetes policy enforcement. Includes real examples, performance considerations, and migration guidance.
Implement admission control policies with OPA Gatekeeper. Enforce security standards, naming conventions, resource limits, and compliance requirements at the cluster level.
Everyone claims to have a blameless culture. Few actually do. Here's what real blamelessness looks like and why it's so difficult to achieve.
Most incident processes are theatre. Here's how to build incident management that reduces downtime, prevents recurrence, and doesn't burn out your team.
AWS EKS defaults to the VPC CNI plugin, assigning VPC IPs to pods via ENIs. While straightforward, this setup limits pod density per node and consumes VPC IPs rapidly. To overcome these constraints, deploying Calico with IPIP or BGP offers a scalable alternative.
A production-focused deep dive into how BGP actually behaves over AWS Direct Connect – route selection, failover, ASN design, MEDs, prepending, blackholing scenarios, and the real-world issues teams hit at scale.
Deploy Tailscale for secure connectivity across clouds, offices, and Kubernetes clusters. Zero-config VPN mesh with SSO integration and ACLs.
A production-focused deep dive into how BGP actually behaves over AWS Direct Connect – route selection, failover, ASN design, MEDs, prepending, blackholing scenarios, and the real-world issues teams hit at scale.
Master Karpenter for Kubernetes node autoscaling. Replace Cluster Autoscaler with faster, smarter provisioning. Includes cost optimization patterns.
Two major cluster crashes, migrating from kops to EKS, slashing compute costs with Karpenter, and the observability stack we rebuilt three times.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
Platform engineering has become the most misunderstood role in tech. Everyone's building 'platforms' but few understand what actually makes one successful. Here's what I've learned building platforms for teams of 10 to 500.
A practical guide to building an IDP that developers actually want to use. Covers the build vs buy decision, Backstage implementation, and the organisational changes required for success.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
Solving the AWS OIDC Chicken-and-Egg Problem with GitHub Actions
A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.
A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.
A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.
A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.
A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
A comprehensive guide to setting up Elastic Cloud (Elasticsearch Service), including deployment configuration, security setup, index lifecycle management, integrations, and cost optimization.
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
Create custom cloud APIs with Crossplane Compositions. Abstract away complexity and give developers self-service infrastructure with guardrails.
Crossplane + LocalStack on kind: 100 % Local AWS Infrastructure-as-Code
Running databases on Kubernetes is controversial. Sometimes it's the right call, sometimes it's a disaster waiting to happen. Here's how to decide, and how to do it properly if you choose to proceed.
Online EBS volume resizing for running instances – the IaC way with Terraform and ASG instance refresh, plus the manual escape hatch when you need it now. No reboot required.
A step-by-step guide to setting up a Kafka cluster on a local Kind cluster using the Strimzi operator, with optional Terraform provisioning.
Pulsar vs Kafka
Cloud cost management isn't just for finance. Here's how engineering teams can build cost awareness into their workflow without slowing down delivery.
How to get started in DevOps?
Implement SLO-based alerting with burn rate alerts. Move from noisy threshold alerts to meaningful reliability signals using error budgets.
How we automated Dynatrace alerting configuration using custom Ansible roles - covering alert profiles, problem notifications, metric events, and maintenance windows across multiple environments.
Kubernetes is an incredible technology that solves real problems. But for most startups, it's the wrong tool. Here's how to know when you're ready - and what to use instead.
Deep Dive into EC2 Networking: ENIs, IP Addressing and Deployment Architectures
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
You can't use Terraform to create the S3 bucket that stores Terraform state. Here's how to bootstrap your remote backend properly, plus the philosophical reason this pattern exists everywhere in software.
A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.
You can't use Terraform to create the S3 bucket that stores Terraform state. Here's how to bootstrap your remote backend properly, plus the philosophical reason this pattern exists everywhere in software.
How to setup a private network for your EKS cluster with Twingate
Running Kubernetes clusters privately is a growing best practice. In this blog, I'll walk you through deploying a private AKS cluster on Azure with no public API endpoint, and enabling secure access via Twingate VPN, which provides identity-based access without opening up your network.
How to setup a private network for your EKS cluster with Twingate
Using GKE DNS-based endpoints for Secure cluster access
AWS EKS defaults to the VPC CNI plugin, assigning VPC IPs to pods via ENIs. While straightforward, this setup limits pod density per node and consumes VPC IPs rapidly. To overcome these constraints, deploying Calico with IPIP or BGP offers a scalable alternative.
How packets actually flow in Kubernetes – from veth pairs to CNI plugins to kube-proxy modes. With AWS/EKS context throughout.
A comprehensive guide to setting up Elastic Cloud (Elasticsearch Service), including deployment configuration, security setup, index lifecycle management, integrations, and cost optimization.
OpenTelemetry unifies traces, metrics, and logs under one standard. This guide covers how to instrument your applications, set up collectors, and actually make sense of the data.
Deep dive into eBPF-based security tools - Cilium, Falco, and Tetragon. Learn how to implement runtime security, network policies, and threat detection at the kernel level.
Falco Kubernetes Lab: Runtime Threat Detection with Prometheus & Grafana
Implement SLO-based alerting with burn rate alerts. Move from noisy threshold alerts to meaningful reliability signals using error budgets.
Falco Kubernetes Lab: Runtime Threat Detection with Prometheus & Grafana
On interview take-home tests that are suspiciously specific, contractors who get ghosted after detailed proposals, and learning to play the game without becoming bitter about it.
The questions that separate senior engineers from those who memorised tutorials. Real interview failures, what interviewers are actually looking for, and how to answer with depth.
On interview take-home tests that are suspiciously specific, contractors who get ghosted after detailed proposals, and learning to play the game without becoming bitter about it.
I've done both. Multiple times. Here's the real trade-offs nobody talks about - the money, the time off problem, the boredom factor, and why your life situation matters more than you think.
Technical guide for upgrading managed Kubernetes clusters across GKE, EKS, and AKS
Comprehensive guide for safely upgrading GKE clusters with minimal downtime and robust rollback procedures
A deep dive into why external DNS resolution in Kubernetes can be painfully slow, how the default ndots:5 setting causes unnecessary lookups, and practical fixes that actually work.
DNS spoofing in Kubernetes remains a critical threat, enabling attackers to redirect traffic, intercept data, or disrupt services. This article explores how such attacks occur and outlines strategies to prevent them.
Documentation is often treated as junior work. That's backwards. The most impactful documentation comes from senior engineers, and writing it is a force multiplier for your expertise.
Most meetings are information broadcasts disguised as collaboration. Learn when to meet, when to write, and how to save everyone's time.
AWS offers NAT Gateways as the default, fully managed solution for letting private subnet resources reach the internet. However, NAT Gateways can be pricey: Hourly cost: ~₹3.75/hour (varies by region) Data transfer cost: Additional ₹3.75/GB on top of standard data transfer For small dev/test environments or personal labs, these costs can add up quickly. In contrast, a NAT Instance is just a normal EC2 instance configured to perform IP forwarding and NAT. It’s typically much cheaper to run a small instance (`t3.micro`) than a NAT Gateway, especially if your traffic volume is modest.
NAT Gateways are the silent budget killer in AWS. Here's how to reduce costs with NAT instances, VPC endpoints, IPv6, and architectural changes - with real numbers and trade-offs.
Everyone wants to know the difference between Senior, Staff, and Principal. After holding all three titles, I can tell you the real differences aren't what most people think. It's not about years - it's about scope.
The IC ladder looks appealing until you're at the top. Many senior engineers chase Principal titles without understanding what they're signing up for. Here's what nobody tells you.
Extend your private API Gateway with secure access from other VPCs using PrivateLink and enforce IAM-based authentication.
Learn how to deploy a secure, private-only API Gateway inside your VPC using interface endpoints, resource policies, and VPC integration.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
Running Kubernetes clusters privately is a growing best practice. In this blog, I'll walk you through deploying a private AKS cluster on Azure with no public API endpoint, and enabling secure access via Twingate VPN, which provides identity-based access without opening up your network.
Deploy Tailscale for secure connectivity across clouds, offices, and Kubernetes clusters. Zero-config VPN mesh with SSO integration and ACLs.
Running Kubernetes clusters privately is a growing best practice. In this blog, I'll walk you through deploying a private AKS cluster on Azure with no public API endpoint, and enabling secure access via Twingate VPN, which provides identity-based access without opening up your network.
Deploy NATS JetStream for messaging and streaming. Simpler than Kafka, faster than RabbitMQ, with persistence and exactly-once delivery.
In this blog, I'll walk you through setting up a full-featured Apache Pulsar playground using kind (Kubernetes in Docker). Whether you're testing Pulsar for learning or demoing a real pub/sub model with admin tools and monitoring, this setup gives you everything.
Cloud cost management isn't just for finance. Here's how engineering teams can build cost awareness into their workflow without slowing down delivery.
Most Kubernetes clusters waste 50-70% of their resources. Here's how to measure what you're actually using, fix the worst offenders, and automate the process - without breaking production.
Deploy a service mesh without sidecars using Cilium. Get mTLS, traffic management, and observability powered by eBPF at the kernel level.
Service meshes promise observability, security, and traffic management. But which one should you choose? A practical comparison based on running all three in production.
Deep dive into SPIFFE and SPIRE for workload identity. Replace shared secrets with cryptographic identity for service-to-service authentication. Includes Kubernetes deployment and mTLS examples.
Secure Your Kubernetes with SPIFFE + SPIRE: Zero-Trust Identity for Workloads
Deep dive into SPIFFE and SPIRE for workload identity. Replace shared secrets with cryptographic identity for service-to-service authentication. Includes Kubernetes deployment and mTLS examples.
Secure Your Kubernetes with SPIFFE + SPIRE: Zero-Trust Identity for Workloads
Implement automated canary deployments with Flagger. Metrics-based promotion, automated rollback, and integration with Istio, Linkerd, and Gateway API.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
An end-to-end guide for baking a Vault AMI using Packer and deploying a Vault EC2 instance on AWS.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
An end-to-end guide for baking a Vault AMI using Packer and deploying a Vault EC2 instance on AWS.
After working across all three - tiny startups, hypergrowth scale-ups, and massive enterprises - I can tell you they're completely different jobs. Same title, same tech, completely different experience. Here's what each teaches you.
Kubernetes is an incredible technology that solves real problems. But for most startups, it's the wrong tool. Here's how to know when you're ready - and what to use instead.
Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.
Stop pushing to test your workflows. Act lets you run GitHub Actions locally with instant feedback. Here's how to set it up and use it effectively.
Sign and verify container images without managing keys. A hands-on guide to Cosign, keyless signing, and enforcing signatures in Kubernetes.
Your dependencies are an attack vector. Here's how to secure your software supply chain with Sigstore, SLSA frameworks, SBOMs, and admission policies that actually work.
Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.
Run AWS services locally for faster development and testing. A practical guide to LocalStack covering S3, Lambda, DynamoDB, SQS, and integration testing patterns.
Control how pods spread across nodes, zones, and regions. A deep dive into topology spread constraints for high availability and efficient resource utilization.
How to ensure sidecar containers are ready before your main app starts. Covers startupProbe, postStart hooks, and why readinessProbe doesn't do what you think.
Tagging is the foundation of cloud governance, cost allocation, and automation. Here's how to implement tagging consistently across your infrastructure using context modules, policies, and automation.
How to use SCPs to set permission guardrails across your AWS Organization. Covers SCP evaluation logic, deny vs allow strategies, common patterns, and production-ready Terraform examples.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
Build a lightweight Kubernetes cluster on three Raspberry Pi 5 devices. Step-by-step guide covering K3s installation, cluster configuration, and deployment testing.
Most engineers massively undervalue themselves because no one taught them how to negotiate. Here's everything I've learned from negotiating salaries, contracts, titles, and more.
I've done both. Multiple times. Here's the real trade-offs nobody talks about - the money, the time off problem, the boredom factor, and why your life situation matters more than you think.
Master Gateway API with traffic splitting, header-based routing, cross-namespace references, and TLS passthrough. The future of Kubernetes ingress.
Gateway API is the successor to Ingress, bringing role-oriented design, native traffic splitting, and cross-namespace routing. This post compares both APIs, when to migrate, and practical migration patterns.
Master Gateway API with traffic splitting, header-based routing, cross-namespace references, and TLS passthrough. The future of Kubernetes ingress.
Gateway API is the successor to Ingress, bringing role-oriented design, native traffic splitting, and cross-namespace routing. This post compares both APIs, when to migrate, and practical migration patterns.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
How to use SCPs to set permission guardrails across your AWS Organization. Covers SCP evaluation logic, deny vs allow strategies, common patterns, and production-ready Terraform examples.
Detailed comparison of Kyverno and OPA Gatekeeper for Kubernetes policy enforcement. Includes real examples, performance considerations, and migration guidance.
Implement admission control policies with OPA Gatekeeper. Enforce security standards, naming conventions, resource limits, and compliance requirements at the cluster level.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
Implement admission control policies with OPA Gatekeeper. Enforce security standards, naming conventions, resource limits, and compliance requirements at the cluster level.
Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
The idea of the 10x engineer has done more harm than good. What actually matters is team multipliers - engineers who make everyone around them better.
Everyone claims to have a blameless culture. Few actually do. Here's what real blamelessness looks like and why it's so difficult to achieve.
How we migrated our CDN to AWS CloudFront at Trainline
How we migrated our CDN to AWS CloudFront at Trainline
How we migrated our CDN to AWS CloudFront at Trainline
A production-focused deep dive into how BGP actually behaves over AWS Direct Connect – route selection, failover, ASN design, MEDs, prepending, blackholing scenarios, and the real-world issues teams hit at scale.
A production-focused deep dive into how BGP actually behaves over AWS Direct Connect – route selection, failover, ASN design, MEDs, prepending, blackholing scenarios, and the real-world issues teams hit at scale.
AWS Controllers for Kubernetes
How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.
MLOps is becoming a critical skill for DevOps engineers. Here's what matters: the infrastructure patterns, tooling, and operational practices that make ML systems work in production - from someone who learned the hard way.
MLOps is becoming a critical skill for DevOps engineers. Here's what matters: the infrastructure patterns, tooling, and operational practices that make ML systems work in production - from someone who learned the hard way.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
A comprehensive guide to hardening your Clawdbot installation and integrating with Google Workspace, GitHub, and Notion – turning your AI assistant into a productivity powerhouse.
A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.
In the second part of our Container Networking Deep Dive, we connect two network namespaces via a bridge on the same Linux host.
In the second part of our Container Networking Deep Dive, we connect two network namespaces via a bridge on the same Linux host.
In the first part of our Container Networking Deep Dive, we explore how to set up a single network namespace inside a VM and connect it to the host using a veth pair.
Running databases on Kubernetes is controversial. Sometimes it's the right call, sometimes it's a disaster waiting to happen. Here's how to decide, and how to do it properly if you choose to proceed.
Running databases on Kubernetes is controversial. Sometimes it's the right call, sometimes it's a disaster waiting to happen. Here's how to decide, and how to do it properly if you choose to proceed.
A step-by-step guide to setting up a Kafka cluster on a local Kind cluster using the Strimzi operator, with optional Terraform provisioning.
A step-by-step guide to setting up a Kafka cluster on a local Kind cluster using the Strimzi operator, with optional Terraform provisioning.
How to get started in DevOps?
How to get started in DevOps?
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
How DNS UDP's 512-byte limit caps responses at ~8 A records, breaking service discovery for scaled ECS/CloudMap workloads – and the sidecar solution to bypass it.
DORA metrics are the industry standard for measuring DevOps performance. Here's how to implement them properly, avoid common pitfalls, and actually use them to improve your team's delivery.
How we automated Dynatrace alerting configuration using custom Ansible roles - covering alert profiles, problem notifications, metric events, and maintenance windows across multiple environments.
How we automated Dynatrace alerting configuration using custom Ansible roles - covering alert profiles, problem notifications, metric events, and maintenance windows across multiple environments.
eBPF is transforming how we observe, secure, and network Linux systems. This guide covers the fundamentals, practical use cases beyond Cilium, and how to start writing your own eBPF programs.
Deep Dive into EC2 Networking: ENIs, IP Addressing and Deployment Architectures
Deep Dive into EC2 Networking: ENIs, IP Addressing and Deployment Architectures
How to use ECS external deployment controllers and task sets for manual blue/green deployments – the setup, the CLI commands, the Terraform, and an honest assessment of when it's worth the complexity.
How to setup a private network for your EKS cluster with Twingate
Running out of IP addresses in AWS EKS can be a subtle yet critical issue. It often manifests as pods stuck in a pending state or nodes failing to join the cluster, leading to deployment bottlenecks and potential downtime. Understanding the root cause and implementing effective solutions is essential for maintaining cluster health and scalability. Now, there are many ways to fix this, but this is one way.
Running out of IP addresses in AWS EKS can be a subtle yet critical issue. It often manifests as pods stuck in a pending state or nodes failing to join the cluster, leading to deployment bottlenecks and potential downtime. Understanding the root cause and implementing effective solutions is essential for maintaining cluster health and scalability. Now, there are many ways to fix this, but this is one way.
AWS EKS defaults to the VPC CNI plugin, assigning VPC IPs to pods via ENIs. While straightforward, this setup limits pod density per node and consumes VPC IPs rapidly. To overcome these constraints, deploying Calico with IPIP or BGP offers a scalable alternative.
A comprehensive guide to setting up Elastic Cloud (Elasticsearch Service), including deployment configuration, security setup, index lifecycle management, integrations, and cost optimization.
A comprehensive guide to setting up Elastic Cloud (Elasticsearch Service), including deployment configuration, security setup, index lifecycle management, integrations, and cost optimization.
A comprehensive guide to migrating your Elasticsearch, Logstash, and Kibana stack from version 6.x to 8.x. Covers breaking changes, migration strategies, index compatibility, and zero-downtime approaches.
A comprehensive guide to migrating your Elasticsearch, Logstash, and Kibana stack from version 6.x to 8.x. Covers breaking changes, migration strategies, index compatibility, and zero-downtime approaches.
A comprehensive guide to migrating your Elasticsearch, Logstash, and Kibana stack from version 6.x to 8.x. Covers breaking changes, migration strategies, index compatibility, and zero-downtime approaches.
Falco Kubernetes Lab: Runtime Threat Detection with Prometheus & Grafana
In the second part of our ECS Fargate Deep Dive, we get hands-on with Firecracker — the lightweight VMM that powers Fargate — and simulate task isolation and networking locally.
On interview take-home tests that are suspiciously specific, contractors who get ghosted after detailed proposals, and learning to play the game without becoming bitter about it.
On interview take-home tests that are suspiciously specific, contractors who get ghosted after detailed proposals, and learning to play the game without becoming bitter about it.
On interview take-home tests that are suspiciously specific, contractors who get ghosted after detailed proposals, and learning to play the game without becoming bitter about it.
A hands-on guide to implementing GitOps with ArgoCD. Covers installation, application management, sync strategies, secrets handling, and the patterns that actually work in production.
Comprehensive guide for safely upgrading GKE clusters with minimal downtime and robust rollback procedures
In this deep dive, we set up a secure, production-ready CI/CD pipeline from GitHub Actions to GKE using Workload Identity Federation—no secrets needed.
Deep dive into Helm's --atomic, --wait, and --cleanup-on-fail flags. How they work, when to use them, the CI/CD pipeline trap that catches everyone, and production-ready deployment patterns.
Online EBS volume resizing for running instances – the IaC way with Terraform and ASG instance refresh, plus the manual escape hatch when you need it now. No reboot required.
Online EBS volume resizing for running instances – the IaC way with Terraform and ASG instance refresh, plus the manual escape hatch when you need it now. No reboot required.
A battle-tested playbook for migrating CI/CD pipelines from Jenkins to GitHub Actions at scale. Covers OIDC authentication, parallel running, secrets migration, and the gotchas that will bite you.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
How I diagnosed and fixed a Java application that kept crashing under load – from 'cannot create native thread' errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.
A practical guide to connecting to PostgreSQL databases in Kubernetes – exec into pods, VPN access, SOCKS5 proxies, pg_dump, kubectl cp and getting data out when you need it.
A practical guide to connecting to PostgreSQL databases in Kubernetes – exec into pods, VPN access, SOCKS5 proxies, pg_dump, kubectl cp and getting data out when you need it.
DNS spoofing in Kubernetes remains a critical threat, enabling attackers to redirect traffic, intercept data, or disrupt services. This article explores how such attacks occur and outlines strategies to prevent them.
DNS spoofing in Kubernetes remains a critical threat, enabling attackers to redirect traffic, intercept data, or disrupt services. This article explores how such attacks occur and outlines strategies to prevent them.
DNS spoofing in Kubernetes remains a critical threat, enabling attackers to redirect traffic, intercept data, or disrupt services. This article explores how such attacks occur and outlines strategies to prevent them.
Why your Kubernetes cluster is wide open by default, and the single NetworkPolicy that changes everything. Copy, paste, deploy, sleep better.
An end-to-end guide for creating a lab container for DevOps training.
An end-to-end guide for creating a lab container for DevOps training.
Most meetings are information broadcasts disguised as collaboration. Learn when to meet, when to write, and how to save everyone's time.
AWS offers NAT Gateways as the default, fully managed solution for letting private subnet resources reach the internet. However, NAT Gateways can be pricey: Hourly cost: ~₹3.75/hour (varies by region) Data transfer cost: Additional ₹3.75/GB on top of standard data transfer For small dev/test environments or personal labs, these costs can add up quickly. In contrast, a NAT Instance is just a normal EC2 instance configured to perform IP forwarding and NAT. It’s typically much cheaper to run a small instance (`t3.micro`) than a NAT Gateway, especially if your traffic volume is modest.
AWS offers NAT Gateways as the default, fully managed solution for letting private subnet resources reach the internet. However, NAT Gateways can be pricey: Hourly cost: ~₹3.75/hour (varies by region) Data transfer cost: Additional ₹3.75/GB on top of standard data transfer For small dev/test environments or personal labs, these costs can add up quickly. In contrast, a NAT Instance is just a normal EC2 instance configured to perform IP forwarding and NAT. It’s typically much cheaper to run a small instance (`t3.micro`) than a NAT Gateway, especially if your traffic volume is modest.
AWS offers NAT Gateways as the default, fully managed solution for letting private subnet resources reach the internet. However, NAT Gateways can be pricey: Hourly cost: ~₹3.75/hour (varies by region) Data transfer cost: Additional ₹3.75/GB on top of standard data transfer For small dev/test environments or personal labs, these costs can add up quickly. In contrast, a NAT Instance is just a normal EC2 instance configured to perform IP forwarding and NAT. It’s typically much cheaper to run a small instance (`t3.micro`) than a NAT Gateway, especially if your traffic volume is modest.
AWS offers NAT Gateways as the default, fully managed solution for letting private subnet resources reach the internet. However, NAT Gateways can be pricey: Hourly cost: ~₹3.75/hour (varies by region) Data transfer cost: Additional ₹3.75/GB on top of standard data transfer For small dev/test environments or personal labs, these costs can add up quickly. In contrast, a NAT Instance is just a normal EC2 instance configured to perform IP forwarding and NAT. It’s typically much cheaper to run a small instance (`t3.micro`) than a NAT Gateway, especially if your traffic volume is modest.
Networking Tools
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
How a 'safe' AMI upgrade led to traffic drops, zombie log files, and disk exhaustion – and the debugging journey that followed. A real incident from on-call, with technical details and lessons learned.
OpenTelemetry unifies traces, metrics, and logs under one standard. This guide covers how to instrument your applications, set up collectors, and actually make sense of the data.
Platform engineering has become the most misunderstood role in tech. Everyone's building 'platforms' but few understand what actually makes one successful. Here's what I've learned building platforms for teams of 10 to 500.
Running Kubernetes clusters privately is a growing best practice. In this blog, I'll walk you through deploying a private AKS cluster on Azure with no public API endpoint, and enabling secure access via Twingate VPN, which provides identity-based access without opening up your network.
In this blog, I'll walk you through setting up a full-featured Apache Pulsar playground using kind (Kubernetes in Docker). Whether you're testing Pulsar for learning or demoing a real pub/sub model with admin tools and monitoring, this setup gives you everything.
In this blog, I'll walk you through setting up a full-featured Apache Pulsar playground using kind (Kubernetes in Docker). Whether you're testing Pulsar for learning or demoing a real pub/sub model with admin tools and monitoring, this setup gives you everything.
In this blog, I'll walk you through setting up a full-featured Apache Pulsar playground using kind (Kubernetes in Docker). Whether you're testing Pulsar for learning or demoing a real pub/sub model with admin tools and monitoring, this setup gives you everything.
Pulsar vs Kafka
The RTO push isn't about productivity. The data is clear: remote work works. What's really happening is a fight over control, real estate, and management inability to adapt.
Most Kubernetes clusters waste 50-70% of their resources. Here's how to measure what you're actually using, fix the worst offenders, and automate the process - without breaking production.
A hands-on guide to configuring AWS Route 53 for latency-based routing across multiple regions, incorporating health checks for automatic failover.
A hands-on guide to configuring AWS Route 53 for latency-based routing across multiple regions, incorporating health checks for automatic failover.
A hands-on guide to configuring AWS Route 53 for latency-based routing across multiple regions, incorporating health checks for automatic failover.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.
Documentation is often treated as junior work. That's backwards. The most impactful documentation comes from senior engineers, and writing it is a force multiplier for your expertise.
Service meshes promise observability, security, and traffic management. But which one should you choose? A practical comparison based on running all three in production.
Service meshes promise observability, security, and traffic management. But which one should you choose? A practical comparison based on running all three in production.
Master AWS Spot Instances in production. Handle interruptions gracefully, use mixed instance groups, and save 60-90% on compute costs.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
How we used AWS DMS with database views, partitioned replication tasks, and Terraform to migrate event sourcing data from on-prem SQL Server and Oracle to DynamoDB – the architecture, the gotchas, and production Terraform you can reuse.
Daily standups were meant to improve communication. Instead, they've become status meetings that waste time and interrupt deep work. There's a better way.
Daily standups were meant to improve communication. Instead, they've become status meetings that waste time and interrupt deep work. There's a better way.
Certifications have become a checkbox exercise. They don't prove competence, and they often distract from what actually matters: building things and solving real problems.
Certifications have become a checkbox exercise. They don't prove competence, and they often distract from what actually matters: building things and solving real problems.
A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.
A practical guide to breaking up monolithic Terraform state files, moving resources between states, and refactoring infrastructure safely. Includes real examples, scripts, and the exact commands we use.
A practical guide to breaking up monolithic Terraform state files, moving resources between states, and refactoring infrastructure safely. Includes real examples, scripts, and the exact commands we use.
How I built a GitHub Action to manage blue/green and canary deployments by dynamically updating Traefik weighted services – with SigV4 authentication, YAML configuration, and a generator API.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
A complete walkthrough of setting up mutual TLS with Traefik and Smallstep CA – from certificate generation to client authentication. Includes local DNS, ACME integration, and a working demo you can deploy.
An end-to-end guide for baking a Vault AMI using Packer and deploying a Vault EC2 instance on AWS.
AWS doesn't offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.
Kubernetes is an incredible technology that solves real problems. But for most startups, it's the wrong tool. Here's how to know when you're ready - and what to use instead.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
The complete journey: client-side vs server-side apply, admission controllers, etcd persistence, controller reconciliation, scheduler binding, and kubelet container creation. Every step traced.
Sign and verify container images without managing keys. A hands-on guide to Cosign, keyless signing, and enforcing signatures in Kubernetes.
Control how pods spread across nodes, zones, and regions. A deep dive into topology spread constraints for high availability and efficient resource utilization.
Control how pods spread across nodes, zones, and regions. A deep dive into topology spread constraints for high availability and efficient resource utilization.
Your dependencies are an attack vector. Here's how to secure your software supply chain with Sigstore, SLSA frameworks, SBOMs, and admission policies that actually work.
Your dependencies are an attack vector. Here's how to secure your software supply chain with Sigstore, SLSA frameworks, SBOMs, and admission policies that actually work.
Your dependencies are an attack vector. Here's how to secure your software supply chain with Sigstore, SLSA frameworks, SBOMs, and admission policies that actually work.
Tagging is the foundation of cloud governance, cost allocation, and automation. Here's how to implement tagging consistently across your infrastructure using context modules, policies, and automation.
A comprehensive guide to Terraform best practices covering project organisation, state management, module design, and foundational patterns for scalable infrastructure as code.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
Build a production-ready database backup system using Kubernetes CronJobs, PostgreSQL, and S3. Includes a complete local testing environment with KIND and LocalStack.
A practical guide to building an ETL pipeline that extracts weather data from OpenWeatherMap, transforms it with pandas, and loads it into PostgreSQL. Includes Airflow orchestration with email notifications.
A practical guide to building an ETL pipeline that extracts weather data from OpenWeatherMap, transforms it with pandas, and loads it into PostgreSQL. Includes Airflow orchestration with email notifications.
A practical guide to building an ETL pipeline that extracts weather data from OpenWeatherMap, transforms it with pandas, and loads it into PostgreSQL. Includes Airflow orchestration with email notifications.
A practical guide to building an ETL pipeline that extracts weather data from OpenWeatherMap, transforms it with pandas, and loads it into PostgreSQL. Includes Airflow orchestration with email notifications.
Build a lightweight Kubernetes cluster on three Raspberry Pi 5 devices. Step-by-step guide covering K3s installation, cluster configuration, and deployment testing.
Build a lightweight Kubernetes cluster on three Raspberry Pi 5 devices. Step-by-step guide covering K3s installation, cluster configuration, and deployment testing.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
Set up a Security Operations Center lab environment using Docker. Includes Elasticsearch, Kibana, Cribl Stream for log routing, and simulated log generators for hands-on security analysis practice.
A comprehensive guide to deploying Spotify's Backstage developer portal on AWS ECS Fargate with PostgreSQL RDS, Cognito authentication, and proper production hardening.
Most engineers massively undervalue themselves because no one taught them how to negotiate. Here's everything I've learned from negotiating salaries, contracts, titles, and more.
Complete guide to building immutable AMIs with Packer in production - CI/CD pipelines, Terraform ASG integration, rollback strategies, maintenance workflows, and security hardening.
How to use AWS Managed Prefix Lists to eliminate hardcoded CIDR blocks in security groups and route tables. Covers AWS-managed prefixes, customer-managed lists for data centres, and production Terraform patterns.
How to use AWS Managed Prefix Lists to eliminate hardcoded CIDR blocks in security groups and route tables. Covers AWS-managed prefixes, customer-managed lists for data centres, and production Terraform patterns.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
How to use Amazon RDS Proxy to handle database connections from Lambda functions at scale. Covers connection pooling, IAM authentication, Terraform setup, and the gotchas you'll hit in production.
How to use AWS Config Rules to detect compliance violations and automatically remediate them using SSM Automation documents. Covers managed rules, custom rules, remediation actions, and complete Terraform examples.
How to use AWS Config Rules to detect compliance violations and automatically remediate them using SSM Automation documents. Covers managed rules, custom rules, remediation actions, and complete Terraform examples.
How to use AWS Config Rules to detect compliance violations and automatically remediate them using SSM Automation documents. Covers managed rules, custom rules, remediation actions, and complete Terraform examples.
How to use VPC Endpoints to access AWS services without internet gateways or NAT. Covers Gateway vs Interface endpoints, PrivateLink, endpoint policies, cost optimization, and production Terraform patterns.
How to use External Secrets Operator to sync AWS Secrets Manager secrets to Kubernetes. Covers SecretStore, ExternalSecret, IAM with IRSA, templating, and production patterns.
How to use External Secrets Operator to sync AWS Secrets Manager secrets to Kubernetes. Covers SecretStore, ExternalSecret, IAM with IRSA, templating, and production patterns.
How to enforce Pod Security Standards using the built-in Pod Security Admission controller. Covers Privileged, Baseline, and Restricted profiles, migration from PSPs, namespace labeling, and exemptions.
How to enforce Pod Security Standards using the built-in Pod Security Admission controller. Covers Privileged, Baseline, and Restricted profiles, migration from PSPs, namespace labeling, and exemptions.
How to enforce Pod Security Standards using the built-in Pod Security Admission controller. Covers Privileged, Baseline, and Restricted profiles, migration from PSPs, namespace labeling, and exemptions.
How to enforce Pod Security Standards using the built-in Pod Security Admission controller. Covers Privileged, Baseline, and Restricted profiles, migration from PSPs, namespace labeling, and exemptions.
How to ensure sidecar containers are ready before your main app starts. Covers startupProbe, postStart hooks, and why readinessProbe doesn't do what you think.
Deep dive into Identity Aware Proxies - what they are, how they work, and how to implement them with GCP IAP, Pomerium, and OAuth2-Proxy. Includes Terraform and Kubernetes examples.
Deep dive into Identity Aware Proxies - what they are, how they work, and how to implement them with GCP IAP, Pomerium, and OAuth2-Proxy. Includes Terraform and Kubernetes examples.
Deep dive into eBPF-based security tools - Cilium, Falco, and Tetragon. Learn how to implement runtime security, network policies, and threat detection at the kernel level.
Implement admission control policies with OPA Gatekeeper. Enforce security standards, naming conventions, resource limits, and compliance requirements at the cluster level.
Build custom Backstage plugins for your internal developer portal. Create frontend components, backend APIs, and integrate with your existing tools.
Build custom Backstage plugins for your internal developer portal. Create frontend components, backend APIs, and integrate with your existing tools.
Build custom Backstage plugins for your internal developer portal. Create frontend components, backend APIs, and integrate with your existing tools.
Implement chaos engineering in Kubernetes with LitmusChaos. Run pod failures, network chaos, and stress tests to validate system resilience.
Implement chaos engineering in Kubernetes with LitmusChaos. Run pod failures, network chaos, and stress tests to validate system resilience.
Compare Dragonfly and Redis for caching and data storage. Dragonfly's multi-threaded architecture vs Redis single-threaded model.
Compare Dragonfly and Redis for caching and data storage. Dragonfly's multi-threaded architecture vs Redis single-threaded model.
Compare Dragonfly and Redis for caching and data storage. Dragonfly's multi-threaded architecture vs Redis single-threaded model.
Implement automated cloud cost optimization with Kubecost and OpenCost. Track costs per team, rightsize resources, and automate savings.
Implement automated cloud cost optimization with Kubecost and OpenCost. Track costs per team, rightsize resources, and automate savings.
Explore Port and Kratix for building internal developer platforms. Self-service infrastructure, developer workflows, and platform engineering patterns.
Explore Port and Kratix for building internal developer platforms. Self-service infrastructure, developer workflows, and platform engineering patterns.
Explore Port and Kratix for building internal developer platforms. Self-service infrastructure, developer workflows, and platform engineering patterns.
Detailed comparison of Kyverno and OPA Gatekeeper for Kubernetes policy enforcement. Includes real examples, performance considerations, and migration guidance.
Detailed comparison of Kyverno and OPA Gatekeeper for Kubernetes policy enforcement. Includes real examples, performance considerations, and migration guidance.
Deploy NATS JetStream for messaging and streaming. Simpler than Kafka, faster than RabbitMQ, with persistence and exactly-once delivery.
Deploy NATS JetStream for messaging and streaming. Simpler than Kafka, faster than RabbitMQ, with persistence and exactly-once delivery.
Deploy NATS JetStream for messaging and streaming. Simpler than Kafka, faster than RabbitMQ, with persistence and exactly-once delivery.
Deploy NATS JetStream for messaging and streaming. Simpler than Kafka, faster than RabbitMQ, with persistence and exactly-once delivery.
Master OpenTelemetry Collector configuration. Build pipelines to transform metrics, filter traces, route logs, and reduce telemetry costs.
Master OpenTelemetry Collector configuration. Build pipelines to transform metrics, filter traces, route logs, and reduce telemetry costs.
Master OpenTelemetry Collector configuration. Build pipelines to transform metrics, filter traces, route logs, and reduce telemetry costs.
Implement automated canary deployments with Flagger. Metrics-based promotion, automated rollback, and integration with Istio, Linkerd, and Gateway API.
Implement automated canary deployments with Flagger. Metrics-based promotion, automated rollback, and integration with Istio, Linkerd, and Gateway API.
Remove secrets from your applications entirely with Secretless Broker. Inject database credentials, API keys, and certificates via sidecar without your app knowing they exist.
Remove secrets from your applications entirely with Secretless Broker. Inject database credentials, API keys, and certificates via sidecar without your app knowing they exist.
Remove secrets from your applications entirely with Secretless Broker. Inject database credentials, API keys, and certificates via sidecar without your app knowing they exist.
Implement SLO-based alerting with burn rate alerts. Move from noisy threshold alerts to meaningful reliability signals using error budgets.
Deploy Tailscale for secure connectivity across clouds, offices, and Kubernetes clusters. Zero-config VPN mesh with SSO integration and ACLs.
Deploy Tailscale for secure connectivity across clouds, offices, and Kubernetes clusters. Zero-config VPN mesh with SSO integration and ACLs.
Scale MySQL horizontally with Vitess. Automatic sharding, online schema changes, and Kubernetes-native deployment for massive scale.
Scale MySQL horizontally with Vitess. Automatic sharding, online schema changes, and Kubernetes-native deployment for massive scale.
Scale MySQL horizontally with Vitess. Automatic sharding, online schema changes, and Kubernetes-native deployment for massive scale.
Scale MySQL horizontally with Vitess. Automatic sharding, online schema changes, and Kubernetes-native deployment for massive scale.
Use Vertical Pod Autoscaler and Horizontal Pod Autoscaler together without conflicts. Includes KEDA integration and best practices.
Use Vertical Pod Autoscaler and Horizontal Pod Autoscaler together without conflicts. Includes KEDA integration and best practices.
Use Vertical Pod Autoscaler and Horizontal Pod Autoscaler together without conflicts. Includes KEDA integration and best practices.
S3 backup/restore, direct connectivity, Parquet exports - none of them worked cleanly. Here's the full war story of migrating a production ClickHouse instance to Cloud, the version mismatch that broke everything, and the dumb-simple approach that actually got the job done.
A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.
A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.