<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Mo Abukar&apos;s Blog</title><description>Technical deep dives on DevOps, Kubernetes, cloud architecture, and platform engineering.</description><link>https://moabukar.co.uk/</link><language>en-gb</language><managingEditor>devopsbymo@gmail.com (Mo Abukar)</managingEditor><webMaster>devopsbymo@gmail.com (Mo Abukar)</webMaster><copyright>© 2026 Mo Abukar</copyright><category>Technology</category><category>DevOps</category><category>Kubernetes</category><item><title>Building a Production-Grade Homelab with K3s, Vault, and FluxCD</title><link>https://moabukar.co.uk/blog/homelab-k3s-proxmox/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/homelab-k3s-proxmox/</guid><description>How I built a fully GitOps-managed Kubernetes homelab on a single mini PC - from unboxing to production. Proxmox bare metal install, K3s cluster, HashiCorp Vault secrets, full observability, and Cloudflare Tunnel.</description><pubDate>Sat, 07 Mar 2026 00:00:00 GMT</pubDate><content:encoded>Building a Production-Grade Homelab with K3s, Vault, and FluxCD
================================================================

I wanted a homelab that wasn&apos;t just &quot;k3s install and call it a day.&quot; Something I could actually use to trial tooling, break things safely, and run real workloads. The kind of setup where if someone asked &quot;how does your secrets management work?&quot; the answer isn&apos;t &quot;I hardcoded them in a ConfigMap.&quot;

This is that setup. Single mini PC, four VMs, full GitOps pipeline, proper secrets management, and observability that would hold up in a real environment. I&apos;m going to walk through the entire build from unboxing to running workloads - every step, every gotcha.

The repo is public: [github.com/moabukar/homelab](https://github.com/moabukar/homelab)

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/k3s.svg&quot; alt=&quot;K3s logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Single Ace Magician mini PC running Proxmox VE bare metal
- 3-node K3s cluster (1 control plane + 2 workers) + dedicated Vault VM
- FluxCD for GitOps - push to main, everything reconciles
- HashiCorp Vault + External Secrets Operator for secrets (JWT auth, no manual secret management)
- Full observability: Prometheus, Grafana, Jaeger, OTel Collector
- Cloudflare Tunnel for external access without port forwarding
- Gateway API + Traefik for routing (not traditional Ingress)
- Everything as code, everything in Git


Why a Mini PC?
==============

I considered a few options before settling on this approach:

- **Raspberry Pi cluster** - I&apos;ve done this before ([wrote about it here](/blog/k3s-homelab-raspberry-pi)). It works, but ARM architecture means some container images aren&apos;t available and the Pi 5 tops out at 8GB RAM. Not enough for a full observability stack.
- **Old desktop/server** - Loud, power-hungry, takes up space. My wife would not have been happy.
- **Cloud VMs** - Defeats the purpose. I wanted something I own, on my network, that I can break and rebuild without a billing surprise.
- **Mini PC** - Small, quiet, x86_64, decent specs for under 300 quid. Winner.

The mini PC sits on my desk, plugged into the router. It draws about 15W idle. Silent. My wife doesn&apos;t even know it&apos;s there (she does now).


The Hardware
============

Here&apos;s everything I bought:

```
PART                    MODEL / SPEC                   APPROX COST
====                    ============                   ===========
Mini PC                 Ace Magician AM08 Pro           ~£280
CPU (included)          AMD Ryzen 7 7730U (8c/16t)      -
RAM (included)          30GB DDR4                        -
Storage (included)      500GB NVMe SSD                   -
USB drive               Any 8GB+ USB stick               ~£5
Ethernet cable          Cat6, 1m                         ~£3
```

That&apos;s it. Under 300 quid for a machine that runs four VMs comfortably.

The AM08 Pro comes with everything you need - the NVMe and RAM are pre-installed. No need to open it up. It ships with Windows 11, which we&apos;re going to nuke from orbit in about five minutes.

```
SPEC                VALUE
====                =====
Model               Ace Magician AM08 Pro
CPU                 AMD Ryzen 7 7730U (8c/16t, 2.0GHz base, 4.5GHz boost)
RAM                 30GB DDR4-3200
Storage             500GB NVMe PCIe 3.0
Network             Gigabit Ethernet (RJ45) + Wi-Fi 6
Ports               2x HDMI, 4x USB 3.0, 1x USB-C
Power               65W adapter, ~15W idle draw
Size                127 x 128 x 57mm (fits in your palm)
```

You&apos;ll also want a monitor and keyboard for the initial Proxmox install. After that, everything is headless via the Proxmox web UI.


Creating the Proxmox Boot USB
=============================

Proxmox VE is a Debian-based hypervisor. Free, open source, enterprise-grade. It runs on bare metal and gives you a web UI to manage VMs, containers, storage, and networking.

First, download the Proxmox VE ISO from the official site:

```
https://www.proxmox.com/en/downloads/proxmox-virtual-environment/iso
```

I used Proxmox VE 8.x. Download the ISO (about 1.2GB).

Next, flash it to a USB stick. On macOS:

```bash
# Find your USB drive
diskutil list

# Unmount it (replace diskN with your USB disk number)
diskutil unmountDisk /dev/diskN

# Flash the ISO (use rdiskN for faster writes)
sudo dd if=proxmox-ve_8.x-x.iso of=/dev/rdiskN bs=4M status=progress

# Eject when done
diskutil eject /dev/diskN
```

On Linux:

```bash
# Find your USB drive
lsblk

# Flash it
sudo dd if=proxmox-ve_8.x-x.iso of=/dev/sdX bs=4M status=progress conv=fdatasync
```

On Windows, use [Rufus](https://rufus.ie/) or [balenaEtcher](https://etcher.balena.io/) - drag the ISO in, select the USB, click flash. Done.


Installing Proxmox on the Mini PC
==================================

This is the &quot;nuke Windows&quot; step.

1. **Plug the USB into the mini PC.** Also connect a monitor (HDMI) and keyboard.

2. **Boot from USB.** Power on the mini PC and hammer the boot menu key. For the Ace Magician AM08 Pro, it&apos;s `F7` or `Del` to enter BIOS, then change boot order to USB first. Some machines use `F2`, `F10`, or `F12` - check your model.

3. **BIOS settings.** While you&apos;re in BIOS, make sure:
   - Virtualization is enabled (AMD-V / SVM Mode = Enabled). This is critical - Proxmox needs hardware virtualisation.
   - Secure Boot is disabled. Proxmox doesn&apos;t play nice with Secure Boot.
   - Boot mode is set to UEFI (not Legacy/CSM).

4. **Boot the Proxmox installer.** Save BIOS settings and reboot. The Proxmox installer should load from USB.

5. **Walk through the installer:**
   - Accept the EULA
   - Select the target disk (the 500GB NVMe). This wipes everything - bye bye Windows.
   - Set your country, timezone, and keyboard layout
   - Set the root password and admin email
   - Configure networking:
     - Management interface: the Ethernet port (enp1s0 or similar)
     - Hostname: `pve.local` (or whatever you like)
     - IP: I used a static IP on my LAN - `192.168.1.100/24`
     - Gateway: `192.168.1.1` (your router)
     - DNS: `192.168.1.1` (or `1.1.1.1`)

6. **Install and reboot.** Takes about 5 minutes. Remove the USB when prompted.

7. **Access the web UI.** From any machine on your network, open:
   ```
   https://192.168.1.100:8006
   ```
   Login with `root` and the password you set. You&apos;ll get a certificate warning - that&apos;s expected, click through it.

You now have a bare-metal hypervisor running on your mini PC. The monitor and keyboard can be disconnected - everything from here is via the web UI or SSH.


Setting Up Tailscale for Remote Access
======================================

I didn&apos;t want to be limited to my home network. Tailscale gives you a private WireGuard mesh network - install it on Proxmox and you can access the web UI from anywhere.

SSH into the Proxmox host (or use the web UI console):

```bash
# Install Tailscale
curl -fsSL https://tailscale.com/install.sh | sh

# Authenticate
tailscale up

# Note the Tailscale IP
tailscale ip -4
# e.g., 100.93.110.7
```

Now you can access Proxmox at `https://100.93.110.7:8006` from any device on your Tailscale network. Coffee shop, office, phone - doesn&apos;t matter.


Creating the Ubuntu Cloud-Init Template
========================================

Instead of installing Ubuntu manually on every VM, I created a cloud-init template once and clone it for each new VM. Cloud-init handles SSH keys, hostname, static IPs, and package installs automatically on first boot.

On the Proxmox host:

```bash
# Download Ubuntu 24.04 cloud image
wget https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img

# Create a new VM (ID 9000) that will become our template
qm create 9000 --name ubuntu-cloud --memory 2048 --cores 2 --net0 virtio,bridge=vmbr0

# Import the cloud image as a disk
qm importdisk 9000 noble-server-cloudimg-amd64.img local-lvm

# Attach the disk as SCSI
qm set 9000 --scsihw virtio-scsi-pci --scsi0 local-lvm:vm-9000-disk-0

# Add a cloud-init drive
qm set 9000 --ide2 local-lvm:cloudinit

# Set boot order to the SCSI disk
qm set 9000 --boot c --bootdisk scsi0

# Add a serial console (needed for cloud-init)
qm set 9000 --serial0 socket --vga serial0

# Configure cloud-init defaults
qm set 9000 --ciuser mo
qm set 9000 --sshkeys ~/.ssh/authorized_keys
qm set 9000 --ipconfig0 ip=dhcp

# Convert to template
qm template 9000
```

This template is the foundation for every VM in the lab. Clone it, customise the resources, set a static IP, and boot. Takes about 30 seconds per VM.


Provisioning the K3s VMs
=========================

Three VMs cloned from the template. Each gets a static IP, specific RAM allocation, and a resized disk.

```bash
# Clone for control plane (VM 101)
qm clone 9000 101 --name cp1 --full
qm set 101 --memory 6144 --cores 2
qm resize 101 scsi0 +18G    # 20GB total
qm set 101 --ipconfig0 ip=192.168.1.21/24,gw=192.168.1.1
qm start 101

# Clone for worker1 (VM 102)
qm clone 9000 102 --name worker1 --full
qm set 102 --memory 8192 --cores 2
qm resize 102 scsi0 +18G
qm set 102 --ipconfig0 ip=192.168.1.22/24,gw=192.168.1.1
qm start 102

# Clone for worker2 (VM 103)
qm clone 9000 103 --name worker2 --full
qm set 103 --memory 8192 --cores 2
qm resize 103 scsi0 +18G
qm set 103 --ipconfig0 ip=192.168.1.23/24,gw=192.168.1.1
qm start 103
```

Wait about a minute for cloud-init to finish, then SSH in:

```bash
ssh mo@192.168.1.21   # cp1
ssh mo@192.168.1.22   # worker1
ssh mo@192.168.1.23   # worker2
```

All three VMs boot in under a minute. Cloud-init sets the hostname, creates the user, installs the SSH key. No manual OS install, no clicking through wizards.

```
VM      NAME       IP               ROLE                RAM     DISK
==      ====       ==               ====                ===     ====
101     cp1        192.168.1.21     K3s control plane   6GB     20GB
102     worker1    192.168.1.22     K3s worker          8GB     20GB
103     worker2    192.168.1.23     K3s worker          8GB     20GB
104     vault      192.168.1.104    HashiCorp Vault     4GB     20GB
9000    template   -                Cloud-init base     -       -
```


The Vault VM (Terraform)
=========================

The K3s VMs were created manually with `qm` commands. For the Vault VM, I used Terraform with the `bpg/proxmox` provider. Partly because I wanted the Vault infrastructure to be reproducible, partly because I wanted to test the provider.

```hcl
resource &quot;proxmox_virtual_environment_vm&quot; &quot;vault&quot; {
  name      = &quot;vault&quot;
  node_name = &quot;pve&quot;
  vm_id     = 104

  clone { vm_id = 9000 }

  cpu    { cores = 2 }
  memory { dedicated = 4096 }
  agent  { enabled = false }  # no qemu-guest-agent

  initialization {
    ip_config {
      ipv4 {
        address = &quot;192.168.1.104/24&quot;
        gateway = &quot;192.168.1.1&quot;
      }
    }
    user_data_file_id = proxmox_virtual_environment_file.cloud_config.id
  }
}
```

The cloud-init user data installs Vault automatically on first boot:

```yaml
#cloud-config
packages:
  - gpg
  - wget
runcmd:
  - wget -O- https://apt.releases.hashicorp.com/gpg | gpg --dearmor -o /usr/share/keyrings/hashicorp-archive-keyring.gpg
  - echo &quot;deb [signed-by=/usr/share/keyrings/hashicorp-archive-keyring.gpg] https://apt.releases.hashicorp.com noble main&quot; &gt; /etc/apt/sources.list.d/hashicorp.list
  - apt-get update &amp;&amp; apt-get install -y vault
  - systemctl enable vault &amp;&amp; systemctl start vault
```

Two gotchas I hit:

1. **`agent.enabled = false`** - If you haven&apos;t installed qemu-guest-agent in the VM, you must set this to false. Otherwise Terraform hangs forever waiting for the agent to respond. Cost me an hour staring at a frozen terminal.

2. **`user_data_file_id` overrides SSH keys** - When using a cloud-init file for user data, the `user_account` block&apos;s SSH keys get ignored. Put your SSH key in the cloud-config file instead.

Run `terraform apply`, wait two minutes, and you have a Vault VM with Vault pre-installed and running.


K3s Cluster
===========

K3s v1.34.5 with Traefik disabled at install (re-enabled later with custom config via HelmChartConfig). Flannel for CNI. Single control plane node - it&apos;s a homelab, not a bank.

On the control plane (cp1):

```bash
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=&quot;v1.34.5+k3s1&quot; \
  sh -s - server --disable traefik --write-kubeconfig-mode 644
```

Why disable Traefik? Because K3s bundles Traefik with default settings. I wanted to customise it (enable the dashboard, configure tracing, adjust resource limits), which is easier to do by disabling the bundled version and deploying it fresh with a HelmChartConfig.

Grab the join token:

```bash
cat /var/lib/rancher/k3s/server/node-token
```

On each worker:

```bash
curl -sfL https://get.k3s.io | INSTALL_K3S_VERSION=&quot;v1.34.5+k3s1&quot; \
  K3S_URL=https://192.168.1.21:6443 \
  K3S_TOKEN=&quot;&lt;token-from-above&gt;&quot; sh -
```

Workers joined within seconds. Cluster up and running in under 10 minutes.

```bash
$ kubectl get nodes
NAME      STATUS   ROLES                  AGE   VERSION
cp1       Ready    control-plane,master   10m   v1.34.5+k3s1
worker1   Ready    &lt;none&gt;                 8m    v1.34.5+k3s1
worker2   Ready    &lt;none&gt;                 7m    v1.34.5+k3s1
```


GitOps with FluxCD
==================

Everything runs through Flux. No `kubectl apply` in production - push to `main` and Flux reconciles.

Bootstrap Flux:

```bash
flux bootstrap github \
  --owner=moabukar \
  --repository=homelab \
  --branch=main \
  --path=clusters/homelab \
  --personal
```

This creates the repo structure and installs Flux controllers. From here, everything is managed by adding YAML to the repo.

The dependency chain is explicit and critical - get this wrong and things install in the wrong order and break:

```
flux-system
  └── gateway-api-crds
        └── infrastructure-controllers    (MetalLB, cert-manager, ESO, OTel, etc.)
              └── infrastructure-configs  (MetalLB pool, Gateway, ClusterIssuers, Cloudflared)
                    └── apps              (monitoring stack, Authentik, Home Assistant, etc.)
```

Each layer waits for the previous one using Flux&apos;s `dependsOn` and health checks. CRDs install before controllers that need them. Controllers install before configs that reference them. No race conditions, no &quot;apply it again and hope&quot; situations.

One thing I got wrong early on: I had `wait: true` on the controller Kustomizations. This blocks the entire reconciliation chain until every single resource is ready. The problem is some resources (like CRDs) don&apos;t have meaningful ready conditions, so Flux waits forever. Switched to explicit `healthChecks` targeting specific Deployments instead. Much more reliable.


Secrets Management
==================

This is the part I&apos;m most pleased with. Zero secrets in Git. Zero manual secret management.

The flow:

```
Terraform (random_password)
    │
    ▼
Vault KV v2 (secret/apps/*)
    │
    ▼
External Secrets Operator (ClusterSecretStore)
    │
    ▼
Kubernetes Secret
    │
    ▼
Reloader (rolling restart)
```

Terraform generates random passwords and writes them to Vault. ESO syncs them into K8s Secrets. Reloader watches for Secret changes and triggers rolling restarts on dependent pods. Add a new app secret? Add it to Terraform, run `terraform apply`, and ESO picks it up within an hour.


Vault Setup
-----------

Vault runs in dev mode... just kidding. Raft storage, TLS disabled (internal network only), UI enabled.

After the VM boots, initialise Vault:

```bash
vault operator init -key-shares=1 -key-threshold=1
```

This gives you an unseal key and root token. Store them safely (they&apos;re saved to `/opt/vault/init.json` on the VM). Single key share because it&apos;s a homelab - in production you&apos;d use Shamir&apos;s Secret Sharing with 3-of-5 or similar.

```bash
vault operator unseal &lt;unseal-key&gt;
vault login &lt;root-token&gt;
```

Enable the KV v2 secrets engine:

```bash
vault secrets enable -path=secret kv-v2
```

The Vault Terraform config (`terraform/vault/config/`) manages everything else - the KV engine, policies, auth backends, and the actual secrets:

```hcl
resource &quot;random_password&quot; &quot;grafana_admin&quot; {
  length  = 32
  special = false
}

resource &quot;vault_kv_secret_v2&quot; &quot;grafana&quot; {
  mount = &quot;secret&quot;
  name  = &quot;apps/grafana&quot;
  data_json = jsonencode({
    admin-password = random_password.grafana_admin.result
    admin-user     = &quot;admin&quot;
  })
}
```

Every app gets a `random_password` resource and a corresponding Vault secret. No more &quot;changeme&quot; passwords.


JWT Auth (The Kubernetes Auth Detour)
-------------------------------------

The Vault authentication story deserves its own section because I wasted significant time on it.

I initially tried the `kubernetes` auth method. This is the &quot;standard&quot; way - Vault calls back to the K8s API to validate ServiceAccount tokens via TokenReview. Spent hours debugging 403 &quot;permission denied&quot; errors. The TokenReview chain between K3s and a Vault running outside the cluster had issues I never fully resolved. The K3s API server, the Vault kubernetes auth config, the CA certificates - something in that chain wasn&apos;t happy.

Switched to `jwt` auth instead. Completely different approach. Vault validates K8s ServiceAccount tokens locally using the cluster&apos;s JWKS public key. No network call back to the K8s API. Pure cryptographic verification.

To set it up:

```bash
# Extract the JWKS public key from K3s
kubectl get --raw /openid/v1/jwks &gt; jwks.json

# Configure JWT auth in Vault
vault auth enable jwt
vault write auth/jwt/config \
  jwt_validation_pubkeys=&quot;$(cat jwks-pem.pem)&quot;

# Create a policy for ESO
vault policy write eso-read - &lt;&lt;EOF
path &quot;secret/data/*&quot; {
  capabilities = [&quot;read&quot;, &quot;list&quot;]
}
EOF

# Create a role for ESO
vault write auth/jwt/role/eso \
  role_type=&quot;jwt&quot; \
  bound_audiences=&quot;kubernetes.default.svc.cluster.local&quot; \
  user_claim=&quot;sub&quot; \
  policies=&quot;eso-read&quot; \
  ttl=&quot;1h&quot;
```

Then the ClusterSecretStore in K8s:

```yaml
apiVersion: external-secrets.io/v1
kind: ClusterSecretStore
metadata:
  name: vault-backend
spec:
  provider:
    vault:
      server: &quot;http://192.168.1.104:8200&quot;
      path: &quot;secret&quot;
      version: &quot;v2&quot;
      auth:
        jwt:
          path: &quot;jwt&quot;
          role: &quot;eso&quot;
          kubernetesServiceAccountToken:
            serviceAccountRef:
              name: external-secrets-operator
              namespace: external-secrets-system
            audiences:
              - kubernetes.default.svc.cluster.local
```

The key insight: JWKS keys are public. They&apos;re safe to commit to Git. They&apos;re the public half of the signing key - anyone can verify a token, but only the K8s API server can sign one. No chicken-and-egg problem for bootstrapping secrets.


ExternalSecrets in Practice
----------------------------

Each app that needs a secret gets an ExternalSecret manifest:

```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: grafana-admin-password
  namespace: monitoring
spec:
  refreshInterval: 1h
  secretStoreRef:
    kind: ClusterSecretStore
    name: vault-backend
  target:
    name: grafana-admin-password
    creationPolicy: Owner
  data:
    - secretKey: admin-password
      remoteRef:
        key: apps/grafana
        property: admin-password
    - secretKey: admin-user
      remoteRef:
        key: apps/grafana
        property: admin-user
```

ESO reads from Vault every hour. If a secret changes in Vault, it updates the K8s Secret. Reloader (Stakater) watches the Secret and triggers a rolling restart on any Deployment with the annotation:

```yaml
metadata:
  annotations:
    reloader.stakater.com/auto: &quot;true&quot;
```

Rotate a password? Change it in Terraform, `terraform apply`, wait for ESO to sync, Reloader restarts the pod. Zero manual steps.


Networking
==========

Gateway API with Traefik as the controller. Not traditional Ingress - Gateway API is the future and Traefik supports it natively in K3s.


MetalLB
-------

K3s doesn&apos;t give you LoadBalancer IPs by default (unlike cloud providers). MetalLB fills that gap. L2 mode, IP pool of `192.168.1.200-250`:

```yaml
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: default-pool
  namespace: metallb-system
spec:
  addresses:
    - 192.168.1.200-192.168.1.250
```

The Traefik Gateway grabs `192.168.1.200` and all HTTPRoutes hang off it.


Gateway API
-----------

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: traefik-gateway
  namespace: kube-system
spec:
  gatewayClassName: traefik
  listeners:
    - name: http
      protocol: HTTP
      port: 80
    - name: https
      protocol: HTTPS
      port: 443
      tls:
        certificateRefs:
          - name: wildcard-cert
```

Each service gets an HTTPRoute:

```yaml
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: grafana
  namespace: monitoring
spec:
  parentRefs:
    - name: traefik-gateway
      namespace: kube-system
  hostnames:
    - grafana.homelab.local
    - grafana.moabukar.co.uk
  rules:
    - backendRefs:
        - name: kube-prometheus-stack-grafana
          port: 80
```

```
ROUTE                        SERVICE
=====                        =======
grafana.homelab.local        Grafana dashboards
prometheus.homelab.local     Prometheus UI
alertmanager.homelab.local   Alertmanager
jaeger.homelab.local         Jaeger trace UI
traefik.homelab.local        Traefik dashboard
ha.homelab.local             Home Assistant
auth.homelab.local           Authentik SSO
it-tools.homelab.local       IT-Tools
```

For local access, add entries to `/etc/hosts` pointing `*.homelab.local` to `192.168.1.200`. Or use a local DNS server if you&apos;re fancy.


External Access (Cloudflare Tunnel)
------------------------------------

I didn&apos;t want to port-forward or expose my home IP. Cloudflare Tunnel creates an outbound-only connection from the cluster to Cloudflare&apos;s edge. External traffic routes through `*.moabukar.co.uk` without any inbound firewall rules.

Cloudflared runs as a Deployment in the cluster. The tunnel token is stored in Vault and synced via ESO - no secrets in Git.

ExternalDNS watches HTTPRoute resources and automatically creates Cloudflare DNS records for any hostname matching `*.moabukar.co.uk`. Add a new HTTPRoute with a `.moabukar.co.uk` hostname, ExternalDNS creates the DNS record, Cloudflare Tunnel routes the traffic. Fully automated.


Observability
=============

Three pillars, all wired up and talking to each other.


Metrics
-------

kube-prometheus-stack provides Prometheus (10Gi storage, 7d retention), Grafana, and Alertmanager. Deployed via Flux HelmRelease.

Custom Grafana dashboards are stored as JSON files in the repo. Kustomize&apos;s `configMapGenerator` creates ConfigMaps with the `grafana_dashboard: &quot;1&quot;` label, and Grafana&apos;s sidecar auto-discovers them. Add a dashboard? Drop a JSON file in the repo, push to main, done.

Dashboards deployed:

- Homelab Overview (custom - cluster-wide resource usage at a glance)
- Flux GitOps (custom - reconciliation status, errors, sync times)
- Traefik (from grafana.net - request rates, error rates, latencies)
- Node Exporter Full (from grafana.net - deep-dive per-node metrics)
- K8s Cluster (from grafana.net - namespace and pod-level overview)
- K8s Pods (from grafana.net - per-pod resource usage)

10 custom PrometheusRule alerts covering the things that actually matter:

```
ALERT                          CONDITION
=====                          =========
NodeHighCPU                    &gt; 80% for 5m
NodeHighMemory                 &gt; 85% for 5m
NodeDiskAlmostFull             &gt; 80%
NodeDown                       Unreachable for 5m
PodCrashLooping                &gt; 5 restarts in 15m
PodNotReady                    Not ready for 10m
DeploymentReplicasMismatch     For 10m
FluxReconciliationFailure      Any Kustomization/HR failing
TraefikHighErrorRate           &gt; 5% 5xx for 5m
CertificateExpiringSoon        &lt; 7 days
```

Alertmanager sends notifications to Telegram via a bot. HTML-formatted messages with alert name, severity, namespace, and a direct link to the relevant Grafana dashboard.


Traces
------

OpenTelemetry Collector runs as a DaemonSet on every node. Receives OTLP traces on port 4317 (gRPC) and 4318 (HTTP), enriches with Kubernetes metadata (pod, namespace, node, deployment), and exports to Jaeger.

The OTel Collector pipeline:

```
receivers:
  otlp (gRPC + HTTP)
    │
processors:
  k8sattributes    (add pod/namespace/node labels)
  resource         (add cluster name)
  transform        (derive service.name from k8s metadata)
  batch            (batch before export)
    │
exporters:
  otlp/jaeger      (Jaeger&apos;s OTLP endpoint)
```

One critical thing I missed initially: the DaemonSet had no Service. The pods were running fine, collecting traces from their own nodes, but application pods couldn&apos;t send traces to the collector because there was no Service to resolve. Added a `ClusterIP` Service on ports 4317/4318 and everything clicked.

Jaeger uses Badger persistent storage (5Gi PVC, 72h retention). Another gotcha - Jaeger needs a ClusterIP Service, not headless, for OTel gRPC export. Headless Services don&apos;t work with gRPC load balancing the way you&apos;d expect.

What&apos;s sending traces:

- **Traefik** - Every HTTP request through the gateway gets a trace automatically. This is the backbone - you can trace a request from ingress all the way through.
- **Grafana** - Internal operation traces (dashboard loads, query execution).
- **OTel Demo** - The official OpenTelemetry Astronomy Shop demo app. Generates realistic cross-service traces across Go, Python, Java, Node.js, and more. Great for testing trace visualisation and understanding distributed tracing patterns.


What&apos;s Also Running
===================

Beyond the core platform:

- **Home Assistant** - Home automation. Runs with host networking and privileged access (needs USB/Bluetooth for Zigbee). Pinned to cp1 node via node affinity (PV is local-path, tied to that node).
- **IT-Tools** - Handy web-based developer tools (base64 encode/decode, JWT debugger, cron expression parser, etc.). 2 replicas.
- **Authentik** - SSO provider. PostgreSQL + Redis backend. Currently broken - the PostgreSQL PVC was initialised with an empty password, then Vault generated a new one. Needs a PVC reset to fix the mismatch.
- **Podinfo** - Smoke test app. If podinfo works, the cluster works.
- **cert-manager** - TLS certificates. ClusterIssuers for Let&apos;s Encrypt prod and staging.


What I&apos;d Do Differently
========================

A few things I&apos;d change if I started over:

- **Skip kubernetes auth for Vault entirely.** JWT auth is simpler, more reliable with K3s, and has zero network dependencies. Don&apos;t waste time debugging TokenReview.
- **Set resource limits from the start.** I added them after the fact based on `kubectl top` data. Should have done it from day one. Without limits, one misbehaving pod can starve the entire node.
- **Use Terraform for all VMs, not just Vault.** The K3s VMs were created manually with `qm` commands. Works, but it means rebuilding them requires remembering the exact steps.
- **Pin the Traefik version.** K3s bundles Traefik and upgrades it on K3s updates. Pin it if you care about stability.
- **Start with 4GB RAM for Vault.** I initially gave it 2GB, and it hit 100% memory under load. Rebuilt it with 4GB. Save yourself the rebuild.
- **Don&apos;t use `wait: true` in Flux Kustomizations.** Use explicit `healthChecks` targeting specific Deployments instead. `wait: true` blocks on every resource, including ones that don&apos;t have meaningful ready conditions.


What&apos;s Next
===========

- Authentik SSO wired into Grafana (OAuth2 proxy)
- Backup strategy for etcd and PVs
- Network policies (currently everything is wide open within the cluster)
- Longhorn for replicated storage (currently local-path, no redundancy)
- More Grafana dashboards as I add services
- Loki for log aggregation (Promtail DaemonSet, LogQL in Grafana)

The full repo is at [github.com/moabukar/homelab](https://github.com/moabukar/homelab). Every manifest, every config, every Terraform file. Fork it, break it, make it yours.


========================================
Proxmox + K3s + Vault + FluxCD
========================================
GitOps all the way down.
========================================</content:encoded><category>kubernetes</category><category>k3s</category><category>homelab</category><category>gitops</category><category>fluxcd</category><category>hashicorp-vault</category><author>Mo Abukar</author></item><item><title>OpenTelemetry Changed How I Think About Observability</title><link>https://moabukar.co.uk/blog/opentelemetry-practical-guide/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/opentelemetry-practical-guide/</guid><description>A practical, opinionated take on OpenTelemetry - why it matters, what it actually solves, and how to instrument across Kubernetes, Lambda, ECS, and EC2 without losing your mind.</description><pubDate>Wed, 04 Mar 2026 00:00:00 GMT</pubDate><content:encoded>I&apos;ve spent the last decade watching teams build observability backwards.

They ship a service. It breaks at 2am. Someone spends three hours grepping CloudWatch logs in one tab, checking Kubernetes pod logs in another, and praying the timestamps line up. Then they bolt on monitoring as a &quot;fast follow&quot; that never actually follows.

OpenTelemetry fixes this. Not in a theoretical, conference-talk kind of way. In a &quot;your on-call engineer stops dreading pages&quot; kind of way.

I built an [observability lab](https://github.com/moabukar/otel-demo) that instruments services across Kubernetes, AWS Lambda, ECS Fargate, and EC2 - all running locally with Kind and LocalStack. This post is what I learned, what I think about OTel, and why I believe it&apos;s the most important shift in observability since Prometheus.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/opentelemetry.svg&quot; alt=&quot;OpenTelemetry logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## The Real Problem Nobody Talks About

Most production environments are a mess of compute types. You&apos;ve got containers on Kubernetes, Lambda functions for event-driven work, ECS tasks for batch processing, maybe some EC2 instances running things nobody wants to touch. Each one has its own logging format, its own metrics system, and its own way of not doing tracing.

Here&apos;s what that actually looks like:

- CloudWatch for Lambda and ECS, but with different log formats
- Prometheus for Kubernetes, but no trace correlation
- X-Ray sometimes, but only if someone bothered to instrument it
- Datadog or New Relic agents on EC2, but they don&apos;t talk to the Kubernetes stack
- Three dashboards open, none of them telling the full story

I&apos;ve lived this. Multiple times. At different companies. It&apos;s not a tooling problem - it&apos;s a fragmentation problem. And you can&apos;t solve fragmentation by adding more tools.

## Why OpenTelemetry Actually Matters

OTel isn&apos;t just another monitoring library. It&apos;s a standardisation layer. That distinction matters more than people realise.

### One SDK, Every Compute Type

The Python app running on ECS Fargate uses the same `opentelemetry-sdk` as the Lambda function. The Go service on EC2 uses the same `otel` package as the one in Kubernetes. You learn the instrumentation API once and it works everywhere.

In the lab, I instrumented five services across four compute types. The instrumentation patterns were nearly identical regardless of where the code runs. That&apos;s not a small thing. That&apos;s the difference between &quot;observability is easy&quot; and &quot;observability is another project.&quot;

### Vendor Neutrality (For Real This Time)

I&apos;ve watched teams spend quarters migrating from one observability vendor to another. Datadog to Grafana Cloud. New Relic to Honeycomb. Each time, it means touching every service, changing imports, updating configurations, and hoping nothing breaks.

With OTel, your instrumentation is decoupled from your backend. The Collector sits in the middle. Today it exports to Jaeger and Prometheus. Tomorrow you can swap in Grafana Tempo or Honeycomb without touching a single line of application code. The collector config changes; the app doesn&apos;t.

This isn&apos;t theoretical. I&apos;ve done it. Changing backends is a YAML edit, not a quarter-long migration project.

### Context Propagation Across Everything

Here&apos;s where it gets genuinely powerful. A trace that starts in a Lambda function, hits an ECS task, and finishes in a Kubernetes pod - OTel&apos;s context propagation (W3C TraceContext) carries the trace ID across all of them. One distributed trace spanning three different compute platforms.

Try doing that with CloudWatch alone. I&apos;ll wait.

## The Architecture That Works

After a lot of iteration, I landed on a two-tier collector architecture that I think is the right pattern for most teams:

```
Apps on K8s  ──→  DaemonSet Collector (per-node)  ──→  Gateway Collector  ──→  Backends
Lambda       ──→                                       ↑
ECS          ──→  Sidecar Collector                ────┘
EC2          ──→  Direct OTLP                      ────┘
```

**DaemonSet Collectors** run on each Kubernetes node. They receive telemetry from local pods via `hostPort` on 4317. This keeps network hops minimal and gives you per-node processing if you need it.

**Gateway Collector** is the central aggregation point. It receives from the DaemonSet collectors plus external sources (Lambda, ECS, EC2) and fans out to your backends - Jaeger for traces, Prometheus for metrics.

Why two tiers? Because you want to batch and process locally before sending to the gateway. It reduces network traffic, gives you a place to add sampling, and means your external sources (Lambda, ECS) have a single stable endpoint to target.

### The Collector Config

The config is deceptively simple:

```yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 1024
  resource:
    attributes:
      - key: k8s.cluster.name
        value: otel-demo
        action: insert

exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: true
  prometheus:
    endpoint: &quot;0.0.0.0:8889&quot;
    resource_to_telemetry_conversion:
      enabled: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp]
      processors: [resource, batch]
      exporters: [prometheus]
```

Receivers, processors, exporters, pipelines. That&apos;s the mental model. Everything else is configuration details.

## Instrumenting Go and Python - What It Actually Looks Like

### Go

The Go instrumentation is clean. You initialise the SDK once, get a tracer and meter, and use them throughout your app:

```go
tracer := otel.Tracer(&quot;demo-go-app&quot;)
meter := otel.Meter(&quot;demo-go-app&quot;)

// Wrap your HTTP server
handler := otelhttp.NewHandler(mux, &quot;go-demo-server&quot;)

// Create manual spans for business logic
func (o *OrderService) ProcessOrder(ctx context.Context, orderID string) error {
    _, span := o.tracer.Start(ctx, &quot;order_service.process_order&quot;)
    defer span.End()

    span.SetAttributes(
        attribute.String(&quot;order.id&quot;, orderID),
    )

    // Your business logic here
    o.simulatePaymentProcessing(ctx, order)
    o.simulateInventoryCheck(ctx, order)

    return nil
}
```

The `otelhttp` middleware handles HTTP spans automatically. For business logic - order processing, payment simulation, inventory checks - you create manual spans. Each span becomes a node in the trace waterfall.

What I like about the Go SDK: it&apos;s explicit. You pass context everywhere (which you should be doing anyway), and the span hierarchy falls out naturally.

### Python

Python&apos;s auto-instrumentation is where OTel really shines for getting started fast:

```python
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.requests import RequestsInstrumentor

FlaskInstrumentor().instrument()
RequestsInstrumentor().instrument()
```

Two lines and every Flask endpoint and outbound HTTP call is traced. Then you add manual spans for business logic:

```python
with tracer.start_as_current_span(&quot;complex_processing&quot;) as span:
    with tracer.start_as_current_span(&quot;validation&quot;) as validation_span:
        validation_span.set_attribute(&quot;validation.rules_checked&quot;, 5)
        # validate...

    with tracer.start_as_current_span(&quot;external_api_call&quot;) as api_span:
        api_span.set_attribute(&quot;api.service&quot;, &quot;enrichment-service&quot;)
        # call external service...
```

The context manager pattern makes it almost impossible to forget to close a span. The nesting creates parent-child relationships automatically.

## The Lambda Problem (And How OTel Solves It)

Lambda is where observability traditionally falls apart. Short-lived functions, cold starts, CloudWatch being the only native option. X-Ray exists but requires its own SDK and doesn&apos;t talk to your Kubernetes tracing.

With OTel, the Lambda function initialises the SDK on cold start and exports traces via OTLP to the same collector gateway your Kubernetes services use. Same trace format, same backend, same Jaeger UI. A trace that starts with an API Gateway request, triggers a Lambda, and calls an ECS task shows up as one unified trace.

Cold start vs warm invocation? Track it as a span attribute. Now you can actually measure cold start impact across your Lambda fleet instead of guessing.

## Opinions - The Stuff Nobody Puts in Documentation

### Start With Traces, Not Metrics

Every OTel tutorial starts with &quot;three pillars of observability: traces, metrics, and logs.&quot; That&apos;s technically correct but practically useless for prioritisation.

Start with traces. They give you the most bang for your instrumentation effort. A single distributed trace tells you more about a request failure than a hundred metric dashboards. Once you have traces, you can derive metrics from them (RED metrics from span data). Logs come last - and honestly, structured log attributes attached to spans are more useful than standalone log lines.

### Auto-Instrumentation Is Table Stakes, Not The Goal

Auto-instrumentation gets you HTTP spans and database calls for free. That&apos;s great for getting started. But the real value comes from manual spans on your business logic.

Nobody cares that an HTTP POST took 450ms. They care that payment processing took 200ms, inventory check took 150ms, and the remaining 100ms was validation. That level of detail requires manual instrumentation. Don&apos;t skip it.

### The Collector Is Your Best Friend

Run the collector. Always. Don&apos;t export directly from your application to your backend.

The collector gives you:
- Batching (reduces network overhead)
- Retry logic (your app doesn&apos;t block on export failures)
- Sampling (tail sampling at the collector level is powerful)
- Backend routing (send traces to Jaeger and metrics to Prometheus from one pipeline)
- A buffer between your apps and your backends (backend down? Collector queues)

Exporting directly from the app to the backend is fine for a tutorial. In production, it&apos;s a reliability risk.

### Resource Attributes Are Underrated

Every span and metric should carry resource attributes: `service.name`, `deployment.environment`, `k8s.namespace.name`, `cloud.platform`. These are what make your data filterable and actionable.

When something breaks at 3am, you want to filter traces by environment, service, and namespace without writing complex queries. Resource attributes make that possible. Invest the five minutes to set them up properly during initialisation.

### Semantic Conventions Matter

OTel has [semantic conventions](https://opentelemetry.io/docs/concepts/semantic-conventions/) for common attributes. Use them. `http.method`, `db.system`, `db.operation` - these aren&apos;t suggestions. They&apos;re what makes your telemetry interoperable across services written by different teams in different languages.

When your Go service records `db.system: postgresql` and your Python service records `database_type: postgres`, you&apos;ve lost the ability to query across services. Semantic conventions prevent this.

## What Actually Changes When You Adopt OTel

I&apos;ll be direct about the measurable improvements I&apos;ve seen:

**MTTR drops dramatically.** Before OTel, debugging a cross-service issue meant bouncing between tools and logs for 45-90 minutes. With distributed tracing, you get an alert with a trace ID, open it in Jaeger, see the failing span, read the error. Five to fifteen minutes. That&apos;s not a marginal improvement - it&apos;s a step change.

**On-call gets less painful.** &quot;Something&apos;s broken, start digging&quot; becomes &quot;Span X in service Y failed with error Z, here&apos;s the trace.&quot; Engineers stop dreading pages because they have context to act immediately.

**New services come pre-instrumented.** When OTel is in your service template, instrumentation is part of the first PR, not a &quot;we&apos;ll add monitoring later&quot; ticket that sits in the backlog for six months.

**Vendor migrations become boring.** And boring is exactly what you want for infrastructure changes.

## Try It Yourself

The full lab is at [github.com/moabukar/otel-demo](https://github.com/moabukar/otel-demo). One `make setup` command gives you:

- A Kind cluster with Go and Python services, fully instrumented
- OTel Collector in DaemonSet + Gateway topology
- Jaeger for traces, Prometheus for metrics, Grafana for dashboards
- LocalStack simulating Lambda, ECS, and EC2 workloads
- All sending telemetry to the same pipeline

Run `make traffic` to generate requests and open Jaeger at `localhost:16686`. Click through a few traces. See how spans nest, how context propagates across services, how errors are highlighted.

That ten minutes will teach you more about OpenTelemetry than any conference talk.

## Final Thought

Observability isn&apos;t about having the most dashboards or the fanciest tooling. It&apos;s about answering &quot;what&apos;s broken and why&quot; as fast as possible.

OpenTelemetry doesn&apos;t give you observability. It gives you the foundation to build observability that actually works - across languages, across compute types, across vendors. It&apos;s the unsexy standardisation layer that makes everything else possible.

And in my experience, the unsexy infrastructure decisions are the ones that compound the most over time.

If you&apos;re still running three different monitoring stacks for three different compute types, you&apos;re paying triple - in money, in cognitive overhead, and in MTTR. OpenTelemetry is the way out.

The code is open source. Go break it.</content:encoded><category>opentelemetry</category><category>observability</category><category>kubernetes</category><category>aws</category><category>devops</category><category>platform-engineering</category><category>monitoring</category><author>Mo Abukar</author></item><item><title>AWS Control Tower Account Factory - The Gotchas Nobody Tells You</title><link>https://moabukar.co.uk/blog/aws-control-tower-account-factory-gotchas/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/aws-control-tower-account-factory-gotchas/</guid><description>Real-world lessons from automating AWS account provisioning with Control Tower, Service Catalog, and Terraform. The silent failures, IAM traps, and StackSet timing issues that cost us days.</description><pubDate>Tue, 24 Feb 2026 00:00:00 GMT</pubDate><content:encoded>AWS Control Tower Account Factory - The Gotchas Nobody Tells You
================================================================

AWS Control Tower&apos;s Account Factory sounds straightforward. You define an OU structure, wire up Service Catalog, and Terraform handles the rest. New accounts on demand.

In practice, it&apos;s a minefield of silent failures, IAM permission gaps, and timing issues that aren&apos;t in the documentation. I recently automated account provisioning for a client&apos;s multi-account setup and hit every single one of these.

This post isn&apos;t a setup guide. It&apos;s the list of things that broke, why they broke, and how to fix them - so you don&apos;t waste the same days I did.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/aws.svg&quot; alt=&quot;AWS logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Service Catalog products silently hang if portfolio associations are wrong
- The AWSControlTowerExecution role can get deleted by failed provisioned product terminations
- StackSets have eventual consistency - your automation will race against them
- IAM session duration limits will bite you mid-provisioning
- SSO access isn&apos;t automatic after enrollment - you need to wire it yourself
- Always verify your actual IAM role names against what&apos;s in your templates


Architecture Context
====================

The setup in question:

```
Management Account
├── Control Tower (landing zone)
├── Service Catalog (Account Factory product)
├── CloudFormation StackSets (baseline deployment)
│
├── Platform OU
│   └── DevOps Sandbox (existing)
│
├── Sandbox OU
│   └── New accounts provisioned here
│
├── Staging OU
├── Prod OU
│
└── Security OU
    ├── Log Archive
    └── Audit
```

Terraform provisions new accounts through the `aws_servicecatalog_provisioned_product` resource, which triggers Account Factory under the hood. A CloudFormation StackSet auto-deploys IAM roles into every new account.

Simple on paper. Brutal in practice.


Gotcha 1: Service Catalog Portfolio Associations
=================================================

This one cost the most time because the failure mode is completely silent.

When you provision a product through Service Catalog, the IAM role making the API call must be a principal on the portfolio that contains the product. Not just on the product - on the portfolio.

If the association is missing, Terraform doesn&apos;t throw an error. The `aws_servicecatalog_provisioned_product` resource just... hangs. No timeout. No failure. It sits at `UNDER_CHANGE` until you kill it.

```hcl
# This is the bit most people forget
resource &quot;aws_servicecatalog_principal_portfolio_association&quot; &quot;provisioning_role&quot; {
  portfolio_id  = data.aws_servicecatalog_portfolio.account_factory.id
  principal_arn = aws_iam_role.provisioning.arn
  principal_type = &quot;IAM&quot;
}
```

You need this association BEFORE any provisioned product resource runs. Put it in a separate module or use `depends_on` explicitly.

```hcl
resource &quot;aws_servicecatalog_provisioned_product&quot; &quot;new_account&quot; {
  name                = &quot;platform-engineering-sandbox&quot;
  product_id          = data.aws_servicecatalog_product.account_factory.id
  provisioning_artifact_id = data.aws_servicecatalog_product.account_factory.default_provisioning_artifact_id

  provisioning_parameters {
    key   = &quot;AccountName&quot;
    value = &quot;platform-engineering-sandbox&quot;
  }

  provisioning_parameters {
    key   = &quot;AccountEmail&quot;
    value = &quot;aws+pe-sandbox@company.com&quot;
  }

  provisioning_parameters {
    key   = &quot;ManagedOrganizationalUnit&quot;
    value = &quot;Sandbox&quot;
  }

  provisioning_parameters {
    key   = &quot;SSOUserEmail&quot;
    value = &quot;admin@company.com&quot;
  }

  provisioning_parameters {
    key   = &quot;SSOUserFirstName&quot;
    value = &quot;Platform&quot;
  }

  provisioning_parameters {
    key   = &quot;SSOUserLastName&quot;
    value = &quot;Admin&quot;
  }

  # Critical - ensure portfolio association exists first
  depends_on = [
    aws_servicecatalog_principal_portfolio_association.provisioning_role
  ]
}
```

**How to check:** In the AWS Console, go to Service Catalog &gt; Portfolios &gt; your Account Factory portfolio &gt; Access tab. Your provisioning role should be listed there.


Gotcha 2: The AWSControlTowerExecution Role Deletion Trap
==========================================================

When you provision an account through Account Factory, Control Tower creates the `AWSControlTowerExecution` role in the new account. This role is how Control Tower manages the account going forward - baseline deployments, guardrails, drift detection.

Here&apos;s the trap: if a provisioned product enters a failed state and you terminate it, the termination process can delete this role from the account. But the account itself still exists in your organisation. Now you have an account that Control Tower can&apos;t manage.

```
SEQUENCE OF PAIN
================

1. Provision account via Service Catalog        → Account created
2. Provisioning fails halfway (timeout, perms)  → Status: TAINTED/ERROR
3. Terminate the provisioned product             → Role deleted
4. Try to re-provision or enroll the account     → Fails (no execution role)
5. Try to create the role manually               → Needs access to the account
6. Can&apos;t access the account                      → No SSO, no role, nothing
```

The fix is to NEVER terminate a failed provisioned product if the account was actually created. Instead:

1. Check if the account exists in AWS Organizations
2. If it does, enroll it through the Control Tower console (not Terraform)
3. Import the state into Terraform after enrollment

```bash
# Check if the account was actually created
aws organizations list-accounts \
  --query &quot;Accounts[?Name==&apos;platform-engineering-sandbox&apos;]&quot;

# If it exists, enroll via console, then import
terraform import \
  aws_servicecatalog_provisioned_product.new_account \
  pp-xxxxxxxxxxxxx
```


Gotcha 3: IAM Session Duration Limits
=======================================

Most CI/CD platforms have a maximum session duration. If your pipeline assumes an IAM role to run Terraform, that session has a clock on it.

Account Factory provisioning is not fast. Depending on the complexity of your baseline StackSet, it can take 20-45 minutes. If your session expires before provisioning completes, Terraform loses its connection to the AWS API and the provisioned product sits in limbo.

```
TIMING BREAKDOWN
================

Step                              Time
====                              ====
Service Catalog product launch    ~2 min
Account creation in Organisations ~5 min
Control Tower baseline deploy     ~15-30 min
StackSet instance deployment      ~5-10 min
SSO configuration                 ~2-5 min
────────────────────────────────────────
Total                             ~30-50 min
```

The default max session for most IAM roles is 1 hour. That sounds like enough until you add Terraform plan time, state locking, and any other resources in the same run.

**Fix:**

```yaml
# In your IAM role CloudFormation template
ProvisioningRole:
  Type: AWS::IAM::Role
  Properties:
    RoleName: terraform-provisioning
    MaxSessionDuration: 7200  # 2 hours
    AssumeRolePolicyDocument:
      # ... your trust policy
```

And in Terraform:

```hcl
resource &quot;aws_iam_role&quot; &quot;provisioning&quot; {
  name                 = &quot;terraform-provisioning&quot;
  max_session_duration = 7200  # seconds
  assume_role_policy   = data.aws_iam_policy_document.trust.json
}
```

Also set appropriate timeouts on the Terraform resource:

```hcl
resource &quot;aws_servicecatalog_provisioned_product&quot; &quot;new_account&quot; {
  # ... config ...

  timeouts {
    create = &quot;60m&quot;
    update = &quot;60m&quot;
    delete = &quot;60m&quot;
  }
}
```


Gotcha 4: StackSet Eventual Consistency
========================================

CloudFormation StackSets deploy asynchronously. When Control Tower enrolls an account, it triggers StackSet deployments for guardrails and baseline configurations. These don&apos;t complete instantly.

If your Terraform automation tries to interact with the new account immediately after provisioning (create IAM integrations, deploy resources, configure providers), it will race against the StackSet deployments.

Common symptoms:

```
SYMPTOMS OF STACKSET RACES
===========================

- Resources exist momentarily then disappear (StackSet overwrites them)
- IAM roles have different permissions than expected (baseline hasn&apos;t applied yet)
- aws_caller_identity returns the account but STS calls fail
- Random AccessDenied errors that work 5 minutes later
```

**Fix:** Add explicit waits or use a two-stage pipeline.

Stage 1 provisions the account. Stage 2 runs separately (triggered by a delay or webhook) and configures the account.

```hcl
# Stage 1: Provision account
resource &quot;aws_servicecatalog_provisioned_product&quot; &quot;account&quot; {
  # ... provisioning config ...
}

# Stage 2: Wait for StackSet baseline (separate Terraform workspace)
data &quot;aws_cloudformation_stack_set&quot; &quot;baseline&quot; {
  name = &quot;AWSControlTowerBP-BASELINE-ROLES&quot;
}

# Verify the stack instance exists in the new account
data &quot;aws_cloudformation_stack_set_instance&quot; &quot;baseline_check&quot; {
  stack_set_name = data.aws_cloudformation_stack_set.baseline.name
  account_id     = aws_servicecatalog_provisioned_product.account.outputs[&quot;AccountId&quot;]
  region         = &quot;eu-west-1&quot;
}
```

In practice, I ended up just sleeping for 5 minutes between stages. Ugly but reliable:

```bash
#!/bin/bash
# provision.sh - two-stage account provisioning

echo &quot;Stage 1: Provisioning account...&quot;
cd terraform/account-provisioning
terraform apply -auto-approve

ACCOUNT_ID=$(terraform output -raw account_id)
echo &quot;Account $ACCOUNT_ID created. Waiting for baseline deployment...&quot;

# StackSets need time to propagate
sleep 300

echo &quot;Stage 2: Configuring account...&quot;
cd ../account-configuration
terraform apply -auto-approve -var=&quot;account_id=$ACCOUNT_ID&quot;
```


Gotcha 5: SSO Access Isn&apos;t Automatic
======================================

When Account Factory creates an account, it creates an SSO user and assigns it to the account. But if you&apos;re using IAM Identity Center with an external identity provider (Azure AD, Okta, etc.), or you want to assign permission sets to existing groups, that doesn&apos;t happen automatically.

You need to explicitly create permission set assignments after the account is provisioned.

```hcl
# Assign admin permission set to your platform team group
resource &quot;aws_ssoadmin_account_assignment&quot; &quot;platform_admin&quot; {
  instance_arn       = tolist(data.aws_ssoadmin_instances.this.arns)[0]
  permission_set_arn = aws_ssoadmin_permission_set.admin.arn

  principal_id   = data.aws_identitystore_group.platform_team.group_id
  principal_type = &quot;GROUP&quot;

  target_id   = aws_servicecatalog_provisioned_product.account.outputs[&quot;AccountId&quot;]
  target_type = &quot;AWS_ACCOUNT&quot;
}

resource &quot;aws_ssoadmin_account_assignment&quot; &quot;developer_readonly&quot; {
  instance_arn       = tolist(data.aws_ssoadmin_instances.this.arns)[0]
  permission_set_arn = aws_ssoadmin_permission_set.read_only.arn

  principal_id   = data.aws_identitystore_group.developers.group_id
  principal_type = &quot;GROUP&quot;

  target_id   = aws_servicecatalog_provisioned_product.account.outputs[&quot;AccountId&quot;]
  target_type = &quot;AWS_ACCOUNT&quot;
}
```

Without this, new accounts are provisioned but nobody can actually log into them through SSO. People find out when they click the account in the AWS access portal and get a blank page.


Gotcha 6: Wrong IAM Role Names in Templates
=============================================

This one sounds obvious but catches everyone at least once.

CloudFormation StackSet templates that deploy IAM roles into member accounts reference a role name. If the actual role used by your CI/CD platform or Terraform runner doesn&apos;t match the name in the template, the StackSet deploys a role that nothing uses, and your automation fails with `AccessDenied`.

```yaml
# What&apos;s in the StackSet template
Resources:
  DeployRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: terraform-deploy  # &lt;-- This name matters

# What your CI/CD is actually assuming
# terraform-provisioning  &lt;-- WRONG NAME
```

The template says `terraform-deploy`. Your CI/CD assumes `terraform-provisioning`. Everything deploys cleanly. Nothing works.

**Fix:** Before writing any automation, verify the ACTUAL role name in two places:

```bash
# 1. What your CI/CD platform is configured to assume
aws sts get-caller-identity
# Returns: arn:aws:iam::XXXX:assumed-role/ACTUAL-ROLE-NAME/session

# 2. What your StackSet template creates
aws cloudformation describe-stack-set \
  --stack-set-name &quot;your-member-account-roles&quot; \
  --query &quot;StackSet.TemplateBody&quot; | jq -r . | grep RoleName
```

These two values must match. If they don&apos;t, fix the template, not the CI/CD config - changing role names in CI/CD has knock-on effects everywhere.


Gotcha 7: The SCP and Permissions Boundary Dance
==================================================

Service Control Policies (SCPs) are organisation-level guardrails. Permissions boundaries are account-level limits. When you use both (and you should), the interaction between them creates a restrictive intersection that&apos;s hard to debug.

```
EFFECTIVE PERMISSIONS
=====================

Identity Policy (what the role CAN do)
     ∩
Permissions Boundary (maximum the role COULD do)
     ∩
SCP (maximum the account COULD do)
     =
What actually works
```

Real example: your StackSet deploys a role with `AdministratorAccess`. Your permissions boundary allows `iam:*`, `s3:*`, `ec2:*`. Your SCP denies `iam:CreateUser`. Result: the role can&apos;t create IAM users even though both the identity policy and boundary allow it.

The debugging nightmare is that CloudTrail shows the denial from the SCP, but the error message to the user just says `AccessDenied` with no indication that an SCP is involved.

```bash
# Check effective SCPs for an account
aws organizations list-policies-for-target \
  --target-id &quot;ACCOUNT_ID&quot; \
  --filter &quot;SERVICE_CONTROL_POLICY&quot; \
  --query &quot;Policies[].{Name:Name, Id:Id}&quot;

# Get the policy content
aws organizations describe-policy \
  --policy-id &quot;p-xxxxxxxxxx&quot; \
  --query &quot;Policy.Content&quot; | jq -r . | jq .
```

**Tip:** Always test IAM operations from a new account before handing it to a team. Run a quick smoke test:

```bash
# Smoke test for new accounts
ACTIONS=(&quot;sts:GetCallerIdentity&quot; &quot;s3:ListBuckets&quot; &quot;ec2:DescribeRegions&quot; &quot;iam:ListRoles&quot;)

for action in &quot;${ACTIONS[@]}&quot;; do
  service=$(echo $action | cut -d: -f1)
  echo -n &quot;$action: &quot;
  aws $service $(echo $action | cut -d: -f2 | sed &apos;s/\([A-Z]\)/-\L\1/g&apos; | sed &apos;s/^-//&apos;) 2&gt;&amp;1 | head -1
done
```


Gotcha 8: Control Tower Drift and Manual Console Changes
==========================================================

Control Tower has a concept called &quot;drift&quot; - when the actual state of your landing zone diverges from what Control Tower expects. Making manual changes in the console (even well-intentioned ones) can trigger drift detection and block further operations.

Common drift triggers:

```
THINGS THAT CAUSE CONTROL TOWER DRIFT
======================================

- Moving accounts between OUs via the Organisations console (not CT)
- Deleting or modifying the AWSControlTowerExecution role
- Changing SCP attachments directly in Organisations
- Modifying Control Tower managed StackSet instances
- Deleting CloudTrail trails in member accounts
- Removing Config rules deployed by guardrails
```

When drift is detected, Account Factory stops working entirely. You can&apos;t provision new accounts, update existing ones, or change OU assignments until drift is resolved.

```bash
# Check landing zone drift status
aws controltower list-landing-zone-operations \
  --query &quot;landingZoneOperations[?status==&apos;IN_PROGRESS&apos;]&quot;

# If you need to resolve drift
aws controltower reset-landing-zone \
  --landing-zone-identifier &quot;arn:aws:controltower:eu-west-1:XXXX:landingzone/XXXX&quot;
```

**Rule:** Never make changes to Control Tower-managed resources through the Organisations console, CloudFormation console, or CLI directly. Always go through Control Tower or Terraform with the appropriate provider resources.


Gotcha 9: Email Addresses Are Forever
=======================================

Every AWS account needs a unique email address. Once used, that email can never be used for another account - even if the original account is closed.

At scale, this becomes an email management problem. Most teams use plus-addressing:

```
ACCOUNT EMAIL STRATEGY
======================

Pattern: aws+{ou}-{name}@company.com

Examples:
  aws+sandbox-dev1@company.com
  aws+prod-api@company.com
  aws+staging-data@company.com
```

But here&apos;s the catch: if you close an account and want to recreate it with the same name, you need a different email. Keep a registry.

```hcl
# Keep a local map of all account emails
locals {
  account_emails = {
    &quot;sandbox-dev1&quot;     = &quot;aws+sandbox-dev1@company.com&quot;
    &quot;sandbox-dev2&quot;     = &quot;aws+sandbox-dev2@company.com&quot;
    &quot;prod-api&quot;         = &quot;aws+prod-api@company.com&quot;
    # Recreated after closure - note the v2 suffix
    &quot;sandbox-testing&quot;  = &quot;aws+sandbox-testing-v2@company.com&quot;
  }
}
```

Also: the root email for each account can receive important AWS notifications (billing alerts, abuse reports, account recovery). Make sure these go to a monitored mailbox or mailing list, not someone&apos;s personal inbox.


Putting It All Together
========================

Here&apos;s the provisioning flow with all the gotchas accounted for:

```
1. Verify portfolio association exists           (Gotcha 1)
2. Set session duration to 2h+                   (Gotcha 3)
3. Provision account via Service Catalog          
4. Wait for StackSet baseline completion          (Gotcha 4)
5. Verify execution role exists in new account    (Gotcha 2)
6. Assign SSO permission sets                     (Gotcha 5)
7. Validate IAM role names match templates        (Gotcha 6)
8. Run permission smoke test                      (Gotcha 7)
9. Verify no drift detected                       (Gotcha 8)
```

The full module structure:

```
account-provisioning/
├── modules/
│   ├── account/           # Service Catalog provisioned product
│   ├── baseline-check/    # Waits for StackSet completion
│   ├── sso-assignment/    # Permission set assignments
│   └── smoke-test/        # Post-provision validation
├── ou/
│   ├── sandbox/
│   ├── staging/
│   ├── prod/
│   └── platform/
├── bootstrap/
│   ├── member-role.yaml   # StackSet template for member IAM roles
│   └── permissions/       # Boundary policies per OU
└── accounts.tf            # Account definitions
```

Each new account is a single block:

```hcl
module &quot;pe_sandbox&quot; {
  source = &quot;./modules/account&quot;

  name          = &quot;platform-engineering-sandbox&quot;
  email         = local.account_emails[&quot;pe-sandbox&quot;]
  ou            = &quot;Sandbox&quot;
  sso_groups    = [&quot;PlatformTeam&quot;, &quot;Developers&quot;]
  permission_sets = {
    &quot;PlatformTeam&quot; = &quot;AdministratorAccess&quot;
    &quot;Developers&quot;   = &quot;ReadOnlyAccess&quot;
  }
}
```


What I&apos;d Do Differently
========================

1. **Two-stage pipeline from day one.** Don&apos;t try to provision and configure in a single Terraform run. The timing issues aren&apos;t worth fighting.

2. **Test with a throwaway account first.** Don&apos;t learn these lessons in a production OU. Create a sandbox account, break it, delete it, try again.

3. **Keep a manual runbook alongside Terraform.** When Service Catalog hangs or drift blocks you, knowing how to fix it through the console is faster than debugging Terraform state.

4. **Use `moved` blocks aggressively.** When you refactor your module structure (and you will), Terraform `moved` blocks save you from destroying and recreating accounts.

5. **Monitor StackSet operations.** Set up CloudWatch alarms on StackSet failures. Silent StackSet failures mean accounts exist without proper baselines - a security risk you won&apos;t notice until an audit.


References
==========

- [AWS Control Tower Account Factory docs](https://docs.aws.amazon.com/controltower/latest/userguide/account-factory.html)
- [Terraform aws_servicecatalog_provisioned_product](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/servicecatalog_provisioned_product)
- [AWS Organizations SCP evaluation logic](https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_scps_evaluation.html)
- [IAM policy evaluation logic](https://docs.aws.amazon.com/IAM/latest/UserGuide/reference_policies_evaluation-logic.html)
- [Control Tower drift detection](https://docs.aws.amazon.com/controltower/latest/userguide/drift.html)
- [CloudFormation StackSets concepts](https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/stacksets-concepts.html)


========================================
Control Tower + Service Catalog + Terraform
========================================
The gotchas they don&apos;t put in the docs.
========================================</content:encoded><category>aws</category><category>control-tower</category><category>terraform</category><category>service-catalog</category><category>iam</category><category>platform-engineering</category><category>multi-account</category><author>Mo Abukar</author></item><item><title>Building an Automated Multi-Account AWS Architecture with Control Tower and Terraform</title><link>https://moabukar.co.uk/blog/aws-control-tower-multi-account-automation/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/aws-control-tower-multi-account-automation/</guid><description>A hands-on walkthrough of enabling AWS Control Tower, designing an OU structure, automating account provisioning via Service Catalog, and deploying security baselines - from zero to fully automated account vending in production.</description><pubDate>Sat, 14 Feb 2026 00:00:00 GMT</pubDate><content:encoded>Most companies start with a single AWS account. One account for everything - dev workloads sitting next to production databases, shared IAM roles with permissions nobody fully understands, and a CloudTrail log that&apos;s a haystack of events from every team.

It works until it doesn&apos;t. And by the time it stops working, the blast radius of any incident is your entire infrastructure.

I recently helped a client migrate from this exact situation - a handful of loosely managed AWS accounts with no guardrails, no centralised logging, and no standardised way to create new accounts - to a fully automated multi-account architecture using AWS Control Tower, Service Catalog, and Terraform.

This post covers everything: the console setup, the problems we hit, the Terraform modules we built, and the lessons learned along the way. This isn&apos;t a theoretical overview - it&apos;s what we actually did, including the bits that didn&apos;t go smoothly.

![AWS Control Tower Multi-Account Architecture](/images/aws-control-tower-architecture.svg)

## Why Multi-Account?

Before diving into the how, it&apos;s worth understanding why AWS themselves recommend a multi-account strategy:

**Blast radius isolation.** If a developer accidentally deletes something in a sandbox account, production doesn&apos;t blink. If credentials leak, the damage is contained to one account.

**Clean billing.** Each account maps to a cost centre, team, or environment. No more parsing thousands of line items to figure out which team spent what.

**Security boundaries.** IAM policies are account-scoped. Service Control Policies (SCPs) let you set hard guardrails per OU. You can&apos;t accidentally give a sandbox account production database access.

**Compliance.** Auditors love seeing separate accounts with clear boundaries. It makes proving segregation of duties much easier.

The tradeoff is complexity. Managing 20+ accounts manually is painful. That&apos;s where automation comes in.

## Step 1: Enabling Control Tower (The Console Part)

Control Tower is one of those AWS services where you have to start in the console. There&apos;s no `terraform apply` for the initial setup - you need to walk through the wizard.

### Prerequisites

Before enabling Control Tower, make sure you have:

- An **AWS Organizations** management account (this becomes your Control Tower management account)
- A **clean email address** for the Audit account (e.g., `aws-audit@yourcompany.com`)
- A **clean email address** for the Log Archive account (e.g., `aws-logs@yourcompany.com`)
- **Admin access** in the management account
- At least **20 minutes** of patience

### The Setup Wizard

Navigate to **AWS Control Tower** in the console and click **Set up landing zone**.

The wizard walks you through several decisions:

**1. Home Region**

Pick the region where Control Tower will operate. This is where your landing zone resources live - CloudFormation stacks, Config rules, the lot. For a European client, we chose `eu-central-1` (Frankfurt).

&gt; **Warning:** You cannot change the home region after setup. Choose carefully.

**2. Region Deny Setting**

Control Tower asks if you want to deny access to non-governed regions. We enabled this. There&apos;s no good reason for workloads to spin up in `ap-southeast-1` if your business operates in Europe.

We allowed four regions:
- `eu-central-1` (Frankfurt) - primary
- `eu-west-2` (London) - secondary
- `eu-west-1` (Ireland) - available for specific services
- `us-east-1` (N. Virginia) - required for global services like IAM, CloudFront, Route53

**3. Foundational OUs**

Control Tower creates two OUs automatically:
- **Security** - houses the Audit and Log Archive accounts
- **Sandbox** - a default OU for experimentation

You can create additional OUs later. We added:
- **Platform** - shared infrastructure (networking, CI/CD, DNS)
- **Prod** - production workloads only
- **Staging** - pre-production environments

**4. Shared Accounts**

Control Tower creates two mandatory shared accounts:

**Audit Account** - centralised security. This is where Security Hub findings aggregate, GuardDuty alerts land, and Config rules are evaluated across the org. Think of it as your security team&apos;s single pane of glass.

**Log Archive Account** - write-once storage. CloudTrail logs from every account in the organisation land here. AWS Config snapshots too. The account has restrictive policies - even admins can&apos;t delete logs.

**5. CloudTrail Configuration**

Control Tower sets up an organisation-wide CloudTrail. Every API call in every account gets logged to the Log Archive account. We enabled:
- CloudTrail log file validation
- CloudTrail log file encryption (KMS)
- S3 access logging on the trail bucket

**6. IAM Identity Center**

If you haven&apos;t set up IAM Identity Center (formerly AWS SSO), Control Tower will configure it. This is how users access accounts - no more IAM users with long-lived credentials.

### The Wait

After clicking **Set up landing zone**, go make a coffee. The initial setup takes 30-60 minutes. Control Tower is:
- Creating the Audit and Log Archive accounts
- Enrolling them in the Security OU
- Deploying baseline CloudFormation StackSets
- Setting up Config rules and CloudTrail
- Configuring guardrails (SCPs)

### The First Problem We Hit

When we enabled Control Tower, the client already had a few existing AWS accounts that weren&apos;t part of any organisation structure. These accounts had resources running in them.

**You cannot retroactively enrol existing accounts into Control Tower without meeting prerequisites.** Each account needs:
- No existing AWS Config recorder (Control Tower creates its own)
- No conflicting CloudTrail trails
- Sufficient IAM permissions for the `AWSControlTowerExecution` role

We had to go into each existing account, delete the existing Config recorder, and clean up conflicting CloudTrail configurations before enrolling them. This took a full day of careful work.

**Lesson:** Enable Control Tower as early as possible. The longer you wait, the more cleanup you&apos;ll need.

## Step 2: Designing the OU Structure

The OU structure is your organisational blueprint. Get it wrong and you&apos;ll be fighting it forever. Get it right and everything clicks.

Here&apos;s what we landed on:

```
Root
├── Management Account
│
├── Security (Control Tower managed)
│   ├── Audit Account
│   └── Log Archive Account
│
├── Platform
│   ├── Networking (Transit Gateway, VPCs)
│   ├── Shared Services (CI/CD, ECR, Secrets)
│   └── DNS (Route53 zones)
│
├── Prod
│   ├── Service A - Prod
│   ├── Service B - Prod
│   └── Data - Prod
│
├── Staging
│   ├── Service A - Staging
│   ├── Service B - Staging
│   └── Data - Staging
│
├── Sandbox
│   └── Developer sandboxes
│
└── Suspended
    └── Decommissioned accounts
```

### Design Decisions

**One account per service per environment.** Not one account per team, not one account per environment. Per service, per environment. This gives maximum blast radius isolation. If the payment service has an incident, the order service is unaffected.

**Platform OU for shared infrastructure.** Networking (Transit Gateway, VPC peering), shared container registries, centralised secrets - these live in the Platform OU. Application teams consume these services but don&apos;t manage the underlying infrastructure.

**Suspended OU.** When an account is decommissioned, it moves here rather than being deleted immediately. AWS keeps suspended accounts for 90 days. This gives you a recovery window.

**Sandbox with strict SCPs.** Sandbox accounts get the tightest guardrails - no leaving the org, no root user, region-locked. Developers can experiment freely within those boundaries.

### Registering OUs with Control Tower

Here&apos;s a gotcha that cost us time: **OUs created via Terraform or the Organizations API are not automatically registered with Control Tower.** You have to register them manually in the Control Tower console.

Go to Control Tower → Organization → click on the OU → **Register OU**.

Registration deploys baseline controls (Config rules, CloudTrail) to all accounts in that OU. Until an OU is registered, accounts in it don&apos;t get Control Tower guardrails.

We created the OUs via Terraform (faster than clicking), then registered each one in the console. It&apos;s a one-time operation per OU.

## Step 3: The Account Module (Terraform)

With Control Tower running and OUs registered, we built Terraform modules to automate account creation. The core module uses AWS Service Catalog to trigger the Control Tower Account Factory.

### How Account Factory Works

Under the hood, Control Tower provisions accounts via a Service Catalog product called &quot;AWS Control Tower Account Factory.&quot; When you provision this product with the right parameters, it:

1. Creates the AWS account in Organizations
2. Moves it to the specified OU
3. Enrolls it in Control Tower
4. Deploys baseline StackSets (CloudTrail, Config, IAM roles)
5. Creates an SSO user with access to the account
6. Applies all mandatory guardrails (SCPs) from the OU

All from a single Terraform resource.

### The Account Module

```hcl
# modules/account/main.tf

resource &quot;aws_servicecatalog_provisioned_product&quot; &quot;account&quot; {
  name                       = var.name
  product_name               = &quot;AWS Control Tower Account Factory&quot;
  provisioning_artifact_name = &quot;AWS Control Tower Account Factory&quot;

  provisioning_parameters {
    key   = &quot;AccountName&quot;
    value = var.name
  }

  provisioning_parameters {
    key   = &quot;AccountEmail&quot;
    value = var.email
  }

  provisioning_parameters {
    key   = &quot;ManagedOrganizationalUnit&quot;
    value = var.ou_name
  }

  provisioning_parameters {
    key   = &quot;SSOUserEmail&quot;
    value = coalesce(var.sso_email, var.email)
  }

  provisioning_parameters {
    key   = &quot;SSOUserFirstName&quot;
    value = var.sso_first_name
  }

  provisioning_parameters {
    key   = &quot;SSOUserLastName&quot;
    value = var.sso_last_name
  }

  tags = {
    Name      = var.name
    ManagedBy = &quot;terraform&quot;
    Team      = var.team
  }

  timeouts {
    create = &quot;60m&quot;
    update = &quot;60m&quot;
    delete = &quot;60m&quot;
  }

  lifecycle {
    ignore_changes = [
      provisioning_artifact_name,
    ]
  }
}
```

### Getting the Account ID

One thing that isn&apos;t obvious - Service Catalog doesn&apos;t directly return the account ID in the resource attributes. You need to read the provisioned product outputs:

```hcl
data &quot;aws_servicecatalog_provisioned_product_outputs&quot; &quot;account&quot; {
  provisioned_product_name = aws_servicecatalog_provisioned_product.account.name
}

locals {
  account_id = try(
    [for o in data.aws_servicecatalog_provisioned_product_outputs.account.outputs :
      o.value if o.key == &quot;AccountId&quot;
    ][0],
    null
  )
}

output &quot;account_id&quot; {
  value = local.account_id
}

output &quot;admin_role_arn&quot; {
  value = local.account_id != null ? (
    &quot;arn:aws:iam::${local.account_id}:role/AWSControlTowerExecution&quot;
  ) : null
}
```

The `AWSControlTowerExecution` role is created automatically by Control Tower in every enrolled account. It&apos;s the role your CI/CD platform uses for cross-account access during provisioning.

### Variables

```hcl
# modules/account/variables.tf

variable &quot;name&quot; {
  type        = string
  description = &quot;Account name (max 50 characters, must be unique)&quot;

  validation {
    condition     = length(var.name) &lt;= 50
    error_message = &quot;Account name must be 50 characters or less.&quot;
  }
}

variable &quot;email&quot; {
  type        = string
  description = &quot;Root account email (must be globally unique across all AWS)&quot;
}

variable &quot;ou_name&quot; {
  type        = string
  description = &quot;Target OU name as shown in Control Tower (e.g., &apos;Sandbox&apos;, &apos;Prod&apos;)&quot;
}

variable &quot;sso_email&quot; {
  type        = string
  description = &quot;SSO user email (defaults to account email)&quot;
  default     = null
}

variable &quot;sso_first_name&quot; {
  type    = string
  default = &quot;Admin&quot;
}

variable &quot;sso_last_name&quot; {
  type    = string
  default = &quot;User&quot;
}

variable &quot;team&quot; {
  type    = string
  default = &quot;platform&quot;
}
```

### The OU Name Gotcha

Notice we use `var.ou_name` (a name like &quot;Sandbox&quot;) rather than an OU ID. This is because Control Tower Account Factory expects the OU name in a specific format, not the raw OU ID.

In earlier versions of Account Factory, the format was `&quot;Custom (ou-xxxx-yyyyyyyy)&quot;`. In newer versions, it just takes the OU name. We initially used the wrong format and got cryptic Service Catalog provisioning failures.

**Lesson:** Check your Control Tower version. The `ManagedOrganizationalUnit` parameter format has changed over time.

## Step 4: Using the Account Module

With the module built, creating an account becomes a simple Terraform definition in the appropriate OU folder.

### Single Account

```hcl
# ou/platform/networking/main.tf

module &quot;networking&quot; {
  source = &quot;../../../modules/account&quot;

  name           = &quot;networking&quot;
  email          = &quot;aws-accounts+networking@company.com&quot;
  ou_name        = &quot;Platform&quot;
  sso_first_name = &quot;Platform&quot;
  sso_last_name  = &quot;Networking&quot;
  team           = &quot;platform&quot;
}

output &quot;networking_account_id&quot; {
  value = module.networking.account_id
}
```

### Multiple Environments

For services that need dev, staging, and prod:

```hcl
# ou/workloads/order-service/main.tf

locals {
  service_name = &quot;order-service&quot;
  environments = {
    staging = &quot;Staging&quot;
    prod    = &quot;Prod&quot;
  }
}

module &quot;accounts&quot; {
  source   = &quot;../../../modules/account&quot;
  for_each = local.environments

  name           = &quot;${local.service_name}-${each.key}&quot;
  email          = &quot;aws-accounts+${local.service_name}-${each.key}@company.com&quot;
  ou_name        = each.value
  sso_first_name = local.service_name
  sso_last_name  = each.key
  team           = &quot;product&quot;
}

output &quot;account_ids&quot; {
  value = { for k, v in module.accounts : k =&gt; v.account_id }
}
```

### The Email Problem

AWS requires globally unique email addresses for every account. You can&apos;t reuse emails, even across different organisations.

The solution: `+` addressing. Most email providers (Google Workspace, Microsoft 365) support it:

- `aws-accounts+networking@company.com`
- `aws-accounts+order-service-prod@company.com`
- `aws-accounts+order-service-staging@company.com`

All route to the same `aws-accounts@company.com` mailbox, but AWS sees them as unique. We created a shared mailbox specifically for this purpose.

## Step 5: Security Baseline Module

Control Tower gives you a solid foundation, but we wanted additional security controls deployed to every account automatically. We built a baseline module that gets applied after account creation.

```hcl
# modules/account-baseline/main.tf

# GuardDuty - threat detection
resource &quot;aws_guardduty_detector&quot; &quot;this&quot; {
  count  = var.enable_guardduty ? 1 : 0
  enable = true

  datasources {
    s3_logs {
      enable = true
    }
    kubernetes {
      audit_logs {
        enable = var.enable_eks_protection
      }
    }
    malware_protection {
      scan_ec2_instance_with_findings {
        ebs_volumes {
          enable = true
        }
      }
    }
  }
}

# Security Hub with CIS and AWS best practices
resource &quot;aws_securityhub_account&quot; &quot;this&quot; {
  count                     = var.enable_security_hub ? 1 : 0
  enable_default_standards  = false
  control_finding_generator = &quot;SECURITY_CONTROL&quot;
  auto_enable_controls      = true
}

resource &quot;aws_securityhub_standards_subscription&quot; &quot;cis&quot; {
  count         = var.enable_security_hub ? 1 : 0
  depends_on    = [aws_securityhub_account.this]
  standards_arn = &quot;arn:aws:securityhub:${var.region}::standards/cis-aws-foundations-benchmark/v/1.4.0&quot;
}

resource &quot;aws_securityhub_standards_subscription&quot; &quot;foundational&quot; {
  count         = var.enable_security_hub ? 1 : 0
  depends_on    = [aws_securityhub_account.this]
  standards_arn = &quot;arn:aws:securityhub:${var.region}::standards/aws-foundational-security-best-practices/v/1.0.0&quot;
}

# IAM Access Analyzer - detect external access
resource &quot;aws_accessanalyzer_analyzer&quot; &quot;this&quot; {
  count         = var.enable_access_analyzer ? 1 : 0
  analyzer_name = &quot;${var.account_name}-access-analyzer&quot;
  type          = &quot;ACCOUNT&quot;
}

# EBS encryption by default
resource &quot;aws_ebs_encryption_by_default&quot; &quot;this&quot; {
  count   = var.enable_ebs_encryption ? 1 : 0
  enabled = true
}

# S3 account-level public access block
resource &quot;aws_s3_account_public_access_block&quot; &quot;this&quot; {
  count                   = var.block_public_s3 ? 1 : 0
  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}

# IMDSv2 enforcement alerting via Config rule
resource &quot;aws_config_config_rule&quot; &quot;imdsv2&quot; {
  count = var.enable_config ? 1 : 0
  name  = &quot;ec2-imdsv2-check&quot;

  source {
    owner             = &quot;AWS&quot;
    source_identifier = &quot;EC2_IMDSV2_CHECK&quot;
  }
}
```

This gives every account:
- **GuardDuty** with S3, EKS, and malware scanning enabled
- **Security Hub** with CIS and AWS Foundational benchmarks
- **IAM Access Analyzer** to catch external trust policies
- **EBS encryption by default** - no unencrypted volumes
- **S3 public access block** at the account level
- **IMDSv2 enforcement** alerting (no more instance metadata v1)

## Step 6: Service Control Policies

SCPs are your hard guardrails. They operate at the Organizations level and override any IAM permissions. Even an admin in a child account can&apos;t do something an SCP denies.

We started conservative - only attaching SCPs to the Sandbox OU - and planned to expand after testing.

### Deny Root User

```json
{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Sid&quot;: &quot;DenyRootUser&quot;,
      &quot;Effect&quot;: &quot;Deny&quot;,
      &quot;Action&quot;: &quot;*&quot;,
      &quot;Resource&quot;: &quot;*&quot;,
      &quot;Condition&quot;: {
        &quot;StringLike&quot;: {
          &quot;aws:PrincipalArn&quot;: &quot;arn:aws:iam::*:root&quot;
        }
      }
    }
  ]
}
```

Every account has a root user. Nobody should be using it. This SCP ensures they can&apos;t.

### Deny Leaving the Organisation

```json
{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Sid&quot;: &quot;DenyLeaveOrganization&quot;,
      &quot;Effect&quot;: &quot;Deny&quot;,
      &quot;Action&quot;: [&quot;organizations:LeaveOrganization&quot;],
      &quot;Resource&quot;: &quot;*&quot;
    }
  ]
}
```

Simple but critical. Without this, anyone with admin access in a child account could remove it from your organisation.

### Restrict Regions

```json
{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Sid&quot;: &quot;DenyUnapprovedRegions&quot;,
      &quot;Effect&quot;: &quot;Deny&quot;,
      &quot;NotAction&quot;: [
        &quot;iam:*&quot;,
        &quot;sts:*&quot;,
        &quot;s3:*&quot;,
        &quot;cloudfront:*&quot;,
        &quot;route53:*&quot;,
        &quot;support:*&quot;,
        &quot;budgets:*&quot;,
        &quot;organizations:*&quot;,
        &quot;account:*&quot;
      ],
      &quot;Resource&quot;: &quot;*&quot;,
      &quot;Condition&quot;: {
        &quot;StringNotEquals&quot;: {
          &quot;aws:RequestedRegion&quot;: [
            &quot;eu-central-1&quot;,
            &quot;eu-west-2&quot;,
            &quot;eu-west-1&quot;,
            &quot;us-east-1&quot;
          ]
        }
      }
    }
  ]
}
```

Note the `NotAction` list - global services like IAM, STS, CloudFront, and Route53 must be excluded because they only operate in `us-east-1` regardless of where you call them from.

### Protect Security Baseline

This one prevents anyone from disabling the security tools we deployed:

```json
{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Sid&quot;: &quot;DenyDisableGuardDuty&quot;,
      &quot;Effect&quot;: &quot;Deny&quot;,
      &quot;Action&quot;: [
        &quot;guardduty:DeleteDetector&quot;,
        &quot;guardduty:DeleteMembers&quot;,
        &quot;guardduty:DisassociateFromMasterAccount&quot;
      ],
      &quot;Resource&quot;: &quot;*&quot;
    },
    {
      &quot;Sid&quot;: &quot;DenyDisableSecurityHub&quot;,
      &quot;Effect&quot;: &quot;Deny&quot;,
      &quot;Action&quot;: [
        &quot;securityhub:DisableSecurityHub&quot;,
        &quot;securityhub:DeleteMembers&quot;
      ],
      &quot;Resource&quot;: &quot;*&quot;
    },
    {
      &quot;Sid&quot;: &quot;DenyDisableConfig&quot;,
      &quot;Effect&quot;: &quot;Deny&quot;,
      &quot;Action&quot;: [
        &quot;config:DeleteConfigurationRecorder&quot;,
        &quot;config:DeleteDeliveryChannel&quot;,
        &quot;config:StopConfigurationRecorder&quot;
      ],
      &quot;Resource&quot;: &quot;*&quot;
    },
    {
      &quot;Sid&quot;: &quot;DenyDisableAccessAnalyzer&quot;,
      &quot;Effect&quot;: &quot;Deny&quot;,
      &quot;Action&quot;: [&quot;access-analyzer:DeleteAnalyzer&quot;],
      &quot;Resource&quot;: &quot;*&quot;
    }
  ]
}
```

### SCP Terraform Module

We built a reusable module for SCPs:

```hcl
# modules/scp/main.tf

resource &quot;aws_organizations_policy&quot; &quot;this&quot; {
  name        = var.name
  description = var.description
  type        = &quot;SERVICE_CONTROL_POLICY&quot;
  content     = var.policy_file != &quot;&quot; ? file(var.policy_file) : var.policy_json
}

resource &quot;aws_organizations_policy_attachment&quot; &quot;targets&quot; {
  for_each  = toset(var.target_ids)
  policy_id = aws_organizations_policy.this.id
  target_id = each.value
}
```

Usage:

```hcl
module &quot;scp_deny_root&quot; {
  source = &quot;../modules/scp&quot;

  name        = &quot;DenyRootUser&quot;
  description = &quot;Deny all actions by root user&quot;
  policy_file = &quot;${path.module}/../scps/deny-root-user.json&quot;

  target_ids = [
    local.ou_ids.sandbox,
    local.ou_ids.prod,
    local.ou_ids.staging,
  ]
}
```

### The 5-SCP Limit

AWS limits each OU to 5 attached SCPs. Control Tower already attaches some of its own guardrail SCPs. We hit this limit on the Sandbox OU and had to consolidate some of our policies.

**Lesson:** Check how many SCPs Control Tower has attached to an OU before adding your own. Use `aws organizations list-policies-for-target` to check.

## Step 7: IAM Identity Center (SSO)

We managed SSO permission sets and account assignments via Terraform. This ensures consistent access patterns across all accounts.

```hcl
# modules/iam-identity-center/main.tf

data &quot;aws_ssoadmin_instances&quot; &quot;this&quot; {}

locals {
  instance_arn     = tolist(data.aws_ssoadmin_instances.this.arns)[0]
  identity_store_id = tolist(data.aws_ssoadmin_instances.this.identity_store_ids)[0]
}

resource &quot;aws_ssoadmin_permission_set&quot; &quot;this&quot; {
  name             = var.name
  description      = var.description
  instance_arn     = local.instance_arn
  session_duration = var.session_duration
}

resource &quot;aws_ssoadmin_managed_policy_attachment&quot; &quot;this&quot; {
  for_each           = toset(var.aws_managed_policies)
  instance_arn       = local.instance_arn
  permission_set_arn = aws_ssoadmin_permission_set.this.arn
  managed_policy_arn = each.value
}

# Look up groups by display name
data &quot;aws_identitystore_group&quot; &quot;groups&quot; {
  for_each          = toset(var.group_names)
  identity_store_id = local.identity_store_id

  alternate_identifier {
    unique_attribute {
      attribute_path  = &quot;DisplayName&quot;
      attribute_value = each.value
    }
  }
}

# Assign groups to accounts
resource &quot;aws_ssoadmin_account_assignment&quot; &quot;this&quot; {
  for_each = { for a in local.assignments : a.key =&gt; a }

  instance_arn       = local.instance_arn
  permission_set_arn = aws_ssoadmin_permission_set.this.arn
  principal_type     = each.value.principal_type
  principal_id       = each.value.principal_id
  target_type        = &quot;AWS_ACCOUNT&quot;
  target_id          = each.value.account_id
}
```

This let us define permission sets like `AdministratorAccess`, `ReadOnlyAccess`, and `DeveloperAccess`, then assign them to IAM Identity Center groups per account. New accounts automatically got the right access patterns based on their OU.

## Step 8: CI/CD Integration (Spacelift)

The final piece was wiring everything into a CI/CD platform. We used Spacelift, but the pattern works with any Terraform automation tool - Terraform Cloud, Atlantis, GitHub Actions with OIDC.

The key design decision: **administrative stacks create child stacks.** The root stack manages OU-level configuration. Each OU has its own stack that manages accounts within it.

```hcl
# modules/spacelift-stack/main.tf

resource &quot;spacelift_stack&quot; &quot;this&quot; {
  name         = var.name
  description  = var.description
  repository   = var.repository
  branch       = &quot;main&quot;
  project_root = var.project_root

  administrative    = var.administrative
  autodeploy        = var.autodeploy
  terraform_version = var.terraform_version
}

resource &quot;spacelift_aws_integration_attachment&quot; &quot;this&quot; {
  stack_id       = spacelift_stack.this.id
  integration_id = var.aws_integration_id
  read           = true
  write          = true
}
```

AWS authentication uses OIDC - no static access keys anywhere. Spacelift assumes a role in the management account, which then uses `AWSControlTowerExecution` for cross-account operations.

### The Provisioning Flow

1. Developer opens a PR adding an account definition
2. Spacelift runs `terraform plan` and posts the result on the PR
3. Team reviews and approves
4. PR is merged to `main`
5. Spacelift runs `terraform apply`
6. Service Catalog triggers Account Factory
7. Account is created, enrolled in Control Tower, baselines deployed
8. SSO access is configured automatically
9. Developer gets an email invitation to access the new account

**Total time from merge to usable account: approximately 30 minutes.** No console clicks, no tickets, full audit trail in Git.

## Problems We Hit (And How We Solved Them)

### 1. Account Factory Timeout on First Run

The first time we ran `terraform apply` with the account module, it timed out. Account Factory can take 25-30 minutes, and our initial timeout was 30 minutes.

**Fix:** Set timeouts to 60 minutes. Account creation usually completes in 25 minutes, but StackSet deployments can add time.

### 2. Existing Accounts Conflicting with Control Tower

The client had existing accounts with their own Config recorders and CloudTrail trails. Control Tower expects to manage these itself.

**Fix:** We wrote a cleanup script that:
- Deleted existing Config recorders
- Deleted existing CloudTrail trails
- Removed conflicting IAM roles
- Then enrolled each account via the Control Tower console

### 3. SCP Limit Per OU

Control Tower attaches its own SCPs. We attached ours on top and hit the 5-SCP limit.

**Fix:** Consolidated multiple deny statements into single SCP documents. Instead of separate policies for &quot;deny root&quot; and &quot;deny leave org,&quot; we combined them into one &quot;baseline-deny&quot; policy.

### 4. OU Name Format Changes

The `ManagedOrganizationalUnit` parameter in Account Factory changed format between Control Tower versions. Older versions expected `&quot;Custom (ou-xxxx-yyyyyyyy)&quot;`, newer versions just want the OU display name like `&quot;Sandbox&quot;`.

**Fix:** Check your Control Tower version. If in doubt, test with a sandbox account first.

### 5. Service Catalog Permissions

The IAM role running Terraform needs specific Service Catalog permissions that aren&apos;t obvious. You need not just `servicecatalog:ProvisionProduct` but also permissions to describe products, list artifacts, and manage provisioned products.

**Fix:** We created a dedicated IAM role for account provisioning with a policy that covers all Service Catalog and Organizations operations needed.

### 6. StackSet Eventual Consistency

After an account is created, StackSet deployments to that account aren&apos;t instant. The StackSet targets the OU, and AWS detects the new account eventually. We saw delays of 5-10 minutes.

**Fix:** Added explicit waits in Terraform using `depends_on` chains. The baseline module depends on the account module completing, adding a natural delay.

### 7. Protecting Account Emails

Once an account exists, changing its root email is dangerous - it effectively changes who has root access. We needed to prevent accidental email changes.

**Fix:** Used OPA/Rego policy in our CI platform to detect and block any Terraform plan that modifies the `AccountEmail` parameter of an existing provisioned product.

## Account Deletion Process

Deleting accounts requires a deliberate sequence:

1. **Remove from Terraform** - delete the module block and apply. This removes the Service Catalog product but doesn&apos;t delete the AWS account.

2. **Move to Suspended OU** - `aws organizations move-account --account-id XXXX --destination-parent-id ou-suspended`

3. **Close the account** - `aws organizations close-account --account-id XXXX`

4. **Wait 90 days** - AWS retains suspended accounts for 90 days before permanent deletion

5. **Clean up SSO** - remove any permission set assignments for the closed account

We built a runbook for this process rather than automating it. Account deletion should be intentional and reviewed.

## What We Ended Up With

After two weeks of work, the client went from:

- 4 loosely managed accounts with no guardrails
- Manual account creation via the console
- No centralised logging or security tooling
- Shared IAM users with long-lived credentials

To:

- A fully automated multi-account architecture
- Account creation via PR (30 minutes, zero console clicks)
- Centralised CloudTrail, Config, GuardDuty, and Security Hub
- SSO with permission sets (no more IAM users)
- SCPs enforcing guardrails across all OUs
- OPA policies preventing dangerous Terraform changes
- Full audit trail in Git for every account ever created

The infrastructure code lives in a single repository with a clear structure:

```
account-provisioning/
├── modules/
│   ├── account/              # Service Catalog + Account Factory
│   ├── account-baseline/     # Security baseline (GuardDuty, etc.)
│   ├── scp/                  # Service Control Policies
│   ├── iam-identity-center/  # SSO permission sets
│   └── spacelift-stack/      # CI/CD stack configuration
├── ou/
│   ├── locals.tf             # Org config (OU IDs, regions)
│   ├── providers.tf          # AWS + CI providers
│   ├── scps.tf               # SCP definitions
│   ├── platform/             # Platform OU accounts
│   ├── prod/                 # Production OU accounts
│   ├── staging/              # Staging OU accounts
│   ├── sandbox/              # Sandbox OU accounts
│   └── security/             # Security OU accounts
├── scps/                     # SCP JSON policy files
├── bootstrap/                # Bootstrap IAM roles
└── docs/                     # Architecture diagrams
```

## Key Takeaways

**Start with Control Tower early.** Retrofitting it onto existing accounts is painful. If you&apos;re building a new AWS setup, enable Control Tower on day one.

**Automate account creation from the start.** Even if you only have 3 accounts today, build the automation. When you need your 10th account, it&apos;ll be a 5-line Terraform change instead of an afternoon of clicking.

**SCPs are your most powerful security tool.** IAM policies can be overridden by admins. SCPs cannot. Use them for the things that must never happen - root login, disabling security services, operating in unapproved regions.

**Use SSO, not IAM users.** Every account created by Account Factory gets SSO access automatically. There&apos;s no reason for long-lived IAM credentials in 2026.

**Test with Sandbox first.** Every SCP, every baseline, every module change - test it in the Sandbox OU before applying to production. SCPs that are too restrictive can lock you out of your own accounts.

**Document the deletion process.** Account creation is automated and repeatable. Account deletion is rare and high-risk. Write a runbook, not a Terraform module.

---

*Building a multi-account AWS setup or migrating to Control Tower? Feel free to reach out on [LinkedIn](https://linkedin.com/in/moabukar).*</content:encoded><category>aws</category><category>control-tower</category><category>terraform</category><category>multi-account</category><category>organizations</category><category>service-catalog</category><category>sso</category><category>iam-identity-center</category><category>scps</category><category>platform-engineering</category><category>spacelift</category><category>security</category><category>devops</category><author>Mo Abukar</author></item><item><title>Spacelift from Scratch: Automating Terraform at Scale with Spaces, Stacks, OPA Policies, and a Private Module Registry</title><link>https://moabukar.co.uk/blog/spacelift-terraform-automation-from-scratch/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/spacelift-terraform-automation-from-scratch/</guid><description>A complete guide to setting up Spacelift for multi-team Terraform automation - from zero to production with spaces, dynamic stacks, OPA security policies in Rego, private module registry, and GitOps-driven infrastructure.</description><pubDate>Sat, 14 Feb 2026 00:00:00 GMT</pubDate><content:encoded>If you&apos;ve ever managed Terraform at scale - multiple teams, multiple environments, multiple AWS accounts - you know the pain. GitHub Actions runners with static IAM keys stored in secrets. A pile of bash scripts stitching together `terraform plan` and `terraform apply`. PRs where nobody actually reviews the plan output because it&apos;s buried in a CI log. No guardrails, no approval gates, no shared modules.

I recently built out a complete Spacelift setup for a client - from zero to a fully automated, policy-driven, multi-team Terraform platform. This post covers everything: the architecture decisions, the Terraform code, the OPA policies in Rego, the private module registry, and the lessons learned along the way.

This isn&apos;t a surface-level overview. It&apos;s what we actually built, including the parts that didn&apos;t go smoothly.

![Spacelift Architecture](/images/spacelift-architecture.svg)

## Why Spacelift?

Before Spacelift, the client&apos;s Terraform workflow was the classic setup: GitHub Actions running `terraform plan` on PRs and `terraform apply` on merge. It worked for two engineers managing three environments. It stopped working when the team grew to fifteen engineers across four teams managing thirty-plus environments across multiple AWS accounts.

The problems were predictable:

**No RBAC.** Every engineer could apply to every environment. The payments team could accidentally destroy the data team&apos;s staging infrastructure. There was nothing preventing it except &quot;don&apos;t do that.&quot;

**Static credentials everywhere.** AWS access keys and secret keys stored in GitHub Actions secrets. Rotated manually. Shared across workflows. A security audit waiting to happen.

**No policy enforcement.** No way to enforce tagging standards, prevent public S3 buckets, or require approval for production changes. Everything was trust-based.

**No visibility.** Understanding which Terraform state files existed, what was drifting, and who changed what required digging through GitHub commit history and AWS CloudTrail logs.

### Why Not Terraform Cloud?

Terraform Cloud (now HCP Terraform) is the obvious alternative. We evaluated it. The dealbreakers were:

- **No hierarchical RBAC.** TFC has workspaces and teams, but not the nested spaces model Spacelift offers. We needed platform team &gt; environment &gt; team scoping.
- **OPA is bolted on, not native.** Spacelift treats OPA as a first-class citizen. Policies auto-attach via labels. TFC&apos;s Sentinel is powerful but uses a proprietary language.
- **No admin stacks.** In Spacelift, you can have a stack that creates other stacks. This is the cornerstone of dynamic infrastructure - you drop a config file and a stack appears. TFC doesn&apos;t have this concept natively.
- **Private module registry flexibility.** Spacelift&apos;s module registry integrates with its spaces and policies. TFC&apos;s registry is decent but lacks the triggering behaviour we wanted.

### Why Not Just GitHub Actions?

GitHub Actions is a CI/CD tool. It can run Terraform, but it doesn&apos;t understand Terraform. It doesn&apos;t know about state, drift, dependencies between stacks, or the difference between a plan that adds a tag and one that destroys a database.

Spacelift is purpose-built for infrastructure as code. It understands plans, resources, costs, and change impact. That matters when you&apos;re managing real infrastructure at scale.

## Core Concepts

Before diving into the implementation, let&apos;s establish the vocabulary. Spacelift has a handful of concepts that everything else builds on.

### Stacks

A stack is an isolated unit of Terraform execution. Think of it like a container for a Terraform run. Each stack has:

- Its own **state** (managed by Spacelift or an external backend)
- A **source code** pointer (a Git repo + branch + project root)
- **Environment variables** and **mounted files**
- A **run history** with full plan/apply logs
- **Labels** that determine which policies, contexts, and integrations attach

One stack typically maps to one environment of one service. So `payments-api-dev`, `payments-api-staging`, and `payments-api-prod` would be three separate stacks, all pointing to the same Terraform code but with different variable files and different spaces.

### Spaces

Spaces are Spacelift&apos;s hierarchical RBAC model. Think of them like folders in a file system - they nest, and permissions inherit downward.

Every Spacelift resource (stack, policy, context, module) lives in a space. Users and teams get access at the space level, and that access flows down to child spaces.

This is one of Spacelift&apos;s killer features. In Terraform Cloud, you manage access per-workspace. In Spacelift, you put staging stacks in the staging space and give the staging team access to that space. Done.

### Contexts

Contexts are bundles of environment variables and mounted files that can be attached to stacks. They&apos;re like shared configuration bags.

For example, an `aws-common` context might set `AWS_DEFAULT_REGION=eu-west-1` and `TF_LOG=ERROR`. A `datadog-credentials` context might inject API keys. Contexts attach to stacks either manually or via label-based auto-attach.

### Policies

Policies are OPA (Open Policy Agent) rules written in Rego. They control everything from what resources are allowed in a plan to who can approve a run to which stacks trigger when a module changes.

Spacelift has several policy types:

- **PLAN** - evaluate after terraform plan, can deny/warn
- **APPROVAL** - control who approves runs and when approval is required
- **ACCESS** - control who can read/write which stacks
- **TRIGGER** - determine which stacks to trigger when another stack finishes
- **PUSH** - control which Git pushes trigger runs
- **NOTIFICATION** - control notification routing

The key insight: policies auto-attach to stacks via labels. You label a stack with `security:all` and every policy that auto-attaches on `security:all` applies. No manual wiring.

### Modules

Spacelift has a private Terraform module registry. You publish modules from Git repos, version them, and consume them from stacks using `source = &quot;spacelift.io/your-org/module-name/provider&quot;`.

The registry supports version constraints, automatic dependency triggering (when a module updates, stacks using it can auto-trigger), and the same spaces/RBAC model as everything else.

## Initial Setup - The Bootstrap Problem

Setting up Spacelift has a chicken-and-egg problem: you need a stack to manage Spacelift resources, but Spacelift resources include stacks. Where do you start?

The answer is a **management stack** (sometimes called an admin stack). You create it manually in the Spacelift UI, and it manages everything else via the Spacelift Terraform provider.

### Step 1: Create the Management Stack

In the Spacelift UI:

1. Create a new stack called `spacelift-management`
2. Point it to your infrastructure repo (e.g., `your-org/infrastructure`)
3. Set the project root to `spacelift/management`
4. Mark it as an **administrative** stack (this gives it permission to manage other Spacelift resources)
5. Set the branch to `main`

### Step 2: AWS OIDC Integration

The first thing the management stack does is set up AWS authentication. Spacelift supports OIDC natively - no static credentials needed.

```hcl
# spacelift/management/aws-integration.tf

resource &quot;spacelift_aws_integration&quot; &quot;main&quot; {
  name = &quot;aws-main&quot;

  # The IAM role Spacelift will assume via OIDC
  role_arn                       = &quot;arn:aws:iam::123456789012:role/spacelift-oidc&quot;
  duration_seconds               = 3600
  generate_credentials_in_worker = false
  space_id                       = &quot;root&quot;

  labels = [&quot;autoattach:aws&quot;]
}
```

On the AWS side, you need a trust policy that allows Spacelift&apos;s OIDC provider to assume the role:

```json
{
  &quot;Version&quot;: &quot;2012-10-17&quot;,
  &quot;Statement&quot;: [
    {
      &quot;Effect&quot;: &quot;Allow&quot;,
      &quot;Principal&quot;: {
        &quot;Federated&quot;: &quot;arn:aws:iam::123456789012:oidc-provider/oidc.spacelift.io&quot;
      },
      &quot;Action&quot;: &quot;sts:AssumeRoleWithWebIdentity&quot;,
      &quot;Condition&quot;: {
        &quot;StringEquals&quot;: {
          &quot;oidc.spacelift.io:aud&quot;: &quot;your-org.app.spacelift.io&quot;
        }
      }
    }
  ]
}
```

This means zero static credentials. Spacelift obtains temporary AWS credentials via OIDC for every run. The credentials expire after an hour. No rotation needed.

### Step 3: Provider Configuration

The management stack uses both the Spacelift provider (to manage Spacelift resources) and the AWS provider (for the OIDC integration):

```hcl
# spacelift/management/providers.tf

terraform {
  required_providers {
    spacelift = {
      source  = &quot;spacelift-io/spacelift&quot;
      version = &quot;~&gt; 1.0&quot;
    }
    aws = {
      source  = &quot;hashicorp/aws&quot;
      version = &quot;~&gt; 5.0&quot;
    }
  }
}

provider &quot;spacelift&quot; {}

provider &quot;aws&quot; {
  region = &quot;eu-west-1&quot;

  default_tags {
    tags = {
      ManagedBy   = &quot;spacelift&quot;
      Environment = &quot;management&quot;
      Project     = &quot;spacelift&quot;
    }
  }
}
```

The Spacelift provider authenticates automatically when running inside a Spacelift stack - no API keys needed. It&apos;s one of those nice touches where the platform helps itself.

## Spaces Hierarchy

The spaces hierarchy is the backbone of the entire RBAC model. We designed it to mirror the company&apos;s organisational structure:

```
root
├── platform
├── sandbox
├── staging
├── prod
└── security
    ├── audit
    └── log-archive
```

The logic:

- **platform** - for the platform engineering team&apos;s own infrastructure (EKS clusters, networking, shared services)
- **sandbox** - development environments, relaxed policies, fast iteration
- **staging** - pre-production, stricter policies, mirrors prod
- **prod** - production, strictest policies, approval required
- **security** - security account infrastructure, restricted access
  - **audit** - CloudTrail, Config, GuardDuty aggregation
  - **log-archive** - centralised logging, long-term retention

Here&apos;s the Terraform code:

```hcl
# spacelift/management/spaces.tf

resource &quot;spacelift_space&quot; &quot;platform&quot; {
  name             = &quot;platform&quot;
  parent_space_id  = &quot;root&quot;
  description      = &quot;Platform engineering team infrastructure&quot;
  inherit_entities = true
}

resource &quot;spacelift_space&quot; &quot;sandbox&quot; {
  name             = &quot;sandbox&quot;
  parent_space_id  = &quot;root&quot;
  description      = &quot;Sandbox/development environments&quot;
  inherit_entities = true
}

resource &quot;spacelift_space&quot; &quot;staging&quot; {
  name             = &quot;staging&quot;
  parent_space_id  = &quot;root&quot;
  description      = &quot;Staging environments&quot;
  inherit_entities = true
}

resource &quot;spacelift_space&quot; &quot;prod&quot; {
  name             = &quot;prod&quot;
  parent_space_id  = &quot;root&quot;
  description      = &quot;Production environments&quot;
  inherit_entities = true
}

resource &quot;spacelift_space&quot; &quot;security&quot; {
  name             = &quot;security&quot;
  parent_space_id  = &quot;root&quot;
  description      = &quot;Security accounts infrastructure&quot;
  inherit_entities = true
}

resource &quot;spacelift_space&quot; &quot;audit&quot; {
  name             = &quot;audit&quot;
  parent_space_id  = spacelift_space.security.id
  description      = &quot;Audit account - CloudTrail, Config, GuardDuty&quot;
  inherit_entities = true
}

resource &quot;spacelift_space&quot; &quot;log_archive&quot; {
  name             = &quot;log-archive&quot;
  parent_space_id  = spacelift_space.security.id
  description      = &quot;Log archive account - centralised logging&quot;
  inherit_entities = true
}
```

The `inherit_entities = true` flag is important. It means policies, contexts, and integrations attached to a parent space are automatically available in child spaces. So an AWS integration attached at `root` is available to every space below it.

This cuts down on duplication massively. You define your AWS OIDC integration once at root, and every stack in every space can use it.

## Dynamic Stack Generation

This is where things get interesting. Instead of manually creating a Spacelift stack for every service-environment combination, we built a system where dropping a YAML config file into a directory automatically creates the stack.

### The Config File

Each service-environment combination has a `config.yaml` file:

```yaml
# environments/payments-api-dev/config.yaml
team: payments
project: payments-api
environment: dev
aws_account_id: &quot;111111111111&quot;
terraform_version: &quot;1.7.0&quot;
project_root: &quot;projects/payments-api/dev&quot;
auto_deploy: true
labels:
  - &quot;team:payments&quot;
  - &quot;env:dev&quot;
  - &quot;service:payments-api&quot;
```

```yaml
# environments/payments-api-prod/config.yaml
team: payments
project: payments-api
environment: prod
aws_account_id: &quot;333333333333&quot;
terraform_version: &quot;1.7.0&quot;
project_root: &quot;projects/payments-api/prod&quot;
auto_deploy: false
labels:
  - &quot;team:payments&quot;
  - &quot;env:prod&quot;
  - &quot;service:payments-api&quot;
```

### Reading Config Files Dynamically

The management stack reads all these config files and creates stacks from them:

```hcl
# spacelift/management/stacks.tf

locals {
  # Find all config.yaml files in the environments directory
  config_files = fileset(path.root, &quot;../../environments/*/config.yaml&quot;)

  # Parse each config file
  configs = {
    for f in local.config_files :
    dirname(f) =&gt; yamldecode(file(&quot;${path.root}/${f}&quot;))
  }

  # Map environments to spaces
  space_map = {
    dev     = spacelift_space.sandbox.id
    sandbox = spacelift_space.sandbox.id
    staging = spacelift_space.staging.id
    prod    = spacelift_space.prod.id
  }
}
```

### The Stack Module

We wrapped stack creation in a reusable module:

```hcl
# modules/spacelift-stack/main.tf

variable &quot;name&quot; {
  type        = string
  description = &quot;Stack name&quot;
}

variable &quot;repository&quot; {
  type        = string
  description = &quot;GitHub repository&quot;
  default     = &quot;infrastructure&quot;
}

variable &quot;branch&quot; {
  type        = string
  description = &quot;Git branch&quot;
  default     = &quot;main&quot;
}

variable &quot;project_root&quot; {
  type        = string
  description = &quot;Root directory for Terraform code&quot;
}

variable &quot;space_id&quot; {
  type        = string
  description = &quot;Spacelift space ID&quot;
}

variable &quot;terraform_version&quot; {
  type        = string
  description = &quot;Terraform version&quot;
  default     = &quot;1.7.0&quot;
}

variable &quot;auto_deploy&quot; {
  type        = bool
  description = &quot;Auto-deploy on merge&quot;
  default     = false
}

variable &quot;labels&quot; {
  type        = list(string)
  description = &quot;Stack labels for policy/context auto-attach&quot;
  default     = []
}

variable &quot;aws_integration_id&quot; {
  type        = string
  description = &quot;AWS integration ID&quot;
}

variable &quot;description&quot; {
  type        = string
  description = &quot;Stack description&quot;
  default     = &quot;&quot;
}

resource &quot;spacelift_stack&quot; &quot;this&quot; {
  name        = var.name
  description = var.description

  repository   = var.repository
  branch       = var.branch
  project_root = var.project_root

  space_id          = var.space_id
  terraform_version = var.terraform_version
  autodeploy        = var.auto_deploy

  labels = concat(var.labels, [
    &quot;autoattach:security-policies&quot;,
    &quot;autoattach:aws&quot;,
  ])

  # Enable local plan preview
  enable_local_preview = true

  # GitHub integration
  github_enterprise {
    namespace = &quot;your-org&quot;
  }
}

# Attach AWS integration
resource &quot;spacelift_aws_integration_attachment&quot; &quot;this&quot; {
  integration_id = var.aws_integration_id
  stack_id       = spacelift_stack.this.id
  read           = true
  write          = true
}
```

### Wiring It Together

Back in the management stack, we iterate over the configs to create stacks:

```hcl
# spacelift/management/stacks.tf (continued)

module &quot;stacks&quot; {
  source   = &quot;../../modules/spacelift-stack&quot;
  for_each = local.configs

  name               = &quot;${each.value.project}-${each.value.environment}&quot;
  project_root       = each.value.project_root
  space_id           = lookup(local.space_map, each.value.environment, spacelift_space.sandbox.id)
  terraform_version  = each.value.terraform_version
  auto_deploy        = each.value.auto_deploy
  aws_integration_id = spacelift_aws_integration.main.id
  description        = &quot;Stack for ${each.value.project} in ${each.value.environment} (team: ${each.value.team})&quot;

  labels = concat(
    each.value.labels,
    [
      &quot;team:${each.value.team}&quot;,
      &quot;env:${each.value.environment}&quot;,
      &quot;project:${each.value.project}&quot;,
    ]
  )
}
```

The beauty of this approach: a developer adds a `config.yaml` file, opens a PR, and on merge the management stack runs and creates the new stack automatically. No tickets, no manual clicks in the UI.

The `auto_deploy` field is key. For sandbox and staging, it&apos;s `true` - merge and it applies. For production, it&apos;s `false` - merge triggers a plan, but apply requires manual approval (enforced by OPA policy, which we&apos;ll get to).

## OPA Policies in Rego

This is the meat of the Spacelift setup. OPA policies written in Rego give you fine-grained control over what can and can&apos;t happen in your infrastructure. We wrote seven policies. Let me walk through each one.

### 1. Enforce Required Tags (PLAN Policy)

Every resource must have standard tags. No exceptions (well, a few exceptions - more on that).

```rego
# policies/plan/enforce-required-tags.rego

package spacelift

# Required tags that every taggable resource must have
required_tags := {
  &quot;Organisation&quot;,
  &quot;Project&quot;,
  &quot;Environment&quot;,
  &quot;Team&quot;,
  &quot;CostCentre&quot;,
  &quot;ManagedBy&quot;,
}

# Providers that don&apos;t use standard map-based tags
# These use list-style tags or have incompatible tag formats
excluded_providers := {
  &quot;datadog&quot;,
  &quot;pagerduty&quot;,
  &quot;cloudflare&quot;,
  &quot;helm&quot;,
  &quot;kubernetes&quot;,
  &quot;kubectl&quot;,
  &quot;vault&quot;,
  &quot;mongodbatlas&quot;,
}

# Check if a resource&apos;s provider is in the excluded list
is_excluded_provider(resource) {
  provider := split(resource.type, &quot;_&quot;)[0]
  excluded_providers[provider]
}

# Resources that are being created or updated and have tag support
taggable_resources[resource] {
  resource := input.terraform.resource_changes[_]
  resource.change.actions[_] == &quot;create&quot;
  not is_excluded_provider(resource)
  resource.change.after.tags != null
}

taggable_resources[resource] {
  resource := input.terraform.resource_changes[_]
  resource.change.actions[_] == &quot;update&quot;
  not is_excluded_provider(resource)
  resource.change.after.tags != null
}

# Find missing tags for a resource
missing_tags(resource) = missing {
  tags := resource.change.after.tags
  missing := {tag | tag := required_tags[_]; not tags[tag]}
}

# Deny resources missing required tags
deny[msg] {
  resource := taggable_resources[_]
  missing := missing_tags(resource)
  count(missing) &gt; 0
  msg := sprintf(
    &quot;Resource &apos;%s&apos; (%s) is missing required tags: %s&quot;,
    [resource.address, resource.type, concat(&quot;, &quot;, missing)]
  )
}

# Warn about resources where we can&apos;t verify tags
warn[msg] {
  resource := input.terraform.resource_changes[_]
  resource.change.actions[_] == &quot;create&quot;
  not is_excluded_provider(resource)
  resource.change.after.tags == null
  resource.change.after.tags_all != null
  msg := sprintf(
    &quot;Resource &apos;%s&apos; (%s) has tags_all but no explicit tags - verify default_tags are set&quot;,
    [resource.address, resource.type]
  )
}
```

The `excluded_providers` set is a real-world necessity. Datadog&apos;s Terraform provider, for example, uses a list of strings for tags (`[&quot;team:payments&quot;, &quot;env:prod&quot;]`) rather than a map. The Kubernetes and Helm providers have their own label concepts. Trying to enforce AWS-style tags on these providers just creates noise.

#### Test File for Tag Policy

```rego
# policies/plan/enforce-required-tags_test.rego

package spacelift

test_deny_missing_tags {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;aws_s3_bucket.test&quot;,
        &quot;type&quot;: &quot;aws_s3_bucket&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;create&quot;],
          &quot;after&quot;: {
            &quot;tags&quot;: {
              &quot;Organisation&quot;: &quot;acme&quot;,
              &quot;Project&quot;: &quot;test&quot;
            }
          }
        }
      }]
    }
  }
  count(result) &gt; 0
}

test_allow_all_tags_present {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;aws_s3_bucket.test&quot;,
        &quot;type&quot;: &quot;aws_s3_bucket&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;create&quot;],
          &quot;after&quot;: {
            &quot;tags&quot;: {
              &quot;Organisation&quot;: &quot;acme&quot;,
              &quot;Project&quot;: &quot;test&quot;,
              &quot;Environment&quot;: &quot;dev&quot;,
              &quot;Team&quot;: &quot;platform&quot;,
              &quot;CostCentre&quot;: &quot;engineering&quot;,
              &quot;ManagedBy&quot;: &quot;terraform&quot;
            }
          }
        }
      }]
    }
  }
  count(result) == 0
}

test_excluded_provider_skipped {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;datadog_monitor.test&quot;,
        &quot;type&quot;: &quot;datadog_monitor&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;create&quot;],
          &quot;after&quot;: {
            &quot;tags&quot;: null
          }
        }
      }]
    }
  }
  count(result) == 0
}

test_update_also_checked {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;aws_instance.test&quot;,
        &quot;type&quot;: &quot;aws_instance&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;update&quot;],
          &quot;after&quot;: {
            &quot;tags&quot;: {
              &quot;Name&quot;: &quot;test&quot;
            }
          }
        }
      }]
    }
  }
  count(result) &gt; 0
}
```

### 2. No Public RDS (PLAN Policy)

RDS instances must never be publicly accessible. Full stop.

```rego
# policies/plan/no-public-rds.rego

package spacelift

# Deny publicly accessible RDS instances
deny[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_db_instance&quot;
  resource.change.actions[_] == &quot;create&quot;
  resource.change.after.publicly_accessible == true
  msg := sprintf(
    &quot;RDS instance &apos;%s&apos; is set to publicly accessible. This is not allowed.&quot;,
    [resource.address]
  )
}

deny[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_db_instance&quot;
  resource.change.actions[_] == &quot;update&quot;
  resource.change.after.publicly_accessible == true
  msg := sprintf(
    &quot;RDS instance &apos;%s&apos; is being updated to publicly accessible. This is not allowed.&quot;,
    [resource.address]
  )
}

# Deny publicly accessible RDS clusters (Aurora)
deny[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_rds_cluster&quot;
  resource.change.actions[_] == &quot;create&quot;
  resource.change.after.publicly_accessible == true
  msg := sprintf(
    &quot;RDS cluster &apos;%s&apos; is set to publicly accessible. This is not allowed.&quot;,
    [resource.address]
  )
}

# Also check cluster instances
deny[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_rds_cluster_instance&quot;
  resource.change.actions[_] == &quot;create&quot;
  resource.change.after.publicly_accessible == true
  msg := sprintf(
    &quot;RDS cluster instance &apos;%s&apos; is set to publicly accessible. This is not allowed.&quot;,
    [resource.address]
  )
}
```

#### Test File for RDS Policy

```rego
# policies/plan/no-public-rds_test.rego

package spacelift

test_deny_public_rds_instance {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;aws_db_instance.main&quot;,
        &quot;type&quot;: &quot;aws_db_instance&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;create&quot;],
          &quot;after&quot;: {
            &quot;publicly_accessible&quot;: true
          }
        }
      }]
    }
  }
  count(result) &gt; 0
}

test_allow_private_rds_instance {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;aws_db_instance.main&quot;,
        &quot;type&quot;: &quot;aws_db_instance&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;create&quot;],
          &quot;after&quot;: {
            &quot;publicly_accessible&quot;: false
          }
        }
      }]
    }
  }
  count(result) == 0
}

test_deny_public_aurora_cluster {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;aws_rds_cluster.main&quot;,
        &quot;type&quot;: &quot;aws_rds_cluster&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;create&quot;],
          &quot;after&quot;: {
            &quot;publicly_accessible&quot;: true
          }
        }
      }]
    }
  }
  count(result) &gt; 0
}

test_deny_public_cluster_instance {
  result := deny with input as {
    &quot;terraform&quot;: {
      &quot;resource_changes&quot;: [{
        &quot;address&quot;: &quot;aws_rds_cluster_instance.main&quot;,
        &quot;type&quot;: &quot;aws_rds_cluster_instance&quot;,
        &quot;change&quot;: {
          &quot;actions&quot;: [&quot;create&quot;],
          &quot;after&quot;: {
            &quot;publicly_accessible&quot;: true
          }
        }
      }]
    }
  }
  count(result) &gt; 0
}
```

### 3. No Public S3 (PLAN Policy)

Every S3 bucket must have public access blocks enabled.

```rego
# policies/plan/no-public-s3.rego

package spacelift

# Deny S3 buckets without public access block
deny[msg] {
  bucket := input.terraform.resource_changes[_]
  bucket.type == &quot;aws_s3_bucket&quot;
  bucket.change.actions[_] == &quot;create&quot;

  # Check if there&apos;s a matching public access block
  not has_public_access_block(bucket.address)

  msg := sprintf(
    &quot;S3 bucket &apos;%s&apos; does not have an associated aws_s3_bucket_public_access_block. All S3 buckets must block public access.&quot;,
    [bucket.address]
  )
}

# Check for a public access block resource that references this bucket
has_public_access_block(bucket_address) {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_s3_bucket_public_access_block&quot;
  resource.change.actions[_] == &quot;create&quot;
  resource.change.after.block_public_acls == true
  resource.change.after.block_public_policy == true
  resource.change.after.ignore_public_acls == true
  resource.change.after.restrict_public_buckets == true
}

# Deny public access blocks that aren&apos;t fully restrictive
deny[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_s3_bucket_public_access_block&quot;
  resource.change.actions[_] == &quot;create&quot;

  not resource.change.after.block_public_acls == true
  msg := sprintf(
    &quot;S3 public access block &apos;%s&apos; must have block_public_acls = true&quot;,
    [resource.address]
  )
}

deny[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_s3_bucket_public_access_block&quot;
  resource.change.actions[_] == &quot;create&quot;

  not resource.change.after.block_public_policy == true
  msg := sprintf(
    &quot;S3 public access block &apos;%s&apos; must have block_public_policy = true&quot;,
    [resource.address]
  )
}

deny[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_s3_bucket_public_access_block&quot;
  resource.change.actions[_] == &quot;create&quot;

  not resource.change.after.restrict_public_buckets == true
  msg := sprintf(
    &quot;S3 public access block &apos;%s&apos; must have restrict_public_buckets = true&quot;,
    [resource.address]
  )
}
```

### 4. Cost Limit Warning (PLAN Policy)

This one doesn&apos;t block - it warns. We wanted visibility into expensive changes without being a hard gate.

```rego
# policies/plan/cost-limit-warning.rego

package spacelift

# Expensive instance types that should trigger a review
expensive_instance_types := {
  &quot;db.r6g.4xlarge&quot;,
  &quot;db.r6g.8xlarge&quot;,
  &quot;db.r6g.12xlarge&quot;,
  &quot;db.r6g.16xlarge&quot;,
  &quot;db.r6i.4xlarge&quot;,
  &quot;db.r6i.8xlarge&quot;,
  &quot;db.r6i.12xlarge&quot;,
  &quot;db.r6i.16xlarge&quot;,
  &quot;db.r5.4xlarge&quot;,
  &quot;db.r5.8xlarge&quot;,
  &quot;db.r5.12xlarge&quot;,
  &quot;db.r5.16xlarge&quot;,
  &quot;m6i.4xlarge&quot;,
  &quot;m6i.8xlarge&quot;,
  &quot;m6i.12xlarge&quot;,
  &quot;m6i.16xlarge&quot;,
  &quot;c6i.4xlarge&quot;,
  &quot;c6i.8xlarge&quot;,
  &quot;c6i.12xlarge&quot;,
  &quot;c6i.16xlarge&quot;,
  &quot;r6i.4xlarge&quot;,
  &quot;r6i.8xlarge&quot;,
  &quot;r6i.12xlarge&quot;,
  &quot;r6i.16xlarge&quot;,
}

# Count resources being created
creates := count([r |
  r := input.terraform.resource_changes[_]
  r.change.actions[_] == &quot;create&quot;
])

# Count resources being destroyed
destroys := count([r |
  r := input.terraform.resource_changes[_]
  r.change.actions[_] == &quot;delete&quot;
])

# Warn on large number of creates
warn[msg] {
  creates &gt; 20
  msg := sprintf(
    &quot;This plan creates %d resources. Please review carefully before applying.&quot;,
    [creates]
  )
}

# Warn on large number of destroys
warn[msg] {
  destroys &gt; 10
  msg := sprintf(
    &quot;WARNING: This plan destroys %d resources. Please verify this is intentional.&quot;,
    [destroys]
  )
}

# Warn on expensive RDS instance types
warn[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_db_instance&quot;
  resource.change.actions[_] == &quot;create&quot;
  expensive_instance_types[resource.change.after.instance_class]
  msg := sprintf(
    &quot;RDS instance &apos;%s&apos; uses expensive instance type &apos;%s&apos;. Please verify this is justified.&quot;,
    [resource.address, resource.change.after.instance_class]
  )
}

# Warn on expensive EC2 instance types
warn[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_instance&quot;
  resource.change.actions[_] == &quot;create&quot;
  expensive_instance_types[resource.change.after.instance_type]
  msg := sprintf(
    &quot;EC2 instance &apos;%s&apos; uses expensive instance type &apos;%s&apos;. Please verify this is justified.&quot;,
    [resource.address, resource.change.after.instance_type]
  )
}

# Warn on expensive RDS cluster instances (Aurora)
warn[msg] {
  resource := input.terraform.resource_changes[_]
  resource.type == &quot;aws_rds_cluster_instance&quot;
  resource.change.actions[_] == &quot;create&quot;
  expensive_instance_types[resource.change.after.instance_class]
  msg := sprintf(
    &quot;Aurora instance &apos;%s&apos; uses expensive instance type &apos;%s&apos;. Please verify this is justified.&quot;,
    [resource.address, resource.change.after.instance_class]
  )
}
```

### 5. Production Requires Approval (APPROVAL Policy)

This is the gate that prevents auto-deploy to production. Even if someone sets `autodeploy = true` on a prod stack, this policy catches it.

```rego
# policies/approval/prod-requires-approval.rego

package spacelift

# Reject auto-deploy for production stacks
reject[msg] {
  input.run.type == &quot;TRACKED&quot;
  is_production
  msg := &quot;Production stacks require manual approval before apply.&quot;
}

# Approve when at least one reviewer approves
approve {
  count(input.reviews.current.approvals) &gt; 0
}

# Check if the stack is in a production space or has production labels
is_production {
  input.stack.labels[_] == &quot;env:prod&quot;
}

is_production {
  contains(input.stack.space.name, &quot;prod&quot;)
}
```

There&apos;s a subtlety here worth calling out. The `reject` rule prevents auto-deployment and requires approval. The `approve` rule defines when enough approvals have been collected. Together they create a manual gate for production.

### 6. Project Ownership (ACCESS Policy)

This policy controls who can see and manage which stacks based on team labels.

```rego
# policies/access/project-ownership.rego

package spacelift

# Team-to-login mapping
team_logins := {
  &quot;payments&quot;: [&quot;github-payments-team&quot;],
  &quot;data&quot;: [&quot;github-data-team&quot;],
  &quot;platform&quot;: [&quot;github-platform-team&quot;],
  &quot;security&quot;: [&quot;github-security-team&quot;],
}

# Platform team gets read access to everything
read {
  input.session.teams[_] == &quot;github-platform-team&quot;
}

# Platform team gets write access to everything
write {
  input.session.teams[_] == &quot;github-platform-team&quot;
}

# Teams get write access to their own stacks
write {
  team := input.stack.labels[i]
  startswith(team, &quot;team:&quot;)
  team_name := substring(team, 5, -1)
  allowed_logins := team_logins[team_name]
  allowed_login := allowed_logins[_]
  input.session.teams[_] == allowed_login
}

# Teams get read access to their own stacks
read {
  team := input.stack.labels[i]
  startswith(team, &quot;team:&quot;)
  team_name := substring(team, 5, -1)
  allowed_logins := team_logins[team_name]
  allowed_login := allowed_logins[_]
  input.session.teams[_] == allowed_login
}

# Deny write to production for non-platform teams
deny_write[msg] {
  input.stack.labels[_] == &quot;env:prod&quot;
  not input.session.teams[_] == &quot;github-platform-team&quot;
  msg := &quot;Only the platform team can write to production stacks.&quot;
}
```

### 7. Module Change Trigger (TRIGGER Policy)

When a module in the private registry is updated, this policy automatically triggers runs on stacks that depend on it.

```rego
# policies/trigger/module-change.rego

package spacelift

# Trigger stacks that use the updated module
trigger[stack_id] {
  # The stack that just finished is a module
  input.run.state == &quot;FINISHED&quot;
  input.run.type == &quot;TRACKED&quot;

  # Get the module name from the triggering stack&apos;s labels
  module_label := input.stack.labels[_]
  startswith(module_label, &quot;module:&quot;)
  module_name := substring(module_label, 7, -1)

  # Find stacks that depend on this module
  stack := input.stacks[_]
  dep_label := stack.labels[_]
  dep_label == sprintf(&quot;depends-on:%s&quot;, [module_name])
  stack_id := stack.id
}
```

### Registering Policies with Auto-Attach

Policies are created as Spacelift resources and auto-attach to stacks via labels:

```hcl
# spacelift/management/policies.tf

resource &quot;spacelift_policy&quot; &quot;enforce_required_tags&quot; {
  name        = &quot;enforce-required-tags&quot;
  type        = &quot;PLAN&quot;
  body        = file(&quot;${path.module}/../../policies/plan/enforce-required-tags.rego&quot;)
  space_id    = &quot;root&quot;
  description = &quot;Enforce required tags on all taggable resources&quot;

  labels = [&quot;autoattach:security-policies&quot;]
}

resource &quot;spacelift_policy&quot; &quot;no_public_rds&quot; {
  name        = &quot;no-public-rds&quot;
  type        = &quot;PLAN&quot;
  body        = file(&quot;${path.module}/../../policies/plan/no-public-rds.rego&quot;)
  space_id    = &quot;root&quot;
  description = &quot;Prevent publicly accessible RDS instances&quot;

  labels = [&quot;autoattach:security-policies&quot;]
}

resource &quot;spacelift_policy&quot; &quot;no_public_s3&quot; {
  name        = &quot;no-public-s3&quot;
  type        = &quot;PLAN&quot;
  body        = file(&quot;${path.module}/../../policies/plan/no-public-s3.rego&quot;)
  space_id    = &quot;root&quot;
  description = &quot;Ensure S3 buckets have public access blocks&quot;

  labels = [&quot;autoattach:security-policies&quot;]
}

resource &quot;spacelift_policy&quot; &quot;cost_limit_warning&quot; {
  name        = &quot;cost-limit-warning&quot;
  type        = &quot;PLAN&quot;
  body        = file(&quot;${path.module}/../../policies/plan/cost-limit-warning.rego&quot;)
  space_id    = &quot;root&quot;
  description = &quot;Warn on expensive resources and large changes&quot;

  labels = [&quot;autoattach:security-policies&quot;]
}

resource &quot;spacelift_policy&quot; &quot;prod_requires_approval&quot; {
  name        = &quot;prod-requires-approval&quot;
  type        = &quot;APPROVAL&quot;
  body        = file(&quot;${path.module}/../../policies/approval/prod-requires-approval.rego&quot;)
  space_id    = &quot;root&quot;
  description = &quot;Require manual approval for production stacks&quot;

  labels = [&quot;autoattach:security-policies&quot;]
}

resource &quot;spacelift_policy&quot; &quot;project_ownership&quot; {
  name        = &quot;project-ownership&quot;
  type        = &quot;ACCESS&quot;
  body        = file(&quot;${path.module}/../../policies/access/project-ownership.rego&quot;)
  space_id    = &quot;root&quot;
  description = &quot;Team-based stack access control&quot;

  labels = [&quot;autoattach:security-policies&quot;]
}

resource &quot;spacelift_policy&quot; &quot;module_change_trigger&quot; {
  name        = &quot;module-change-trigger&quot;
  type        = &quot;TRIGGER&quot;
  body        = file(&quot;${path.module}/../../policies/trigger/module-change.rego&quot;)
  space_id    = &quot;root&quot;
  description = &quot;Trigger dependent stacks when modules update&quot;

  labels = [&quot;autoattach:security-policies&quot;]
}
```

The `autoattach:security-policies` label is the glue. Every stack we create includes this label, so every policy automatically applies. No manual wiring.

## Private Module Registry

One of the most valuable parts of the Spacelift setup was the private module registry. Instead of teams copy-pasting Terraform code or referencing Git repos with `?ref=v1.2.3`, they consume versioned modules from Spacelift&apos;s registry.

### The Module Wrapper

We created a reusable module for registering modules in Spacelift:

```hcl
# modules/spacelift-module/main.tf

variable &quot;name&quot; {
  type        = string
  description = &quot;Module name&quot;
}

variable &quot;repository&quot; {
  type        = string
  description = &quot;GitHub repository containing the module&quot;
}

variable &quot;branch&quot; {
  type        = string
  description = &quot;Git branch&quot;
  default     = &quot;main&quot;
}

variable &quot;project_root&quot; {
  type        = string
  description = &quot;Root directory in the repo&quot;
  default     = &quot;&quot;
}

variable &quot;space_id&quot; {
  type        = string
  description = &quot;Space ID&quot;
}

variable &quot;description&quot; {
  type        = string
  description = &quot;Module description&quot;
  default     = &quot;&quot;
}

variable &quot;labels&quot; {
  type        = list(string)
  description = &quot;Labels&quot;
  default     = []
}

variable &quot;terraform_provider&quot; {
  type        = string
  description = &quot;Terraform provider name&quot;
  default     = &quot;aws&quot;
}

resource &quot;spacelift_module&quot; &quot;this&quot; {
  name        = var.name
  description = var.description

  repository   = var.repository
  branch       = var.branch
  project_root = var.project_root

  space_id            = var.space_id
  terraform_provider  = var.terraform_provider

  labels = concat(var.labels, [
    &quot;module:${var.name}&quot;,
    &quot;autoattach:security-policies&quot;,
  ])

  github_enterprise {
    namespace = &quot;your-org&quot;
  }
}

output &quot;id&quot; {
  value = spacelift_module.this.id
}
```

### Registering Modules

Each internal module gets registered:

```hcl
# spacelift/management/modules.tf

module &quot;module_vpc&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;vpc&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/vpc&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;VPC module with private/public subnets, NAT gateways, and flow logs&quot;

  labels = [&quot;module:vpc&quot;]
}

module &quot;module_ecs&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;ecs&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/ecs&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;ECS cluster and service module with Fargate support&quot;

  labels = [&quot;module:ecs&quot;]
}

module &quot;module_rds&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;rds&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/rds&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;RDS instance module with encryption, backups, and parameter groups&quot;

  labels = [&quot;module:rds&quot;]
}

module &quot;module_aurora&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;aurora&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/aurora&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;Aurora cluster module with vertical autoscaling and read replicas&quot;

  labels = [&quot;module:aurora&quot;]
}

module &quot;module_alb&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;alb&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/alb&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;Application Load Balancer with WAF integration&quot;

  labels = [&quot;module:alb&quot;]
}

module &quot;module_context&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;context&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/context&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;Shared context module for Spacelift contexts&quot;

  labels = [&quot;module:context&quot;]
}

module &quot;module_vault&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name              = &quot;vault&quot;
  repository        = &quot;terraform-modules&quot;
  project_root      = &quot;modules/vault&quot;
  space_id          = spacelift_space.platform.id
  description       = &quot;HashiCorp Vault cluster on ECS&quot;

  labels = [&quot;module:vault&quot;]
}

module &quot;module_nats&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;nats&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/nats&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;NATS messaging cluster module&quot;

  labels = [&quot;module:nats&quot;]
}

module &quot;module_clickhouse&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name        = &quot;clickhouse&quot;
  repository  = &quot;terraform-modules&quot;
  project_root = &quot;modules/clickhouse&quot;
  space_id    = spacelift_space.platform.id
  description = &quot;ClickHouse analytics database module&quot;

  labels = [&quot;module:clickhouse&quot;]
}

module &quot;module_datadog_monitors&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name               = &quot;datadog-monitors&quot;
  repository         = &quot;terraform-modules&quot;
  project_root       = &quot;modules/datadog-monitors&quot;
  space_id           = spacelift_space.platform.id
  terraform_provider = &quot;datadog&quot;
  description        = &quot;Datadog monitor definitions&quot;

  labels = [&quot;module:datadog-monitors&quot;]
}

module &quot;module_datadog_dashboards&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name               = &quot;datadog-dashboards&quot;
  repository         = &quot;terraform-modules&quot;
  project_root       = &quot;modules/datadog-dashboards&quot;
  space_id           = spacelift_space.platform.id
  terraform_provider = &quot;datadog&quot;
  description        = &quot;Datadog dashboard definitions&quot;

  labels = [&quot;module:datadog-dashboards&quot;]
}

module &quot;module_datadog_synthetics&quot; {
  source = &quot;../../modules/spacelift-module&quot;

  name               = &quot;datadog-synthetics&quot;
  repository         = &quot;terraform-modules&quot;
  project_root       = &quot;modules/datadog-synthetics&quot;
  space_id           = spacelift_space.platform.id
  terraform_provider = &quot;datadog&quot;
  description        = &quot;Datadog synthetic test definitions&quot;

  labels = [&quot;module:datadog-synthetics&quot;]
}
```

### Consuming Modules

Teams consume modules using the Spacelift registry source format:

```hcl
# projects/payments-api/dev/main.tf

module &quot;vpc&quot; {
  source  = &quot;spacelift.io/your-org/vpc/aws&quot;
  version = &quot;~&gt; 2.0&quot;

  name                = &quot;payments-api-dev&quot;
  cidr                = &quot;10.10.0.0/16&quot;
  availability_zones  = [&quot;eu-west-1a&quot;, &quot;eu-west-1b&quot;, &quot;eu-west-1c&quot;]
  private_subnets     = [&quot;10.10.1.0/24&quot;, &quot;10.10.2.0/24&quot;, &quot;10.10.3.0/24&quot;]
  public_subnets      = [&quot;10.10.101.0/24&quot;, &quot;10.10.102.0/24&quot;, &quot;10.10.103.0/24&quot;]
  enable_nat_gateway  = true
  single_nat_gateway  = true  # Cost saving for dev

  tags = {
    Organisation = &quot;acme-corp&quot;
    Project      = &quot;payments-api&quot;
    Environment  = &quot;dev&quot;
    Team         = &quot;payments&quot;
    CostCentre   = &quot;engineering&quot;
    ManagedBy    = &quot;terraform&quot;
  }
}

module &quot;ecs&quot; {
  source  = &quot;spacelift.io/your-org/ecs/aws&quot;
  version = &quot;~&gt; 1.5&quot;

  cluster_name = &quot;payments-api-dev&quot;
  vpc_id       = module.vpc.vpc_id
  subnet_ids   = module.vpc.private_subnet_ids

  tags = {
    Organisation = &quot;acme-corp&quot;
    Project      = &quot;payments-api&quot;
    Environment  = &quot;dev&quot;
    Team         = &quot;payments&quot;
    CostCentre   = &quot;engineering&quot;
    ManagedBy    = &quot;terraform&quot;
  }
}

module &quot;rds&quot; {
  source  = &quot;spacelift.io/your-org/rds/aws&quot;
  version = &quot;~&gt; 3.0&quot;

  identifier     = &quot;payments-api-dev&quot;
  engine         = &quot;postgres&quot;
  engine_version = &quot;15.4&quot;
  instance_class = &quot;db.t3.medium&quot;

  vpc_id     = module.vpc.vpc_id
  subnet_ids = module.vpc.private_subnet_ids

  # Dev settings
  multi_az                = false
  deletion_protection     = false
  backup_retention_period = 1

  tags = {
    Organisation = &quot;acme-corp&quot;
    Project      = &quot;payments-api&quot;
    Environment  = &quot;dev&quot;
    Team         = &quot;payments&quot;
    CostCentre   = &quot;engineering&quot;
    ManagedBy    = &quot;terraform&quot;
  }
}
```

The `~&gt;` version constraint is key. `~&gt; 2.0` means &quot;any 2.x version but not 3.0.&quot; This gives teams automatic patch and minor updates while protecting against breaking changes.

### Auto-Triggering on Module Updates

When the platform team updates the VPC module (say, adding a new output), the module-change trigger policy kicks in. Any stack with a `depends-on:vpc` label automatically gets a new run. This ensures infrastructure stays up to date with the latest module versions.

For this to work, stacks that consume modules need the dependency label:

```hcl
labels = concat(var.labels, [
  &quot;depends-on:vpc&quot;,
  &quot;depends-on:ecs&quot;,
  &quot;depends-on:rds&quot;,
])
```

## Contexts

Contexts solve the problem of shared configuration. Instead of duplicating environment variables across fifty stacks, you define them once and auto-attach.

### AWS Common Context

```hcl
# spacelift/management/contexts.tf

resource &quot;spacelift_context&quot; &quot;aws_common&quot; {
  name        = &quot;aws-common&quot;
  description = &quot;Common AWS configuration shared across all stacks&quot;
  space_id    = &quot;root&quot;

  labels = [&quot;autoattach:aws&quot;]
}

resource &quot;spacelift_environment_variable&quot; &quot;aws_region&quot; {
  context_id = spacelift_context.aws_common.id
  name       = &quot;AWS_DEFAULT_REGION&quot;
  value      = &quot;eu-west-1&quot;
  write_only = false
}

resource &quot;spacelift_environment_variable&quot; &quot;tf_log&quot; {
  context_id = spacelift_context.aws_common.id
  name       = &quot;TF_LOG&quot;
  value      = &quot;ERROR&quot;
  write_only = false
}

resource &quot;spacelift_environment_variable&quot; &quot;tf_input&quot; {
  context_id = spacelift_context.aws_common.id
  name       = &quot;TF_INPUT&quot;
  value      = &quot;false&quot;
  write_only = false
}
```

### Datadog Credentials Context

```hcl
resource &quot;spacelift_context&quot; &quot;datadog_credentials&quot; {
  name        = &quot;datadog-credentials&quot;
  description = &quot;Datadog API credentials (secrets managed in UI)&quot;
  space_id    = &quot;root&quot;

  labels = [&quot;autoattach:datadog&quot;]
}

# Note: The actual API key and APP key values are set manually
# in the Spacelift UI as write-only (secret) variables.
# We only create the context shell here.
#
# Variables managed in UI:
# - DATADOG_API_KEY (write-only)
# - DATADOG_APP_KEY (write-only)
# - DD_API_KEY (write-only, for the Datadog provider)
# - DD_APP_KEY (write-only, for the Datadog provider)
```

This is a deliberate pattern. The context resource is managed in Terraform, but the secret values are set in the UI. This keeps sensitive credentials out of state files while still having the context itself be version-controlled.

### Per-Environment Contexts

```hcl
resource &quot;spacelift_context&quot; &quot;env_sandbox&quot; {
  name        = &quot;env-sandbox&quot;
  description = &quot;Sandbox environment configuration&quot;
  space_id    = spacelift_space.sandbox.id

  labels = [&quot;autoattach:env:sandbox&quot;]
}

resource &quot;spacelift_environment_variable&quot; &quot;sandbox_account_id&quot; {
  context_id = spacelift_context.env_sandbox.id
  name       = &quot;TF_VAR_aws_account_id&quot;
  value      = &quot;111111111111&quot;
  write_only = false
}

resource &quot;spacelift_context&quot; &quot;env_staging&quot; {
  name        = &quot;env-staging&quot;
  description = &quot;Staging environment configuration&quot;
  space_id    = spacelift_space.staging.id

  labels = [&quot;autoattach:env:staging&quot;]
}

resource &quot;spacelift_environment_variable&quot; &quot;staging_account_id&quot; {
  context_id = spacelift_context.env_staging.id
  name       = &quot;TF_VAR_aws_account_id&quot;
  value      = &quot;222222222222&quot;
  write_only = false
}

resource &quot;spacelift_context&quot; &quot;env_prod&quot; {
  name        = &quot;env-prod&quot;
  description = &quot;Production environment configuration&quot;
  space_id    = spacelift_space.prod.id

  labels = [&quot;autoattach:env:prod&quot;]
}

resource &quot;spacelift_environment_variable&quot; &quot;prod_account_id&quot; {
  context_id = spacelift_context.env_prod.id
  name       = &quot;TF_VAR_aws_account_id&quot;
  value      = &quot;333333333333&quot;
  write_only = false
}
```

The auto-attach labels make this seamless. A stack labelled `env:sandbox` automatically gets the sandbox context attached. No manual configuration per stack.

## The Full GitOps Flow

Let&apos;s walk through what happens end-to-end when a developer wants to create infrastructure for a new service.

### Step 1: Developer Creates a Config File

The developer creates a new directory and config file:

```yaml
# environments/order-service-dev/config.yaml
team: commerce
project: order-service
environment: dev
aws_account_id: &quot;111111111111&quot;
terraform_version: &quot;1.7.0&quot;
project_root: &quot;projects/order-service/dev&quot;
auto_deploy: true
labels:
  - &quot;team:commerce&quot;
  - &quot;env:dev&quot;
  - &quot;service:order-service&quot;
  - &quot;depends-on:vpc&quot;
  - &quot;depends-on:ecs&quot;
  - &quot;depends-on:rds&quot;
```

### Step 2: Developer Creates the Terraform Code

```hcl
# projects/order-service/dev/main.tf

terraform {
  required_version = &quot;&gt;= 1.7.0&quot;
}

module &quot;vpc&quot; {
  source  = &quot;spacelift.io/your-org/vpc/aws&quot;
  version = &quot;~&gt; 2.0&quot;

  name               = &quot;order-service-dev&quot;
  cidr               = &quot;10.20.0.0/16&quot;
  availability_zones = [&quot;eu-west-1a&quot;, &quot;eu-west-1b&quot;, &quot;eu-west-1c&quot;]
  private_subnets    = [&quot;10.20.1.0/24&quot;, &quot;10.20.2.0/24&quot;, &quot;10.20.3.0/24&quot;]
  public_subnets     = [&quot;10.20.101.0/24&quot;, &quot;10.20.102.0/24&quot;, &quot;10.20.103.0/24&quot;]

  tags = local.common_tags
}

locals {
  common_tags = {
    Organisation = &quot;acme-corp&quot;
    Project      = &quot;order-service&quot;
    Environment  = &quot;dev&quot;
    Team         = &quot;commerce&quot;
    CostCentre   = &quot;engineering&quot;
    ManagedBy    = &quot;terraform&quot;
  }
}
```

### Step 3: PR Opened

The developer opens a PR. Two things happen:

1. **The management stack runs a plan.** It detects the new `config.yaml` file and shows a plan to create a new Spacelift stack resource.
2. **Reviewers see exactly what will be created** - the stack name, space, labels, and configuration.

### Step 4: PR Merged

On merge to `main`:

1. The management stack applies, creating the new `order-service-dev` stack in Spacelift.
2. The new stack automatically picks up:
   - **AWS integration** via the `autoattach:aws` label
   - **Security policies** via the `autoattach:security-policies` label
   - **AWS common context** via the `autoattach:aws` label
   - **Sandbox environment context** via the `autoattach:env:sandbox` label (dev maps to sandbox space)
3. The new stack triggers its first run, planning the Terraform code in `projects/order-service/dev/`.
4. Since `auto_deploy = true` for dev, the plan applies automatically.

### Step 5: Infrastructure Exists

Within minutes of merging a PR, the developer has:

- A VPC with private and public subnets
- All resources properly tagged (enforced by OPA)
- No public access on any S3 buckets (enforced by OPA)
- No publicly accessible RDS (enforced by OPA)
- Full audit trail in Spacelift
- OIDC-based AWS auth (no static credentials)

The developer never logged into the Spacelift UI. They never ran `terraform apply` locally. They didn&apos;t need to know how the AWS integration works or what policies exist. The platform handled all of it.

## Problems and Lessons Learned

This wasn&apos;t all smooth sailing. Here are the real issues we hit and how we dealt with them.

### The Approval Policy Loop

This was our most confusing bug. We set up the `prod-requires-approval` policy with `autoattach:security-policies`, which means it attaches to every stack with that label. Including the management stack itself.

The management stack creates production stacks. So when someone added a prod service config, the management stack planned the change, and then... needed approval. Because the management stack had the prod approval policy attached. Even though the management stack isn&apos;t a production stack - it&apos;s the admin stack that manages everything.

**The fix:** We added an exclusion to the approval policy:

```rego
# Don&apos;t require approval for the admin/management stack
reject[msg] {
  input.run.type == &quot;TRACKED&quot;
  is_production
  not is_admin_stack
  msg := &quot;Production stacks require manual approval before apply.&quot;
}

is_admin_stack {
  input.stack.administrative == true
}
```

This is the kind of thing that makes sense in hindsight but takes an hour of confused debugging to figure out the first time.

### Drift Detection Requires Private Workers

Spacelift has built-in drift detection - it can periodically run `terraform plan` on your stacks and alert you if the actual infrastructure has drifted from the Terraform state. Brilliant feature.

Except it requires private workers. On the free tier and even some paid plans, you&apos;re using Spacelift&apos;s shared workers, which don&apos;t support scheduled drift detection. We had to set up private workers running in our own ECS cluster before we could enable it.

Not a dealbreaker, but it&apos;s worth knowing upfront. If drift detection is important to you (and it should be), factor in the private worker setup cost.

### Datadog Provider Tag Format

Our tag enforcement policy initially denied every Datadog resource. The Datadog Terraform provider doesn&apos;t use maps for tags - it uses a list of `key:value` strings:

```hcl
# AWS style (map)
tags = {
  Environment = &quot;prod&quot;
  Team        = &quot;payments&quot;
}

# Datadog style (list of strings)
tags = [&quot;env:prod&quot;, &quot;team:payments&quot;]
```

OPA couldn&apos;t verify the tag format because the structure was completely different. Our fix was the `excluded_providers` set in the tag policy. We still enforce Datadog tags, but through a separate policy specific to the Datadog tag format. The main tag policy just skips Datadog resources entirely.

### Label-Based Auto-Attach Debugging

Labels are powerful. Auto-attach via labels is even more powerful. But when something isn&apos;t working, figuring out why a policy did or didn&apos;t attach to a specific stack requires checking:

1. The stack&apos;s labels
2. The policy&apos;s auto-attach labels
3. The space hierarchy (policies in parent spaces can affect child spaces)
4. Whether `inherit_entities` is true or false at each level

We ended up creating a simple bash script that queries the Spacelift API and lists all policies attached to a given stack, which made debugging much faster.

```bash
#!/bin/bash
# scripts/list-stack-policies.sh

STACK_ID=$1

spacectl stack policies list --id &quot;$STACK_ID&quot; \
  | jq -r &apos;.[] | &quot;\(.type)\t\(.name)\t\(.autoattach)&quot;&apos;
```

### Space Inheritance Gotchas

`inherit_entities = true` means entities (policies, contexts, integrations) from the parent space are available in the child space. This is usually what you want. But it can surprise you.

We had a case where a policy intended only for the security space was accidentally inheriting into the audit and log-archive child spaces. The audit stacks were getting denied because a security-specific policy was checking for controls that only applied to the parent security account.

**The lesson:** Be intentional about what lives at each level. If a policy should only apply to stacks directly in a space (not its children), you need to filter by space name in the Rego code, or place it more carefully in the hierarchy.

### Module Versioning Challenges

The `~&gt;` constraint is a double-edged sword. `~&gt; 2.0` allows `2.1`, `2.5`, `2.99` - any `2.x`. If the platform team accidentally pushes a breaking change as a minor version, it cascades to every stack.

We adopted a policy: breaking changes always get a major version bump. Minor versions add features or fix bugs. Patch versions are documentation or internal refactors. Semantic versioning isn&apos;t just a guideline - it&apos;s a contract between the platform team and the consuming teams.

We also added a `CHANGELOG.md` to every module repository and a Slack notification when new versions are published. Communication matters as much as automation.

## Repository Structure

Here&apos;s the final layout of the infrastructure repository:

```
infrastructure/
├── spacelift/
│   └── management/
│       ├── providers.tf
│       ├── aws-integration.tf
│       ├── spaces.tf
│       ├── stacks.tf
│       ├── policies.tf
│       ├── modules.tf
│       ├── contexts.tf
│       └── outputs.tf
│
├── policies/
│   ├── plan/
│   │   ├── enforce-required-tags.rego
│   │   ├── enforce-required-tags_test.rego
│   │   ├── no-public-rds.rego
│   │   ├── no-public-rds_test.rego
│   │   ├── no-public-s3.rego
│   │   └── cost-limit-warning.rego
│   ├── approval/
│   │   └── prod-requires-approval.rego
│   ├── access/
│   │   └── project-ownership.rego
│   └── trigger/
│       └── module-change.rego
│
├── environments/
│   ├── payments-api-dev/
│   │   └── config.yaml
│   ├── payments-api-staging/
│   │   └── config.yaml
│   ├── payments-api-prod/
│   │   └── config.yaml
│   ├── order-service-dev/
│   │   └── config.yaml
│   ├── order-service-staging/
│   │   └── config.yaml
│   └── order-service-prod/
│       └── config.yaml
│
├── projects/
│   ├── payments-api/
│   │   ├── dev/
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   └── outputs.tf
│   │   ├── staging/
│   │   │   ├── main.tf
│   │   │   ├── variables.tf
│   │   │   └── outputs.tf
│   │   └── prod/
│   │       ├── main.tf
│   │       ├── variables.tf
│   │       └── outputs.tf
│   └── order-service/
│       ├── dev/
│       │   ├── main.tf
│       │   ├── variables.tf
│       │   └── outputs.tf
│       ├── staging/
│       │   └── ...
│       └── prod/
│           └── ...
│
├── modules/
│   ├── spacelift-stack/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── spacelift-module/
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
│
└── scripts/
    └── list-stack-policies.sh
```

And the separate modules repository:

```
terraform-modules/
├── modules/
│   ├── vpc/
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── CHANGELOG.md
│   ├── ecs/
│   │   └── ...
│   ├── rds/
│   │   └── ...
│   ├── aurora/
│   │   └── ...
│   ├── alb/
│   │   └── ...
│   ├── vault/
│   │   └── ...
│   ├── nats/
│   │   └── ...
│   ├── clickhouse/
│   │   └── ...
│   ├── datadog-monitors/
│   │   └── ...
│   ├── datadog-dashboards/
│   │   └── ...
│   └── datadog-synthetics/
│       └── ...
└── README.md
```

## What We Ended Up With

After about three weeks of work, here&apos;s what the client had:

**40+ stacks** across sandbox, staging, and production environments - all dynamically created from config files. No manual stack creation.

**7 OPA policies** covering tag enforcement, security guardrails, cost warnings, production approvals, team-based access control, and module dependency triggers. All auto-attached via labels.

**12 private modules** in the Spacelift registry covering everything from VPCs and ECS clusters to Datadog monitors. All versioned, all consumable with a one-liner.

**Zero static credentials.** AWS authentication via OIDC. Datadog credentials in Spacelift&apos;s encrypted context store. Nothing in GitHub secrets.

**Full RBAC.** The payments team can only see and modify payments stacks. The data team can only see data stacks. The platform team has god mode. All enforced by spaces and OPA.

**GitOps from end to end.** Adding a new service environment means creating a `config.yaml` file and opening a PR. The platform takes care of the rest.

### The Numbers

- **Time to onboard a new service:** ~10 minutes (create config, write Terraform, open PR)
- **Time to add a new environment:** ~5 minutes (copy and modify config)
- **Policy violations caught in first month:** 47 (mostly missing tags, 3 public RDS attempts)
- **Production incidents from Terraform:** 0 (approval policy doing its job)

### What I&apos;d Do Differently

If I were starting from scratch again:

1. **Set up private workers from day one.** We wasted time on shared workers only to need private workers for drift detection. Just start with private workers.

2. **Invest more in the module CHANGELOG process.** Automated changelogs from commit messages would have saved us several &quot;what changed?&quot; conversations.

3. **Build a custom Spacelift dashboard.** The UI is good but not great for a bird&apos;s-eye view of 40+ stacks. A custom dashboard showing stack health, recent failures, and drift status would help.

4. **Test OPA policies in CI before deploying.** We wrote Rego tests but didn&apos;t run them in CI initially. Broken policies get deployed silently and then deny legitimate changes. Test them like you&apos;d test application code.

## Wrapping Up

Spacelift isn&apos;t perfect. The UI can be sluggish. The documentation has gaps (especially around policy debugging). Private workers add operational overhead. And the pricing model means costs grow with your infrastructure.

But for multi-team Terraform at scale, it&apos;s the best tool I&apos;ve used. The combination of hierarchical spaces, native OPA, the admin stack pattern for dynamic stack creation, and OIDC authentication creates a platform that&apos;s genuinely self-service.

The real measure of a platform is whether teams can use it without filing tickets. With this setup, they can. A developer creates a config file, writes their Terraform, and opens a PR. The platform handles RBAC, policy enforcement, secret injection, AWS authentication, and deployment. That&apos;s the goal.

If you&apos;re managing more than a handful of Terraform workspaces and finding that GitHub Actions plus bash scripts isn&apos;t cutting it anymore, Spacelift is worth evaluating. Start with the management stack, spaces, and one or two policies. The rest builds naturally from there.

---

*Have questions about any of this? Find me on [LinkedIn](https://linkedin.com/in/intrapreneurmd) or [GitHub](https://github.com/moabukar). The code examples in this post are simplified from a real implementation - happy to discuss specifics.*</content:encoded><category>spacelift</category><category>terraform</category><category>opa</category><category>rego</category><category>iac</category><category>gitops</category><category>platform-engineering</category><category>devops</category><category>modules</category><category>policy-as-code</category><author>Mo Abukar</author></item><item><title>Migrating ClickHouse From EC2 to ClickHouse Cloud - Every Approach We Tried and Why Most Failed</title><link>https://moabukar.co.uk/blog/clickhouse-cloud-migration/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/clickhouse-cloud-migration/</guid><description>S3 backup/restore, direct connectivity, Parquet exports - none of them worked cleanly. Here&apos;s the full war story of migrating a production ClickHouse instance to Cloud, the version mismatch that broke everything, and the dumb-simple approach that actually got the job done.</description><pubDate>Mon, 09 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/clickhouse.svg&quot; alt=&quot;ClickHouse logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## TL;DR

- Tried 5 different approaches to migrate ClickHouse from EC2 to ClickHouse Cloud
- `BACKUP`/`RESTORE` via S3 failed due to version mismatch (v25.12 → v25.8) and `SharedMergeTree` engine requirements
- Direct exports OOM&apos;d on a memory-constrained `t3.medium`
- The approach that worked: SSM port-forward + pipe `SELECT FORMAT Native | INSERT FORMAT Native` through a laptop, partitioned by table
- ClickHouse Cloud&apos;s version lag and engine restrictions are the biggest gotchas nobody warns you about

---

## The Setup

Production ClickHouse running on a single `t3.medium` EC2 instance in `eu-west-2`. Private subnet, no public IP, no NAT gateway. About 500 MB of data across 7 tables - a mix of time-series data, pre-aggregated rollups, and a large replica table with 13M rows.

The target: ClickHouse Cloud, same region, `SharedMergeTree` engine under the hood.

Should be straightforward, right? ~500 MB of data. A few tables. Same region. How hard can it be?

---

## Attempt 1: Direct Connectivity

First instinct - connect Cloud to EC2 and use `remoteSecure()` to pull data directly.

```bash
curl -s --max-time 5 https://&lt;cloud-host&gt;:8443/ping; echo $?
# 35
```

Exit code 35: TLS handshake failure. The EC2 is in a private subnet with no internet egress. There&apos;s *some* route out (it didn&apos;t timeout), but something - a proxy, firewall, or security group - is stripping TLS on non-standard ports.

**Lesson:** Don&apos;t assume private subnet EC2 instances can reach ClickHouse Cloud. You need either a NAT gateway, a VPC endpoint (PrivateLink), or a different approach entirely.

We&apos;d later set up a VPC Interface Endpoint (PrivateLink) for post-migration production traffic, but that wasn&apos;t ready yet.

---

## Attempt 2: S3 BACKUP/RESTORE

The EC2 had an IAM role with S3 access to a dedicated backup bucket. ClickHouse v25.12 supports native `BACKUP TO S3()`. This felt like the clean path.

### Problem 1: Missing IAM Permission

```sql
BACKUP TABLE db.table1, TABLE db.table2, ...
TO S3(&apos;https://s3.eu-west-2.amazonaws.com/my-backup-bucket/migration/&apos;)
```

Failed with:

```
s3:DeleteObject on resource &quot;.../.lock&quot; because no identity-based policy allows the s3:DeleteObject action
```

The IAM policy only had `PutObject`, `GetObject`, `ListBucket`. ClickHouse&apos;s backup process creates a `.lock` file and tries to delete it on completion.

**You need `s3:DeleteObject` in your IAM policy for ClickHouse S3 backups.** This isn&apos;t documented clearly anywhere.

Fixed the policy, backup succeeded.

### Problem 2: Version Mismatch

```sql
-- On ClickHouse Cloud (v25.8)
RESTORE TABLE ... FROM S3(&apos;...&apos;)
```

```
Code: 246. DB::Exception: Unknown version of serialization infos (1). Should be less or equal than 0
```

The EC2 was running **v25.12**. Cloud was on **v25.8**. The backup&apos;s internal `serialization.json` format changed between versions and isn&apos;t backwards-compatible.

**`BACKUP`/`RESTORE` does not work across major ClickHouse versions.** The backup format is tightly coupled to the server version. There&apos;s no migration path flag that fully downgrades the format.

### Problem 3: SharedMergeTree

Even after trying `SETTINGS compatibility=&apos;25.8.1&apos;` on the backup:

```
Code: 36. DB::Exception: Tables in a Shared database must use engines that do not store data on disk. Attempted to create a table with engine &apos;MergeTree&apos;, which stores data on disk.
```

ClickHouse Cloud requires `SharedMergeTree`. You can pre-create tables (Cloud silently converts `MergeTree` → `SharedMergeTree`), but the backup&apos;s part-level format was still incompatible.

**Three separate failures in the BACKUP/RESTORE path:**

1. IAM permissions (fixable)
2. Version mismatch (not fixable without upgrading Cloud)
3. Engine restriction (not fixable without pre-creating tables AND having a compatible backup format)

---

## Attempt 3: Export to S3 as Parquet/Native/CSV

Fine. No backup/restore. Just `INSERT INTO FUNCTION s3(...)` from the EC2.

```sql
INSERT INTO FUNCTION s3(
  &apos;https://s3.eu-west-2.amazonaws.com/my-bucket/export/table.parquet&apos;,
  &apos;Parquet&apos;
)
SELECT * FROM db.my_table
```

This worked for the small tables. But the largest table (13M rows, 310 MB) OOM&apos;d every time:

```
Code: 241. DB::Exception: (total) memory limit exceeded: would use 3.37 GiB, current RSS: 2.12 GiB, maximum: 3.37 GiB
```

A `t3.medium` has 4 GB RAM. The ClickHouse server process was already using ~1.9 GB (NATS engine tables consuming memory for live streaming ingestion). That left barely enough for the export.

**What we tried:**

- Smaller chunk sizes with `LIMIT/OFFSET` → OOM&apos;d on OFFSET scan
- `--max_block_size=1024` → still OOM&apos;d (server-level memory, not per-query)
- `--max_threads=1` → still OOM&apos;d
- Streaming to stdout with `FORMAT CSVWithNames | gzip` → still OOM&apos;d
- `clickhouse-local` to bypass the server → directory locked by running server

**What we couldn&apos;t do:**

- Restart the server to free memory - this is production
- Detach the NATS tables - they&apos;re actively ingesting live data
- Drop OS caches - tried `echo 3 &gt; /proc/sys/vm/drop_caches`, didn&apos;t help enough

The fundamental problem: on a memory-constrained instance with a live workload, you can&apos;t export large tables through the ClickHouse server without competing for memory.

---

## Attempt 4: The Approach That Actually Worked

Dumb simple. Pipe data through a laptop.

### Setup

**Terminal 1:** SSM port-forward to make EC2 ClickHouse available on localhost:

```bash
aws ssm start-session \
  --target i-xxxxxxxxxxxx \
  --region eu-west-2 \
  --document-name AWS-StartPortForwardingSession \
  --parameters &apos;{&quot;portNumber&quot;:[&quot;9000&quot;],&quot;localPortNumber&quot;:[&quot;9000&quot;]}&apos;
```

**Terminal 2:** Pipe data from source to target:

```bash
clickhouse client --host 127.0.0.1 --port 9000 \
  --user default --password &apos;***&apos; \
  --query &quot;SELECT * FROM db.my_table FORMAT Native&quot; \
| clickhouse client --host &lt;cloud-host&gt; --secure \
  --user default --password &apos;***&apos; \
  --query &quot;INSERT INTO db.my_table FORMAT Native&quot;
```

Your laptop acts as a dumb pipe. Data streams from EC2 → SSM tunnel → your machine → HTTPS → ClickHouse Cloud. No disk buffering, no S3 intermediary.

### Handling the Memory-Constrained Tables

Small tables piped directly - no issues.

For the larger tables that OOM&apos;d on the EC2, we split by partition or column value:

```bash
# Split by partition (for partitioned tables)
for part in 202001 202002 202003 ... 202602; do
  clickhouse client --host 127.0.0.1 --port 9000 \
    --user default --password &apos;***&apos; \
    --max_threads=1 --max_block_size=65536 \
    --query &quot;SELECT * FROM db.rollups WHERE toYYYYMM(ts_minute) = ${part} FORMAT Native&quot; \
  | clickhouse client --host &lt;cloud-host&gt; --secure \
    --user default --password &apos;***&apos; \
    --query &quot;INSERT INTO db.rollups FORMAT Native&quot;
done

# Split by column value (for unpartitioned tables)
for sym in value_a value_b value_c ...; do
  clickhouse client --host 127.0.0.1 --port 9000 \
    --user default --password &apos;***&apos; \
    --max_threads=1 --max_block_size=65536 \
    --query &quot;SELECT * FROM db.large_table WHERE category = &apos;${sym}&apos; FORMAT Native&quot; \
  | clickhouse client --host &lt;cloud-host&gt; --secure \
    --user default --password &apos;***&apos; \
    --query &quot;INSERT INTO db.large_table FORMAT Native&quot;
done
```

Each query only reads a slice of data, keeping EC2 memory usage within bounds.

**This worked.** All tables migrated.

---

## Things Nobody Tells You About ClickHouse Cloud Migration

### 1. Version Mismatch Kills BACKUP/RESTORE

ClickHouse Cloud manages its own version and you can&apos;t control it. If your self-hosted version is newer than Cloud&apos;s version, `BACKUP`/`RESTORE` simply won&apos;t work. There&apos;s no compatibility layer that fully handles this.

**Check versions before you plan anything:**

```sql
-- Source
SELECT version() -- e.g. 25.12.1

-- Target (Cloud)
SELECT version() -- e.g. 25.8.1
```

### 2. SharedMergeTree Changes Everything

Cloud uses `SharedMergeTree` internally. You can write `CREATE TABLE ... ENGINE = MergeTree` in DDL and Cloud converts it, but the on-disk part format is different. Backup files contain raw parts with the original engine&apos;s format - Cloud can&apos;t ingest them.

### 3. NATS Engine Doesn&apos;t Exist on Cloud

If you&apos;re using ClickHouse&apos;s built-in NATS engine for streaming ingestion, there&apos;s no equivalent on Cloud. You need an external consumer that subscribes to NATS and inserts into Cloud via HTTPS.

The materialized view chain still works - if your MVs trigger on `INSERT` to a base table, they&apos;ll fire regardless of whether the insert came from NATS or an HTTP client. You just need to replace the source.

### 4. IAM Needs DeleteObject for S3 Backups

ClickHouse creates and deletes a `.lock` file during backup. Your IAM policy needs `s3:PutObject`, `s3:GetObject`, `s3:ListBucket`, **and** `s3:DeleteObject`.

### 5. Memory-Constrained Instances Can&apos;t Export Large Tables

On a `t3.medium` (4 GB), if the server is already using 2 GB for live workloads, you don&apos;t have headroom for exporting tables that need to decompress columns into memory. Even streaming to stdout OOMs because the *server* buffers the read, not the client.

Partition your exports. Or use a bigger instance for the migration window.

### 6. PrivateLink Is Required, Not Optional

If your EC2 is in a private subnet (no NAT), you need a VPC Interface Endpoint (PrivateLink) to reach ClickHouse Cloud. This is also the &quot;reverse private endpoint&quot; you&apos;ll see referenced in ClickHouse docs - it&apos;s how your VPC talks to Cloud without traversing the public internet.

Set this up **before** the migration, not during.

---

## The Migration Checklist I Wish I Had

Before starting any ClickHouse → Cloud migration:

1. **Compare versions** - if source &gt; target, `BACKUP`/`RESTORE` won&apos;t work
2. **Check table engines** - NATS, Kafka, MySQL engines won&apos;t migrate to Cloud
3. **Check instance memory** - can it handle concurrent reads during export?
4. **Set up PrivateLink first** - you&apos;ll need it for migration AND production
5. **IAM policy** - ensure `s3:DeleteObject` if using S3 as intermediary
6. **Pre-create tables on Cloud** - Cloud auto-converts to `SharedMergeTree`
7. **Plan MV recreation order** - create target tables first, then MVs
8. **Have a pipe-through-laptop fallback** - it&apos;s ugly but it works

---

## Final Thoughts

We tried 4 different approaches. Three failed due to version mismatches, engine restrictions, and memory constraints. The one that worked was the simplest: pipe data through a laptop using SSM port forwarding and `FORMAT Native`.

For ~500 MB of data, the whole migration took about an hour of actual data transfer (most of the time was spent figuring out *why* the &quot;proper&quot; approaches didn&apos;t work).

If ClickHouse Cloud let you control the version, or if `BACKUP`/`RESTORE` had a real cross-version compatibility mode, this would&apos;ve been a 10-minute job. Instead, it was a full afternoon of debugging.

The takeaway: always check versions first. And keep a simple fallback plan - sometimes the &quot;wrong&quot; approach is the only one that works.</content:encoded><category>clickhouse</category><category>aws</category><category>migration</category><category>database</category><category>devops</category><category>production</category><author>Mo Abukar</author></item><item><title>Identity Aware Proxy: Zero Trust Access for Internal Applications</title><link>https://moabukar.co.uk/blog/identity-aware-proxy-deep-dive/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/identity-aware-proxy-deep-dive/</guid><description>Deep dive into Identity Aware Proxies - what they are, how they work, and how to implement them with GCP IAP, Pomerium, and OAuth2-Proxy. Includes Terraform and Kubernetes examples.</description><pubDate>Fri, 06 Feb 2026 00:00:00 GMT</pubDate><content:encoded>Identity Aware Proxy: Zero Trust Access for Internal Applications
==================================================================

VPNs are dead. Well, not dead - but they&apos;re the wrong tool for
application-level access control. Identity Aware Proxies (IAP)
provide a better model: authenticate users at the application
layer, not the network layer.

This guide covers what IAP is, why it matters, and how to implement
it using GCP IAP, Pomerium, and OAuth2-Proxy.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/googlecloud.svg&quot; alt=&quot;Google Cloud logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- IAP authenticates users before they reach your application
- No VPN required - works over public internet
- Integrates with your existing IdP (Google, Okta, Azure AD)
- Per-application access policies
- Full Terraform + Kubernetes examples included


What is an Identity Aware Proxy?
================================

An Identity Aware Proxy sits in front of your application and
handles authentication before any request reaches your backend.
Users authenticate via OAuth2/OIDC, and the proxy validates their
identity and authorization before forwarding requests.

```
                    ┌─────────────────────────────────────────┐
                    │           Identity Provider             │
                    │        (Google, Okta, Azure AD)         │
                    └─────────────────────────────────────────┘
                                        │
                                        │ OAuth2/OIDC
                                        ▼
┌──────────┐     ┌─────────────────────────────────────┐     ┌─────────────┐
│   User   │────▶│        Identity Aware Proxy         │────▶│    App      │
│ Browser  │     │   (Validates identity + policy)     │     │  Backend    │
└──────────┘     └─────────────────────────────────────┘     └─────────────┘
                            │
                            ▼
                    X-Forwarded-User: user@company.com
                    X-Forwarded-Email: user@company.com
                    X-Forwarded-Groups: engineering,admin
```


Why Not Just Use a VPN?
-----------------------

VPNs provide network-level access. Once you&apos;re on the network,
you can access everything. This violates zero trust principles.

```
APPROACH        SCOPE           GRANULARITY     VISIBILITY
========        =====           ===========     ==========
VPN             Network         Broad           Limited
IAP             Application     Per-app         Full audit
```

With IAP:

- Each application has its own access policy
- Users only access what they&apos;re authorized for
- Every request is logged with user identity
- No network-level access required


IAP Solutions Compared
======================

```
SOLUTION          TYPE            COST            BEST FOR
========          ====            ====            ========
GCP IAP           Managed         Per-user        GCP workloads
AWS Cognito+ALB   Managed         Per-MAU         AWS workloads
Pomerium          Self-hosted     Free/Enterprise Multi-cloud, K8s
OAuth2-Proxy      Self-hosted     Free            Simple setups
Cloudflare Access Managed         Per-seat        Edge-first
```


Architecture Deep Dive
======================

The authentication flow follows standard OAuth2/OIDC:

```
1. User requests protected resource
   Browser ──▶ IAP ──▶ &quot;Not authenticated&quot;

2. IAP redirects to IdP login
   Browser ──▶ IdP ──▶ &quot;Login page&quot;

3. User authenticates with IdP
   Browser ──▶ IdP ──▶ &quot;Success, here&apos;s auth code&quot;

4. IAP exchanges code for tokens
   IAP ──▶ IdP ──▶ &quot;Here&apos;s ID token + access token&quot;

5. IAP validates tokens and checks policy
   IAP ──▶ Policy Engine ──▶ &quot;User authorized&quot;

6. Request forwarded with identity headers
   IAP ──▶ Backend ──▶ &quot;Here&apos;s the request + X-Forwarded-User&quot;
```


Headers Injected by IAP
-----------------------

Most IAP solutions inject these headers:

```
HEADER                      VALUE
======                      =====
X-Forwarded-User            user@company.com
X-Forwarded-Email           user@company.com
X-Forwarded-Groups          engineering,platform
X-Forwarded-Access-Token    eyJhbGciOiJSUzI1...
X-Auth-Request-User         user@company.com
```

Your application can trust these headers because they come from
the proxy, not the user. The proxy strips any incoming headers
with these names to prevent spoofing.


GCP Identity Aware Proxy
========================

GCP IAP is the most mature managed solution. It integrates with
Cloud Load Balancing and provides per-resource access policies.


Terraform Configuration
-----------------------

```hcl
# Enable IAP API
resource &quot;google_project_service&quot; &quot;iap&quot; {
  service = &quot;iap.googleapis.com&quot;
}

# OAuth consent screen
resource &quot;google_iap_brand&quot; &quot;project_brand&quot; {
  support_email     = &quot;admin@company.com&quot;
  application_title = &quot;Internal Apps&quot;
  project           = var.project_id
}

# OAuth client for IAP
resource &quot;google_iap_client&quot; &quot;project_client&quot; {
  display_name = &quot;IAP Client&quot;
  brand        = google_iap_brand.project_brand.name
}

# Backend service with IAP enabled
resource &quot;google_compute_backend_service&quot; &quot;app&quot; {
  name        = &quot;app-backend&quot;
  protocol    = &quot;HTTP&quot;
  timeout_sec = 30

  backend {
    group = google_compute_instance_group_manager.app.instance_group
  }

  iap {
    oauth2_client_id     = google_iap_client.project_client.client_id
    oauth2_client_secret = google_iap_client.project_client.secret
  }

  health_checks = [google_compute_health_check.app.id]
}

# IAP access policy - allow specific users
resource &quot;google_iap_web_backend_service_iam_member&quot; &quot;access&quot; {
  project             = var.project_id
  web_backend_service = google_compute_backend_service.app.name
  role                = &quot;roles/iap.httpsResourceAccessor&quot;
  member              = &quot;user:developer@company.com&quot;
}

# Allow entire group
resource &quot;google_iap_web_backend_service_iam_member&quot; &quot;group_access&quot; {
  project             = var.project_id
  web_backend_service = google_compute_backend_service.app.name
  role                = &quot;roles/iap.httpsResourceAccessor&quot;
  member              = &quot;group:engineering@company.com&quot;
}
```


Verifying IAP Headers in Your App
---------------------------------

GCP IAP uses a signed JWT. Verify it in your application:

```python
from google.auth.transport import requests
from google.oauth2 import id_token

def verify_iap_jwt(iap_jwt, expected_audience):
    &quot;&quot;&quot;Verify the IAP JWT and return the user&apos;s email.&quot;&quot;&quot;
    try:
        decoded_jwt = id_token.verify_token(
            iap_jwt,
            requests.Request(),
            audience=expected_audience,
            certs_url=&quot;https://www.gstatic.com/iap/verify/public_key&quot;
        )
        return decoded_jwt[&apos;email&apos;]
    except Exception as e:
        print(f&quot;JWT verification failed: {e}&quot;)
        return None

# In your Flask/FastAPI app
@app.route(&apos;/api/data&apos;)
def get_data():
    iap_jwt = request.headers.get(&apos;X-Goog-IAP-JWT-Assertion&apos;)
    email = verify_iap_jwt(iap_jwt, &apos;/projects/PROJECT_NUM/apps/APP_ID&apos;)
    
    if not email:
        return &quot;Unauthorized&quot;, 401
    
    return f&quot;Hello, {email}&quot;
```


Pomerium: Self-Hosted IAP for Kubernetes
========================================

Pomerium is the best self-hosted option. It&apos;s designed for
Kubernetes and supports advanced policies with OPA.


Architecture with Pomerium
--------------------------

```
                    ┌──────────────────┐
                    │   IdP (Okta)     │
                    └────────┬─────────┘
                             │
┌──────────┐     ┌───────────▼──────────┐     ┌─────────────┐
│  User    │────▶│      Pomerium        │────▶│   Backend   │
│          │     │   (Authenticate +    │     │   Service   │
└──────────┘     │    Authorize +       │     └─────────────┘
                 │    Proxy)            │
                 └──────────────────────┘
                             │
                 ┌───────────▼──────────┐
                 │    Policy Engine     │
                 │  (Who can access     │
                 │   what routes)       │
                 └──────────────────────┘
```


Kubernetes Deployment
---------------------

```yaml
# pomerium-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: pomerium-config
  namespace: pomerium
data:
  config.yaml: |
    # Identity Provider configuration
    idp_provider: google
    idp_client_id: ${IDP_CLIENT_ID}
    idp_client_secret: ${IDP_CLIENT_SECRET}
    
    # Authenticate service URL
    authenticate_service_url: https://authenticate.company.com
    
    # Cookie settings
    cookie_secret: ${COOKIE_SECRET}
    cookie_domain: company.com
    
    # Routes and policies
    routes:
      - from: https://grafana.company.com
        to: http://grafana.monitoring.svc.cluster.local:3000
        policy:
          - allow:
              or:
                - email:
                    is: admin@company.com
                - groups:
                    has: platform-team
        
      - from: https://argocd.company.com
        to: http://argocd-server.argocd.svc.cluster.local:443
        tls_skip_verify: true
        policy:
          - allow:
              or:
                - groups:
                    has: engineering
                    
      - from: https://kibana.company.com
        to: http://kibana.logging.svc.cluster.local:5601
        policy:
          - allow:
              or:
                - groups:
                    has: sre
                - groups:
                    has: engineering
        # Preserve original host header
        preserve_host_header: true
```


Helm Deployment
---------------

```yaml
# values.yaml
config:
  rootDomain: company.com
  generateTLS: false
  existingSecret: pomerium-secrets

authenticate:
  idp:
    provider: google
    clientID: your-client-id.apps.googleusercontent.com
    clientSecret: your-client-secret
    serviceAccount: |
      {
        &quot;type&quot;: &quot;service_account&quot;,
        ...
      }

ingress:
  enabled: true
  className: nginx
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
  hosts:
    - authenticate.company.com
    - grafana.company.com
    - argocd.company.com
  tls:
    - secretName: pomerium-tls
      hosts:
        - &quot;*.company.com&quot;
```

```bash
helm repo add pomerium https://helm.pomerium.io
helm upgrade --install pomerium pomerium/pomerium \
  -n pomerium --create-namespace \
  -f values.yaml
```


Pomerium Policy Language
------------------------

Pomerium uses a powerful policy language:

```yaml
routes:
  # Simple email-based access
  - from: https://admin.company.com
    to: http://admin-backend:8080
    policy:
      - allow:
          or:
            - email:
                is: cto@company.com
            - email:
                is: vp-engineering@company.com

  # Group-based with domain restriction
  - from: https://internal.company.com
    to: http://internal-api:8080
    policy:
      - allow:
          and:
            - domain:
                is: company.com
            - groups:
                has: employees

  # Time-based access (only during business hours)
  - from: https://production-db.company.com
    to: http://db-proxy:5432
    policy:
      - allow:
          and:
            - groups:
                has: dba
            - date:
                after: &quot;2024-01-01T09:00:00Z&quot;
                before: &quot;2024-01-01T18:00:00Z&quot;

  # Claims-based (custom IdP attributes)
  - from: https://contractor-portal.company.com
    to: http://contractor-api:8080
    policy:
      - allow:
          and:
            - claim/contract_status:
                is: active
            - claim/department:
                is: engineering
```


OAuth2-Proxy: Simple and Lightweight
====================================

For simpler setups, OAuth2-Proxy is a lightweight alternative.
It&apos;s a single binary that handles OAuth2 authentication.


Kubernetes Deployment
---------------------

```yaml
# oauth2-proxy-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: oauth2-proxy
  namespace: auth
spec:
  replicas: 2
  selector:
    matchLabels:
      app: oauth2-proxy
  template:
    metadata:
      labels:
        app: oauth2-proxy
    spec:
      containers:
        - name: oauth2-proxy
          image: quay.io/oauth2-proxy/oauth2-proxy:v7.6.0
          args:
            - --provider=google
            - --email-domain=company.com
            - --upstream=file:///dev/null
            - --http-address=0.0.0.0:4180
            - --cookie-secure=true
            - --cookie-domain=.company.com
            - --whitelist-domain=.company.com
            - --set-xauthrequest=true
            - --pass-access-token=true
            - --pass-user-headers=true
            - --set-authorization-header=true
          env:
            - name: OAUTH2_PROXY_CLIENT_ID
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: client-id
            - name: OAUTH2_PROXY_CLIENT_SECRET
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: client-secret
            - name: OAUTH2_PROXY_COOKIE_SECRET
              valueFrom:
                secretKeyRef:
                  name: oauth2-proxy-secrets
                  key: cookie-secret
          ports:
            - containerPort: 4180
          readinessProbe:
            httpGet:
              path: /ping
              port: 4180
            initialDelaySeconds: 5
            periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: oauth2-proxy
  namespace: auth
spec:
  selector:
    app: oauth2-proxy
  ports:
    - port: 4180
      targetPort: 4180
```


NGINX Ingress Integration
-------------------------

OAuth2-Proxy integrates with NGINX Ingress via annotations:

```yaml
# ingress-with-oauth2.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: protected-app
  namespace: default
  annotations:
    nginx.ingress.kubernetes.io/auth-url: &quot;https://oauth2.company.com/oauth2/auth&quot;
    nginx.ingress.kubernetes.io/auth-signin: &quot;https://oauth2.company.com/oauth2/start?rd=$escaped_request_uri&quot;
    nginx.ingress.kubernetes.io/auth-response-headers: &quot;X-Auth-Request-User,X-Auth-Request-Email,X-Auth-Request-Groups&quot;
spec:
  ingressClassName: nginx
  tls:
    - hosts:
        - app.company.com
      secretName: app-tls
  rules:
    - host: app.company.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app
                port:
                  number: 8080
```


AWS: ALB with Cognito Authentication
====================================

AWS doesn&apos;t have a direct IAP equivalent, but you can achieve
similar functionality using ALB with Cognito authentication.


Terraform Configuration
-----------------------

```hcl
# Cognito User Pool
resource &quot;aws_cognito_user_pool&quot; &quot;main&quot; {
  name = &quot;internal-apps&quot;

  password_policy {
    minimum_length    = 12
    require_lowercase = true
    require_numbers   = true
    require_symbols   = true
    require_uppercase = true
  }

  # Enable federation with corporate IdP
  schema {
    name                = &quot;email&quot;
    attribute_data_type = &quot;String&quot;
    required            = true
  }
}

# User Pool Domain
resource &quot;aws_cognito_user_pool_domain&quot; &quot;main&quot; {
  domain       = &quot;internal-apps-${data.aws_caller_identity.current.account_id}&quot;
  user_pool_id = aws_cognito_user_pool.main.id
}

# App Client
resource &quot;aws_cognito_user_pool_client&quot; &quot;alb&quot; {
  name         = &quot;alb-client&quot;
  user_pool_id = aws_cognito_user_pool.main.id

  generate_secret = true

  allowed_oauth_flows                  = [&quot;code&quot;]
  allowed_oauth_flows_user_pool_client = true
  allowed_oauth_scopes                 = [&quot;openid&quot;, &quot;email&quot;, &quot;profile&quot;]

  callback_urls = [
    &quot;https://app.company.com/oauth2/idpresponse&quot;
  ]

  supported_identity_providers = [&quot;COGNITO&quot;]
}

# ALB with Authentication
resource &quot;aws_lb_listener_rule&quot; &quot;authenticated&quot; {
  listener_arn = aws_lb_listener.https.arn
  priority     = 100

  action {
    type = &quot;authenticate-cognito&quot;
    authenticate_cognito {
      user_pool_arn       = aws_cognito_user_pool.main.arn
      user_pool_client_id = aws_cognito_user_pool_client.alb.id
      user_pool_domain    = aws_cognito_user_pool_domain.main.domain

      on_unauthenticated_request = &quot;authenticate&quot;
      session_timeout            = 3600
    }
  }

  action {
    type             = &quot;forward&quot;
    target_group_arn = aws_lb_target_group.app.arn
  }

  condition {
    host_header {
      values = [&quot;app.company.com&quot;]
    }
  }
}
```


Federate with Corporate IdP (Okta)
----------------------------------

```hcl
# SAML Identity Provider
resource &quot;aws_cognito_identity_provider&quot; &quot;okta&quot; {
  user_pool_id  = aws_cognito_user_pool.main.id
  provider_name = &quot;Okta&quot;
  provider_type = &quot;SAML&quot;

  provider_details = {
    MetadataURL           = &quot;https://company.okta.com/app/xxx/sso/saml/metadata&quot;
    IDPSignout            = &quot;true&quot;
    RequestSigningAlgorithm = &quot;rsa-sha256&quot;
  }

  attribute_mapping = {
    email    = &quot;http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress&quot;
    name     = &quot;http://schemas.xmlsoap.org/ws/2005/05/identity/claims/name&quot;
    username = &quot;http://schemas.xmlsoap.org/ws/2005/05/identity/claims/emailaddress&quot;
  }
}

# Update client to use Okta
resource &quot;aws_cognito_user_pool_client&quot; &quot;alb_federated&quot; {
  name         = &quot;alb-client-federated&quot;
  user_pool_id = aws_cognito_user_pool.main.id

  generate_secret = true

  allowed_oauth_flows                  = [&quot;code&quot;]
  allowed_oauth_flows_user_pool_client = true
  allowed_oauth_scopes                 = [&quot;openid&quot;, &quot;email&quot;, &quot;profile&quot;]

  callback_urls = [
    &quot;https://app.company.com/oauth2/idpresponse&quot;
  ]

  supported_identity_providers = [&quot;Okta&quot;]
}
```


Handling IAP Headers in Your Application
========================================

Your backend needs to trust and parse the identity headers.


Go Example
----------

```go
package main

import (
    &quot;log&quot;
    &quot;net/http&quot;
    &quot;strings&quot;
)

type User struct {
    Email  string
    Groups []string
}

func getUserFromHeaders(r *http.Request) *User {
    email := r.Header.Get(&quot;X-Forwarded-Email&quot;)
    if email == &quot;&quot; {
        email = r.Header.Get(&quot;X-Auth-Request-Email&quot;)
    }
    
    if email == &quot;&quot; {
        return nil
    }

    groups := r.Header.Get(&quot;X-Forwarded-Groups&quot;)
    if groups == &quot;&quot; {
        groups = r.Header.Get(&quot;X-Auth-Request-Groups&quot;)
    }

    return &amp;User{
        Email:  email,
        Groups: strings.Split(groups, &quot;,&quot;),
    }
}

func requireAuth(next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        user := getUserFromHeaders(r)
        if user == nil {
            http.Error(w, &quot;Unauthorized&quot;, http.StatusUnauthorized)
            return
        }
        
        log.Printf(&quot;Request from user: %s, groups: %v&quot;, user.Email, user.Groups)
        next(w, r)
    }
}

func requireGroup(group string, next http.HandlerFunc) http.HandlerFunc {
    return func(w http.ResponseWriter, r *http.Request) {
        user := getUserFromHeaders(r)
        if user == nil {
            http.Error(w, &quot;Unauthorized&quot;, http.StatusUnauthorized)
            return
        }

        for _, g := range user.Groups {
            if g == group {
                next(w, r)
                return
            }
        }

        http.Error(w, &quot;Forbidden: requires group &quot;+group, http.StatusForbidden)
    }
}

func main() {
    http.HandleFunc(&quot;/api/public&quot;, func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte(&quot;Public endpoint&quot;))
    })

    http.HandleFunc(&quot;/api/user&quot;, requireAuth(func(w http.ResponseWriter, r *http.Request) {
        user := getUserFromHeaders(r)
        w.Write([]byte(&quot;Hello, &quot; + user.Email))
    }))

    http.HandleFunc(&quot;/api/admin&quot;, requireGroup(&quot;admin&quot;, func(w http.ResponseWriter, r *http.Request) {
        w.Write([]byte(&quot;Admin-only endpoint&quot;))
    }))

    log.Fatal(http.ListenAndServe(&quot;:8080&quot;, nil))
}
```


Express.js Example
------------------

```javascript
const express = require(&apos;express&apos;);
const app = express();

// Middleware to extract user from IAP headers
const iapAuth = (req, res, next) =&gt; {
    const email = req.headers[&apos;x-forwarded-email&apos;] || 
                  req.headers[&apos;x-auth-request-email&apos;];
    
    if (!email) {
        return res.status(401).json({ error: &apos;Unauthorized&apos; });
    }

    const groups = (req.headers[&apos;x-forwarded-groups&apos;] || 
                    req.headers[&apos;x-auth-request-groups&apos;] || &apos;&apos;)
                    .split(&apos;,&apos;)
                    .filter(Boolean);

    req.user = { email, groups };
    next();
};

// Middleware to require specific group
const requireGroup = (group) =&gt; (req, res, next) =&gt; {
    if (!req.user.groups.includes(group)) {
        return res.status(403).json({ 
            error: `Forbidden: requires group ${group}` 
        });
    }
    next();
};

app.get(&apos;/api/user&apos;, iapAuth, (req, res) =&gt; {
    res.json({ 
        message: `Hello, ${req.user.email}`,
        groups: req.user.groups 
    });
});

app.get(&apos;/api/admin&apos;, iapAuth, requireGroup(&apos;admin&apos;), (req, res) =&gt; {
    res.json({ message: &apos;Admin endpoint&apos; });
});

app.listen(8080, () =&gt; console.log(&apos;Server running on port 8080&apos;));
```


Security Considerations
=======================

Header Spoofing Prevention
--------------------------

Your IAP must strip any incoming headers that match the injected
header names. Otherwise, attackers could spoof identity by sending:

```
curl -H &quot;X-Forwarded-Email: admin@company.com&quot; https://app.company.com
```

Most IAP solutions handle this automatically. Verify by testing:

```bash
# This should NOT result in admin access
curl -H &quot;X-Forwarded-Email: admin@company.com&quot; \
     -H &quot;X-Forwarded-Groups: admin&quot; \
     https://app.company.com/api/admin
```


Network Security
----------------

If your backend is directly accessible (bypassing IAP), attackers
can inject headers directly. Ensure:

1. Backend is not publicly accessible
2. Backend only accepts traffic from IAP
3. Use network policies in Kubernetes

```yaml
# network-policy.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: backend-only-from-iap
  namespace: default
spec:
  podSelector:
    matchLabels:
      app: backend
  policyTypes:
    - Ingress
  ingress:
    - from:
        - namespaceSelector:
            matchLabels:
              name: pomerium
          podSelector:
            matchLabels:
              app: pomerium
```


JWT Verification (Recommended)
------------------------------

For maximum security, verify the JWT signature instead of trusting
headers. GCP IAP provides signed JWTs in `X-Goog-IAP-JWT-Assertion`.

Pomerium can also sign requests with JWT:

```yaml
routes:
  - from: https://api.company.com
    to: http://api-backend:8080
    policy:
      - allow:
          groups:
            has: engineering
    # Sign all requests with JWT
    pass_identity_headers: true
    kubernetes_service_account_token: true
```


Troubleshooting
===============

**Redirect loop after login:**

Check that your callback URL matches exactly. Include trailing
slashes if configured.

```
Expected: https://app.company.com/oauth2/callback
Got:      https://app.company.com/oauth2/callback/
```


**&quot;Access Denied&quot; after successful login:**

User authenticated but failed authorization. Check:
- User email in allowed list
- User groups match policy
- Domain restriction (e.g., `email_domain: company.com`)


**Headers not reaching backend:**

Verify header passthrough in ingress:

```yaml
annotations:
  nginx.ingress.kubernetes.io/auth-response-headers: &quot;X-Auth-Request-User,X-Auth-Request-Email&quot;
```


**Session expired too quickly:**

Increase session timeout:
- GCP IAP: Cannot be changed (1 hour)
- Pomerium: `cookie_expire: 24h`
- OAuth2-Proxy: `--cookie-expire=168h`


References
==========

- GCP IAP Docs: https://cloud.google.com/iap/docs
- Pomerium Docs: https://www.pomerium.com/docs
- OAuth2-Proxy: https://oauth2-proxy.github.io/oauth2-proxy
- BeyondCorp Whitepaper: https://research.google/pubs/pub43231/
- Zero Trust Architecture (NIST): https://csrc.nist.gov/publications/detail/sp/800-207/final


========================================
Identity Aware Proxy + Zero Trust
========================================
Authenticate at the edge. Trust nothing.
========================================</content:encoded><category>identity-aware-proxy</category><category>zero-trust</category><category>security</category><category>kubernetes</category><category>terraform</category><category>oauth2</category><author>Mo Abukar</author></item><item><title>10 Rules for Negotiating Your Job Offer (From 7 Years of Engineering)</title><link>https://moabukar.co.uk/blog/negotiation-rules-engineers/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/negotiation-rules-engineers/</guid><description>Most engineers massively undervalue themselves because no one taught them how to negotiate. Here&apos;s everything I&apos;ve learned from negotiating salaries, contracts, titles, and more.</description><pubDate>Wed, 04 Feb 2026 00:00:00 GMT</pubDate><content:encoded>10 Rules for Negotiating Your Job Offer
=======================================

Over the past 7-8 years, I&apos;ve worked across startups, scale-ups, consultancies and high-pressure engineering teams. I specialise in cloud, DevOps, Kubernetes, automation and platform engineering - but one thing I learned very early is this:

**Your technical skills get you the interview. Your negotiation skills decide your career.**

I&apos;ve personally negotiated:

- Salary jumps like £80k to £100k, £100k to £120k+
- Contract day rates from £500 to £650
- Title upgrades from Senior to Principal
- Senior to Staff level transitions
- Remote flexibility, reduced on-call, sign-on bonuses, contract tweaks and more

I&apos;ve coached engineers at every level - juniors getting their first break, seniors stepping into leadership and principals trying to align compensation with responsibility.

Across all these cases, one thing stays true:

**Most engineers massively undervalue themselves because no one ever taught them how to negotiate.**

Here&apos;s the reality: negotiation is an art of its own. It has tactics, unspoken rules, leverage patterns and psychology behind it. Some of it seems like &quot;common sense&quot; only after someone tells you. Most of what I teach here comes from experience - and honestly, mistakes I made early on.

---

![10 Rules for Negotiating Your Job Offer (From 7 Years of Engineering)](/images/negotiation-rules-engineers.jpg)


Why Most Advice is Useless
==========================

Most negotiation advice is vague rubbish. &quot;Make sure you negotiate.&quot; &quot;Never say the first number.&quot; Beyond those two morsels, you&apos;re on your own.

I think people believe negotiation is some mystical skill that some people have and others don&apos;t. That&apos;s nonsense. Negotiation is learnable. It&apos;s not magic. It&apos;s patterns and psychology.

Three caveats before we start:

**One:** I&apos;m not a professional negotiator. When my advice contradicts actual experts, assume I&apos;m wrong.

**Two:** Negotiation is intertwined with social dynamics. The appropriate advice for a white male in London might not be appropriate for someone else in a different context. Be aware of this. But don&apos;t let fear of discrimination stop you from negotiating - that&apos;s often just as damaging.

**Three:** Negotiation is stupid. It&apos;s a practice that rewards those who are good at it, regardless of actual merit. But it&apos;s the system we have. Might as well get good at it.

---

The Ten Rules
=============

```
RULE    PRINCIPLE
====    =========
1       Get everything in writing
2       Always keep the door open
3       Information is power
4       Always be positive
5       Don&apos;t be the decision maker
6       Have alternatives
7       Proclaim reasons for everything
8       Be motivated by more than just money
9       Understand what they value
10      Be winnable
```

Let me walk through each one.

---

Rule 1: Get Everything in Writing
=================================

When you receive an offer, write everything down. Salary, equity, bonus, start date, title, benefits, WFH policy - all of it.

Even if they say they&apos;ll send a written version later, write it down yourself. Even non-monetary things: &quot;we&apos;re migrating to Kubernetes next quarter&quot; - write it down. &quot;The team is 8 people&quot; - write it down.

You&apos;ll forget. And this information will inform your decision.

From this point on, everything significant gets a paper trail. Confirm details in follow-up emails. Companies often don&apos;t send official offer letters until the deal is done, so it&apos;s on you to document.

---

Rule 2: Always Keep the Door Open
=================================

After they give you the offer details, they&apos;ll ask: &quot;So what do you think?&quot;

This is a trap. Not malicious, but it&apos;s designed to get you to commit.

If you say &quot;Yes, sounds amazing, when do I start?&quot; - you&apos;ve accepted. Door closed.

If you say &quot;Can you do £95k instead of £90k?&quot; - you&apos;ve also closed the door. You&apos;ve told them exactly what it takes to sign you. They&apos;ll offer £92k and you&apos;ll probably accept.

Never give up negotiating power until you&apos;re ready to make a final, informed decision.

Instead, say something like:

&quot;Thanks so much - I&apos;m really excited about the opportunity. Right now I&apos;m wrapping up conversations with a few other companies, so I can&apos;t commit to specifics yet. But I&apos;m confident we can find something that works for both of us. I&apos;d love to be part of the team.&quot;

You&apos;ve said nothing. You&apos;ve committed to nothing. You&apos;ve kept all your power.

---

Rule 3: Information is Power
============================

The company doesn&apos;t tell you their budget. They don&apos;t tell you what they paid the last person in this role. They don&apos;t tell you how desperate they are to fill the position.

They want all your information while protecting theirs.

Don&apos;t play that game.

When you say &quot;Can you do £95k instead of £90k?&quot; you&apos;ve revealed your hand. They now know exactly where the ceiling is. They&apos;ll bid £92k and close.

But what if you&apos;re the kind of person who wouldn&apos;t consider anything below £110k? Or £120k? If you were, you wouldn&apos;t be asking for £95k.

By staying silent, they don&apos;t know which kind of person you are. That uncertainty is your leverage.

**Corollary:** Don&apos;t reveal your current salary if you can avoid it. If you must, be liberal in calculating total compensation - include bonuses, stock, benefits, pension, on-call pay, everything. And frame it as &quot;This is what I&apos;m making now, and I&apos;m looking for a step up.&quot;

---

Rule 4: Always Be Positive
==========================

Even if the offer is rubbish, stay positive and excited about the company.

Why? Because your excitement is an asset. The company is investing in you because they think you&apos;ll work hard and stay. If you seem less excited, you become a riskier investment. You&apos;re literally worth less to them.

So regardless of how negotiations are going, signal that:
1. You still like the company
2. You&apos;re still excited to work there
3. You want to make this work

Reiterate that you love the mission, the team, the problem space. Keep the energy up.

---

Rule 5: Don&apos;t Be the Decision Maker
===================================

End the offer conversation like this:

&quot;I&apos;ll look over the details and discuss with my partner/family/advisor. I&apos;ll reach out if I have questions. Thanks for sharing the good news!&quot;

See what happened? You&apos;ve introduced external decision-makers. The recruiter can&apos;t pressure you because the &quot;real&quot; decision-maker is beyond their reach.

This is a classic technique. Customer support does it: &quot;It&apos;s not my decision, I&apos;m just doing my job.&quot; It defuses tension and gives you control.

Even if you don&apos;t actually care what your partner thinks about your job offer, mentioning them gives you breathing room.

---

Rule 6: Have Alternatives
=========================

This is the most important rule. Having other offers is the single strongest lever you have.

Here&apos;s why: companies know their own interview process is noisy. They know most interview processes are noisy. But a candidate with multiple offers has multiple weak signals in their favour. Combined, those converge into a much stronger signal.

It&apos;s like a student with a strong SAT score AND strong GPA AND scholarships. Could still be a dunce, but much less likely.

So tell companies you have other offers. It&apos;s not tacky. It&apos;s the oldest method in history to galvanise a marketplace: show that supply is limited.

When you get an offer, immediately email every other company you&apos;re talking to:

```
Hi [NAME],

Quick update on my process - I&apos;ve just received an offer from [COMPANY] 
which is quite strong. That said, I&apos;m really excited about [YOUR COMPANY] 
and want to see if we can make it work. 

Since my timeline is now compressed, is there anything you can do to 
expedite the process?
```

Send this to everyone. Even companies you think are long shots. Demand breeds demand.

---

Rule 7: Proclaim Reasons for Everything
=======================================

When you ask for something, always give a reason. Even if the reason is weak, having one makes requests more palatable.

Bad: &quot;I want £110k.&quot;

Good: &quot;I&apos;m looking for £110k because that&apos;s what it would take to make this move make sense financially - I&apos;d be leaving unvested equity and my current role has strong growth trajectory.&quot;

The reason doesn&apos;t have to be ironclad. It just has to exist. People are wired to respond better to requests that have justification, even flimsy justification.

---

Rule 8: Be Motivated by More Than Just Money
============================================

Don&apos;t approach negotiation as pure compensation extraction. Think about what you actually want:

- Base salary
- Equity/bonus
- Title
- Remote flexibility
- Team/project placement
- Learning opportunities
- Reduced on-call
- Sign-on bonus
- Start date
- Hardware/equipment

Some of these are easier for companies to give than others. Equity might come from a fixed pool. But a title upgrade? Remote days? Those often cost the company nothing.

Negotiate across multiple dimensions. Sometimes you&apos;ll get more value by asking for things that don&apos;t cost them much.

---

Rule 9: Understand What They Value
==================================

Try to understand the company&apos;s position. What do they actually care about?

- Are they desperate to fill this role?
- Is headcount tight?
- Do they have competing candidates?
- Is the hiring manager fighting for budget?
- What&apos;s their timeline?

The more you understand their constraints, the better you can craft asks that work for both sides.

If they&apos;re desperate, you have more leverage than you think. If they have five other candidates, you have less. Read the situation.

---

Rule 10: Be Winnable
====================

Here&apos;s the counterbalance to everything above: the company needs to believe they can actually close you.

If you seem like you&apos;re just collecting offers with no intention of joining, they&apos;ll stop investing energy. If you seem impossible to satisfy, they&apos;ll move on.

Signal that you&apos;re genuinely interested. That if the right package comes together, you&apos;ll sign. That you&apos;re not just wasting their time.

The best negotiating position is: &quot;I really want to join. Help me make it work.&quot;

---

Dealing with Exploding Offers
=============================

Exploding offers - offers that expire in 24-72 hours - are increasingly common, especially at startups.

They&apos;re designed to limit your ability to get counteroffers. Companies know exactly what they&apos;re doing.

Don&apos;t feel guilty about pushing back. Needing more than 48 hours to make a life decision isn&apos;t a character flaw.

Here&apos;s how to handle it:

```
&quot;I have one concern. You mentioned this offer expires in 48 hours. 
That doesn&apos;t work for me - there&apos;s no way I can make an informed 
decision in that window. I&apos;m wrapping up conversations with other 
companies, which will take another week or so. I&apos;ll need more time.&quot;
```

If they push back:

```
&quot;That&apos;s unfortunate. I really like [COMPANY] and was excited about 
the team, but 48 hours is too short for a decision this significant. 
I take my commitments seriously and need to consult with [PARTNER/ADVISOR]. 
I can&apos;t make a decision I&apos;m comfortable with in this timeframe.&quot;
```

Almost every company will relent. If they don&apos;t, walk away. They&apos;ll usually grab you before you reach the door.

Every exploding offer I&apos;ve ever received widened when I pushed back. Sometimes by weeks.

---

The Mindset
===========

Don&apos;t value companies on a single dimension. Salary matters, but so does:

- Cultural fit
- Challenge of the work
- Learning potential
- Career trajectory
- Quality of life
- Growth potential
- Overall happiness

Anyone who says &quot;just choose where you&apos;ll be happiest&quot; is being as simplistic as someone who says &quot;just choose the highest salary.&quot;

Also remember: different companies value you differently. Your specific skills might be worth more at Company A than Company B. The more companies you talk to, the more likely you are to find one where you&apos;re unusually valuable.

Keep an open mind about which company that turns out to be.

---

Final Thoughts
==============

Negotiation feels uncomfortable. It feels like you&apos;re being greedy or difficult.

Get over it.

Companies expect negotiation. They&apos;ve budgeted for it. The offer they give you has room built in. Not negotiating leaves money on the table that was allocated for you.

In all my years, I&apos;ve never seen an offer rescinded because someone negotiated professionally. It basically doesn&apos;t happen. And when it does, the candidate was being unreasonable or the company was looking for an excuse.

The worst they can say is no. And even then, you&apos;ve signalled that you know your worth.

```
========================================
Your skills got you the interview.
Your negotiation decides your career.
========================================
```</content:encoded><category>career</category><category>negotiation</category><category>salary</category><category>engineering-culture</category><category>advice</category><author>Mo Abukar</author></item><item><title>ELK Stack Migration: From 6.x to 8.x - The Complete Guide</title><link>https://moabukar.co.uk/blog/elk-6-to-8-migration/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/elk-6-to-8-migration/</guid><description>A comprehensive guide to migrating your Elasticsearch, Logstash, and Kibana stack from version 6.x to 8.x. Covers breaking changes, migration strategies, index compatibility, and zero-downtime approaches.</description><pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate><content:encoded># ELK Stack Migration: From 6.x to 8.x - The Complete Guide

Migrating an ELK stack from 6.x to 8.x isn&apos;t a simple version bump. It&apos;s a multi-step journey with breaking changes at every major version, index compatibility requirements, and fundamental architectural shifts - especially around security.

I recently completed this migration for a client running a 15-node production cluster with 50TB of logs. This post documents the complete process: the required upgrade path, breaking changes, migration strategies, and the exact steps we followed.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/elastic.svg&quot; alt=&quot;Elastic logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## The Upgrade Path: You Can&apos;t Skip Versions

**Critical:** You cannot directly upgrade from Elasticsearch 6.x to 8.x. The supported upgrade path is:

```
6.x → 6.8 (latest) → 7.17 (latest 7.x) → 8.x
```

Why? Each major version can only read indices from the previous major version:

| Elasticsearch Version | Can Read Indices Created In |
|-----------------------|-----------------------------|
| 6.x | 5.x, 6.x |
| 7.x | 6.x, 7.x |
| 8.x | 7.x, 8.x |

This means:
- **ES 7.x cannot read indices created in ES 5.x** - must reindex first
- **ES 8.x cannot read indices created in ES 6.x** - must go through 7.x first

If you have any indices created in 5.x or earlier, you must reindex them before upgrading to 7.x.

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/elk-6-to-8-migration](https://github.com/moabukar/blog-code/tree/main/elk-6-to-8-migration)

---

## Phase 1: Preparation and Assessment

### Step 1: Inventory Your Cluster

Before touching anything, document your current state:

```bash
# Get cluster health
curl -X GET &quot;localhost:9200/_cluster/health?pretty&quot;

# List all indices with creation version
curl -X GET &quot;localhost:9200/_cat/indices?v&amp;h=index,creation.date.string,pri,rep,docs.count,store.size&quot;

# Check index settings for compatibility issues
curl -X GET &quot;localhost:9200/_settings?pretty&quot; | jq &apos;to_entries[] | {index: .key, created: .value.settings.index.version.created}&apos;

# Get current version
curl -X GET &quot;localhost:9200/&quot;
```

### Step 2: Identify Indices Created in 5.x

ES 7.x cannot read 5.x indices. Check for them:

```bash
# Indices with version.created starting with &quot;503&quot; or lower are 5.x
curl -s &quot;localhost:9200/_settings&quot; | jq -r &apos;
  to_entries[] | 
  select(.value.settings.index.version.created | startswith(&quot;5&quot;) or startswith(&quot;2&quot;) or startswith(&quot;1&quot;)) | 
  .key&apos;
```

If you have 5.x indices, you must reindex them while still on 6.x:

```bash
# Reindex old index to a new one
POST _reindex
{
  &quot;source&quot;: {
    &quot;index&quot;: &quot;old-5x-index&quot;
  },
  &quot;dest&quot;: {
    &quot;index&quot;: &quot;old-5x-index-reindexed&quot;
  }
}

# Delete old index
DELETE old-5x-index

# Optionally rename
POST _aliases
{
  &quot;actions&quot;: [
    { &quot;add&quot;: { &quot;index&quot;: &quot;old-5x-index-reindexed&quot;, &quot;alias&quot;: &quot;old-5x-index&quot; }}
  ]
}
```

### Step 3: Install the Upgrade Assistant

For Kibana 6.7+, use the Upgrade Assistant:

1. Open Kibana
2. Go to **Management → Stack Management → Upgrade Assistant**
3. Review all deprecation warnings
4. Fix all critical issues before proceeding

The Upgrade Assistant identifies:
- Deprecated index settings
- Deprecated cluster settings
- Mappings that need updating
- Deprecated API usage in Kibana saved objects

### Step 4: Backup Everything

```bash
# Create a snapshot repository (if not exists)
PUT _snapshot/migration_backup
{
  &quot;type&quot;: &quot;fs&quot;,
  &quot;settings&quot;: {
    &quot;location&quot;: &quot;/mnt/backups/es-migration&quot;
  }
}

# Take a full snapshot
PUT _snapshot/migration_backup/pre-migration-snapshot?wait_for_completion=true
{
  &quot;indices&quot;: &quot;*&quot;,
  &quot;include_global_state&quot;: true
}

# Verify the snapshot
GET _snapshot/migration_backup/pre-migration-snapshot
```

**Also backup:**
- Kibana saved objects (export from Management → Saved Objects)
- Logstash pipelines
- All configuration files (elasticsearch.yml, kibana.yml, logstash.yml)
- Any custom scripts or integrations

---

## Phase 2: Upgrade to Latest 6.8.x

Always upgrade to the latest minor version before a major upgrade:

```bash
# On each node (rolling restart):
# 1. Disable shard allocation
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;persistent&quot;: {
    &quot;cluster.routing.allocation.enable&quot;: &quot;primaries&quot;
  }
}&apos;

# 2. Stop non-essential indexing and perform a synced flush
curl -X POST &quot;localhost:9200/_flush/synced&quot;

# 3. Stop Elasticsearch
sudo systemctl stop elasticsearch

# 4. Upgrade the package
# Debian/Ubuntu
sudo apt-get update &amp;&amp; sudo apt-get install elasticsearch=6.8.23

# RHEL/CentOS
sudo yum install elasticsearch-6.8.23

# 5. Start Elasticsearch
sudo systemctl start elasticsearch

# 6. Wait for node to join
curl -X GET &quot;localhost:9200/_cat/nodes&quot;

# 7. Re-enable shard allocation
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;persistent&quot;: {
    &quot;cluster.routing.allocation.enable&quot;: null
  }
}&apos;

# 8. Wait for green status before proceeding to next node
curl -X GET &quot;localhost:9200/_cluster/health?wait_for_status=green&amp;timeout=5m&quot;
```

Repeat for all nodes, one at a time.

---

## Phase 3: Upgrade 6.8 to 7.17

This is the most significant upgrade - many breaking changes occur here.

### Breaking Changes in 7.0

#### 1. Mapping Types Removed

ES 7.x removes mapping types. The `_doc` type becomes the only type.

**6.x:**
```json
PUT my_index
{
  &quot;mappings&quot;: {
    &quot;my_type&quot;: {
      &quot;properties&quot;: {
        &quot;title&quot;: { &quot;type&quot;: &quot;text&quot; }
      }
    }
  }
}

PUT my_index/my_type/1
{
  &quot;title&quot;: &quot;Hello&quot;
}
```

**7.x:**
```json
PUT my_index
{
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;title&quot;: { &quot;type&quot;: &quot;text&quot; }
    }
  }
}

PUT my_index/_doc/1
{
  &quot;title&quot;: &quot;Hello&quot;
}
```

Indices created in 6.x with custom types will still work in 7.x (compatibility mode), but you should plan to migrate them.

#### 2. Discovery Configuration Changed

The `discovery.zen.*` settings are removed. New settings:

**6.x (old):**
```yaml
discovery.zen.ping.unicast.hosts: [&quot;host1&quot;, &quot;host2&quot;]
discovery.zen.minimum_master_nodes: 2
```

**7.x (new):**
```yaml
discovery.seed_hosts: [&quot;host1&quot;, &quot;host2&quot;]
cluster.initial_master_nodes: [&quot;node-1&quot;, &quot;node-2&quot;, &quot;node-3&quot;]
```

**Important:** `cluster.initial_master_nodes` is only needed for the first cluster formation. Remove it after the cluster is running.

#### 3. Default Shards Changed

- Primary shards default changed from 5 to 1
- This only affects new indices

#### 4. Lucene 8 Upgrade

ES 7 uses Lucene 8, which brings:
- Better query performance
- New BKD-based doc values
- Some queries may behave differently

#### 5. Java 11+ Required

ES 7.x requires Java 11. ES 6.x could run on Java 8.

### The 6.8 → 7.17 Upgrade Process

```bash
# Ensure you&apos;re on 6.8.x latest
curl -X GET &quot;localhost:9200/&quot;

# Run Upgrade Assistant one more time
# Fix any remaining deprecation warnings

# Take another snapshot
PUT _snapshot/migration_backup/pre-7x-snapshot?wait_for_completion=true

# For each node (rolling upgrade):

# 1. Disable shard allocation
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;persistent&quot;: {
    &quot;cluster.routing.allocation.enable&quot;: &quot;primaries&quot;
  }
}&apos;

# 2. Stop indexing and flush
curl -X POST &quot;localhost:9200/_flush/synced&quot;

# 3. Stop ES
sudo systemctl stop elasticsearch

# 4. Update configuration (elasticsearch.yml)
# - Replace discovery.zen.* with discovery.seed_hosts
# - Add cluster.initial_master_nodes (first time only)
# - Remove any deprecated settings flagged by Upgrade Assistant

# 5. Install 7.17
sudo apt-get install elasticsearch=7.17.18

# 6. Start ES
sudo systemctl start elasticsearch

# 7. Verify node joined
curl -X GET &quot;localhost:9200/_cat/nodes?v&quot;

# 8. Re-enable allocation
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;persistent&quot;: {
    &quot;cluster.routing.allocation.enable&quot;: null
  }
}&apos;

# 9. Wait for green
curl -X GET &quot;localhost:9200/_cluster/health?wait_for_status=green&amp;timeout=5m&quot;

# Proceed to next node
```

After all nodes are upgraded:

```bash
# Remove cluster.initial_master_nodes from elasticsearch.yml
# It&apos;s only needed for initial cluster bootstrap

# Verify cluster
curl -X GET &quot;localhost:9200/_cluster/health?pretty&quot;
curl -X GET &quot;localhost:9200/_cat/indices?v&quot;
```

### Upgrade Kibana 6.8 → 7.17

Kibana must match the Elasticsearch version.

```bash
# Stop Kibana
sudo systemctl stop kibana

# Update kibana.yml if needed
# - elasticsearch.url is now elasticsearch.hosts

# Install 7.17
sudo apt-get install kibana=7.17.18

# Start Kibana
sudo systemctl start kibana

# Kibana will migrate saved objects automatically
# Check logs for migration status
sudo journalctl -u kibana -f
```

### Upgrade Logstash 6.8 → 7.17

```bash
# Stop Logstash
sudo systemctl stop logstash

# Review pipeline configurations
# - Update any deprecated plugin options
# - document_type is no longer needed

# Install 7.17
sudo apt-get install logstash=7.17.18

# Start Logstash
sudo systemctl start logstash
```

---

## Phase 4: Upgrade 7.17 to 8.x

The 7.x to 8.x upgrade is significant because **security is enabled by default** in ES 8.

### Breaking Changes in 8.0

#### 1. Security Enabled by Default

ES 8 enables security automatically:
- TLS for HTTP and transport layers
- Built-in users (elastic, kibana_system, etc.)
- API key authentication

If you weren&apos;t using security before, this is a major change.

#### 2. Discovery Settings Finalized

`cluster.initial_master_nodes` is deprecated for clusters that have already formed. Remove it.

#### 3. Many Deprecated Settings Removed

All settings deprecated in 7.x are removed in 8.0:
- `discovery.zen.*` - completely removed
- `node.max_local_storage_nodes` - removed
- `http.tcp_no_delay` - use `http.tcp.no_delay`
- Many more

#### 4. Java 17+ Recommended

ES 8 bundles its own JDK, but if you provide your own, use Java 17+.

#### 5. REST API Changes

- The `_type` path element is removed
- Content-Type header is always required
- Some query DSL changes

### Preparing for Security

If your 7.x cluster didn&apos;t have security enabled, you need to prepare:

**Option A: Enable Security on 7.x First (Recommended)**

```yaml
# elasticsearch.yml on 7.17
xpack.security.enabled: true
xpack.security.transport.ssl.enabled: true
xpack.security.http.ssl.enabled: true

# Generate certificates
bin/elasticsearch-certutil ca
bin/elasticsearch-certutil cert --ca elastic-stack-ca.p12

# Set passwords
bin/elasticsearch-setup-passwords interactive
```

**Option B: Let ES 8 Configure Security Automatically**

When you start ES 8 for the first time, it will:
- Generate certificates
- Create the elastic superuser password
- Configure TLS

But this requires cluster downtime and reconfiguration of all clients.

### The 7.17 → 8.x Upgrade Process

```bash
# Take a snapshot
PUT _snapshot/migration_backup/pre-8x-snapshot?wait_for_completion=true

# Upgrade each node (rolling upgrade):

# 1. Disable allocation
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;persistent&quot;: {
    &quot;cluster.routing.allocation.enable&quot;: &quot;primaries&quot;
  }
}&apos;

# 2. Flush
curl -X POST &quot;localhost:9200/_flush&quot;

# 3. Stop ES
sudo systemctl stop elasticsearch

# 4. Update elasticsearch.yml
# - Remove cluster.initial_master_nodes
# - Remove any deprecated settings
# - Configure security settings

# 5. Install ES 8
sudo apt-get install elasticsearch=8.12.0

# 6. Start ES
sudo systemctl start elasticsearch

# On first 8.x node start, note:
# - Auto-generated password for &apos;elastic&apos; user
# - Enrollment tokens for other nodes
# Check: /var/log/elasticsearch/elasticsearch.log

# 7. Re-enable allocation (now with auth)
curl -X PUT &quot;https://localhost:9200/_cluster/settings&quot; \
  -u elastic:YOUR_PASSWORD \
  --cacert /etc/elasticsearch/certs/http_ca.crt \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;persistent&quot;: {
    &quot;cluster.routing.allocation.enable&quot;: null
  }
}&apos;

# 8. Wait for green
curl -X GET &quot;https://localhost:9200/_cluster/health?wait_for_status=green&quot; \
  -u elastic:YOUR_PASSWORD \
  --cacert /etc/elasticsearch/certs/http_ca.crt
```

### Upgrade Kibana to 8.x

```bash
# Stop Kibana
sudo systemctl stop kibana

# Update kibana.yml
# - elasticsearch.hosts with https://
# - elasticsearch.username: kibana_system
# - elasticsearch.password: [generated password]
# - elasticsearch.ssl.certificateAuthorities

# Install Kibana 8
sudo apt-get install kibana=8.12.0

# Reset kibana_system password
curl -X POST &quot;https://localhost:9200/_security/user/kibana_system/_password&quot; \
  -u elastic:YOUR_PASSWORD \
  --cacert /etc/elasticsearch/certs/http_ca.crt \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;password&quot;: &quot;your_new_kibana_password&quot;
}&apos;

# Start Kibana
sudo systemctl start kibana
```

### Upgrade Logstash to 8.x

Update your Logstash output configurations for HTTPS and authentication:

```ruby
output {
  elasticsearch {
    hosts =&gt; [&quot;https://es-node1:9200&quot;, &quot;https://es-node2:9200&quot;]
    user =&gt; &quot;logstash_writer&quot;
    password =&gt; &quot;your_password&quot;
    ssl_certificate_authorities =&gt; &quot;/etc/logstash/certs/http_ca.crt&quot;
  }
}
```

Create a dedicated Logstash user:

```bash
curl -X POST &quot;https://localhost:9200/_security/user/logstash_writer&quot; \
  -u elastic:YOUR_PASSWORD \
  --cacert /etc/elasticsearch/certs/http_ca.crt \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;password&quot;: &quot;logstash_password&quot;,
  &quot;roles&quot;: [&quot;logstash_writer&quot;],
  &quot;full_name&quot;: &quot;Logstash Writer&quot;
}&apos;
```

---

## Alternative: Zero-Downtime Migration with Cluster Expansion

For production clusters where you can&apos;t afford downtime, use the &quot;expand then contract&quot; method:

### The Concept

Instead of in-place upgrades, you:
1. Create new ES 8 nodes
2. Join them to the existing cluster temporarily
3. Migrate data via shard reallocation
4. Remove old nodes

This only works for 6.8 → 7.x migration (same major version compatibility). For 7.x → 8.x, you&apos;d do a second round.

### Step-by-Step

```bash
# 1. Configure new ES 7 nodes to join existing ES 6.8 cluster
# In new node&apos;s elasticsearch.yml:
cluster.name: mycluster
discovery.seed_hosts: [&quot;old-node1&quot;, &quot;old-node2&quot;, &quot;old-node3&quot;]

# 2. Start new nodes - they join the cluster
# Verify
curl -X GET &quot;localhost:9200/_cat/nodes?v&quot;
# Should show both old and new nodes

# 3. Disable rebalancing temporarily
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;transient&quot;: {
    &quot;cluster.routing.rebalance.enable&quot;: &quot;none&quot;
  }
}&apos;

# 4. Set migration rate limits (based on your benchmark)
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;transient&quot;: {
    &quot;cluster.routing.allocation.node_concurrent_recoveries&quot;: 10,
    &quot;indices.recovery.max_bytes_per_sec&quot;: &quot;100mb&quot;
  }
}&apos;

# 5. Exclude old nodes one by one (starts shard migration)
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;transient&quot;: {
    &quot;cluster.routing.allocation.exclude._name&quot;: &quot;old-node1&quot;
  }
}&apos;

# 6. Wait for shards to migrate off the node
watch -n 5 &apos;curl -s localhost:9200/_cat/shards | grep old-node1 | wc -l&apos;
# Wait until count reaches 0

# 7. Shut down old-node1
# Repeat steps 5-7 for each old node

# 8. Reset cluster settings
curl -X PUT &quot;localhost:9200/_cluster/settings&quot; -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;transient&quot;: {
    &quot;cluster.routing.allocation.exclude._name&quot;: null,
    &quot;cluster.routing.rebalance.enable&quot;: null,
    &quot;cluster.routing.allocation.node_concurrent_recoveries&quot;: null,
    &quot;indices.recovery.max_bytes_per_sec&quot;: null
  }
}&apos;
```

---

## Index Template Migration

ES 8 uses composable index templates. Migrate your legacy templates:

### Legacy Template (6.x/7.x style)

```json
PUT _template/logs_template
{
  &quot;index_patterns&quot;: [&quot;logs-*&quot;],
  &quot;settings&quot;: {
    &quot;number_of_shards&quot;: 3
  },
  &quot;mappings&quot;: {
    &quot;properties&quot;: {
      &quot;@timestamp&quot;: { &quot;type&quot;: &quot;date&quot; },
      &quot;message&quot;: { &quot;type&quot;: &quot;text&quot; }
    }
  }
}
```

### Composable Template (8.x style)

```json
# Component template for settings
PUT _component_template/logs_settings
{
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;number_of_shards&quot;: 3
    }
  }
}

# Component template for mappings
PUT _component_template/logs_mappings
{
  &quot;template&quot;: {
    &quot;mappings&quot;: {
      &quot;properties&quot;: {
        &quot;@timestamp&quot;: { &quot;type&quot;: &quot;date&quot; },
        &quot;message&quot;: { &quot;type&quot;: &quot;text&quot; }
      }
    }
  }
}

# Composable index template
PUT _index_template/logs_template
{
  &quot;index_patterns&quot;: [&quot;logs-*&quot;],
  &quot;composed_of&quot;: [&quot;logs_settings&quot;, &quot;logs_mappings&quot;],
  &quot;priority&quot;: 200
}
```

Legacy templates still work in ES 8 but are deprecated.

---

## Post-Migration Verification

After completing the migration:

```bash
# 1. Verify cluster health
curl -X GET &quot;https://localhost:9200/_cluster/health?pretty&quot; -u elastic:password --cacert ca.crt

# 2. Check all nodes
curl -X GET &quot;https://localhost:9200/_cat/nodes?v&quot; -u elastic:password --cacert ca.crt

# 3. Verify indices
curl -X GET &quot;https://localhost:9200/_cat/indices?v&amp;health=yellow,red&quot; -u elastic:password --cacert ca.crt

# 4. Test searches
curl -X GET &quot;https://localhost:9200/your-index/_search?size=1&quot; -u elastic:password --cacert ca.crt

# 5. Verify Kibana dashboards work

# 6. Verify Logstash is ingesting
curl -X GET &quot;https://localhost:9200/_cat/indices?v&amp;s=index:desc&quot; -u elastic:password --cacert ca.crt | head -10
```

---

## Rollback Plan

If something goes wrong:

### Rollback from 7.x to 6.8

```bash
# 1. Stop all 7.x nodes
# 2. Restore 6.8 packages
sudo apt-get install elasticsearch=6.8.23
# 3. Restore elasticsearch.yml from backup
# 4. Restore snapshot if needed
# 5. Start cluster
```

### Rollback from 8.x to 7.17

```bash
# 1. Stop all 8.x nodes
# 2. Restore 7.17 packages
sudo apt-get install elasticsearch=7.17.18
# 3. Restore elasticsearch.yml (especially security settings)
# 4. If security was newly enabled in 8, disable it or restore 7.x certs
# 5. Restore snapshot if needed
# 6. Start cluster
```

**Critical:** You cannot restore an 8.x snapshot to a 7.x cluster. Always keep 7.x snapshots until you&apos;re confident in the 8.x cluster.

---

## Timeline Estimate

For a 10-node cluster with 30TB of data:

| Phase | Duration |
|-------|----------|
| Preparation &amp; Assessment | 2-4 hours |
| Backup | 2-6 hours (depends on data size) |
| Upgrade to 6.8 (rolling) | 2-3 hours |
| Upgrade to 7.17 (rolling) | 3-4 hours |
| Upgrade to 8.x (rolling) | 3-4 hours |
| Kibana/Logstash upgrades | 1-2 hours |
| Verification | 2-3 hours |
| **Total** | **15-22 hours** |

For zero-downtime cluster expansion method, add 4-8 hours for shard migration per major version.

---

## Checklist

```markdown
## Pre-Migration
- [ ] Document current cluster state
- [ ] Check for 5.x indices (must reindex)
- [ ] Run Upgrade Assistant, fix all critical issues
- [ ] Backup all data (snapshot)
- [ ] Backup all config files
- [ ] Export Kibana saved objects
- [ ] Test upgrade process in non-prod environment

## 6.8 Upgrade
- [ ] Upgrade to latest 6.8.x
- [ ] Verify cluster health
- [ ] Re-run Upgrade Assistant

## 7.x Upgrade
- [ ] Update elasticsearch.yml (discovery settings)
- [ ] Rolling upgrade all nodes
- [ ] Upgrade Kibana
- [ ] Upgrade Logstash
- [ ] Verify cluster health
- [ ] Remove cluster.initial_master_nodes

## 8.x Upgrade
- [ ] Plan security configuration
- [ ] Update elasticsearch.yml (remove deprecated settings)
- [ ] Rolling upgrade all nodes
- [ ] Note auto-generated passwords
- [ ] Update Kibana configuration for HTTPS/auth
- [ ] Upgrade Kibana
- [ ] Update Logstash outputs for HTTPS/auth
- [ ] Upgrade Logstash
- [ ] Create service accounts for integrations
- [ ] Verify all dashboards and pipelines work

## Post-Migration
- [ ] Verify cluster health is green
- [ ] Verify all indices accessible
- [ ] Verify Kibana dashboards work
- [ ] Verify data ingestion working
- [ ] Migrate legacy index templates
- [ ] Update documentation
- [ ] Remove old snapshots (after grace period)
```

---

## Key Takeaways

1. **You must upgrade through 7.x** - no direct 6→8 path exists
2. **Reindex 5.x indices before upgrading to 7.x** - they won&apos;t be readable
3. **Security is mandatory in ES 8** - plan for HTTPS and authentication
4. **Take snapshots before each major upgrade** - your rollback lifeline
5. **Use the Upgrade Assistant** - it catches issues you&apos;ll miss
6. **Test in non-prod first** - always
7. **Rolling upgrades minimize downtime** - but require patience
8. **Update all clients** - Kibana, Logstash, Beats, application code

The ELK 6 to 8 migration is a significant undertaking, but with proper planning and methodical execution, it&apos;s entirely manageable. Take your time, verify at each step, and keep those backups handy.

---

*Questions or war stories from your own ELK migrations? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or [GitHub](https://github.com/moabukar).*</content:encoded><category>elasticsearch</category><category>elk</category><category>kibana</category><category>logstash</category><category>migration</category><category>observability</category><author>Mo Abukar</author></item><item><title>Platform Engineering in 2026 - It&apos;s About the Discipline, Not the Tools</title><link>https://moabukar.co.uk/blog/platform-engineering-2026/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/platform-engineering-2026/</guid><description>Platform engineering has become the most misunderstood role in tech. Everyone&apos;s building &apos;platforms&apos; but few understand what actually makes one successful. Here&apos;s what I&apos;ve learned building platforms for teams of 10 to 500.</description><pubDate>Tue, 03 Feb 2026 00:00:00 GMT</pubDate><content:encoded>Everyone&apos;s hiring platform engineers now. Job postings are everywhere. But talk to most of them and you&apos;ll hear the same story: they&apos;re building Kubernetes clusters, setting up Terraform modules, and wondering why developers still complain about shipping speed.

That&apos;s because we&apos;ve confused the tools with the discipline.

Platform engineering isn&apos;t about Kubernetes. It isn&apos;t about Backstage. It isn&apos;t about whatever shiny internal developer portal someone&apos;s pitching this week. Those are implementation details. The discipline is something else entirely.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/backstage.svg&quot; alt=&quot;Backstage logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## What Platform Engineering Actually Is

Platform engineering is product management for infrastructure.

Read that again. It&apos;s not &quot;DevOps but with a platform team.&quot; It&apos;s not &quot;SRE but we build things.&quot; It&apos;s treating your internal infrastructure as a product, with your developers as customers.

This means:
- You do user research (talking to developers about their pain points)
- You prioritise features (not everything gets built)
- You measure success (adoption, satisfaction, time-to-deploy)
- You iterate based on feedback (not based on what&apos;s technically interesting)

Most platform teams skip all of this. They build what they think is cool, ship it, and then wonder why adoption is 20%.

## The Golden Path Misconception

Everyone talks about &quot;golden paths&quot; now. The idea is simple: provide a paved road that makes the right thing the easy thing. Developers follow the path, they get security, observability, and compliance for free.

Sounds great in theory. In practice, most golden paths fail because they&apos;re actually golden cages.

The difference is autonomy. A golden path says &quot;here&apos;s a great way to deploy a service, use it if you want.&quot; A golden cage says &quot;here&apos;s the only way to deploy a service, deal with it.&quot;

The moment your platform feels like a cage, developers will find workarounds. They&apos;ll deploy to that one account you don&apos;t control. They&apos;ll spin up that VM that&apos;s &quot;just for testing.&quot; They&apos;ll do whatever it takes to ship, because shipping is their job.

The best platforms I&apos;ve seen follow an 80/20 rule: 80% of use cases should be trivially easy with the golden path. The remaining 20% should still be possible, just with less hand-holding.

## Why Most Platform Teams Fail

I&apos;ve watched platform teams fail in three predictable ways.

**Failure mode 1: Building for yourself**

Platform engineers are usually senior. They&apos;ve seen things. They have opinions about the &quot;right&quot; way to do infrastructure. So they build platforms that would&apos;ve solved their problems from five years ago.

But your developers aren&apos;t you. They don&apos;t care about your elegant Terraform abstraction. They want to ship a feature by Friday.

The fix: Talk to your users. Not once at the start of the project. Continuously. Weekly user interviews should be non-negotiable.

**Failure mode 2: Over-engineering**

A team of four platform engineers does not need to build a multi-cluster, multi-region, active-active Kubernetes platform on day one. They need to solve the problems they actually have.

I&apos;ve seen platform teams spend 18 months building infrastructure that would be appropriate for Netflix, for a company with 30 developers. By the time they shipped, half the engineering org had quit from frustration.

The fix: Start embarrassingly simple. Single cluster. Single region. Manual processes where automation doesn&apos;t pay off yet. Iterate based on real pain points.

**Failure mode 3: No product ownership**

Platform teams without product ownership build features. Platform teams with product ownership build outcomes.

Features: &quot;We shipped a service mesh.&quot;
Outcomes: &quot;Developers can now do canary deployments in 2 clicks instead of 2 days.&quot;

If you can&apos;t articulate the outcome, you probably shouldn&apos;t build the feature.

## What Good Looks Like

The best platform teams I&apos;ve worked with share some characteristics.

**They measure developer experience.**

Not just uptime. Not just deployment frequency. Actual developer satisfaction. How long does it take a new engineer to ship their first change? How many support tickets does the platform team get per week? Would developers recommend the platform to a colleague?

These are soft metrics, but they&apos;re the ones that matter.

**They have strong opinions, weakly held.**

Good platforms are opinionated. They make choices for you. But good platform teams know when to bend. If three different teams need the same escape hatch, that&apos;s not an edge case - that&apos;s a missing feature.

**They deprecate ruthlessly.**

Every platform accumulates cruft. Old deployment methods. Legacy clusters. That one custom solution from 2019. The best teams deprecate aggressively, with clear timelines and migration support. The worst teams let options proliferate until nobody knows what to use.

**They write documentation like it&apos;s code.**

Because for developers, docs are the interface. If your platform requires a 45-minute walkthrough to use, your platform has a bug. The fix isn&apos;t better training - it&apos;s a simpler platform.

## The Technology Is the Easy Part

Let me tell you a secret: the technology choices barely matter.

Kubernetes vs ECS vs Lambda? Doesn&apos;t matter. ArgoCD vs Flux vs whatever? Doesn&apos;t matter. Backstage vs Port vs custom? Doesn&apos;t matter.

What matters is whether developers can ship with confidence. Whether they trust the platform. Whether using it feels like an acceleration, not a tax.

I&apos;ve seen teams build great platforms on &quot;boring&quot; tech stacks. I&apos;ve seen teams build unusable platforms on cutting-edge infrastructure. The technology is not the differentiator.

The differentiator is whether you&apos;re solving real problems, getting feedback, and iterating. That&apos;s it. That&apos;s the whole discipline.

## Where Platform Engineering Goes From Here

Platform engineering is maturing as a discipline. Here&apos;s what I think the next few years look like.

**AI-assisted development will change everything.** When AI can scaffold entire services, the platform becomes the guardrails. Your golden path becomes less about templates and more about policies, security boundaries, and compliance automation.

**Developer experience will become a competitive advantage.** Companies will compete for talent based on how fast developers can ship. Platform quality directly impacts recruiting and retention.

**Platform teams will get smaller, not bigger.** Better tooling means fewer people can do more. The teams that survive will be the ones that can do more with less.

**The &quot;platform&quot; will become invisible.** The end state isn&apos;t developers loving your platform. It&apos;s developers not thinking about infrastructure at all. They push code, it runs, it scales, it&apos;s secure. Magic.

## Getting Started

If you&apos;re building a platform team from scratch, here&apos;s what I&apos;d do:

1. **Interview 10 developers this week.** Ask them what&apos;s painful. Write it down. Don&apos;t argue.

2. **Identify the one thing that would make the biggest difference.** Not three things. One thing.

3. **Build the simplest possible solution.** Ship it. Get feedback.

4. **Iterate.** Repeat forever.

That&apos;s it. No complex framework. No multi-year roadmap. Just solve problems, get feedback, iterate.

Platform engineering isn&apos;t about building the perfect infrastructure. It&apos;s about continuously making developers&apos; lives better. The teams that understand this build great platforms. The teams that don&apos;t build elaborate systems that nobody uses.

Which one are you building?</content:encoded><category>platform-engineering</category><category>devops</category><category>developer-experience</category><category>internal-platforms</category><category>idp</category><author>Mo Abukar</author></item><item><title>Implementing Vertical Autoscaling for Aurora Databases Using Lambda Functions</title><link>https://moabukar.co.uk/blog/vertical-scaling-aurora/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/vertical-scaling-aurora/</guid><description>AWS doesn&apos;t offer vertical autoscaling for Aurora – so we built it. CloudWatch Alarms, SNS, Lambda coordination, and the gotchas we hit in production.</description><pubDate>Mon, 02 Feb 2026 00:00:00 GMT</pubDate><content:encoded># Implementing Vertical Autoscaling for Aurora Databases Using Lambda Functions

AWS provides horizontal scaling for Aurora out of the box – add read replicas, distribute load, done. Vertical scaling? You&apos;re on your own. Aurora PostgreSQL supports a single writer instance, so when that writer needs more horsepower, you can&apos;t just throw more nodes at it.

This guide covers a production-ready implementation of vertical autoscaling for Aurora using Lambda functions, CloudWatch Alarms, SNS, and RDS Event Subscriptions. The approach minimises downtime through coordinated reader-first scaling and automated failover.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/aws.svg&quot; alt=&quot;AWS logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                         Aurora Vertical Autoscaling                         │
└─────────────────────────────────────────────────────────────────────────────┘

┌──────────────┐     ┌──────────────┐     ┌──────────────────────────────────┐
│  CloudWatch  │────▶│     SNS      │────▶│        Alarm Lambda              │
│    Alarm     │     │    Topic     │     │  • Validates cluster state       │
│ (CPU &gt; 80%)  │     │              │     │  • Tags instance as &apos;modifying&apos;  │
└──────────────┘     └──────────────┘     │  • Initiates modify-db-instance  │
                                          └──────────────────────────────────┘
                                                         │
                                                         ▼
                                          ┌──────────────────────────────────┐
                                          │      Aurora Cluster              │
                                          │  ┌────────┐  ┌────────┐          │
                                          │  │ Writer │  │ Reader │          │
                                          │  │db.r6g. │  │db.r6g. │          │
                                          │  │xlarge  │  │xlarge  │ ◀─ Scale │
                                          │  └────────┘  └────────┘          │
                                          └──────────────────────────────────┘
                                                         │
                                                         │ RDS Event
                                                         │ (RDS-EVENT-0014)
                                                         ▼
┌──────────────────────────────────────────────────────────────────────────────┐
│                        RDS Event Subscription                                │
│                    (filters: modification complete)                          │
└──────────────────────────────────────────────────────────────────────────────┘
                                                         │
                                                         ▼
                                          ┌──────────────────────────────────┐
                                          │        Event Lambda              │
                                          │  • Removes &apos;modifying&apos; tag       │
                                          │  • Checks for size parity        │
                                          │  • Scales next smallest instance │
                                          │  • Triggers failover if needed   │
                                          └──────────────────────────────────┘
```

## Prerequisites

- Aurora PostgreSQL or MySQL cluster with at least one reader
- IAM permissions for Lambda to modify RDS instances and manage tags
- SNS topic for alarm notifications
- Terraform (or CloudFormation if you must)

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/vertical-scaling-aurora](https://github.com/moabukar/blog-code/tree/main/vertical-scaling-aurora)

## Repository Structure

```
aurora-vertical-autoscaling/
├── terraform/
│   ├── main.tf
│   ├── variables.tf
│   ├── outputs.tf
│   ├── lambda.tf
│   ├── cloudwatch.tf
│   ├── sns.tf
│   └── iam.tf
├── lambda/
│   ├── alarm_handler/
│   │   ├── handler.py
│   │   └── requirements.txt
│   └── event_handler/
│       ├── handler.py
│       └── requirements.txt
├── scripts/
│   └── package_lambda.sh
└── README.md
```

## IAM Configuration

The Lambda functions need granular RDS permissions. Avoid `rds:*` – specify exactly what&apos;s required.

```hcl
# terraform/iam.tf

data &quot;aws_iam_policy_document&quot; &quot;lambda_assume_role&quot; {
  statement {
    effect = &quot;Allow&quot;
    principals {
      type        = &quot;Service&quot;
      identifiers = [&quot;lambda.amazonaws.com&quot;]
    }
    actions = [&quot;sts:AssumeRole&quot;]
  }
}

resource &quot;aws_iam_role&quot; &quot;aurora_autoscaler&quot; {
  name               = &quot;aurora-vertical-autoscaler-${var.environment}&quot;
  assume_role_policy = data.aws_iam_policy_document.lambda_assume_role.json
}

data &quot;aws_iam_policy_document&quot; &quot;aurora_autoscaler&quot; {
  # RDS describe permissions
  statement {
    effect = &quot;Allow&quot;
    actions = [
      &quot;rds:DescribeDBClusters&quot;,
      &quot;rds:DescribeDBInstances&quot;,
      &quot;rds:ListTagsForResource&quot;
    ]
    resources = [&quot;*&quot;]
  }

  # RDS modify permissions – scoped to specific cluster
  statement {
    effect = &quot;Allow&quot;
    actions = [
      &quot;rds:ModifyDBInstance&quot;,
      &quot;rds:FailoverDBCluster&quot;,
      &quot;rds:AddTagsToResource&quot;,
      &quot;rds:RemoveTagsFromResource&quot;
    ]
    resources = [
      &quot;arn:aws:rds:${var.region}:${data.aws_caller_identity.current.account_id}:cluster:${var.cluster_identifier}&quot;,
      &quot;arn:aws:rds:${var.region}:${data.aws_caller_identity.current.account_id}:db:${var.cluster_identifier}-*&quot;
    ]
  }

  # CloudWatch Logs
  statement {
    effect = &quot;Allow&quot;
    actions = [
      &quot;logs:CreateLogGroup&quot;,
      &quot;logs:CreateLogStream&quot;,
      &quot;logs:PutLogEvents&quot;
    ]
    resources = [&quot;arn:aws:logs:${var.region}:${data.aws_caller_identity.current.account_id}:*&quot;]
  }

  # SNS publish for notifications
  statement {
    effect    = &quot;Allow&quot;
    actions   = [&quot;sns:Publish&quot;]
    resources = [aws_sns_topic.scaling_notifications.arn]
  }
}

resource &quot;aws_iam_role_policy&quot; &quot;aurora_autoscaler&quot; {
  name   = &quot;aurora-autoscaler-policy&quot;
  role   = aws_iam_role.aurora_autoscaler.id
  policy = data.aws_iam_policy_document.aurora_autoscaler.json
}
```

## CloudWatch Alarm Configuration

CPU utilisation is the trigger here. You could substitute any CloudWatch metric – `DatabaseConnections`, `FreeableMemory`, `ReadIOPS`, etc.

```hcl
# terraform/cloudwatch.tf

resource &quot;aws_cloudwatch_metric_alarm&quot; &quot;aurora_cpu_high&quot; {
  alarm_name          = &quot;aurora-${var.cluster_identifier}-cpu-high&quot;
  comparison_operator = &quot;GreaterThanThreshold&quot;
  evaluation_periods  = 3
  metric_name         = &quot;CPUUtilization&quot;
  namespace           = &quot;AWS/RDS&quot;
  period              = 60
  statistic           = &quot;Average&quot;
  threshold           = var.cpu_threshold  # Default: 80
  alarm_description   = &quot;CPU utilisation exceeded ${var.cpu_threshold}% for 3 consecutive minutes&quot;

  dimensions = {
    DBClusterIdentifier = var.cluster_identifier
  }

  alarm_actions = [aws_sns_topic.scaling_trigger.arn]
  ok_actions    = []  # Optional: notify when alarm clears

  treat_missing_data = &quot;notBreaching&quot;

  tags = var.tags
}
```

**Why 3 evaluation periods?** Single spikes shouldn&apos;t trigger scaling. Sustained load over 3 minutes indicates genuine capacity pressure. Adjust based on your workload characteristics.

## SNS Topics

Two topics: one for triggering the alarm Lambda, one for operational notifications.

```hcl
# terraform/sns.tf

resource &quot;aws_sns_topic&quot; &quot;scaling_trigger&quot; {
  name = &quot;aurora-scaling-trigger-${var.environment}&quot;
}

resource &quot;aws_sns_topic&quot; &quot;scaling_notifications&quot; {
  name = &quot;aurora-scaling-notifications-${var.environment}&quot;
}

resource &quot;aws_sns_topic_subscription&quot; &quot;alarm_lambda&quot; {
  topic_arn = aws_sns_topic.scaling_trigger.arn
  protocol  = &quot;lambda&quot;
  endpoint  = aws_lambda_function.alarm_handler.arn
}

# Optional: email notifications for scaling events
resource &quot;aws_sns_topic_subscription&quot; &quot;email&quot; {
  count     = var.notification_email != &quot;&quot; ? 1 : 0
  topic_arn = aws_sns_topic.scaling_notifications.arn
  protocol  = &quot;email&quot;
  endpoint  = var.notification_email
}
```

## RDS Event Subscription

This triggers the Event Lambda when an instance modification completes.

```hcl
# terraform/rds_events.tf

resource &quot;aws_db_event_subscription&quot; &quot;modification_complete&quot; {
  name      = &quot;aurora-modification-complete-${var.environment}&quot;
  sns_topic = aws_sns_topic.event_trigger.arn

  source_type = &quot;db-instance&quot;
  source_ids  = data.aws_rds_cluster.target.cluster_members

  event_categories = [&quot;configuration change&quot;]

  tags = var.tags
}

resource &quot;aws_sns_topic&quot; &quot;event_trigger&quot; {
  name = &quot;aurora-event-trigger-${var.environment}&quot;
}

resource &quot;aws_sns_topic_subscription&quot; &quot;event_lambda&quot; {
  topic_arn = aws_sns_topic.event_trigger.arn
  protocol  = &quot;lambda&quot;
  endpoint  = aws_lambda_function.event_handler.arn
}
```

## Lambda Functions

### Alarm Handler

This function receives the CloudWatch Alarm, validates cluster state, and initiates scaling.

```python
# lambda/alarm_handler/handler.py

import boto3
import json
import os
import random
from datetime import datetime, timezone, timedelta
from typing import Optional

rds = boto3.client(&apos;rds&apos;)
sns = boto3.client(&apos;sns&apos;)

# Instance size ordering for comparison
INSTANCE_SIZE_ORDER = {
    &apos;small&apos;: 1, &apos;medium&apos;: 2, &apos;large&apos;: 3, &apos;xlarge&apos;: 4,
    &apos;2xlarge&apos;: 5, &apos;4xlarge&apos;: 6, &apos;8xlarge&apos;: 7, &apos;12xlarge&apos;: 8,
    &apos;16xlarge&apos;: 9, &apos;24xlarge&apos;: 10
}

# Allowed instance families for scaling (configure per cluster)
ALLOWED_FAMILIES = os.environ.get(&apos;ALLOWED_FAMILIES&apos;, &apos;db.r6g,db.r7g&apos;).split(&apos;,&apos;)
MAX_INSTANCE_CLASS = os.environ.get(&apos;MAX_INSTANCE_CLASS&apos;, &apos;db.r6g.4xlarge&apos;)
COOLDOWN_MINUTES = int(os.environ.get(&apos;COOLDOWN_MINUTES&apos;, &apos;15&apos;))
NOTIFICATION_TOPIC = os.environ[&apos;NOTIFICATION_TOPIC_ARN&apos;]


def handler(event, context):
    &quot;&quot;&quot;
    Handles CloudWatch Alarm via SNS.
    Validates cluster state and initiates vertical scaling if conditions are met.
    &quot;&quot;&quot;
    try:
        # Parse SNS message
        sns_message = json.loads(event[&apos;Records&apos;][0][&apos;Sns&apos;][&apos;Message&apos;])
        alarm_name = sns_message.get(&apos;AlarmName&apos;, &apos;&apos;)
        
        # Extract cluster identifier from alarm dimensions
        cluster_id = extract_cluster_id(sns_message)
        if not cluster_id:
            return {&apos;statusCode&apos;: 400, &apos;body&apos;: &apos;Could not determine cluster ID&apos;}
        
        print(f&quot;Processing alarm for cluster: {cluster_id}&quot;)
        
        # Get cluster details
        cluster = get_cluster_details(cluster_id)
        if not cluster:
            return {&apos;statusCode&apos;: 404, &apos;body&apos;: f&apos;Cluster {cluster_id} not found&apos;}
        
        # Validation checks
        validation_result = validate_cluster_state(cluster)
        if not validation_result[&apos;can_scale&apos;]:
            print(f&quot;Scaling blocked: {validation_result[&apos;reason&apos;]}&quot;)
            return {&apos;statusCode&apos;: 200, &apos;body&apos;: validation_result[&apos;reason&apos;]}
        
        # Execute scaling
        result = execute_scaling(cluster)
        
        # Send notification
        notify(result)
        
        return {&apos;statusCode&apos;: 200, &apos;body&apos;: json.dumps(result)}
        
    except Exception as e:
        print(f&quot;Error: {str(e)}&quot;)
        notify({&apos;status&apos;: &apos;error&apos;, &apos;message&apos;: str(e)})
        raise


def extract_cluster_id(alarm_message: dict) -&gt; Optional[str]:
    &quot;&quot;&quot;Extract cluster identifier from alarm dimensions.&quot;&quot;&quot;
    trigger = alarm_message.get(&apos;Trigger&apos;, {})
    dimensions = trigger.get(&apos;Dimensions&apos;, [])
    
    for dim in dimensions:
        if dim.get(&apos;name&apos;) == &apos;DBClusterIdentifier&apos;:
            return dim.get(&apos;value&apos;)
    return None


def get_cluster_details(cluster_id: str) -&gt; Optional[dict]:
    &quot;&quot;&quot;Fetch cluster and instance details from RDS.&quot;&quot;&quot;
    try:
        cluster_resp = rds.describe_db_clusters(DBClusterIdentifier=cluster_id)
        cluster = cluster_resp[&apos;DBClusters&apos;][0]
        
        # Get instance details
        instances = []
        for member in cluster[&apos;DBClusterMembers&apos;]:
            instance_resp = rds.describe_db_instances(
                DBInstanceIdentifier=member[&apos;DBInstanceIdentifier&apos;]
            )
            instance = instance_resp[&apos;DBInstances&apos;][0]
            
            # Get tags
            tags_resp = rds.list_tags_for_resource(
                ResourceName=instance[&apos;DBInstanceArn&apos;]
            )
            instance[&apos;Tags&apos;] = {t[&apos;Key&apos;]: t[&apos;Value&apos;] for t in tags_resp[&apos;TagList&apos;]}
            instance[&apos;IsWriter&apos;] = member[&apos;IsClusterWriter&apos;]
            instances.append(instance)
        
        cluster[&apos;Instances&apos;] = instances
        return cluster
        
    except rds.exceptions.DBClusterNotFoundFault:
        return None


def validate_cluster_state(cluster: dict) -&gt; dict:
    &quot;&quot;&quot;
    Check if scaling is permitted:
    1. No instances currently being modified
    2. No instances tagged as &apos;modifying&apos;
    3. Cooldown period has elapsed
    &quot;&quot;&quot;
    instances = cluster[&apos;Instances&apos;]
    
    # Check for active modifications
    for instance in instances:
        if instance[&apos;DBInstanceStatus&apos;] != &apos;available&apos;:
            return {
                &apos;can_scale&apos;: False,
                &apos;reason&apos;: f&quot;Instance {instance[&apos;DBInstanceIdentifier&apos;]} is {instance[&apos;DBInstanceStatus&apos;]}&quot;
            }
    
    # Check for modifying tag
    for instance in instances:
        if instance[&apos;Tags&apos;].get(&apos;aurora-autoscaler-modifying&apos;) == &apos;true&apos;:
            return {
                &apos;can_scale&apos;: False,
                &apos;reason&apos;: f&quot;Instance {instance[&apos;DBInstanceIdentifier&apos;]} has modifying tag&quot;
            }
    
    # Check cooldown period
    latest_modification = get_latest_modification_timestamp(instances)
    if latest_modification:
        cooldown_end = latest_modification + timedelta(minutes=COOLDOWN_MINUTES)
        if datetime.now(timezone.utc) &lt; cooldown_end:
            return {
                &apos;can_scale&apos;: False,
                &apos;reason&apos;: f&quot;Cooldown period active until {cooldown_end.isoformat()}&quot;
            }
    
    return {&apos;can_scale&apos;: True, &apos;reason&apos;: None}


def get_latest_modification_timestamp(instances: list) -&gt; Optional[datetime]:
    &quot;&quot;&quot;Get the most recent modification timestamp from instance tags.&quot;&quot;&quot;
    timestamps = []
    for instance in instances:
        ts_str = instance[&apos;Tags&apos;].get(&apos;aurora-autoscaler-modification-timestamp&apos;)
        if ts_str:
            try:
                timestamps.append(datetime.fromisoformat(ts_str.replace(&apos;Z&apos;, &apos;+00:00&apos;)))
            except ValueError:
                pass
    return max(timestamps) if timestamps else None


def execute_scaling(cluster: dict) -&gt; dict:
    &quot;&quot;&quot;
    Scaling algorithm:
    1. Find smallest reader instances
    2. Scale one reader to match largest instance
    3. If all instances same size, scale to next tier
    4. If writer is smallest, scale writer (triggers failover)
    &quot;&quot;&quot;
    instances = cluster[&apos;Instances&apos;]
    readers = [i for i in instances if not i[&apos;IsWriter&apos;]]
    writer = next(i for i in instances if i[&apos;IsWriter&apos;])
    
    # Parse instance classes
    for instance in instances:
        instance[&apos;_parsed&apos;] = parse_instance_class(instance[&apos;DBInstanceClass&apos;])
    
    # Sort by size
    instances_by_size = sorted(instances, key=lambda x: get_size_rank(x[&apos;_parsed&apos;]))
    smallest_size = get_size_rank(instances_by_size[0][&apos;_parsed&apos;])
    largest_size = get_size_rank(instances_by_size[-1][&apos;_parsed&apos;])
    
    # Check if at maximum
    max_parsed = parse_instance_class(MAX_INSTANCE_CLASS)
    if smallest_size &gt;= get_size_rank(max_parsed):
        return notify_max_reached(cluster[&apos;DBClusterIdentifier&apos;])
    
    # Determine target instance and size
    if smallest_size &lt; largest_size:
        # Scale smallest to match largest
        target_class = instances_by_size[-1][&apos;DBInstanceClass&apos;]
        smallest_readers = [r for r in readers if get_size_rank(r[&apos;_parsed&apos;]) == smallest_size]
        
        if smallest_readers:
            target_instance = random.choice(smallest_readers)
        else:
            # Writer is smallest – scale it
            target_instance = writer
    else:
        # All same size – scale to next tier
        target_class = get_next_instance_class(instances_by_size[0][&apos;DBInstanceClass&apos;])
        if not target_class:
            return notify_max_reached(cluster[&apos;DBClusterIdentifier&apos;])
        
        if readers:
            target_instance = random.choice(readers)
        else:
            target_instance = writer
    
    # Tag and modify
    tag_instance_as_modifying(target_instance[&apos;DBInstanceArn&apos;])
    
    rds.modify_db_instance(
        DBInstanceIdentifier=target_instance[&apos;DBInstanceIdentifier&apos;],
        DBInstanceClass=target_class,
        ApplyImmediately=True
    )
    
    return {
        &apos;status&apos;: &apos;scaling_initiated&apos;,
        &apos;cluster&apos;: cluster[&apos;DBClusterIdentifier&apos;],
        &apos;instance&apos;: target_instance[&apos;DBInstanceIdentifier&apos;],
        &apos;from_class&apos;: target_instance[&apos;DBInstanceClass&apos;],
        &apos;to_class&apos;: target_class
    }


def parse_instance_class(instance_class: str) -&gt; dict:
    &quot;&quot;&quot;Parse db.r6g.xlarge into components.&quot;&quot;&quot;
    parts = instance_class.split(&apos;.&apos;)
    return {
        &apos;prefix&apos;: parts[0],
        &apos;family&apos;: parts[1],
        &apos;size&apos;: parts[2] if len(parts) &gt; 2 else &apos;medium&apos;
    }


def get_size_rank(parsed: dict) -&gt; int:
    &quot;&quot;&quot;Get numeric rank for instance size.&quot;&quot;&quot;
    return INSTANCE_SIZE_ORDER.get(parsed[&apos;size&apos;], 0)


def get_next_instance_class(current_class: str) -&gt; Optional[str]:
    &quot;&quot;&quot;Get the next larger instance class.&quot;&quot;&quot;
    parsed = parse_instance_class(current_class)
    sizes = list(INSTANCE_SIZE_ORDER.keys())
    current_idx = sizes.index(parsed[&apos;size&apos;])
    
    if current_idx &gt;= len(sizes) - 1:
        return None
    
    next_size = sizes[current_idx + 1]
    next_class = f&quot;{parsed[&apos;prefix&apos;]}.{parsed[&apos;family&apos;]}.{next_size}&quot;
    
    # Validate against max
    max_parsed = parse_instance_class(MAX_INSTANCE_CLASS)
    if get_size_rank({&apos;size&apos;: next_size}) &gt; get_size_rank(max_parsed):
        return None
    
    return next_class


def tag_instance_as_modifying(instance_arn: str):
    &quot;&quot;&quot;Tag instance to prevent concurrent modifications.&quot;&quot;&quot;
    rds.add_tags_to_resource(
        ResourceName=instance_arn,
        Tags=[
            {&apos;Key&apos;: &apos;aurora-autoscaler-modifying&apos;, &apos;Value&apos;: &apos;true&apos;},
            {&apos;Key&apos;: &apos;aurora-autoscaler-modification-timestamp&apos;, 
             &apos;Value&apos;: datetime.now(timezone.utc).isoformat()}
        ]
    )


def notify_max_reached(cluster_id: str) -&gt; dict:
    &quot;&quot;&quot;Send high-priority notification when max size reached.&quot;&quot;&quot;
    message = {
        &apos;status&apos;: &apos;max_size_reached&apos;,
        &apos;cluster&apos;: cluster_id,
        &apos;message&apos;: f&quot;Cluster {cluster_id} has reached maximum instance size {MAX_INSTANCE_CLASS}&quot;,
        &apos;priority&apos;: &apos;high&apos;
    }
    notify(message)
    return message


def notify(message: dict):
    &quot;&quot;&quot;Send notification to SNS topic.&quot;&quot;&quot;
    sns.publish(
        TopicArn=NOTIFICATION_TOPIC,
        Subject=f&quot;Aurora Autoscaler: {message.get(&apos;status&apos;, &apos;update&apos;)}&quot;,
        Message=json.dumps(message, indent=2)
    )
```

### Event Handler

This function processes RDS modification completion events and continues the scaling chain.

```python
# lambda/event_handler/handler.py

import boto3
import json
import os
import random
from datetime import datetime, timezone

rds = boto3.client(&apos;rds&apos;)
sns = boto3.client(&apos;sns&apos;)

NOTIFICATION_TOPIC = os.environ[&apos;NOTIFICATION_TOPIC_ARN&apos;]


def handler(event, context):
    &quot;&quot;&quot;
    Handles RDS Event Subscription notifications (modification complete).
    Removes modifying tag and continues scaling if needed.
    &quot;&quot;&quot;
    try:
        # Parse SNS message from RDS Event Subscription
        sns_message = json.loads(event[&apos;Records&apos;][0][&apos;Sns&apos;][&apos;Message&apos;])
        
        # RDS events have different structure
        source_id = sns_message.get(&apos;Source ID&apos;)
        event_message = sns_message.get(&apos;Event Message&apos;, &apos;&apos;)
        
        # Only process completion events
        if &apos;has been modified&apos; not in event_message.lower():
            print(f&quot;Ignoring event: {event_message}&quot;)
            return {&apos;statusCode&apos;: 200, &apos;body&apos;: &apos;Ignored non-completion event&apos;}
        
        print(f&quot;Processing completion for instance: {source_id}&quot;)
        
        # Get instance details
        instance_resp = rds.describe_db_instances(DBInstanceIdentifier=source_id)
        instance = instance_resp[&apos;DBInstances&apos;][0]
        cluster_id = instance[&apos;DBClusterIdentifier&apos;]
        
        # Get cluster details
        cluster = get_cluster_details(cluster_id)
        
        # Remove modifying tag
        remove_modifying_tag(instance)
        
        # Check if more scaling needed
        if should_continue_scaling(cluster):
            result = continue_scaling(cluster)
        else:
            result = {
                &apos;status&apos;: &apos;scaling_complete&apos;,
                &apos;cluster&apos;: cluster_id,
                &apos;message&apos;: &apos;All instances are now the same size&apos;
            }
        
        notify(result)
        return {&apos;statusCode&apos;: 200, &apos;body&apos;: json.dumps(result)}
        
    except Exception as e:
        print(f&quot;Error: {str(e)}&quot;)
        notify({&apos;status&apos;: &apos;error&apos;, &apos;message&apos;: str(e)})
        raise


def get_cluster_details(cluster_id: str) -&gt; dict:
    &quot;&quot;&quot;Fetch cluster and instance details.&quot;&quot;&quot;
    cluster_resp = rds.describe_db_clusters(DBClusterIdentifier=cluster_id)
    cluster = cluster_resp[&apos;DBClusters&apos;][0]
    
    instances = []
    for member in cluster[&apos;DBClusterMembers&apos;]:
        instance_resp = rds.describe_db_instances(
            DBInstanceIdentifier=member[&apos;DBInstanceIdentifier&apos;]
        )
        instance = instance_resp[&apos;DBInstances&apos;][0]
        instance[&apos;IsWriter&apos;] = member[&apos;IsClusterWriter&apos;]
        
        tags_resp = rds.list_tags_for_resource(ResourceName=instance[&apos;DBInstanceArn&apos;])
        instance[&apos;Tags&apos;] = {t[&apos;Key&apos;]: t[&apos;Value&apos;] for t in tags_resp[&apos;TagList&apos;]}
        
        instances.append(instance)
    
    cluster[&apos;Instances&apos;] = instances
    return cluster


def remove_modifying_tag(instance: dict):
    &quot;&quot;&quot;Remove the modifying tag from an instance.&quot;&quot;&quot;
    rds.remove_tags_from_resource(
        ResourceName=instance[&apos;DBInstanceArn&apos;],
        TagKeys=[&apos;aurora-autoscaler-modifying&apos;]
    )
    print(f&quot;Removed modifying tag from {instance[&apos;DBInstanceIdentifier&apos;]}&quot;)


def should_continue_scaling(cluster: dict) -&gt; bool:
    &quot;&quot;&quot;Check if instances still need equalisation.&quot;&quot;&quot;
    classes = set(i[&apos;DBInstanceClass&apos;] for i in cluster[&apos;Instances&apos;])
    return len(classes) &gt; 1


def continue_scaling(cluster: dict) -&gt; dict:
    &quot;&quot;&quot;Scale the next smallest reader to match the largest instance.&quot;&quot;&quot;
    instances = cluster[&apos;Instances&apos;]
    readers = [i for i in instances if not i[&apos;IsWriter&apos;]]
    writer = next(i for i in instances if i[&apos;IsWriter&apos;])
    
    # Find smallest and largest
    instances_by_size = sorted(instances, key=lambda x: get_instance_rank(x[&apos;DBInstanceClass&apos;]))
    smallest = instances_by_size[0]
    largest = instances_by_size[-1]
    
    # Prefer readers for scaling
    smallest_class = smallest[&apos;DBInstanceClass&apos;]
    smallest_readers = [r for r in readers if r[&apos;DBInstanceClass&apos;] == smallest_class]
    
    if smallest_readers:
        target = random.choice(smallest_readers)
    else:
        # Writer is smallest
        target = writer
    
    # Tag and modify
    tag_instance_as_modifying(target[&apos;DBInstanceArn&apos;])
    
    rds.modify_db_instance(
        DBInstanceIdentifier=target[&apos;DBInstanceIdentifier&apos;],
        DBInstanceClass=largest[&apos;DBInstanceClass&apos;],
        ApplyImmediately=True
    )
    
    return {
        &apos;status&apos;: &apos;scaling_continued&apos;,
        &apos;cluster&apos;: cluster[&apos;DBClusterIdentifier&apos;],
        &apos;instance&apos;: target[&apos;DBInstanceIdentifier&apos;],
        &apos;from_class&apos;: target[&apos;DBInstanceClass&apos;],
        &apos;to_class&apos;: largest[&apos;DBInstanceClass&apos;]
    }


def get_instance_rank(instance_class: str) -&gt; int:
    &quot;&quot;&quot;Get numeric rank for sorting.&quot;&quot;&quot;
    size_order = {
        &apos;small&apos;: 1, &apos;medium&apos;: 2, &apos;large&apos;: 3, &apos;xlarge&apos;: 4,
        &apos;2xlarge&apos;: 5, &apos;4xlarge&apos;: 6, &apos;8xlarge&apos;: 7, &apos;12xlarge&apos;: 8,
        &apos;16xlarge&apos;: 9, &apos;24xlarge&apos;: 10
    }
    size = instance_class.split(&apos;.&apos;)[-1]
    return size_order.get(size, 0)


def tag_instance_as_modifying(instance_arn: str):
    &quot;&quot;&quot;Tag instance to prevent concurrent modifications.&quot;&quot;&quot;
    rds.add_tags_to_resource(
        ResourceName=instance_arn,
        Tags=[
            {&apos;Key&apos;: &apos;aurora-autoscaler-modifying&apos;, &apos;Value&apos;: &apos;true&apos;},
            {&apos;Key&apos;: &apos;aurora-autoscaler-modification-timestamp&apos;,
             &apos;Value&apos;: datetime.now(timezone.utc).isoformat()}
        ]
    )


def notify(message: dict):
    &quot;&quot;&quot;Send notification to SNS topic.&quot;&quot;&quot;
    sns.publish(
        TopicArn=NOTIFICATION_TOPIC,
        Subject=f&quot;Aurora Autoscaler: {message.get(&apos;status&apos;, &apos;update&apos;)}&quot;,
        Message=json.dumps(message, indent=2)
    )
```

## Lambda Terraform Configuration

```hcl
# terraform/lambda.tf

data &quot;archive_file&quot; &quot;alarm_handler&quot; {
  type        = &quot;zip&quot;
  source_dir  = &quot;${path.module}/../lambda/alarm_handler&quot;
  output_path = &quot;${path.module}/../.build/alarm_handler.zip&quot;
}

data &quot;archive_file&quot; &quot;event_handler&quot; {
  type        = &quot;zip&quot;
  source_dir  = &quot;${path.module}/../lambda/event_handler&quot;
  output_path = &quot;${path.module}/../.build/event_handler.zip&quot;
}

resource &quot;aws_lambda_function&quot; &quot;alarm_handler&quot; {
  function_name    = &quot;aurora-autoscaler-alarm-${var.environment}&quot;
  filename         = data.archive_file.alarm_handler.output_path
  source_code_hash = data.archive_file.alarm_handler.output_base64sha256
  handler          = &quot;handler.handler&quot;
  runtime          = &quot;python3.11&quot;
  timeout          = 30
  memory_size      = 256

  role = aws_iam_role.aurora_autoscaler.arn

  environment {
    variables = {
      NOTIFICATION_TOPIC_ARN = aws_sns_topic.scaling_notifications.arn
      ALLOWED_FAMILIES       = join(&quot;,&quot;, var.allowed_instance_families)
      MAX_INSTANCE_CLASS     = var.max_instance_class
      COOLDOWN_MINUTES       = tostring(var.cooldown_minutes)
    }
  }

  tags = var.tags
}

resource &quot;aws_lambda_function&quot; &quot;event_handler&quot; {
  function_name    = &quot;aurora-autoscaler-event-${var.environment}&quot;
  filename         = data.archive_file.event_handler.output_path
  source_code_hash = data.archive_file.event_handler.output_base64sha256
  handler          = &quot;handler.handler&quot;
  runtime          = &quot;python3.11&quot;
  timeout          = 30
  memory_size      = 256

  role = aws_iam_role.aurora_autoscaler.arn

  environment {
    variables = {
      NOTIFICATION_TOPIC_ARN = aws_sns_topic.scaling_notifications.arn
    }
  }

  tags = var.tags
}

# Lambda permissions for SNS invocation
resource &quot;aws_lambda_permission&quot; &quot;alarm_sns&quot; {
  statement_id  = &quot;AllowSNSInvoke&quot;
  action        = &quot;lambda:InvokeFunction&quot;
  function_name = aws_lambda_function.alarm_handler.function_name
  principal     = &quot;sns.amazonaws.com&quot;
  source_arn    = aws_sns_topic.scaling_trigger.arn
}

resource &quot;aws_lambda_permission&quot; &quot;event_sns&quot; {
  statement_id  = &quot;AllowSNSInvoke&quot;
  action        = &quot;lambda:InvokeFunction&quot;
  function_name = aws_lambda_function.event_handler.function_name
  principal     = &quot;sns.amazonaws.com&quot;
  source_arn    = aws_sns_topic.event_trigger.arn
}
```

## Variables

```hcl
# terraform/variables.tf

variable &quot;environment&quot; {
  type        = string
  description = &quot;Environment name (dev, staging, prod)&quot;
}

variable &quot;region&quot; {
  type        = string
  description = &quot;AWS region&quot;
}

variable &quot;cluster_identifier&quot; {
  type        = string
  description = &quot;Aurora cluster identifier&quot;
}

variable &quot;cpu_threshold&quot; {
  type        = number
  default     = 80
  description = &quot;CPU utilisation percentage to trigger scaling&quot;
}

variable &quot;cooldown_minutes&quot; {
  type        = number
  default     = 15
  description = &quot;Minutes to wait between scaling operations&quot;
}

variable &quot;allowed_instance_families&quot; {
  type        = list(string)
  default     = [&quot;db.r6g&quot;, &quot;db.r7g&quot;]
  description = &quot;Allowed instance families for scaling&quot;
}

variable &quot;max_instance_class&quot; {
  type        = string
  default     = &quot;db.r6g.4xlarge&quot;
  description = &quot;Maximum instance class to scale to&quot;
}

variable &quot;notification_email&quot; {
  type        = string
  default     = &quot;&quot;
  description = &quot;Email address for scaling notifications&quot;
}

variable &quot;tags&quot; {
  type        = map(string)
  default     = {}
  description = &quot;Tags to apply to resources&quot;
}
```

## Scaling Behaviour Summary

| Scenario | Action |
|----------|--------|
| CPU alarm fires, all instances same size | Scale random reader to next tier |
| CPU alarm fires, instances different sizes | Scale smallest reader to match largest |
| Writer is smallest instance | Scale writer (triggers automatic failover) |
| All instances at max size | High-priority notification, no scaling |
| Instance modification in progress | Skip, wait for completion |
| Within cooldown period | Skip, wait for cooldown |

## Downtime Characteristics

**Reader scaling**: Zero downtime. Reader becomes unavailable briefly during modification (~2–5 minutes depending on size), but connections route to other readers.

**Writer scaling**: Requires failover. When the writer needs scaling:
1. A reader is scaled first
2. Failover promotes the scaled reader to writer (~10–30 seconds)
3. Original writer (now reader) is scaled

With RDS Proxy in front of the cluster, observed downtime drops to 1–3 seconds for the failover.

## Trade-offs

**Pros**:
- No external dependencies beyond AWS services
- Automatic coordination prevents concurrent modifications
- Scales readers first to minimise writer disruption
- Configurable cooldown prevents thrashing

**Cons**:
- No downscaling – once scaled up, instances stay large
- RDS modification times can be unpredictable (5–15 minutes)
- Failover still causes brief connection drops
- CloudWatch Alarm delays add latency to scaling response

## Gotchas

1. **RDS Proxy connection limits**: If using RDS Proxy, ensure max_connections on the proxy can handle the scaled instance. Proxy doesn&apos;t auto-adjust.

2. **Parameter groups**: Scaling to a different instance family might require a compatible parameter group. Aurora usually handles this, but verify memory-related parameters.

3. **Reserved instances**: Scaling to larger instances may exceed your reserved instance coverage. Monitor RI utilisation.

4. **Multi-AZ considerations**: Ensure your VPC subnets in each AZ can accommodate the larger instance types.

5. **Performance Insights**: Scaling resets Performance Insights history. Export metrics before scaling if you need them.

## Observability

Add CloudWatch dashboards and alerts:

```hcl
resource &quot;aws_cloudwatch_dashboard&quot; &quot;aurora_scaling&quot; {
  dashboard_name = &quot;aurora-autoscaling-${var.cluster_identifier}&quot;

  dashboard_body = jsonencode({
    widgets = [
      {
        type   = &quot;metric&quot;
        width  = 12
        height = 6
        properties = {
          title  = &quot;CPU Utilisation&quot;
          region = var.region
          metrics = [
            [&quot;AWS/RDS&quot;, &quot;CPUUtilization&quot;, &quot;DBClusterIdentifier&quot;, var.cluster_identifier]
          ]
          annotations = {
            horizontal = [{
              value = var.cpu_threshold
              label = &quot;Scaling threshold&quot;
              color = &quot;#ff7f0e&quot;
            }]
          }
        }
      },
      {
        type   = &quot;metric&quot;
        width  = 12
        height = 6
        properties = {
          title  = &quot;Lambda Invocations&quot;
          region = var.region
          metrics = [
            [&quot;AWS/Lambda&quot;, &quot;Invocations&quot;, &quot;FunctionName&quot;, aws_lambda_function.alarm_handler.function_name],
            [&quot;AWS/Lambda&quot;, &quot;Invocations&quot;, &quot;FunctionName&quot;, aws_lambda_function.event_handler.function_name]
          ]
        }
      }
    ]
  })
}
```

## Downscaling (Future Work)

The current implementation only scales up. For FinOps-conscious environments, consider:

1. **Scheduled downscaling**: Lambda triggered by EventBridge schedule during known low-traffic periods
2. **Metric-based downscaling**: Separate alarm for sustained low CPU (&lt;20% for 30+ minutes)
3. **Manual approval gate**: SNS → approval workflow → Lambda execution

Downscaling is riskier – you need to ensure the smaller instance can handle baseline load before committing.

## Conclusion

This approach leverages native AWS primitives (CloudWatch, SNS, Lambda, RDS Events) to implement vertical autoscaling without third-party dependencies. The coordination logic using tags and cooldown periods prevents race conditions and thrashing.

For workloads with predictable scaling patterns, consider pairing this reactive approach with proactive scheduled scaling. And if you&apos;re hitting the maximum instance size regularly, it&apos;s time to evaluate Aurora Serverless v2 or architectural changes to reduce write pressure.

Source code COMING SOON at: [github.com/moabukar/aurora-vertical-autoscaling](https://github.com/moabukar/aurora-vertical-autoscaling)

---</content:encoded><category>aurora</category><category>rds</category><category>aws</category><category>lambda</category><category>autoscaling</category><category>terraform</category><category>serverless</category><author>Mo Abukar</author></item><item><title>Terraform State Surgery - Splitting, Moving, and Refactoring Without Downtime</title><link>https://moabukar.co.uk/blog/terraform-state-surgery/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/terraform-state-surgery/</guid><description>A practical guide to breaking up monolithic Terraform state files, moving resources between states, and refactoring infrastructure safely. Includes real examples, scripts, and the exact commands we use.</description><pubDate>Sun, 01 Feb 2026 00:00:00 GMT</pubDate><content:encoded># Terraform State Surgery - Splitting, Moving, and Refactoring Without Downtime

Your Terraform state file started small. A VPC here, an RDS instance there. Then someone added the EKS cluster. Then the Lambda functions. Then three more environments. Now `terraform plan` takes 15 minutes, and you&apos;re terrified to touch anything because 400 resources might get recreated.

Sound familiar? Time for state surgery.

This post covers the real-world techniques for splitting monolithic state files, moving resources between states, and refactoring your Terraform structure without accidentally destroying production.

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/terraform-state-surgery](https://github.com/moabukar/blog-code/tree/main/terraform-state-surgery)

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/terraform.svg&quot; alt=&quot;Terraform logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Why Split State?

Before we dive in, let&apos;s be clear about why you&apos;d want to do this:

1. **Plan times** - Large states mean slow plans. A 500-resource state can take 10+ minutes to plan.
2. **Blast radius** - One bad `terraform apply` can affect everything in the state.
3. **Team ownership** - Different teams want to manage their own infrastructure.
4. **Lifecycle differences** - Networking changes rarely; applications change daily.
5. **State locking conflicts** - Multiple engineers blocked waiting for the same state lock.

## The Golden Rules

Before any state manipulation:

```bash
# 1. ALWAYS backup your state first
terraform state pull &gt; state-backup-$(date +%Y%m%d-%H%M%S).json

# 2. ALWAYS run plan after any state change
terraform plan
# Must show: &quot;No changes. Your infrastructure matches the configuration.&quot;

# 3. NEVER delete the backup until you&apos;ve verified everything works
```

If `terraform plan` shows any changes after state manipulation, **stop**. Something went wrong.

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/terraform-state-surgery](https://github.com/moabukar/blog-code/tree/main/terraform-state-surgery)

---

## Technique 1: Moving Resources Between States

The most common scenario: you have a monolithic state and want to extract resources into a new state file.

### The Scenario

You have everything in one state:
```
aws_vpc.main
aws_subnet.private[0]
aws_subnet.private[1]
aws_subnet.public[0]
aws_subnet.public[1]
aws_eks_cluster.main
aws_eks_node_group.workers
aws_rds_instance.database
aws_lambda_function.api
```

You want to split into:
- `networking/` - VPC, subnets
- `eks/` - Cluster and node groups
- `database/` - RDS
- `application/` - Lambda functions

### Step 1: Create the New State Structure

```bash
mkdir -p terraform/{networking,eks,database,application}
```

### Step 2: Move the Code

Copy the relevant resource blocks to each new directory. For example, `terraform/networking/main.tf`:

```hcl
# terraform/networking/main.tf

terraform {
  backend &quot;s3&quot; {
    bucket = &quot;my-terraform-state&quot;
    key    = &quot;networking/terraform.tfstate&quot;
    region = &quot;eu-west-1&quot;
  }
}

resource &quot;aws_vpc&quot; &quot;main&quot; {
  cidr_block = &quot;10.0.0.0/16&quot;
  
  tags = {
    Name = &quot;main-vpc&quot;
  }
}

resource &quot;aws_subnet&quot; &quot;private&quot; {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = &quot;private-${count.index}&quot;
  }
}

resource &quot;aws_subnet&quot; &quot;public&quot; {
  count             = 2
  vpc_id            = aws_vpc.main.id
  cidr_block        = cidrsubnet(aws_vpc.main.cidr_block, 8, count.index + 100)
  availability_zone = data.aws_availability_zones.available.names[count.index]
  
  tags = {
    Name = &quot;public-${count.index}&quot;
  }
}

# Outputs for other states to consume
output &quot;vpc_id&quot; {
  value = aws_vpc.main.id
}

output &quot;private_subnet_ids&quot; {
  value = aws_subnet.private[*].id
}

output &quot;public_subnet_ids&quot; {
  value = aws_subnet.public[*].id
}
```

### Step 3: Import Into the New State

Here&apos;s where most tutorials fail you. They say &quot;just run terraform import.&quot; But with complex resources, you need the exact import IDs.

```bash
cd terraform/networking

# Initialize the new backend
terraform init

# Import each resource
terraform import aws_vpc.main vpc-0abc123def456

terraform import &apos;aws_subnet.private[0]&apos; subnet-0abc123
terraform import &apos;aws_subnet.private[1]&apos; subnet-0def456

terraform import &apos;aws_subnet.public[0]&apos; subnet-0ghi789
terraform import &apos;aws_subnet.public[1]&apos; subnet-0jkl012

# Verify - THIS MUST SHOW NO CHANGES
terraform plan
```

### Step 4: Remove From the Old State

Only after verifying the import worked:

```bash
cd ../legacy  # Your old monolithic directory

# Remove from old state
terraform state rm aws_vpc.main
terraform state rm &apos;aws_subnet.private[0]&apos;
terraform state rm &apos;aws_subnet.private[1]&apos;
terraform state rm &apos;aws_subnet.public[0]&apos;
terraform state rm &apos;aws_subnet.public[1]&apos;

# Verify old state still works
terraform plan
# Should show no changes (just fewer resources now)
```

### The Script We Use

For large migrations, we script this:

```bash
#!/bin/bash
# state-migration.sh

set -e

OLD_DIR=&quot;./legacy&quot;
NEW_DIR=&quot;./networking&quot;

# Resources to move (format: &quot;resource_address|import_id&quot;)
RESOURCES=(
  &quot;aws_vpc.main|vpc-0abc123def456&quot;
  &quot;aws_subnet.private[0]|subnet-0abc123&quot;
  &quot;aws_subnet.private[1]|subnet-0def456&quot;
  &quot;aws_subnet.public[0]|subnet-0ghi789&quot;
  &quot;aws_subnet.public[1]|subnet-0jkl012&quot;
)

echo &quot;=== Backing up states ===&quot;
cd &quot;$OLD_DIR&quot;
terraform state pull &gt; &quot;../backup-old-$(date +%Y%m%d-%H%M%S).json&quot;
cd ..

cd &quot;$NEW_DIR&quot;
terraform init
terraform state pull &gt; &quot;../backup-new-$(date +%Y%m%d-%H%M%S).json&quot; 2&gt;/dev/null || echo &quot;New state is empty (expected)&quot;
cd ..

echo &quot;=== Importing into new state ===&quot;
cd &quot;$NEW_DIR&quot;
for resource in &quot;${RESOURCES[@]}&quot;; do
  addr=&quot;${resource%%|*}&quot;
  id=&quot;${resource##*|}&quot;
  echo &quot;Importing: $addr ($id)&quot;
  terraform import &quot;$addr&quot; &quot;$id&quot; || { echo &quot;FAILED: $addr&quot;; exit 1; }
done

echo &quot;=== Verifying new state ===&quot;
terraform plan -detailed-exitcode
if [ $? -eq 0 ]; then
  echo &quot;✓ New state verified - no changes&quot;
elif [ $? -eq 2 ]; then
  echo &quot;✗ ERROR: Plan shows changes! Aborting.&quot;
  exit 1
fi
cd ..

echo &quot;=== Removing from old state ===&quot;
cd &quot;$OLD_DIR&quot;
for resource in &quot;${RESOURCES[@]}&quot;; do
  addr=&quot;${resource%%|*}&quot;
  echo &quot;Removing: $addr&quot;
  terraform state rm &quot;$addr&quot; || { echo &quot;FAILED to remove: $addr&quot;; exit 1; }
done

echo &quot;=== Verifying old state ===&quot;
terraform plan -detailed-exitcode
if [ $? -eq 0 ]; then
  echo &quot;✓ Old state verified - no changes&quot;
elif [ $? -eq 2 ]; then
  echo &quot;✗ ERROR: Plan shows changes! Check manually.&quot;
  exit 1
fi

echo &quot;=== Migration complete ===&quot;
```

---

## Technique 2: Using `terraform state mv`

If you&apos;re reorganizing within the same state (renaming resources, moving into modules), use `terraform state mv`:

### Renaming a Resource

```bash
# Old: aws_instance.web
# New: aws_instance.application

terraform state mv aws_instance.web aws_instance.application
```

### Moving Into a Module

```bash
# Old: aws_instance.web (root module)
# New: module.compute.aws_instance.web

terraform state mv aws_instance.web module.compute.aws_instance.web
```

### Moving Out of a Module

```bash
# Old: module.legacy.aws_instance.web
# New: aws_instance.web (root module)

terraform state mv module.legacy.aws_instance.web aws_instance.web
```

### Bulk Moves

```bash
# Move all resources from one module to another
terraform state mv module.old_network module.network
```

---

## Technique 3: Using `moved` Blocks (Terraform 1.1+)

For refactoring that you want tracked in version control, use `moved` blocks:

```hcl
# This tells Terraform the resource was renamed
moved {
  from = aws_instance.web
  to   = aws_instance.application
}

# Moving into a module
moved {
  from = aws_instance.web
  to   = module.compute.aws_instance.web
}

# Renaming a module
moved {
  from = module.old_name
  to   = module.new_name
}
```

Benefits of `moved` blocks:
- Version controlled
- Works across team members
- Self-documenting
- Terraform handles the state update automatically

After applying with `moved` blocks:
```bash
terraform plan
# Shows: &quot;Terraform will perform the following actions:&quot;
# aws_instance.web has moved to aws_instance.application

terraform apply
# State is updated, no infrastructure changes
```

**Important:** Keep `moved` blocks for at least one full release cycle, then remove them.

---

## Technique 4: The `import` Block (Terraform 1.5+)

For new states, you can now define imports in config:

```hcl
# imports.tf

import {
  to = aws_vpc.main
  id = &quot;vpc-0abc123def456&quot;
}

import {
  to = aws_subnet.private[0]
  id = &quot;subnet-0abc123&quot;
}

import {
  to = aws_subnet.private[1]
  id = &quot;subnet-0def456&quot;
}
```

Then run:
```bash
terraform plan
# Shows what will be imported

terraform apply
# Imports all resources
```

This is cleaner than CLI imports for large migrations.

---

## Technique 5: Cross-State References with `terraform_remote_state`

After splitting states, you need to reference resources across states:

```hcl
# terraform/eks/main.tf

# Reference the networking state
data &quot;terraform_remote_state&quot; &quot;networking&quot; {
  backend = &quot;s3&quot;
  config = {
    bucket = &quot;my-terraform-state&quot;
    key    = &quot;networking/terraform.tfstate&quot;
    region = &quot;eu-west-1&quot;
  }
}

# Use outputs from networking state
resource &quot;aws_eks_cluster&quot; &quot;main&quot; {
  name     = &quot;main-cluster&quot;
  role_arn = aws_iam_role.eks.arn

  vpc_config {
    subnet_ids = data.terraform_remote_state.networking.outputs.private_subnet_ids
  }
}
```

### Dependency Order

With split states, you need to apply in order:

```bash
# 1. Networking first (no dependencies)
cd networking &amp;&amp; terraform apply

# 2. Database (depends on networking)
cd ../database &amp;&amp; terraform apply

# 3. EKS (depends on networking)
cd ../eks &amp;&amp; terraform apply

# 4. Application (depends on everything)
cd ../application &amp;&amp; terraform apply
```

We encode this in CI/CD with explicit job dependencies.

---

## Technique 6: The `removed` Block (Terraform 1.7+)

When you want to remove a resource from state without destroying it:

```hcl
# This removes from state but keeps the actual resource
removed {
  from = aws_instance.legacy_server

  lifecycle {
    destroy = false
  }
}
```

Use cases:
- Handing off resources to another team
- Removing resources that will be managed manually
- Migrating to a different IaC tool

---

## Real-World Migration: Monolith to Multi-State

Here&apos;s the actual migration plan we used for a client with 400+ resources:

### Phase 1: Analysis

```bash
# List all resources in current state
terraform state list &gt; all-resources.txt

# Count by type
terraform state list | cut -d&apos;.&apos; -f1-2 | sort | uniq -c | sort -rn

# Output:
#   45 aws_security_group_rule
#   32 aws_iam_role_policy_attachment
#   28 aws_route53_record
#   15 aws_lambda_function
#   12 aws_s3_bucket
#   ...
```

### Phase 2: Categorization

We grouped resources into logical domains:

```
networking/     - VPC, subnets, route tables, NAT, IGW
security/       - Security groups, NACLs, WAF
iam/            - Roles, policies, users
dns/            - Route53 zones and records
compute/        - EC2, ASG, Launch templates
eks/            - EKS cluster, node groups, add-ons
rds/            - RDS instances, parameter groups
lambda/         - Lambda functions, layers
storage/        - S3 buckets, EFS
monitoring/     - CloudWatch, SNS, alarms
```

### Phase 3: Dependency Mapping

```
networking (0 deps)
    ↓
security (networking)
    ↓
iam (0 deps - can parallel with security)
    ↓
dns (networking)
    ↓
rds (networking, security)
    ↓
eks (networking, security, iam)
    ↓
storage (iam)
    ↓
lambda (iam, networking, security, storage)
    ↓
monitoring (everything)
```

### Phase 4: Migration Script

```bash
#!/bin/bash
# full-migration.sh

DOMAINS=&quot;networking security iam dns rds eks storage lambda monitoring&quot;
OLD_STATE=&quot;./legacy&quot;
BACKUP_DIR=&quot;./backups/$(date +%Y%m%d-%H%M%S)&quot;

mkdir -p &quot;$BACKUP_DIR&quot;

# Backup everything first
echo &quot;=== Creating backups ===&quot;
cd &quot;$OLD_STATE&quot;
terraform state pull &gt; &quot;$BACKUP_DIR/legacy.json&quot;
cd ..

for domain in $DOMAINS; do
  if [ -d &quot;$domain&quot; ]; then
    cd &quot;$domain&quot;
    terraform state pull &gt; &quot;$BACKUP_DIR/${domain}.json&quot; 2&gt;/dev/null || echo &quot;$domain is new&quot;
    cd ..
  fi
done

# Migrate each domain
for domain in $DOMAINS; do
  echo &quot;=== Migrating: $domain ===&quot;
  
  if [ -f &quot;migrations/${domain}.sh&quot; ]; then
    bash &quot;migrations/${domain}.sh&quot;
    
    # Verify
    cd &quot;$domain&quot;
    if ! terraform plan -detailed-exitcode; then
      echo &quot;ERROR: $domain verification failed!&quot;
      exit 1
    fi
    cd ..
  else
    echo &quot;No migration script for $domain, skipping&quot;
  fi
done

echo &quot;=== All migrations complete ===&quot;
```

### Phase 5: Verification

After each domain migration:

```bash
# In the new state directory
terraform plan
# Must show: No changes

# In the old state directory  
terraform plan
# Must show: No changes (just fewer resources)

# Verify actual infrastructure
aws ec2 describe-vpcs --vpc-ids vpc-xxx
aws eks describe-cluster --name main-cluster
# ... spot check critical resources
```

---

## Common Gotchas

### 1. Count vs For_Each Index Mismatch

If you&apos;re moving from `count` to `for_each`, the state addresses differ:

```hcl
# count uses numeric index
aws_subnet.private[0]
aws_subnet.private[1]

# for_each uses key
aws_subnet.private[&quot;eu-west-1a&quot;]
aws_subnet.private[&quot;eu-west-1b&quot;]
```

You&apos;ll need individual `moved` blocks:

```hcl
moved {
  from = aws_subnet.private[0]
  to   = aws_subnet.private[&quot;eu-west-1a&quot;]
}

moved {
  from = aws_subnet.private[1]
  to   = aws_subnet.private[&quot;eu-west-1b&quot;]
}
```

### 2. Provider Aliases

If the resource uses a provider alias, include it in the import:

```bash
# Resource uses aliased provider
terraform import &apos;aws_instance.west[&quot;web&quot;]&apos; i-0abc123
# May need: -provider=aws.west
```

### 3. Sensitive Values in State

When pulling state for backup, sensitive values are included. Secure your backups:

```bash
# Encrypt backup
terraform state pull | gpg --encrypt -r your@email.com &gt; state-backup.json.gpg
```

### 4. State Locking During Migration

Disable auto-apply in CI/CD during migration. You don&apos;t want automated applies while manipulating state.

```bash
# Force unlock if needed (dangerous - make sure no one else is using it)
terraform force-unlock LOCK_ID
```

### 5. Remote State Data Source Timing

If you split networking from EKS, and EKS references networking via `terraform_remote_state`, you must apply networking first after the split.

---

## The Checklist

```markdown
## Pre-Migration
- [ ] Backup all state files
- [ ] Document current resource count per state
- [ ] Map dependencies between resources
- [ ] Plan the new state structure
- [ ] Disable CI/CD auto-applies
- [ ] Notify team of migration window

## Per-Domain Migration
- [ ] Create new directory structure
- [ ] Copy resource code to new location
- [ ] Add remote_state data sources where needed
- [ ] Add outputs for cross-state references
- [ ] Run terraform init in new directory
- [ ] Import resources into new state
- [ ] Verify: terraform plan shows no changes
- [ ] Remove resources from old state
- [ ] Verify: old state terraform plan shows no changes
- [ ] Commit changes

## Post-Migration
- [ ] Update CI/CD pipelines for new structure
- [ ] Update documentation
- [ ] Re-enable CI/CD auto-applies
- [ ] Delete old monolithic state (after grace period)
- [ ] Archive backup files securely
```

---

## Key Takeaways

1. **Always backup state before any manipulation**
2. **`terraform plan` must show no changes after every state operation**
3. **Use `moved` blocks for version-controlled refactoring**
4. **Use `import` blocks (1.5+) for cleaner bulk imports**
5. **Use `removed` blocks (1.7+) to remove without destroying**
6. **Map dependencies before splitting - apply order matters**
7. **Script large migrations - manual commands are error-prone**
8. **Keep backups until you&apos;re 100% confident the migration worked**

State surgery is scary, but with the right approach it&apos;s routine. Take it slow, verify everything, and you&apos;ll have clean, maintainable Terraform in no time.

---

*Questions? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or [GitHub](https://github.com/moabukar).*</content:encoded><category>terraform</category><category>state</category><category>migration</category><category>refactoring</category><category>iac</category><category>devops</category><author>Mo Abukar</author></item><item><title>Terraform 0.11 to 1.11 Migration - The Full Journey</title><link>https://moabukar.co.uk/blog/terraform-011-to-111-migration/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/terraform-011-to-111-migration/</guid><description>A detailed guide on migrating Terraform from 0.11 to 1.11, covering HCL2 syntax changes, the S3 bucket resource split, state manipulation, and ensuring zero-drift upgrades.</description><pubDate>Fri, 30 Jan 2026 00:00:00 GMT</pubDate><content:encoded># Terraform 0.11 to 1.11 Migration - The Full Journey

Last year I helped a client migrate their Terraform codebase from 0.11 all the way to 1.11. Their infrastructure had been running on 0.11 for years - nobody wanted to touch it because &quot;it works, don&apos;t break it.&quot; Sound familiar?

This post documents the entire journey: the syntax changes, the resource splits, the state surgery, and most importantly - how to verify nothing breaks at each step.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/terraform.svg&quot; alt=&quot;Terraform logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## The Golden Rule

Before we dive in, here&apos;s the rule that guided every step of this migration:

**After each upgrade, `terraform plan` must show no changes.**

If plan shows changes, you&apos;ve broken something. Stop, fix it, then continue. This is non-negotiable.

## The Upgrade Path

You can&apos;t jump directly from 0.11 to 1.11. Terraform versions have breaking changes that require stepping stones:

```
0.11 → 0.12 → 0.13 → 0.14 → 0.15 → 1.0 → 1.1+ → 1.11
```

Each jump has its own gotchas. Here&apos;s what we hit at each stage.

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/terraform-migration](https://github.com/moabukar/blog-code/tree/main/terraform-migration)

---

## Phase 1: 0.11 to 0.12 - The Big Syntax Change

This is the hardest upgrade. Terraform 0.12 introduced HCL2, which changed almost everything about how you write Terraform.

### Before: 0.11 Syntax

```hcl
# 0.11 - String interpolation everywhere
resource &quot;aws_instance&quot; &quot;web&quot; {
  ami           = &quot;${var.ami_id}&quot;
  instance_type = &quot;${var.instance_type}&quot;
  
  tags {
    Name = &quot;${var.environment}-web-${count.index}&quot;
  }
}

# 0.11 - Conditional with empty string hack
resource &quot;aws_eip&quot; &quot;web&quot; {
  count    = &quot;${var.create_eip ? 1 : 0}&quot;
  instance = &quot;${aws_instance.web.id}&quot;
}

# 0.11 - Element function for list access
output &quot;first_subnet&quot; {
  value = &quot;${element(var.subnet_ids, 0)}&quot;
}
```

### After: 0.12 Syntax

```hcl
# 0.12 - No interpolation needed for simple references
resource &quot;aws_instance&quot; &quot;web&quot; {
  ami           = var.ami_id
  instance_type = var.instance_type
  
  tags = {
    Name = &quot;${var.environment}-web-${count.index}&quot;
  }
}

# 0.12 - Proper boolean conditionals
resource &quot;aws_eip&quot; &quot;web&quot; {
  count    = var.create_eip ? 1 : 0
  instance = aws_instance.web.id
}

# 0.12 - Native list indexing
output &quot;first_subnet&quot; {
  value = var.subnet_ids[0]
}
```

### The 0.12upgrade Tool

Terraform 0.12 shipped with a built-in upgrade tool:

```bash
# First, make sure you&apos;re on the latest 0.11
terraform-0.11 init
terraform-0.11 plan  # Should show no changes

# Run the upgrade tool
terraform-0.12 0.12upgrade

# Review the changes
git diff

# Test the upgrade
terraform-0.12 init
terraform-0.12 plan  # MUST show no changes
```

### What the Tool Doesn&apos;t Fix

The upgrade tool handles most syntax changes, but it can&apos;t fix everything:

**1. Quoted Type Constraints**

```hcl
# 0.11
variable &quot;instance_count&quot; {
  type = &quot;string&quot;  # Quotes around type
}

# 0.12
variable &quot;instance_count&quot; {
  type = string  # No quotes - tool usually fixes this
}
```

**2. Computed Maps in Resources**

```hcl
# 0.11 - This worked
resource &quot;aws_instance&quot; &quot;web&quot; {
  tags = &quot;${merge(var.common_tags, map(&quot;Name&quot;, &quot;web&quot;))}&quot;
}

# 0.12 - Need to update
resource &quot;aws_instance&quot; &quot;web&quot; {
  tags = merge(var.common_tags, { Name = &quot;web&quot; })
}
```

**3. Count on Modules**

```hcl
# 0.11 - count on modules didn&apos;t exist
# If you hacked it with null_resource, you need to refactor

# 0.12 - Still no count on modules (that comes in 0.13)
```

### Verification

After the upgrade tool runs:

```bash
terraform init
terraform plan -out=plan.out

# The output MUST say:
# No changes. Infrastructure is up-to-date.
```

If you see any planned changes, **stop**. Something went wrong. Common issues:

- State file version incompatibility (run `terraform state pull &gt; state.json` and check the version)
- Provider version changes (pin your providers!)
- Syntax the tool missed

---

## Phase 2: 0.12 to 0.13 - Provider Requirements

Terraform 0.13 introduced required_providers blocks and count/for_each on modules.

### New Required Providers Block

```hcl
# 0.12 - Provider declared implicitly or with version constraint
provider &quot;aws&quot; {
  version = &quot;~&gt; 3.0&quot;
  region  = &quot;eu-west-1&quot;
}

# 0.13 - Explicit required_providers block
terraform {
  required_version = &quot;&gt;= 0.13&quot;
  
  required_providers {
    aws = {
      source  = &quot;hashicorp/aws&quot;
      version = &quot;~&gt; 3.0&quot;
    }
  }
}

provider &quot;aws&quot; {
  region = &quot;eu-west-1&quot;
}
```

### The 0.13upgrade Tool

```bash
terraform-0.13 0.13upgrade

# This adds the required_providers block automatically
# Review and test
terraform init
terraform plan  # Must show no changes
```

### Module Count/For_Each

If you had workarounds for conditional modules, now you can do it properly:

```hcl
# 0.13 - count on modules finally works
module &quot;monitoring&quot; {
  source = &quot;./modules/monitoring&quot;
  count  = var.enable_monitoring ? 1 : 0
}
```

---

## Phase 3: 0.13 to 0.14 - Provider Lock Files

Terraform 0.14 introduced the `.terraform.lock.hcl` file.

```bash
terraform init
# Creates .terraform.lock.hcl

# Commit this file!
git add .terraform.lock.hcl
git commit -m &quot;Add Terraform provider lock file&quot;
```

The lock file pins exact provider versions and checksums. This prevents &quot;works on my machine&quot; issues.

### Sensitive Variables

0.14 also introduced the `sensitive` argument:

```hcl
variable &quot;db_password&quot; {
  type      = string
  sensitive = true  # Won&apos;t show in plan output
}
```

---

## Phase 4: 0.14 to 0.15 - Deprecation Warnings

0.15 removed a lot of deprecated syntax and prepared for 1.0.

Key changes:
- `terraform state mv` behavior changed
- Provider source addresses are now required
- Deprecated interpolation-only expressions removed

```bash
terraform init
terraform plan

# Address any deprecation warnings before moving to 1.0
```

---

## Phase 5: 0.15 to 1.0 - The Stability Release

Terraform 1.0 was mostly a stability release. If you got through 0.15 cleanly, 1.0 should be painless.

```bash
terraform init
terraform plan  # Should show no changes
```

---

## Phase 6: 1.0 to 1.1+ - The S3 Bucket Split

**This is where things get interesting.**

Starting in AWS Provider 4.0 (which you&apos;ll likely adopt when moving through Terraform 1.x), the `aws_s3_bucket` resource was broken up into multiple resources.

### The Old Way (AWS Provider 3.x)

```hcl
resource &quot;aws_s3_bucket&quot; &quot;data&quot; {
  bucket = &quot;my-data-bucket&quot;
  acl    = &quot;private&quot;

  versioning {
    enabled = true
  }

  server_side_encryption_configuration {
    rule {
      apply_server_side_encryption_by_default {
        sse_algorithm = &quot;aws:kms&quot;
        kms_master_key_id = aws_kms_key.bucket_key.arn
      }
    }
  }

  lifecycle_rule {
    id      = &quot;archive&quot;
    enabled = true

    transition {
      days          = 90
      storage_class = &quot;GLACIER&quot;
    }

    expiration {
      days = 365
    }
  }

  logging {
    target_bucket = aws_s3_bucket.logs.id
    target_prefix = &quot;data-bucket/&quot;
  }

  cors_rule {
    allowed_headers = [&quot;*&quot;]
    allowed_methods = [&quot;GET&quot;, &quot;PUT&quot;]
    allowed_origins = [&quot;https://example.com&quot;]
    max_age_seconds = 3000
  }

  website {
    index_document = &quot;index.html&quot;
    error_document = &quot;error.html&quot;
  }

  tags = {
    Environment = &quot;production&quot;
  }
}
```

One massive resource block with everything crammed in.

### The New Way (AWS Provider 4.0+)

```hcl
resource &quot;aws_s3_bucket&quot; &quot;data&quot; {
  bucket = &quot;my-data-bucket&quot;

  tags = {
    Environment = &quot;production&quot;
  }
}

resource &quot;aws_s3_bucket_versioning&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id

  versioning_configuration {
    status = &quot;Enabled&quot;
  }
}

resource &quot;aws_s3_bucket_server_side_encryption_configuration&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id

  rule {
    apply_server_side_encryption_by_default {
      sse_algorithm     = &quot;aws:kms&quot;
      kms_master_key_id = aws_kms_key.bucket_key.arn
    }
  }
}

resource &quot;aws_s3_bucket_lifecycle_configuration&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id

  rule {
    id     = &quot;archive&quot;
    status = &quot;Enabled&quot;

    transition {
      days          = 90
      storage_class = &quot;GLACIER&quot;
    }

    expiration {
      days = 365
    }
  }
}

resource &quot;aws_s3_bucket_logging&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id

  target_bucket = aws_s3_bucket.logs.id
  target_prefix = &quot;data-bucket/&quot;
}

resource &quot;aws_s3_bucket_cors_configuration&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id

  cors_rule {
    allowed_headers = [&quot;*&quot;]
    allowed_methods = [&quot;GET&quot;, &quot;PUT&quot;]
    allowed_origins = [&quot;https://example.com&quot;]
    max_age_seconds = 3000
  }
}

resource &quot;aws_s3_bucket_website_configuration&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id

  index_document {
    suffix = &quot;index.html&quot;
  }

  error_document {
    key = &quot;error.html&quot;
  }
}

resource &quot;aws_s3_bucket_acl&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id
  acl    = &quot;private&quot;
}

resource &quot;aws_s3_bucket_public_access_block&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id

  block_public_acls       = true
  block_public_policy     = true
  ignore_public_acls      = true
  restrict_public_buckets = true
}
```

Yes, one resource became nine. But here&apos;s why this is actually better:

1. **Granular state management** - You can import/move individual settings
2. **Cleaner diffs** - Changing versioning doesn&apos;t show the entire bucket in the plan
3. **Independent lifecycle** - Each setting can be managed separately
4. **Better module composition** - Modules can manage specific aspects

### The Migration Strategy

Here&apos;s the critical part. You have two options:

#### Option A: Let Terraform Recreate (DON&apos;T DO THIS IN PROD)

If you just upgrade the provider and update your code, Terraform will want to:
1. Remove the old inline configuration
2. Create new standalone resources

This might work for non-critical buckets, but for production data? Absolutely not.

#### Option B: State Surgery (The Safe Way)

```bash
# 1. First, upgrade your code to the new format
# 2. Then import the existing configuration into the new resources

# Import versioning
terraform import aws_s3_bucket_versioning.data my-data-bucket

# Import encryption
terraform import aws_s3_bucket_server_side_encryption_configuration.data my-data-bucket

# Import lifecycle rules
terraform import aws_s3_bucket_lifecycle_configuration.data my-data-bucket

# Import logging
terraform import aws_s3_bucket_logging.data my-data-bucket

# Continue for each resource...
```

#### Option C: Use moved Blocks (Terraform 1.1+)

Terraform 1.1 introduced `moved` blocks, which are perfect for this:

```hcl
# Tell Terraform that the inline config moved to a new resource
moved {
  from = aws_s3_bucket.data
  to   = aws_s3_bucket.data
}

# For the child resources, you still need imports
# But moved blocks help when refactoring your own resources
```

### The Import Script We Used

For the client, we wrote a script to handle all their S3 buckets:

```bash
#!/bin/bash
# s3-migration-import.sh

set -e

BUCKETS=$(terraform state list | grep &quot;aws_s3_bucket\.&quot; | grep -v &quot;aws_s3_bucket_&quot;)

for bucket_resource in $BUCKETS; do
  bucket_name=$(terraform state show &quot;$bucket_resource&quot; | grep &quot;bucket &quot; | head -1 | awk -F&apos;&quot;&apos; &apos;{print $2}&apos;)
  base_name=$(echo &quot;$bucket_resource&quot; | sed &apos;s/aws_s3_bucket\.//&apos;)
  
  echo &quot;Processing: $bucket_name ($base_name)&quot;
  
  # Check if versioning exists
  if aws s3api get-bucket-versioning --bucket &quot;$bucket_name&quot; --query &apos;Status&apos; --output text | grep -q &quot;Enabled\|Suspended&quot;; then
    echo &quot;  Importing versioning...&quot;
    terraform import &quot;aws_s3_bucket_versioning.${base_name}&quot; &quot;$bucket_name&quot; || true
  fi
  
  # Check if encryption exists
  if aws s3api get-bucket-encryption --bucket &quot;$bucket_name&quot; 2&gt;/dev/null; then
    echo &quot;  Importing encryption...&quot;
    terraform import &quot;aws_s3_bucket_server_side_encryption_configuration.${base_name}&quot; &quot;$bucket_name&quot; || true
  fi
  
  # Check if lifecycle rules exist
  if aws s3api get-bucket-lifecycle-configuration --bucket &quot;$bucket_name&quot; 2&gt;/dev/null; then
    echo &quot;  Importing lifecycle...&quot;
    terraform import &quot;aws_s3_bucket_lifecycle_configuration.${base_name}&quot; &quot;$bucket_name&quot; || true
  fi
  
  # Check if logging exists
  if aws s3api get-bucket-logging --bucket &quot;$bucket_name&quot; --query &apos;LoggingEnabled&apos; --output text | grep -v &quot;None&quot;; then
    echo &quot;  Importing logging...&quot;
    terraform import &quot;aws_s3_bucket_logging.${base_name}&quot; &quot;$bucket_name&quot; || true
  fi
  
  # Always import public access block (should exist on all buckets)
  echo &quot;  Importing public access block...&quot;
  terraform import &quot;aws_s3_bucket_public_access_block.${base_name}&quot; &quot;$bucket_name&quot; || true
  
done

echo &quot;Done. Run &apos;terraform plan&apos; to verify.&quot;
```

### Verification After S3 Migration

```bash
terraform plan

# You should see:
# No changes. Your infrastructure matches the configuration.

# If you see changes, common issues:
# - Lifecycle rule IDs don&apos;t match (AWS auto-generates if not specified)
# - ACL differences (check if bucket-owner-full-control vs private)
# - Public access block settings differ from defaults
```

---

## Phase 7: 1.1+ to 1.11 - Incremental Updates

After surviving the S3 split, the remaining upgrades are gentler.

### Notable Changes by Version

**Terraform 1.2:**
- `precondition` and `postcondition` blocks
- `replace_triggered_by` lifecycle argument

**Terraform 1.3:**
- `optional()` function for object type defaults

```hcl
variable &quot;config&quot; {
  type = object({
    name     = string
    enabled  = optional(bool, true)  # Default value!
    retries  = optional(number, 3)
  })
}
```

**Terraform 1.4:**
- `terraform_data` resource (replaces `null_resource`)

**Terraform 1.5:**
- `import` blocks for config-driven imports
- `check` blocks for continuous validation

```hcl
# 1.5 style import - no more CLI imports!
import {
  to = aws_s3_bucket.data
  id = &quot;my-data-bucket&quot;
}

# Continuous validation
check &quot;bucket_versioning_enabled&quot; {
  data &quot;aws_s3_bucket_versioning&quot; &quot;data&quot; {
    bucket = aws_s3_bucket.data.id
  }

  assert {
    condition     = data.aws_s3_bucket_versioning.data.versioning_configuration[0].status == &quot;Enabled&quot;
    error_message = &quot;Bucket versioning must be enabled&quot;
  }
}
```

**Terraform 1.6:**
- `terraform test` framework

**Terraform 1.7:**
- `removed` blocks for safe resource removal from state

```hcl
# Instead of terraform state rm, use this
removed {
  from = aws_instance.old_server

  lifecycle {
    destroy = false  # Don&apos;t destroy the actual resource
  }
}
```

**Terraform 1.8-1.11:**
- Provider-defined functions
- Various performance improvements
- Better error messages

### The Final Verification

After reaching 1.11:

```bash
terraform init -upgrade
terraform plan

# Must show:
# No changes. Your infrastructure matches the configuration.

# Run a full validate too
terraform validate
```

---

## Common Issues and Fixes

### Issue: State Version Mismatch

```
Error: state snapshot was created by Terraform v0.14.0, which is newer than current v0.13.0
```

**Fix:** You can&apos;t downgrade state. Always move forward.

### Issue: Provider Version Conflict

```
Error: Failed to query available provider packages
```

**Fix:** Pin your provider versions before upgrading Terraform:

```hcl
terraform {
  required_providers {
    aws = {
      source  = &quot;hashicorp/aws&quot;
      version = &quot;~&gt; 3.75.0&quot;  # Pin before upgrade
    }
  }
}
```

### Issue: Module Source Changed

```
Error: Module not installed
```

**Fix:** Run `terraform init -upgrade` after each Terraform version upgrade.

### Issue: Deprecated Interpolation

```
Warning: Interpolation-only expressions are deprecated
```

**Fix:** Remove unnecessary `${}`:

```hcl
# Bad
name = &quot;${var.name}&quot;

# Good
name = var.name
```

### Issue: S3 Bucket ACL Conflicts

```
Error: error putting S3 Bucket ACL: AccessControlListNotSupported
```

**Fix:** For buckets with ownership controls, you can&apos;t use ACLs:

```hcl
# If you have this:
resource &quot;aws_s3_bucket_ownership_controls&quot; &quot;data&quot; {
  bucket = aws_s3_bucket.data.id
  rule {
    object_ownership = &quot;BucketOwnerEnforced&quot;
  }
}

# Then you can&apos;t have this:
# resource &quot;aws_s3_bucket_acl&quot; &quot;data&quot; { ... }  # REMOVE THIS
```

---

## The Migration Checklist

Here&apos;s the checklist we used for each environment:

```markdown
## Pre-Migration
- [ ] Backup state file: `terraform state pull &gt; state-backup-$(date +%Y%m%d).json`
- [ ] Document current Terraform version
- [ ] Document current provider versions
- [ ] Run `terraform plan` - confirm no changes
- [ ] Commit all code changes

## Per Version Upgrade
- [ ] Install new Terraform version
- [ ] Run upgrade tool if available (0.12upgrade, 0.13upgrade)
- [ ] Run `terraform init -upgrade`
- [ ] Run `terraform plan`
- [ ] Verify: &quot;No changes&quot;
- [ ] Commit changes with version in message

## S3 Migration (Provider 3.x → 4.x)
- [ ] Update code to use separate resources
- [ ] Run import script for all buckets
- [ ] Run `terraform plan` - verify no changes
- [ ] Test in dev/staging first
- [ ] Commit and document

## Post-Migration
- [ ] Update CI/CD pipelines with new Terraform version
- [ ] Update documentation
- [ ] Train team on new syntax/features
- [ ] Remove old Terraform binaries
```

---

## Timeline

For reference, here&apos;s how long this took for a ~200 resource codebase:

| Phase | Duration | Notes |
|-------|----------|-------|
| 0.11 → 0.12 | 2 days | Most syntax changes |
| 0.12 → 0.13 | 4 hours | Mostly automated |
| 0.13 → 0.14 | 2 hours | Lock file setup |
| 0.14 → 0.15 | 2 hours | Deprecation fixes |
| 0.15 → 1.0 | 1 hour | Smooth |
| 1.0 → 1.5 (S3 split) | 3 days | The big one |
| 1.5 → 1.11 | 4 hours | Incremental |

**Total: ~1 week of focused work**

---

## Key Takeaways

1. **Never skip versions** - Follow the upgrade path
2. **Plan must show no changes** - After every upgrade
3. **Backup state** - Before every upgrade
4. **Pin provider versions** - Upgrade Terraform and providers separately
5. **Test in non-prod first** - Always
6. **The S3 split is the hard part** - Budget time for it
7. **Document everything** - Future you will thank present you

The Terraform ecosystem moves fast. What was bleeding edge in 0.11 is now ancient history. But if you follow this guide methodically, you&apos;ll get there without losing any infrastructure along the way.

---

*Questions? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or [GitHub](https://github.com/moabukar).*</content:encoded><category>terraform</category><category>iac</category><category>migration</category><category>aws</category><category>s3</category><category>state-management</category><category>hcl2</category><author>Mo Abukar</author></item><item><title>Running Clawdbot 24/7 on a Hetzner VPS – Terraform, Security Hardening, and the Bits the Docs Miss</title><link>https://moabukar.co.uk/blog/clawdbot-automated/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/clawdbot-automated/</guid><description>A production-grade setup for Clawdbot on Hetzner Cloud with Terraform provisioning, proper SSH hardening, fail2ban, UFW, unattended-upgrades, and optional Tailscale – the stuff you actually need in prod.</description><pubDate>Wed, 28 Jan 2026 00:00:00 GMT</pubDate><content:encoded>Clawdbot has been everywhere in January 2026. The development velocity is mental – new features landing daily, skills ecosystem expanding, and the community building integrations faster than I can keep up.

The official docs are solid, but they assume you&apos;re clicking through a web console and SSHing in with a password. That&apos;s not how we do things.

This is a production-grade walkthrough: Terraform-provisioned Hetzner VPS, proper security hardening, and the gotchas I hit getting Clawdbot running 24/7.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/hetzner.svg&quot; alt=&quot;Hetzner logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Infrastructure as Code - Hetzner VPS with Terraform

No clicking around in consoles. Here&apos;s the Terraform to spin up a VPS with SSH keys, firewall rules, and optional Tailscale bootstrap.

### Repo Structure

```
.
├── main.tf
├── variables.tf
├── outputs.tf
├── data.tf
├── terraform.tfvars.example
└── scripts/
    └── cloud-init.sh
```

### main.tf

```hcl
# SSH Key
resource &quot;hcloud_ssh_key&quot; &quot;default&quot; {
  name       = &quot;${var.server_name}-ssh-key&quot;
  public_key = var.ssh_public_key
}

# Cloud-init script
locals {
  user_data = templatefile(&quot;${path.module}/scripts/cloud-init.sh&quot;, {
    tailscale_auth_key = var.tailscale_auth_key
    username           = var.username
    ssh_public_key     = var.ssh_public_key
  })
}

# Server
resource &quot;hcloud_server&quot; &quot;vps&quot; {
  name        = var.server_name
  image       = var.image
  server_type = data.hcloud_server_type.selected.name
  location    = data.hcloud_location.selected.name
  ssh_keys    = concat([hcloud_ssh_key.default.id], var.ssh_keys)
  user_data   = local.user_data

  labels = {
    managed-by  = &quot;terraform&quot;
    environment = var.environment
    purpose     = &quot;clawdbot&quot;
  }
}

# Firewall – locked down by default
resource &quot;hcloud_firewall&quot; &quot;vps&quot; {
  name = &quot;${var.server_name}-firewall&quot;

  # SSH: Tailscale CGNAT range + explicit allowed IPs
  rule {
    direction   = &quot;in&quot;
    protocol    = &quot;tcp&quot;
    port        = &quot;22&quot;
    source_ips  = var.tailscale_auth_key != &quot;&quot; ? concat([&quot;100.64.0.0/10&quot;], var.allowed_ssh_ips) : var.allowed_ssh_ips
    description = &quot;SSH access&quot;
  }

  # ICMP for diagnostics
  rule {
    direction   = &quot;in&quot;
    protocol    = &quot;icmp&quot;
    source_ips  = [&quot;0.0.0.0/0&quot;, &quot;::/0&quot;]
    description = &quot;ICMP (ping)&quot;
  }

  # Egress – allow all (Hetzner default, but explicit is better)
  rule {
    direction       = &quot;out&quot;
    protocol        = &quot;tcp&quot;
    port            = &quot;1-65535&quot;
    destination_ips = [&quot;0.0.0.0/0&quot;, &quot;::/0&quot;]
    description     = &quot;All TCP outbound&quot;
  }

  rule {
    direction       = &quot;out&quot;
    protocol        = &quot;udp&quot;
    port            = &quot;1-65535&quot;
    destination_ips = [&quot;0.0.0.0/0&quot;, &quot;::/0&quot;]
    description     = &quot;All UDP outbound&quot;
  }

  rule {
    direction       = &quot;out&quot;
    protocol        = &quot;icmp&quot;
    destination_ips = [&quot;0.0.0.0/0&quot;, &quot;::/0&quot;]
    description     = &quot;ICMP outbound&quot;
  }
}

resource &quot;hcloud_firewall_attachment&quot; &quot;vps&quot; {
  firewall_id = hcloud_firewall.vps.id
  server_ids  = [hcloud_server.vps.id]
}
```

### variables.tf

```hcl
variable &quot;hcloud_token&quot; {
  description = &quot;Hetzner Cloud API token&quot;
  type        = string
  sensitive   = true
}

variable &quot;server_name&quot; {
  description = &quot;Server hostname&quot;
  type        = string
  default     = &quot;clawdbot&quot;
}

variable &quot;server_type&quot; {
  description = &quot;Hetzner server type (cx22 = 2 vCPU, 4GB RAM)&quot;
  type        = string
  default     = &quot;cx22&quot;
}

variable &quot;image&quot; {
  description = &quot;OS image&quot;
  type        = string
  default     = &quot;ubuntu-24.04&quot;
}

variable &quot;location&quot; {
  description = &quot;Hetzner datacenter&quot;
  type        = string
  default     = &quot;nbg1&quot;  # Nuremberg, DE
}

variable &quot;ssh_public_key&quot; {
  description = &quot;SSH public key for access&quot;
  type        = string
}

variable &quot;ssh_keys&quot; {
  description = &quot;Additional SSH key IDs&quot;
  type        = list(string)
  default     = []
}

variable &quot;username&quot; {
  description = &quot;Non-root user to create&quot;
  type        = string
  default     = &quot;clawdbot&quot;
}

variable &quot;tailscale_auth_key&quot; {
  description = &quot;Tailscale auth key (optional)&quot;
  type        = string
  default     = &quot;&quot;
  sensitive   = true
}

variable &quot;allowed_ssh_ips&quot; {
  description = &quot;IPs allowed to SSH (use your static IP or VPN range)&quot;
  type        = list(string)
  default     = []  # Empty = SSH only via Tailscale if enabled
}

variable &quot;environment&quot; {
  description = &quot;Environment label&quot;
  type        = string
  default     = &quot;production&quot;
}
```

### data.tf

```hcl
terraform {
  required_providers {
    hcloud = {
      source  = &quot;hetznercloud/hcloud&quot;
      version = &quot;~&gt; 1.45&quot;
    }
  }
}

provider &quot;hcloud&quot; {
  token = var.hcloud_token
}

data &quot;hcloud_server_type&quot; &quot;selected&quot; {
  name = var.server_type
}

data &quot;hcloud_location&quot; &quot;selected&quot; {
  name = var.location
}
```

### outputs.tf

```hcl
output &quot;server_ip&quot; {
  description = &quot;Public IPv4 address&quot;
  value       = hcloud_server.vps.ipv4_address
}

output &quot;server_ipv6&quot; {
  description = &quot;Public IPv6 address&quot;
  value       = hcloud_server.vps.ipv6_address
}

output &quot;ssh_command&quot; {
  description = &quot;SSH connection string&quot;
  value       = &quot;ssh ${var.username}@${hcloud_server.vps.ipv4_address}&quot;
}
```

### scripts/cloud-init.sh

This is where the security hardening happens. Cloud-init runs on first boot – no manual SSH required.

```bash
#!/bin/bash
set -euo pipefail

# Variables from Terraform
TAILSCALE_AUTH_KEY=&quot;${tailscale_auth_key}&quot;
USERNAME=&quot;${username}&quot;
SSH_PUBLIC_KEY=&quot;${ssh_public_key}&quot;

# Logging
exec &gt; &gt;(tee /var/log/cloud-init-custom.log) 2&gt;&amp;1
echo &quot;=== Cloud-init started at $(date) ===&quot;

# System updates
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get upgrade -y

# Install essentials
DEBIAN_FRONTEND=noninteractive apt-get install -y \
  curl \
  git \
  vim \
  htop \
  fail2ban \
  ufw \
  unattended-upgrades \
  apt-listchanges

# Create non-root user
if ! id &quot;$USERNAME&quot; &amp;&gt;/dev/null; then
  useradd -m -s /bin/bash -G sudo &quot;$USERNAME&quot;
  echo &quot;$USERNAME ALL=(ALL) NOPASSWD:ALL&quot; &gt; /etc/sudoers.d/$USERNAME
  chmod 0440 /etc/sudoers.d/$USERNAME
fi

# SSH key for user
USER_HOME=&quot;/home/$USERNAME&quot;
mkdir -p &quot;$USER_HOME/.ssh&quot;
echo &quot;$SSH_PUBLIC_KEY&quot; &gt; &quot;$USER_HOME/.ssh/authorized_keys&quot;
chmod 700 &quot;$USER_HOME/.ssh&quot;
chmod 600 &quot;$USER_HOME/.ssh/authorized_keys&quot;
chown -R &quot;$USERNAME:$USERNAME&quot; &quot;$USER_HOME/.ssh&quot;

# SSH hardening
cat &gt; /etc/ssh/sshd_config.d/hardening.conf &lt;&lt; &apos;EOF&apos;
# Disable password authentication
PasswordAuthentication no
ChallengeResponseAuthentication no
UsePAM yes

# Disable root login
PermitRootLogin no

# Key-based auth only
PubkeyAuthentication yes
AuthorizedKeysFile .ssh/authorized_keys

# Timeouts and limits
ClientAliveInterval 300
ClientAliveCountMax 2
MaxAuthTries 3
MaxSessions 3
LoginGraceTime 30

# Disable unused auth methods
HostbasedAuthentication no
PermitEmptyPasswords no
KerberosAuthentication no
GSSAPIAuthentication no

# Logging
LogLevel VERBOSE
EOF

# Restart SSH
systemctl restart ssh

# fail2ban configuration
cat &gt; /etc/fail2ban/jail.local &lt;&lt; &apos;EOF&apos;
[DEFAULT]
bantime = 1h
findtime = 10m
maxretry = 5
banaction = ufw

[sshd]
enabled = true
port = ssh
logpath = /var/log/auth.log
maxretry = 3
bantime = 24h
EOF

systemctl enable fail2ban
systemctl restart fail2ban

# UFW firewall
ufw default deny incoming
ufw default allow outgoing
ufw allow ssh
ufw --force enable

# Unattended upgrades – security patches only
cat &gt; /etc/apt/apt.conf.d/50unattended-upgrades &lt;&lt; &apos;EOF&apos;
Unattended-Upgrade::Allowed-Origins {
    &quot;${distro_id}:${distro_codename}-security&quot;;
};
Unattended-Upgrade::AutoFixInterruptedDpkg &quot;true&quot;;
Unattended-Upgrade::MinimalSteps &quot;true&quot;;
Unattended-Upgrade::Remove-Unused-Kernel-Packages &quot;true&quot;;
Unattended-Upgrade::Remove-Unused-Dependencies &quot;true&quot;;
Unattended-Upgrade::Automatic-Reboot &quot;false&quot;;
EOF

cat &gt; /etc/apt/apt.conf.d/20auto-upgrades &lt;&lt; &apos;EOF&apos;
APT::Periodic::Update-Package-Lists &quot;1&quot;;
APT::Periodic::Unattended-Upgrade &quot;1&quot;;
APT::Periodic::AutocleanInterval &quot;7&quot;;
EOF

systemctl enable unattended-upgrades

# Kernel hardening via sysctl
cat &gt; /etc/sysctl.d/99-security.conf &lt;&lt; &apos;EOF&apos;
# IP Spoofing protection
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1

# Ignore ICMP redirects
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0

# Ignore source routed packets
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0

# Log Martian packets
net.ipv4.conf.all.log_martians = 1

# Ignore broadcast pings
net.ipv4.icmp_echo_ignore_broadcasts = 1

# Disable IPv6 if not needed (optional)
# net.ipv6.conf.all.disable_ipv6 = 1
EOF

sysctl -p /etc/sysctl.d/99-security.conf

# Tailscale (optional)
if [ -n &quot;$TAILSCALE_AUTH_KEY&quot; ]; then
  curl -fsSL https://tailscale.com/install.sh | sh
  tailscale up --authkey=&quot;$TAILSCALE_AUTH_KEY&quot; --ssh
  echo &quot;Tailscale installed and connected&quot;
fi

echo &quot;=== Cloud-init completed at $(date) ===&quot;
```

### terraform.tfvars.example

```hcl
hcloud_token   = &quot;your-hetzner-api-token&quot;
server_name    = &quot;clawdbot-prod&quot;
server_type    = &quot;cx22&quot;
image          = &quot;ubuntu-24.04&quot;
location       = &quot;nbg1&quot;
ssh_public_key = &quot;ssh-ed25519 AAAA... you@machine&quot;

# Security: restrict SSH to your IP or VPN
allowed_ssh_ips = [&quot;YOUR_IP/32&quot;]

# Optional: Tailscale for zero-trust access
# tailscale_auth_key = &quot;tskey-auth-xxxxx&quot;
```

## Deployment

```bash
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values

terraform init
terraform plan
terraform apply
```

Wait ~2 minutes for cloud-init to complete. Check progress:

```bash
ssh clawdbot@$(terraform output -raw server_ip) &apos;tail -f /var/log/cloud-init-custom.log&apos;
```

## Tailscale Setup – Getting Your Auth Key

If you want zero-trust access to your Clawdbot gateway (and you should), you&apos;ll need a Tailscale auth key before running Terraform.

### Create a Tailscale Account

1. Head to [tailscale.com](https://tailscale.com) and sign up (free tier is plenty)
2. Install Tailscale on your local machine – this is how you&apos;ll access the VPS securely

### Generate an Auth Key

1. Go to [Tailscale Admin Console](https://login.tailscale.com/admin/settings/keys)
2. Click **Generate auth key**
3. Settings I use:
   - **Reusable**: No (one-time use is more secure)
   - **Ephemeral**: No (we want the node to persist)
   - **Pre-approved**: Yes (skips manual approval)
   - **Tags**: Optional, but useful if you have ACLs (`tag:servers`)
   - **Expiry**: 1 hour is fine – it&apos;s only used during cloud-init

4. Copy the key – it looks like `tskey-auth-kXYZ123CNTRL-abc123...`

This key goes into your `terraform.tfvars`:

```hcl
tailscale_auth_key = &quot;tskey-auth-kXYZ123CNTRL-abc123...&quot;
```

### Why Tailscale?

The VPS binds Clawdbot to `127.0.0.1` – it&apos;s not exposed to the public internet. Tailscale creates a private mesh network between your devices. You access the dashboard via `https://clawdbot.tail1234.ts.net` (private HTTPS, no port forwarding, no firewall holes).

If you skip Tailscale, you&apos;ll need to either:
- SSH tunnel every time (`ssh -L 18789:localhost:18789 clawdbot@server`)
- Expose the gateway to `0.0.0.0` with token auth (less secure)

## Networking and Security Hardening

The cloud-init script handles the heavy lifting, but here&apos;s what&apos;s actually happening:

### Defence in Depth

```
Internet → Hetzner Firewall → UFW → Application
              (hypervisor)    (kernel)   (userspace)
```

**Hetzner Firewall** – Filters at the hypervisor level. Traffic is dropped before it reaches your VM. Can&apos;t be disabled from inside the VM (good for preventing compromise escalation).

**UFW** – Linux kernel firewall (iptables frontend). Second layer of filtering. Useful for per-application rules and logging.

**fail2ban** – Monitors `/var/log/auth.log`, bans IPs after 3 failed SSH attempts for 24 hours. Integrates with UFW for automatic blocking.

### Kernel Hardening

The sysctl settings prevent common network attacks:

| Setting | What it does |
|---------|--------------|
| `rp_filter = 1` | Drops packets with spoofed source IPs |
| `accept_redirects = 0` | Ignores ICMP redirects (prevents MitM) |
| `accept_source_route = 0` | Blocks source-routed packets |
| `log_martians = 1` | Logs packets with impossible addresses |
| `icmp_echo_ignore_broadcasts = 1` | Prevents Smurf attacks |

### SSH Hardening

The custom `sshd_config` drops 90% of automated attacks:

- **No passwords** – Key-only auth eliminates brute force
- **No root login** – Attackers must guess username + key
- **3 max auth tries** – Slows down attacks
- **30s login grace** – Closes hanging connections fast
- **Verbose logging** – Forensics if something goes wrong

### Automatic Security Updates

`unattended-upgrades` applies security patches daily without intervention. Only security repos are enabled – no surprise feature changes breaking your setup.

Check what&apos;s pending:

```bash
sudo unattended-upgrades --dry-run -v
```

## Installing Clawdbot

SSH in as the `clawdbot` user:

```bash
ssh clawdbot@$(terraform output -raw server_ip)
```

### Node.js via nvm

```bash
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash
source ~/.bashrc
nvm install 24
node -v  # v24.x
```

### Homebrew (required for some skills)

```bash
/bin/bash -c &quot;$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/HEAD/install.sh)&quot;
echo &apos;eval &quot;$(/home/linuxbrew/.linuxbrew/bin/brew shellenv)&quot;&apos; &gt;&gt; ~/.bashrc
source ~/.bashrc
```

### Clawdbot Installation

```bash
npm i -g clawdbot
```

This takes a minute or two. Once done, you&apos;re ready to onboard.

## Onboarding Walkthrough

Run `clawdbot onboard` and follow the interactive wizard. Clawdbot is moving fast, so options might change – but here&apos;s what worked for me with annotations:

```
◆  I understand this is powerful and inherently risky. Continue?
│  Yes
│
◆  Onboarding mode
│  ○ QuickStart
│  ● Manual (Configure port, network, Tailscale, and auth options.)
│  # Use manual mode for more control
│
◆  What do you want to set up?
│  ● Local gateway (this machine) (Gateway reachable (ws://127.0.0.1:18789))
│  ○ Remote gateway (info-only)
│  # Counterintuitive, but &quot;local&quot; means the gateway runs on this VPS
│
◆  Workspace directory
│  /home/clawdbot/clawd
│
◆  Model/auth provider
│  ● OpenAI (Codex OAuth + API key)
│  ○ Anthropic
│  ○ MiniMax
│  ○ Qwen
│  ○ Synthetic
│  ○ Google
│  ○ Copilot
│  ...
│  # I use Anthropic with my Claude Pro subscription. Pick your provider.
│
◆  Gateway port
│  18789
│
◆  Gateway bind
│  ● Loopback (127.0.0.1)
│  ○ LAN (0.0.0.0)
│  ○ Tailnet (Tailscale IP)
│  ○ Auto (Loopback → LAN)
│  ○ Custom IP
│  # Loopback – only accessible via Tailscale or SSH tunnel
│
◆  Gateway auth
│  ○ Off (loopback only)
│  ● Token (Recommended default (local + remote))
│  ○ Password
│  # Token auth – you&apos;ll get a token for dashboard access
│
◆  Tailscale exposure
│  ○ Off
│  ● Serve (Private HTTPS for your tailnet (devices on Tailscale))
│  ○ Funnel
│  # Serve = private HTTPS within your tailnet. Funnel = public internet (avoid)
│
◆  Reset Tailscale serve/funnel on exit?
│  ○ Yes / ● No
│  # No – keeps the endpoint alive when gateway restarts
│
◆  Configure chat channels now?
│  ● Yes / ○ No
│  # Yes – this is how you&apos;ll interact with Clawdbot day-to-day
│
◇  Skills status ────────────╮
│                            │
│  Eligible: 13              │
│  Missing requirements: 38  │
│  Blocked by allowlist: 0   │
│                            │
├────────────────────────────╯
│
◆  Configure skills now? (recommended)
│  ● Yes / ○ No
│  # Yes – use Spacebar to select skills, Enter to confirm
│  # If unsure, skip for now – you can add skills later
│
◆  Preferred node manager for skill installs
│  ● npm
│  ○ pnpm
│  ○ bun
│
◆  Set GOOGLE_PLACES_API_KEY for goplaces?
│  ○ Yes / ● No
│  # Skip API key prompts unless you have them ready
│
◆  Enable hooks?
│  ◼ Skip for now
│  ◻ 🚀 boot-md
│  ◻ 📝 command-logger
│  ◻ 💾 session-memory
│  # Skip unless you&apos;ve read the docs on hooks
│
◆  Install Gateway service (recommended)
│  ● Yes / ○ No
│  # Yes – installs a systemd unit for auto-start
│
◆  Gateway service runtime
│  ● Node (recommended) (Required for WhatsApp + Telegram. Bun can corrupt memory on reconnect.)
│  # Node – required for WhatsApp integration
│
◆  How do you want to hatch your bot?
│  ● Hatch in TUI (recommended)
│  ○ Open the Web UI
│  ○ Do this later
│  # TUI drops you into an interactive terminal to finish setup
```

Once complete, the gateway runs as a systemd service:

```bash
systemctl status clawdbot-gateway
journalctl -u clawdbot-gateway -f
```

## WhatsApp Integration

I wanted Clawdbot as a proper personal assistant – something I can message from my phone without opening a laptop. WhatsApp Business works perfectly for this.

### Don&apos;t Use Your Real Number

Clawdbot needs to connect via WhatsApp Business API, which requires phone verification. Don&apos;t use your personal number – if something goes wrong, you don&apos;t want your main WhatsApp account locked.

**Get a cheap SIM for SMS verification:**

- [giffgaff](https://www.giffgaff.com/) – £10 gets you a SIM with a UK number, pay-as-you-go
- Any budget MVNO works – you only need it for the initial SMS verification
- Once verified, the SIM can sit in a drawer

### Setup Steps

1. **Install WhatsApp Business** on a spare phone (or use an Android emulator)
2. **Verify with your temporary number**
3. During Clawdbot onboarding, select **WhatsApp** as your chat channel
4. Clawdbot generates a QR code – scan it with WhatsApp Business to link
5. Message your bot with `/start` to pair the session

Now you can message Clawdbot from your main phone by adding the business number as a contact. It&apos;s your 24/7 personal assistant – responds in seconds, runs automations, and doesn&apos;t judge you for asking questions at 3am.

## Security Checklist

What we&apos;ve covered:

- [x] SSH key-only auth (password disabled)
- [x] Root login disabled
- [x] Non-root user with sudo
- [x] fail2ban with 24h bans for SSH brute force
- [x] UFW firewall (SSH only inbound)
- [x] Hetzner firewall (defence in depth)
- [x] Unattended security upgrades
- [x] Kernel hardening (IP spoofing, redirects, source routing)
- [x] Optional Tailscale for zero-trust access

What you should also consider:

- [ ] SSH on non-standard port (security through obscurity, but reduces log noise)
- [ ] Monitoring/alerting (Prometheus node_exporter, or just `uptime-kuma`)
- [ ] Backup strategy for `/home/clawdbot/clawd`
- [ ] Rate limiting at application level if exposing any HTTP endpoints

## Gotchas

**cloud-init timing** – Terraform reports success before cloud-init finishes. The server is up, but hardening might still be in progress. Check `/var/log/cloud-init-custom.log`.

**Tailscale SSH** – If you enable `tailscale up --ssh`, Tailscale handles SSH auth separately. Your `~/.ssh/authorized_keys` still works, but Tailscale ACLs take precedence for tailnet connections.

**UFW vs Hetzner firewall** – Both are active. Hetzner firewall filters at the hypervisor level (faster, can&apos;t be bypassed from inside the VM). UFW runs inside the VM. Defence in depth – keep both.

**npm global installs** – If you hit EACCES errors, don&apos;t use `sudo npm`. Fix npm&apos;s directory:

```bash
mkdir ~/.npm-global
npm config set prefix &apos;~/.npm-global&apos;
echo &apos;export PATH=~/.npm-global/bin:$PATH&apos; &gt;&gt; ~/.bashrc
source ~/.bashrc
```

## Using Clawdbot

The dashboard is nice, but the chat interface is where it shines. Connect via Telegram (or your chosen channel), then just describe what you want.

I&apos;ve set up:
- Daily digest of bookmarked tweets via the `bird` skill
- RSS feed monitoring with summaries pushed to a private channel
- Automated Git repo health checks

The key insight: don&apos;t configure workflows via the UI. Describe the outcome you want in natural language. Clawdbot figures out the skill configuration, proposes a plan, and executes it.

---

The full Terraform setup is on [GitHub](https://github.com/moabukar/vps-clawdbot). 

Questions? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or drop a comment below.</content:encoded><category>clawdbot</category><category>hetzner</category><category>terraform</category><category>vps</category><category>security</category><category>devops</category><category>automation</category><author>Mo Abukar</author></item><item><title>Elastic Cloud Setup Guide - From Zero to Production</title><link>https://moabukar.co.uk/blog/elastic-cloud-setup-guide/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/elastic-cloud-setup-guide/</guid><description>A comprehensive guide to setting up Elastic Cloud (Elasticsearch Service), including deployment configuration, security setup, index lifecycle management, integrations, and cost optimization.</description><pubDate>Wed, 28 Jan 2026 00:00:00 GMT</pubDate><content:encoded># Elastic Cloud Setup Guide - From Zero to Production

Running your own Elasticsearch cluster is powerful but operationally heavy. Upgrades, security patches, scaling, backups - it adds up. Elastic Cloud handles all of that, letting you focus on using the stack rather than managing it.

This guide walks through setting up Elastic Cloud properly - not just clicking through the wizard, but configuring it for real production use with proper security, lifecycle management, and cost optimization.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/elastic.svg&quot; alt=&quot;Elastic logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Why Elastic Cloud?

Before diving in, here&apos;s why you might choose Elastic Cloud over self-managed:

**Pros:**
- Fully managed upgrades (one-click)
- Automated backups and snapshots
- Built-in security (TLS, RBAC, SSO)
- Cross-cloud deployment (AWS, GCP, Azure)
- Autoscaling options
- Elastic&apos;s support team
- Always latest features

**Cons:**
- Higher cost than self-managed (roughly 2-3x)
- Less control over infrastructure
- Data residency concerns (though many regions available)
- Vendor lock-in

For most teams, the operational overhead savings outweigh the cost difference.

---

## Step 1: Create Your Elastic Cloud Account

1. Go to [cloud.elastic.co](https://cloud.elastic.co)
2. Sign up (email or SSO with Google/Microsoft)
3. Verify your email
4. You get a 14-day free trial with $400 credit

---

## Step 2: Create Your First Deployment

### Choosing a Deployment Template

Elastic Cloud offers several pre-configured templates:

| Template | Best For | Components |
|----------|----------|------------|
| **General Purpose** | Most workloads | ES + Kibana balanced |
| **Observability** | Logs, metrics, APM | Optimized for time-series |
| **Security** | SIEM, threat detection | Elastic Security features |
| **Vector Search** | AI/ML, embeddings | ML nodes included |
| **Enterprise Search** | Web/app search | App Search + Workplace Search |

For this guide, we&apos;ll use **Observability** - the most common use case.

### Deployment Configuration

Click **Create deployment** and configure:

**1. Name:** Choose something meaningful
```
prod-logs-eu-west-1
staging-observability
```

**2. Cloud Provider &amp; Region:**
- Choose based on where your data sources are
- Lower latency = better ingestion performance
- Consider data residency requirements

**3. Hardware Profile:**

For a production observability deployment, I recommend starting with:

```
Elasticsearch:
  - Hot tier: 2 zones × 4GB RAM (8GB total)
  - Warm tier: 2 zones × 2GB RAM (4GB total) - optional initially
  - Cold tier: None initially

Kibana:
  - 1 zone × 1GB RAM

Integrations Server (APM + Fleet):
  - 1 zone × 1GB RAM
```

You can scale up later - Elastic Cloud makes this easy.

**4. Version:**
- Always choose the latest stable version (8.x)
- Avoid pre-release versions for production

### Advanced Settings

Expand **Advanced settings** for more control:

**Snapshot Repository:**
- Enabled by default (found repository)
- Snapshots every 30 minutes
- Retained for 100 snapshots or ~2 days

**Plugins:**
- Most plugins are pre-installed
- Custom plugins require support ticket

Click **Create deployment** and wait 5-10 minutes.

---

## Step 3: Save Your Credentials

When deployment completes, you&apos;ll see:

```
Elasticsearch endpoint: https://my-deployment.es.eu-west-1.aws.found.io:9243
Kibana endpoint: https://my-deployment.kb.eu-west-1.aws.found.io:9243

Username: elastic
Password: &lt;generated-password&gt;
```

**Save these immediately** - the password is only shown once.

If you lose it:
1. Go to deployment → Security
2. Reset the elastic user password

---

## Step 4: Initial Kibana Setup

### Access Kibana

1. Click the Kibana link or navigate to your Kibana endpoint
2. Log in with the elastic superuser
3. You&apos;ll see the Kibana home page

### Create Your First Space

Spaces let you organize dashboards and access by team:

1. Go to **Stack Management → Kibana → Spaces**
2. Create spaces like:
   - `production-logs`
   - `security-team`
   - `platform-team`

### Set Up Index Patterns (Data Views)

Before you can visualize data, you need data views:

1. Go to **Stack Management → Kibana → Data Views**
2. Click **Create data view**
3. For logs: `logs-*` or `filebeat-*`
4. For metrics: `metrics-*` or `metricbeat-*`
5. Select `@timestamp` as the time field

---

## Step 5: Security Configuration

### Create Service Accounts

Never use the `elastic` superuser for applications. Create dedicated accounts:

**Via Kibana:**
1. **Stack Management → Security → Users**
2. Create users for each service:

```
Username: logstash_writer
Role: logstash_writer (built-in)
Password: &lt;strong-password&gt;

Username: beats_writer  
Role: beats_writer (built-in)
Password: &lt;strong-password&gt;

Username: apm_writer
Role: apm_user (built-in)
Password: &lt;strong-password&gt;
```

**Via API (for automation):**

```bash
# Create a custom role
curl -X PUT &quot;https://your-deployment.es.region.aws.found.io:9243/_security/role/logs_writer&quot; \
  -u elastic:password \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;cluster&quot;: [&quot;monitor&quot;, &quot;manage_index_templates&quot;, &quot;manage_ilm&quot;],
  &quot;indices&quot;: [
    {
      &quot;names&quot;: [&quot;logs-*&quot;, &quot;filebeat-*&quot;],
      &quot;privileges&quot;: [&quot;create_index&quot;, &quot;write&quot;, &quot;create&quot;, &quot;auto_configure&quot;]
    }
  ]
}&apos;

# Create a user with that role
curl -X PUT &quot;https://your-deployment.es.region.aws.found.io:9243/_security/user/logs_writer&quot; \
  -u elastic:password \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;password&quot;: &quot;your-secure-password&quot;,
  &quot;roles&quot;: [&quot;logs_writer&quot;],
  &quot;full_name&quot;: &quot;Logs Writer Service Account&quot;
}&apos;
```

### API Keys (Recommended)

For machine-to-machine auth, API keys are better than passwords:

```bash
# Create an API key for Filebeat
curl -X POST &quot;https://your-deployment.es.region.aws.found.io:9243/_security/api_key&quot; \
  -u elastic:password \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;name&quot;: &quot;filebeat-prod-servers&quot;,
  &quot;role_descriptors&quot;: {
    &quot;filebeat_writer&quot;: {
      &quot;cluster&quot;: [&quot;monitor&quot;, &quot;read_ilm&quot;],
      &quot;indices&quot;: [
        {
          &quot;names&quot;: [&quot;filebeat-*&quot;, &quot;logs-*&quot;],
          &quot;privileges&quot;: [&quot;create_index&quot;, &quot;create_doc&quot;, &quot;auto_configure&quot;]
        }
      ]
    }
  },
  &quot;expiration&quot;: &quot;365d&quot;
}&apos;
```

Response:
```json
{
  &quot;id&quot;: &quot;abc123&quot;,
  &quot;name&quot;: &quot;filebeat-prod-servers&quot;,
  &quot;api_key&quot;: &quot;xyz789...&quot;,
  &quot;encoded&quot;: &quot;YWJjMTIzOnhejjc4OS4u&quot;  // Base64(id:api_key) - use this
}
```

Use the `encoded` value in your Beats config:

```yaml
output.elasticsearch:
  hosts: [&quot;https://your-deployment.es.region.aws.found.io:9243&quot;]
  api_key: &quot;YWJjMTIzOnhejjc4OS4u&quot;
```

### Enable SSO (Optional but Recommended)

For team access, configure SAML or OIDC:

1. **Deployment → Security → User authentication**
2. Configure your identity provider (Okta, Azure AD, Google)
3. Map groups to Elastic roles

---

## Step 6: Index Lifecycle Management (ILM)

ILM automatically manages index rollover, tiering, and deletion. **This is critical for cost control.**

### Understanding Data Tiers

```
Hot Tier  →  Warm Tier  →  Cold Tier  →  Frozen Tier  →  Delete
(fast SSD)   (cheaper)    (cheapest)    (S3 backed)
 0-7 days    7-30 days    30-90 days    90-365 days    365+ days
```

### Create an ILM Policy

**Via Kibana:**
1. **Stack Management → Index Lifecycle Policies**
2. Click **Create policy**

**Via API:**

```bash
curl -X PUT &quot;https://your-deployment.es.region.aws.found.io:9243/_ilm/policy/logs-policy&quot; \
  -u elastic:password \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;policy&quot;: {
    &quot;phases&quot;: {
      &quot;hot&quot;: {
        &quot;min_age&quot;: &quot;0ms&quot;,
        &quot;actions&quot;: {
          &quot;rollover&quot;: {
            &quot;max_size&quot;: &quot;50gb&quot;,
            &quot;max_age&quot;: &quot;1d&quot;,
            &quot;max_docs&quot;: 100000000
          },
          &quot;set_priority&quot;: {
            &quot;priority&quot;: 100
          }
        }
      },
      &quot;warm&quot;: {
        &quot;min_age&quot;: &quot;7d&quot;,
        &quot;actions&quot;: {
          &quot;set_priority&quot;: {
            &quot;priority&quot;: 50
          },
          &quot;shrink&quot;: {
            &quot;number_of_shards&quot;: 1
          },
          &quot;forcemerge&quot;: {
            &quot;max_num_segments&quot;: 1
          },
          &quot;allocate&quot;: {
            &quot;number_of_replicas&quot;: 1
          }
        }
      },
      &quot;cold&quot;: {
        &quot;min_age&quot;: &quot;30d&quot;,
        &quot;actions&quot;: {
          &quot;set_priority&quot;: {
            &quot;priority&quot;: 0
          },
          &quot;allocate&quot;: {
            &quot;number_of_replicas&quot;: 0
          }
        }
      },
      &quot;delete&quot;: {
        &quot;min_age&quot;: &quot;90d&quot;,
        &quot;actions&quot;: {
          &quot;delete&quot;: {}
        }
      }
    }
  }
}&apos;
```

### Apply ILM to Index Templates

Create an index template that uses your ILM policy:

```bash
curl -X PUT &quot;https://your-deployment.es.region.aws.found.io:9243/_index_template/logs-template&quot; \
  -u elastic:password \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;index_patterns&quot;: [&quot;logs-*&quot;],
  &quot;template&quot;: {
    &quot;settings&quot;: {
      &quot;number_of_shards&quot;: 3,
      &quot;number_of_replicas&quot;: 1,
      &quot;index.lifecycle.name&quot;: &quot;logs-policy&quot;,
      &quot;index.lifecycle.rollover_alias&quot;: &quot;logs&quot;
    }
  },
  &quot;composed_of&quot;: [],
  &quot;priority&quot;: 200,
  &quot;data_stream&quot;: {}
}&apos;
```

---

## Step 7: Data Ingestion Setup

### Option A: Elastic Agent (Recommended)

Elastic Agent is the unified way to collect all data types:

1. **Kibana → Fleet → Add agent**
2. Create an agent policy (e.g., &quot;Production Servers&quot;)
3. Add integrations:
   - System (CPU, memory, disk)
   - Custom logs
   - Docker/Kubernetes
   - Cloud provider metrics

4. Install on your servers:

```bash
# Download
curl -L -O https://artifacts.elastic.co/downloads/beats/elastic-agent/elastic-agent-8.12.0-linux-x86_64.tar.gz
tar xzvf elastic-agent-8.12.0-linux-x86_64.tar.gz
cd elastic-agent-8.12.0-linux-x86_64

# Enroll (Fleet URL and token from Kibana)
sudo ./elastic-agent install \
  --url=https://your-fleet-server.es.region.aws.found.io:443 \
  --enrollment-token=YOUR_ENROLLMENT_TOKEN
```

### Option B: Filebeat (Logs Only)

For simpler log collection:

```yaml
# filebeat.yml
filebeat.inputs:
  - type: log
    enabled: true
    paths:
      - /var/log/nginx/*.log
    fields:
      environment: production
      service: nginx

output.elasticsearch:
  hosts: [&quot;https://your-deployment.es.region.aws.found.io:9243&quot;]
  api_key: &quot;your-api-key&quot;
  index: &quot;logs-nginx-%{+yyyy.MM.dd}&quot;

setup.ilm.enabled: true
setup.ilm.rollover_alias: &quot;logs-nginx&quot;
setup.ilm.policy_name: &quot;logs-policy&quot;
```

### Option C: Logstash (Complex Processing)

For advanced transformations:

```ruby
# logstash.conf
input {
  beats {
    port =&gt; 5044
  }
}

filter {
  grok {
    match =&gt; { &quot;message&quot; =&gt; &quot;%{COMBINEDAPACHELOG}&quot; }
  }
  
  geoip {
    source =&gt; &quot;clientip&quot;
  }
  
  date {
    match =&gt; [&quot;timestamp&quot;, &quot;dd/MMM/yyyy:HH:mm:ss Z&quot;]
  }
}

output {
  elasticsearch {
    hosts =&gt; [&quot;https://your-deployment.es.region.aws.found.io:9243&quot;]
    api_key =&gt; &quot;your-api-key&quot;
    data_stream =&gt; true
    data_stream_type =&gt; &quot;logs&quot;
    data_stream_dataset =&gt; &quot;nginx&quot;
    data_stream_namespace =&gt; &quot;production&quot;
  }
}
```

---

## Step 8: Monitoring Your Deployment

### Deployment Metrics

In Elastic Cloud Console:
1. Click your deployment
2. Go to **Monitoring**
3. View:
   - CPU/Memory usage
   - Disk usage
   - Request rate
   - Search/Index latency

### Stack Monitoring in Kibana

For deeper insights:
1. **Kibana → Stack Monitoring**
2. Enable self-monitoring if prompted
3. View:
   - Cluster health
   - Node metrics
   - Index stats
   - Logstash pipeline metrics

### Set Up Alerts

**Via Kibana:**
1. **Stack Management → Rules and Connectors**
2. Create rules for:

```
- Cluster health is not green
- Disk usage &gt; 80%
- CPU usage &gt; 90% for 5 minutes
- No data received in 10 minutes
- Search latency &gt; 500ms
```

**Notification channels:**
- Email
- Slack
- PagerDuty
- Webhook

---

## Step 9: Cost Optimization

### Right-Sizing Your Deployment

Start small and scale up. Monitor for 2 weeks, then adjust:

```
If CPU consistently &lt; 30%: Scale down
If CPU consistently &gt; 70%: Scale up
If memory pressure high: Add more RAM
If disk &gt; 80%: Add storage or review ILM
```

### Autoscaling (Recommended)

Enable autoscaling to handle traffic spikes:

1. **Deployment → Edit**
2. Enable autoscaling for hot tier
3. Set min/max bounds

Example:
```
Hot tier: 
  Min: 4GB
  Max: 32GB
  Scale up when: Memory pressure &gt; 75%
  Scale down when: Memory pressure &lt; 50%
```

### Data Tiering Strategy

Move old data to cheaper tiers:

| Age | Tier | Approximate Cost |
|-----|------|------------------|
| 0-7 days | Hot | $$$$ |
| 7-30 days | Warm | $$$ |
| 30-90 days | Cold | $$ |
| 90+ days | Frozen | $ |

Frozen tier uses searchable snapshots - data lives in object storage (S3/GCS) but remains searchable.

### Reserved Capacity

If your usage is predictable, commit to reserved capacity for discounts:
- 1-year: ~30% discount
- 3-year: ~50% discount

---

## Step 10: Backup and Disaster Recovery

### Automated Snapshots

Elastic Cloud takes automatic snapshots:
- Every 30 minutes
- Stored in Elastic&apos;s secure repository
- Retained based on your plan

### Cross-Cluster Replication (CCR)

For true DR, replicate to another region:

1. Create a secondary deployment in another region
2. **Stack Management → Remote Clusters**
3. Add your primary cluster as remote
4. Set up follower indices:

```bash
curl -X PUT &quot;https://secondary.es.region.aws.found.io:9243/logs-replica/_ccr/follow&quot; \
  -u elastic:password \
  -H &apos;Content-Type: application/json&apos; -d&apos;
{
  &quot;remote_cluster&quot;: &quot;primary-cluster&quot;,
  &quot;leader_index&quot;: &quot;logs-*&quot;
}&apos;
```

### Manual Snapshots

For compliance or long-term retention:

```bash
# Create a custom repository (requires support to enable)
PUT _snapshot/my-s3-repo
{
  &quot;type&quot;: &quot;s3&quot;,
  &quot;settings&quot;: {
    &quot;bucket&quot;: &quot;my-elasticsearch-backups&quot;,
    &quot;region&quot;: &quot;eu-west-1&quot;
  }
}

# Take a snapshot
PUT _snapshot/my-s3-repo/snapshot-2024-01?wait_for_completion=true
{
  &quot;indices&quot;: &quot;logs-*&quot;,
  &quot;include_global_state&quot;: false
}
```

---

## Step 11: Common Integrations

### AWS Integration

Collect CloudWatch logs and metrics:

1. **Kibana → Integrations → AWS**
2. Configure:
   - Access Key / Secret Key (or IAM role)
   - Regions to monitor
   - Services: CloudWatch, S3, ELB, EC2, etc.

### Kubernetes Integration

For K8s observability:

1. Deploy Elastic Agent as DaemonSet:
```bash
kubectl apply -f https://download.elastic.co/downloads/eck/2.11.0/crds.yaml
```

2. Or use Helm:
```bash
helm repo add elastic https://helm.elastic.co
helm install elastic-agent elastic/elastic-agent \
  --set kubernetes.enabled=true \
  --set outputs.default.type=elasticsearch \
  --set outputs.default.hosts=&apos;[&quot;https://your-deployment.es.region.aws.found.io:9243&quot;]&apos; \
  --set outputs.default.api_key=&apos;your-api-key&apos;
```

### APM (Application Performance Monitoring)

1. **Kibana → APM → Add agent**
2. Install agent for your language:

**Node.js:**
```javascript
const apm = require(&apos;elastic-apm-node&apos;).start({
  serviceName: &apos;my-api&apos;,
  serverUrl: &apos;https://your-apm.apm.region.aws.found.io:443&apos;,
  secretToken: &apos;your-secret-token&apos;,
  environment: &apos;production&apos;
});
```

**Python:**
```python
import elasticapm

app = Flask(__name__)
apm = ElasticAPM(app,
    service_name=&apos;my-api&apos;,
    server_url=&apos;https://your-apm.apm.region.aws.found.io:443&apos;,
    secret_token=&apos;your-secret-token&apos;,
    environment=&apos;production&apos;
)
```

---

## Troubleshooting

### Deployment Won&apos;t Start

1. Check deployment activity log
2. Common causes:
   - Invalid configuration
   - Quota exceeded
   - Region capacity issues

### Can&apos;t Connect

```bash
# Test connectivity
curl -v https://your-deployment.es.region.aws.found.io:9243

# Test auth
curl -u elastic:password https://your-deployment.es.region.aws.found.io:9243/_cluster/health
```

Common issues:
- Wrong credentials
- IP allowlist blocking you
- Network/firewall issues

### Slow Queries

1. Check **Stack Monitoring → Indices**
2. Look for:
   - Large shards (&gt;50GB)
   - Many small shards
   - Missing replicas

Fixes:
- Add more hot tier capacity
- Optimize ILM for faster rollover
- Review query patterns

### High Costs

1. **Deployment → Usage**
2. Identify cost drivers:
   - Over-provisioned tiers
   - Too many replicas
   - Data not moving to cheaper tiers
   - Retaining data too long

---

## Production Checklist

```markdown
## Initial Setup
- [ ] Create deployment with appropriate template
- [ ] Save elastic password securely
- [ ] Enable 2FA on Elastic Cloud account

## Security
- [ ] Create service accounts (don&apos;t use elastic user)
- [ ] Generate API keys for applications
- [ ] Configure SSO for team access
- [ ] Set up IP allowlist if needed
- [ ] Review and restrict default roles

## Data Management
- [ ] Configure ILM policies
- [ ] Set appropriate retention periods
- [ ] Enable data tiering (warm/cold/frozen)
- [ ] Test index rollover

## Ingestion
- [ ] Set up Elastic Agent or Beats
- [ ] Verify data is flowing
- [ ] Check index patterns/data views

## Monitoring
- [ ] Enable Stack Monitoring
- [ ] Set up alerting rules
- [ ] Configure notification channels

## Backup/DR
- [ ] Verify automated snapshots
- [ ] Test restore process
- [ ] Consider CCR for critical data

## Cost
- [ ] Enable autoscaling with reasonable bounds
- [ ] Review ILM to move data to cheaper tiers
- [ ] Consider reserved capacity for stable workloads
```

---

## Key Takeaways

1. **Start small, scale up** - Elastic Cloud makes scaling easy
2. **Use API keys, not passwords** - More secure, easier to rotate
3. **ILM is critical** - Without it, costs spiral and performance degrades
4. **Data tiering saves money** - Hot data is expensive, archive aggressively
5. **Monitor your monitoring** - Set up alerts for your Elastic deployment itself
6. **Autoscaling is your friend** - Handles spikes without over-provisioning

Elastic Cloud removes the operational burden of running Elasticsearch, but you still need to configure it properly. Get security, ILM, and tiering right from the start, and you&apos;ll have a production-ready observability platform.

---

*Questions about Elastic Cloud? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or [GitHub](https://github.com/moabukar).*</content:encoded><category>elasticsearch</category><category>elastic-cloud</category><category>observability</category><category>logging</category><category>saas</category><category>managed-services</category><author>Mo Abukar</author></item><item><title>Clawdbot Manual Setup – Step-by-Step VPS Configuration with WhatsApp Integration</title><link>https://moabukar.co.uk/blog/clawdbot-manual/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/clawdbot-manual/</guid><description>A detailed walkthrough for setting up Clawdbot on a Hetzner VPS from scratch – SSH hardening, firewall configuration, Tailscale, and WhatsApp Business integration using a dedicated number.</description><pubDate>Tue, 27 Jan 2026 00:00:00 GMT</pubDate><content:encoded>Clawdbot has taken off in January 2026. The pace of development is relentless – new skills, integrations, and features landing daily.

The docs are decent, but I hit enough edge cases that I wanted to document the entire process from zero to working WhatsApp assistant. This is the manual setup – no Terraform, no automation – just SSH and a terminal.

If you want the IaC approach, check out my [Terraform-based setup](/blog/clawdbot-automated). This guide is for those who want to understand every step.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/whatsapp.svg&quot; alt=&quot;WhatsApp logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Table of Contents

1. [Create VPS](#1-create-vps)
2. [Basic VPS Setup](#2-basic-vps-setup)
3. [Firewall (UFW)](#3-firewall-ufw)
4. [Prepare Your WhatsApp Number](#4-prepare-your-whatsapp-number)
5. [Installing Prerequisites](#5-installing-prerequisites)
6. [Tailscale Setup](#6-tailscale-setup)
7. [Clawdbot Installation](#7-clawdbot-installation)
8. [Use Cases and Configuration](#8-use-cases-and-configuration)
9. [Hardening Your VPS](#9-hardening-your-vps)

---

## 1. Create VPS

Log into [Hetzner Cloud](https://console.hetzner.cloud/) and create a new server.

**Recommended specs:**
- **Type**: CX22 (2 vCPU, 4GB RAM) – €4.51/month
- **Image**: Ubuntu 24.04
- **Location**: Pick closest to you (I use `nbg1` – Nuremberg)
- **SSH Key**: Add your public key here if you have one

The cheapest option (CX22) is more than enough. Clawdbot is lightweight – the gateway idles at ~100MB RAM.

Once created, note your server&apos;s IP address.

---

## 2. Basic VPS Setup

SSH into the server as root:

```bash
ssh root@your-server-ip
```

If you didn&apos;t add an SSH key during creation, you&apos;ll receive a root password via email. Use it to log in.

### Update the System

```bash
apt update &amp;&amp; apt upgrade -y
```

If prompted to reboot for kernel updates:

```bash
reboot
```

Wait 30 seconds, then reconnect.

---

## 3. Firewall (UFW)

Before opening ports for anything, set up a basic firewall. UFW is a simple frontend for iptables:

```bash
apt install ufw -y
```

**Critical**: Allow SSH before enabling the firewall:

```bash
ufw allow ssh
```

Set default policies:

```bash
ufw default deny incoming
ufw default allow outgoing
```

Enable the firewall:

```bash
ufw enable
```

Type `y` to confirm.

Check status:

```bash
ufw status
```

Output should show SSH allowed:

```
Status: active

To                         Action      From
--                         ------      ----
22/tcp                     ALLOW       Anywhere
22/tcp (v6)                ALLOW       Anywhere (v6)
```

---

## 4. Prepare Your WhatsApp Number

Do this **before** installing Clawdbot - you&apos;ll need a working WhatsApp Business account ready when the onboarding wizard asks you to scan a QR code.

### Get a Dedicated Phone Number

**Do not use your personal WhatsApp number.** If something goes wrong, you risk your main account getting flagged.

Buy a cheap eSIM or SIM for SMS verification:

- **Lyca Mobile eSIM** - cheapest option, works without a physical SIM slot. Order from the Lyca app or website, activate it on your phone, and you&apos;ll have a number within minutes
- [giffgaff](https://www.giffgaff.com/) - £10 gets you a UK number, pay-as-you-go, no contract
- [Lebara](https://www.lebara.co.uk/) - similar pricing
- Any budget MVNO works - you only need it for one SMS verification

You only need the number to receive one SMS. After that, the SIM can live in a drawer.

### Setup WhatsApp Business

1. **Install WhatsApp Business** (not regular WhatsApp) from the Play Store or App Store on your phone

2. **Register with your new eSIM/SIM number**:
   - Make sure the eSIM is active on your phone
   - Open WhatsApp Business and register with the new number
   - Verify via SMS - the code should arrive on the eSIM
   - Complete the business profile setup (name it something like &quot;Atlas&quot; or &quot;My Assistant&quot;)

3. **Keep WhatsApp Business installed** - you&apos;ll need it to scan the QR code during Clawdbot setup

&gt; **Tip**: If you have a dual-SIM phone, you can run both your personal WhatsApp (on your main number) and WhatsApp Business (on your eSIM) on the same device. They&apos;re separate apps.

---

## 5. Installing Prerequisites

### Node.js via nvm

Clawdbot requires Node.js. Use nvm for easy version management:

```bash
curl -o- https://raw.githubusercontent.com/nvm-sh/nvm/v0.40.3/install.sh | bash
```

Reload your shell:

```bash
source ~/.bashrc
```

Install Node 22:

```bash
nvm install 22
```

Verify:

```bash
node -v  # Should print v22.x.x
npm -v   # Should print 10.x.x
```

---

## 6. Tailscale Setup

Tailscale creates a private mesh network between your devices. This lets you access the Clawdbot dashboard securely without exposing ports to the internet.

### Create a Tailscale Account

1. Go to [tailscale.com](https://tailscale.com) and sign up (free tier is plenty)
2. Install Tailscale on your **local machine** too – this is how you&apos;ll access the VPS

### Install Tailscale on the VPS

```bash
curl -fsSL https://tailscale.com/install.sh | sh
```

Connect to your tailnet:

```bash
tailscale up
```

This prints a URL – open it in your browser and authenticate.

Once connected, check your Tailscale IP:

```bash
tailscale ip -4
```

You&apos;ll get an IP like `100.x.y.z`. This is your VPS&apos;s private Tailscale address.

### Verify Connectivity

From your **local machine** (with Tailscale installed and running):

```bash
ping 100.x.y.z  # Your VPS&apos;s Tailscale IP
```

If it responds, you&apos;re connected. You can now SSH via Tailscale:

```bash
ssh root@100.x.y.z
```

This works even if you later remove the public SSH rule from UFW.

---

## 7. Clawdbot Installation

### Install the Claude CLI

Clawdbot uses the Claude CLI under the hood. Install it first:

```bash
npm i -g @anthropic-ai/claude-code
```

Now get your Anthropic auth token. Open a **second terminal tab** (keep your main SSH session open) and run:

```bash
claude setup-token
```

This generates a token you can use on the VPS. Copy the token it gives you.

Back in your **main terminal**, set the token:

```bash
export ANTHROPIC_AUTH_TOKEN=&lt;your-token-here&gt;
```

Add it to your shell profile so it persists across sessions:

```bash
echo &apos;export ANTHROPIC_AUTH_TOKEN=&lt;your-token-here&gt;&apos; &gt;&gt; ~/.bashrc
```

&gt; **Note**: If you have a Claude Max subscription ($200/month), the usage limits work well for a 24/7 assistant. Alternatively, you can use a pay-as-you-go API key with `ANTHROPIC_API_KEY` instead.

### Install OpenClaw (Clawdbot)

```bash
npm i -g openclaw
```

This takes a minute or two. Once complete, start the onboarding wizard:

```bash
openclaw onboard
```

### Onboarding Walkthrough

The wizard is interactive. Here&apos;s the QuickStart flow:

```
◆  I understand this is powerful and inherently risky. Continue?
│  Yes
```

Accept the warning.

```
◆  Onboarding mode
│  ● QuickStart
│  ○ Manual (Configure port, network, Tailscale, and auth options.)
```

**QuickStart** - handles sensible defaults for you. You can always tweak settings later with `openclaw config`.

```
◆  Model/auth provider
│  ○ OpenAI (Codex OAuth + API key)
│  ● Anthropic
│  ○ MiniMax
│  ○ Qwen
│  ○ Synthetic
│  ○ Google
│  ○ Copilot
```

Select **Anthropic** (since we set up the Claude CLI token above).

```
◆  Configure chat channels now?
│  ● Yes / ○ No
```

**Yes** - select WhatsApp when prompted.

### WhatsApp QR Pairing

When you select WhatsApp, the wizard displays a QR code in the terminal. This is where your prepared WhatsApp Business account comes in.

On your phone:

1. Open **WhatsApp Business** (the app you set up in Step 4)
2. Go to **Settings** → **Linked Devices**
3. Tap **Link a Device**
4. Scan the QR code shown in your terminal

Once scanned, the terminal confirms the connection and the wizard continues.

```
◆  Install Gateway service (recommended)
│  ● Yes / ○ No
```

**Yes** - installs a systemd unit so the gateway starts on boot and survives reboots.

```
◆  How do you want to hatch your bot?
│  ● Hatch in TUI (recommended)
│  ○ Open the Web UI
│  ○ Do this later
```

**Hatch in TUI** - drops you into an interactive terminal to complete setup and start chatting.

### Verify the Gateway

After hatching, the gateway should be running. Check:

```bash
systemctl status clawdbot-gateway
```

View logs:

```bash
journalctl -u clawdbot-gateway -f
```

### Test WhatsApp

From your **main phone** (your personal WhatsApp):

1. Add the business number as a contact
2. Open a chat with it
3. Send `/start`

Clawdbot should respond with a welcome message and pairing instructions.

### Useful Commands

The `openclaw` CLI is your main interface:

```bash
openclaw status          # Check gateway status
openclaw gateway start   # Start the gateway
openclaw gateway stop    # Stop the gateway
openclaw gateway restart # Restart the gateway
openclaw config          # Open config editor
openclaw help            # Full command list
```

&gt; **Note**: After initial linking, WhatsApp linked devices can operate independently for ~14 days. The phone only needs to come online periodically to keep the session active.

---

## 8. Use Cases and Configuration

You might be tempted to configure workflows via the Dashboard UI. Don&apos;t.

The chat interface is where Clawdbot shines. Describe what you want in natural language – it&apos;ll figure out the rest.

### Examples

**Daily tweet digest:**
&gt; &quot;Send me a summary of my bookmarked tweets every morning at 8am&quot;

Clawdbot walked me through setting up the `bird` skill (X/Twitter CLI), configuring OAuth, and scheduling the digest.

**RSS monitoring:**
&gt; &quot;Watch Hacker News for posts about Kubernetes and message me when something interesting comes up&quot;

It configured the RSS skill, set up keyword filtering, and sent a test notification.

**Git repo health checks:**
&gt; &quot;Every Monday, check my GitHub repos for stale PRs and dependency updates&quot;

Configured GitHub integration, scheduled the check, and formatted the report.

### The Key Insight

Clawdbot doesn&apos;t say &quot;I can&apos;t do that.&quot; It proposes a plan.

If a skill is missing, it&apos;ll suggest installing one. If an API key is needed, it&apos;ll explain where to get it. It&apos;s genuinely collaborative.

---

## 9. Hardening Your VPS

Your VPS is running and Clawdbot is working. Now let&apos;s lock things down properly. Skip this at your peril – bots will find your server within hours.

### Setup SSH Key Authentication

On your **local machine** (not the server), generate an SSH key if you don&apos;t have one:

```bash
ssh-keygen -t ed25519 -C &quot;your_email@example.com&quot;
```

Press Enter to accept the default location. Set a passphrase if you want extra security.

Copy your public key to the server:

```bash
ssh-copy-id root@your-server-ip
```

**Test it works** – open a new terminal:

```bash
ssh root@your-server-ip
```

You should log in without being prompted for a password (or just your SSH key passphrase if you set one).

### Disable Password Authentication

Now that key auth works, disable password login entirely. On the **server**:

```bash
vim /etc/ssh/sshd_config
```

Find and update these lines (uncomment if they have `#` in front):

```
PasswordAuthentication no
PubkeyAuthentication yes
```

Save and exit (`:wq` in vim).

Restart SSH:

```bash
systemctl restart ssh
```

&gt; **Warning**: Don&apos;t close your current SSH session until you&apos;ve verified you can connect in a new terminal. If you lock yourself out, you&apos;ll need Hetzner&apos;s console access.

### Install fail2ban

fail2ban monitors auth logs and bans IPs that fail login attempts repeatedly.

```bash
apt install fail2ban -y
```

Create a local config (so updates don&apos;t overwrite your settings):

```bash
cp /etc/fail2ban/jail.conf /etc/fail2ban/jail.local
vim /etc/fail2ban/jail.local
```

Find the `[sshd]` section and update it:

```ini
[sshd]
enabled = true
port = ssh
logpath = /var/log/auth.log
maxretry = 3
bantime = 86400
findtime = 600
```

This bans IPs for 24 hours after 3 failed attempts within 10 minutes.

Start and enable fail2ban:

```bash
systemctl start fail2ban
systemctl enable fail2ban
```

Check it&apos;s working:

```bash
fail2ban-client status sshd
```

### Enable Automatic Security Updates

Install unattended-upgrades to automatically apply security patches:

```bash
apt install unattended-upgrades apt-listchanges -y
```

Configure it:

```bash
vim /etc/apt/apt.conf.d/50unattended-upgrades
```

Ensure these lines are uncommented:

```
Unattended-Upgrade::Allowed-Origins {
    &quot;${distro_id}:${distro_codename}-security&quot;;
};
```

Enable automatic updates:

```bash
vim /etc/apt/apt.conf.d/20auto-upgrades
```

Add:

```
APT::Periodic::Update-Package-Lists &quot;1&quot;;
APT::Periodic::Unattended-Upgrade &quot;1&quot;;
APT::Periodic::AutocleanInterval &quot;7&quot;;
```

### Kernel Hardening (Optional but Recommended)

Add sysctl settings to prevent common network attacks:

```bash
vim /etc/sysctl.d/99-security.conf
```

Add:

```ini
# IP Spoofing protection
net.ipv4.conf.all.rp_filter = 1
net.ipv4.conf.default.rp_filter = 1

# Ignore ICMP redirects
net.ipv4.conf.all.accept_redirects = 0
net.ipv6.conf.all.accept_redirects = 0
net.ipv4.conf.all.send_redirects = 0

# Ignore source routed packets
net.ipv4.conf.all.accept_source_route = 0
net.ipv6.conf.all.accept_source_route = 0

# Log Martian packets
net.ipv4.conf.all.log_martians = 1

# Ignore broadcast pings
net.ipv4.icmp_echo_ignore_broadcasts = 1
```

Apply:

```bash
sysctl -p /etc/sysctl.d/99-security.conf
```

### Security Checklist

Before calling it done, verify:

- [x] SSH key-only authentication (password disabled)
- [x] fail2ban active with 24h bans
- [x] UFW firewall enabled (SSH only)
- [x] Automatic security updates configured
- [x] Kernel hardening applied
- [x] Clawdbot bound to loopback only
- [x] Tailscale Serve for private HTTPS access
- [x] Dedicated WhatsApp number (not personal)

---

## Troubleshooting

**Can&apos;t connect via SSH after disabling password auth:**
Use Hetzner&apos;s web console to access the server and fix `/etc/ssh/sshd_config`.

**Clawdbot gateway won&apos;t start:**
```bash
journalctl -u clawdbot-gateway -n 50 --no-pager
```
Check for port conflicts or missing dependencies.

**WhatsApp disconnects frequently:**
Ensure the spare phone stays online. Check battery optimisation settings. Consider running WhatsApp in an emulator for better reliability.

**Tailscale Serve not working:**

```bash
tailscale serve status
```
Verify the gateway is actually running on the configured port.

---

Questions? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or drop a comment below.</content:encoded><category>clawdbot</category><category>hetzner</category><category>vps</category><category>whatsapp</category><category>devops</category><category>security</category><category>tutorial</category><author>Mo Abukar</author></item><item><title>Self-Hosted GitLab on Kubernetes - A Startup&apos;s Journey</title><link>https://moabukar.co.uk/blog/self-hosted-gitlab-kubernetes/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/self-hosted-gitlab-kubernetes/</guid><description>A detailed guide on deploying GitLab on AKS using Helm charts, with Azure SQL as the database backend. Covers architecture decisions, configuration, lessons learned, and the gotchas we hit in production.</description><pubDate>Sun, 25 Jan 2026 00:00:00 GMT</pubDate><content:encoded># Self-Hosted GitLab on Kubernetes - A Startup&apos;s Journey

When we hit 50 engineers at the startup, GitLab.com&apos;s pricing started to sting. The Premium tier at $29/user/month meant we were looking at $17,400/year just for source control and CI/CD. For a startup watching every pound, that&apos;s significant.

We decided to self-host GitLab on our existing AKS clusters. This post documents the complete setup - the architecture decisions, Helm configuration, Azure SQL integration, and the lessons we learned along the way.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/gitlab.svg&quot; alt=&quot;GitLab logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Why Self-Host?

**The numbers:**
- GitLab Premium (50 users): ~$17,400/year
- Self-hosted on existing K8s: ~$3,600/year (compute + storage)
- **Savings: ~$13,800/year**

**Other benefits:**
- Full control over data (compliance requirement for us)
- No rate limits on CI/CD
- Custom runners on our own infrastructure
- Integration with internal services

**The trade-offs:**
- Operational overhead (upgrades, backups, monitoring)
- Need K8s expertise
- You own the uptime

For a startup with a competent platform team, the trade-off made sense.

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/self-hosted-gitlab-kubernetes](https://github.com/moabukar/blog-code/tree/main/self-hosted-gitlab-kubernetes)

---

## Architecture Overview

```
┌─────────────────────────────────────────────────────────────────┐
│                         AKS Cluster                              │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    gitlab namespace                      │    │
│  │                                                          │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐              │    │
│  │  │ Webservice│  │ Sidekiq  │  │  Gitaly  │              │    │
│  │  │ (Rails)   │  │ (Jobs)   │  │ (Git RPC)│              │    │
│  │  └────┬─────┘  └────┬─────┘  └────┬─────┘              │    │
│  │       │              │              │                    │    │
│  │  ┌────┴──────────────┴──────────────┴────┐              │    │
│  │  │              Redis Cluster             │              │    │
│  │  │         (Azure Cache for Redis)        │              │    │
│  │  └───────────────────────────────────────┘              │    │
│  │                                                          │    │
│  │  ┌──────────┐  ┌──────────┐  ┌──────────┐              │    │
│  │  │ Registry │  │   Shell  │  │  Toolbox │              │    │
│  │  │ (Images) │  │  (SSH)   │  │  (Rails) │              │    │
│  │  └──────────┘  └──────────┘  └──────────┘              │    │
│  │                                                          │    │
│  └─────────────────────────────────────────────────────────┘    │
│                                                                  │
│  ┌──────────────────┐     ┌──────────────────────────────┐     │
│  │ Ingress (NGINX)  │     │   Azure Files (Persistent)   │     │
│  │ + Cert-Manager   │     │   - Git repos (Gitaly)       │     │
│  └──────────────────┘     │   - LFS objects              │     │
│                            │   - Uploads                  │     │
└────────────────────────────┴──────────────────────────────┴─────┘
              │                              │
              │                              │
┌─────────────┴───────────┐    ┌────────────┴─────────────┐
│      Azure SQL          │    │   Azure Blob Storage     │
│   (PostgreSQL Flexible) │    │   - Backups              │
│   - gitlab_production   │    │   - CI artifacts         │
│   - gitlab_geo (if DR)  │    │   - Terraform state      │
└─────────────────────────┘    └──────────────────────────┘
```

### Key Decisions

1. **External PostgreSQL (Azure SQL)** - GitLab&apos;s bundled PostgreSQL is fine for small installs, but for production we wanted managed backups, HA, and point-in-time recovery.

2. **External Redis (Azure Cache)** - Same reasoning. Plus, Redis is critical for GitLab - Sidekiq jobs, caching, sessions.

3. **Azure Files for Gitaly** - Git repositories need persistent storage. Azure Files Premium (NFS) gave us the IOPS we needed.

4. **Azure Blob for artifacts** - CI artifacts and LFS objects go to blob storage. Cheaper and scales infinitely.

---

## Prerequisites

Before starting:

```bash
# AKS cluster running (we used 1.28)
# kubectl configured
# Helm 3.x installed
# Domain name ready (gitlab.yourcompany.com)
# SSL certificate (we used cert-manager with Let&apos;s Encrypt)

# Create namespace
kubectl create namespace gitlab

# Add GitLab Helm repo
helm repo add gitlab https://charts.gitlab.io/
helm repo update
```

---

## Step 1: Set Up Azure SQL (PostgreSQL)

### Create the Database Server

```bash
# Create resource group (if not exists)
az group create --name rg-gitlab-prod --location uksouth

# Create PostgreSQL Flexible Server
az postgres flexible-server create \
  --resource-group rg-gitlab-prod \
  --name gitlab-postgres-prod \
  --location uksouth \
  --admin-user gitlabadmin \
  --admin-password &apos;YourSecurePassword123!&apos; \
  --sku-name Standard_D4s_v3 \
  --tier GeneralPurpose \
  --storage-size 256 \
  --version 14 \
  --high-availability ZoneRedundant \
  --public-access None  # We&apos;ll use private endpoint
```

### Configure Private Endpoint

```bash
# Create private endpoint for PostgreSQL
az network private-endpoint create \
  --resource-group rg-gitlab-prod \
  --name gitlab-postgres-pe \
  --vnet-name aks-vnet \
  --subnet aks-subnet \
  --private-connection-resource-id $(az postgres flexible-server show \
    --resource-group rg-gitlab-prod \
    --name gitlab-postgres-prod \
    --query id -o tsv) \
  --group-id postgresqlServer \
  --connection-name gitlab-postgres-connection
```

### Create the Database

```bash
# Connect to PostgreSQL
az postgres flexible-server connect \
  --name gitlab-postgres-prod \
  --resource-group rg-gitlab-prod \
  --admin-user gitlabadmin \
  --admin-password &apos;YourSecurePassword123!&apos;

# Create database
CREATE DATABASE gitlab_production;

# Create GitLab user with limited privileges
CREATE USER gitlab WITH ENCRYPTED PASSWORD &apos;GitLabUserPassword123!&apos;;
GRANT ALL PRIVILEGES ON DATABASE gitlab_production TO gitlab;

# Required extensions
\c gitlab_production
CREATE EXTENSION IF NOT EXISTS pg_trgm;
CREATE EXTENSION IF NOT EXISTS btree_gist;
CREATE EXTENSION IF NOT EXISTS plpgsql;
```

### PostgreSQL Configuration

GitLab needs specific PostgreSQL settings:

```bash
# Via Azure CLI
az postgres flexible-server parameter set \
  --resource-group rg-gitlab-prod \
  --server-name gitlab-postgres-prod \
  --name shared_preload_libraries \
  --value pg_stat_statements

az postgres flexible-server parameter set \
  --resource-group rg-gitlab-prod \
  --server-name gitlab-postgres-prod \
  --name max_connections \
  --value 200
```

---

## Step 2: Set Up Azure Cache for Redis

```bash
# Create Redis Cache (Premium for clustering)
az redis create \
  --resource-group rg-gitlab-prod \
  --name gitlab-redis-prod \
  --location uksouth \
  --sku Premium \
  --vm-size P1 \
  --enable-non-ssl-port false \
  --minimum-tls-version 1.2

# Get connection details
az redis show \
  --resource-group rg-gitlab-prod \
  --name gitlab-redis-prod \
  --query &quot;[hostName, sslPort, accessKeys.primaryKey]&quot; -o tsv
```

---

## Step 3: Set Up Azure Blob Storage

```bash
# Create storage account
az storage account create \
  --resource-group rg-gitlab-prod \
  --name gitlabstorageprod \
  --location uksouth \
  --sku Standard_ZRS \
  --kind StorageV2

# Create containers
az storage container create --name gitlab-artifacts --account-name gitlabstorageprod
az storage container create --name gitlab-uploads --account-name gitlabstorageprod
az storage container create --name gitlab-lfs --account-name gitlabstorageprod
az storage container create --name gitlab-packages --account-name gitlabstorageprod
az storage container create --name gitlab-backups --account-name gitlabstorageprod
az storage container create --name gitlab-registry --account-name gitlabstorageprod

# Get connection string
az storage account show-connection-string \
  --resource-group rg-gitlab-prod \
  --name gitlabstorageprod \
  --query connectionString -o tsv
```

---

## Step 4: Create Kubernetes Secrets

```bash
# PostgreSQL password
kubectl create secret generic gitlab-postgresql-password \
  --namespace gitlab \
  --from-literal=postgresql-password=&apos;GitLabUserPassword123!&apos;

# Redis password
kubectl create secret generic gitlab-redis-password \
  --namespace gitlab \
  --from-literal=redis-password=&apos;YourRedisPassword&apos;

# Azure Storage credentials
kubectl create secret generic gitlab-azure-storage \
  --namespace gitlab \
  --from-literal=connection=&apos;DefaultEndpointsProtocol=https;AccountName=gitlabstorageprod;AccountKey=YOUR_KEY;EndpointSuffix=core.windows.net&apos;

# GitLab Rails secret (generate a random one)
kubectl create secret generic gitlab-rails-secret \
  --namespace gitlab \
  --from-literal=secret=&apos;$(openssl rand -hex 64)&apos;

# Initial root password
kubectl create secret generic gitlab-initial-root-password \
  --namespace gitlab \
  --from-literal=password=&apos;YourGitLabRootPassword123!&apos;
```

---

## Step 5: The Helm Values File

This is the critical part. Here&apos;s our production `values.yaml`:

```yaml
# values-production.yaml

global:
  # Domain configuration
  hosts:
    domain: yourcompany.com
    gitlab:
      name: gitlab.yourcompany.com
      https: true
    registry:
      name: registry.yourcompany.com
      https: true
    minio:
      enabled: false  # We use Azure Blob instead
    
  # Ingress configuration
  ingress:
    class: nginx
    annotations:
      kubernetes.io/tls-acme: &quot;true&quot;
      cert-manager.io/cluster-issuer: letsencrypt-prod
      nginx.ingress.kubernetes.io/proxy-body-size: &quot;0&quot;
      nginx.ingress.kubernetes.io/proxy-read-timeout: &quot;900&quot;
      nginx.ingress.kubernetes.io/proxy-connect-timeout: &quot;900&quot;
    configureCertmanager: false  # We manage cert-manager separately
    tls:
      enabled: true
      secretName: gitlab-tls

  # Time zone
  time_zone: Europe/London

  # Email configuration
  email:
    from: gitlab@yourcompany.com
    display_name: GitLab
    reply_to: noreply@yourcompany.com

  smtp:
    enabled: true
    address: smtp.sendgrid.net
    port: 587
    authentication: plain
    user_name: apikey
    password:
      secret: gitlab-smtp-password
      key: password
    starttls_auto: true

  # ============================================
  # EXTERNAL POSTGRESQL (Azure SQL)
  # ============================================
  psql:
    host: gitlab-postgres-prod.postgres.database.azure.com
    port: 5432
    database: gitlab_production
    username: gitlab
    password:
      secret: gitlab-postgresql-password
      key: postgresql-password
    ssl:
      enabled: true
      # Azure requires SSL
    
  # ============================================
  # EXTERNAL REDIS (Azure Cache)
  # ============================================
  redis:
    host: gitlab-redis-prod.redis.cache.windows.net
    port: 6380
    password:
      enabled: true
      secret: gitlab-redis-password
      key: redis-password
    scheme: rediss  # SSL

  # ============================================
  # GITALY (Git repository storage)
  # ============================================
  gitaly:
    enabled: true
    authToken:
      secret: gitlab-gitaly-secret
      key: token
    internal:
      names:
        - default
    external: []

  # ============================================
  # OBJECT STORAGE (Azure Blob)
  # ============================================
  minio:
    enabled: false  # Disable bundled MinIO

  appConfig:
    # LFS
    lfs:
      enabled: true
      proxy_download: true
      bucket: gitlab-lfs
      connection:
        secret: gitlab-rails-storage
        key: connection

    # Artifacts  
    artifacts:
      enabled: true
      proxy_download: true
      bucket: gitlab-artifacts
      connection:
        secret: gitlab-rails-storage
        key: connection

    # Uploads
    uploads:
      enabled: true
      proxy_download: true
      bucket: gitlab-uploads
      connection:
        secret: gitlab-rails-storage
        key: connection

    # Packages
    packages:
      enabled: true
      proxy_download: true
      bucket: gitlab-packages
      connection:
        secret: gitlab-rails-storage
        key: connection

    # Backups
    backups:
      bucket: gitlab-backups
      tmpBucket: gitlab-backups-tmp

    # Object storage connection template (Azure)
    object_store:
      enabled: true
      proxy_download: true
      storage_options: {}
      connection:
        secret: gitlab-rails-storage
        key: connection

  # ============================================
  # REGISTRY
  # ============================================
  registry:
    enabled: true
    bucket: gitlab-registry
    storage:
      secret: gitlab-registry-storage
      key: config

# ============================================
# DISABLE BUNDLED COMPONENTS
# ============================================
postgresql:
  install: false  # Using Azure SQL

redis:
  install: false  # Using Azure Cache

minio:
  install: false  # Using Azure Blob

# ============================================
# CERTMANAGER (we manage separately)
# ============================================
certmanager:
  install: false

# ============================================
# NGINX INGRESS (we manage separately)
# ============================================
nginx-ingress:
  enabled: false

# ============================================
# PROMETHEUS (optional - we use Azure Monitor)
# ============================================
prometheus:
  install: false

# ============================================
# GITLAB COMPONENTS
# ============================================

# Webservice (Rails application)
gitlab:
  webservice:
    replicaCount: 2
    minReplicas: 2
    maxReplicas: 10
    resources:
      requests:
        cpu: 900m
        memory: 2.5G
      limits:
        cpu: 2
        memory: 4G
    workerProcesses: 2
    workhorse:
      resources:
        requests:
          cpu: 100m
          memory: 100M
        limits:
          cpu: 500m
          memory: 500M
    hpa:
      targetAverageValue: 400m

  # Sidekiq (background jobs)
  sidekiq:
    replicas: 2
    minReplicas: 2
    maxReplicas: 10
    resources:
      requests:
        cpu: 500m
        memory: 2G
      limits:
        cpu: 2
        memory: 4G
    hpa:
      targetAverageValue: 350m
    pods:
      - name: all-in-1
        concurrency: 25
        queues: 

  # Gitaly (Git operations)
  gitaly:
    persistence:
      enabled: true
      size: 500Gi
      storageClass: azurefile-premium  # Azure Files Premium
    resources:
      requests:
        cpu: 300m
        memory: 1.5G
      limits:
        cpu: 2
        memory: 4G

  # GitLab Shell (SSH)
  gitlab-shell:
    replicaCount: 2
    minReplicas: 2
    maxReplicas: 4
    resources:
      requests:
        cpu: 50m
        memory: 32M
      limits:
        cpu: 500m
        memory: 128M

  # Toolbox (Rails console, backups)
  toolbox:
    enabled: true
    replicas: 1
    backups:
      cron:
        enabled: true
        schedule: &quot;0 2 * * *&quot;  # 2 AM daily
        persistence:
          enabled: true
          size: 100Gi
          storageClass: azurefile-premium
      objectStorage:
        config:
          secret: gitlab-rails-storage
          key: connection

  # Migrations (database migrations)
  migrations:
    enabled: true

  # GitLab Exporter (metrics)
  gitlab-exporter:
    enabled: true
    resources:
      requests:
        cpu: 50m
        memory: 100M
      limits:
        cpu: 200m
        memory: 256M

# Registry
registry:
  enabled: true
  replicas: 2
  hpa:
    minReplicas: 2
    maxReplicas: 5
  storage:
    secret: gitlab-registry-storage
    key: config
  resources:
    requests:
      cpu: 100m
      memory: 128M
    limits:
      cpu: 500m
      memory: 512M

# ============================================
# SHARED SETTINGS
# ============================================
shared-secrets:
  enabled: true
  rbac:
    create: true
```

---

## Step 6: Azure Storage Connection Secret

Create the storage connection file:

```yaml
# gitlab-rails-storage.yaml
apiVersion: v1
kind: Secret
metadata:
  name: gitlab-rails-storage
  namespace: gitlab
type: Opaque
stringData:
  connection: |
    provider: AzureRM
    azure_storage_account_name: gitlabstorageprod
    azure_storage_access_key: YOUR_STORAGE_ACCOUNT_KEY
    azure_storage_domain: blob.core.windows.net
```

For the registry:

```yaml
# gitlab-registry-storage.yaml
apiVersion: v1
kind: Secret
metadata:
  name: gitlab-registry-storage
  namespace: gitlab
type: Opaque
stringData:
  config: |
    azure:
      accountname: gitlabstorageprod
      accountkey: YOUR_STORAGE_ACCOUNT_KEY
      container: gitlab-registry
      rootdirectory: /
```

Apply the secrets:

```bash
kubectl apply -f gitlab-rails-storage.yaml
kubectl apply -f gitlab-registry-storage.yaml
```

---

## Step 7: Deploy GitLab

```bash
# Install GitLab
helm upgrade --install gitlab gitlab/gitlab \
  --namespace gitlab \
  --timeout 600s \
  --values values-production.yaml

# Watch the deployment
kubectl -n gitlab get pods -w

# Check for issues
kubectl -n gitlab get events --sort-by=&apos;.lastTimestamp&apos;
```

First deployment takes 10-15 minutes. Watch for all pods to become Ready.

---

## Step 8: Post-Installation

### Get the Root Password

```bash
kubectl -n gitlab get secret gitlab-initial-root-password \
  -o jsonpath=&quot;{.data.password}&quot; | base64 -d &amp;&amp; echo
```

### Access GitLab

1. Navigate to `https://gitlab.yourcompany.com`
2. Log in as `root` with the password above
3. **Immediately change the root password**
4. Disable sign-ups (Admin → Settings → General → Sign-up restrictions)

### Create Your First User

```bash
# Via Rails console
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- gitlab-rails console

# In console:
user = User.new(username: &apos;admin&apos;, email: &apos;admin@yourcompany.com&apos;, name: &apos;Admin User&apos;, password: &apos;securepassword&apos;, password_confirmation: &apos;securepassword&apos;)
user.admin = true
user.skip_confirmation!
user.save!
```

---

## Lessons Learned

### 1. Gitaly Storage is Critical

We initially used Azure Files Standard. Big mistake. Git operations were slow, and `git clone` on large repos took forever.

**Fix:** Use Azure Files Premium (NFS) with high IOPS. The cost difference is worth it.

```yaml
# Storage class for Gitaly
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: azurefile-premium
provisioner: file.csi.azure.com
parameters:
  skuName: Premium_LRS
  protocol: nfs
reclaimPolicy: Retain
volumeBindingMode: Immediate
allowVolumeExpansion: true
```

### 2. Sidekiq Queues Matter

We started with a single Sidekiq pod handling all queues. CI jobs were slow because they competed with everything else.

**Fix:** Dedicate Sidekiq pods to specific queue groups:

```yaml
gitlab:
  sidekiq:
    pods:
      - name: urgent
        concurrency: 10
        queues:
          - pipeline_processing
          - pipeline_default
          - pipeline_cache
      - name: default
        concurrency: 25
        queues:
      - name: slow
        concurrency: 5
        queues:
          - cronjob
          - repository_archive_cache
```

### 3. PostgreSQL Connection Limits

GitLab is connection-hungry. We hit Azure SQL&apos;s connection limit during peak hours.

**Fix:** 
- Set `max_connections: 200` on PostgreSQL
- Use PgBouncer (GitLab Helm chart can deploy it):

```yaml
global:
  psql:
    host: gitlab-pgbouncer  # PgBouncer service
    
gitlab:
  pgbouncer:
    enabled: true
    replicas: 2
```

### 4. Registry Garbage Collection

Container registry grows fast. Without cleanup, storage costs explode.

**Fix:** Enable registry garbage collection:

```bash
# Run GC manually
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- \
  gitlab-ctl registry-garbage-collect -m

# Or schedule it via cron job
```

### 5. Backup Testing

We set up backups but never tested restores. When we needed to restore a deleted project, we discovered our backup was incomplete.

**Fix:** Test restores monthly:

```bash
# Create backup
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- backup-utility

# Restore (to test environment)
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- backup-utility --restore
```

### 6. Resource Requests Were Wrong

Initial deployment used GitLab&apos;s default resource requests. Pods were constantly OOMKilled.

**Fix:** Monitor actual usage for 2 weeks, then right-size:

```bash
# Get actual resource usage
kubectl -n gitlab top pods

# Check OOMKills
kubectl -n gitlab get events | grep -i oom
```

### 7. Upgrade Path Matters

GitLab doesn&apos;t support skipping major versions. We tried to go from 15.x to 16.x directly and broke migrations.

**Fix:** Follow the upgrade path strictly:
- 15.11 → 16.0 → 16.3 → 16.7 → 16.11 → 17.x

Check [GitLab&apos;s upgrade path tool](https://gitlab-com.gitlab.io/support/toolbox/upgrade-path/).

---

## Monitoring

### Key Metrics to Watch

```yaml
# Prometheus rules (if using)
groups:
  - name: gitlab
    rules:
      - alert: GitLabSidekiqQueueBacklog
        expr: sidekiq_queue_size &gt; 1000
        for: 10m
        
      - alert: GitLabGitalyHighLatency
        expr: gitaly_service_client_requests_seconds_bucket{le=&quot;1&quot;} &lt; 0.95
        for: 5m
        
      - alert: GitLabPostgreSQLConnections
        expr: pg_stat_activity_count &gt; 180
        for: 5m
```

### Useful Commands

```bash
# Check GitLab component health
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- gitlab-rake gitlab:check

# Check background jobs
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- gitlab-rake gitlab:sidekiq:check

# Database migrations status
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- gitlab-rake db:migrate:status

# Rails console (for debugging)
kubectl -n gitlab exec -it deploy/gitlab-toolbox -- gitlab-rails console
```

---

## Cost Breakdown

Our monthly costs (50 users, moderate CI usage):

| Component | SKU | Monthly Cost |
|-----------|-----|--------------|
| Azure SQL (PostgreSQL) | D4s_v3, HA | ~£280 |
| Azure Cache (Redis) | P1 | ~£140 |
| Azure Files Premium | 500GB | ~£85 |
| Azure Blob Storage | ~200GB | ~£10 |
| AKS Node Pool (dedicated) | 2x D4s_v3 | ~£240 |
| **Total** | | **~£755/month** |

vs GitLab Premium: ~£1,450/month

**Savings: ~£700/month (~£8,400/year)**

---

## Production Checklist

```markdown
## Pre-Deployment
- [ ] Azure SQL created with HA enabled
- [ ] Redis Cache created (Premium)
- [ ] Blob Storage containers created
- [ ] Private endpoints configured
- [ ] SSL certificates ready
- [ ] DNS records configured

## Helm Configuration
- [ ] External PostgreSQL configured
- [ ] External Redis configured
- [ ] Object storage (Azure) configured
- [ ] Gitaly persistence configured
- [ ] Registry storage configured
- [ ] SMTP configured
- [ ] Resource requests/limits set
- [ ] HPA configured

## Post-Deployment
- [ ] Root password changed
- [ ] Sign-ups disabled
- [ ] First admin user created
- [ ] SSO/LDAP configured (if using)
- [ ] Backup cron job verified
- [ ] Backup restore tested
- [ ] Monitoring alerts configured
- [ ] Runner(s) registered

## Ongoing
- [ ] Monthly backup restore test
- [ ] Registry garbage collection scheduled
- [ ] Upgrade path documented
- [ ] Runbook for common issues
```

---

## Key Takeaways

1. **Use external PostgreSQL and Redis** - The bundled ones are fine for testing, not production
2. **Azure Files Premium for Gitaly** - Don&apos;t skimp on Git storage IOPS
3. **Right-size after observing** - GitLab&apos;s defaults are conservative
4. **Test your backups** - Untested backups aren&apos;t backups
5. **Follow upgrade paths** - GitLab migrations are version-sensitive
6. **Monitor Sidekiq queues** - They&apos;re the first sign of trouble
7. **Budget for the ops time** - Self-hosting isn&apos;t &quot;set and forget&quot;

Self-hosted GitLab on Kubernetes is absolutely viable for startups, but go in with eyes open. The cost savings are real, but so is the operational overhead.

---

*Running GitLab on K8s? Hit any interesting issues? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or [GitHub](https://github.com/moabukar).*</content:encoded><category>gitlab</category><category>kubernetes</category><category>aks</category><category>azure</category><category>helm</category><category>devops</category><category>self-hosted</category><category>startup</category><author>Mo Abukar</author></item><item><title>Cloud Unit Economics for Multi-Tenant SaaS - Cost Per Customer, Not Per Service</title><link>https://moabukar.co.uk/blog/cloud-unit-economics-multi-tenant/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/cloud-unit-economics-multi-tenant/</guid><description>How to calculate true cost-per-tenant in a shared infrastructure environment. Covers EKS with Karpenter, shared databases (Aurora, DynamoDB), and tools like OpenCost, CloudZero, and custom attribution approaches.</description><pubDate>Tue, 20 Jan 2026 00:00:00 GMT</pubDate><content:encoded># Cloud Unit Economics for Multi-Tenant SaaS - Cost Per Customer, Not Per Service

Your AWS bill tells you that EKS costs £50,000/month and Aurora costs £15,000/month. But what does Customer A cost? What about Customer B who does 10x the transactions? Traditional cloud billing shows you spend by service - it doesn&apos;t show you spend by customer, transaction, or business unit.

This is the unit economics problem, and for multi-tenant SaaS platforms, it&apos;s critical. Without it, you can&apos;t answer:
- Which customers are profitable?
- What&apos;s the true margin on each deal?
- Where should we optimise?
- How should we price?

I recently helped a client solve this for their multi-tenant platform running on EKS with shared Aurora, DynamoDB, MSK, and KeySpaces backends. This post covers the approach, the tooling, and the gotchas.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/opencost.svg&quot; alt=&quot;OpenCost logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## The Problem: Shared Infrastructure, Unknown Attribution

Consider this typical multi-tenant architecture:

```
┌─────────────────────────────────────────────────────────────────────┐
│                           Customers                                  │
│  ┌────────┐    ┌────────┐    ┌────────┐                            │
│  │ Cust A │    │ Cust B │    │ Cust C │                            │
│  └───┬────┘    └───┬────┘    └───┬────┘                            │
│      │             │             │                                   │
│      └─────────────┼─────────────┘                                   │
│                    │                                                  │
│                    ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                      CloudFront                              │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                    │                                                  │
│                    ▼                                                  │
│  ┌─────────────────────────────────────────────────────────────┐    │
│  │                    EKS Cluster                               │    │
│  │  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐        │    │
│  │  │ Login   │  │ Orders  │  │ Payment │  │ Common  │        │    │
│  │  │ Service │  │ Service │  │ Service │  │ Services│        │    │
│  │  └─────────┘  └─────────┘  └─────────┘  └─────────┘        │    │
│  └─────────────────────────────────────────────────────────────┘    │
│                    │                                                  │
│      ┌─────────────┼─────────────────────────────┐                   │
│      │             │             │               │                   │
│      ▼             ▼             ▼               ▼                   │
│  ┌────────┐   ┌────────┐   ┌──────────┐   ┌──────────┐             │
│  │ Aurora │   │DynamoDB│   │ KeySpaces│   │   MSK    │             │
│  │(shared)│   │(shared)│   │ (shared) │   │ (shared) │             │
│  └────────┘   └────────┘   └──────────┘   └──────────┘             │
└─────────────────────────────────────────────────────────────────────┘
```

**The challenge:**
- All customers hit the same EKS pods
- All customers share the same Aurora cluster
- All customers write to the same DynamoDB tables
- Tenants are isolated at the data level, not the infrastructure level

AWS Cost Explorer will tell you Aurora costs £15k/month. It won&apos;t tell you that Customer A costs £8k and Customer B costs £2k.

## Unit Economics Defined

**Unit economics** = Cost to serve one unit of business value

Common units:
- **Cost per customer** - Total cost / number of customers
- **Cost per transaction** - Total cost / number of transactions
- **Cost per API call** - Total cost / number of API requests
- **Cost per user** - Total cost / active users
- **Cost per order** - Total cost / orders processed

The &quot;right&quot; unit depends on your business model:
- Per-seat SaaS → Cost per user
- Transaction platform → Cost per transaction
- API business → Cost per 1M requests
- E-commerce → Cost per order

## The Solution: Multi-Dimensional Cost Attribution

To solve this, we need to:

1. **Tag everything possible** at the AWS level
2. **Instrument applications** to emit tenant context
3. **Collect resource usage** at the tenant level
4. **Allocate shared costs** proportionally
5. **Build a cost model** that combines direct and allocated costs

### Step 1: AWS Tagging Strategy

Start with consistent tagging. Every resource needs:

```
tenant_id: customer-123        # Direct tenant if applicable
service: orders-api            # Which service
environment: production        # Environment
cost_center: platform          # Business allocation
```

For shared resources, tag with:
```
allocation_type: shared
allocation_basis: request_count  # How to split the cost
```

**The problem:** Most shared resources can&apos;t be tagged per-tenant because multiple tenants use them simultaneously.

### Step 2: Kubernetes Cost Attribution with OpenCost

[OpenCost](https://opencost.io) is the CNCF project for Kubernetes cost monitoring. It allocates cluster costs to namespaces, deployments, and labels.

**Install OpenCost:**

```bash
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm install opencost opencost/opencost \
  --namespace opencost \
  --set opencost.prometheus.internal.enabled=true \
  --set opencost.ui.enabled=true
```

**Configure for tenant attribution:**

The key is labeling your pods with tenant information when possible, or tracking tenant metrics separately.

For shared pods (most multi-tenant setups), OpenCost gives you cost-per-pod, but you need application-level metrics to split by tenant.

```yaml
# Example: Pod with tenant label (for tenant-dedicated resources)
apiVersion: v1
kind: Pod
metadata:
  labels:
    app: orders-api
    tenant: customer-123  # Only works for tenant-dedicated pods
```

For shared pods serving multiple tenants, you need a different approach.

### Step 3: Application-Level Tenant Metrics

This is where most cost attribution projects fail. You need your application to emit tenant-tagged metrics.

**Instrument your services:**

```python
# Python example with Prometheus metrics
from prometheus_client import Counter, Histogram

# Request counter by tenant
requests_total = Counter(
    &apos;http_requests_total&apos;,
    &apos;Total HTTP requests&apos;,
    [&apos;service&apos;, &apos;tenant_id&apos;, &apos;endpoint&apos;]
)

# Request duration by tenant
request_duration = Histogram(
    &apos;http_request_duration_seconds&apos;,
    &apos;Request duration&apos;,
    [&apos;service&apos;, &apos;tenant_id&apos;, &apos;endpoint&apos;]
)

# In your request handler
@app.route(&apos;/api/orders&apos;)
def handle_orders():
    tenant_id = get_tenant_from_request()  # Extract from JWT, header, etc.
    
    with request_duration.labels(
        service=&apos;orders-api&apos;,
        tenant_id=tenant_id,
        endpoint=&apos;/api/orders&apos;
    ).time():
        # Process request
        result = process_order()
    
    requests_total.labels(
        service=&apos;orders-api&apos;,
        tenant_id=tenant_id,
        endpoint=&apos;/api/orders&apos;
    ).inc()
    
    return result
```

**Key metrics to collect per tenant:**
- Request count
- CPU time consumed
- Memory high-water mark
- Database queries executed
- Storage bytes read/written
- Kafka messages produced/consumed

### Step 4: Database Cost Attribution

Shared databases are the hardest to attribute. Tenants are isolated at the row/table level, not the instance level.

#### Aurora/RDS Attribution

Aurora costs have multiple components:
- Instance hours (compute)
- Storage (GB-months)
- I/O requests
- Backup storage

**Attribution approach:**

```sql
-- Track storage per tenant
SELECT 
    tenant_id,
    SUM(pg_total_relation_size(schemaname || &apos;.&apos; || tablename)) as bytes
FROM pg_tables
JOIN your_data_table ON table_id = tablename
GROUP BY tenant_id;

-- Track query activity per tenant (requires pg_stat_statements)
SELECT 
    -- Extract tenant from query or use application tags
    tenant_id,
    SUM(total_time) as query_time_ms,
    SUM(calls) as query_count,
    SUM(shared_blks_read + shared_blks_hit) as blocks_accessed
FROM pg_stat_statements
JOIN tenant_query_log ON query_hash = queryid
GROUP BY tenant_id;
```

**For Aurora I/O costs:**
- Track read/write IOPS per tenant via application metrics
- Use CloudWatch `VolumeReadIOPs` and `VolumeWriteIOPs` for total
- Allocate proportionally based on application-tracked I/O

#### DynamoDB Attribution

DynamoDB billing is simpler - it&apos;s based on:
- Read Capacity Units (RCU)
- Write Capacity Units (WCU)
- Storage (GB)

**Enable DynamoDB Contributor Insights:**

```bash
aws dynamodb update-contributor-insights \
    --table-name YourTable \
    --contributor-insights-action ENABLE
```

This shows top partition keys (often tenant IDs) and their access patterns.

**Custom attribution via application:**

```python
# Track DynamoDB operations per tenant
dynamodb_reads = Counter(
    &apos;dynamodb_read_units_total&apos;,
    &apos;DynamoDB consumed read units&apos;,
    [&apos;table&apos;, &apos;tenant_id&apos;]
)

dynamodb_writes = Counter(
    &apos;dynamodb_write_units_total&apos;,
    &apos;DynamoDB consumed write units&apos;,
    [&apos;table&apos;, &apos;tenant_id&apos;]
)

# After each DynamoDB operation
response = dynamodb.query(
    TableName=&apos;Orders&apos;,
    KeyConditionExpression=&apos;tenant_id = :tid&apos;,
    ExpressionAttributeValues={&apos;:tid&apos;: {&apos;S&apos;: tenant_id}}
)

consumed_rcu = response[&apos;ConsumedCapacity&apos;][&apos;ReadCapacityUnits&apos;]
dynamodb_reads.labels(table=&apos;Orders&apos;, tenant_id=tenant_id).inc(consumed_rcu)
```

### Step 5: The Cost Attribution Pipeline

Now we combine everything into an attribution pipeline:

```
┌──────────────┐    ┌──────────────┐    ┌──────────────┐
│  AWS Cost    │    │  OpenCost    │    │ Application  │
│  &amp; Usage     │    │  (K8s costs) │    │   Metrics    │
│   Report     │    │              │    │  (Prometheus)│
└──────┬───────┘    └──────┬───────┘    └──────┬───────┘
       │                   │                   │
       └───────────────────┼───────────────────┘
                           │
                           ▼
                  ┌────────────────┐
                  │  Cost          │
                  │  Attribution   │
                  │  Engine        │
                  └────────┬───────┘
                           │
                           ▼
                  ┌────────────────┐
                  │  Tenant Cost   │
                  │  Dashboard     │
                  └────────────────┘
```

**Example attribution logic:**

```python
def calculate_tenant_costs(period):
    # 1. Get total AWS costs from Cost &amp; Usage Report
    aws_costs = get_cur_costs(period)  # {&apos;eks&apos;: 50000, &apos;aurora&apos;: 15000, ...}
    
    # 2. Get tenant resource usage from Prometheus
    tenant_metrics = query_prometheus(f&apos;&apos;&apos;
        sum by (tenant_id) (
            rate(http_requests_total{{service=~&quot;.+&quot;}}[{period}])
        )
    &apos;&apos;&apos;)
    
    total_requests = sum(tenant_metrics.values())
    
    # 3. Get tenant-specific metrics where available
    tenant_db_usage = get_database_usage_by_tenant(period)
    tenant_storage = get_storage_by_tenant(period)
    
    # 4. Calculate allocation ratios
    tenant_costs = {}
    for tenant_id, request_count in tenant_metrics.items():
        request_ratio = request_count / total_requests
        db_ratio = tenant_db_usage.get(tenant_id, 0) / sum(tenant_db_usage.values())
        storage_ratio = tenant_storage.get(tenant_id, 0) / sum(tenant_storage.values())
        
        tenant_costs[tenant_id] = {
            # Allocate EKS costs by request ratio
            &apos;eks&apos;: aws_costs[&apos;eks&apos;] * request_ratio,
            
            # Allocate Aurora by DB usage
            &apos;aurora&apos;: aws_costs[&apos;aurora&apos;] * db_ratio,
            
            # Allocate storage by storage ratio
            &apos;s3&apos;: aws_costs[&apos;s3&apos;] * storage_ratio,
            
            # Direct costs (if any tenant-specific resources)
            &apos;direct&apos;: get_direct_tenant_costs(tenant_id, period),
        }
        
        tenant_costs[tenant_id][&apos;total&apos;] = sum(tenant_costs[tenant_id].values())
    
    return tenant_costs
```

## Tools Comparison

Several tools can help with this:

### OpenCost
- **What:** Open-source Kubernetes cost monitoring
- **Good for:** Pod/namespace/label cost allocation
- **Limitation:** Doesn&apos;t handle non-K8s resources, needs app metrics for tenant split
- **Cost:** Free

### CloudZero
- **What:** SaaS unit economics platform
- **Good for:** End-to-end unit cost tracking, pre-built integrations
- **Limitation:** SaaS pricing can be high, less customisable
- **Cost:** $$$

### Kubecost
- **What:** Commercial K8s cost monitoring (OpenCost fork)
- **Good for:** K8s-focused with better UI, alerting
- **Limitation:** Still K8s-centric
- **Cost:** Free tier, paid for advanced features

### Attrb.io
- **What:** Cost attribution sensors for K8s
- **Good for:** Works with Karpenter, fine-grained attribution
- **Limitation:** Newer tool, less mature
- **Cost:** Check pricing

### Custom Build
- **What:** Build your own with CUR + Prometheus + custom logic
- **Good for:** Full control, handles edge cases
- **Limitation:** Engineering effort, maintenance burden
- **Cost:** Engineering time

### Our Recommendation

For most multi-tenant platforms:

1. **Start with OpenCost** for K8s visibility
2. **Add application-level tenant metrics** (non-negotiable)
3. **Build a custom attribution layer** for shared resources
4. **Consider CloudZero** if you need quick time-to-value and can afford it

## Implementation Checklist

```markdown
## Tagging
- [ ] Define tenant tagging strategy
- [ ] Tag all AWS resources
- [ ] Label all K8s resources

## Instrumentation
- [ ] Add tenant_id to all application metrics
- [ ] Instrument request counts per tenant
- [ ] Instrument database operations per tenant
- [ ] Instrument storage usage per tenant
- [ ] Instrument queue operations per tenant

## Collection
- [ ] Deploy OpenCost for K8s costs
- [ ] Configure Cost &amp; Usage Report
- [ ] Set up Prometheus for application metrics
- [ ] Enable database monitoring (pg_stat_statements, DynamoDB Contributor Insights)

## Attribution
- [ ] Define cost allocation rules
- [ ] Build attribution pipeline
- [ ] Handle shared resource allocation
- [ ] Handle idle/unattributed costs

## Reporting
- [ ] Build tenant cost dashboard
- [ ] Set up cost anomaly alerting
- [ ] Create margin reports
- [ ] Enable drill-down by service/time/tenant
```

## Common Pitfalls

### 1. Ignoring Idle Costs

Not all costs map to tenant activity. Idle EKS nodes, standby Aurora replicas, unused reserved capacity - these need a policy:

- **Spread evenly:** Divide among all tenants
- **Spread by usage:** Allocate proportionally to active tenants
- **Keep separate:** Track as &quot;platform overhead&quot;

### 2. Point-in-Time vs. Averaged

Tenant usage varies. A tenant might spike to 50% of capacity for an hour, then drop to 5%. 

**Don&apos;t:** Take a single measurement
**Do:** Average over the billing period, or use peak-based allocation for reserved capacity

### 3. Forgetting Support and People Costs

Cloud costs aren&apos;t the full picture:
- Support tickets per tenant
- Engineering time per tenant
- Onboarding costs
- Account management

For true unit economics, you need these too.

### 4. Over-Engineering Early

Start simple:
1. Track total costs
2. Track tenant request counts
3. Allocate by request ratio

Add complexity (DB-level, storage-level, network-level) only when the simple model is insufficient.

## Example Dashboard

A good unit economics dashboard shows:

```
┌─────────────────────────────────────────────────────────────────┐
│                    Unit Economics Dashboard                      │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  SUMMARY                           TREND (Last 6 Months)        │
│  ─────────────────────────         ────────────────────────     │
│  Total Cost:     £85,000           [Line chart showing          │
│  Customers:      150                cost per customer trend]    │
│  Avg Cost/Cust:  £567                                           │
│  Cost/1K Trans:  £12.40                                         │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  TOP 10 CUSTOMERS BY COST          COST BREAKDOWN BY SERVICE    │
│  ─────────────────────────         ────────────────────────     │
│  1. BigCorp Inc     £12,400        EKS:        58%             │
│  2. MegaTech Ltd    £8,200         Aurora:     18%             │
│  3. StartupXYZ      £6,100         DynamoDB:   12%             │
│  4. Enterprise Co   £5,800         MSK:         7%             │
│  5. Growth Inc      £4,200         Other:       5%             │
│  ...                                                            │
│                                                                  │
├─────────────────────────────────────────────────────────────────┤
│                                                                  │
│  MARGIN ANALYSIS                                                 │
│  ─────────────────                                              │
│  Customer     Revenue    Cost    Margin    Margin %             │
│  BigCorp      £25,000    £12,400  £12,600    50.4%             │
│  MegaTech     £10,000    £8,200   £1,800     18.0%  ⚠️         │
│  StartupXYZ   £15,000    £6,100   £8,900     59.3%             │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```

## Key Takeaways

1. **AWS billing ≠ business visibility** - You need tenant-level attribution
2. **Tag everything** - But know that tagging alone isn&apos;t enough for shared resources
3. **Instrument applications** - Tenant-aware metrics are essential
4. **Start simple** - Request-based allocation is a good first step
5. **Handle shared costs explicitly** - Define allocation rules upfront
6. **Include non-cloud costs** - Support, engineering, sales for true unit economics
7. **Iterate** - Your first model will be wrong; refine based on learnings

Unit economics turns your cloud bill from a mystery into a business tool. You&apos;ll finally know which customers are profitable, where to optimise, and how to price your product.

---

*Building unit economics for your platform? Questions about the approach? Find me on [LinkedIn](https://linkedin.com/in/moabukar) or [GitHub](https://github.com/moabukar).*</content:encoded><category>finops</category><category>cloud-costs</category><category>kubernetes</category><category>eks</category><category>multi-tenant</category><category>saas</category><category>unit-economics</category><category>aws</category><author>Mo Abukar</author></item><item><title>DORA Metrics Implementation - Measuring What Matters</title><link>https://moabukar.co.uk/blog/dora-metrics-implementation/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/dora-metrics-implementation/</guid><description>DORA metrics are the industry standard for measuring DevOps performance. Here&apos;s how to implement them properly, avoid common pitfalls, and actually use them to improve your team&apos;s delivery.</description><pubDate>Thu, 15 Jan 2026 00:00:00 GMT</pubDate><content:encoded>DORA metrics have become the standard for measuring DevOps performance. Every platform engineering talk mentions them. Every engineering leader wants them.

But most implementations fail. Teams collect the numbers without understanding what they mean. Dashboards get built but never improve anything. Metrics become targets that get gamed.

This guide covers how to implement DORA metrics in a way that actually drives improvement.

![DORA Metrics Implementation - Measuring What Matters](/images/dora-metrics.png)


## The Four Metrics

DORA research identified four key metrics that predict software delivery performance:

**Deployment Frequency** - How often you deploy to production. Elite performers deploy on-demand, multiple times per day.

**Lead Time for Changes** - Time from code commit to code running in production. Elite performers do this in less than an hour.

**Change Failure Rate** - Percentage of deployments that cause failures requiring remediation. Elite performers stay under 15%.

**Time to Restore Service** - How long it takes to recover from failures. Elite performers restore in under an hour.

These metrics are correlated. Teams that deploy frequently also have lower failure rates. Fast lead times correlate with faster recovery. This isn&apos;t coincidence - the same practices that enable speed also enable quality.

## Why These Metrics Matter

Traditional metrics measure activity: lines of code, story points, velocity. These metrics are easy to game and don&apos;t correlate with business outcomes.

DORA metrics measure capability: can you deliver changes safely and quickly? This directly connects to business value.

A team that deploys daily with low failure rates can:
- Respond to customer feedback quickly
- Fix bugs before they compound
- Ship features while they&apos;re still relevant
- Recover from incidents without panic

A team that deploys monthly with high failure rates cannot do these things, no matter how many story points they complete.

## Measuring Deployment Frequency

Deployment frequency sounds simple but has nuance.

**What counts as a deployment?** I define it as any change that reaches production, including:
- Feature releases
- Bug fixes
- Configuration changes
- Infrastructure updates

**Where to measure?** Pull from your deployment tool. If you&apos;re using ArgoCD, query ArgoCD. If you&apos;re using GitHub Actions, query GitHub. Don&apos;t make humans log deployments manually.

A simple query for GitHub Actions:

```bash
gh api repos/org/repo/actions/runs \
  --jq &apos;[.workflow_runs[] | select(.conclusion == &quot;success&quot; and .name == &quot;Deploy&quot;)] | length&apos;
```

For ArgoCD:

```bash
argocd app history my-app --output json | jq &apos;length&apos;
```

**Aggregation.** Calculate daily, weekly, and monthly frequencies. The trend matters more than any single number.

**Per team vs system-wide.** Track both. Some teams might deploy frequently while others are blocked. You need visibility into both.

## Measuring Lead Time

Lead time is the most technically challenging metric to collect.

**Definition.** Time from first commit in a change to that change running in production.

**The tricky part.** A deployment might include multiple commits, PRs, and merges. You need to trace back to the first commit.

If you&apos;re using conventional commits or PR-based workflows, you can trace from deployment to PR to commits.

For GitHub-based workflows:

```python
def calculate_lead_time(deployment_sha, repo):
    # Find the merge commit
    merge_commit = get_commit(deployment_sha)
    
    # Find the PR that created this merge
    prs = get_prs_for_commit(deployment_sha)
    
    # Get the first commit in the PR
    first_commit = get_first_commit_in_pr(prs[0])
    
    # Calculate time difference
    lead_time = deployment_time - first_commit.timestamp
    
    return lead_time
```

**Simplification.** If full tracing is hard, approximate. Measure from PR open time to deployment time. It&apos;s not exact but captures most of the delay.

**Exclude outliers thoughtfully.** A PR that sat for three months before being deployed shouldn&apos;t be excluded just because it&apos;s inconvenient. That&apos;s signal, not noise.

## Measuring Change Failure Rate

Change failure rate requires defining what a failure is.

**What counts as a failure?**
- Rollbacks
- Hotfixes deployed within X hours
- Incidents triggered by deployments
- Feature flags immediately disabled

**What doesn&apos;t count?**
- Bugs found in staging
- Bugs found days later (hard to attribute)
- Performance regressions that don&apos;t trigger incidents

The key is consistency. Pick a definition and stick with it.

**Data sources.** Cross-reference deployments with:
- Rollback events
- Incident management systems (PagerDuty, Opsgenie)
- Hotfix deployments (often tagged differently)

```python
def calculate_cfr(deployments, incidents, window_hours=24):
    failures = 0
    
    for deployment in deployments:
        # Check for incidents within window after deployment
        related_incidents = [i for i in incidents 
                           if i.trigger_time &gt; deployment.time 
                           and i.trigger_time &lt; deployment.time + window_hours]
        
        if related_incidents:
            failures += 1
    
    return failures / len(deployments)
```

## Measuring Time to Restore

Time to restore measures incident recovery capability.

**Definition.** Time from incident start to incident resolution.

**Data source.** Your incident management system. PagerDuty, Opsgenie, and most tools provide API access to incident timelines.

**Considerations:**
- Use time to mitigate, not time to root cause
- Exclude incidents that weren&apos;t actually service-impacting
- Track by severity - P1 recovery time matters more than P4

```python
def calculate_mttr(incidents):
    restore_times = []
    
    for incident in incidents:
        if incident.severity &lt;= 2:  # P1 and P2 only
            restore_time = incident.resolved_at - incident.triggered_at
            restore_times.append(restore_time)
    
    return median(restore_times)  # Median is more robust than mean
```

## Building the Dashboard

Once you&apos;re collecting data, visualise it usefully.

**Show trends, not just current values.** A 5% failure rate means nothing without context. Is it improving or degrading?

**Compare to benchmarks.** The DORA State of DevOps report publishes benchmarks:
- Elite: Deploy on-demand, &lt;1 hour lead time, &lt;15% CFR, &lt;1 hour restore
- High: Weekly-monthly deploys, &lt;1 week lead time, 16-30% CFR, &lt;1 day restore
- Medium: Monthly-6 monthly, 1-6 months lead time, 31-45% CFR, &lt;1 week restore
- Low: &lt;6 months deploys, &gt;6 months lead time, &gt;45% CFR, &gt;6 months restore

**Show by team.** Aggregated metrics hide team-level problems. Let teams see their own performance.

**Avoid vanity displays.** A giant number showing &quot;500 deployments this month&quot; looks impressive but doesn&apos;t help improvement. Show metrics that prompt action.

## Common Implementation Mistakes

**Measuring too precisely.** Don&apos;t spend months building perfect measurement. Start with approximations and refine. Some data is better than no data.

**Ignoring context.** Raw numbers without context mislead. A team that deploys 10x daily but only has 2 services isn&apos;t necessarily high-performing.

**Making metrics targets.** The moment you tie DORA metrics to performance reviews, people game them. Deploy empty commits to boost frequency. Classify incidents as non-failures.

**Forgetting the why.** DORA metrics are a means, not an end. The goal is delivering value to customers, not optimising numbers.

**Not acting on insights.** Dashboards are useless without action. If lead time is high, do something about it. Otherwise, don&apos;t bother measuring.

## Using Metrics to Drive Improvement

Metrics should prompt questions, not answers.

**If deployment frequency is low:** What&apos;s blocking more frequent deploys? Manual testing? Change approval processes? Fear of breaking things?

**If lead time is high:** Where does time go? Waiting for code review? Waiting for CI? Waiting for deployment windows?

**If change failure rate is high:** Are we testing effectively? Are we deploying too much at once? Is production observability lacking?

**If time to restore is high:** Do we have runbooks? Can we roll back quickly? Do we know how to diagnose issues?

Each question leads to specific improvements. Metrics don&apos;t tell you what to do - they tell you where to look.

## Tooling Options

Several tools can help collect DORA metrics:

**Sleuth** - Purpose-built for DORA. Integrates with common tools, provides dashboards out of the box.

**LinearB** - Broader engineering metrics including DORA. Good if you want more than just deployment metrics.

**Faros** - Open source option. More setup required but full control.

**Custom.** If you have platform engineers, building custom collection isn&apos;t hard. Prometheus + Grafana with some Python scripts can work.

My recommendation: start custom, move to tooling if you need polish. Understanding how the data flows helps you trust it.

## Starting Point

If you&apos;re starting from zero:

**Week 1:** Instrument deployment frequency. Just count deploys per day. Put it on a visible dashboard.

**Week 2:** Add lead time tracking. Start with PR-open-to-deploy time.

**Week 3:** Add change failure rate. Cross-reference deploys with incidents.

**Week 4:** Add time to restore. Pull from your incident management tool.

**Ongoing:** Review metrics weekly with the team. Ask what they tell you. Make one improvement based on what you learn.

DORA metrics aren&apos;t magic. They&apos;re a starting point for continuous improvement. Implement them, use them to ask questions, and act on what you learn.</content:encoded><category>dora</category><category>devops</category><category>metrics</category><category>engineering-culture</category><category>cicd</category><category>platform-engineering</category><author>Mo Abukar</author></item><item><title>7 Years of Infrastructure Decisions: What I&apos;d Do Again and What I Regret</title><link>https://moabukar.co.uk/blog/infra-decisions/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/infra-decisions/</guid><description>Every infrastructure decision I&apos;d make again – and the ones I wouldn&apos;t – after running production workloads across fintech, open-source, IoT, and beyond.</description><pubDate>Thu, 15 Jan 2026 00:00:00 GMT</pubDate><content:encoded>Seven years. Ticketing platforms processing millions of transactions. Open-source protocol infrastructure. IoT security systems. Kubernetes clusters I&apos;ve lost count of. Lambdas that quietly run the world. Terraform state files that haunt my dreams.

This is the post I wish someone had written for me when I started. Every decision below is something I&apos;ve either shipped to production and would do again, or shipped to production and now mass-delete from my muscle memory.

No &quot;it depends&quot; hand-waving without context. No vendor-neutral cowardice. Actual opinions from actual incidents.

![7 Years of Infrastructure Decisions: What I&apos;d Do Again and What I Regret](/images/infra-decisions.jpg)


## AWS

### Picking AWS over GCP

🟩 *Endorse*

GCP has better Kubernetes. GKE&apos;s control plane is superior. The Kubernetes tooling is years ahead.

AWS has better everything else.

Account management. Support that answers the phone. An ecosystem that doesn&apos;t deprecate services you depend on. Backwards compatibility as a religion. A TAM who knows your name.

I&apos;ve run production on both. When GCP support told me to &quot;check the documentation&quot; during an outage, I knew where our future spend was going.

The Kubernetes gap has closed anyway. Karpenter, external-dns, external-secrets, AWS Load Balancer Controller – EKS is now genuinely competitive.

### EKS

🟩 *Endorse*

Running your own control plane is mass-produced serotonin for infrastructure engineers who like etcd quorum problems.

Use EKS. The cost delta versus self-managed is a rounding error compared to engineer time.

Caveat: EKS upgrades are aggressive and non-optional. You will upgrade clusters more often than you want. Automate it or die.

### EKS Managed Add-ons

🟧 *Regret*

Good idea. Poor execution.

The moment you need to customise resource requests, pin an image tag, or modify a ConfigMap, you&apos;re fighting the add-on system. And you will need to customise.

Helm charts managed via Flux/ArgoCD. Full control. Fits existing GitOps. No surprises during cluster upgrades.

### RDS

🟩 *Endorse*

Network outage: downtime, post-mortem, move on.

Data loss: company-ending event, career-ending event, therapy.

The managed database markup is insurance. Automated backups, point-in-time recovery, read replicas, Multi-AZ – this is not where you penny-pinch.

Day one: enable deletion protection, set snapshot retention, test restores. Future you will send a thank-you note.

### ElastiCache (Redis)

🟩 *Endorse*

The Swiss Army knife of &quot;do fast data thing&quot;. Caching, sessions, rate limiting, pub/sub, leaderboards, distributed locks – it handles all of them well enough that you don&apos;t need five separate tools.

Redis versus Valkey licensing drama: AWS will continue supporting something Redis-compatible. Not your problem.

### ECR

🟩 *Endorse*

Ran quay.io. Stability was a disaster. Migrated to ECR. Boring ever since.

Deep IAM integration means EKS nodes pull images without credential rotation. Enable image scanning – it&apos;s free and catches the obvious CVEs.

The registry equivalent of &quot;it just works&quot;.

### VPC Endpoints (PrivateLink)

🟩 *Endorse*

Traffic to AWS services (S3, ECR, Secrets Manager, STS) going over the internet is unnecessary latency, cost, and attack surface. Interface endpoints keep it private.

The gotcha: endpoint policies. By default they&apos;re wide open. Lock them down to specific buckets/resources – otherwise you&apos;ve just created a data exfiltration path.

Gateway endpoints (S3, DynamoDB) are free. Interface endpoints cost money but less than NAT Gateway data processing fees for high-volume services.

### Private API Gateway + VPC Link

🟧 *Context-dependent*

Private API Gateway lets you expose internal services without public endpoints. VPC Link connects it to your NLB/ALB.

The good: WAF integration, throttling, API keys, usage plans – all managed.

The bad: cold starts on private APIs are brutal (seconds, not milliseconds). Fine for internal tooling, painful for latency-sensitive workloads. Also, debugging DNS resolution issues between API Gateway and your VPC will test your patience.

For internal APIs, consider ALB + Lambda authorizers instead – simpler, faster cold starts.

### ECS (Fargate and EC2)

🟩 *Endorse for specific use cases*

Hot take: not everything needs Kubernetes.

ECS Fargate is perfect for: batch jobs, scheduled tasks, simple services that don&apos;t need the K8s ecosystem. No nodes to manage, no cluster upgrades, predictable pricing.

ECS on EC2: useful when you need GPU instances, specific instance types, or want to avoid Fargate&apos;s vCPU pricing at scale.

Where ECS falls down: complex networking policies, service mesh, anything requiring the CNCF ecosystem. If you&apos;re already running EKS, adding ECS creates operational sprawl.

My pattern: EKS for the platform, Fargate for one-off jobs that don&apos;t justify a Helm chart.

### Lambda

🟩 *Endorse more than I initially did*

I was slow to adopt Lambda. &quot;We have Kubernetes, why do we need another compute platform?&quot;

Turns out: event-driven workloads (S3 triggers, SQS consumers, API Gateway backends) are dramatically simpler on Lambda. No scaling config, no pod disruption budgets, no node selectors.

The real win is cost attribution. In Kubernetes, costs hide behind shared nodes. Lambda bills per invocation – you know exactly what each function costs.

Gotchas:
- Cold starts matter for synchronous APIs (use provisioned concurrency or accept the latency)
- 15-minute timeout kills long-running jobs (use Step Functions or ECS)
- VPC-attached Lambdas have their own cold start penalty (ENI attachment)

Pattern that works: Lambda for glue code, event processing, and APIs under 10s response time. ECS/EKS for everything else.

### Step Functions

🟩 *Endorse*

State machines as infrastructure. Orchestrate Lambda, ECS tasks, Glue jobs, human approval steps – all with built-in retry, error handling, and observability.

Express workflows for high-volume, short-duration (synchronous). Standard workflows for long-running, complex orchestration.

The visual debugger alone is worth it – seeing exactly where your workflow failed beats parsing CloudWatch logs.

### EventBridge

🟩 *Endorse*

Decoupled event routing without managing Kafka/SQS fan-out yourself. Schema registry, archive/replay, cross-account event buses.

Pattern: services emit events to EventBridge, rules route to targets (Lambda, SQS, Step Functions). Loose coupling, easy to add new consumers without modifying producers.

One trap: event pattern matching syntax is fiddly. Test patterns thoroughly – silent failures when events don&apos;t match are painful to debug.

### NAT Gateway vs NAT Instances

🟧 *Context-dependent*

NAT Gateway: managed, highly available, expensive at scale.
NAT Instances: cheap, requires maintenance, single point of failure unless you build HA yourself.

At Trainline, NAT Gateway costs were eye-watering. We built NAT instances with auto-scaling groups and saved significant money. But it&apos;s technical debt you&apos;re taking on. For most companies, NAT Gateway is correct until your bill says otherwise.

### AWS Premium Support

🟧 *Regret*

It costs as much as another engineer. Unless your team genuinely lacks AWS expertise, the ROI isn&apos;t there.

Enterprise support is worth it if you&apos;re spending £500k+/year on AWS and need the TAM relationship for commercial negotiations.

### Control Tower Account Factory for Terraform (AFT)

🟩 *Endorse*

Pre-AFT, AWS Control Tower was a UI-driven nightmare. Spinning up new accounts meant clicking through the console, manually configuring baselines, and praying someone remembered to tag things correctly. Zero automation.

AFT changed everything. Account provisioning became code. New environment? Terraform apply. Done.

The real win isn&apos;t just speed – it&apos;s standardization. We enforce tagging at account creation. Production accounts get tagged with `environment=prod`, which we then use for routing decisions (VPC peering, network policies, cost allocation). 

Tags beat AWS Organizations for this. Organizations force you into a tree structure – but account properties aren&apos;t always hierarchical. An account can be &quot;production&quot;, &quot;fintech-regulated&quot;, and &quot;us-east-1&quot; all at once. Tags handle that. Organization hierarchy doesn&apos;t.

Gotcha: AFT has a learning curve. The account request workflow via Git, the Terraform customizations, the pipeline structure – it&apos;s not plug-and-play. But once it&apos;s wired up, account provisioning goes from hours to minutes.

If you&apos;re running multi-account AWS and not using AFT, you&apos;re doing Control Tower the hard way.

## Kubernetes Ecosystem

### Karpenter

🟩 *Endorse*

If you&apos;re on EKS without Karpenter, you&apos;re lighting money on fire.

Cluster Autoscaler: slow, dumb, fights with node groups. Karpenter: fast, smart, provisions exactly what your pods need.

We&apos;ve seen 30-40% cost reduction on compute after migration. Spot instance handling actually works. Consolidation actually consolidates. Bin-packing that isn&apos;t a joke.

The learning curve – NodePools, EC2NodeClasses, weight-based selection – is real. Worth it. This is non-negotiable for EKS in 2025.

### KEDA (Kubernetes Event-Driven Autoscaling)

🟩 *Endorse for event-driven workloads*

HPA scales on CPU/memory. KEDA scales on anything: SQS queue depth, Kafka lag, Prometheus metrics, cron schedules.

If you&apos;re running workers that process queues, KEDA is the answer. Scale to zero when idle, scale up based on actual backlog. We&apos;ve cut costs significantly on batch processing workloads that used to run 24/7 &quot;just in case&quot;.

**Where it shines:**
- SQS consumers (scale on ApproximateNumberOfMessages)
- Kafka consumers (scale on consumer lag)
- Scheduled jobs (cron-based scaling, better than CronJobs for some use cases)
- Prometheus-based scaling (custom metrics your app exposes)

**Where it doesn&apos;t:**
- Request-based scaling (stick with HPA + Ingress metrics)
- Workloads that can&apos;t handle cold starts

Pattern: KEDA for async workers, HPA for sync APIs. They coexist fine – use ScaledObject for event-driven, HPA for everything else.

### Flux vs ArgoCD for GitOps

🟩 *Endorse (either one, just pick)*

Both work. Both are CNCF graduated. The debate is overblown.

**Flux:**
- Kubernetes-native, feels like CRDs all the way down
- Lighter footprint, less resource overhead
- Better for multi-tenancy with Flux&apos;s tenant model
- Weak observability out of the box – you&apos;ll build tooling to answer &quot;where&apos;s my commit?&quot;
- No UI by default (Weave GitOps exists but meh)

**ArgoCD:**
- Beautiful UI, developers love clicking around
- App-of-apps pattern is intuitive
- Better for teams who want visual deployment status
- Heavier footprint, more moving parts
- ApplicationSets for dynamic generation

**When to use Flux:** Platform teams, multi-cluster, GitOps purists, resource-constrained clusters.

**When to use ArgoCD:** Developer-facing platforms, teams who want dashboards, orgs where &quot;I need to see it&quot; matters.

We went Flux. It&apos;s worked for years. The core reconciliation model is solid. But I&apos;ve seen ArgoCD deployments work equally well. The real mistake is running both, or spending six months evaluating instead of shipping.

### External Secrets Operator

🟩 *Endorse*

AWS Secrets Manager → Kubernetes Secrets. Developers understand it. Terraform manages secrets upstream. AWS rotation continues working.

Replaced SealedSecrets, which required infrastructure knowledge to update and broke every AWS-native rotation integration. ESO is the correct answer.

### External DNS

🟩 *Endorse*

Ingress annotation → Route53 record. Four years. Zero problems. Nothing to say. It just works.

### Cert-Manager

🟩 *Endorse*

Let&apos;s Encrypt certificates, automated, in Kubernetes. Set up once, forget forever.

The only pain: enterprise customers who don&apos;t trust Let&apos;s Encrypt. Budget for a few DigiCert certs annually for the dinosaurs.

### Helm v3

🟩 *Endorse*

Helm v2 was a security nightmare (Tiller). Helm v3 is tolerable.

Go templating is painful to debug. The ecosystem is enormous. It solves &quot;versioned Kubernetes manifests&quot; adequately. We&apos;ve all accepted this is what we have.

Store charts in OCI registries (ECR works). Avoid the S3 + plugin mess.

### Service Mesh (Istio/Linkerd)

🟩 *No Regrets (for not using)*

Service meshes solve real problems. mTLS. Observability. Traffic management.

Service meshes also add operational complexity that most teams can&apos;t afford.

For most companies: you don&apos;t need one. mTLS? Network policies + application-level encryption. Traffic splitting? Ingress controllers. Observability? You&apos;re already running Prometheus.

If you genuinely need mesh features, Linkerd is simpler than Istio. But ask yourself three times if you actually need it.

### Cilium

🟩 *Endorse*

eBPF-based networking. Replaces kube-proxy. Network policies that work. Hubble for observability. No sidecar overhead.

The migration from VPC CNI requires planning. The benefits – performance, observability, no iptables spaghetti – are worth it.

This is where Kubernetes networking is going. Get ahead of it.

## Infrastructure as Code

### Terraform over CloudFormation

🟩 *Endorse*

This shouldn&apos;t even be a debate anymore. HCL is readable. The provider ecosystem is enormous. State management is a solved problem. The hiring pool knows Terraform.

CloudFormation has its place – Service Catalog, certain AWS-native integrations – but as your primary IaC? No.

### Terragrunt

🟩 *Endorse*

Terraform&apos;s missing features: DRY remote state configuration, dependency management between modules, environment promotion without copy-paste.

Terragrunt fills the gaps. `root.hcl` inheritance keeps your environments consistent (note: `terragrunt.hcl` at root level is deprecated - use `root.hcl` now). `dependency` blocks wire outputs between stacks without data sources everywhere. `run-all` applies changes across your entire estate.

The learning curve is real – you&apos;re now debugging two tools – but the alternative is bespoke wrapper scripts that do the same thing worse.

Pattern: `root.hcl` defines remote state and provider config, environment folders have `terragrunt.hcl` files that inherit and override. One `terragrunt run-all plan` shows drift across everything.

### Spacelift

🟩 *Endorse for teams with budget*

If you&apos;re past 5 engineers touching Terraform, Spacelift pays for itself.

Drift detection that actually works. Policy-as-code (OPA) for guardrails. Stack dependencies. Self-service for developers who shouldn&apos;t have AWS console access but need to provision resources.

The killer feature: contexts and mounted files. Inject secrets, provider configs, and shared modules without templating hell.

Downside: it&apos;s not cheap. And you&apos;re adding a dependency on a vendor for your infrastructure provisioning – evaluate that risk.

### Atlantis

🟩 *Endorse for teams without budget*

PR-based Terraform workflow, self-hosted, free.

`atlantis plan` on PR open, `atlantis apply` on merge. Locks prevent concurrent modifications. Works.

We ran Atlantis in Kubernetes for years. Minimal operational overhead once configured. The main gap versus Spacelift: no drift detection, no policy engine (you&apos;ll bolt on Conftest or similar), no fancy UI.

For most teams under 10 engineers, Atlantis is correct. Graduate to Spacelift when you outgrow it.

### env0 / Terraform Cloud

🟧 *Depends on your constraints*

Terraform Cloud: HashiCorp&apos;s offering. Works, reasonably priced, tight integration with the Terraform ecosystem. The free tier is generous for small teams.

env0: similar space, more flexibility on policy and workflows, better GitOps model.

My take: if you&apos;re already paying HashiCorp for Terraform Enterprise, Cloud makes sense. Otherwise, Spacelift (if you have budget) or Atlantis (if you don&apos;t).

### Not Using CDK/Pulumi

🟩 *No Regrets*

&quot;I can use real programming constructs!&quot; – and now your IaC has inheritance hierarchies, unit tests that don&apos;t test anything meaningful, and abstractions that make `terraform plan` output incomprehensible.

Terraform&apos;s constraints are a feature. It&apos;s harder to be clever. Clever kills you at 3am.

If you need abstraction, write Terraform modules. If you need code generation, write a script that outputs Terraform. Don&apos;t make your IaC a software project.

Exception: genuinely complex conditional logic (CDK/Pulumi handle this better). But ask yourself if the complexity is necessary before reaching for more powerful tools.

### Terraform Module Strategy

🟩 *Endorse with opinions*

Monorepo for internal modules. Versioned releases via git tags. Terragrunt or a module registry for consumption.

Don&apos;t: put modules in the same repo as the Terraform that consumes them (circular dependency hell). Don&apos;t: version with branch names (&quot;just use main&quot; guarantees broken applies). Don&apos;t: build modules that do too much (a module that creates VPC + EKS + RDS + everything is unmaintainable).

Do: small, composable modules. One module, one responsibility. Test with Terratest or tftest if you&apos;re feeling fancy, but at minimum `terraform validate` in CI.

### State File Hygiene

🟩 *Endorse being paranoid*

State files are your infrastructure&apos;s source of truth. Treat them accordingly.

S3 bucket: versioned, encrypted (SSE-S3 minimum, KMS if compliance requires), bucket policy denying public access, lifecycle rules to clean up old versions.

DynamoDB: locking table, on-demand capacity (you&apos;re not doing enough applies for provisioned to matter).

Separate account: your CI/CD and state live in a management account, not the accounts containing the infrastructure. When you accidentally `terraform destroy` the wrong workspace, you don&apos;t lose the state bucket too.

Never: commit state to git. Run Terraform from laptops in production. Share state between unrelated projects.

### OpenTofu

🟧 *Watching closely*

HashiCorp&apos;s license change made OpenTofu happen. It&apos;s production-ready, actively maintained, and has feature parity with Terraform 1.5.

I haven&apos;t migrated production workloads yet – inertia is real – but for greenfield projects, it&apos;s a legitimate choice. Spacelift and Terragrunt both support it.

If you&apos;re worried about HashiCorp&apos;s direction, start evaluating. The migration path is straightforward.

### Crossplane

🟥 *Regret (for most teams)*

The pitch is compelling: manage AWS/GCP/Azure resources using Kubernetes CRDs. GitOps for infrastructure. Developers self-serve without learning Terraform.

The reality: you&apos;re using infrastructure to manage infrastructure. Kubernetes managing the very AWS resources Kubernetes runs on. The recursion gives me a headache just writing it.

**The problems:**
- Provider maturity varies wildly (AWS provider is decent, others less so)
- Debugging is painful – is it a Crossplane issue, provider issue, or AWS issue?
- You need Terraform anyway for the Kubernetes cluster itself
- Composition complexity rivals Terraform modules but with worse tooling
- Your platform team now maintains Crossplane AND probably Terraform

**Where it might work:**
- Organisations already deep in Kubernetes who want unified control plane
- Platform teams building developer self-service portals
- Multi-cloud environments where one abstraction helps

For most teams: Terraform/Terragrunt/Spacelift handles infrastructure better. If developers need self-service, build an internal portal that calls Terraform, don&apos;t add another layer of abstraction.

I&apos;ve seen more Crossplane migrations fail than succeed. The teams that make it work have dedicated platform engineers and accept they&apos;re running a complex system.

### Backstage

🟩 *Endorse (with realistic expectations)*

Spotify&apos;s developer portal. Service catalog, documentation, templates for scaffolding new services. The promise: one place for developers to find everything.

**What it does well:**
- Software catalog (who owns what, where&apos;s the repo, what&apos;s the status)
- TechDocs (docs-as-code, lives with the service)
- Scaffolder templates (spin up new services with standards baked in)
- Plugin ecosystem (Kubernetes, CI/CD, cost, whatever you need)

**The honest take:**
- It&apos;s a framework, not a product. Budget 2-3 months for initial setup and customization
- Plugin quality varies wildly (some are polished, some are abandonware)
- Keeping the catalog accurate requires discipline (or automation)
- React/TypeScript skills needed to build custom plugins

**When it&apos;s worth it:**
- 50+ services and developers can&apos;t find anything
- Onboarding takes weeks because tribal knowledge
- You want to standardize service creation

**When it&apos;s not:**
- Small teams where everyone knows everything
- No one to maintain it post-launch
- Expecting magic without investment

We&apos;ve seen it transform developer experience at scale. We&apos;ve also seen it become shelfware. The difference is commitment to maintaining it as a product, not a one-time project.

### Atlantis for Terraform

🟩 *Endorse*

PR-based Terraform workflow. Plan runs on PR, apply on merge. State locking prevents conflicts.

Free, self-hosted, works. We run it in Kubernetes with minimal operational overhead.

## Observability

### Datadog

🟥 *Regret*

Great product. Pricing model designed to bankrupt you.

Kubernetes makes it worse: per-host pricing when you&apos;re spinning spot instances up and down constantly. 10 instances running, 20 launched and terminated that hour? You pay for 20.

GPU nodes make it catastrophic: one service per node, full per-host cost. Your ML workloads will subsidise Datadog&apos;s Series C.

We&apos;re migrating to Prometheus + Grafana + Loki. More operational overhead. Dramatically lower cost. No vendor holding your metrics hostage.

### Not Using OpenTelemetry Early

🟥 *Regret*

Instrumented applications directly with Datadog&apos;s SDK. Now we&apos;re locked in. Migration requires touching every service.

OpenTelemetry wasn&apos;t mature four years ago. It is now. Start with it. Tracing is production-ready. Metrics are catching up.

Vendor-agnostic instrumentation isn&apos;t just about cost – it&apos;s about not being held hostage when your observability vendor&apos;s pricing becomes untenable.

### Prometheus / Grafana / Loki Stack

🟩 *Endorse*

Self-hosted observability that scales.

Prometheus for metrics. Loki for logs. Grafana for dashboards. Mimir for long-term metric storage. Tempo for traces.

Yes, you&apos;re running databases. Yes, there&apos;s operational overhead. The cost savings at scale are substantial, and you own your data.

Pattern: Prometheus Operator for Kubernetes-native deployment, ServiceMonitors for autodiscovery, Thanos or Mimir for multi-cluster aggregation.

### PagerDuty

🟩 *Endorse*

It pages you. The pricing is reasonable. The integrations work. Nothing else to say.

Don&apos;t overthink alerting platforms. PagerDuty is fine.

## Process &amp; Culture

### GitOps Everything

🟩 *Endorse*

Services. Terraform. Kubernetes manifests. Application config. All in Git. All deployed via reconciliation.

&quot;But I can&apos;t see the pipeline!&quot; – correct. Build deployment status dashboards. Invest in tooling that answers &quot;where is my commit?&quot; The payoff is infrastructure that self-heals and a Git history that tells you exactly what changed when.

### Post-Mortems in Notion (not Datadog/PagerDuty)

🟩 *Endorse*

Both Datadog and PagerDuty have incident management features. Both are inflexible garbage for post-mortems.

Notion (or any wiki) lets you customise the process. Start with PagerDuty&apos;s template, adapt to your team&apos;s culture. The tool that gets used beats the tool with features.

### Automating Post-Mortem Process

🟩 *Endorse*

Nobody wants to be the person chasing people to fill out the post-mortem.

Slack bot: &quot;No update in 1 hour, post a status.&quot; &quot;No calendar invite in 24 hours, schedule the retro.&quot; &quot;Post-mortem doc still empty after 3 days, gentle nudge.&quot;

Make the robot the bad guy. Your relationships with colleagues will thank you.

### Regular PagerDuty Review

🟩 *Endorse*

Alert fatigue is a pipeline:

1. No alerts. We need alerts.
2. Too many alerts. We ignore alerts.
3. We tune alerts. Only critical ones page.
4. We ignore non-critical alerts.
5. Something in non-critical explodes into an incident.

Two-tier alerting (critical pages, non-critical emails) plus bi-weekly review meetings. For each alert: should it stay critical? Can we automate the fix? Can we tune the threshold?

Non-critical alerts are technical debt. Treat them accordingly.

### Monthly Cost Reviews

🟩 *Endorse*

Finance sees the bill. Finance can&apos;t answer &quot;is this right?&quot;.
Engineering can answer. Engineering doesn&apos;t look.

Monthly meeting. Both teams. Every major SaaS bill.

Tag-based cost allocation in AWS. Break down by account, service, team. Spot the anomalies before they compound into &quot;how did we spend £50k on NAT Gateway last month?&quot;

## Networking Deep Cuts

### Route53 Latency-Based Routing + Health Checks

🟩 *Endorse*

Multi-region failover without a load balancer in front. Route53 health checks detect failures, latency-based routing sends traffic to the nearest healthy region.

Cheaper than Global Accelerator for most use cases. The 60-second health check interval is the main limitation – if you need faster failover, pay for Global Accelerator.

Pattern: active-active with latency routing for normal operation, automatic failover when health checks fail. Works for anything with a DNS name.

### CloudFront + S3 Origin Access Control

🟩 *Endorse*

OAC replaced OAI (Origin Access Identity) – use it. Cleaner IAM integration, supports SSE-KMS encrypted buckets.

The pattern: S3 bucket is private, CloudFront is the only access path. No public bucket policies, no signed URLs for public content. Invalidation costs add up if you&apos;re deploying frequently – use versioned filenames instead.

For APIs: CloudFront in front of API Gateway or ALB gives you edge caching, WAF integration, and a single domain for static + dynamic content.

### Transit Gateway vs VPC Peering

🟧 *Context-dependent*

VPC Peering: free (data transfer still costs), simple, doesn&apos;t scale past ~125 peerings per VPC.

Transit Gateway: costs money (hourly + per-GB), but gives you hub-and-spoke topology, route tables, multicast, and inter-region peering.

Rule of thumb: 3-5 VPCs? Peering. More than that, or you need centralised egress/ingress? Transit Gateway.

The hidden cost: Transit Gateway data processing fees. High-bandwidth cross-VPC traffic gets expensive fast. Architect to minimise cross-VPC chatter.

### DNS Resolution Across Accounts (Route53 Resolver)

🟩 *Endorse*

Multi-account setups need centralised DNS. Route53 Resolver endpoints let spoke accounts resolve private hosted zones in a central account (and vice versa).

Without this: you&apos;re managing DNS in every account or hacking /etc/hosts. Neither scales.

Pattern: central &quot;networking&quot; account owns private hosted zones, Resolver rules share them to spoke accounts via RAM. Services resolve internal DNS names regardless of which account they&apos;re in.

## Data Layer

### SQS over Self-Managed Queues

🟩 *Endorse*

Every time I&apos;ve seen teams run RabbitMQ or ActiveMQ in production, I&apos;ve seen operational pain. Clustering issues, disk space alerts, upgrade nightmares.

SQS: unlimited throughput, no capacity planning, dead-letter queues built in, costs almost nothing at reasonable scale.

FIFO queues when ordering matters (300 TPS limit per message group – design around it). Standard queues for everything else.

The only valid reason for self-managed: you need AMQP protocol compatibility or complex routing (RabbitMQ exchanges). Even then, consider Amazon MQ first.

### DynamoDB

🟩 *Endorse with caveats*

Single-digit millisecond latency at any scale. No connection pooling, no read replicas to manage, global tables for multi-region.

The caveats:
- Data modelling is hard. You must know your access patterns upfront. No JOINs, no ad-hoc queries.
- On-demand pricing is expensive at sustained load. Provisioned capacity + auto-scaling for predictable workloads.
- Hot partitions will ruin your day. Distribute writes across partition keys.

Pattern: use DynamoDB for high-throughput, simple access patterns (session stores, feature flags, user preferences). Use RDS/Aurora for complex queries and relationships.

### Aurora Serverless v2

🟧 *Cautious endorsement*

Scales compute automatically, bills per ACU-second. Sounds perfect for variable workloads.

Reality: the scaling isn&apos;t instant. Under sudden load, you&apos;ll hit capacity limits before scale-up completes. The minimum ACU floor (0.5) still costs money – it&apos;s not scale-to-zero.

Use it for: dev/staging environments, workloads with predictable daily patterns, multi-tenant apps where you can&apos;t right-size a single instance.

Don&apos;t use it for: latency-sensitive production workloads where scaling lag matters.

## Things I&apos;d Do Differently

### Multiple Applications Sharing a Database

🟥 *Regret*

Nobody decides to share a database. It happens.

Someone adds a table. Another team adds a foreign key. Suddenly everything&apos;s coupled. The database is used by everyone, cared for by no one. And everything owned by no one is owned by infrastructure eventually.

Problems: crud accumulates that nobody can delete. Performance issues require product context infra doesn&apos;t have. Bad application code alerts the infrastructure team. One team&apos;s bad query takes down everyone.

One service, one database. Enforce it early. It&apos;s harder to untangle later.

### Not Adopting Identity Platform Early

🟥 *Regret*

Started with Google Workspace for groups and permissions. Too inflexible. Too many manual processes.

Okta (or equivalent) from day one. SCIM provisioning. SSO everywhere. Compliance sorted. Only accept SaaS vendors that integrate.

The security and audit benefits compound. The &quot;we&apos;ll do it properly later&quot; never comes.

### Not Using Lambda More

🟧 *Regret*

&quot;EC2 is cheaper than Lambda at scale&quot; – true for theoretical 100% utilisation. Nobody runs at 100% utilisation.

Lambda: scales to zero, per-request pricing, no infrastructure to manage, easy cost attribution.

I was slow to adopt Lambda because we had Kubernetes. Turns out event-driven workloads are dramatically simpler on Lambda. Stop fighting it.

### Renovate over Dependabot

🟩 *Endorse*

Dependency updates are boring until they&apos;re urgent. Then you&apos;re upgrading five major versions in a crisis.

Renovate: more flexible than Dependabot, more complicated to configure. The regex documentation will test your patience. Still worth it.

Automate dependency updates or accept that your dependencies will become technical debt.

## CI/CD Deep Cuts

### GitHub Actions Self-Hosted Runners on EKS

🟧 *Works, with pain*

actions-runner-controller lets you run GitHub Actions on your own Kubernetes cluster. Saves money, keeps builds inside your VPC.

The pain: runner pod scaling is flaky, ephemeral runners occasionally fail to clean up, and debugging why a workflow is stuck waiting for a runner is maddening.

We made it work with aggressive pod lifecycle limits and custom metrics for runner pool sizing. But it&apos;s not set-and-forget.

Alternative: CodeBuild for AWS-native workflows. More expensive per-minute, but zero operational overhead.

### OIDC Federation for CI/CD (No Long-Lived Credentials)

🟩 *Endorse*

GitHub Actions, GitLab CI, CircleCI – all support OIDC. Your CI job assumes an IAM role directly, no access keys stored in secrets.

Pattern: IAM OIDC provider trusts your CI provider, role trust policy scopes to specific repos/branches. Terraform apply only works from `main` branch of `infra` repo.

If you&apos;re still rotating CI credentials quarterly, stop. OIDC federation is straightforward to set up and eliminates an entire class of security incidents.

### Terraform State in S3 + DynamoDB Locking

🟩 *Endorse*

Obvious in retrospect, but: S3 bucket (versioned, encrypted) for state, DynamoDB table for locking. Atlantis or Terraform Cloud for remote execution.

The mistake I&apos;ve seen: state in the same account as the infrastructure. When you accidentally terraform destroy the state bucket... don&apos;t. Separate &quot;management&quot; account for CI/CD and state.

## Security Patterns

### IAM Roles Anywhere (Hybrid Workloads)

🟩 *Niche but useful*

On-prem or non-AWS workloads that need AWS API access? IAM Roles Anywhere lets you use X.509 certificates to assume IAM roles.

No more long-lived access keys on Jenkins servers. Certificate-based auth with automatic credential rotation.

Setup: Private CA (ACM PCA or your own), trust anchor in IAM, certificates on your on-prem machines. More moving parts than access keys, but dramatically better security posture.

### Secrets Manager vs Parameter Store

🟧 *It depends*

Secrets Manager: automatic rotation, cross-account sharing, costs $0.40/secret/month + API calls.

Parameter Store (SecureString): no rotation built-in, same-account only, free tier covers most usage.

Pattern: Secrets Manager for database credentials (use the rotation Lambda), RDS integration is seamless. Parameter Store for everything else (API keys, config values, feature flags).

Don&apos;t pay for Secrets Manager when Parameter Store does the job.

### HashiCorp Vault

🟧 *It depends (often overkill)*

Vault is powerful: dynamic secrets, PKI, transit encryption, identity-based access. It&apos;s also operationally complex – you&apos;re now running a critical distributed system.

**When Vault makes sense:**
- Dynamic database credentials (short-lived, per-pod)
- PKI infrastructure at scale
- Multi-cloud secrets management
- Strict compliance requiring audit trails on every secret access

**When it&apos;s overkill:**
- AWS-only shops (Secrets Manager + IAM roles cover 90% of use cases)
- Teams without dedicated platform engineers to maintain it
- Startups who think &quot;we&apos;ll need it eventually&quot;

If you&apos;re running Vault, run it managed (HCP Vault) or accept you&apos;re staffing a Vault team. Self-hosted Vault clusters have bitten more teams than they&apos;ve helped.

External Secrets Operator + Secrets Manager handles most Kubernetes secrets needs without the Vault overhead.

### AWS WAF

🟧 *Endorse with caveats*

WAF in front of ALB/CloudFront blocks obvious attacks: SQL injection, XSS, known bad IPs. AWS Managed Rules cover the basics.

The honest take: WAF is security theater for sophisticated attacks but catches enough script kiddies and scanners to be worth the $5/month base cost. The real protection comes from secure application code, not edge filtering.

**What works:**
- Rate limiting (actually useful for brute force)
- Geo-blocking if you don&apos;t serve certain regions
- AWS Managed Rules for OWASP top 10

**What doesn&apos;t:**
- Thinking WAF replaces input validation
- Custom regex rules (maintenance nightmare)
- Blocking legitimate traffic with overly aggressive rules

Pattern: Enable it, use managed rules, set up logging to S3, review blocked requests monthly. Don&apos;t spend weeks tuning rules unless you&apos;re under active attack.

### AWS Config + Security Hub

🟩 *Endorse for compliance*

Config rules catch drift: &quot;S3 bucket is public&quot;, &quot;EBS volume unencrypted&quot;, &quot;Security group allows 0.0.0.0/0&quot;.

Security Hub aggregates findings from Config, GuardDuty, Inspector, and third-party tools. Single pane of glass for compliance posture.

The gotcha: enabling everything generates thousands of findings. Prioritise ruthlessly – start with CIS benchmarks, suppress noise aggressively.

### SCPs (Service Control Policies)

🟩 *Endorse for guardrails*

Organisation-level policies that even account admins can&apos;t bypass. &quot;No resources outside eu-west-1/eu-west-2&quot;, &quot;No public S3 buckets&quot;, &quot;No disabling CloudTrail&quot;.

Pattern: deny-list SCPs in the organisation root for hard security boundaries. Allow-list SCPs for sandbox accounts (only specific services enabled).

Test thoroughly – an overly restrictive SCP will break deployments in ways that are hard to debug.

## The Actual Lessons

Seven years of production incidents, 3am pages, and post-mortems have taught me this:

**Non-negotiable**: EKS + Karpenter. Flux or ArgoCD (pick one, stop debating). External Secrets Operator. Terraform with Terragrunt or Spacelift. OIDC federation (no long-lived credentials, ever). VPC endpoints for AWS service traffic. Prometheus stack (or accept Datadog&apos;s pricing will eventually force migration anyway).

**Avoid at all costs**: Datadog at scale (pricing model is hostile to Kubernetes). Shared databases between services. EKS managed add-ons (you&apos;ll customise, then fight them). Service meshes you don&apos;t need. Long-lived CI credentials. Running Terraform from laptops. State files in the same account as infrastructure.

**Context-dependent**: NAT Gateway vs instances (cost threshold). Aurora Serverless v2 (scaling lag). Private API Gateway (cold start tolerance). Transit Gateway vs peering (VPC count). Secrets Manager vs Parameter Store (rotation needs). Spacelift vs Atlantis (budget).

**Niche wins worth knowing**: Route53 latency routing for cheap multi-region failover. EventBridge for decoupled event routing. Step Functions for complex orchestration. IAM Roles Anywhere for hybrid workloads. SCPs for guardrails that can&apos;t be bypassed. Lambda for event-driven glue code (stop fighting it).

**The meta-lessons**:

Boring technology wins. Every time. The clever architecture that impresses in design reviews will wake you up at 3am when it fails in ways nobody anticipated.

Debuggability over elegance. If you can&apos;t figure out why it&apos;s broken in 15 minutes with logs and metrics, your architecture is wrong.

Automation compounds. Every hour spent on operational tooling pays dividends for years. Every hour spent manually doing what should be automated is stolen from your future self.

The fanciest architecture means nothing if you can&apos;t debug it at 3am with half your brain still asleep.

---

*I share infrastructure patterns, debugging deep-dives, and production war stories. Connect on [LinkedIn](https://linkedin.com/in/moabukar) or check out [CoderCo](https://coderco.io) for hands-on DevOps education.*</content:encoded><category>kubernetes</category><category>aws</category><category>infrastructure</category><category>devops</category><category>platform-engineering</category><category>lambda</category><category>ecs</category><category>terraform</category><category>networking</category><author>Mo Abukar</author></item><item><title>MLOps for DevOps Engineers - What You Actually Need to Know</title><link>https://moabukar.co.uk/blog/ai-ml-ops-for-devops/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/ai-ml-ops-for-devops/</guid><description>MLOps is becoming a critical skill for DevOps engineers. Here&apos;s what matters: the infrastructure patterns, tooling, and operational practices that make ML systems work in production - from someone who learned the hard way.</description><pubDate>Sat, 10 Jan 2026 00:00:00 GMT</pubDate><content:encoded>Last year I got pulled into an ML project as &quot;the Kubernetes guy.&quot; The data science team had trained a fraud detection model. It worked great in their notebooks. Now they needed it in production.

&quot;Should be easy,&quot; they said. &quot;Just deploy it.&quot;

Six weeks later, we had a working system. But those six weeks taught me that ML deployment is a completely different beast. The model was the easy part. Everything around it - the data pipelines, the serving infrastructure, the monitoring - that&apos;s where the real work lives.

If you&apos;re a DevOps engineer being asked to support ML workloads, this is what I wish someone had told me before I started.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/kubeflow.svg&quot; alt=&quot;Kubeflow logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Why ML Systems Are Different

Traditional applications are predictable. You deploy code, it behaves the same way every time. Same input, same output. If something breaks, you check the logs, find the error, fix it.

ML systems don&apos;t work like that.

The model is just a mathematical function that learned patterns from data. But data changes. Customer behavior shifts. New fraud patterns emerge. A model that worked brilliantly last month might be making terrible predictions today - and it won&apos;t throw a single error.

This is the fundamental difference: **ML systems can &quot;work&quot; while being completely wrong.**

Your monitoring won&apos;t catch it unless you&apos;ve specifically built for it. The API returns 200 OK. Latency looks fine. But the model is predicting 0.5 for everything because the input data distribution changed.

The other major difference is dependencies. Traditional apps depend on code and maybe a database. ML systems depend on:

- Training data (the original dataset)
- Feature pipelines (transformations applied to raw data)
- Model artifacts (the serialized model file)
- Inference data (live data coming in)
- External APIs (if you&apos;re enriching features)

Change any of these, and behavior changes. Often in unpredictable ways.

## The ML Pipeline - What You&apos;re Actually Operating

Before diving into tools, you need to understand what you&apos;re operating. Here&apos;s the lifecycle:

```
Data Collection → Feature Engineering → Training → Validation → Deployment → Monitoring
       ↑                                                                          |
       └──────────────────────────────────────────────────────────────────────────┘
                                    (Retraining Loop)
```

**Data Collection** is where most of the cost lives. Data lakes, streaming pipelines, storage. This is familiar territory for DevOps - just bigger datasets than you&apos;re used to.

**Feature Engineering** transforms raw data into model inputs. If the raw data is &quot;user clicked product X at time T,&quot; the features might be &quot;number of clicks in last hour&quot; and &quot;average time between clicks.&quot; This often runs on Spark or similar batch processing systems.

**Training** is the expensive part. GPU clusters, hours or days of compute, massive memory requirements. But it&apos;s also bursty - you train occasionally, not continuously.

**Validation** is where teams cut corners and pay for it later. Does the model meet quality thresholds? Does it perform equally across different user segments? Is it faster than the model it&apos;s replacing?

**Deployment** is model serving - getting predictions with low latency at scale.

**Monitoring** closes the loop. Detect when the model degrades, trigger retraining.

## Training Infrastructure

Training jobs need GPUs. Lots of them. Here&apos;s how to set it up on Kubernetes.

First, you need the NVIDIA device plugin. It exposes GPUs as a schedulable resource.

We&apos;re going to create a DaemonSet that runs on all GPU nodes:

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  template:
    metadata:
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - image: nvcr.io/nvidia/k8s-device-plugin:v0.14.1
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: [&quot;ALL&quot;]
        volumeMounts:
        - name: device-plugin
          mountPath: /var/lib/kubelet/device-plugins
      volumes:
      - name: device-plugin
        hostPath:
          path: /var/lib/kubelet/device-plugins
```

Now training jobs can request GPUs:

```yaml
apiVersion: batch/v1
kind: Job
metadata:
  name: model-training
spec:
  template:
    spec:
      containers:
      - name: trainer
        image: my-training-image:v1
        resources:
          limits:
            nvidia.com/gpu: 4
        env:
        - name: WANDB_API_KEY
          valueFrom:
            secretKeyRef:
              name: ml-secrets
              key: wandb-key
      nodeSelector:
        gpu-type: a100
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      restartPolicy: Never
```

**My take on GPU node pools:** Create dedicated node pools for GPU workloads with taints. This prevents regular workloads from accidentally scheduling there and blocking expensive GPU capacity. The `tolerations` in the training job spec allow it to run on tainted nodes.

**Spot instances for training** are a no-brainer. Training jobs can checkpoint progress and resume after interruption. You&apos;ll save 60-90% on GPU costs. The key is implementing checkpointing properly - save model state every N steps to S3 or GCS, and have your training script resume from the latest checkpoint on startup.

## Model Serving - The Production Bit

Training happens occasionally. Serving happens constantly. This is where latency and reliability matter.

You have a few options for serving. Let me walk you through what I&apos;ve seen work.

### Option 1: BYO Flask/FastAPI

The simple approach. Wrap your model in a REST API:

```python
from fastapi import FastAPI
import joblib

app = FastAPI()
model = joblib.load(&quot;model.pkl&quot;)

@app.post(&quot;/predict&quot;)
async def predict(features: dict):
    prediction = model.predict([list(features.values())])
    return {&quot;prediction&quot;: float(prediction[0])}
```

This works for simple cases. But you&apos;re reinventing the wheel on:
- Batching (grouping multiple requests for GPU efficiency)
- Model versioning
- Canary deployments
- Auto-scaling
- Health checks

### Option 2: KServe (My Recommendation)

KServe (formerly KFServing) handles all of that out of the box. It&apos;s become the standard for model serving on Kubernetes.

Let&apos;s deploy a scikit-learn model:

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
spec:
  predictor:
    minReplicas: 1
    maxReplicas: 10
    scaleTarget: 70
    scaleMetric: concurrency
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/fraud-detector/v2
      resources:
        requests:
          cpu: 500m
          memory: 1Gi
        limits:
          cpu: 1
          memory: 2Gi
```

KServe handles:
- Downloading the model from S3
- Creating the serving container
- Auto-scaling based on request concurrency
- Canary deployments (deploy v3 to 10% of traffic)
- A/B testing
- Standardized prediction protocol

For canary deployments, which you&apos;ll want when replacing models:

```yaml
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: fraud-detector
spec:
  predictor:
    canaryTrafficPercent: 10
    minReplicas: 1
    model:
      modelFormat:
        name: sklearn
      storageUri: s3://models/fraud-detector/v3
```

This sends 10% of traffic to v3, keeping 90% on the previous version. Gradually increase if metrics look good.

## Experiment Tracking - MLflow

Here&apos;s a lesson I learned the hard way: data scientists will train hundreds of model variants. Without tracking, nobody knows which one is in production or why it was chosen.

MLflow is the standard tool. Let&apos;s set it up on Kubernetes.

First, we need a PostgreSQL database for metadata and S3 for artifacts. Then deploy the tracking server:

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: mlflow-tracking
spec:
  replicas: 1
  selector:
    matchLabels:
      app: mlflow
  template:
    metadata:
      labels:
        app: mlflow
    spec:
      containers:
      - name: mlflow
        image: ghcr.io/mlflow/mlflow:v2.10.0
        command:
        - mlflow
        - server
        - --host=0.0.0.0
        - --port=5000
        - --backend-store-uri=postgresql://mlflow:password@postgres:5432/mlflow
        - --default-artifact-root=s3://mlflow-artifacts/
        ports:
        - containerPort: 5000
        env:
        - name: AWS_ACCESS_KEY_ID
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: access-key
        - name: AWS_SECRET_ACCESS_KEY
          valueFrom:
            secretKeyRef:
              name: aws-credentials
              key: secret-key
```

Data scientists integrate with a few lines of code:

```python
import mlflow

mlflow.set_tracking_uri(&quot;http://mlflow-tracking:5000&quot;)
mlflow.set_experiment(&quot;fraud-detection&quot;)

with mlflow.start_run():
    mlflow.log_param(&quot;learning_rate&quot;, 0.01)
    mlflow.log_param(&quot;max_depth&quot;, 5)
    
    # Training happens here...
    
    mlflow.log_metric(&quot;accuracy&quot;, 0.94)
    mlflow.log_metric(&quot;f1_score&quot;, 0.91)
    mlflow.sklearn.log_model(model, &quot;model&quot;)
```

Now every experiment is tracked: what parameters were used, what metrics were achieved, and the model artifact itself. When something goes wrong in production, you can trace back to exactly what was deployed.

## Monitoring ML Systems - The Hard Part

Standard application monitoring (latency, error rate, throughput) still applies. But it misses the ML-specific failures.

### What to Monitor

**Prediction distribution.** If your fraud model normally predicts between 0.1 and 0.9, and suddenly everything is 0.5, something&apos;s wrong. Track the mean, standard deviation, and percentiles of predictions.

**Feature drift.** Input data changing from the training distribution. If the model was trained on users aged 18-65 and suddenly you&apos;re getting users aged 70+, predictions might be unreliable.

**Concept drift.** The relationship between features and labels changing. Fraud patterns evolve. What indicated fraud last year might be normal behavior now.

**Data quality.** Missing values, null features, unexpected types. Garbage in, garbage out.

### Implementing Drift Detection

Here&apos;s a simple approach using Prometheus. First, instrument your serving code:

```python
from prometheus_client import Histogram, Counter

prediction_histogram = Histogram(
    &apos;model_prediction_value&apos;,
    &apos;Distribution of model predictions&apos;,
    buckets=[0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
)

feature_missing_counter = Counter(
    &apos;feature_missing_total&apos;,
    &apos;Count of missing features&apos;,
    [&apos;feature_name&apos;]
)

@app.post(&quot;/predict&quot;)
async def predict(features: dict):
    # Check for missing features
    for expected in [&apos;feature_a&apos;, &apos;feature_b&apos;, &apos;feature_c&apos;]:
        if expected not in features:
            feature_missing_counter.labels(feature_name=expected).inc()
    
    prediction = model.predict([list(features.values())])
    prediction_histogram.observe(prediction[0])
    
    return {&quot;prediction&quot;: float(prediction[0])}
```

Then alert on drift:

```yaml
groups:
- name: ml-monitoring
  rules:
  - alert: PredictionDistributionShift
    expr: |
      abs(
        avg_over_time(model_prediction_value_sum[1h]) / avg_over_time(model_prediction_value_count[1h])
        -
        avg_over_time(model_prediction_value_sum[1h] offset 7d) / avg_over_time(model_prediction_value_count[1h] offset 7d)
      ) &gt; 0.1
    for: 30m
    labels:
      severity: warning
    annotations:
      summary: &quot;Model prediction distribution has shifted significantly&quot;

  - alert: HighMissingFeatureRate
    expr: rate(feature_missing_total[5m]) &gt; 0.01
    for: 10m
    labels:
      severity: critical
    annotations:
      summary: &quot;High rate of missing features in model input&quot;
```

For more sophisticated drift detection, look at Evidently AI or WhyLabs. They provide statistical tests (Kolmogorov-Smirnov, Population Stability Index) and dashboards specifically designed for ML monitoring.

## The Retraining Pipeline

Models degrade. You need automated retraining. Here&apos;s how I set it up with Argo Workflows:

```yaml
apiVersion: argoproj.io/v1alpha1
kind: CronWorkflow
metadata:
  name: fraud-model-retrain
spec:
  schedule: &quot;0 2 * * 0&quot;  # Weekly, Sunday 2am
  workflowSpec:
    entrypoint: retrain-pipeline
    templates:
    - name: retrain-pipeline
      dag:
        tasks:
        - name: extract-data
          template: extract-training-data
        - name: train
          dependencies: [extract-data]
          template: train-model
        - name: validate
          dependencies: [train]
          template: validate-model
        - name: deploy
          dependencies: [validate]
          template: deploy-if-better
          when: &quot;{{tasks.validate.outputs.parameters.passed}} == true&quot;

    - name: extract-training-data
      container:
        image: data-pipeline:v1
        command: [python, extract.py]
        args: [&quot;--output&quot;, &quot;/data/training.parquet&quot;]

    - name: train-model
      container:
        image: training:v1
        resources:
          limits:
            nvidia.com/gpu: 2
        command: [python, train.py]
        args: [&quot;--data&quot;, &quot;/data/training.parquet&quot;]

    - name: validate-model
      container:
        image: validation:v1
        command: [python, validate.py]
      outputs:
        parameters:
        - name: passed
          valueFrom:
            path: /tmp/validation_passed

    - name: deploy-if-better
      container:
        image: deployer:v1
        command: [python, deploy.py]
```

The key insight: **the validation step gates deployment.** Never auto-deploy a model that hasn&apos;t been validated against quality thresholds. Compare accuracy, latency, and fairness metrics against the current production model.

## Cost Management

ML infrastructure is expensive. Here&apos;s how to keep it under control:

**Spot instances for training.** I mentioned this, but it bears repeating. Checkpointing + spot = 70% savings.

**Right-size GPU instances.** A100s are overkill for most inference. T4s often work fine at a fraction of the cost. Profile your model&apos;s actual memory and compute requirements.

**Scale to zero.** KServe can scale to zero replicas when there&apos;s no traffic. You only pay when the model is being used.

**Monitor GPU utilization.** I&apos;ve seen teams running GPUs at 10% utilization because they&apos;re processing one request at a time. Enable request batching to improve throughput.

**Lifecycle policies for model artifacts.** Old model versions accumulate in S3. Set up lifecycle rules to archive or delete after 90 days.

## Getting Started

If you&apos;re new to MLOps, don&apos;t try to adopt everything at once. Here&apos;s the order I&apos;d recommend:

1. **Containerize models.** Get them out of notebooks and into Docker images with pinned dependencies. This alone solves half the &quot;works on my machine&quot; problems.

2. **Set up MLflow.** Experiment tracking is low effort, high value. You&apos;ll thank yourself when someone asks &quot;what&apos;s in production?&quot;

3. **Deploy with KServe.** Don&apos;t build your own serving infrastructure. KServe handles the hard parts.

4. **Add Prometheus metrics.** Start tracking prediction distributions from day one. You need baseline data before you can detect drift.

5. **Automate retraining.** Once you have monitoring, add scheduled retraining with validation gates.

ML systems are harder to operate than traditional applications. But the patterns are learnable, the tools are maturing, and frankly, this is where infrastructure is heading. Every company is becoming an ML company, whether they realise it or not.

The DevOps engineers who understand this stack will be in high demand. Start learning it now.</content:encoded><category>mlops</category><category>devops</category><category>kubernetes</category><category>machine-learning</category><category>platform-engineering</category><category>infrastructure</category><author>Mo Abukar</author></item><item><title>Debugging JVM Thread Exhaustion on EC2: A Contractor War Story</title><link>https://moabukar.co.uk/blog/jvm-server-issues/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/jvm-server-issues/</guid><description>How I diagnosed and fixed a Java application that kept crashing under load – from &apos;cannot create native thread&apos; errors to properly tuned JVM settings, system limits, and right-sized EC2 instances.</description><pubDate>Sat, 10 Jan 2026 00:00:00 GMT</pubDate><content:encoded># Debugging JVM Thread Exhaustion on EC2: A Contractor War Story

I got called in as a contractor to help a client whose Java application kept dying under load. The staging environment would work fine with a handful of users, but the moment they ran load tests simulating real traffic, the JVM would crash with cryptic errors about threads and memory.

The symptoms were classic resource exhaustion, but the root causes were multiple – and finding them required digging through JVM settings, Linux system limits, and EC2 instance sizing. This post walks through the debugging process and the fixes that got them to production.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/openjdk.svg&quot; alt=&quot;OpenJDK logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## The Symptoms

The application was a REST API running on EC2, serving requests like:

```bash
curl -vvk https://api.example.com/rest/getVersionDetails/web
```

Under light load: fine. Under load testing (simulating ~500 concurrent users): crashes within minutes.

The errors in the logs varied:

```
java.lang.OutOfMemoryError: unable to create native thread
```

```
java.lang.OutOfMemoryError: Java heap space
```

```
Cannot allocate memory (errno=12)
```

The application would sometimes hang completely, other times crash and restart via systemd, only to crash again.

## Initial Assessment

First, I SSH&apos;d into the staging server during a load test to see what was happening in real-time.

### Check System Resources

```bash
# Memory usage
free -h
              total        used        free      shared  buff/cache   available
Mem:          983Mi       812Mi        62Mi       0.0Ki       108Mi        74Mi
Swap:            0B          0B          0B

# CPU and load
uptime
 14:23:45 up 2 days,  3:42,  1 user,  load average: 4.82, 3.21, 1.89

# Process count
ps aux | wc -l
847
```

The server was a `t2.micro` with 1GB RAM. It was completely maxed out – 812MB used, only 74MB available, and no swap configured. The load average was nearly 5x the single vCPU.

### Check Thread Count

```bash
# Threads for the Java process
ps -eo pid,nlwp,cmd | grep java
12847  523  /usr/bin/java -jar /opt/app/api.jar

# System-wide thread count
cat /proc/sys/kernel/threads-max
7732

# Threads per process limit
ulimit -u
unlimited
```

The Java process had **523 threads** running. That&apos;s a lot for a t2.micro.

### Check systemd TasksMax

This was a key finding:

```bash
systemctl show --property DefaultTasksMax
DefaultTasksMax=1844674407370955161
```

That absurdly large number meant the system default was essentially unlimited – but the per-service limit might be different:

```bash
systemctl show myapp.service --property TasksMax
TasksMax=512
```

**There it was.** The systemd service had a TasksMax of 512, and the Java process was hitting 523 threads. systemd was killing threads when they exceeded the limit.

## The Problems (There Were Several)

### Problem 1: TasksMax Limit

systemd&apos;s TasksMax setting limits how many tasks (threads) a service can spawn. The default varies by distribution, but many set it to 512. A busy Java application can easily exceed this.

### Problem 2: Undersized Instance

A t2.micro has:
- 1 vCPU (burstable)
- 1GB RAM
- No swap by default

Running a JVM that spawns hundreds of threads on this is asking for trouble. The JVM alone needs memory for:
- Heap (application objects)
- Metaspace (class metadata)
- Thread stacks (1MB default per thread × 500 threads = 500MB just for stacks)
- Native memory (JIT compiler, GC, etc.)
- The OS itself

On a 1GB instance, there simply wasn&apos;t enough memory.

### Problem 3: No JVM Tuning

The application was running with default JVM settings:

```bash
ps aux | grep java
# Showed no -Xmx, -Xms, or -Xss flags
```

The JVM was auto-sizing based on available memory, but its choices weren&apos;t appropriate for this workload.

### Problem 4: No Swap Space

When physical RAM runs out, Linux normally uses swap. But EC2 instances don&apos;t have swap by default, so the OOM killer would just terminate processes.

### Problem 5: Thread Leak

Looking at thread dumps over time, the thread count kept growing:

```bash
# Take thread dumps 30 seconds apart
jstack 12847 &gt; /tmp/threads1.txt
sleep 30
jstack 12847 &gt; /tmp/threads2.txt

# Compare thread counts
grep &quot;java.lang.Thread.State&quot; /tmp/threads1.txt | wc -l
487
grep &quot;java.lang.Thread.State&quot; /tmp/threads2.txt | wc -l
512
```

Threads were being created but not cleaned up – a classic thread leak, likely from connection pools or async handlers not being properly closed.

## The Fixes

### Fix 1: Increase TasksMax

Edit the systemd service file:

```bash
sudo systemctl edit myapp.service
```

Add:

```ini
[Service]
TasksMax=4096
```

Then reload:

```bash
sudo systemctl daemon-reload
sudo systemctl restart myapp.service
```

Verify:

```bash
systemctl show myapp.service --property TasksMax
TasksMax=4096
```

This was the immediate fix that stopped the crashes, but it only masked the underlying problems.

### Fix 2: Right-Size the EC2 Instance

I recommended upgrading from t2.micro to at least t3.medium (2 vCPU, 4GB RAM) for staging, and t3.large (2 vCPU, 8GB RAM) for production.

The memory calculation:

| Component | Memory |
|-----------|--------|
| JVM Heap | 2GB |
| Metaspace | 256MB |
| Thread stacks (500 threads × 512KB) | 250MB |
| Native/JIT/GC | ~500MB |
| OS + buffer cache | ~1GB |
| **Total** | **~4GB minimum** |

A t2.micro was never going to work. We moved to t3.medium for staging and t3.large for production.

### Fix 3: Tune JVM Settings

I added explicit JVM flags to the startup script:

```bash
#!/bin/bash
# /opt/app/start.sh

JAVA_OPTS=&quot;-server&quot;
JAVA_OPTS=&quot;$JAVA_OPTS -Xms1g -Xmx2g&quot;           # Heap: 1GB initial, 2GB max
JAVA_OPTS=&quot;$JAVA_OPTS -Xss512k&quot;                 # Thread stack: 512KB (down from 1MB default)
JAVA_OPTS=&quot;$JAVA_OPTS -XX:MaxMetaspaceSize=256m&quot;
JAVA_OPTS=&quot;$JAVA_OPTS -XX:+UseG1GC&quot;             # G1 garbage collector
JAVA_OPTS=&quot;$JAVA_OPTS -XX:MaxGCPauseMillis=200&quot;
JAVA_OPTS=&quot;$JAVA_OPTS -XX:+HeapDumpOnOutOfMemoryError&quot;
JAVA_OPTS=&quot;$JAVA_OPTS -XX:HeapDumpPath=/var/log/app/heapdump.hprof&quot;

exec java $JAVA_OPTS -jar /opt/app/api.jar
```

Key settings explained:

| Flag | Purpose |
|------|---------|
| `-Xms1g -Xmx2g` | Set initial and max heap. Setting them equal avoids resize overhead. |
| `-Xss512k` | Reduce thread stack size from 1MB to 512KB. Saves memory with many threads. |
| `-XX:MaxMetaspaceSize=256m` | Cap metaspace to prevent unbounded growth. |
| `-XX:+UseG1GC` | G1 is better for larger heaps and lower pause times. |
| `-XX:+HeapDumpOnOutOfMemoryError` | Automatically dump heap on OOM for post-mortem analysis. |

### Fix 4: Add Swap Space

Even with proper sizing, swap provides a safety net:

```bash
# Create 2GB swap file
sudo fallocate -l 2G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

# Make permanent
echo &apos;/swapfile none swap sw 0 0&apos; | sudo tee -a /etc/fstab

# Reduce swappiness (prefer RAM, use swap only when necessary)
echo &apos;vm.swappiness=10&apos; | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
```

Verify:

```bash
free -h
              total        used        free      shared  buff/cache   available
Mem:          3.8Gi       1.2Gi       1.9Gi       0.0Ki       712Mi       2.4Gi
Swap:         2.0Gi          0B       2.0Gi
```

### Fix 5: Fix the Thread Leak

This required code changes from the development team. The issues were:

1. **HTTP connection pool not configured with max connections**:

```java
// Before: unbounded pool
CloseableHttpClient client = HttpClients.createDefault();

// After: bounded pool
PoolingHttpClientConnectionManager cm = new PoolingHttpClientConnectionManager();
cm.setMaxTotal(100);
cm.setDefaultMaxPerRoute(20);
CloseableHttpClient client = HttpClients.custom()
    .setConnectionManager(cm)
    .build();
```

2. **Async handlers not completing**:

```java
// Before: CompletableFuture without timeout
CompletableFuture.supplyAsync(() -&gt; fetchData());

// After: with timeout
CompletableFuture.supplyAsync(() -&gt; fetchData())
    .orTimeout(30, TimeUnit.SECONDS)
    .exceptionally(ex -&gt; {
        logger.error(&quot;Async operation timed out&quot;, ex);
        return fallbackValue;
    });
```

3. **ExecutorService not bounded**:

```java
// Before: cached thread pool (unbounded)
ExecutorService executor = Executors.newCachedThreadPool();

// After: fixed thread pool
ExecutorService executor = Executors.newFixedThreadPool(50);
```

### Fix 6: Add Monitoring

I set up CloudWatch alarms to catch these issues before they caused outages:

```bash
# Install CloudWatch agent
sudo yum install -y amazon-cloudwatch-agent

# Configure to collect memory and process metrics
cat &gt; /opt/aws/amazon-cloudwatch-agent/etc/amazon-cloudwatch-agent.json &lt;&lt; &apos;EOF&apos;
{
  &quot;metrics&quot;: {
    &quot;namespace&quot;: &quot;MyApp&quot;,
    &quot;metrics_collected&quot;: {
      &quot;mem&quot;: {
        &quot;measurement&quot;: [&quot;mem_used_percent&quot;, &quot;mem_available&quot;]
      },
      &quot;processes&quot;: {
        &quot;measurement&quot;: [&quot;running&quot;, &quot;blocked&quot;, &quot;zombies&quot;]
      }
    }
  }
}
EOF

sudo amazon-cloudwatch-agent-ctl -a start
```

And a custom metric for JVM thread count:

```bash
#!/bin/bash
# /opt/scripts/jvm-metrics.sh
# Run via cron every minute

PID=$(pgrep -f &quot;api.jar&quot;)
if [ -n &quot;$PID&quot; ]; then
    THREAD_COUNT=$(cat /proc/$PID/status | grep Threads | awk &apos;{print $2}&apos;)
    
    aws cloudwatch put-metric-data \
        --namespace &quot;MyApp&quot; \
        --metric-name &quot;JVMThreadCount&quot; \
        --value &quot;$THREAD_COUNT&quot; \
        --unit Count
fi
```

CloudWatch alarm:

```bash
aws cloudwatch put-metric-alarm \
    --alarm-name &quot;jvm-thread-count-high&quot; \
    --metric-name &quot;JVMThreadCount&quot; \
    --namespace &quot;MyApp&quot; \
    --statistic &quot;Average&quot; \
    --period 300 \
    --threshold 400 \
    --comparison-operator &quot;GreaterThanThreshold&quot; \
    --evaluation-periods 2 \
    --alarm-actions &quot;arn:aws:sns:eu-west-1:123456789012:alerts&quot;
```

## Verification

After all fixes were applied, I ran the load test again:

```bash
# Before fixes
# 500 concurrent users → crash within 5 minutes

# After fixes
# 500 concurrent users → stable for 2 hours
# Memory: 2.1GB / 4GB
# Threads: stable at ~180 (down from 500+)
# Response time: P99 &lt; 200ms
```

The thread leak fix was the biggest improvement – thread count dropped from 500+ to ~180 and stayed stable.

## Debugging Commands Reference

For the next time you&apos;re debugging JVM issues on Linux:

```bash
# System memory
free -h
cat /proc/meminfo

# Process memory
ps aux --sort=-%mem | head
pmap -x &lt;pid&gt;

# Thread count for a process
cat /proc/&lt;pid&gt;/status | grep Threads
ps -eo pid,nlwp,cmd | grep java

# System thread limits
cat /proc/sys/kernel/threads-max
ulimit -u

# systemd TasksMax
systemctl show --property DefaultTasksMax
systemctl show &lt;service&gt; --property TasksMax

# JVM thread dump
jstack &lt;pid&gt; &gt; threads.txt

# JVM heap dump
jmap -dump:format=b,file=heap.hprof &lt;pid&gt;

# JVM flags in use
jcmd &lt;pid&gt; VM.flags

# Watch thread count over time
watch -n 1 &quot;cat /proc/&lt;pid&gt;/status | grep Threads&quot;

# Check for OOM killer activity
dmesg | grep -i &quot;killed process&quot;
journalctl -k | grep -i &quot;out of memory&quot;
```

## Lessons Learned

### 1. t2.micro Is Not for Production JVMs

A JVM with any meaningful workload needs at least 2GB RAM available, preferably 4GB+. t2.micro is for testing and tiny workloads only.

### 2. Always Set Explicit JVM Heap Sizes

Don&apos;t rely on JVM auto-tuning. Set `-Xms` and `-Xmx` explicitly based on your instance size and workload.

### 3. Reduce Thread Stack Size

The default 1MB per thread is often excessive. `-Xss512k` or even `-Xss256k` works for most applications and saves significant memory with many threads.

### 4. Check systemd TasksMax

This catches many people off guard. A default of 512 tasks is easily exceeded by JVM applications.

### 5. Always Have Swap

Even if you&apos;ve sized everything correctly, swap provides a buffer against unexpected memory spikes. It&apos;s better to slow down than to crash.

### 6. Monitor Thread Count

Thread leaks are common in async Java applications. Monitor thread count as a first-class metric alongside CPU and memory.

### 7. Bound Your Thread Pools

Never use `Executors.newCachedThreadPool()` in production. Always use bounded pools with explicit maximums.

## Summary

The client&apos;s JVM crashes were caused by a combination of:
- systemd TasksMax limit (512 threads)
- Undersized EC2 instance (t2.micro with 1GB RAM)
- No JVM tuning (defaults for heap and thread stack)
- No swap space
- Thread leak in application code

The fixes:
- Increased TasksMax to 4096
- Upgraded to t3.medium (4GB RAM)
- Tuned JVM with explicit heap sizes and reduced thread stack
- Added 2GB swap
- Fixed thread leak in connection pools and async handlers
- Added monitoring for threads and memory

Total time to diagnose and fix: about 2 days. The application has been stable in production for months since.

---

*Debugging JVM performance issues or have questions about EC2 sizing for Java? Find me on [LinkedIn](https://linkedin.com/in/moabukar).*</content:encoded><category>java</category><category>jvm</category><category>ec2</category><category>debugging</category><category>performance</category><category>memory</category><category>threads</category><category>linux</category><category>devops</category><author>Mo Abukar</author></item><item><title>That Time I Gave Away £50k Worth of Consulting for Free (And What It Taught Me About the Industry)</title><link>https://moabukar.co.uk/blog/free-consulting-lessons/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/free-consulting-lessons/</guid><description>On interview take-home tests that are suspiciously specific, contractors who get ghosted after detailed proposals, and learning to play the game without becoming bitter about it.</description><pubDate>Mon, 05 Jan 2026 00:00:00 GMT</pubDate><content:encoded># That Time I Gave Away £50k Worth of Consulting for Free (And What It Taught Me About the Industry)

I was naive. Let me just get that out of the way.

Early in my contracting career, I took a short engagement with a company - let&apos;s call them Acme Corp. The work was straightforward: review their AWS infrastructure, identify issues, help stabilise their systems. A month or two of work, decent rate, seemed like a good fit.

Then the budget &quot;ran out.&quot;

Fair enough. Contracts end. But before I left, they asked if I could put together a proposal outlining everything that needed fixing and how long it would take. You know, just so they had a roadmap for when budget freed up.

So I did. I spent hours writing a detailed document covering:

- Server stability issues (JVM memory problems from running Tomcat, Apache and the application on the same box)
- Load handling problems (a specific user uploading 60 images would crash the upload server)
- Unnecessary costs from unused public IPs
- Email deliverability issues (SPF, DKIM, MX records needing audit)
- Security concerns (open security groups, no NACLs, shared access keys)
- Recommendations for SSM access, containerisation, Infrastructure as Code, immutable AMIs, proper IAM and SSO

I estimated timelines. Quick wins in 10-12 days. Long-term improvements over 3-4 months. I gave them a genuine, honest assessment of their environment and a roadmap to fix it.

Then I sent it.

And got ghosted.

No response. No &quot;thanks but we&apos;re going a different direction.&quot; Nothing.

They almost certainly took that document, handed it to someone cheaper (or did it themselves), and implemented everything I&apos;d outlined. For free. My consulting, my expertise, my time – all given away because I was too eager to be helpful.

![That Time I Gave Away £50k Worth of Consulting for Free (And What It Taught Me About the Industry)](/images/free-consulting.webp)


## The Interview Version of This

Here&apos;s the thing: this doesn&apos;t just happen to contractors. It happens to candidates in interviews all the time.

You know the pattern:

*&quot;For the technical assessment, we&apos;d like you to design a system for [suspiciously specific business problem].&quot;*

*&quot;Please write a solution for handling [exact scenario our production system faces].&quot;*

*&quot;Create a proof-of-concept for [feature we&apos;ve been meaning to build].&quot;*

Sometimes these are legitimate assessments. Often, they&apos;re not.

I&apos;ve seen take-home tests that:

- Ask candidates to design the exact architecture the company is currently struggling with
- Request working code for features that mysteriously appear in the product weeks later
- Demand detailed proposals for solving problems that read like internal tickets

The candidate spends 8-20 hours on a &quot;test,&quot; submits it, gets rejected (or ghosted), and never knows that their work is now being discussed in sprint planning.

Is it illegal? Probably not, in most cases. Is it ethical? Absolutely not.

## Why Companies Do This

Let&apos;s be honest about the incentives:

**It&apos;s free.** Consulting rates for senior engineers are £500-1000/day. A detailed architecture proposal might cost £5-10k if you hired someone properly. A &quot;take-home test&quot; costs nothing.

**There&apos;s no accountability.** Candidates are desperate for jobs. They&apos;ll do the work hoping it leads somewhere. When it doesn&apos;t, they blame themselves, not the company.

**It&apos;s easy to justify internally.** &quot;We&apos;re just assessing their skills.&quot; &quot;It&apos;s based on real problems so we can see how they think.&quot; &quot;Everyone does take-home tests.&quot;

**The power dynamic is completely one-sided.** The company holds all the cards. The candidate needs the job. Even if they suspect something&apos;s off, what are they going to do – refuse and lose the opportunity?

## How to Spot It

Some red flags:

**The problem is too specific.** Generic assessments ask you to design &quot;a URL shortener&quot; or &quot;a rate limiter.&quot; Fishing expeditions ask you to design &quot;a ticket routing system that handles peak loads on Friday afternoons in the travel industry.&quot;

**They want production-ready code.** Real assessments want to see your thinking. Exploitation wants working features.

**The scope keeps expanding.** &quot;Could you also add...&quot; is a sign they&apos;re treating you as unpaid labour.

**You never meet the team.** If no engineers are involved in evaluating your work, it&apos;s probably going straight into their backlog.

**The feedback is suspiciously vague.** &quot;We decided to go with another candidate&quot; with no technical feedback usually means your solution was useful but you weren&apos;t.

## What I Do Now

I still do take-home tests when required. But I&apos;ve changed my approach:

**Time-box ruthlessly.** If they say 4 hours, I spend 4 hours or even less. Not 8. Not 12. If the test can&apos;t be completed in the stated time, that&apos;s their problem, not mine. Or i&apos;ll decline the test and move on (unless its a role I want badly)

**Keep it conceptual.** Architecture diagrams, pseudocode, bullet points. Not production-ready implementations. If they want working code, that&apos;s what the job is for.

**Ask questions first.** &quot;Is this based on a real problem you&apos;re facing?&quot; Sometimes the honesty of their answer tells you everything.

**Protect your IP.** I&apos;ve started adding a simple header to documents: &quot;This document is provided for interview evaluation purposes only. Redistribution or commercial use without written consent is prohibited.&quot; Does it have legal teeth? Probably not. Does it make a point? Yes.

**Walk away from obvious exploitation.** A company once asked me to build a fully functional microservice as a &quot;test.&quot; I declined. They were annoyed. I don&apos;t care.

## The Contractor Lesson

Back to my Acme Corp story.

I was naive, but I&apos;m not bitter. Because:

That experience taught me something important: **your expertise has value and you should never give it away for free unless it&apos;s a deliberate choice.**

When I send proposals now, they&apos;re paid engagements. Discovery sessions have day rates. Detailed roadmaps come after contracts are signed. I&apos;ve learned to separate &quot;being helpful&quot; from &quot;being taken advantage of.&quot;

And look – Acme Corp probably did implement my suggestions. They probably saved themselves months of fumbling by using my roadmap. Good for them, genuinely. But I won&apos;t make that mistake again.

## The Politics Reality

Here&apos;s the part nobody wants to say out loud: **the tech industry has politics and pretending otherwise is a fast track to getting exploited.**

&quot;Just do good work and you&apos;ll be recognised&quot; is a lie told by people who&apos;ve never had their work stolen.

&quot;Meritocracy&quot; is a fairy tale companies tell candidates while they&apos;re extracting free labour through interview tests.

You don&apos;t have to become cynical. But you do have to be aware. Protect yourself. Value your time. Understand that companies will take whatever you give them, so be intentional about what you give.

Don&apos;t hate the player, hate the game. Or better yet – understand the game well enough to play it on your terms.

## Practical Takeaways

If you&apos;re interviewing:

1. **Time-box take-home tests.** Respect the stated limit, not the implicit expectation.
2. **Ask if the problem is real.** Their answer will be revealing.
3. **Keep solutions conceptual.** Show your thinking, not a finished product.
4. **Trust your gut.** If something feels exploitative, it probably is.
5. **It&apos;s okay to decline.** Companies that won&apos;t hire you without free labour aren&apos;t companies you want to work for.

If you&apos;re contracting:

1. **Never send detailed proposals before contracts are signed.** High-level summaries only.
2. **Discovery is paid work.** If they want you to audit their systems, that&apos;s a billable engagement.
3. **Get everything in writing.** Verbal agreements mean nothing when budget &quot;runs out.&quot;
4. **Maintain relationships, but protect yourself.** Being helpful and being a pushover are different things.

## Final Thought

I don&apos;t regret the Acme Corp experience. It was tuition. Expensive tuition, but I learned.

Now when someone asks for detailed proposals before signing a contract, I politely decline and explain why. Some understand. Some get offended. The ones who get offended are exactly the ones who would have exploited me anyway.

Your expertise took years to build. Your time is finite. Your work has value.

Act accordingly.

---

*Had similar experiences with interviews or contracting? I&apos;d like to hear your stories – find me on [LinkedIn](https://linkedin.com/in/moabukar).*</content:encoded><category>career</category><category>consulting</category><category>interviews</category><category>contracting</category><category>tech-industry</category><category>lessons-learned</category><author>Mo Abukar</author></item><item><title>Dragonfly vs Redis: Modern In-Memory Store Comparison</title><link>https://moabukar.co.uk/blog/dragonfly-vs-redis/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/dragonfly-vs-redis/</guid><description>Compare Dragonfly and Redis for caching and data storage. Dragonfly&apos;s multi-threaded architecture vs Redis single-threaded model.</description><pubDate>Wed, 31 Dec 2025 00:00:00 GMT</pubDate><content:encoded>Dragonfly vs Redis: Modern In-Memory Store Comparison
======================================================

Redis is single-threaded. Dragonfly is multi-threaded and
claims 25x throughput. Is it ready for production? This guide
compares both with benchmarks and deployment patterns.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/redis.svg&quot; alt=&quot;Redis logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Dragonfly = multi-threaded Redis alternative
- 25x higher throughput claims
- Drop-in Redis replacement (RESP protocol)
- Better memory efficiency
- Production-ready since 2023


Feature Comparison
==================

```
FEATURE                 DRAGONFLY           REDIS
=======                 =========           =====
Threading               Multi-threaded      Single-threaded
Throughput              4M+ ops/sec         100K+ ops/sec
Memory efficiency       Better              Good
Clustering              Built-in            Redis Cluster
Persistence             Yes (RDB/AOF)       Yes (RDB/AOF)
Lua scripting           Yes                 Yes
Modules                 Limited             Extensive
Maturity                2023+               2009+
Community               Growing             Massive
Enterprise support      DragonflyDB Inc     Redis Ltd
```


Benchmark Results
=================

```
OPERATION       DRAGONFLY       REDIS 7         SPEEDUP
=========       =========       =======         =======
SET             4.2M ops/sec    180K ops/sec    23x
GET             4.5M ops/sec    200K ops/sec    22x
INCR            3.8M ops/sec    170K ops/sec    22x
LPUSH           3.5M ops/sec    150K ops/sec    23x
HSET            3.2M ops/sec    140K ops/sec    23x

Test: 64 cores, 256GB RAM, 100 concurrent connections
```


Deploy Dragonfly on Kubernetes
==============================

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dragonfly
spec:
  serviceName: dragonfly
  replicas: 1
  selector:
    matchLabels:
      app: dragonfly
  template:
    metadata:
      labels:
        app: dragonfly
    spec:
      containers:
        - name: dragonfly
          image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.14.0
          args:
            - --logtostderr
            - --cache_mode  # LRU eviction
            - --maxmemory=8G
            - --proactor_threads=8
          ports:
            - containerPort: 6379
              name: redis
            - containerPort: 9999
              name: metrics
          resources:
            requests:
              cpu: &quot;4&quot;
              memory: 10Gi
            limits:
              memory: 12Gi
          volumeMounts:
            - name: data
              mountPath: /data
          livenessProbe:
            exec:
              command: [&quot;redis-cli&quot;, &quot;ping&quot;]
            initialDelaySeconds: 10
          readinessProbe:
            exec:
              command: [&quot;redis-cli&quot;, &quot;ping&quot;]
            initialDelaySeconds: 5
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [&quot;ReadWriteOnce&quot;]
        storageClassName: gp3
        resources:
          requests:
            storage: 100Gi
---
apiVersion: v1
kind: Service
metadata:
  name: dragonfly
spec:
  ports:
    - port: 6379
      name: redis
    - port: 9999
      name: metrics
  selector:
    app: dragonfly
```


Helm Deployment
---------------

```bash
helm repo add dragonfly https://dragonflydb.github.io/helm-charts
helm upgrade --install dragonfly dragonfly/dragonfly \
  --namespace cache --create-namespace \
  --set resources.requests.cpu=4 \
  --set resources.requests.memory=8Gi \
  --set extraArgs=&quot;{--cache_mode,--maxmemory=6G}&quot;
```


Deploy Redis (for comparison)
=============================

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: redis
spec:
  serviceName: redis
  replicas: 1
  selector:
    matchLabels:
      app: redis
  template:
    metadata:
      labels:
        app: redis
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          command:
            - redis-server
            - --maxmemory 6gb
            - --maxmemory-policy allkeys-lru
            - --appendonly yes
          ports:
            - containerPort: 6379
          resources:
            requests:
              cpu: &quot;2&quot;
              memory: 8Gi
          volumeMounts:
            - name: data
              mountPath: /data
  volumeClaimTemplates:
    - metadata:
        name: data
      spec:
        accessModes: [&quot;ReadWriteOnce&quot;]
        resources:
          requests:
            storage: 50Gi
```


Application Connection
======================

Both use the same Redis protocol:

```go
import &quot;github.com/redis/go-redis/v9&quot;

func main() {
    // Works with both Redis and Dragonfly
    client := redis.NewClient(&amp;redis.Options{
        Addr: &quot;dragonfly.cache:6379&quot;,  // or redis.cache:6379
    })

    ctx := context.Background()
    
    // Same commands work
    client.Set(ctx, &quot;key&quot;, &quot;value&quot;, time.Hour)
    val, _ := client.Get(ctx, &quot;key&quot;).Result()
}
```


High Availability
=================

Dragonfly HA (Master-Replica)
-----------------------------

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: dragonfly
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: dragonfly
          image: docker.dragonflydb.io/dragonflydb/dragonfly:v1.14.0
          args:
            - --logtostderr
            - --cluster_mode=emulated
            - --cluster_announce_ip=$(POD_IP)
          env:
            - name: POD_IP
              valueFrom:
                fieldRef:
                  fieldPath: status.podIP
```


Redis Sentinel
--------------

```yaml
# Use Bitnami Redis chart for HA
helm upgrade --install redis bitnami/redis \
  --set sentinel.enabled=true \
  --set sentinel.quorum=2 \
  --set replica.replicaCount=2
```


When to Use Which
=================

**Use Dragonfly when:**
- High throughput is critical (millions of ops/sec)
- You have multi-core machines to utilize
- You want simpler scaling (no cluster sharding)
- Memory efficiency is important
- Starting fresh (no Redis modules needed)

**Use Redis when:**
- You need Redis modules (RedisSearch, RedisJSON, etc.)
- You&apos;re already running Redis in production
- You need the larger ecosystem and community
- Enterprise support is important
- Stability over raw performance


Migration Strategy
==================

```bash
# 1. Deploy Dragonfly alongside Redis
# 2. Use Dragonfly for reads (shadow traffic)
# 3. Compare results
# 4. Switch writes to Dragonfly
# 5. Decommission Redis

# Shadow traffic example
if dragonfly_enabled:
    result = dragonfly.get(key)
    redis_result = redis.get(key)  # Compare
    if result != redis_result:
        log.warning(&quot;Mismatch&quot;, key=key)
```


Monitoring
==========

```yaml
# Both expose Prometheus metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: dragonfly
spec:
  selector:
    matchLabels:
      app: dragonfly
  endpoints:
    - port: metrics
      path: /metrics
```

Key metrics:
- `dragonfly_connected_clients`
- `dragonfly_used_memory_bytes`
- `dragonfly_commands_processed_total`
- `dragonfly_keyspace_hits_total`
- `dragonfly_keyspace_misses_total`


References
==========

- Dragonfly: https://dragonflydb.io
- Dragonfly Docs: https://www.dragonflydb.io/docs
- Redis: https://redis.io
- Benchmark: https://www.dragonflydb.io/blog/dragonfly-1-0-benchmark


========================================
Dragonfly vs Redis
========================================
Multi-threaded speed. Redis compatibility.
========================================</content:encoded><category>dragonfly</category><category>redis</category><category>caching</category><category>database</category><category>kubernetes</category><category>performance</category><author>Mo Abukar</author></item><item><title>Vitess for MySQL: Horizontal Sharding Done Right</title><link>https://moabukar.co.uk/blog/vitess-mysql-sharding/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/vitess-mysql-sharding/</guid><description>Scale MySQL horizontally with Vitess. Automatic sharding, online schema changes, and Kubernetes-native deployment for massive scale.</description><pubDate>Sun, 28 Dec 2025 00:00:00 GMT</pubDate><content:encoded>Vitess for MySQL: Horizontal Sharding Done Right
=================================================

MySQL doesn&apos;t scale horizontally. Vitess makes it scale. Born
at YouTube to handle billions of rows, it&apos;s now a CNCF project
powering Slack, GitHub, and many others.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/vitess.svg&quot; alt=&quot;Vitess logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Vitess = MySQL horizontal sharding layer
- Automatic shard routing
- Online schema migrations
- Connection pooling and query rewriting
- Kubernetes operator included


Architecture
============

```
┌─────────────────────────────────────────────────────────────────┐
│                         Application                              │
│                     (MySQL protocol)                             │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                          VTGate                                  │
│              (Query router, connection pooler)                   │
└─────────────────────────────────────────────────────────────────┘
                               │
          ┌────────────────────┼────────────────────┐
          ▼                    ▼                    ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│     VTTablet     │  │     VTTablet     │  │     VTTablet     │
│   (Shard -80)    │  │   (Shard 80-)    │  │    (Replica)     │
│   ┌──────────┐   │  │   ┌──────────┐   │  │   ┌──────────┐   │
│   │  MySQL   │   │  │   │  MySQL   │   │  │   │  MySQL   │   │
│   └──────────┘   │  │   └──────────┘   │  │   └──────────┘   │
└──────────────────┘  └──────────────────┘  └──────────────────┘
```


Install Vitess Operator
=======================

```bash
# Install operator
kubectl apply -f https://github.com/vitessio/vitess/releases/download/v18.0.0/operator.yaml

# Wait for operator
kubectl wait --for=condition=Available deployment/vitess-operator -n vitess
```


Deploy Cluster
==============

```yaml
apiVersion: planetscale.com/v2
kind: VitessCluster
metadata:
  name: example
spec:
  images:
    vtgate: vitess/lite:v18.0.0
    vttablet: vitess/lite:v18.0.0
    vtbackup: vitess/lite:v18.0.0
    mysqld:
      mysql80Compatible: vitess/lite:v18.0.0
    mysqldExporter: prom/mysqld-exporter:v0.14.0

  cells:
    - name: zone1
      gateway:
        replicas: 2
        resources:
          requests:
            cpu: 500m
            memory: 512Mi

  keyspaces:
    - name: commerce
      turndownPolicy: Immediate
      partitionings:
        - equal:
            parts: 2
            shardTemplate:
              databaseInitScriptSecret:
                name: commerce-init
                key: init.sql
              replication:
                enforceSemiSync: true
              tabletPools:
                - cell: zone1
                  type: replica
                  replicas: 2
                  vttablet:
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                  mysqld:
                    resources:
                      requests:
                        cpu: 500m
                        memory: 1Gi
                  dataVolumeClaimTemplate:
                    accessModes: [&quot;ReadWriteOnce&quot;]
                    resources:
                      requests:
                        storage: 100Gi
                    storageClassName: gp3
```


VSchema (Sharding Config)
=========================

```json
{
  &quot;sharded&quot;: true,
  &quot;vindexes&quot;: {
    &quot;hash&quot;: {
      &quot;type&quot;: &quot;hash&quot;
    },
    &quot;customer_keyspace_id&quot;: {
      &quot;type&quot;: &quot;hash&quot;
    }
  },
  &quot;tables&quot;: {
    &quot;customer&quot;: {
      &quot;column_vindexes&quot;: [
        {
          &quot;column&quot;: &quot;customer_id&quot;,
          &quot;name&quot;: &quot;hash&quot;
        }
      ]
    },
    &quot;orders&quot;: {
      &quot;column_vindexes&quot;: [
        {
          &quot;column&quot;: &quot;customer_id&quot;,
          &quot;name&quot;: &quot;customer_keyspace_id&quot;
        }
      ]
    },
    &quot;products&quot;: {
      &quot;type&quot;: &quot;reference&quot;
    }
  }
}
```


Apply VSchema
-------------

```bash
vtctldclient ApplyVSchema --vschema-file vschema.json commerce
```


Application Connection
======================

```go
import (
    &quot;database/sql&quot;
    _ &quot;github.com/go-sql-driver/mysql&quot;
)

func main() {
    // Connect to VTGate (MySQL protocol)
    db, err := sql.Open(&quot;mysql&quot;, &quot;user:password@tcp(vtgate.vitess:3306)/commerce&quot;)
    if err != nil {
        log.Fatal(err)
    }

    // Queries are automatically routed to correct shard
    rows, err := db.Query(&quot;SELECT * FROM customer WHERE customer_id = ?&quot;, 123)
    
    // Cross-shard queries work automatically
    rows, err = db.Query(&quot;SELECT c.name, o.total FROM customer c JOIN orders o ON c.customer_id = o.customer_id&quot;)
}
```


Online Schema Change
====================

```bash
# Safe ALTER TABLE across shards
vtctldclient ApplySchema \
  --sql &quot;ALTER TABLE customer ADD COLUMN email VARCHAR(255)&quot; \
  commerce
```

Vitess uses gh-ost/pt-osc under the hood for non-blocking changes.


Resharding
==========

Split shards when they get too big:

```bash
# Split shard -80 into -40 and 40-80
vtctldclient Reshard \
  --source_shards &quot;-80&quot; \
  --target_shards &quot;-40,40-80&quot; \
  commerce.reshard1

# Monitor progress
vtctldclient Reshard Show commerce.reshard1

# Complete when ready
vtctldclient Reshard SwitchTraffic commerce.reshard1
vtctldclient Reshard Complete commerce.reshard1
```


Backup and Restore
==================

```yaml
apiVersion: planetscale.com/v2
kind: VitessBackupSchedule
metadata:
  name: daily-backup
spec:
  backup:
    storage:
      s3:
        bucket: vitess-backups
        region: eu-west-2
        authSecret:
          name: s3-credentials
  schedule: &quot;0 2 * * *&quot;
  keyspace: commerce
```


Monitoring
==========

```yaml
# ServiceMonitor for Prometheus
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vitess
spec:
  selector:
    matchLabels:
      app: vitess
  endpoints:
    - port: web
      path: /debug/vars
```


References
==========

- Vitess Docs: https://vitess.io/docs
- Operator: https://github.com/planetscale/vitess-operator
- VSchema: https://vitess.io/docs/reference/features/vschema


========================================
Vitess + MySQL + Kubernetes
========================================
Scale MySQL. Shard automatically.
========================================</content:encoded><category>vitess</category><category>mysql</category><category>database</category><category>sharding</category><category>kubernetes</category><category>scaling</category><author>Mo Abukar</author></item><item><title>NATS JetStream: Lightweight Alternative to Kafka</title><link>https://moabukar.co.uk/blog/nats-jetstream/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/nats-jetstream/</guid><description>Deploy NATS JetStream for messaging and streaming. Simpler than Kafka, faster than RabbitMQ, with persistence and exactly-once delivery.</description><pubDate>Wed, 24 Dec 2025 00:00:00 GMT</pubDate><content:encoded>NATS JetStream: Lightweight Alternative to Kafka
=================================================

Kafka is powerful but complex. NATS JetStream provides similar
persistence and streaming capabilities with 10x simpler ops.
Single binary, no ZooKeeper, no JVM.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/nats.svg&quot; alt=&quot;NATS logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- NATS = ultra-fast messaging (5M+ msg/sec)
- JetStream = persistent streaming (Kafka-like)
- Single binary, ~20MB memory footprint
- Exactly-once delivery, replay, consumer groups
- Kubernetes operator included


NATS vs Kafka
=============

```
FEATURE                 NATS JETSTREAM      KAFKA
=======                 ==============      =====
Latency                 &lt;1ms                1-5ms
Throughput              5M+ msg/sec         1M+ msg/sec
Memory footprint        ~20MB               ~1GB+
Dependencies            None                ZooKeeper/KRaft
Operations              Simple              Complex
Persistence             JetStream           Kafka Logs
Exactly-once            Yes                 Yes
Consumer groups         Yes                 Yes
Learning curve          Low                 High
```


Install NATS
============

```bash
# Helm
helm repo add nats https://nats-io.github.io/k8s/helm/charts/
helm upgrade --install nats nats/nats \
  --namespace nats --create-namespace \
  --set config.jetstream.enabled=true \
  --set config.cluster.enabled=true \
  --set config.cluster.replicas=3
```


Values for Production
---------------------

```yaml
# nats-values.yaml
config:
  cluster:
    enabled: true
    replicas: 3
  
  jetstream:
    enabled: true
    fileStore:
      pvc:
        size: 50Gi
        storageClassName: gp3
    memoryStore:
      maxSize: 1Gi
  
  # Monitoring
  monitor:
    enabled: true
    port: 8222

natsbox:
  enabled: true  # Debug container

# Prometheus metrics
exporter:
  enabled: true
  serviceMonitor:
    enabled: true
```


Core Concepts
=============

```
STREAMS     = Persistent message storage (like Kafka topics)
CONSUMERS   = Read position trackers (like Kafka consumer groups)
SUBJECTS    = Message routing (like Kafka topic partitions)
```

```
Producer ──▶ Subject ──▶ Stream ──▶ Consumer ──▶ App
                  │
                  └──▶ Stream ──▶ Consumer ──▶ Other App
```


Create Stream
=============

```bash
# Using NATS CLI
nats stream add ORDERS \
  --subjects &quot;orders.*&quot; \
  --storage file \
  --replicas 3 \
  --retention limits \
  --max-msgs-per-subject 1000000 \
  --max-age 7d \
  --max-bytes 10GB

# Or via YAML
nats stream add --config stream.json
```

```json
{
  &quot;name&quot;: &quot;ORDERS&quot;,
  &quot;subjects&quot;: [&quot;orders.&gt;&quot;],
  &quot;retention&quot;: &quot;limits&quot;,
  &quot;storage&quot;: &quot;file&quot;,
  &quot;max_msgs&quot;: 10000000,
  &quot;max_bytes&quot;: 10737418240,
  &quot;max_age&quot;: 604800000000000,
  &quot;max_msg_size&quot;: 1048576,
  &quot;replicas&quot;: 3,
  &quot;discard&quot;: &quot;old&quot;
}
```


Create Consumer
===============

```bash
# Durable consumer (survives restarts)
nats consumer add ORDERS order-processor \
  --ack explicit \
  --deliver all \
  --max-deliver 5 \
  --filter &quot;orders.created&quot; \
  --pull
```


Go Producer
===========

```go
package main

import (
    &quot;encoding/json&quot;
    &quot;log&quot;
    &quot;time&quot;

    &quot;github.com/nats-io/nats.go&quot;
)

type Order struct {
    ID        string    `json:&quot;id&quot;`
    Customer  string    `json:&quot;customer&quot;`
    Amount    float64   `json:&quot;amount&quot;`
    CreatedAt time.Time `json:&quot;created_at&quot;`
}

func main() {
    nc, err := nats.Connect(&quot;nats://nats.nats:4222&quot;)
    if err != nil {
        log.Fatal(err)
    }
    defer nc.Close()

    js, err := nc.JetStream()
    if err != nil {
        log.Fatal(err)
    }

    order := Order{
        ID:        &quot;ord-123&quot;,
        Customer:  &quot;cust-456&quot;,
        Amount:    99.99,
        CreatedAt: time.Now(),
    }

    data, _ := json.Marshal(order)

    // Publish with acknowledgment
    ack, err := js.Publish(&quot;orders.created&quot;, data)
    if err != nil {
        log.Fatal(err)
    }

    log.Printf(&quot;Published to stream %s, seq %d&quot;, ack.Stream, ack.Sequence)
}
```


Go Consumer
===========

```go
package main

import (
    &quot;encoding/json&quot;
    &quot;log&quot;

    &quot;github.com/nats-io/nats.go&quot;
)

type Order struct {
    ID        string  `json:&quot;id&quot;`
    Customer  string  `json:&quot;customer&quot;`
    Amount    float64 `json:&quot;amount&quot;`
}

func main() {
    nc, err := nats.Connect(&quot;nats://nats.nats:4222&quot;)
    if err != nil {
        log.Fatal(err)
    }
    defer nc.Close()

    js, err := nc.JetStream()
    if err != nil {
        log.Fatal(err)
    }

    // Pull-based consumer
    sub, err := js.PullSubscribe(&quot;orders.created&quot;, &quot;order-processor&quot;)
    if err != nil {
        log.Fatal(err)
    }

    for {
        msgs, err := sub.Fetch(10) // Batch of 10
        if err != nil {
            log.Println(&quot;Fetch error:&quot;, err)
            continue
        }

        for _, msg := range msgs {
            var order Order
            json.Unmarshal(msg.Data, &amp;order)

            log.Printf(&quot;Processing order: %s, amount: %.2f&quot;, order.ID, order.Amount)

            // Process order...

            // Acknowledge
            msg.Ack()
        }
    }
}
```


Kubernetes Deployment
=====================

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-processor
spec:
  replicas: 3
  template:
    spec:
      containers:
        - name: processor
          image: order-processor:latest
          env:
            - name: NATS_URL
              value: &quot;nats://nats.nats:4222&quot;
            - name: NATS_STREAM
              value: &quot;ORDERS&quot;
            - name: NATS_CONSUMER
              value: &quot;order-processor&quot;
```


Key-Value Store
===============

JetStream includes a built-in KV store:

```go
// Create bucket
kv, err := js.CreateKeyValue(&amp;nats.KeyValueConfig{
    Bucket:   &quot;config&quot;,
    Replicas: 3,
    TTL:      24 * time.Hour,
})

// Put
_, err = kv.Put(&quot;api.rate_limit&quot;, []byte(&quot;1000&quot;))

// Get
entry, err := kv.Get(&quot;api.rate_limit&quot;)
log.Println(string(entry.Value()))

// Watch for changes
watcher, _ := kv.Watch(&quot;api.*&quot;)
for update := range watcher.Updates() {
    if update != nil {
        log.Printf(&quot;Key %s changed to %s&quot;, update.Key(), update.Value())
    }
}
```


Object Store
============

Store large files:

```go
// Create bucket
os, err := js.CreateObjectStore(&amp;nats.ObjectStoreConfig{
    Bucket:   &quot;files&quot;,
    Replicas: 3,
})

// Put file
os.PutFile(&quot;report.pdf&quot;, &quot;/path/to/report.pdf&quot;)

// Get file
os.GetFile(&quot;report.pdf&quot;, &quot;/output/report.pdf&quot;)
```


Monitoring
==========

```yaml
# Prometheus rules
groups:
  - name: nats
    rules:
      - alert: NATSHighLatency
        expr: nats_server_route_latency_ms &gt; 100
        labels:
          severity: warning

      - alert: NATSSlowConsumers
        expr: nats_server_slow_consumers &gt; 0
        labels:
          severity: warning

      - alert: NATSStreamFull
        expr: nats_jetstream_stream_bytes / nats_jetstream_stream_max_bytes &gt; 0.9
        labels:
          severity: warning
```


References
==========

- NATS Docs: https://docs.nats.io
- JetStream: https://docs.nats.io/nats-concepts/jetstream
- Go Client: https://github.com/nats-io/nats.go


========================================
NATS JetStream + Kubernetes
========================================
Simple messaging. Persistent streaming.
========================================</content:encoded><category>nats</category><category>jetstream</category><category>messaging</category><category>streaming</category><category>kubernetes</category><category>microservices</category><author>Mo Abukar</author></item><item><title>VPA + HPA Together: The Right Way to Autoscale Both</title><link>https://moabukar.co.uk/blog/vpa-hpa-together/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/vpa-hpa-together/</guid><description>Use Vertical Pod Autoscaler and Horizontal Pod Autoscaler together without conflicts. Includes KEDA integration and best practices.</description><pubDate>Sat, 20 Dec 2025 00:00:00 GMT</pubDate><content:encoded>VPA + HPA Together: The Right Way to Autoscale Both
===================================================

VPA adjusts pod resources (CPU/memory). HPA adjusts pod count.
Using them together is tricky - both can fight over CPU. Here&apos;s
how to make them work together.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/kubernetes.svg&quot; alt=&quot;Kubernetes logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- VPA = vertical scaling (resource requests/limits)
- HPA = horizontal scaling (replica count)
- Don&apos;t let both scale on CPU
- VPA for memory, HPA for CPU (recommended)
- Or use VPA in recommendation-only mode


The Problem
===========

```
HPA: &quot;CPU is high, add more replicas&quot;
VPA: &quot;CPU is high, increase CPU requests&quot;

Both trigger → pods get more CPU AND more replicas
→ Massive over-provisioning
```


Solution 1: Split by Metric
===========================

VPA scales memory, HPA scales on CPU:

```yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
      - containerName: api
        controlledResources: [&quot;memory&quot;]  # Only memory
        minAllowed:
          memory: 128Mi
        maxAllowed:
          memory: 4Gi

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu  # Only CPU
        target:
          type: Utilization
          averageUtilization: 70
```


Solution 2: VPA Recommendations Only
====================================

VPA in &quot;Off&quot; mode provides recommendations without acting:

```yaml
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: &quot;Off&quot;  # Recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: api
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
```

Apply VPA recommendations during maintenance windows:

```bash
#!/bin/bash
# apply-vpa-recommendations.sh

VPA_NAME=$1
NAMESPACE=$2

REC=$(kubectl get vpa $VPA_NAME -n $NAMESPACE -o jsonpath=&apos;{.status.recommendation.containerRecommendations[0].target}&apos;)

CPU=$(echo $REC | jq -r &apos;.cpu&apos;)
MEMORY=$(echo $REC | jq -r &apos;.memory&apos;)

echo &quot;Recommended: cpu=$CPU, memory=$MEMORY&quot;

kubectl patch deployment $VPA_NAME -n $NAMESPACE --type=json -p=&quot;[
  {\&quot;op\&quot;: \&quot;replace\&quot;, \&quot;path\&quot;: \&quot;/spec/template/spec/containers/0/resources/requests/cpu\&quot;, \&quot;value\&quot;: \&quot;$CPU\&quot;},
  {\&quot;op\&quot;: \&quot;replace\&quot;, \&quot;path\&quot;: \&quot;/spec/template/spec/containers/0/resources/requests/memory\&quot;, \&quot;value\&quot;: \&quot;$MEMORY\&quot;}
]&quot;
```


Solution 3: KEDA with Custom Metrics
====================================

Use KEDA for event-driven scaling, VPA for resources:

```yaml
# VPA for resources
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: worker-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-worker
  updatePolicy:
    updateMode: Auto

---
# KEDA for queue-based scaling
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: worker-scaler
spec:
  scaleTargetRef:
    name: queue-worker
  minReplicaCount: 1
  maxReplicaCount: 50
  triggers:
    - type: rabbitmq
      metadata:
        host: amqp://rabbitmq.default.svc:5672
        queueName: jobs
        queueLength: &quot;10&quot;
```


Solution 4: Goldilocks
======================

Goldilocks runs VPA in recommendation mode and provides a dashboard:

```bash
helm repo add fairwinds-stable https://charts.fairwinds.com/stable
helm upgrade --install goldilocks fairwinds-stable/goldilocks \
  --namespace goldilocks --create-namespace

# Label namespace to enable
kubectl label ns production goldilocks.fairwinds.com/enabled=true
```


Best Practices
==============

```
WORKLOAD TYPE           RECOMMENDATION
=============           ==============
Stateless API           HPA on CPU, VPA on memory
Batch/Workers           KEDA on queue depth, VPA on both
Memory-intensive        VPA on memory, HPA on custom metric
GPU                     VPA Off (fixed resources)
```


Configuration Matrix
--------------------

```yaml
# CPU-bound (web servers)
VPA: memory only, Auto mode
HPA: cpu at 70%

# Memory-bound (caches, JVM)
VPA: memory only, Auto mode
HPA: custom metric (requests/sec)

# Queue workers
VPA: both, Auto mode
KEDA: queue length

# Mixed workloads
VPA: Off mode (recommendations)
HPA: cpu at 70%
Apply VPA recommendations weekly
```


Monitoring
==========

```yaml
# Prometheus rules
groups:
  - name: autoscaling
    rules:
      - alert: VPARecommendationDrift
        expr: |
          abs(
            kube_verticalpodautoscaler_status_recommendation_containerrecommendations_target{resource=&quot;cpu&quot;}
            -
            kube_pod_container_resource_requests{resource=&quot;cpu&quot;}
          ) / kube_pod_container_resource_requests{resource=&quot;cpu&quot;} &gt; 0.5
        for: 24h
        labels:
          severity: info
        annotations:
          summary: &quot;VPA recommendation differs &gt;50% from current requests&quot;

      - alert: HPAAtMaxReplicas
        expr: kube_horizontalpodautoscaler_status_current_replicas == kube_horizontalpodautoscaler_spec_max_replicas
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: &quot;{{ $labels.horizontalpodautoscaler }} at max replicas&quot;
```


Full Example
============

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 3  # Will be managed by HPA
  template:
    spec:
      containers:
        - name: api
          resources:
            requests:
              cpu: 100m     # Will be managed by VPA (recommendations)
              memory: 256Mi # Will be managed by VPA
            limits:
              memory: 512Mi # Will be managed by VPA

---
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: Auto
  resourcePolicy:
    containerPolicies:
      - containerName: api
        controlledResources: [&quot;memory&quot;]
        minAllowed:
          memory: 128Mi
        maxAllowed:
          memory: 2Gi

---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 30
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
    scaleUp:
      stabilizationWindowSeconds: 0
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
```


References
==========

- VPA Docs: https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler
- HPA Docs: https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale
- KEDA: https://keda.sh
- Goldilocks: https://goldilocks.docs.fairwinds.com


========================================
VPA + HPA + KEDA
========================================
Right-size resources. Scale replicas. Together.
========================================</content:encoded><category>kubernetes</category><category>autoscaling</category><category>vpa</category><category>hpa</category><category>keda</category><category>performance</category><author>Mo Abukar</author></item><item><title>Pod Topology Spread Constraints - Distributing Workloads Intelligently</title><link>https://moabukar.co.uk/blog/pod-topology-spread-constraints/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/pod-topology-spread-constraints/</guid><description>Control how pods spread across nodes, zones, and regions. A deep dive into topology spread constraints for high availability and efficient resource utilization.</description><pubDate>Thu, 18 Dec 2025 00:00:00 GMT</pubDate><content:encoded># Pod Topology Spread Constraints - Distributing Workloads Intelligently

You have 6 replicas. 3 nodes. Kubernetes puts all 6 pods on node-1 because it has the most resources. Then node-1 dies. You&apos;re down.

Pod affinity rules help, but they&apos;re blunt instruments. They say &quot;don&apos;t put all pods on one node&quot; but don&apos;t guarantee even distribution.

Topology Spread Constraints give you precise control over pod distribution across zones, nodes, or any topology you define. They&apos;re the difference between &quot;hopefully spread out&quot; and &quot;guaranteed spread.&quot;

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/kubernetes.svg&quot; alt=&quot;Kubernetes logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## TL;DR

- Topology Spread Constraints control pod distribution across failure domains
- Use `maxSkew` to define acceptable imbalance
- Choose `whenUnsatisfiable`: DoNotSchedule (strict) or ScheduleAnyway (soft)
- Spread across zones for availability, nodes for resource efficiency
- Combine with pod anti-affinity for complete scheduling control

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/pod-topology-spread-constraints](https://github.com/moabukar/blog-code/tree/main/pod-topology-spread-constraints)

---

## The Problem

Default scheduling optimizes for resource efficiency, not availability:

```yaml
# 6 replicas, 3 nodes
# Default scheduling might produce:
Node-1 (lots of resources): pod-1, pod-2, pod-3, pod-4
Node-2 (some resources):    pod-5
Node-3 (some resources):    pod-6

# If Node-1 fails: 4 of 6 pods gone = degraded service
```

What we want:

```yaml
Node-1: pod-1, pod-2
Node-2: pod-3, pod-4
Node-3: pod-5, pod-6

# If any node fails: only 2 of 6 pods affected = service continues
```

---

## Basic Syntax

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web
  template:
    metadata:
      labels:
        app: web
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web
      containers:
        - name: web
          image: nginx
```

### Key Fields

| Field | Description |
|-------|-------------|
| `maxSkew` | Maximum allowed difference between zone counts |
| `topologyKey` | Node label to group by (hostname, zone, region) |
| `whenUnsatisfiable` | What to do if constraint can&apos;t be met |
| `labelSelector` | Which pods to count for distribution |

---

## Understanding maxSkew

`maxSkew: 1` means the difference between the most and least populated topology domains can be at most 1.

```yaml
# 6 pods, 3 nodes, maxSkew: 1

# Valid distributions:
Node-1: 2 pods
Node-2: 2 pods  
Node-3: 2 pods
# Skew = 2 - 2 = 0 ✓

Node-1: 3 pods
Node-2: 2 pods
Node-3: 1 pod
# Skew = 3 - 1 = 2 ✗ (exceeds maxSkew: 1)

Node-1: 2 pods
Node-2: 2 pods
Node-3: 1 pod  
# Skew = 2 - 1 = 1 ✓
```

### maxSkew Values

- `maxSkew: 1` - Strictly even distribution (recommended for HA)
- `maxSkew: 2` - Some imbalance allowed
- `maxSkew: N` - More flexibility, less guarantee

---

## whenUnsatisfiable Modes

### DoNotSchedule (Strict)

```yaml
whenUnsatisfiable: DoNotSchedule
```

If placing a pod would violate the constraint, don&apos;t schedule it. Pod stays Pending.

**Use when:** Availability is critical. Better to have fewer pods than violate the spread.

### ScheduleAnyway (Soft)

```yaml
whenUnsatisfiable: ScheduleAnyway
```

Try to satisfy the constraint, but schedule anyway if impossible. Scheduler still prefers compliant placements.

**Use when:** You want best-effort spreading but can&apos;t guarantee topology (e.g., autoscaling might create uneven node counts).

---

## Common Topology Keys

### By Node

```yaml
topologyKey: kubernetes.io/hostname
```

Spread across individual nodes. Good for node failure tolerance.

### By Zone

```yaml
topologyKey: topology.kubernetes.io/zone
```

Spread across availability zones. Essential for zone failure tolerance.

### By Region

```yaml
topologyKey: topology.kubernetes.io/region
```

Spread across regions. For multi-region deployments.

### Custom Labels

```yaml
# Nodes labeled with: rack=rack-1, rack=rack-2, etc.
topologyKey: rack
```

Spread across custom failure domains like racks, power zones, etc.

---

## Real-World Examples

### High Availability Web Service

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-api
spec:
  replicas: 6
  selector:
    matchLabels:
      app: web-api
  template:
    metadata:
      labels:
        app: web-api
    spec:
      topologySpreadConstraints:
        # Spread across zones (primary)
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: web-api
        # Also spread across nodes within zones (secondary)
        - maxSkew: 1
          topologyKey: kubernetes.io/hostname
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: web-api
      containers:
        - name: web-api
          image: myapp:latest
          resources:
            requests:
              cpu: 100m
              memory: 128Mi
```

**Result:**
- Strictly spread across zones (DoNotSchedule)
- Best-effort spread across nodes within zones (ScheduleAnyway)
- Zone failure takes out at most ~33% of pods

### Database Replicas

```yaml
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: postgres
spec:
  replicas: 3
  selector:
    matchLabels:
      app: postgres
  template:
    metadata:
      labels:
        app: postgres
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: postgres
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            - labelSelector:
                matchLabels:
                  app: postgres
              topologyKey: kubernetes.io/hostname
      containers:
        - name: postgres
          image: postgres:15
```

**Result:**
- One replica per zone (topologySpreadConstraints)
- No two replicas on the same node (podAntiAffinity)
- Maximum fault tolerance for stateful workload

### Mixed Criticality

```yaml
# Critical pods: strict spreading
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 4
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: DoNotSchedule
          labelSelector:
            matchLabels:
              app: payment-service
---
# Less critical: soft spreading
apiVersion: apps/v1
kind: Deployment
metadata:
  name: metrics-collector
spec:
  replicas: 4
  template:
    spec:
      topologySpreadConstraints:
        - maxSkew: 2
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfiable: ScheduleAnyway
          labelSelector:
            matchLabels:
              app: metrics-collector
```

---

## Combining with Pod Anti-Affinity

Topology Spread Constraints and Pod Anti-Affinity serve different purposes:

| Feature | Purpose |
|---------|---------|
| Topology Spread | Even distribution across domains |
| Pod Anti-Affinity | Keep specific pods apart |

### When to Use Each

**Topology Spread alone:**
```yaml
# 6 replicas across 3 zones
# Allows: zone-a: 2, zone-b: 2, zone-c: 2
# Allows: zone-a: 2, zone-b: 3, zone-c: 1 (if maxSkew: 2)
```

**Anti-Affinity alone:**
```yaml
# No two pods on same node
# Could result in: zone-a: 4, zone-b: 1, zone-c: 1
```

**Both together:**
```yaml
# Spread across zones AND no two on same node
# Best of both worlds
```

### Complete Example

```yaml
spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule
      labelSelector:
        matchLabels:
          app: myapp
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
        - weight: 100
          podAffinityTerm:
            labelSelector:
              matchLabels:
                app: myapp
            topologyKey: kubernetes.io/hostname
```

---

## Debugging

### Check Current Distribution

```bash
# See which nodes pods are on
kubectl get pods -l app=web-api -o wide

# Count pods per node
kubectl get pods -l app=web-api -o jsonpath=&apos;{range .items[*]}{.spec.nodeName}{&quot;\n&quot;}{end}&apos; | sort | uniq -c

# Count pods per zone
kubectl get pods -l app=web-api -o jsonpath=&apos;{range .items[*]}{.spec.nodeName}{&quot;\n&quot;}{end}&apos; | \
  xargs -I{} kubectl get node {} -o jsonpath=&apos;{.metadata.labels.topology\.kubernetes\.io/zone}{&quot;\n&quot;}&apos; | \
  sort | uniq -c
```

### Why Is My Pod Pending?

```bash
kubectl describe pod &lt;pod-name&gt;

# Look for:
# Warning  FailedScheduling  default-scheduler  
#   0/6 nodes are available: 2 node(s) didn&apos;t match pod topology spread constraints
```

### Common Issues

**1. No matching nodes**
```
0/3 nodes are available: 3 node(s) didn&apos;t match pod topology spread constraints
```
- maxSkew too strict for current topology
- Solution: Increase maxSkew or add nodes

**2. Label selector mismatch**
```yaml
# Constraint counts pods with app=web
labelSelector:
  matchLabels:
    app: web

# But deployment has app=web-api
# Constraint sees 0 pods, doesn&apos;t work as expected
```

**3. Node not labeled**
```bash
# Check node labels
kubectl get nodes --show-labels | grep topology.kubernetes.io/zone

# Add missing labels
kubectl label node node-1 topology.kubernetes.io/zone=zone-a
```

---

## Cluster-Level Defaults

Set default constraints for all pods:

```yaml
# kube-scheduler-config.yaml
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
  - schedulerName: default-scheduler
    pluginConfig:
      - name: PodTopologySpread
        args:
          defaultConstraints:
            - maxSkew: 1
              topologyKey: topology.kubernetes.io/zone
              whenUnsatisfiable: ScheduleAnyway
          defaultingType: List
```

---

## Best Practices

### 1. Start with Zones

```yaml
# Zone spreading is usually more important than node spreading
topologyKey: topology.kubernetes.io/zone
```

### 2. Use Strict for Critical Services

```yaml
# Payment service can&apos;t afford zone imbalance
whenUnsatisfiable: DoNotSchedule
```

### 3. Use Soft for Best-Effort

```yaml
# Logging can handle some imbalance
whenUnsatisfiable: ScheduleAnyway
maxSkew: 2
```

### 4. Match Label Selectors Carefully

```yaml
# Must match the pods you want to spread
labelSelector:
  matchLabels:
    app: web-api
    # Don&apos;t include version if you want all versions spread together
```

### 5. Consider Scale-Down Behavior

When scaling down, Kubernetes doesn&apos;t rebalance. You may end up with:

```
# After scaling 6 → 3 pods
zone-a: 2 pods
zone-b: 1 pod
zone-c: 0 pods
```

Use the [Descheduler](https://github.com/kubernetes-sigs/descheduler) to rebalance.

---

## Quick Reference

```yaml
topologySpreadConstraints:
  # Spread across zones
  - maxSkew: 1                              # Max imbalance
    topologyKey: topology.kubernetes.io/zone # Group by zone
    whenUnsatisfiable: DoNotSchedule        # Strict
    labelSelector:
      matchLabels:
        app: myapp
  
  # Also spread across nodes (soft)
  - maxSkew: 1
    topologyKey: kubernetes.io/hostname
    whenUnsatisfiable: ScheduleAnyway       # Best-effort
    labelSelector:
      matchLabels:
        app: myapp
```

---

## Conclusion

Topology Spread Constraints give you precise control over workload distribution:

1. **Zone spreading** - Survive zone failures
2. **Node spreading** - Survive node failures
3. **Custom topologies** - Match your infrastructure

Don&apos;t rely on luck for availability. Define your spreading requirements explicitly, and Kubernetes will enforce them.

---

## References

- [Kubernetes Pod Topology Spread Constraints](https://kubernetes.io/docs/concepts/scheduling-eviction/topology-spread-constraints/)
- [KEP-895: Pod Topology Spread](https://github.com/kubernetes/enhancements/tree/master/keps/sig-scheduling/895-pod-topology-spread)
- [Descheduler](https://github.com/kubernetes-sigs/descheduler)
- [Scheduler Configuration](https://kubernetes.io/docs/reference/scheduling/config/)</content:encoded><category>kubernetes</category><category>scheduling</category><category>high-availability</category><category>pods</category><category>devops</category><author>Mo Abukar</author></item><item><title>FinOps Automation: Kubecost, OpenCost, and Automated Rightsizing</title><link>https://moabukar.co.uk/blog/finops-automation/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/finops-automation/</guid><description>Implement automated cloud cost optimization with Kubecost and OpenCost. Track costs per team, rightsize resources, and automate savings.</description><pubDate>Tue, 16 Dec 2025 00:00:00 GMT</pubDate><content:encoded>FinOps Automation: Kubecost, OpenCost, and Rightsizing
======================================================

Cloud costs grow faster than revenue. FinOps brings financial
accountability to engineering. This guide covers automated
cost tracking, allocation, and rightsizing.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/opencost.svg&quot; alt=&quot;OpenCost logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- OpenCost/Kubecost = cost allocation per namespace/team
- Automatic rightsizing recommendations
- Showback/chargeback by team
- Slack alerts for cost anomalies
- Terraform/GitOps integration


Install OpenCost
================

```bash
helm repo add opencost https://opencost.github.io/opencost-helm-chart
helm upgrade --install opencost opencost/opencost \
  --namespace opencost --create-namespace \
  --set opencost.prometheus.external.url=http://prometheus.monitoring:9090
```


Install Kubecost
================

```bash
helm repo add kubecost https://kubecost.github.io/cost-analyzer/
helm upgrade --install kubecost kubecost/cost-analyzer \
  --namespace kubecost --create-namespace \
  --set prometheus.server.enabled=false \
  --set prometheus.kube-state-metrics.enabled=false \
  --set prometheus.nodeExporter.enabled=false \
  --set global.prometheus.enabled=true \
  --set global.prometheus.fqdn=http://prometheus.monitoring:9090
```


Cost Allocation Labels
======================

```yaml
# Require cost allocation labels
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-cost-labels
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-team-and-env
      match:
        resources:
          kinds:
            - Deployment
            - StatefulSet
      validate:
        message: &quot;Labels &apos;team&apos; and &apos;environment&apos; are required for cost allocation&quot;
        pattern:
          metadata:
            labels:
              team: &quot;?*&quot;
              environment: &quot;?*&quot;
```


API Usage
=========

```bash
# Namespace costs (last 7 days)
curl -s &quot;http://kubecost.monitoring/model/allocation?window=7d&amp;aggregate=namespace&quot; | jq

# Team costs
curl -s &quot;http://kubecost.monitoring/model/allocation?window=30d&amp;aggregate=label:team&quot; | jq

# Idle costs
curl -s &quot;http://kubecost.monitoring/model/allocation?window=7d&amp;aggregate=namespace&amp;idle=true&quot; | jq

# Savings recommendations
curl -s &quot;http://kubecost.monitoring/model/savings&quot; | jq
```


Slack Alerts
============

```yaml
# Kubecost alert configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: kubecost-alerts
  namespace: kubecost
data:
  alerts.yaml: |
    alerts:
      - name: daily-spend-anomaly
        type: budget
        threshold: 500  # $500/day
        window: 1d
        aggregation: namespace
        filter: namespace!~&quot;kube-system|monitoring&quot;
        slackWebhookUrl: https://hooks.slack.com/services/xxx
        
      - name: efficiency-alert
        type: efficiency
        threshold: 0.5  # Alert if &lt;50% efficient
        window: 7d
        aggregation: namespace
        slackWebhookUrl: https://hooks.slack.com/services/xxx
        
      - name: cluster-spend
        type: budget
        threshold: 10000  # $10k/month
        window: 30d
        aggregation: cluster
        slackWebhookUrl: https://hooks.slack.com/services/xxx
```


Rightsizing Automation
======================

```yaml
# VPA recommendations
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: &quot;Off&quot;  # Recommendations only
  resourcePolicy:
    containerPolicies:
      - containerName: &apos;*&apos;
        minAllowed:
          cpu: 50m
          memory: 64Mi
        maxAllowed:
          cpu: 4
          memory: 8Gi
```


Apply Rightsizing
-----------------

```bash
#!/bin/bash
# rightsizing-report.sh

# Get VPA recommendations
for vpa in $(kubectl get vpa -A -o jsonpath=&apos;{range .items[*]}{.metadata.namespace}/{.metadata.name} {end}&apos;); do
  NS=$(echo $vpa | cut -d/ -f1)
  NAME=$(echo $vpa | cut -d/ -f2)
  
  CURRENT=$(kubectl get vpa -n $NS $NAME -o jsonpath=&apos;{.status.recommendation.containerRecommendations[0].target}&apos;)
  
  echo &quot;VPA: $NS/$NAME&quot;
  echo &quot;Recommended: $CURRENT&quot;
  echo &quot;---&quot;
done
```


Grafana Dashboards
==================

```json
{
  &quot;panels&quot;: [
    {
      &quot;title&quot;: &quot;Cost by Namespace (30d)&quot;,
      &quot;type&quot;: &quot;piechart&quot;,
      &quot;targets&quot;: [
        {
          &quot;expr&quot;: &quot;sum(kubecost_allocation_cost{window=\&quot;30d\&quot;}) by (namespace)&quot;,
          &quot;legendFormat&quot;: &quot;{{ namespace }}&quot;
        }
      ]
    },
    {
      &quot;title&quot;: &quot;Cost by Team (30d)&quot;,
      &quot;type&quot;: &quot;piechart&quot;, 
      &quot;targets&quot;: [
        {
          &quot;expr&quot;: &quot;sum(kubecost_allocation_cost{window=\&quot;30d\&quot;}) by (team)&quot;,
          &quot;legendFormat&quot;: &quot;{{ team }}&quot;
        }
      ]
    },
    {
      &quot;title&quot;: &quot;Daily Spend Trend&quot;,
      &quot;type&quot;: &quot;timeseries&quot;,
      &quot;targets&quot;: [
        {
          &quot;expr&quot;: &quot;sum(kubecost_allocation_cost{window=\&quot;1d\&quot;})&quot;,
          &quot;legendFormat&quot;: &quot;Daily Cost&quot;
        }
      ]
    },
    {
      &quot;title&quot;: &quot;Idle Resources (%)&quot;,
      &quot;type&quot;: &quot;gauge&quot;,
      &quot;targets&quot;: [
        {
          &quot;expr&quot;: &quot;sum(kubecost_allocation_cpu_idle_cost) / sum(kubecost_allocation_cpu_cost) * 100&quot;,
          &quot;legendFormat&quot;: &quot;CPU Idle %&quot;
        }
      ]
    }
  ]
}
```


Terraform Integration
=====================

```hcl
# Enforce resource requests/limits
resource &quot;kubectl_manifest&quot; &quot;cost_policy&quot; {
  yaml_body = &lt;&lt;EOF
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-requests-limits
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-resources
      match:
        resources:
          kinds:
            - Pod
      validate:
        message: &quot;Resource requests and limits are required&quot;
        pattern:
          spec:
            containers:
              - resources:
                  requests:
                    cpu: &quot;?*&quot;
                    memory: &quot;?*&quot;
                  limits:
                    memory: &quot;?*&quot;
EOF
}

# Tag resources for allocation
resource &quot;aws_resourcegroups_group&quot; &quot;team_resources&quot; {
  for_each = toset(var.teams)
  
  name = &quot;team-${each.key}&quot;
  
  resource_query {
    query = jsonencode({
      ResourceTypeFilters = [&quot;AWS::AllSupported&quot;]
      TagFilters = [
        {
          Key    = &quot;team&quot;
          Values = [each.key]
        }
      ]
    })
  }
}
```


Weekly Cost Report
==================

```python
#!/usr/bin/env python3
import requests
import json
from datetime import datetime

KUBECOST_URL = &quot;http://kubecost.monitoring&quot;
SLACK_WEBHOOK = &quot;https://hooks.slack.com/services/xxx&quot;

def get_costs():
    resp = requests.get(f&quot;{KUBECOST_URL}/model/allocation?window=7d&amp;aggregate=namespace&quot;)
    return resp.json()

def get_recommendations():
    resp = requests.get(f&quot;{KUBECOST_URL}/model/savings&quot;)
    return resp.json()

def send_slack_report(costs, recommendations):
    total = sum([ns.get(&apos;totalCost&apos;, 0) for ns in costs.get(&apos;data&apos;, [])])
    potential_savings = recommendations.get(&apos;totalSavings&apos;, 0)
    
    blocks = [
        {
            &quot;type&quot;: &quot;header&quot;,
            &quot;text&quot;: {&quot;type&quot;: &quot;plain_text&quot;, &quot;text&quot;: f&quot;📊 Weekly Cost Report - {datetime.now().strftime(&apos;%Y-%m-%d&apos;)}&quot;}
        },
        {
            &quot;type&quot;: &quot;section&quot;,
            &quot;fields&quot;: [
                {&quot;type&quot;: &quot;mrkdwn&quot;, &quot;text&quot;: f&quot;*Total Spend (7d):* ${total:.2f}&quot;},
                {&quot;type&quot;: &quot;mrkdwn&quot;, &quot;text&quot;: f&quot;*Potential Savings:* ${potential_savings:.2f}&quot;}
            ]
        }
    ]
    
    # Top 5 namespaces
    top_ns = sorted(costs.get(&apos;data&apos;, []), key=lambda x: x.get(&apos;totalCost&apos;, 0), reverse=True)[:5]
    ns_text = &quot;\n&quot;.join([f&quot;• {ns[&apos;namespace&apos;]}: ${ns[&apos;totalCost&apos;]:.2f}&quot; for ns in top_ns])
    
    blocks.append({
        &quot;type&quot;: &quot;section&quot;,
        &quot;text&quot;: {&quot;type&quot;: &quot;mrkdwn&quot;, &quot;text&quot;: f&quot;*Top 5 Namespaces:*\n{ns_text}&quot;}
    })
    
    requests.post(SLACK_WEBHOOK, json={&quot;blocks&quot;: blocks})

if __name__ == &quot;__main__&quot;:
    costs = get_costs()
    recs = get_recommendations()
    send_slack_report(costs, recs)
```


References
==========

- OpenCost: https://www.opencost.io
- Kubecost: https://docs.kubecost.com
- FinOps Foundation: https://www.finops.org


========================================
FinOps + Kubecost + Automation
========================================
Track costs. Optimize automatically.
========================================</content:encoded><category>finops</category><category>kubecost</category><category>opencost</category><category>kubernetes</category><category>cost-optimization</category><category>observability</category><author>Mo Abukar</author></item><item><title>Migrating a Java Application from EC2 to ECS Fargate: A Step-by-Step Guide</title><link>https://moabukar.co.uk/blog/ec2-to-fargate-java-migration/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/ec2-to-fargate-java-migration/</guid><description>The complete journey of containerising a Java JAR running on EC2 and deploying it to ECS Fargate – from local testing to Dockerfile, task definitions, networking, secrets management, and achieving production parity.</description><pubDate>Mon, 15 Dec 2025 00:00:00 GMT</pubDate><content:encoded># Migrating a Java Application from EC2 to ECS Fargate: A Step-by-Step Guide

Running Java applications on EC2 works, but you&apos;re managing instances, patching OS, handling auto-scaling groups, and dealing with capacity planning. ECS Fargate removes all of that – you just define your container and AWS handles the rest.

I&apos;ve migrated dozens of Java applications from EC2 to Fargate. This post walks through the complete process: validating the application locally, building an optimised Docker image, creating the ECS task definition, handling secrets and configuration, setting up networking, and achieving production parity.

By the end, you&apos;ll have a repeatable process for containerising any Java application.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/aws.svg&quot; alt=&quot;AWS logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## Prerequisites

Before starting:

- Java application packaged as a JAR (or WAR)
- Docker installed locally
- AWS CLI configured
- Basic familiarity with ECS concepts
- Terraform (optional, but recommended for infrastructure)

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/ec2-to-fargate-java-migration](https://github.com/moabukar/blog-code/tree/main/ec2-to-fargate-java-migration)

## Step 1: Understand the Existing EC2 Setup

Before containerising, document everything about the current deployment:

```bash
# SSH into the EC2 instance
ssh -i key.pem ec2-user@your-ec2-instance

# Find the Java process
ps aux | grep java
```

Output:
```
ec2-user  1234  5.2 12.3 4567890 123456 ?  Sl   10:00   1:23 
  /usr/bin/java -Xms512m -Xmx2g -Dspring.profiles.active=prod 
  -Dserver.port=8080 -jar /opt/app/myapp.jar
```

Document:

| Item | Value |
|------|-------|
| Java version | `java -version` → OpenJDK 17 |
| JVM flags | `-Xms512m -Xmx2g` |
| Spring profile | `prod` |
| Port | `8080` |
| JAR location | `/opt/app/myapp.jar` |
| Config files | `/opt/app/config/application.yml` |
| Environment variables | Check `/etc/environment` or systemd service |
| Log location | `/var/log/app/` |
| Health check endpoint | `GET /actuator/health` |

Also check:

```bash
# Environment variables
env | grep -E &quot;(DB_|API_|SECRET_)&quot;

# Config files
cat /opt/app/config/application.yml

# System resources
free -h
nproc

# Network dependencies
netstat -tlnp | grep java
cat /etc/hosts
```

## Step 2: Run the JAR Locally

Before containerising, verify the application runs locally with the same configuration:

```bash
# Create a working directory
mkdir -p ~/migration-test &amp;&amp; cd ~/migration-test

# Copy the JAR from EC2
scp -i key.pem ec2-user@your-ec2-instance:/opt/app/myapp.jar .

# Copy config files
scp -i key.pem ec2-user@your-ec2-instance:/opt/app/config/application.yml .

# Set environment variables (match EC2)
export DB_HOST=localhost
export DB_PASSWORD=testpassword
export SPRING_PROFILES_ACTIVE=local

# Run with the same JVM flags
java -Xms512m -Xmx2g \
  -Dspring.profiles.active=local \
  -Dserver.port=8080 \
  -jar myapp.jar
```

Test the health endpoint:

```bash
curl http://localhost:8080/actuator/health
# {&quot;status&quot;:&quot;UP&quot;}
```

If it doesn&apos;t work locally, fix it before proceeding. Common issues:
- Missing environment variables
- Database connectivity (use a local DB or mock)
- External service dependencies

## Step 3: Create the Dockerfile

### Basic Dockerfile

Start simple:

```dockerfile
# Dockerfile
FROM eclipse-temurin:17-jre-alpine

WORKDIR /app

COPY myapp.jar app.jar

EXPOSE 8080

ENTRYPOINT [&quot;java&quot;, &quot;-jar&quot;, &quot;app.jar&quot;]
```

### Production-Ready Dockerfile

A real production Dockerfile needs more:

```dockerfile
# Dockerfile
FROM eclipse-temurin:17-jre-alpine AS runtime

# Security: run as non-root user
RUN addgroup -g 1001 appgroup &amp;&amp; \
    adduser -u 1001 -G appgroup -D appuser

WORKDIR /app

# Copy the JAR
COPY --chown=appuser:appgroup myapp.jar app.jar

# Create directories for logs and temp files
RUN mkdir -p /app/logs /app/tmp &amp;&amp; \
    chown -R appuser:appgroup /app

# Switch to non-root user
USER appuser

# Expose the application port
EXPOSE 8080

# Health check (ECS also does health checks, but this is useful for Docker)
HEALTHCHECK --interval=30s --timeout=10s --start-period=60s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:8080/actuator/health || exit 1

# JVM configuration via environment variables
ENV JAVA_OPTS=&quot;-XX:+UseContainerSupport \
  -XX:MaxRAMPercentage=75.0 \
  -XX:InitialRAMPercentage=50.0 \
  -Djava.security.egd=file:/dev/./urandom \
  -Duser.timezone=UTC&quot;

# Application configuration
ENV SERVER_PORT=8080
ENV SPRING_PROFILES_ACTIVE=prod

ENTRYPOINT [&quot;sh&quot;, &quot;-c&quot;, &quot;java $JAVA_OPTS -jar app.jar&quot;]
```

### Key Dockerfile Decisions Explained

**Base image: `eclipse-temurin:17-jre-alpine`**
- Eclipse Temurin is the successor to AdoptOpenJDK
- JRE-only (not JDK) – smaller image, no compiler needed at runtime
- Alpine Linux – smallest footprint (~170MB vs ~400MB for Debian)

**`-XX:+UseContainerSupport`**
- Enables JVM to respect container memory limits
- Without this, JVM might try to use more memory than the container has

**`-XX:MaxRAMPercentage=75.0`**
- Use 75% of container memory for heap
- Leaves 25% for metaspace, thread stacks, native memory, and OS

**Non-root user**
- Security best practice – container shouldn&apos;t run as root
- Fargate runs containers as root by default unless you specify otherwise

**`file:/dev/./urandom`**
- Faster startup – avoids blocking on `/dev/random` for entropy

## Step 4: Build and Test the Docker Image

```bash
# Build the image
docker build -t myapp:latest .

# Run locally with environment variables
docker run -d \
  --name myapp-test \
  -p 8080:8080 \
  -e DB_HOST=host.docker.internal \
  -e DB_PASSWORD=testpassword \
  -e SPRING_PROFILES_ACTIVE=local \
  myapp:latest

# Check logs
docker logs -f myapp-test

# Test health endpoint
curl http://localhost:8080/actuator/health

# Check resource usage
docker stats myapp-test
```

### Test with Memory Limits (Simulating Fargate)

Fargate tasks have specific memory allocations. Test with limits:

```bash
# Simulate a 1GB Fargate task
docker run -d \
  --name myapp-constrained \
  --memory=1g \
  --cpus=0.5 \
  -p 8080:8080 \
  -e SPRING_PROFILES_ACTIVE=local \
  myapp:latest

# Watch memory usage
docker stats myapp-constrained
```

If the container gets OOM-killed, adjust your `MaxRAMPercentage` or increase the memory allocation.

## Step 5: Push to Amazon ECR

```bash
# Create ECR repository
aws ecr create-repository \
  --repository-name myapp \
  --image-scanning-configuration scanOnPush=true

# Get the repository URI
ECR_URI=$(aws ecr describe-repositories \
  --repository-names myapp \
  --query &apos;repositories[0].repositoryUri&apos; \
  --output text)

# Authenticate Docker to ECR
aws ecr get-login-password --region eu-west-1 | \
  docker login --username AWS --password-stdin $ECR_URI

# Tag and push
docker tag myapp:latest $ECR_URI:latest
docker tag myapp:latest $ECR_URI:$(git rev-parse --short HEAD)
docker push $ECR_URI:latest
docker push $ECR_URI:$(git rev-parse --short HEAD)
```

## Step 6: Create the ECS Task Definition

### Using AWS CLI

```bash
# Create task definition JSON
cat &gt; task-definition.json &lt;&lt; &apos;EOF&apos;
{
  &quot;family&quot;: &quot;myapp&quot;,
  &quot;networkMode&quot;: &quot;awsvpc&quot;,
  &quot;requiresCompatibilities&quot;: [&quot;FARGATE&quot;],
  &quot;cpu&quot;: &quot;512&quot;,
  &quot;memory&quot;: &quot;1024&quot;,
  &quot;executionRoleArn&quot;: &quot;arn:aws:iam::123456789012:role/ecsTaskExecutionRole&quot;,
  &quot;taskRoleArn&quot;: &quot;arn:aws:iam::123456789012:role/myapp-task-role&quot;,
  &quot;containerDefinitions&quot;: [
    {
      &quot;name&quot;: &quot;myapp&quot;,
      &quot;image&quot;: &quot;123456789012.dkr.ecr.eu-west-1.amazonaws.com/myapp:latest&quot;,
      &quot;essential&quot;: true,
      &quot;portMappings&quot;: [
        {
          &quot;containerPort&quot;: 8080,
          &quot;protocol&quot;: &quot;tcp&quot;
        }
      ],
      &quot;environment&quot;: [
        {
          &quot;name&quot;: &quot;SPRING_PROFILES_ACTIVE&quot;,
          &quot;value&quot;: &quot;prod&quot;
        },
        {
          &quot;name&quot;: &quot;SERVER_PORT&quot;,
          &quot;value&quot;: &quot;8080&quot;
        }
      ],
      &quot;secrets&quot;: [
        {
          &quot;name&quot;: &quot;DB_PASSWORD&quot;,
          &quot;valueFrom&quot;: &quot;arn:aws:secretsmanager:eu-west-1:123456789012:secret:myapp/db-password&quot;
        },
        {
          &quot;name&quot;: &quot;API_KEY&quot;,
          &quot;valueFrom&quot;: &quot;arn:aws:ssm:eu-west-1:123456789012:parameter/myapp/api-key&quot;
        }
      ],
      &quot;logConfiguration&quot;: {
        &quot;logDriver&quot;: &quot;awslogs&quot;,
        &quot;options&quot;: {
          &quot;awslogs-group&quot;: &quot;/ecs/myapp&quot;,
          &quot;awslogs-region&quot;: &quot;eu-west-1&quot;,
          &quot;awslogs-stream-prefix&quot;: &quot;ecs&quot;
        }
      },
      &quot;healthCheck&quot;: {
        &quot;command&quot;: [&quot;CMD-SHELL&quot;, &quot;wget --no-verbose --tries=1 --spider http://localhost:8080/actuator/health || exit 1&quot;],
        &quot;interval&quot;: 30,
        &quot;timeout&quot;: 10,
        &quot;retries&quot;: 3,
        &quot;startPeriod&quot;: 60
      }
    }
  ]
}
EOF

# Register the task definition
aws ecs register-task-definition --cli-input-json file://task-definition.json
```

### Using Terraform (Recommended)

```hcl
# ecr.tf
resource &quot;aws_ecr_repository&quot; &quot;myapp&quot; {
  name                 = &quot;myapp&quot;
  image_tag_mutability = &quot;MUTABLE&quot;

  image_scanning_configuration {
    scan_on_push = true
  }

  encryption_configuration {
    encryption_type = &quot;AES256&quot;
  }
}

resource &quot;aws_ecr_lifecycle_policy&quot; &quot;myapp&quot; {
  repository = aws_ecr_repository.myapp.name

  policy = jsonencode({
    rules = [
      {
        rulePriority = 1
        description  = &quot;Keep last 10 images&quot;
        selection = {
          tagStatus   = &quot;any&quot;
          countType   = &quot;imageCountMoreThan&quot;
          countNumber = 10
        }
        action = {
          type = &quot;expire&quot;
        }
      }
    ]
  })
}
```

```hcl
# iam.tf
# Task execution role (for ECS to pull images and write logs)
resource &quot;aws_iam_role&quot; &quot;ecs_task_execution&quot; {
  name = &quot;myapp-ecs-task-execution&quot;

  assume_role_policy = jsonencode({
    Version = &quot;2012-10-17&quot;
    Statement = [
      {
        Action = &quot;sts:AssumeRole&quot;
        Effect = &quot;Allow&quot;
        Principal = {
          Service = &quot;ecs-tasks.amazonaws.com&quot;
        }
      }
    ]
  })
}

resource &quot;aws_iam_role_policy_attachment&quot; &quot;ecs_task_execution&quot; {
  role       = aws_iam_role.ecs_task_execution.name
  policy_arn = &quot;arn:aws:iam::aws:policy/service-role/AmazonECSTaskExecutionRolePolicy&quot;
}

# Allow reading secrets
resource &quot;aws_iam_role_policy&quot; &quot;ecs_task_execution_secrets&quot; {
  name = &quot;secrets-access&quot;
  role = aws_iam_role.ecs_task_execution.id

  policy = jsonencode({
    Version = &quot;2012-10-17&quot;
    Statement = [
      {
        Effect = &quot;Allow&quot;
        Action = [
          &quot;secretsmanager:GetSecretValue&quot;
        ]
        Resource = [
          &quot;arn:aws:secretsmanager:eu-west-1:*:secret:myapp/*&quot;
        ]
      },
      {
        Effect = &quot;Allow&quot;
        Action = [
          &quot;ssm:GetParameters&quot;
        ]
        Resource = [
          &quot;arn:aws:ssm:eu-west-1:*:parameter/myapp/*&quot;
        ]
      }
    ]
  })
}

# Task role (for the application to access AWS services)
resource &quot;aws_iam_role&quot; &quot;ecs_task&quot; {
  name = &quot;myapp-ecs-task&quot;

  assume_role_policy = jsonencode({
    Version = &quot;2012-10-17&quot;
    Statement = [
      {
        Action = &quot;sts:AssumeRole&quot;
        Effect = &quot;Allow&quot;
        Principal = {
          Service = &quot;ecs-tasks.amazonaws.com&quot;
        }
      }
    ]
  })
}

# Add policies for S3, SQS, etc. as needed
resource &quot;aws_iam_role_policy&quot; &quot;ecs_task_s3&quot; {
  name = &quot;s3-access&quot;
  role = aws_iam_role.ecs_task.id

  policy = jsonencode({
    Version = &quot;2012-10-17&quot;
    Statement = [
      {
        Effect = &quot;Allow&quot;
        Action = [
          &quot;s3:GetObject&quot;,
          &quot;s3:PutObject&quot;
        ]
        Resource = [
          &quot;arn:aws:s3:::myapp-data/*&quot;
        ]
      }
    ]
  })
}
```

```hcl
# logs.tf
resource &quot;aws_cloudwatch_log_group&quot; &quot;myapp&quot; {
  name              = &quot;/ecs/myapp&quot;
  retention_in_days = 30
}
```

```hcl
# task-definition.tf
resource &quot;aws_ecs_task_definition&quot; &quot;myapp&quot; {
  family                   = &quot;myapp&quot;
  network_mode             = &quot;awsvpc&quot;
  requires_compatibilities = [&quot;FARGATE&quot;]
  cpu                      = &quot;512&quot;
  memory                   = &quot;1024&quot;
  execution_role_arn       = aws_iam_role.ecs_task_execution.arn
  task_role_arn            = aws_iam_role.ecs_task.arn

  container_definitions = jsonencode([
    {
      name      = &quot;myapp&quot;
      image     = &quot;${aws_ecr_repository.myapp.repository_url}:latest&quot;
      essential = true

      portMappings = [
        {
          containerPort = 8080
          protocol      = &quot;tcp&quot;
        }
      ]

      environment = [
        {
          name  = &quot;SPRING_PROFILES_ACTIVE&quot;
          value = var.environment
        },
        {
          name  = &quot;SERVER_PORT&quot;
          value = &quot;8080&quot;
        },
        {
          name  = &quot;JAVA_OPTS&quot;
          value = &quot;-XX:+UseContainerSupport -XX:MaxRAMPercentage=75.0 -XX:InitialRAMPercentage=50.0&quot;
        }
      ]

      secrets = [
        {
          name      = &quot;DB_PASSWORD&quot;
          valueFrom = aws_secretsmanager_secret.db_password.arn
        },
        {
          name      = &quot;DB_HOST&quot;
          valueFrom = &quot;${aws_ssm_parameter.db_host.arn}&quot;
        }
      ]

      logConfiguration = {
        logDriver = &quot;awslogs&quot;
        options = {
          &quot;awslogs-group&quot;         = aws_cloudwatch_log_group.myapp.name
          &quot;awslogs-region&quot;        = &quot;eu-west-1&quot;
          &quot;awslogs-stream-prefix&quot; = &quot;ecs&quot;
        }
      }

      healthCheck = {
        command     = [&quot;CMD-SHELL&quot;, &quot;wget --no-verbose --tries=1 --spider http://localhost:8080/actuator/health || exit 1&quot;]
        interval    = 30
        timeout     = 10
        retries     = 3
        startPeriod = 60
      }
    }
  ])

  tags = {
    Name        = &quot;myapp&quot;
    Environment = var.environment
  }
}
```

## Step 7: Set Up Secrets

Never put secrets in environment variables in the task definition. Use Secrets Manager or Parameter Store:

```bash
# Create secret in Secrets Manager
aws secretsmanager create-secret \
  --name myapp/db-password \
  --secret-string &quot;your-secure-password&quot;

# Or use Parameter Store (cheaper, simpler)
aws ssm put-parameter \
  --name /myapp/db-host \
  --value &quot;mydb.cluster-xxx.eu-west-1.rds.amazonaws.com&quot; \
  --type SecureString
```

In Terraform:

```hcl
resource &quot;aws_secretsmanager_secret&quot; &quot;db_password&quot; {
  name = &quot;myapp/db-password&quot;
}

resource &quot;aws_secretsmanager_secret_version&quot; &quot;db_password&quot; {
  secret_id = aws_secretsmanager_secret.db_password.id
  # Don&apos;t put the actual secret in Terraform - bootstrap manually
  secret_string = &quot;PLACEHOLDER&quot;

  lifecycle {
    ignore_changes = [secret_string]
  }
}

resource &quot;aws_ssm_parameter&quot; &quot;db_host&quot; {
  name  = &quot;/myapp/db-host&quot;
  type  = &quot;SecureString&quot;
  value = var.db_host
}
```

## Step 8: Create the ECS Service

```hcl
# ecs-cluster.tf
resource &quot;aws_ecs_cluster&quot; &quot;main&quot; {
  name = &quot;myapp-cluster&quot;

  setting {
    name  = &quot;containerInsights&quot;
    value = &quot;enabled&quot;
  }
}

resource &quot;aws_ecs_cluster_capacity_providers&quot; &quot;main&quot; {
  cluster_name = aws_ecs_cluster.main.name

  capacity_providers = [&quot;FARGATE&quot;, &quot;FARGATE_SPOT&quot;]

  default_capacity_provider_strategy {
    base              = 1
    weight            = 1
    capacity_provider = &quot;FARGATE&quot;
  }
}
```

```hcl
# ecs-service.tf
resource &quot;aws_ecs_service&quot; &quot;myapp&quot; {
  name            = &quot;myapp&quot;
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.myapp.arn
  desired_count   = 2
  launch_type     = &quot;FARGATE&quot;

  network_configuration {
    subnets          = var.private_subnet_ids
    security_groups  = [aws_security_group.ecs_tasks.id]
    assign_public_ip = false
  }

  load_balancer {
    target_group_arn = aws_lb_target_group.myapp.arn
    container_name   = &quot;myapp&quot;
    container_port   = 8080
  }

  deployment_configuration {
    minimum_healthy_percent = 50
    maximum_percent         = 200
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  # Allow time for health checks during deployment
  health_check_grace_period_seconds = 120

  lifecycle {
    ignore_changes = [desired_count]  # Allow auto-scaling to manage
  }

  depends_on = [aws_lb_listener.https]
}
```

```hcl
# security-groups.tf
resource &quot;aws_security_group&quot; &quot;ecs_tasks&quot; {
  name        = &quot;myapp-ecs-tasks&quot;
  description = &quot;Allow inbound from ALB&quot;
  vpc_id      = var.vpc_id

  ingress {
    description     = &quot;HTTP from ALB&quot;
    from_port       = 8080
    to_port         = 8080
    protocol        = &quot;tcp&quot;
    security_groups = [aws_security_group.alb.id]
  }

  egress {
    description = &quot;All outbound&quot;
    from_port   = 0
    to_port     = 0
    protocol    = &quot;-1&quot;
    cidr_blocks = [&quot;0.0.0.0/0&quot;]
  }
}
```

## Step 9: Set Up Load Balancer

```hcl
# alb.tf
resource &quot;aws_lb&quot; &quot;main&quot; {
  name               = &quot;myapp-alb&quot;
  internal           = false
  load_balancer_type = &quot;application&quot;
  security_groups    = [aws_security_group.alb.id]
  subnets            = var.public_subnet_ids

  enable_deletion_protection = true
}

resource &quot;aws_lb_target_group&quot; &quot;myapp&quot; {
  name        = &quot;myapp-tg&quot;
  port        = 8080
  protocol    = &quot;HTTP&quot;
  vpc_id      = var.vpc_id
  target_type = &quot;ip&quot;  # Required for Fargate

  health_check {
    enabled             = true
    healthy_threshold   = 2
    unhealthy_threshold = 3
    timeout             = 10
    interval            = 30
    path                = &quot;/actuator/health&quot;
    port                = &quot;traffic-port&quot;
    protocol            = &quot;HTTP&quot;
    matcher             = &quot;200&quot;
  }

  deregistration_delay = 30

  stickiness {
    type            = &quot;lb_cookie&quot;
    cookie_duration = 86400
    enabled         = false
  }
}

resource &quot;aws_lb_listener&quot; &quot;https&quot; {
  load_balancer_arn = aws_lb.main.arn
  port              = &quot;443&quot;
  protocol          = &quot;HTTPS&quot;
  ssl_policy        = &quot;ELBSecurityPolicy-TLS13-1-2-2021-06&quot;
  certificate_arn   = var.certificate_arn

  default_action {
    type             = &quot;forward&quot;
    target_group_arn = aws_lb_target_group.myapp.arn
  }
}

resource &quot;aws_lb_listener&quot; &quot;http_redirect&quot; {
  load_balancer_arn = aws_lb.main.arn
  port              = &quot;80&quot;
  protocol          = &quot;HTTP&quot;

  default_action {
    type = &quot;redirect&quot;
    redirect {
      port        = &quot;443&quot;
      protocol    = &quot;HTTPS&quot;
      status_code = &quot;HTTP_301&quot;
    }
  }
}

resource &quot;aws_security_group&quot; &quot;alb&quot; {
  name        = &quot;myapp-alb&quot;
  description = &quot;Allow HTTPS inbound&quot;
  vpc_id      = var.vpc_id

  ingress {
    description = &quot;HTTPS&quot;
    from_port   = 443
    to_port     = 443
    protocol    = &quot;tcp&quot;
    cidr_blocks = [&quot;0.0.0.0/0&quot;]
  }

  ingress {
    description = &quot;HTTP (redirect)&quot;
    from_port   = 80
    to_port     = 80
    protocol    = &quot;tcp&quot;
    cidr_blocks = [&quot;0.0.0.0/0&quot;]
  }

  egress {
    from_port   = 0
    to_port     = 0
    protocol    = &quot;-1&quot;
    cidr_blocks = [&quot;0.0.0.0/0&quot;]
  }
}
```

## Step 10: Set Up Auto-Scaling

```hcl
# autoscaling.tf
resource &quot;aws_appautoscaling_target&quot; &quot;myapp&quot; {
  max_capacity       = 10
  min_capacity       = 2
  resource_id        = &quot;service/${aws_ecs_cluster.main.name}/${aws_ecs_service.myapp.name}&quot;
  scalable_dimension = &quot;ecs:service:DesiredCount&quot;
  service_namespace  = &quot;ecs&quot;
}

# Scale based on CPU
resource &quot;aws_appautoscaling_policy&quot; &quot;cpu&quot; {
  name               = &quot;myapp-cpu-scaling&quot;
  policy_type        = &quot;TargetTrackingScaling&quot;
  resource_id        = aws_appautoscaling_target.myapp.resource_id
  scalable_dimension = aws_appautoscaling_target.myapp.scalable_dimension
  service_namespace  = aws_appautoscaling_target.myapp.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = &quot;ECSServiceAverageCPUUtilization&quot;
    }
    target_value       = 70.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

# Scale based on memory
resource &quot;aws_appautoscaling_policy&quot; &quot;memory&quot; {
  name               = &quot;myapp-memory-scaling&quot;
  policy_type        = &quot;TargetTrackingScaling&quot;
  resource_id        = aws_appautoscaling_target.myapp.resource_id
  scalable_dimension = aws_appautoscaling_target.myapp.scalable_dimension
  service_namespace  = aws_appautoscaling_target.myapp.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = &quot;ECSServiceAverageMemoryUtilization&quot;
    }
    target_value       = 80.0
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}

# Scale based on ALB request count
resource &quot;aws_appautoscaling_policy&quot; &quot;requests&quot; {
  name               = &quot;myapp-requests-scaling&quot;
  policy_type        = &quot;TargetTrackingScaling&quot;
  resource_id        = aws_appautoscaling_target.myapp.resource_id
  scalable_dimension = aws_appautoscaling_target.myapp.scalable_dimension
  service_namespace  = aws_appautoscaling_target.myapp.service_namespace

  target_tracking_scaling_policy_configuration {
    predefined_metric_specification {
      predefined_metric_type = &quot;ALBRequestCountPerTarget&quot;
      resource_label         = &quot;${aws_lb.main.arn_suffix}/${aws_lb_target_group.myapp.arn_suffix}&quot;
    }
    target_value       = 1000
    scale_in_cooldown  = 300
    scale_out_cooldown = 60
  }
}
```

## Step 11: Deploy and Verify

```bash
# Apply Terraform
terraform init
terraform plan
terraform apply

# Check ECS service status
aws ecs describe-services \
  --cluster myapp-cluster \
  --services myapp \
  --query &apos;services[0].{Status:status,Running:runningCount,Desired:desiredCount}&apos;

# Check task status
aws ecs list-tasks --cluster myapp-cluster --service-name myapp
aws ecs describe-tasks \
  --cluster myapp-cluster \
  --tasks $(aws ecs list-tasks --cluster myapp-cluster --service-name myapp --query &apos;taskArns[0]&apos; --output text)

# View logs
aws logs tail /ecs/myapp --follow

# Test the endpoint
curl https://myapp.example.com/actuator/health
```

## Step 12: Achieve Production Parity

### Compare EC2 vs Fargate

Create a checklist:

| Aspect | EC2 | Fargate | Status |
|--------|-----|---------|--------|
| Java version | OpenJDK 17 | eclipse-temurin:17 | ✅ |
| Heap size | 2GB | 75% of 1024MB = 768MB | ⚠️ Increase task memory |
| Spring profile | prod | prod | ✅ |
| DB connectivity | Via VPC | Via VPC | ✅ |
| Secrets | Env vars | Secrets Manager | ✅ (improved) |
| Logging | /var/log | CloudWatch | ✅ |
| Health check | None | /actuator/health | ✅ (improved) |
| Auto-scaling | ASG | ECS auto-scaling | ✅ |

### Load Testing

Run the same load test against both:

```bash
# Install hey (HTTP load generator)
brew install hey

# Test EC2
hey -n 10000 -c 100 https://ec2-app.example.com/api/test

# Test Fargate
hey -n 10000 -c 100 https://fargate-app.example.com/api/test
```

Compare:
- Response times (P50, P95, P99)
- Error rates
- Throughput (requests/second)

### Monitoring Parity

Ensure you have equivalent monitoring:

```hcl
# cloudwatch-alarms.tf
resource &quot;aws_cloudwatch_metric_alarm&quot; &quot;cpu_high&quot; {
  alarm_name          = &quot;myapp-cpu-high&quot;
  comparison_operator = &quot;GreaterThanThreshold&quot;
  evaluation_periods  = 2
  metric_name         = &quot;CPUUtilization&quot;
  namespace           = &quot;AWS/ECS&quot;
  period              = 300
  statistic           = &quot;Average&quot;
  threshold           = 80
  alarm_description   = &quot;ECS CPU utilisation is high&quot;

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.myapp.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

resource &quot;aws_cloudwatch_metric_alarm&quot; &quot;memory_high&quot; {
  alarm_name          = &quot;myapp-memory-high&quot;
  comparison_operator = &quot;GreaterThanThreshold&quot;
  evaluation_periods  = 2
  metric_name         = &quot;MemoryUtilization&quot;
  namespace           = &quot;AWS/ECS&quot;
  period              = 300
  statistic           = &quot;Average&quot;
  threshold           = 85
  alarm_description   = &quot;ECS memory utilisation is high&quot;

  dimensions = {
    ClusterName = aws_ecs_cluster.main.name
    ServiceName = aws_ecs_service.myapp.name
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}

resource &quot;aws_cloudwatch_metric_alarm&quot; &quot;healthy_hosts&quot; {
  alarm_name          = &quot;myapp-unhealthy-hosts&quot;
  comparison_operator = &quot;LessThanThreshold&quot;
  evaluation_periods  = 2
  metric_name         = &quot;HealthyHostCount&quot;
  namespace           = &quot;AWS/ApplicationELB&quot;
  period              = 60
  statistic           = &quot;Average&quot;
  threshold           = 1
  alarm_description   = &quot;No healthy hosts in target group&quot;

  dimensions = {
    TargetGroup  = aws_lb_target_group.myapp.arn_suffix
    LoadBalancer = aws_lb.main.arn_suffix
  }

  alarm_actions = [aws_sns_topic.alerts.arn]
}
```

## Common Issues and Fixes

### Container Keeps Restarting

Check logs first:
```bash
aws logs tail /ecs/myapp --since 1h
```

Common causes:
- Health check failing (increase `startPeriod`)
- OOM (increase task memory or reduce `MaxRAMPercentage`)
- Missing secrets (check execution role permissions)

### Health Check Failing

```bash
# Exec into the container (requires ECS Exec enabled)
aws ecs execute-command \
  --cluster myapp-cluster \
  --task &lt;task-id&gt; \
  --container myapp \
  --interactive \
  --command &quot;/bin/sh&quot;

# Test health endpoint from inside
wget -qO- http://localhost:8080/actuator/health
```

### Slow Startup

Java applications can be slow to start. Increase the health check `startPeriod`:

```hcl
healthCheck = {
  startPeriod = 120  # 2 minutes before health checks start
}
```

Also ensure the ALB health check is aligned:
```hcl
health_check {
  interval = 30
  timeout  = 10
  # Give the app time to start before marking unhealthy
}
```

And set a `health_check_grace_period_seconds` on the service:
```hcl
health_check_grace_period_seconds = 120
```

### Database Connectivity

Fargate tasks need:
- Security group allowing outbound to the database
- Database security group allowing inbound from ECS tasks security group
- Correct VPC/subnet configuration (private subnets with NAT gateway for outbound)

## Cutover Strategy

### Blue/Green with Route 53

1. Deploy Fargate service alongside EC2
2. Use weighted routing in Route 53:
   ```hcl
   resource &quot;aws_route53_record&quot; &quot;app&quot; {
     zone_id = var.zone_id
     name    = &quot;api.example.com&quot;
     type    = &quot;A&quot;

     weighted_routing_policy {
       weight = 90
     }
     set_identifier = &quot;ec2&quot;

     alias {
       name                   = aws_lb.ec2.dns_name
       zone_id                = aws_lb.ec2.zone_id
       evaluate_target_health = true
     }
   }

   resource &quot;aws_route53_record&quot; &quot;app_fargate&quot; {
     zone_id = var.zone_id
     name    = &quot;api.example.com&quot;
     type    = &quot;A&quot;

     weighted_routing_policy {
       weight = 10
     }
     set_identifier = &quot;fargate&quot;

     alias {
       name                   = aws_lb.fargate.dns_name
       zone_id                = aws_lb.fargate.zone_id
       evaluate_target_health = true
     }
   }
   ```
3. Gradually shift weight: 90/10 → 50/50 → 10/90 → 0/100
4. Monitor errors and latency at each step
5. Decommission EC2 once Fargate is 100%

## Summary

Migrating from EC2 to Fargate:

1. **Document the EC2 setup** – JVM flags, environment variables, config files
2. **Test locally** – run the JAR with the same configuration
3. **Build a production Docker image** – non-root user, container-aware JVM settings
4. **Create the task definition** – proper memory/CPU, secrets from Secrets Manager
5. **Set up networking** – ALB, security groups, health checks
6. **Deploy and verify** – logs, health checks, load testing
7. **Achieve parity** – compare performance, monitoring, alerting
8. **Cutover gradually** – weighted routing, monitor, shift traffic

The result: no more EC2 instances to patch, automatic scaling, pay-per-second billing, and a cleaner deployment model.

---

*Migrating Java apps to Fargate or have questions about container sizing? Find me on [LinkedIn](https://linkedin.com/in/moabukar).*</content:encoded><category>java</category><category>ecs</category><category>fargate</category><category>docker</category><category>aws</category><category>containers</category><category>migration</category><category>terraform</category><category>devops</category><author>Mo Abukar</author></item><item><title>Spot Instance Patterns: Graceful Handling and Cost Savings</title><link>https://moabukar.co.uk/blog/spot-instances-done-right/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/spot-instances-done-right/</guid><description>Master AWS Spot Instances in production. Handle interruptions gracefully, use mixed instance groups, and save 60-90% on compute costs.</description><pubDate>Fri, 12 Dec 2025 00:00:00 GMT</pubDate><content:encoded>Spot Instance Patterns: Graceful Handling and Cost Savings
==========================================================

Spot Instances offer 60-90% savings but can be interrupted with
2-minute warning. This guide covers production patterns for
handling interruptions gracefully.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/aws.svg&quot; alt=&quot;AWS logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Spot = unused EC2 capacity at steep discounts
- 2-minute interruption warning
- Diversify instance types/AZs
- Handle SIGTERM gracefully
- Mix spot + on-demand for reliability


Spot Basics
===========

```
PRICING                     RELIABILITY
=======                     ===========
On-Demand: $0.10/hr         99.99%
Spot: $0.02/hr (80% off)    ~95-98% (varies)
```

Interruption causes:
- Price exceeds your max (if set)
- Capacity needed for on-demand
- Pool depleted


Kubernetes Integration
======================

Node Termination Handler
------------------------

```bash
helm repo add eks https://aws.github.io/eks-charts
helm upgrade --install aws-node-termination-handler eks/aws-node-termination-handler \
  --namespace kube-system \
  --set enableSpotInterruptionDraining=true \
  --set enableScheduledEventDraining=true \
  --set enableRebalanceMonitoring=true \
  --set enableRebalanceDraining=true
```


Karpenter Spot Configuration
----------------------------

```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-diversified
spec:
  template:
    spec:
      requirements:
        # Wide variety of instance types for diversification
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: [&quot;c&quot;, &quot;m&quot;, &quot;r&quot;]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: [&quot;large&quot;, &quot;xlarge&quot;, &quot;2xlarge&quot;]
        - key: kubernetes.io/arch
          operator: In
          values: [&quot;amd64&quot;, &quot;arm64&quot;]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [&quot;spot&quot;]
        - key: topology.kubernetes.io/zone
          operator: In
          values: [&quot;eu-west-2a&quot;, &quot;eu-west-2b&quot;, &quot;eu-west-2c&quot;]
      
      nodeClassRef:
        name: default
  
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
```


Mixed Instance Groups (EKS)
===========================

```hcl
# Terraform
resource &quot;aws_eks_node_group&quot; &quot;mixed&quot; {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = &quot;mixed-spot-ondemand&quot;
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = aws_subnet.private[*].id

  capacity_type = &quot;SPOT&quot;

  instance_types = [
    &quot;m5.large&quot;, &quot;m5a.large&quot;, &quot;m5n.large&quot;,
    &quot;m6i.large&quot;, &quot;m6a.large&quot;,
    &quot;c5.large&quot;, &quot;c5a.large&quot;, &quot;c6i.large&quot;,
    &quot;r5.large&quot;, &quot;r5a.large&quot;, &quot;r6i.large&quot;
  ]

  scaling_config {
    desired_size = 5
    max_size     = 20
    min_size     = 2
  }

  labels = {
    &quot;node-type&quot; = &quot;spot&quot;
  }

  taint {
    key    = &quot;spot&quot;
    value  = &quot;true&quot;
    effect = &quot;NO_SCHEDULE&quot;
  }
}

# On-demand baseline
resource &quot;aws_eks_node_group&quot; &quot;ondemand&quot; {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = &quot;ondemand-baseline&quot;
  node_role_arn   = aws_iam_role.node.arn
  subnet_ids      = aws_subnet.private[*].id

  capacity_type = &quot;ON_DEMAND&quot;
  instance_types = [&quot;m6i.large&quot;]

  scaling_config {
    desired_size = 2
    max_size     = 5
    min_size     = 2
  }

  labels = {
    &quot;node-type&quot; = &quot;ondemand&quot;
  }
}
```


Graceful Shutdown
=================

Application Side
----------------

```go
package main

import (
    &quot;context&quot;
    &quot;log&quot;
    &quot;net/http&quot;
    &quot;os&quot;
    &quot;os/signal&quot;
    &quot;syscall&quot;
    &quot;time&quot;
)

func main() {
    srv := &amp;http.Server{Addr: &quot;:8080&quot;}

    go func() {
        if err := srv.ListenAndServe(); err != http.ErrServerClosed {
            log.Fatal(err)
        }
    }()

    // Wait for SIGTERM (from spot interruption)
    quit := make(chan os.Signal, 1)
    signal.Notify(quit, syscall.SIGTERM, syscall.SIGINT)
    &lt;-quit

    log.Println(&quot;Shutting down gracefully...&quot;)

    // 25 seconds to drain (leave buffer before 2-min deadline)
    ctx, cancel := context.WithTimeout(context.Background(), 25*time.Second)
    defer cancel()

    // Stop accepting new requests
    if err := srv.Shutdown(ctx); err != nil {
        log.Printf(&quot;Shutdown error: %v&quot;, err)
    }

    // Cleanup: flush buffers, close connections
    cleanup()

    log.Println(&quot;Shutdown complete&quot;)
}
```


Pod Disruption Budget
---------------------

```yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-server-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: api-server
```


Preemption Settings
-------------------

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 30
      
      containers:
        - name: api
          lifecycle:
            preStop:
              exec:
                command:
                  - /bin/sh
                  - -c
                  - &quot;sleep 5 &amp;&amp; /app/drain.sh&quot;
```


Workload Patterns
=================

Pattern 1: Spot-Tolerant Stateless
----------------------------------

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  replicas: 6
  template:
    spec:
      # Tolerate spot taint
      tolerations:
        - key: spot
          operator: Equal
          value: &quot;true&quot;
          effect: NoSchedule
      
      # Spread across zones
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfied: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-server
      
      # Prefer spot, fallback to on-demand
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: node-type
                    operator: In
                    values: [&quot;spot&quot;]
```


Pattern 2: Critical on On-Demand
--------------------------------

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: payment-service
spec:
  replicas: 3
  template:
    spec:
      # Force on-demand
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values: [&quot;ondemand&quot;]
      
      # Do NOT tolerate spot taint
      tolerations: []
```


Pattern 3: Hybrid
-----------------

```yaml
# 2 replicas on on-demand (baseline)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-ondemand
spec:
  replicas: 2
  template:
    spec:
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values: [&quot;ondemand&quot;]

---
# Remaining on spot
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server-spot
spec:
  replicas: 4
  template:
    spec:
      tolerations:
        - key: spot
          operator: Equal
          value: &quot;true&quot;
          effect: NoSchedule
      affinity:
        nodeAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: node-type
                    operator: In
                    values: [&quot;spot&quot;]
```


Monitoring Spot
===============

```yaml
# Prometheus alerts
groups:
  - name: spot-instances
    rules:
      - alert: SpotInterruptionWarning
        expr: increase(aws_node_termination_handler_actions_total{action=&quot;drain&quot;}[5m]) &gt; 0
        labels:
          severity: warning
        annotations:
          summary: Spot instance interruption detected

      - alert: HighSpotInterruptionRate
        expr: rate(aws_node_termination_handler_actions_total{action=&quot;drain&quot;}[1h]) &gt; 2
        labels:
          severity: warning
        annotations:
          summary: High rate of spot interruptions
```


References
==========

- Spot Best Practices: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/spot-best-practices.html
- Node Termination Handler: https://github.com/aws/aws-node-termination-handler
- Karpenter Spot: https://karpenter.sh/docs/concepts/scheduling/#capacity-type


========================================
AWS Spot + Kubernetes + Cost Savings
========================================
60-90% savings. Graceful interruption handling.
========================================</content:encoded><category>aws</category><category>spot-instances</category><category>kubernetes</category><category>cost-optimization</category><category>eks</category><category>reliability</category><author>Mo Abukar</author></item><item><title>The Real Difference Between Senior, Staff, and Principal Engineer</title><link>https://moabukar.co.uk/blog/senior-staff-principal-difference/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/senior-staff-principal-difference/</guid><description>Everyone wants to know the difference between Senior, Staff, and Principal. After holding all three titles, I can tell you the real differences aren&apos;t what most people think. It&apos;s not about years - it&apos;s about scope.</description><pubDate>Wed, 10 Dec 2025 00:00:00 GMT</pubDate><content:encoded>The Real Difference Between Senior, Staff, and Principal Engineer
=================================================================

Everyone wants to know the difference between Senior, Staff, and Principal Engineer. The titles get thrown around constantly, and every company defines them differently. But after 7+ years in this industry - and having held all three titles - I can tell you the real differences aren&apos;t what most people think.

**It&apos;s not about years of experience. It&apos;s not about technical depth. It&apos;s about scope of impact.**

Let me break down what actually changes as you move up the IC ladder.

---

![The Real Difference Between Senior, Staff, and Principal Engineer](/images/senior-staff-principal-difference.jpg)


The Quick Summary
=================

```
LEVEL       OWNS                SCOPE               TIME HORIZON
=====       ====                =====               ============
Senior      Your work           Team                Weeks/Sprints
Staff       The system          Multiple teams      Quarters
Principal   The direction       Organisation        Years
```

---

Senior Engineer: Own Your Domain
================================

A Senior Engineer owns their work end-to-end. You&apos;re given a problem, you solve it, you ship it. You don&apos;t need hand-holding. You can estimate work, break it down, and deliver without someone checking on you every day.

**What Senior looks like:**

- You own features or components
- You make technical decisions within your team&apos;s scope
- You mentor juniors and mid-levels
- You&apos;re trusted to work independently
- You push back on bad requirements
- You write code that others can maintain

**The key shift from Mid to Senior:** You stop needing someone to tell you *how* to do things. You figure it out. You&apos;re accountable for outcomes, not just tasks.

Most engineers plateau here - and that&apos;s completely fine. Being a strong Senior is a great career. The money is good, the work is interesting, and you&apos;re not drowning in meetings. Many of the best engineers I know are Seniors who&apos;ve been doing it for 15+ years.

But if you want to go further, the game changes.

---

Staff Engineer: Own the System
==============================

Staff Engineer is where things get blurry. Some companies use it as &quot;Senior+&quot;, others use it as a legitimate leadership role. Here&apos;s what it *should* mean:

**A Staff Engineer owns systems, not just components.**

You&apos;re no longer thinking about your feature. You&apos;re thinking about how your feature interacts with everything else. You&apos;re thinking about the architecture of the whole service, or multiple services. You&apos;re the person who spots problems that span team boundaries.

**What Staff looks like:**

- You influence technical direction across multiple teams
- You design systems that other teams will build on
- You&apos;re the go-to person for cross-cutting concerns
- You represent engineering in discussions with product and leadership
- You mentor Seniors
- You write fewer PRs but review more
- You write design docs that shape how work gets done

**The key shift from Senior to Staff:** You stop optimising for your own output. You start optimising for team output. If you can make five engineers 20% more effective, that&apos;s worth more than any code you could write yourself.

This is the hardest transition for most engineers. You&apos;ve spent your whole career being rewarded for individual contribution. Now you&apos;re being asked to step back and multiply others.

---

Principal Engineer: Own the Direction
=====================================

Principal is where you start thinking in years, not quarters. You&apos;re not just solving today&apos;s problems - you&apos;re positioning the organisation for problems they don&apos;t even know they have yet.

**A Principal Engineer owns technical direction across the organisation.**

You&apos;re setting standards. You&apos;re making build-vs-buy decisions. You&apos;re deciding which technologies the company bets on. When something goes catastrophically wrong, you&apos;re in the room. When the company is making a strategic shift, engineering leadership wants your input.

**What Principal looks like:**

- You define technical strategy across the company
- You influence hiring standards and engineering culture
- You&apos;re involved in decisions that affect the whole engineering org
- You represent engineering to the rest of the business
- You spend a lot of time in documents, meetings, and 1:1s
- You might go weeks without writing production code
- You&apos;re expected to have opinions on everything technical

**The key shift from Staff to Principal:** You stop thinking about systems and start thinking about the organisation. How do we structure teams? What should we build in-house vs buy? How do we scale engineering from 50 to 200 people without everything falling apart?

This is where the IC track starts to feel a lot like management - but without direct reports.

---

The Uncomfortable Truth
=======================

Here&apos;s what nobody tells you:

**The higher you go, the less you code.**

```
LEVEL       CODING TIME     REST OF TIME
=====       ===========     ============
Senior      70-80%          Reviews, meetings, mentoring
Staff       40-50%          Design docs, reviews, alignment
Principal   10-20%          Strategy, influence, decisions
```

At Senior, you&apos;re still an individual contributor in the traditional sense. You write code, you ship features, you debug production issues.

At Staff, maybe 50% of your time is code. The rest is design docs, reviews, meetings, and unblocking others.

At Principal, you might be lucky to write code 20% of the time. Most of your impact comes from decisions, documents, and influence.

If you love coding and hate meetings, Staff and Principal might not be for you. That&apos;s not a failure - it&apos;s self-awareness.

---

How to Actually Get Promoted
============================

I&apos;ve seen people get stuck at Senior for years because they keep doing Senior work really well. That&apos;s not how promotions work at this level.

**To get to Staff:**

- Start solving problems outside your immediate team
- Write design docs that influence multiple teams
- Mentor Seniors, not just juniors
- Be the person who gets pulled into cross-team discussions
- Stop waiting to be assigned work - find the important problems yourself

**To get to Principal:**

- Have a track record of successful Staff-level impact
- Be known across the engineering org, not just your corner
- Have strong opinions on how engineering should work at scale
- Be able to communicate technical decisions to non-technical people
- Build relationships with engineering leadership

The common thread: **your scope of impact has to expand before your title does.** You don&apos;t get promoted and then start doing the bigger work. You do the bigger work and then get recognised for it.

---

Titles Are Fake (But Also Real)
===============================

Every company defines these differently. A Staff Engineer at a 50-person startup is doing different work than a Staff Engineer at Google. A Principal at one company might be equivalent to a Senior at another.

```
COMPANY SIZE    &quot;STAFF&quot; ACTUALLY MEANS
============    ======================
&lt; 50            Senior who&apos;s been there longest
50-200          Cross-team technical leader
200-1000        Architecture/platform owner
1000+           Org-wide technical influence
```

Don&apos;t get too hung up on the specific title. Focus on the scope of your impact. Are you solving bigger problems than you were a year ago? Are you influencing more people? Are you trusted with more important decisions?

If yes, you&apos;re growing - regardless of what your title says.

---

Final Thought
=============

The IC ladder isn&apos;t a ladder everyone should climb. Each rung involves trade-offs. More scope means more ambiguity. More influence means more politics. More impact means less hands-on work.

Know what you actually want. Some of the happiest engineers I know are Seniors who&apos;ve been at it for 20 years, writing great code and mentoring others. Some of the most stressed are Principals who miss building things.

The best career is the one that fits you - not the one with the fanciest title.

```
========================================
Senior: own your work
Staff: own the system
Principal: own the direction
========================================
Choose your level wisely.
========================================
```</content:encoded><category>career</category><category>engineering-culture</category><category>leadership</category><category>principal-engineer</category><category>advice</category><author>Mo Abukar</author></item><item><title>Karpenter Deep Dive: Node Provisioning That Actually Works</title><link>https://moabukar.co.uk/blog/karpenter-deep-dive/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/karpenter-deep-dive/</guid><description>Master Karpenter for Kubernetes node autoscaling. Replace Cluster Autoscaler with faster, smarter provisioning. Includes cost optimization patterns.</description><pubDate>Mon, 08 Dec 2025 00:00:00 GMT</pubDate><content:encoded>Karpenter Deep Dive: Node Provisioning That Actually Works
==========================================================

Cluster Autoscaler is slow. It works with node groups and takes
minutes to scale. Karpenter provisions nodes in seconds, picks
the right instance types, and consolidates aggressively.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/kubernetes.svg&quot; alt=&quot;Kubernetes logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Karpenter = fast, flexible node provisioning
- Provisions in ~60 seconds (vs 3-5 min for CA)
- Automatic instance type selection
- Built-in consolidation and spot handling
- Works with EKS, coming to other clouds


Install Karpenter
=================

```bash
# Set variables
export KARPENTER_VERSION=v0.33.0
export CLUSTER_NAME=production
export AWS_REGION=eu-west-2
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)

# Create IAM resources
aws cloudformation deploy \
  --stack-name Karpenter-${CLUSTER_NAME} \
  --template-file karpenter-cloudformation.yaml \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides ClusterName=${CLUSTER_NAME}

# Install Karpenter
helm upgrade --install karpenter oci://public.ecr.aws/karpenter/karpenter \
  --version ${KARPENTER_VERSION} \
  --namespace karpenter --create-namespace \
  --set settings.clusterName=${CLUSTER_NAME} \
  --set settings.clusterEndpoint=$(aws eks describe-cluster --name ${CLUSTER_NAME} --query &quot;cluster.endpoint&quot; --output text) \
  --set serviceAccount.annotations.&quot;eks\.amazonaws\.com/role-arn&quot;=arn:aws:iam::${AWS_ACCOUNT_ID}:role/KarpenterControllerRole-${CLUSTER_NAME}
```


NodePool Configuration
======================

```yaml
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        # Instance categories
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: [&quot;c&quot;, &quot;m&quot;, &quot;r&quot;]
        
        # Instance sizes
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: [&quot;medium&quot;, &quot;large&quot;, &quot;xlarge&quot;, &quot;2xlarge&quot;]
        
        # Architectures
        - key: kubernetes.io/arch
          operator: In
          values: [&quot;amd64&quot;, &quot;arm64&quot;]
        
        # Capacity types (spot + on-demand)
        - key: karpenter.sh/capacity-type
          operator: In
          values: [&quot;spot&quot;, &quot;on-demand&quot;]
        
        # Availability zones
        - key: topology.kubernetes.io/zone
          operator: In
          values: [&quot;eu-west-2a&quot;, &quot;eu-west-2b&quot;, &quot;eu-west-2c&quot;]
      
      nodeClassRef:
        name: default

  # Limits
  limits:
    cpu: 1000
    memory: 2000Gi
  
  # Disruption settings
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s
    budgets:
      - nodes: &quot;10%&quot;
```


EC2NodeClass
============

```yaml
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: default
spec:
  # AMI selection
  amiFamily: AL2
  
  # Or specific AMI
  # amiSelectorTerms:
  #   - id: ami-0123456789abcdef0
  
  # Subnets
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  
  # Security groups
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: ${CLUSTER_NAME}
  
  # Instance profile
  instanceProfile: KarpenterNodeInstanceProfile-${CLUSTER_NAME}
  
  # Block device mappings
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 100Gi
        volumeType: gp3
        iops: 3000
        throughput: 125
        encrypted: true
  
  # User data
  userData: |
    #!/bin/bash
    /etc/eks/bootstrap.sh ${CLUSTER_NAME} \
      --container-runtime containerd
  
  # Tags for instances
  tags:
    Environment: production
    ManagedBy: karpenter
  
  # Metadata options
  metadataOptions:
    httpEndpoint: enabled
    httpProtocolIPv6: disabled
    httpPutResponseHopLimit: 2
    httpTokens: required
```


Workload-Specific NodePools
===========================

```yaml
# GPU workloads
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu
spec:
  template:
    metadata:
      labels:
        node-type: gpu
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: [&quot;g4dn&quot;, &quot;g5&quot;]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [&quot;on-demand&quot;]  # GPUs usually on-demand
      
      taints:
        - key: nvidia.com/gpu
          value: &quot;true&quot;
          effect: NoSchedule
      
      nodeClassRef:
        name: gpu

---
# Spot-only for batch
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: batch
spec:
  template:
    metadata:
      labels:
        node-type: batch
    spec:
      requirements:
        - key: karpenter.k8s.aws/instance-category
          operator: In
          values: [&quot;c&quot;, &quot;m&quot;, &quot;r&quot;]
        - key: karpenter.sh/capacity-type
          operator: In
          values: [&quot;spot&quot;]
      
      taints:
        - key: workload-type
          value: batch
          effect: NoSchedule
      
      nodeClassRef:
        name: default

  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 0s  # Aggressive consolidation for batch
```


Pod Scheduling
==============

```yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-server
spec:
  template:
    spec:
      # Spread across zones
      topologySpreadConstraints:
        - maxSkew: 1
          topologyKey: topology.kubernetes.io/zone
          whenUnsatisfied: DoNotSchedule
          labelSelector:
            matchLabels:
              app: api-server
      
      # Prefer arm64 (cheaper)
      affinity:
        nodeAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
            - weight: 100
              preference:
                matchExpressions:
                  - key: kubernetes.io/arch
                    operator: In
                    values: [&quot;arm64&quot;]
      
      # Resource requests drive instance selection
      containers:
        - name: api
          resources:
            requests:
              cpu: &quot;500m&quot;
              memory: &quot;512Mi&quot;
            limits:
              memory: &quot;1Gi&quot;
```


Consolidation
=============

Karpenter automatically consolidates nodes:

```yaml
disruption:
  # Consolidate underutilized nodes
  consolidationPolicy: WhenUnderutilized
  consolidateAfter: 30s
  
  # Or only when empty
  # consolidationPolicy: WhenEmpty
  # consolidateAfter: 0s
  
  # Budget limits how many nodes can disrupt at once
  budgets:
    - nodes: &quot;10%&quot;
    - nodes: &quot;0&quot;
      schedule: &quot;0 9-17 * * 1-5&quot;  # No disruption during business hours
```


Cost Optimization
=================

```yaml
# Prioritize spot and arm64
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: [&quot;spot&quot;, &quot;on-demand&quot;]
        - key: kubernetes.io/arch
          operator: In
          values: [&quot;arm64&quot;, &quot;amd64&quot;]
  
  # Weights for instance type selection
  weight: 100  # Higher weight = preferred

# Separate on-demand pool for critical workloads
---
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: critical
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: [&quot;on-demand&quot;]
      
      taints:
        - key: critical
          value: &quot;true&quot;
          effect: NoSchedule
  
  weight: 10  # Lower weight, used only when tolerated
```


Monitoring
==========

```yaml
# Prometheus rules
groups:
  - name: karpenter
    rules:
      - alert: KarpenterProvisioningFailed
        expr: increase(karpenter_provisioner_scheduling_duration_seconds_count{result=&quot;error&quot;}[5m]) &gt; 0
        labels:
          severity: warning
        annotations:
          summary: Karpenter provisioning failures
      
      - alert: KarpenterNodeNotReady
        expr: karpenter_nodes_created_total - karpenter_nodes_terminated_total - count(kube_node_status_condition{condition=&quot;Ready&quot;,status=&quot;true&quot;}) &gt; 0
        for: 5m
        labels:
          severity: warning
```

```bash
# Check Karpenter decisions
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter -f

# Node provisioning
kubectl get nodeclaims

# Current nodes
kubectl get nodes -L karpenter.sh/capacity-type,node.kubernetes.io/instance-type
```


References
==========

- Karpenter Docs: https://karpenter.sh
- Best Practices: https://karpenter.sh/docs/concepts/best-practices
- Instance Types: https://aws.amazon.com/ec2/instance-types


========================================
Karpenter + EKS + Cost Optimization
========================================
Right-size nodes. Automatically. Fast.
========================================</content:encoded><category>karpenter</category><category>kubernetes</category><category>autoscaling</category><category>aws</category><category>eks</category><category>cost-optimization</category><author>Mo Abukar</author></item><item><title>The Fast Feedback Loop - Local Development with Kind, LocalStack, and Act</title><link>https://moabukar.co.uk/blog/fast-feedback-loop-local-tools/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/fast-feedback-loop-local-tools/</guid><description>Combine Kind, LocalStack, and Act for a complete local development environment. Test Kubernetes, AWS services, and CI pipelines without leaving your laptop.</description><pubDate>Fri, 05 Dec 2025 00:00:00 GMT</pubDate><content:encoded># The Fast Feedback Loop - Local Development with Kind, LocalStack, and Act

The best engineers I know have one thing in common: tight feedback loops. They see results in seconds, not minutes. They iterate dozens of times before pushing code.

The worst development experiences? Push-to-test cycles. Change code, commit, push, wait for CI, watch it fail, repeat. Each iteration costs minutes. Multiply by dozens of iterations per feature, and you&apos;ve lost hours.

This post shows you how to build a complete local development environment using three tools:
- **Kind** - Kubernetes on your laptop
- **LocalStack** - AWS services locally
- **Act** - GitHub Actions without pushing

Together, they give you the entire cloud stack running locally with instant feedback.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/kubernetes.svg&quot; alt=&quot;Kubernetes logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## TL;DR

- Kind runs real Kubernetes clusters in Docker
- LocalStack emulates AWS services locally
- Act runs GitHub Actions on your machine
- Combined: test infrastructure, cloud services, and CI locally
- Feedback in seconds, not minutes

---

## The Stack

```
┌─────────────────────────────────────────────────────────────┐
│                     Your Laptop                              │
│                                                              │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │    Kind     │  │  LocalStack │  │        Act          │  │
│  │ (Kubernetes)│  │   (AWS)     │  │  (GitHub Actions)   │  │
│  │             │  │             │  │                     │  │
│  │ - Pods      │  │ - S3        │  │ - Build workflows   │  │
│  │ - Services  │  │ - Lambda    │  │ - Test workflows    │  │
│  │ - Ingress   │  │ - DynamoDB  │  │ - Deploy workflows  │  │
│  │ - Helm      │  │ - SQS/SNS   │  │                     │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
│                                                              │
│                    ↓ All running in Docker ↓                 │
└─────────────────────────────────────────────────────────────┘
```

---

## Setting Up the Environment

### Prerequisites

```bash
# Install Docker (required for all three tools)
# https://docs.docker.com/get-docker/

# Install kubectl
brew install kubectl

# Install Kind
brew install kind

# Install LocalStack
pip install localstack

# Install Act
brew install act
```

### Docker Compose for Everything

```yaml
# docker-compose.yml
version: &apos;3.8&apos;

services:
  localstack:
    image: localstack/localstack:latest
    ports:
      - &quot;4566:4566&quot;
    environment:
      - DEBUG=1
      - PERSISTENCE=1
    volumes:
      - &quot;./localstack-data:/var/lib/localstack&quot;
      - &quot;/var/run/docker.sock:/var/run/docker.sock&quot;
      - &quot;./init-localstack.sh:/etc/localstack/init/ready.d/init.sh&quot;
    healthcheck:
      test: [&quot;CMD&quot;, &quot;curl&quot;, &quot;-f&quot;, &quot;http://localhost:4566/_localstack/health&quot;]
      interval: 10s
      timeout: 5s
      retries: 3

networks:
  default:
    name: local-dev
```

### Kind Cluster Configuration

```yaml
# kind-config.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: local-dev
nodes:
  - role: control-plane
    extraPortMappings:
      - containerPort: 30080
        hostPort: 8080
        protocol: TCP
      - containerPort: 30443
        hostPort: 8443
        protocol: TCP
  - role: worker
  - role: worker
```

### Initialization Script

```bash
#!/bin/bash
# init-localstack.sh

# Create S3 buckets
awslocal s3 mb s3://app-artifacts
awslocal s3 mb s3://app-uploads

# Create DynamoDB tables
awslocal dynamodb create-table \
  --table-name AppData \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

# Create SQS queues
awslocal sqs create-queue --queue-name app-events
awslocal sqs create-queue --queue-name app-events-dlq

# Create secrets
awslocal secretsmanager create-secret \
  --name app/database \
  --secret-string &apos;{&quot;host&quot;:&quot;postgres&quot;,&quot;port&quot;:5432,&quot;username&quot;:&quot;app&quot;,&quot;password&quot;:&quot;secret&quot;}&apos;

echo &quot;LocalStack initialized!&quot;
```

### Makefile for Everything

```makefile
# Makefile
.PHONY: up down kind-up kind-down localstack-up localstack-down test-ci clean

# Start everything
up: localstack-up kind-up
	@echo &quot;✓ Local environment ready&quot;
	@echo &quot;  Kubernetes: kubectl get nodes&quot;
	@echo &quot;  LocalStack: http://localhost:4566&quot;

# Stop everything
down: kind-down localstack-down
	@echo &quot;✓ Local environment stopped&quot;

# Kind cluster
kind-up:
	@kind get clusters | grep -q local-dev || kind create cluster --config kind-config.yaml
	@kubectl wait --for=condition=ready node --all --timeout=60s
	@echo &quot;✓ Kind cluster ready&quot;

kind-down:
	@kind delete cluster --name local-dev 2&gt;/dev/null || true

# LocalStack
localstack-up:
	@docker-compose up -d localstack
	@echo &quot;Waiting for LocalStack...&quot;
	@until curl -s http://localhost:4566/_localstack/health | grep -q &apos;&quot;s3&quot;: &quot;running&quot;&apos;; do sleep 2; done
	@echo &quot;✓ LocalStack ready&quot;

localstack-down:
	@docker-compose down

# Run CI locally
test-ci:
	@act -j test

# Run specific workflow
ci-build:
	@act -j build

ci-deploy:
	@act -j deploy --secret-file .secrets

# Deploy to local Kind cluster
deploy-local:
	@kubectl apply -k k8s/overlays/local

# Clean everything
clean: down
	@rm -rf localstack-data
	@docker volume prune -f
	@echo &quot;✓ Cleaned&quot;
```

---

## Kind: Local Kubernetes

### Create Cluster

```bash
# Create cluster
kind create cluster --config kind-config.yaml

# Verify
kubectl get nodes
# NAME                      STATUS   ROLES           AGE   VERSION
# local-dev-control-plane   Ready    control-plane   1m    v1.28.0
# local-dev-worker          Ready    &lt;none&gt;          1m    v1.28.0
# local-dev-worker2         Ready    &lt;none&gt;          1m    v1.28.0
```

### Load Local Images

```bash
# Build your app
docker build -t myapp:dev .

# Load into Kind (no registry needed)
kind load docker-image myapp:dev --name local-dev

# Deploy
kubectl run myapp --image=myapp:dev --image-pull-policy=Never
```

### Local Ingress

```bash
# Install Nginx Ingress
kubectl apply -f https://raw.githubusercontent.com/kubernetes/ingress-nginx/main/deploy/static/provider/kind/deploy.yaml

# Wait for it
kubectl wait --namespace ingress-nginx \
  --for=condition=ready pod \
  --selector=app.kubernetes.io/component=controller \
  --timeout=90s
```

```yaml
# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
spec:
  ingressClassName: nginx
  rules:
    - host: myapp.local
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: myapp
                port:
                  number: 80
```

```bash
# Add to /etc/hosts
echo &quot;127.0.0.1 myapp.local&quot; | sudo tee -a /etc/hosts

# Access via browser
open http://myapp.local:8080
```

---

## Connecting Kind and LocalStack

Your app in Kubernetes needs to talk to LocalStack. Since both run in Docker, use Docker networking.

### Configure AWS SDK in Pods

```yaml
# k8s/overlays/local/deployment-patch.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp
spec:
  template:
    spec:
      containers:
        - name: myapp
          env:
            - name: AWS_ENDPOINT_URL
              value: &quot;http://host.docker.internal:4566&quot;
            - name: AWS_ACCESS_KEY_ID
              value: &quot;test&quot;
            - name: AWS_SECRET_ACCESS_KEY
              value: &quot;test&quot;
            - name: AWS_DEFAULT_REGION
              value: &quot;us-east-1&quot;
```

### Or Use ExternalName Service

```yaml
# k8s/base/localstack-service.yaml
apiVersion: v1
kind: Service
metadata:
  name: localstack
spec:
  type: ExternalName
  externalName: host.docker.internal
  ports:
    - port: 4566
```

Now pods can reach LocalStack at `http://localstack:4566`.

---

## Act: Local CI/CD

### GitHub Actions Workflow

```yaml
# .github/workflows/ci.yml
name: CI
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: &apos;20&apos;
      - run: npm ci
      - run: npm test

  build:
    runs-on: ubuntu-latest
    needs: test
    steps:
      - uses: actions/checkout@v4
      - name: Build Docker image
        run: docker build -t myapp:${{ github.sha }} .

  deploy:
    runs-on: ubuntu-latest
    needs: build
    if: github.ref == &apos;refs/heads/main&apos;
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to Kubernetes
        run: |
          kubectl apply -k k8s/overlays/production
```

### Run Locally with Act

```bash
# Run all jobs
act

# Run specific job
act -j test
act -j build

# With secrets
act --secret-file .secrets

# Skip deploy (it needs real cluster)
act -j test -j build
```

### Act Configuration

```bash
# .actrc
-P ubuntu-latest=catthehacker/ubuntu:act-22.04
--secret-file .secrets
--env-file .env.local
--container-daemon-socket /var/run/docker.sock
```

---

## Complete Development Workflow

### 1. Start Environment

```bash
make up
# ✓ LocalStack ready
# ✓ Kind cluster ready
```

### 2. Develop Locally

```bash
# Run app directly (fastest feedback)
npm run dev

# Or in Docker
docker-compose up app
```

### 3. Test Against Local Services

```bash
# App talks to LocalStack S3
curl http://localhost:3000/upload -F &quot;file=@test.txt&quot;

# Verify in LocalStack
awslocal s3 ls s3://app-uploads/
```

### 4. Test CI Locally

```bash
# Before pushing, verify CI will pass
make test-ci

# Fix any issues locally
# Iterate in seconds, not minutes
```

### 5. Test Kubernetes Deployment

```bash
# Build and load image
docker build -t myapp:dev .
kind load docker-image myapp:dev --name local-dev

# Deploy to local cluster
make deploy-local

# Test it
curl http://myapp.local:8080/health
```

### 6. Push with Confidence

```bash
# Everything works locally, now push
git add .
git commit -m &quot;Feature complete&quot;
git push

# CI will pass because you already tested it
```

---

## Debugging Tips

### Kind: Access Node

```bash
# Shell into node
docker exec -it local-dev-worker /bin/bash

# Check container runtime
crictl ps
```

### LocalStack: Check Logs

```bash
docker-compose logs -f localstack

# Or check specific service
curl http://localhost:4566/_localstack/health | jq
```

### Act: Verbose Mode

```bash
# See everything
act -v

# Interactive mode
act --reuse
docker exec -it act-CI-test /bin/bash
```

---

## Performance Tips

### 1. Pre-pull Images

```bash
# Pull commonly used images once
docker pull catthehacker/ubuntu:act-22.04
docker pull localstack/localstack:latest
docker pull kindest/node:v1.28.0
```

### 2. Use Persistence

```yaml
# LocalStack data persists between restarts
environment:
  - PERSISTENCE=1
volumes:
  - &quot;./localstack-data:/var/lib/localstack&quot;
```

### 3. Reuse Kind Cluster

```bash
# Don&apos;t delete cluster between sessions
# Just restart containers if needed
docker start local-dev-control-plane local-dev-worker local-dev-worker2
```

### 4. Parallel Testing

```bash
# Run different tests in parallel
make test-ci &amp;
make deploy-local &amp;
wait
```

---

## When to Test Where

| What | Local | CI | Staging |
|------|-------|-------|---------|
| Unit tests | ✅ Primary | ✅ Verify | - |
| Integration tests | ✅ Primary | ✅ Verify | - |
| Kubernetes manifests | ✅ Kind | ✅ | ✅ Verify |
| AWS integrations | ✅ LocalStack | ✅ | ✅ Real AWS |
| CI workflows | ✅ Act | ✅ Primary | - |
| Performance tests | - | - | ✅ Primary |
| E2E tests | ✅ Optional | ✅ | ✅ Primary |

---

## Quick Reference

```bash
# Start everything
make up

# Stop everything
make down

# Test CI locally
make test-ci

# Deploy to local Kubernetes
docker build -t myapp:dev .
kind load docker-image myapp:dev --name local-dev
kubectl apply -k k8s/overlays/local

# Check LocalStack
awslocal s3 ls
awslocal dynamodb list-tables

# Check Kubernetes
kubectl get pods
kubectl logs -f deployment/myapp

# Clean slate
make clean
```

---

## Conclusion

Fast feedback loops are a superpower. With Kind, LocalStack, and Act, you can:

1. **Test Kubernetes changes** - No waiting for cloud clusters
2. **Test AWS integrations** - No cloud costs or permissions
3. **Test CI pipelines** - No push-and-pray

The investment in local tooling pays off exponentially. An engineer with 10-second feedback loops will outproduce one with 10-minute loops by 10x or more.

Set up your local environment today. Your future self will thank you.

---

## References

- [Kind Documentation](https://kind.sigs.k8s.io/)
- [LocalStack Documentation](https://docs.localstack.cloud/)
- [Act Documentation](https://nektosact.com/)
- [Docker Compose](https://docs.docker.com/compose/)</content:encoded><category>devops</category><category>kind</category><category>localstack</category><category>act</category><category>kubernetes</category><category>aws</category><category>development</category><author>Mo Abukar</author></item><item><title>The Principal Engineer Trap</title><link>https://moabukar.co.uk/blog/principal-engineer-trap/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/principal-engineer-trap/</guid><description>The IC ladder looks appealing until you&apos;re at the top. Many senior engineers chase Principal titles without understanding what they&apos;re signing up for. Here&apos;s what nobody tells you.</description><pubDate>Fri, 05 Dec 2025 00:00:00 GMT</pubDate><content:encoded>The tech industry built an Individual Contributor ladder so engineers wouldn&apos;t have to become managers. Senior, Staff, Principal, Distinguished - climb the ladder while writing code, not attending meetings.

Except it&apos;s a trap.

The higher you go on the IC ladder, the less you code and the more you do everything you were trying to avoid. I&apos;ve watched great engineers become miserable Principals because nobody explained what the job actually is.

![The Principal Engineer Trap](/images/principal-engineer-trap.png)


## What Principal Really Means

At senior level, you&apos;re accountable for your work. At staff level, you&apos;re accountable for your team&apos;s work. At principal level, you&apos;re accountable for outcomes you can&apos;t directly control.

Principal engineers:
- Define technical strategy across multiple teams
- Influence without authority over people who don&apos;t report to them
- Navigate organisational politics to get things done
- Communicate constantly with stakeholders at all levels
- Write documents more than code
- Attend many, many meetings

If this sounds like management, that&apos;s because it is. It&apos;s management without direct reports, which is often harder than management with them.

## The Coding Myth

&quot;I&apos;ll stay on the IC track so I can keep coding.&quot;

This is the lie that draws people up the ladder. It&apos;s true at senior level. Increasingly false at staff. Almost entirely false at principal.

Principal engineers code occasionally. Prototypes, proof of concepts, critical architectural pieces. But most of your impact comes through others. If you&apos;re spending significant time coding, you&apos;re not doing your job.

The math is simple: if you have 10 teams in your scope and each decision you influence saves one engineer-month, your leverage is 10 engineer-months. Your personal coding contribution is one engineer-month at best. The leverage matters more.

This is frustrating for people who became engineers because they love coding. The job that promoted you out of coding is not the job you loved.

## Influence Without Authority

Managers can direct people. &quot;Please work on this project.&quot; Principals cannot.

Principal engineers influence through:
- Technical credibility (&quot;they know what they&apos;re talking about&quot;)
- Relationship capital (&quot;they&apos;ve helped me before&quot;)
- Compelling arguments (&quot;this approach is clearly better&quot;)
- Organisational navigation (&quot;they know how to get things approved&quot;)

None of this comes automatically with the title. You earn it over time, and you can lose it quickly.

Influencing without authority is exhausting. You can&apos;t just decide things. You have to convince people, build coalitions, and accept that sometimes you&apos;ll lose despite being right.

## The Communication Load

Principals communicate constantly.

**With engineers:** Explaining technical direction, answering questions, providing guidance, code reviews that are more about education than approval.

**With managers:** Aligning on priorities, reporting on technical health, flagging risks, advocating for technical investments.

**With leadership:** Translating technical concepts for non-technical executives, justifying technology decisions, setting expectations.

**With stakeholders:** Product managers, designers, customers, partners. Everyone wants to understand what&apos;s technically possible.

**In writing:** RFCs, architecture docs, decision records, strategy documents. Principal engineers write constantly.

If you don&apos;t enjoy communication - if you&apos;d rather put headphones on and code - principal is the wrong job.

## The Scope Problem

As you grow in scope, your context shrinks.

A senior engineer might understand one codebase deeply. A principal engineer has to have opinions about dozens of systems they&apos;ve never touched.

You become a mile wide and an inch deep. You rely on others for context, form opinions quickly with incomplete information, and accept that you&apos;ll sometimes be wrong.

This is uncomfortable for engineers who pride themselves on deep technical knowledge. The job requires breadth over depth, which can feel like becoming a worse engineer.

## What You Give Up

Taking a principal role means giving up things you might value:

**Flow state.** Uninterrupted coding time becomes rare. Your calendar fills with meetings. Deep work requires aggressive scheduling defence.

**Completion.** You rarely finish things yourself. You start conversations, set direction, and let others complete the work. The satisfaction of shipping disappears.

**Certainty.** At senior level, you know if your code works. At principal level, you&apos;re making bets about the future. You won&apos;t know if you were right for months or years.

**Technical depth.** Staying current in any technology becomes harder. You know a little about everything, a lot about nothing.

**Team camaraderie.** You&apos;re no longer part of a single team. You float between teams, belonging fully to none.

Some people are fine with these trade-offs. Others find them devastating.

## Signs the Trap Is Closing

Watch for warning signs that principal isn&apos;t right for you:

**You resent meetings.** Meetings are the job. If every meeting feels like an interruption, you&apos;re in the wrong role.

**You feel like a fraud.** You&apos;re making decisions about systems you don&apos;t understand deeply. This never goes away - you adapt or struggle.

**You miss shipping.** The last thing you shipped yourself was months ago. You feel disconnected from the work.

**You&apos;re exhausted by people.** Every day is conversations, negotiations, relationship management. If people drain you, this depletes your energy.

**You&apos;re not influencing.** You have the title but not the impact. Your opinions don&apos;t change outcomes. This is a failure state.

## Alternatives

Not everyone should be a principal. There are other paths:

**Stay senior.** Senior is a great job. Deep technical work, clear outcomes, sustainable lifestyle. There&apos;s no shame in not climbing further.

**Specialise.** Some organisations have specialist tracks - security, performance, machine learning. Deep expertise without broad scope.

**Consult.** Independent consulting lets you go deep on problems, then move on. No organisational politics.

**Small companies.** At startups, senior engineers have principal-level impact with hands-on work. The ladder compresses.

**Management.** If you&apos;re going to do the communication and people work anyway, management gives you direct authority to match.

## If You Still Want It

If you&apos;ve read all this and still want principal:

**Build influence before title.** Start doing principal-level work before you have the title. The title should recognise what you already do.

**Invest in communication skills.** Writing, presenting, persuading. These become your primary tools.

**Accept the trade-offs.** You&apos;re choosing leverage over craft. Make the choice consciously.

**Find meaning in the work.** Principal impact is multiplied through others. Learn to feel ownership over outcomes you didn&apos;t directly create.

**Protect what matters.** Some principals maintain small coding projects for sanity. Schedule the time. Protect it.

The principal title is prestigious. The job is hard in ways that aren&apos;t obvious from below. Know what you&apos;re choosing.

The best principals I know love the work. The unhappy ones wanted the title without understanding the job.

Make sure you&apos;re in the first group before you sign up.</content:encoded><category>career</category><category>engineering-culture</category><category>leadership</category><category>principal-engineer</category><author>Mo Abukar</author></item><item><title>Progressive Delivery with Flagger: Automated Canary Deployments</title><link>https://moabukar.co.uk/blog/progressive-delivery-flagger/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/progressive-delivery-flagger/</guid><description>Implement automated canary deployments with Flagger. Metrics-based promotion, automated rollback, and integration with Istio, Linkerd, and Gateway API.</description><pubDate>Thu, 04 Dec 2025 00:00:00 GMT</pubDate><content:encoded>Progressive Delivery with Flagger: Automated Canary Deployments
================================================================

Manual canary deployments are tedious. Flagger automates the
entire process: deploy canary, shift traffic gradually, analyze
metrics, promote or rollback automatically.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/argo.svg&quot; alt=&quot;Argo logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Flagger = automated canary/blue-green/A-B testing
- Metrics-based promotion (Prometheus, Datadog, etc.)
- Automatic rollback on failure
- Works with Istio, Linkerd, Nginx, Gateway API
- Full GitOps integration


Install Flagger
===============

```bash
helm repo add flagger https://flagger.app
helm upgrade -i flagger flagger/flagger \
  --namespace flagger-system --create-namespace \
  --set meshProvider=istio \
  --set metricsServer=http://prometheus.monitoring:9090
```


Canary Resource
===============

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-server
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  
  progressDeadlineSeconds: 600
  
  service:
    port: 8080
    targetPort: 8080
    gateways:
      - production-gateway
    hosts:
      - api.company.com
  
  analysis:
    # Canary increment
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    
    # Promotion metrics
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
      
      - name: request-duration
        thresholdRange:
          max: 500
        interval: 1m
    
    # Webhooks for custom checks
    webhooks:
      - name: load-test
        type: rollout
        url: http://loadtester.flagger-system/
        metadata:
          cmd: &quot;hey -z 1m -q 10 -c 2 http://api-server-canary.production:8080/&quot;
```


Metrics Templates
=================

```yaml
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-success-rate
  namespace: flagger-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.monitoring:9090
  query: |
    100 - (
      sum(
        rate(
          http_requests_total{
            namespace=&quot;{{ namespace }}&quot;,
            service=~&quot;{{ target }}-canary&quot;,
            status=~&quot;5..&quot;
          }[{{ interval }}]
        )
      ) 
      / 
      sum(
        rate(
          http_requests_total{
            namespace=&quot;{{ namespace }}&quot;,
            service=~&quot;{{ target }}-canary&quot;
          }[{{ interval }}]
        )
      ) * 100
    )

---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-duration
  namespace: flagger-system
spec:
  provider:
    type: prometheus
    address: http://prometheus.monitoring:9090
  query: |
    histogram_quantile(0.99,
      sum(
        rate(
          http_request_duration_seconds_bucket{
            namespace=&quot;{{ namespace }}&quot;,
            service=~&quot;{{ target }}-canary&quot;
          }[{{ interval }}]
        )
      ) by (le)
    ) * 1000
```


Blue-Green Deployment
=====================

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-server
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  
  service:
    port: 8080
  
  analysis:
    # Blue-green: 0% or 100%
    interval: 1m
    threshold: 5
    iterations: 10  # Run analysis 10 times before promoting
    
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
    
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://loadtester.flagger-system/
        metadata:
          type: bash
          cmd: &quot;curl -s http://api-server-canary.production:8080/health | grep ok&quot;
      
      - name: load-test
        type: rollout
        url: http://loadtester.flagger-system/
        metadata:
          cmd: &quot;hey -z 2m -q 50 -c 10 http://api-server-canary.production:8080/&quot;
```


A/B Testing
===========

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-server
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  
  service:
    port: 8080
  
  analysis:
    interval: 1m
    threshold: 5
    iterations: 100
    
    # A/B testing by header
    match:
      - headers:
          x-user-type:
            exact: beta
    
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99
        interval: 1m
```


Gateway API Integration
=======================

```yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: api-server
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  
  service:
    port: 8080
  
  # Use Gateway API instead of Istio
  gatewayRefs:
    - name: production-gateway
      namespace: gateway-system
  
  analysis:
    interval: 1m
    threshold: 5
    maxWeight: 50
    stepWeight: 10
    
    metrics:
      - name: request-success-rate
        templateRef:
          name: request-success-rate
          namespace: flagger-system
        thresholdRange:
          min: 99
```


Alerting
========

```yaml
apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
  name: slack
  namespace: flagger-system
spec:
  type: slack
  channel: deployments
  username: flagger
  secretRef:
    name: slack-url

---
apiVersion: v1
kind: Secret
metadata:
  name: slack-url
  namespace: flagger-system
stringData:
  address: https://hooks.slack.com/services/xxx/yyy/zzz
```


GitOps Workflow
===============

```yaml
# Argo CD triggers Flagger
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: api-server
  namespace: argocd
spec:
  source:
    repoURL: https://github.com/company/app
    path: k8s
    targetRevision: HEAD
  destination:
    server: https://kubernetes.default.svc
    namespace: production
  syncPolicy:
    automated:
      prune: true
      selfHeal: true

# Deployment update triggers Flagger canary
# Flagger handles the progressive rollout
```


Monitoring Progress
===================

```bash
# Watch canary status
kubectl get canary -n production -w

# Describe canary
kubectl describe canary api-server -n production

# Check events
kubectl get events -n production --field-selector involvedObject.name=api-server
```


Rollback
========

```bash
# Manual rollback
kubectl annotate canary api-server -n production flagger.app/rollback=true

# Or set status
kubectl patch canary api-server -n production \
  --type=merge -p &apos;{&quot;status&quot;:{&quot;phase&quot;:&quot;Failed&quot;}}&apos;
```


References
==========

- Flagger Docs: https://docs.flagger.app
- Metrics: https://docs.flagger.app/usage/metrics
- Webhooks: https://docs.flagger.app/usage/webhooks


========================================
Flagger + Progressive Delivery
========================================
Deploy with confidence. Rollback automatically.
========================================</content:encoded><category>flagger</category><category>canary</category><category>progressive-delivery</category><category>kubernetes</category><category>gitops</category><category>deployment</category><author>Mo Abukar</author></item><item><title>Startup vs Scale-Up vs Enterprise: Where You&apos;ll Actually Learn the Most</title><link>https://moabukar.co.uk/blog/startup-scaleup-enterprise/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/startup-scaleup-enterprise/</guid><description>After working across all three - tiny startups, hypergrowth scale-ups, and massive enterprises - I can tell you they&apos;re completely different jobs. Same title, same tech, completely different experience. Here&apos;s what each teaches you.</description><pubDate>Tue, 02 Dec 2025 00:00:00 GMT</pubDate><content:encoded>Startup vs Scale-Up vs Enterprise: Where You&apos;ll Actually Learn the Most
=======================================================================

After working across all three - tiny startups, hypergrowth scale-ups, and massive enterprises - I can tell you: they&apos;re completely different jobs.

Same title. Same tech. Completely different experience.

**The honest answer to &quot;which is best for your career&quot; is: all of them, at different times.**

Let me explain what each actually teaches you, because the lessons are genuinely different.

---

![Startup vs Scale-Up vs Enterprise: Where You&apos;ll Actually Learn the Most](/images/startup-scaleup-enterprise.webp)


Quick Comparison
================

```
ENVIRONMENT     SIZE        PACE        DEPTH       CHAOS
===========     ====        ====        =====       =====
Startup         &lt; 50        Fast        Shallow     High
Scale-up        50-500      Relentless  Medium      Very High
Enterprise      500+        Slow        Deep        Low
```

---

Startup (Under 50 People): Wear Every Hat
=========================================

At a startup, you&apos;re not a DevOps Engineer. You&apos;re DevOps, SRE, Platform, Security, and sometimes Backend all rolled into one. Job titles are suggestions. Everyone does whatever needs doing.

**What startups teach you:**

**Breadth over depth.** You&apos;ll touch everything: infrastructure, CI/CD, monitoring, security, networking, maybe even some frontend when things get desperate. You won&apos;t be an expert in any of it, but you&apos;ll understand how it all connects.

**Speed over perfection.** There&apos;s no time for perfect architectures. You ship something that works, learn from it, and iterate. You develop strong intuition for &quot;good enough for now&quot; vs &quot;this will kill us later.&quot;

**Ownership.** No one else is going to fix that alert at 3am. No one else is going to write that runbook. It&apos;s yours. All of it. This forces accountability in a way that larger companies simply can&apos;t.

**Business context.** In a startup, you&apos;re in the room (or at least nearby) when business decisions happen. You understand *why* things are being built, not just *what*. This changes how you think about engineering.

**Scrappiness.** Limited budget means creative solutions. You learn to do a lot with a little. This skill never stops being valuable.

---

The Hard Truth About Startups
-----------------------------

You&apos;ll learn breadth at the expense of depth. You might set up Kubernetes, but you won&apos;t deeply understand its internals because there&apos;s no time. You&apos;ll configure Terraform, but you won&apos;t learn advanced patterns because you&apos;re already fighting the next fire.

Also: no mentorship. If you&apos;re the only infrastructure person, there&apos;s no senior engineer to learn from. You&apos;re Googling, reading docs, and figuring it out alone. This can be empowering or terrifying, depending on your personality.

```
STARTUP ROLE            WHAT YOU&apos;RE ACTUALLY DOING
============            ==========================
DevOps Engineer         DevOps + SRE + Platform + Security
Backend Developer       Backend + Frontend + DBA + QA
Product Manager         PM + Designer + Analyst + Support
CTO                     Architect + IC + Manager + Recruiter
```

**Best for:** Early-career engineers who want exposure, or experienced engineers who want ownership and variety.

---

Scale-Up (50-500 People): Build What Scales
===========================================

Scale-ups are the sweet spot for learning. You have startup energy but actual resources. Things are growing fast, which means you&apos;re constantly solving problems you&apos;ve never faced before.

**What scale-ups teach you:**

**Scaling systems.** The architecture that worked for 10 engineers breaks at 100. The deployment process that worked for 5 services fails at 50. You learn to anticipate scale problems before they hit.

**Process creation.** Startups have no process. Enterprises have too much. Scale-ups are in the middle, figuring out what process is actually needed. You learn to build just enough structure without killing velocity.

**Technical leadership.** You&apos;re experienced enough to lead projects but not drowning in meetings yet. This is where many engineers transition from Senior to Staff - by necessity, not by title.

**Cross-team collaboration.** Multiple teams now exist. They need to work together. You learn how to align technical decisions across groups, which is a skill you&apos;ll use forever.

**Handling growth chaos.** Nothing works the way it&apos;s supposed to. Documentation is outdated. That system &quot;nobody touches&quot; is suddenly critical. You learn to navigate ambiguity at speed.

---

The Hard Truth About Scale-Ups
------------------------------

It&apos;s exhausting. The pace is relentless. You&apos;re constantly firefighting while trying to build for the future. Work-life balance is often poor. Burnout is common.

Also: things change constantly. That project you spent three months on? Deprioritised. That team you joined? Reorged. The roadmap? Rewritten. If you need stability, scale-ups will drive you insane.

**Best for:** Mid-career engineers who want to level up quickly and don&apos;t mind chaos.

---

Enterprise (500+ People): Go Deep
=================================

Enterprises get a bad rap. People think they&apos;re slow, bureaucratic, and boring. That&apos;s sometimes true. But they also teach you things you simply can&apos;t learn elsewhere.

**What enterprises teach you:**

**Depth.** You have time to actually understand things properly. You can spend weeks diving into a technology because nobody expects you to ship five features this sprint. This depth builds expertise that&apos;s hard to get elsewhere.

**Working with legacy.** Real-world systems aren&apos;t greenfield. They&apos;re 10-year-old codebases with undocumented behaviour and zero tests. Learning to work with (and improve) legacy systems is an underrated skill.

**Process and governance.** Change management. Security reviews. Compliance requirements. Architecture review boards. It&apos;s frustrating, but understanding *why* these exist makes you a better engineer. Many startup engineers dismiss process entirely - then build systems that fall over when they scale.

**Organisational complexity.** Getting anything done requires navigating multiple teams, stakeholders, and approval chains. This is annoying but valuable. If you ever want to be a Staff or Principal engineer, you need to understand how large organisations work.

**Specialisation.** Enterprises have dedicated teams for everything. You can go deep on Kubernetes, or networking, or security, or observability. You become genuinely expert in your area.

---

The Hard Truth About Enterprises
--------------------------------

You can stagnate. If you&apos;re not careful, you&apos;ll spend five years doing the same thing, learning nothing new, and becoming institutionalised. The safety is seductive. The golden handcuffs are real.

Also: impact is slow. That initiative you proposed? It&apos;ll take six months just to get approved. Then another year to implement. If you need to see results quickly, enterprises will frustrate you.

**Best for:** Engineers who want depth, stability, or exposure to large-scale systems. Also engineers with families who value predictability.

---

You Need All Three
==================

Here&apos;s what I&apos;ve learned: the best engineers I know have worked across all three environments. Each teaches you something the others can&apos;t.

**Startup experience** gives you scrappiness, ownership, and breadth. You learn to ship fast and take responsibility.

**Scale-up experience** gives you growth skills and technical leadership. You learn to build systems that survive success.

**Enterprise experience** gives you depth, process understanding, and navigation skills. You learn to work within constraints and handle complexity.

```
IF YOU&apos;VE ONLY DONE     YOU PROBABLY           YOU&apos;LL STRUGGLE WITH
====================    =============          ====================
Startups                Over-simplify          Process and governance
Scale-ups               Context-switch well    Deep expertise
Enterprises             Over-engineer          Moving fast
```

If you&apos;ve only ever worked at startups, you probably don&apos;t understand why process exists - and you&apos;ll struggle when your startup becomes a scale-up.

If you&apos;ve only ever worked at enterprises, you probably over-engineer everything and can&apos;t ship without six approvals - and you&apos;ll drown in startup chaos.

**The magic combination:** Do a startup early (learn to ship), do a scale-up mid-career (learn to lead), do an enterprise when you want depth or stability (learn to specialise).

Or mix and match based on what you need. The point is: don&apos;t stay in one lane your entire career.

---

Matching Environment to Life Stage
==================================

Different environments also suit different life stages:

```
CAREER STAGE        BEST FIT            WHY
============        ========            ===
Early (0-3 yrs)     Startup/Scale-up    Need exposure and reps
Mid (3-7 yrs)       Scale-up            Ready to lead, want growth
Later (7+ yrs)      Depends             Depth, ownership, or stability
With family         Often enterprise    Predictable hours, good benefits
```

**Early career (0-3 years):** Startups or scale-ups. You need exposure and reps. Enterprises will let you hide in a corner and never grow.

**Mid career (3-7 years):** Scale-ups are ideal. You have enough experience to lead but still want growth. This is where careers accelerate.

**Later career (7+ years):** Depends on what you want. Enterprises offer stability and depth. Startups offer ownership if you&apos;re senior enough to not drown. Scale-ups offer impact if you have the energy.

**With family responsibilities:** Enterprises often make sense. Predictable hours, good benefits, less chaos. No shame in prioritising life outside work.

---

Compensation Reality
====================

Real talk: compensation varies more by company than by stage, but generally:

```
ENVIRONMENT     BASE        EQUITY              RISK-ADJUSTED
===========     ====        ======              =============
Startup         Lower       High (maybe)        Often lowest
Scale-up        Competitive Meaningful          Often best
Enterprise      High        RSUs (predictable)  Reliable
```

**Startups:** Lower base, potentially meaningful equity (or worthless equity - it&apos;s a gamble). Total comp is often lower unless you hit a unicorn.

**Scale-ups:** Competitive base, meaningful equity that might actually be worth something. Often the best risk-adjusted compensation.

**Enterprises:** High base, good benefits, often RSUs that vest predictably. Lower upside but reliable.

Don&apos;t join a startup purely for the equity unless you genuinely believe in the company. Most startup equity ends up worth nothing.

---

My Advice
=========

If you&apos;re early in your career: try a startup or scale-up first. Get the breadth. Learn to ship. Develop urgency.

If you&apos;re mid-career and feeling stuck: try a different environment. If you&apos;ve only done enterprises, do a scale-up. If you&apos;ve only done startups, try an enterprise. The change in perspective is valuable.

If you&apos;re later in your career: know what you want. Depth? Enterprise. Impact? Scale-up. Ownership? Startup.

**The worst thing you can do is stay in one environment forever and assume it&apos;s the only way engineering works.** It&apos;s not. Go see how other people do it. You&apos;ll come back better.

---

Summary
=======

```
ENVIRONMENT     TEACHES YOU                     TRADE-OFF
===========     ===========                     =========
Startup         Breadth, ownership, speed       No mentorship, shallow
Scale-up        Scaling, leadership, growth     Chaos, burnout risk
Enterprise      Depth, process, specialisation  Stagnation, slow impact
```

**Work in all three if you can. Each teaches you something the others can&apos;t.**

```
========================================
Startup: wear every hat
Scale-up: build what scales
Enterprise: go deep
========================================
Do all three, and you&apos;ll be dangerous.
========================================
```</content:encoded><category>career</category><category>startups</category><category>engineering-culture</category><category>advice</category><category>leadership</category><author>Mo Abukar</author></item><item><title>SLO-Based Alerting: Burn Rate Alerts vs Threshold Alerts</title><link>https://moabukar.co.uk/blog/slo-based-alerting/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/slo-based-alerting/</guid><description>Implement SLO-based alerting with burn rate alerts. Move from noisy threshold alerts to meaningful reliability signals using error budgets.</description><pubDate>Sun, 30 Nov 2025 00:00:00 GMT</pubDate><content:encoded>SLO-Based Alerting: Burn Rate Alerts vs Threshold Alerts
========================================================

Threshold alerts are noisy. &quot;CPU &gt; 80%&quot; fires constantly but
rarely matters. SLO-based alerting focuses on what matters:
are we burning through our error budget too fast?

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/prometheus.svg&quot; alt=&quot;Prometheus logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- SLO = target reliability (e.g., 99.9% availability)
- Error Budget = allowed unreliability (0.1% = 43 min/month)
- Burn Rate = how fast you&apos;re consuming error budget
- Multi-window alerts reduce noise, catch real problems
- Prometheus/Grafana implementation included


Why SLO-Based Alerting?
=======================

```
THRESHOLD ALERTS                SLO-BASED ALERTS
================                ================
&quot;Error rate &gt; 1%&quot;               &quot;Burning 10x error budget&quot;
Fires on any spike              Fires on sustained impact
100s of alerts/week             ~5 alerts/week
Alert fatigue                   Actionable alerts
```


Error Budget Math
=================

```
SLO: 99.9% availability
Error Budget: 0.1% = 1 - 0.999

Monthly error budget (30 days):
30 days × 24 hours × 60 minutes × 0.001 = 43.2 minutes

If you&apos;re at 99.8% for an hour:
- Errors: 0.2% of traffic
- Budget consumed: 2 × (60 min / 43.2 min budget) = 2.78 hours worth
- Burn rate: 2× normal
```


Burn Rate Definition
====================

```
Burn Rate = (Actual Error Rate) / (SLO Error Rate)

Example:
- SLO allows 0.1% errors
- Current error rate: 0.5%
- Burn rate: 0.5 / 0.1 = 5×

At 5× burn rate:
- 30-day budget exhausted in 6 days
- 1-day budget exhausted in ~5 hours
```


Multi-Window Burn Rate Alerts
=============================

Single window alerts are still noisy. Use multiple windows:

```
SHORT WINDOW        LONG WINDOW         SEVERITY
============        ===========         ========
5 min               1 hour              Page (critical)
30 min              6 hours             Page (warning)
2 hours             24 hours            Ticket
6 hours             3 days              Ticket
```

Both windows must exceed threshold to fire.


Prometheus Recording Rules
==========================

```yaml
# slo-recording-rules.yaml
groups:
  - name: slo-recording
    interval: 30s
    rules:
      # Error ratio over different windows
      - record: slo:http_request_error_ratio:rate5m
        expr: |
          sum(rate(http_requests_total{status=~&quot;5..&quot;}[5m])) by (service)
          /
          sum(rate(http_requests_total[5m])) by (service)
      
      - record: slo:http_request_error_ratio:rate30m
        expr: |
          sum(rate(http_requests_total{status=~&quot;5..&quot;}[30m])) by (service)
          /
          sum(rate(http_requests_total[30m])) by (service)
      
      - record: slo:http_request_error_ratio:rate1h
        expr: |
          sum(rate(http_requests_total{status=~&quot;5..&quot;}[1h])) by (service)
          /
          sum(rate(http_requests_total[1h])) by (service)
      
      - record: slo:http_request_error_ratio:rate6h
        expr: |
          sum(rate(http_requests_total{status=~&quot;5..&quot;}[6h])) by (service)
          /
          sum(rate(http_requests_total[6h])) by (service)
      
      - record: slo:http_request_error_ratio:rate24h
        expr: |
          sum(rate(http_requests_total{status=~&quot;5..&quot;}[24h])) by (service)
          /
          sum(rate(http_requests_total[24h])) by (service)
      
      - record: slo:http_request_error_ratio:rate3d
        expr: |
          sum(rate(http_requests_total{status=~&quot;5..&quot;}[3d])) by (service)
          /
          sum(rate(http_requests_total[3d])) by (service)

      # SLO targets (configure per service)
      - record: slo:http_request:objective
        expr: |
          vector(0.001)  # 99.9% = 0.1% error budget
        labels:
          service: api-server
      
      - record: slo:http_request:objective
        expr: |
          vector(0.01)  # 99% = 1% error budget
        labels:
          service: batch-processor
```


Burn Rate Alerts
================

```yaml
# slo-alerting-rules.yaml
groups:
  - name: slo-alerts
    rules:
      # Critical: 14.4× burn rate over 5m AND 1h
      # Exhausts budget in 2 hours
      - alert: SLOErrorBudgetCritical
        expr: |
          (
            slo:http_request_error_ratio:rate5m &gt; (14.4 * on(service) group_left slo:http_request:objective)
            and
            slo:http_request_error_ratio:rate1h &gt; (14.4 * on(service) group_left slo:http_request:objective)
          )
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: &quot;{{ $labels.service }} burning error budget 14× faster than allowed&quot;
          description: &quot;Error rate is {{ $value | humanizePercentage }}. At this rate, 30-day budget exhausted in ~2 hours.&quot;
          runbook_url: https://runbooks.company.com/slo-critical

      # Warning: 6× burn rate over 30m AND 6h
      # Exhausts budget in 5 days
      - alert: SLOErrorBudgetWarning
        expr: |
          (
            slo:http_request_error_ratio:rate30m &gt; (6 * on(service) group_left slo:http_request:objective)
            and
            slo:http_request_error_ratio:rate6h &gt; (6 * on(service) group_left slo:http_request:objective)
          )
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: &quot;{{ $labels.service }} burning error budget 6× faster than allowed&quot;
          description: &quot;At this rate, 30-day budget exhausted in ~5 days.&quot;

      # Ticket: 3× burn rate over 2h AND 24h
      # Exhausts budget in 10 days
      - alert: SLOErrorBudgetDegraded
        expr: |
          (
            slo:http_request_error_ratio:rate2h &gt; (3 * on(service) group_left slo:http_request:objective)
            and
            slo:http_request_error_ratio:rate24h &gt; (3 * on(service) group_left slo:http_request:objective)
          )
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: &quot;{{ $labels.service }} error rate elevated&quot;

      # Slow Burn: 1× burn rate over 6h AND 3d
      # On track to exhaust budget
      - alert: SLOErrorBudgetSlowBurn
        expr: |
          (
            slo:http_request_error_ratio:rate6h &gt; on(service) group_left slo:http_request:objective
            and
            slo:http_request_error_ratio:rate3d &gt; on(service) group_left slo:http_request:objective
          )
        for: 1h
        labels:
          severity: ticket
        annotations:
          summary: &quot;{{ $labels.service }} on track to exhaust error budget&quot;
```


Latency SLOs
============

```yaml
groups:
  - name: latency-slo-recording
    rules:
      # P99 latency ratio
      - record: slo:http_request_latency_ratio:rate5m
        expr: |
          sum(rate(http_request_duration_seconds_bucket{le=&quot;0.5&quot;}[5m])) by (service)
          /
          sum(rate(http_request_duration_seconds_count[5m])) by (service)
      
      # Target: 99% of requests &lt; 500ms
      - record: slo:http_request_latency:objective
        expr: vector(0.99)
        labels:
          service: api-server

  - name: latency-slo-alerts
    rules:
      - alert: SLOLatencyBudgetCritical
        expr: |
          (
            slo:http_request_latency_ratio:rate5m &lt; (1 - 14.4 * (1 - on(service) group_left slo:http_request_latency:objective))
            and
            slo:http_request_latency_ratio:rate1h &lt; (1 - 14.4 * (1 - on(service) group_left slo:http_request_latency:objective))
          )
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: &quot;{{ $labels.service }} latency SLO breach&quot;
```


Grafana Dashboard
=================

```json
{
  &quot;panels&quot;: [
    {
      &quot;title&quot;: &quot;Error Budget Remaining (30d)&quot;,
      &quot;type&quot;: &quot;gauge&quot;,
      &quot;targets&quot;: [
        {
          &quot;expr&quot;: &quot;1 - (sum_over_time(slo:http_request_error_ratio:rate5m{service=\&quot;api-server\&quot;}[30d]) / count_over_time(slo:http_request_error_ratio:rate5m{service=\&quot;api-server\&quot;}[30d])) / 0.001&quot;,
          &quot;legendFormat&quot;: &quot;Budget Remaining&quot;
        }
      ],
      &quot;options&quot;: {
        &quot;reduceOptions&quot;: {
          &quot;calcs&quot;: [&quot;lastNotNull&quot;]
        }
      },
      &quot;fieldConfig&quot;: {
        &quot;defaults&quot;: {
          &quot;unit&quot;: &quot;percentunit&quot;,
          &quot;min&quot;: 0,
          &quot;max&quot;: 1,
          &quot;thresholds&quot;: {
            &quot;steps&quot;: [
              {&quot;color&quot;: &quot;red&quot;, &quot;value&quot;: 0},
              {&quot;color&quot;: &quot;yellow&quot;, &quot;value&quot;: 0.25},
              {&quot;color&quot;: &quot;green&quot;, &quot;value&quot;: 0.5}
            ]
          }
        }
      }
    },
    {
      &quot;title&quot;: &quot;Current Burn Rate&quot;,
      &quot;type&quot;: &quot;stat&quot;,
      &quot;targets&quot;: [
        {
          &quot;expr&quot;: &quot;slo:http_request_error_ratio:rate1h{service=\&quot;api-server\&quot;} / 0.001&quot;,
          &quot;legendFormat&quot;: &quot;Burn Rate&quot;
        }
      ],
      &quot;fieldConfig&quot;: {
        &quot;defaults&quot;: {
          &quot;unit&quot;: &quot;x&quot;,
          &quot;thresholds&quot;: {
            &quot;steps&quot;: [
              {&quot;color&quot;: &quot;green&quot;, &quot;value&quot;: 0},
              {&quot;color&quot;: &quot;yellow&quot;, &quot;value&quot;: 1},
              {&quot;color&quot;: &quot;red&quot;, &quot;value&quot;: 6}
            ]
          }
        }
      }
    },
    {
      &quot;title&quot;: &quot;Time Until Budget Exhausted&quot;,
      &quot;type&quot;: &quot;stat&quot;,
      &quot;targets&quot;: [
        {
          &quot;expr&quot;: &quot;(1 - (sum_over_time(slo:http_request_error_ratio:rate5m{service=\&quot;api-server\&quot;}[30d]) / count_over_time(slo:http_request_error_ratio:rate5m{service=\&quot;api-server\&quot;}[30d])) / 0.001) * 30 * 24 / (slo:http_request_error_ratio:rate1h{service=\&quot;api-server\&quot;} / 0.001)&quot;,
          &quot;legendFormat&quot;: &quot;Hours Remaining&quot;
        }
      ],
      &quot;fieldConfig&quot;: {
        &quot;defaults&quot;: {
          &quot;unit&quot;: &quot;h&quot;
        }
      }
    }
  ]
}
```


Sloth: SLO Generator
====================

```yaml
# sloth-slo.yaml
version: prometheus/v1
service: api-server
slos:
  - name: requests-availability
    objective: 99.9
    description: 99.9% of requests succeed
    sli:
      events:
        error_query: sum(rate(http_requests_total{service=&quot;api-server&quot;,status=~&quot;5..&quot;}[{{.window}}]))
        total_query: sum(rate(http_requests_total{service=&quot;api-server&quot;}[{{.window}}]))
    alerting:
      name: APIServerAvailability
      page_alert:
        labels:
          severity: critical
      ticket_alert:
        labels:
          severity: warning
```

```bash
sloth generate -i sloth-slo.yaml -o prometheus-rules.yaml
```


References
==========

- Google SRE Book: https://sre.google/sre-book/service-level-objectives/
- Sloth: https://sloth.dev
- OpenSLO: https://openslo.com


========================================
SLOs + Burn Rate Alerts + Prometheus
========================================
Alert on impact. Not on symptoms.
========================================</content:encoded><category>slo</category><category>sre</category><category>alerting</category><category>prometheus</category><category>observability</category><category>reliability</category><author>Mo Abukar</author></item><item><title>OpenTelemetry Collector Pipelines: Transform, Filter, Route Telemetry</title><link>https://moabukar.co.uk/blog/opentelemetry-collector-pipelines/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/opentelemetry-collector-pipelines/</guid><description>Master OpenTelemetry Collector configuration. Build pipelines to transform metrics, filter traces, route logs, and reduce telemetry costs.</description><pubDate>Wed, 26 Nov 2025 00:00:00 GMT</pubDate><content:encoded>OpenTelemetry Collector Pipelines: Transform, Filter, Route
============================================================

The OpenTelemetry Collector is the Swiss Army knife of telemetry.
It receives, processes, and exports traces, metrics, and logs.
This guide covers building production pipelines.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/opentelemetry.svg&quot; alt=&quot;OpenTelemetry logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Collector = vendor-agnostic telemetry pipeline
- Receivers = ingest data (OTLP, Prometheus, etc.)
- Processors = transform, filter, batch, sample
- Exporters = send to backends (Prometheus, Jaeger, etc.)
- Connectors = route between pipelines


Architecture
============

```
┌─────────────────────────────────────────────────────────────────┐
│                    OpenTelemetry Collector                       │
│                                                                  │
│  ┌──────────────┐   ┌──────────────┐   ┌──────────────────────┐ │
│  │  Receivers   │──▶│  Processors  │──▶│     Exporters        │ │
│  │              │   │              │   │                      │ │
│  │ - OTLP       │   │ - batch      │   │ - otlp (Tempo)       │ │
│  │ - prometheus │   │ - filter     │   │ - prometheus         │ │
│  │ - filelog    │   │ - transform  │   │ - loki               │ │
│  │ - jaeger     │   │ - tail_sample│   │ - datadog            │ │
│  └──────────────┘   └──────────────┘   └──────────────────────┘ │
│                                                                  │
└─────────────────────────────────────────────────────────────────┘
```


Basic Configuration
===================

```yaml
# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318

processors:
  batch:
    timeout: 5s
    send_batch_size: 1000

exporters:
  otlp:
    endpoint: tempo.monitoring:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [otlp]
```


Metrics Pipeline
================

Prometheus + Remote Write
-------------------------

```yaml
receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: kubernetes-pods
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
              action: keep
              regex: true
            - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
              action: replace
              target_label: __metrics_path__
              regex: (.+)
            - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
              action: replace
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $$1:$$2
              target_label: __address__

processors:
  batch:
    timeout: 10s
  
  # Add cluster label
  resource:
    attributes:
      - key: cluster
        value: production
        action: upsert
  
  # Filter out high-cardinality metrics
  filter:
    metrics:
      exclude:
        match_type: regexp
        metric_names:
          - &quot;.*_bucket&quot;  # Exclude histogram buckets
          - &quot;go_.*&quot;      # Exclude Go runtime metrics

exporters:
  prometheusremotewrite:
    endpoint: https://prometheus.company.com/api/v1/write
    headers:
      Authorization: Bearer ${PROM_TOKEN}

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch, resource, filter]
      exporters: [prometheusremotewrite]
```


Traces Pipeline
===============

Tail Sampling
-------------

Sample traces intelligently based on content:

```yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
  
  # Memory limiter to prevent OOM
  memory_limiter:
    check_interval: 1s
    limit_mib: 1000
    spike_limit_mib: 200
  
  # Tail-based sampling
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 1000
    policies:
      # Always sample errors
      - name: errors
        type: status_code
        status_code:
          status_codes: [ERROR]
      
      # Always sample slow traces
      - name: slow-traces
        type: latency
        latency:
          threshold_ms: 1000
      
      # Sample 10% of everything else
      - name: probabilistic
        type: probabilistic
        probabilistic:
          sampling_percentage: 10
      
      # Always sample specific operations
      - name: important-operations
        type: string_attribute
        string_attribute:
          key: http.route
          values:
            - /api/payments
            - /api/checkout
          enabled_regex_matching: false

exporters:
  otlp:
    endpoint: tempo.monitoring:4317
    tls:
      insecure: true

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, tail_sampling, batch]
      exporters: [otlp]
```


Logs Pipeline
=============

File to Loki
------------

```yaml
receivers:
  filelog:
    include:
      - /var/log/pods/*/*/*.log
    include_file_path: true
    operators:
      # Parse container runtime format
      - type: regex_parser
        regex: &apos;^(?P&lt;time&gt;\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d+Z) (?P&lt;stream&gt;stdout|stderr) (?P&lt;logtag&gt;[^ ]*) (?P&lt;log&gt;.*)$&apos;
        timestamp:
          parse_from: attributes.time
          layout: &apos;%Y-%m-%dT%H:%M:%S.%LZ&apos;
      
      # Parse JSON logs
      - type: json_parser
        parse_from: attributes.log
        if: &apos;attributes.log matches &quot;^\\{&quot;&apos;
      
      # Extract Kubernetes metadata
      - type: regex_parser
        regex: &apos;^/var/log/pods/(?P&lt;namespace&gt;[^_]+)_(?P&lt;pod&gt;[^_]+)_[^/]+/(?P&lt;container&gt;[^/]+)/&apos;
        parse_from: attributes[&quot;log.file.path&quot;]

processors:
  batch:
    timeout: 5s
  
  # Add resource attributes
  resource:
    attributes:
      - key: service.name
        from_attribute: container
        action: upsert
      - key: k8s.namespace.name
        from_attribute: namespace
        action: upsert
  
  # Filter out noisy logs
  filter:
    logs:
      exclude:
        match_type: regexp
        bodies:
          - &quot;.*health.*check.*&quot;
          - &quot;.*readiness.*probe.*&quot;

exporters:
  loki:
    endpoint: http://loki.monitoring:3100/loki/api/v1/push
    labels:
      resource:
        service.name: service
        k8s.namespace.name: namespace
      attributes:
        level: level

service:
  pipelines:
    logs:
      receivers: [filelog]
      processors: [batch, resource, filter]
      exporters: [loki]
```


Multi-Destination Routing
=========================

```yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 5s
  
  # Route by attribute
  routing:
    from_attribute: tenant
    default_exporters: [otlp/default]
    table:
      - value: tenant-a
        exporters: [otlp/tenant-a]
      - value: tenant-b
        exporters: [otlp/tenant-b]

exporters:
  otlp/default:
    endpoint: tempo-default.monitoring:4317
  otlp/tenant-a:
    endpoint: tempo-a.tenant-a:4317
  otlp/tenant-b:
    endpoint: tempo-b.tenant-b:4317

service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch, routing]
      exporters: [otlp/default, otlp/tenant-a, otlp/tenant-b]
```


Transform Processor
===================

```yaml
processors:
  transform:
    trace_statements:
      - context: span
        statements:
          # Rename attribute
          - set(attributes[&quot;http.method&quot;], attributes[&quot;http.request.method&quot;])
          - delete_key(attributes, &quot;http.request.method&quot;)
          
          # Truncate long values
          - truncate_all(attributes, 256)
          
          # Hash sensitive data
          - set(attributes[&quot;user.id&quot;], SHA256(attributes[&quot;user.id&quot;]))
          
          # Add derived attribute
          - set(attributes[&quot;is_error&quot;], status.code == STATUS_CODE_ERROR)
    
    metric_statements:
      - context: datapoint
        statements:
          # Convert units
          - set(attributes[&quot;duration_seconds&quot;], attributes[&quot;duration_ms&quot;] / 1000.0)
    
    log_statements:
      - context: log
        statements:
          # Parse severity
          - set(severity_number, SEVERITY_NUMBER_ERROR) where IsMatch(body, &quot;(?i)error&quot;)
          - set(severity_number, SEVERITY_NUMBER_WARN) where IsMatch(body, &quot;(?i)warn&quot;)
```


Kubernetes Deployment
=====================

```yaml
apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: otel-collector-agent
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: otel-collector-agent
  template:
    metadata:
      labels:
        app: otel-collector-agent
    spec:
      serviceAccountName: otel-collector
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.91.0
          args:
            - --config=/etc/otel/config.yaml
          ports:
            - containerPort: 4317
              hostPort: 4317
            - containerPort: 4318
              hostPort: 4318
          volumeMounts:
            - name: config
              mountPath: /etc/otel
            - name: varlog
              mountPath: /var/log
              readOnly: true
          resources:
            requests:
              cpu: 100m
              memory: 256Mi
            limits:
              cpu: 500m
              memory: 512Mi
      volumes:
        - name: config
          configMap:
            name: otel-collector-config
        - name: varlog
          hostPath:
            path: /var/log
```


Gateway Pattern
===============

```yaml
# Agent (DaemonSet) -&gt; Gateway (Deployment) -&gt; Backends

# Agent config
exporters:
  otlp:
    endpoint: otel-gateway.monitoring:4317

# Gateway config
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

processors:
  batch:
    timeout: 10s
    send_batch_size: 10000
  
  tail_sampling:
    # Sampling config here

exporters:
  otlp/tempo:
    endpoint: tempo.monitoring:4317
  prometheusremotewrite:
    endpoint: https://prometheus.company.com/api/v1/write
  loki:
    endpoint: http://loki.monitoring:3100/loki/api/v1/push
```


References
==========

- OTel Collector Docs: https://opentelemetry.io/docs/collector
- Contrib Receivers: https://github.com/open-telemetry/opentelemetry-collector-contrib
- Configuration: https://opentelemetry.io/docs/collector/configuration


========================================
OpenTelemetry Collector + Pipelines
========================================
Receive. Transform. Export. Observe.
========================================</content:encoded><category>opentelemetry</category><category>observability</category><category>metrics</category><category>traces</category><category>logs</category><category>collector</category><author>Mo Abukar</author></item><item><title>Blameless Culture is Harder Than You Think</title><link>https://moabukar.co.uk/blog/blameless-culture/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/blameless-culture/</guid><description>Everyone claims to have a blameless culture. Few actually do. Here&apos;s what real blamelessness looks like and why it&apos;s so difficult to achieve.</description><pubDate>Sat, 22 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Every tech company claims to have a blameless culture. It&apos;s in the values deck. It&apos;s mentioned in interviews. Post-mortems are labelled &quot;blameless&quot; by default.

And yet, when something breaks, people get blamed. Not officially. Not in writing. But in subtle ways that everyone notices.

True blamelessness is rare because it&apos;s genuinely hard. It requires fighting human instincts, changing organisational incentives, and maintaining discipline when emotions run high.

![Blameless Culture is Harder Than You Think](/images/blameless-culture.png)


## What Blame Actually Looks Like

Blame isn&apos;t always obvious. It hides in language and behaviour:

**The pointed question.** &quot;Why didn&apos;t you test this before deploying?&quot; The question has an answer - there&apos;s a reason. But the tone implies the person should have known better.

**The disappointed sigh.** No words needed. Body language does the work.

**The &quot;learning opportunity.&quot;** &quot;This is a learning opportunity for the team&quot; often means &quot;someone screwed up and we&apos;re being polite about it.&quot;

**The reassignment.** After an incident, someone quietly gets moved off the project. No explicit blame, but everyone knows.

**The repeated story.** &quot;Remember when that deployment took down production?&quot; becomes organisational folklore, forever associated with whoever made the change.

**The hiring filter.** &quot;We need someone more senior for this system&quot; after an incident. The current engineer was fine last week.

These aren&apos;t firings or formal reprimands. They&apos;re subtle signals that shape behaviour far more than any official policy.

## Why Blame Feels Right

Blame is instinctive. Something broke. Someone did something. Cause and effect. Holding people accountable seems reasonable.

But this instinct is wrong for complex systems.

Complex systems fail for complex reasons. The engineer who pushed the bad config didn&apos;t create the system that allowed bad configs to be pushed. They didn&apos;t write the inadequate tests, design the missing guardrails, or create the time pressure that led to skipping review.

Blaming the individual ignores the system. And if the system doesn&apos;t change, the same failure will happen again, just with a different person.

The question isn&apos;t &quot;who screwed up?&quot; It&apos;s &quot;what allowed this to happen, and how do we prevent it?&quot;

## The Cost of Blame

Blame cultures pay hidden costs:

**People hide problems.** If reporting an issue gets you blamed, you learn to stay quiet. Small problems become big problems because nobody wants to be the messenger.

**Risk aversion kills velocity.** Every change is a potential career threat. People deploy less, experiment less, and move slower. &quot;Don&apos;t break anything&quot; becomes the unspoken priority.

**Post-mortems become useless.** When blame is possible, people protect themselves. They minimise their involvement, blame external factors, and avoid saying anything that could be used against them. You learn nothing.

**Good people leave.** Talented engineers have options. They don&apos;t stay where mistakes end careers.

**Learning stops.** Organisations that blame don&apos;t improve. They just get better at hiding failure.

## What Blamelessness Actually Requires

Creating a blameless culture isn&apos;t about declaring it. It&apos;s about building systems and behaviours that make it real.

**Language discipline.** Ban &quot;why didn&apos;t you&quot; questions. Replace with &quot;what made this possible&quot; and &quot;how might we prevent this.&quot; It sounds pedantic, but language shapes thinking.

**Assume competence.** The person who made the mistake was trying to do their job well. If they made a mistake, the system failed to prevent it. Start from this assumption.

**Separate the person from the action.** &quot;The deployment caused an outage&quot; not &quot;John caused an outage.&quot; The action happened. It&apos;s not someone&apos;s identity.

**Leadership models behaviour.** When leaders take blame, others do too. When leaders deflect, others learn to deflect. You get the culture you demonstrate.

**Consequences for blaming.** If someone publicly blames a colleague, address it. Blamelessness requires active maintenance.

## Post-Mortems as Practice

Post-mortems are where blamelessness is tested. Every incident is a choice: learn or blame.

Structure post-mortems to make blame hard:

**Focus on timeline, not people.** &quot;At 14:32, the deployment completed&quot; not &quot;At 14:32, Sarah deployed.&quot;

**Ask systemic questions.** &quot;What process allowed this?&quot; &quot;What safeguard was missing?&quot; &quot;What information would have changed the decision?&quot;

**Explore counterfactuals.** &quot;If a different person had been on call, would the outcome differ?&quot; Usually the answer is no - which proves it&apos;s systemic.

**Name contributing factors, not culprits.** Time pressure, missing documentation, inadequate testing environments. These are systemic issues with systemic solutions.

**Distribute the post-mortem widely.** Transparency signals that this is about learning, not punishment. If you&apos;re hiding the post-mortem, ask why.

## The Accountability Objection

&quot;But people need to be accountable!&quot; This objection comes up every time.

Accountability and blamelessness aren&apos;t opposites. You can hold people to high standards without blaming them when complex systems fail.

Accountability means:
- Clear expectations communicated in advance
- Feedback on performance patterns over time
- Development plans for growth areas
- Consequences for repeated negligence or malice

What it doesn&apos;t mean:
- Punishment for single incidents in complex systems
- Career damage for honest mistakes
- Public shaming after outages

The distinction: patterns versus incidents. Someone who repeatedly ignores warnings, skips reviews, and refuses to learn has an accountability problem. Someone who made a mistake in a system that allowed the mistake isn&apos;t negligent - they&apos;re human.

## When Someone Really Did Screw Up

What about genuine negligence? The person who deployed drunk. Who ignored explicit warnings. Who deliberately bypassed safeguards.

These cases are rare. When they happen, address them directly and privately. Don&apos;t use the post-mortem for discipline.

The post-mortem is still blameless: &quot;The deployment bypassed the standard review process. We need to understand how this was possible and prevent it.&quot;

Separately, HR handles the personnel issue.

Mixing discipline and learning corrupts both.

## Building the Muscle

Blamelessness is a practice. You get better with repetition.

**Start with small incidents.** Practice on low-stakes failures. Build the habit before emotions run high.

**Appoint a blamelessness advocate.** In post-mortems, one person watches for blame language and redirects. Rotate this role.

**Celebrate good post-mortems.** When a post-mortem leads to real improvements, recognise it publicly.

**Review past incidents.** Look back at older post-mortems. Were they blameless? What would you do differently?

**Train new hires.** Explain the culture explicitly. Don&apos;t assume they&apos;ll absorb it.

## The Long Game

Blameless culture takes years to build and moments to destroy. One public blame incident undoes months of trust-building.

It&apos;s worth the effort. Teams with genuine blamelessness:
- Find and fix problems faster
- Experiment and learn more
- Retain better people
- Build more reliable systems

The irony is that blameless cultures have fewer incidents to be blameless about. The learning compounds.

Start today. Review your last post-mortem. Was it truly blameless? What would you change?

The answer tells you where you actually stand.</content:encoded><category>engineering-culture</category><category>post-mortems</category><category>incident-management</category><category>leadership</category><category>psychological-safety</category><author>Mo Abukar</author></item><item><title>Chaos Engineering with Litmus: Controlled Failure Injection</title><link>https://moabukar.co.uk/blog/chaos-engineering-litmus/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/chaos-engineering-litmus/</guid><description>Implement chaos engineering in Kubernetes with LitmusChaos. Run pod failures, network chaos, and stress tests to validate system resilience.</description><pubDate>Sat, 22 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Chaos Engineering with Litmus: Controlled Failure Injection
============================================================

Hope is not a strategy. Chaos engineering proves your system
can handle failures before production incidents do. LitmusChaos
is a Kubernetes-native chaos engineering platform.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/litmus.svg&quot; alt=&quot;Litmus logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- LitmusChaos = K8s-native chaos experiments
- ChaosHub = library of pre-built experiments
- Pod, network, node, and application-level chaos
- Integrates with CI/CD for automated resilience testing
- Full examples with GameDay patterns


Install LitmusChaos
===================

```bash
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
helm upgrade --install litmus litmuschaos/litmus \
  --namespace litmus --create-namespace \
  --set portal.frontend.service.type=ClusterIP
```


Pod Chaos Experiments
=====================

Pod Delete
----------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: pod-delete-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-delete
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: &quot;60&quot;
            - name: CHAOS_INTERVAL
              value: &quot;10&quot;
            - name: FORCE
              value: &quot;false&quot;
            - name: PODS_AFFECTED_PERC
              value: &quot;50&quot;
```


Container Kill
--------------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: container-kill-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: container-kill
      spec:
        components:
          env:
            - name: TARGET_CONTAINER
              value: api
            - name: TOTAL_CHAOS_DURATION
              value: &quot;30&quot;
            - name: CHAOS_INTERVAL
              value: &quot;10&quot;
```


Network Chaos
=============

Network Latency
---------------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-latency-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-latency
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: eth0
            - name: NETWORK_LATENCY
              value: &quot;200&quot;
            - name: TOTAL_CHAOS_DURATION
              value: &quot;60&quot;
            - name: TARGET_PODS
              value: &quot;1&quot;
            - name: DESTINATION_IPS
              value: &quot;&quot;
            - name: DESTINATION_HOSTS
              value: &quot;postgres.production.svc.cluster.local&quot;
```


Network Loss
------------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: network-loss-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-network-loss
      spec:
        components:
          env:
            - name: NETWORK_INTERFACE
              value: eth0
            - name: NETWORK_PACKET_LOSS_PERCENTAGE
              value: &quot;30&quot;
            - name: TOTAL_CHAOS_DURATION
              value: &quot;60&quot;
```


DNS Chaos
---------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: dns-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-dns-error
      spec:
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: &quot;60&quot;
            - name: TARGET_HOSTNAMES
              value: &quot;api.external.com,database.internal&quot;
```


Resource Stress
===============

CPU Stress
----------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: cpu-stress-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-cpu-hog
      spec:
        components:
          env:
            - name: CPU_CORES
              value: &quot;2&quot;
            - name: TOTAL_CHAOS_DURATION
              value: &quot;60&quot;
            - name: CPU_LOAD
              value: &quot;80&quot;
```


Memory Stress
-------------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: memory-stress-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: pod-memory-hog
      spec:
        components:
          env:
            - name: MEMORY_CONSUMPTION
              value: &quot;500&quot;
            - name: TOTAL_CHAOS_DURATION
              value: &quot;60&quot;
```


Disk Fill
---------

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: disk-fill-chaos
  namespace: production
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  experiments:
    - name: disk-fill
      spec:
        components:
          env:
            - name: FILL_PERCENTAGE
              value: &quot;80&quot;
            - name: TOTAL_CHAOS_DURATION
              value: &quot;60&quot;
            - name: CONTAINER_PATH
              value: &quot;/data&quot;
```


CI/CD Integration
=================

GitHub Actions
--------------

```yaml
name: Chaos Tests
on:
  schedule:
    - cron: &apos;0 3 * * *&apos;  # Daily at 3am
  workflow_dispatch:

jobs:
  chaos-test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup kubectl
        uses: azure/setup-kubectl@v3
      
      - name: Configure kubeconfig
        run: echo &quot;${{ secrets.KUBECONFIG }}&quot; | base64 -d &gt; kubeconfig
        
      - name: Run Chaos Experiment
        run: |
          kubectl apply -f chaos/pod-delete.yaml
          
          # Wait for experiment to complete
          kubectl wait --for=condition=ChaosResultFound \
            chaosengine/pod-delete-chaos -n production \
            --timeout=300s
      
      - name: Check Result
        run: |
          RESULT=$(kubectl get chaosresult pod-delete-chaos-pod-delete \
            -n production -o jsonpath=&apos;{.status.experimentStatus.verdict}&apos;)
          
          if [ &quot;$RESULT&quot; != &quot;Pass&quot; ]; then
            echo &quot;Chaos experiment failed!&quot;
            exit 1
          fi
      
      - name: Cleanup
        if: always()
        run: kubectl delete chaosengine pod-delete-chaos -n production
```


GameDay Workflow
================

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosWorkflow
metadata:
  name: gameday-workflow
  namespace: litmus
spec:
  steps:
    - name: verify-baseline
      template: verify-baseline
    
    - name: pod-failure
      template: pod-failure
      dependencies: [verify-baseline]
    
    - name: verify-recovery
      template: verify-recovery
      dependencies: [pod-failure]
    
    - name: network-chaos
      template: network-chaos
      dependencies: [verify-recovery]
    
    - name: final-verification
      template: verify-baseline
      dependencies: [network-chaos]

  templates:
    - name: verify-baseline
      container:
        image: curlimages/curl
        command: [&quot;/bin/sh&quot;, &quot;-c&quot;]
        args:
          - |
            for i in $(seq 1 10); do
              STATUS=$(curl -s -o /dev/null -w &quot;%{http_code}&quot; http://api.production.svc:8080/health)
              if [ &quot;$STATUS&quot; != &quot;200&quot; ]; then
                echo &quot;Health check failed: $STATUS&quot;
                exit 1
              fi
              sleep 2
            done
            echo &quot;Baseline verified&quot;
    
    - name: pod-failure
      chaosEngine:
        engineSpec:
          appinfo:
            appns: production
            applabel: app=api-server
            appkind: deployment
          experiments:
            - name: pod-delete
              spec:
                components:
                  env:
                    - name: TOTAL_CHAOS_DURATION
                      value: &quot;30&quot;
                    - name: PODS_AFFECTED_PERC
                      value: &quot;100&quot;
    
    - name: network-chaos
      chaosEngine:
        engineSpec:
          appinfo:
            appns: production
            applabel: app=api-server
            appkind: deployment
          experiments:
            - name: pod-network-latency
              spec:
                components:
                  env:
                    - name: NETWORK_LATENCY
                      value: &quot;500&quot;
                    - name: TOTAL_CHAOS_DURATION
                      value: &quot;60&quot;
```


Hypothesis-Driven Testing
=========================

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: hypothesis-test
  namespace: production
  annotations:
    hypothesis: &quot;System maintains 99.9% availability when 50% of pods are killed&quot;
    success-criteria: &quot;Error rate &lt; 0.1% during chaos&quot;
spec:
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  engineState: active
  chaosServiceAccount: litmus-admin
  
  # Probes to validate hypothesis
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: availability-check
            type: httpProbe
            httpProbe/inputs:
              url: http://api.production.svc:8080/health
              insecureSkipVerify: false
              method:
                get:
                  criteria: ==
                  responseCode: &quot;200&quot;
            mode: Continuous
            runProperties:
              probeTimeout: 5
              interval: 2
              retry: 3
              probePollingInterval: 1
        components:
          env:
            - name: TOTAL_CHAOS_DURATION
              value: &quot;120&quot;
            - name: PODS_AFFECTED_PERC
              value: &quot;50&quot;
```


Observability Integration
=========================

```yaml
apiVersion: litmuschaos.io/v1alpha1
kind: ChaosEngine
metadata:
  name: observable-chaos
  namespace: production
spec:
  monitoring: true
  jobCleanUpPolicy: retain
  appinfo:
    appns: production
    applabel: app=api-server
    appkind: deployment
  experiments:
    - name: pod-delete
      spec:
        probe:
          - name: prometheus-check
            type: promProbe
            promProbe/inputs:
              endpoint: http://prometheus.monitoring:9090
              query: sum(rate(http_requests_total{status=~&quot;5..&quot;}[1m]))
              comparator:
                type: float
                criteria: &quot;&lt;=&quot;
                value: &quot;0.01&quot;
            mode: Edge
            runProperties:
              probeTimeout: 5
              interval: 5
              retry: 2
```


References
==========

- LitmusChaos: https://litmuschaos.io
- ChaosHub: https://hub.litmuschaos.io
- Principles: https://principlesofchaos.org


========================================
LitmusChaos + Kubernetes
========================================
Break it in testing. Not in production.
========================================</content:encoded><category>chaos-engineering</category><category>litmus</category><category>kubernetes</category><category>reliability</category><category>sre</category><category>testing</category><author>Mo Abukar</author></item><item><title>LocalStack Deep Dive - AWS on Your Laptop</title><link>https://moabukar.co.uk/blog/localstack-deep-dive/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/localstack-deep-dive/</guid><description>Run AWS services locally for faster development and testing. A practical guide to LocalStack covering S3, Lambda, DynamoDB, SQS, and integration testing patterns.</description><pubDate>Thu, 20 Nov 2025 00:00:00 GMT</pubDate><content:encoded># LocalStack Deep Dive - AWS on Your Laptop

Developing against AWS is expensive. Not just in cloud costs, but in feedback time. Deploy Lambda, wait, test, fail, redeploy, wait again.

LocalStack emulates AWS services locally. S3, Lambda, DynamoDB, SQS - running on your laptop. Changes take seconds, not minutes. Tests run without hitting real AWS. No credentials needed.

This is how fast cloud development should feel.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/aws.svg&quot; alt=&quot;AWS logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## TL;DR

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/localstack-deep-dive](https://github.com/moabukar/blog-code/tree/main/localstack-deep-dive)

- LocalStack emulates 80+ AWS services locally
- Free tier covers most common services
- Perfect for development and integration testing
- Works with standard AWS SDKs and CLI
- Docker-based, runs anywhere

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/localstack-deep-dive](https://github.com/moabukar/blog-code/tree/main/localstack-deep-dive)

---

## Getting Started

### Installation

```bash
# Using pip
pip install localstack

# Start LocalStack
localstack start

# Or use Docker directly
docker run -d \
  --name localstack \
  -p 4566:4566 \
  -p 4510-4559:4510-4559 \
  -e DEBUG=1 \
  localstack/localstack
```

### Docker Compose (Recommended)

```yaml
# docker-compose.yml
version: &apos;3.8&apos;

services:
  localstack:
    image: localstack/localstack:latest
    ports:
      - &quot;4566:4566&quot;            # LocalStack Gateway
      - &quot;4510-4559:4510-4559&quot;  # External services port range
    environment:
      - DEBUG=1
      - DOCKER_HOST=unix:///var/run/docker.sock
      - PERSISTENCE=1          # Persist data between restarts
    volumes:
      - &quot;./localstack-data:/var/lib/localstack&quot;
      - &quot;/var/run/docker.sock:/var/run/docker.sock&quot;
```

```bash
docker-compose up -d
```

### Configure AWS CLI

```bash
# Create a LocalStack profile
aws configure --profile localstack
# AWS Access Key ID: test
# AWS Secret Access Key: test
# Default region: us-east-1
# Default output format: json

# Or use environment variables
export AWS_ACCESS_KEY_ID=test
export AWS_SECRET_ACCESS_KEY=test
export AWS_DEFAULT_REGION=us-east-1
export AWS_ENDPOINT_URL=http://localhost:4566
```

---

## S3: Object Storage

```bash
# Create bucket
aws --endpoint-url=http://localhost:4566 s3 mb s3://my-bucket

# Upload file
aws --endpoint-url=http://localhost:4566 s3 cp myfile.txt s3://my-bucket/

# List objects
aws --endpoint-url=http://localhost:4566 s3 ls s3://my-bucket/

# Generate presigned URL
aws --endpoint-url=http://localhost:4566 s3 presign s3://my-bucket/myfile.txt
```

### Python SDK

```python
import boto3

# Create S3 client pointing to LocalStack
s3 = boto3.client(
    &apos;s3&apos;,
    endpoint_url=&apos;http://localhost:4566&apos;,
    aws_access_key_id=&apos;test&apos;,
    aws_secret_access_key=&apos;test&apos;,
    region_name=&apos;us-east-1&apos;
)

# Create bucket
s3.create_bucket(Bucket=&apos;my-bucket&apos;)

# Upload file
s3.upload_file(&apos;local_file.txt&apos;, &apos;my-bucket&apos;, &apos;remote_key.txt&apos;)

# Download file
s3.download_file(&apos;my-bucket&apos;, &apos;remote_key.txt&apos;, &apos;downloaded.txt&apos;)
```

---

## Lambda: Serverless Functions

### Deploy a Lambda

```bash
# Create function code
cat &gt; handler.py &lt;&lt; &apos;EOF&apos;
def lambda_handler(event, context):
    name = event.get(&apos;name&apos;, &apos;World&apos;)
    return {
        &apos;statusCode&apos;: 200,
        &apos;body&apos;: f&apos;Hello, {name}!&apos;
    }
EOF

# Zip it
zip function.zip handler.py

# Create Lambda function
aws --endpoint-url=http://localhost:4566 lambda create-function \
  --function-name hello-function \
  --runtime python3.9 \
  --handler handler.lambda_handler \
  --zip-file fileb://function.zip \
  --role arn:aws:iam::000000000000:role/lambda-role

# Invoke it
aws --endpoint-url=http://localhost:4566 lambda invoke \
  --function-name hello-function \
  --payload &apos;{&quot;name&quot;: &quot;LocalStack&quot;}&apos; \
  output.txt

cat output.txt
# {&quot;statusCode&quot;: 200, &quot;body&quot;: &quot;Hello, LocalStack!&quot;}
```

### Lambda with S3 Trigger

```bash
# Create S3 bucket notification
aws --endpoint-url=http://localhost:4566 s3api put-bucket-notification-configuration \
  --bucket my-bucket \
  --notification-configuration &apos;{
    &quot;LambdaFunctionConfigurations&quot;: [{
      &quot;LambdaFunctionArn&quot;: &quot;arn:aws:lambda:us-east-1:000000000000:function:hello-function&quot;,
      &quot;Events&quot;: [&quot;s3:ObjectCreated:*&quot;]
    }]
  }&apos;

# Now uploading to S3 triggers the Lambda
aws --endpoint-url=http://localhost:4566 s3 cp test.txt s3://my-bucket/
```

---

## DynamoDB: NoSQL Database

```bash
# Create table
aws --endpoint-url=http://localhost:4566 dynamodb create-table \
  --table-name Users \
  --attribute-definitions AttributeName=userId,AttributeType=S \
  --key-schema AttributeName=userId,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST

# Insert item
aws --endpoint-url=http://localhost:4566 dynamodb put-item \
  --table-name Users \
  --item &apos;{&quot;userId&quot;: {&quot;S&quot;: &quot;123&quot;}, &quot;name&quot;: {&quot;S&quot;: &quot;Alice&quot;}}&apos;

# Query
aws --endpoint-url=http://localhost:4566 dynamodb get-item \
  --table-name Users \
  --key &apos;{&quot;userId&quot;: {&quot;S&quot;: &quot;123&quot;}}&apos;
```

### Python with boto3

```python
import boto3

dynamodb = boto3.resource(
    &apos;dynamodb&apos;,
    endpoint_url=&apos;http://localhost:4566&apos;,
    aws_access_key_id=&apos;test&apos;,
    aws_secret_access_key=&apos;test&apos;,
    region_name=&apos;us-east-1&apos;
)

# Create table
table = dynamodb.create_table(
    TableName=&apos;Users&apos;,
    KeySchema=[{&apos;AttributeName&apos;: &apos;userId&apos;, &apos;KeyType&apos;: &apos;HASH&apos;}],
    AttributeDefinitions=[{&apos;AttributeName&apos;: &apos;userId&apos;, &apos;AttributeType&apos;: &apos;S&apos;}],
    BillingMode=&apos;PAY_PER_REQUEST&apos;
)
table.wait_until_exists()

# Insert
table.put_item(Item={&apos;userId&apos;: &apos;123&apos;, &apos;name&apos;: &apos;Alice&apos;, &apos;email&apos;: &apos;alice@example.com&apos;})

# Query
response = table.get_item(Key={&apos;userId&apos;: &apos;123&apos;})
print(response[&apos;Item&apos;])
```

---

## SQS: Message Queues

```bash
# Create queue
aws --endpoint-url=http://localhost:4566 sqs create-queue \
  --queue-name my-queue

# Send message
aws --endpoint-url=http://localhost:4566 sqs send-message \
  --queue-url http://localhost:4566/000000000000/my-queue \
  --message-body &quot;Hello from LocalStack&quot;

# Receive message
aws --endpoint-url=http://localhost:4566 sqs receive-message \
  --queue-url http://localhost:4566/000000000000/my-queue
```

### SQS + Lambda Integration

```bash
# Create event source mapping
aws --endpoint-url=http://localhost:4566 lambda create-event-source-mapping \
  --function-name hello-function \
  --event-source-arn arn:aws:sqs:us-east-1:000000000000:my-queue \
  --batch-size 10

# Messages sent to SQS now trigger the Lambda
```

---

## SNS: Pub/Sub Messaging

```bash
# Create topic
aws --endpoint-url=http://localhost:4566 sns create-topic \
  --name my-topic

# Subscribe SQS queue to topic
aws --endpoint-url=http://localhost:4566 sns subscribe \
  --topic-arn arn:aws:sns:us-east-1:000000000000:my-topic \
  --protocol sqs \
  --notification-endpoint arn:aws:sqs:us-east-1:000000000000:my-queue

# Publish message
aws --endpoint-url=http://localhost:4566 sns publish \
  --topic-arn arn:aws:sns:us-east-1:000000000000:my-topic \
  --message &quot;Broadcast message&quot;
```

---

## Secrets Manager

```bash
# Create secret
aws --endpoint-url=http://localhost:4566 secretsmanager create-secret \
  --name my-secret \
  --secret-string &apos;{&quot;username&quot;:&quot;admin&quot;,&quot;password&quot;:&quot;secret123&quot;}&apos;

# Retrieve secret
aws --endpoint-url=http://localhost:4566 secretsmanager get-secret-value \
  --secret-id my-secret
```

---

## Terraform with LocalStack

```hcl
# providers.tf
terraform {
  required_providers {
    aws = {
      source  = &quot;hashicorp/aws&quot;
      version = &quot;~&gt; 5.0&quot;
    }
  }
}

provider &quot;aws&quot; {
  region                      = &quot;us-east-1&quot;
  access_key                  = &quot;test&quot;
  secret_key                  = &quot;test&quot;
  skip_credentials_validation = true
  skip_metadata_api_check     = true
  skip_requesting_account_id  = true

  endpoints {
    s3             = &quot;http://localhost:4566&quot;
    dynamodb       = &quot;http://localhost:4566&quot;
    lambda         = &quot;http://localhost:4566&quot;
    sqs            = &quot;http://localhost:4566&quot;
    sns            = &quot;http://localhost:4566&quot;
    secretsmanager = &quot;http://localhost:4566&quot;
    iam            = &quot;http://localhost:4566&quot;
  }
}

# main.tf
resource &quot;aws_s3_bucket&quot; &quot;app_bucket&quot; {
  bucket = &quot;my-app-bucket&quot;
}

resource &quot;aws_dynamodb_table&quot; &quot;app_table&quot; {
  name         = &quot;AppData&quot;
  billing_mode = &quot;PAY_PER_REQUEST&quot;
  hash_key     = &quot;id&quot;

  attribute {
    name = &quot;id&quot;
    type = &quot;S&quot;
  }
}

resource &quot;aws_sqs_queue&quot; &quot;app_queue&quot; {
  name = &quot;app-processing-queue&quot;
}
```

```bash
# Apply against LocalStack
terraform init
terraform apply
```

---

## Integration Testing Pattern

### pytest with LocalStack

```python
# conftest.py
import pytest
import boto3
import os

@pytest.fixture(scope=&apos;session&apos;)
def localstack_endpoint():
    return os.getenv(&apos;AWS_ENDPOINT_URL&apos;, &apos;http://localhost:4566&apos;)

@pytest.fixture(scope=&apos;session&apos;)
def s3_client(localstack_endpoint):
    return boto3.client(
        &apos;s3&apos;,
        endpoint_url=localstack_endpoint,
        aws_access_key_id=&apos;test&apos;,
        aws_secret_access_key=&apos;test&apos;,
        region_name=&apos;us-east-1&apos;
    )

@pytest.fixture(scope=&apos;function&apos;)
def test_bucket(s3_client):
    bucket_name = &apos;test-bucket&apos;
    s3_client.create_bucket(Bucket=bucket_name)
    yield bucket_name
    # Cleanup
    objects = s3_client.list_objects_v2(Bucket=bucket_name).get(&apos;Contents&apos;, [])
    for obj in objects:
        s3_client.delete_object(Bucket=bucket_name, Key=obj[&apos;Key&apos;])
    s3_client.delete_bucket(Bucket=bucket_name)
```

```python
# test_s3_operations.py
def test_upload_and_download(s3_client, test_bucket):
    # Upload
    s3_client.put_object(
        Bucket=test_bucket,
        Key=&apos;test-file.txt&apos;,
        Body=b&apos;Hello, LocalStack!&apos;
    )
    
    # Download
    response = s3_client.get_object(Bucket=test_bucket, Key=&apos;test-file.txt&apos;)
    content = response[&apos;Body&apos;].read().decode(&apos;utf-8&apos;)
    
    assert content == &apos;Hello, LocalStack!&apos;

def test_list_objects(s3_client, test_bucket):
    # Create multiple objects
    for i in range(5):
        s3_client.put_object(
            Bucket=test_bucket,
            Key=f&apos;file-{i}.txt&apos;,
            Body=f&apos;Content {i}&apos;.encode()
        )
    
    # List
    response = s3_client.list_objects_v2(Bucket=test_bucket)
    
    assert len(response[&apos;Contents&apos;]) == 5
```

### CI/CD Integration

```yaml
# .github/workflows/test.yml
name: Integration Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    services:
      localstack:
        image: localstack/localstack:latest
        ports:
          - 4566:4566
        env:
          SERVICES: s3,dynamodb,lambda,sqs
    
    steps:
      - uses: actions/checkout@v4
      
      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: &apos;3.11&apos;
      
      - name: Install dependencies
        run: pip install -r requirements.txt pytest boto3
      
      - name: Wait for LocalStack
        run: |
          pip install awscli-local
          for i in {1..30}; do
            if awslocal s3 ls 2&gt;/dev/null; then
              echo &quot;LocalStack is ready&quot;
              break
            fi
            echo &quot;Waiting for LocalStack...&quot;
            sleep 2
          done
      
      - name: Run tests
        env:
          AWS_ENDPOINT_URL: http://localhost:4566
          AWS_ACCESS_KEY_ID: test
          AWS_SECRET_ACCESS_KEY: test
          AWS_DEFAULT_REGION: us-east-1
        run: pytest tests/ -v
```

---

## awslocal CLI Wrapper

```bash
# Install
pip install awscli-local

# Use without --endpoint-url
awslocal s3 mb s3://my-bucket
awslocal dynamodb list-tables
awslocal lambda list-functions

# It automatically adds the LocalStack endpoint
```

---

## Pro Tips

### 1. Use Initialization Scripts

```yaml
# docker-compose.yml
services:
  localstack:
    image: localstack/localstack:latest
    volumes:
      - &quot;./init-aws.sh:/etc/localstack/init/ready.d/init-aws.sh&quot;
```

```bash
# init-aws.sh
#!/bin/bash
awslocal s3 mb s3://app-bucket
awslocal dynamodb create-table \
  --table-name Users \
  --attribute-definitions AttributeName=id,AttributeType=S \
  --key-schema AttributeName=id,KeyType=HASH \
  --billing-mode PAY_PER_REQUEST
awslocal sqs create-queue --queue-name app-queue
echo &quot;LocalStack initialized!&quot;
```

### 2. Enable Persistence

```yaml
environment:
  - PERSISTENCE=1
volumes:
  - &quot;./localstack-data:/var/lib/localstack&quot;
```

### 3. Debug Lambda Execution

```yaml
environment:
  - DEBUG=1
  - LAMBDA_EXECUTOR=docker  # Run Lambdas in separate containers
  - LAMBDA_REMOTE_DOCKER=0
```

### 4. Check Service Status

```bash
# Health check
curl http://localhost:4566/_localstack/health

# Service status
curl http://localhost:4566/_localstack/info
```

---

## Supported Services (Free Tier)

| Service | Coverage |
|---------|----------|
| S3 | Full |
| DynamoDB | Full |
| Lambda | Full |
| SQS | Full |
| SNS | Full |
| CloudWatch Logs | Full |
| IAM | Basic |
| Secrets Manager | Full |
| SSM Parameter Store | Full |
| CloudFormation | Most resources |
| API Gateway | Full |
| Kinesis | Full |
| Step Functions | Full |

Pro tier adds: RDS, ECS, EKS, ElastiCache, and more.

---

## When NOT to Use LocalStack

- **Performance testing** - Local != cloud performance
- **IAM policy testing** - IAM simulation is limited
- **Network testing** - VPCs, Transit Gateway, etc.
- **Managed service features** - RDS failover, Aurora Serverless v2
- **Final pre-production testing** - Always test against real AWS

---

## Quick Reference

```bash
# Start
docker-compose up -d

# AWS CLI with endpoint
aws --endpoint-url=http://localhost:4566 s3 ls

# Or use awslocal
awslocal s3 ls

# Check health
curl localhost:4566/_localstack/health

# View logs
docker-compose logs -f localstack

# Reset (delete all data)
docker-compose down -v
docker-compose up -d
```

---

## Conclusion

LocalStack transforms AWS development:

1. **Faster feedback** - Seconds instead of minutes
2. **No cloud costs** - Run everything locally
3. **Offline development** - Work on planes, trains, anywhere
4. **Better testing** - Integration tests without mock complexity
5. **Team consistency** - Everyone runs the same environment

Your AWS bill will thank you. Your iteration speed will skyrocket.

---

## References

- [LocalStack Documentation](https://docs.localstack.cloud/)
- [LocalStack GitHub](https://github.com/localstack/localstack)
- [awscli-local](https://github.com/localstack/awscli-local)
- [Terraform LocalStack Provider](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/custom-service-endpoints)</content:encoded><category>localstack</category><category>aws</category><category>testing</category><category>development</category><category>devops</category><category>docker</category><author>Mo Abukar</author></item><item><title>GitHub Actions OIDC – Ditch the AWS Access Keys Forever</title><link>https://moabukar.co.uk/blog/github-actions-oidc/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/github-actions-oidc/</guid><description>How to authenticate GitHub Actions to AWS without storing secrets. OIDC federation explained, IAM role setup, and the token claims that control access.</description><pubDate>Wed, 19 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Stop storing AWS access keys in GitHub Secrets. There&apos;s a better way.

GitHub Actions supports OIDC (OpenID Connect) federation, which means your workflows can assume IAM roles directly – no long-lived credentials, no rotation headaches, no secrets to leak.

Here&apos;s how it works and how to set it up properly.

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/github-actions-oidc](https://github.com/moabukar/blog-code/tree/main/github-actions-oidc)

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/githubactions.svg&quot; alt=&quot;GitHub Actions logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## The Problem with Access Keys

Traditional CI/CD authentication looks like this:

1. Create IAM user
2. Generate access key
3. Store in GitHub Secrets
4. Hope nobody leaks them
5. Forget to rotate them
6. Get breached

Access keys are:
- **Long-lived** – valid until you delete them
- **Static** – same credentials for every workflow run
- **Broadly scoped** – often over-permissioned &quot;just to make CI work&quot;
- **Hard to audit** – which workflow used these keys when?

## OIDC: The Better Way

With OIDC federation:

1. GitHub Actions requests a short-lived token from GitHub&apos;s OIDC provider
2. Your workflow presents this token to AWS
3. AWS validates the token against GitHub&apos;s public keys
4. AWS issues temporary credentials (15 min - 1 hour)
5. Workflow runs with those credentials
6. Credentials automatically expire

No secrets stored. No keys to rotate. Every workflow run gets unique, short-lived credentials.

```
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  GitHub Actions │────►│ GitHub OIDC     │────►│      AWS        │
│    Workflow     │     │ Provider        │     │   IAM Role      │
└─────────────────┘     └─────────────────┘     └─────────────────┘
        │                       │                       │
        │ 1. Request token      │                       │
        │──────────────────────►│                       │
        │                       │                       │
        │ 2. JWT with claims    │                       │
        │◄──────────────────────│                       │
        │                       │                       │
        │ 3. AssumeRoleWithWebIdentity                  │
        │──────────────────────────────────────────────►│
        │                       │                       │
        │ 4. Temporary credentials                      │
        │◄──────────────────────────────────────────────│
```

## Setting It Up

### Step 1: Create the OIDC Provider in AWS

```bash
aws iam create-open-id-connect-provider \
  --url https://token.actions.githubusercontent.com \
  --client-id-list sts.amazonaws.com \
  --thumbprint-list 6938fd4d98bab03faadb97b34396831e3780aea1
```

Or with Terraform:

```hcl
resource &quot;aws_iam_openid_connect_provider&quot; &quot;github&quot; {
  url             = &quot;https://token.actions.githubusercontent.com&quot;
  client_id_list  = [&quot;sts.amazonaws.com&quot;]
  thumbprint_list = [&quot;6938fd4d98bab03faadb97b34396831e3780aea1&quot;]
}
```

### Step 2: Create the IAM Role

This is where the magic happens. The trust policy controls *which* GitHub repos/branches can assume the role:

```hcl
data &quot;aws_iam_policy_document&quot; &quot;github_actions_assume_role&quot; {
  statement {
    effect  = &quot;Allow&quot;
    actions = [&quot;sts:AssumeRoleWithWebIdentity&quot;]

    principals {
      type        = &quot;Federated&quot;
      identifiers = [aws_iam_openid_connect_provider.github.arn]
    }

    condition {
      test     = &quot;StringEquals&quot;
      variable = &quot;token.actions.githubusercontent.com:aud&quot;
      values   = [&quot;sts.amazonaws.com&quot;]
    }

    condition {
      test     = &quot;StringLike&quot;
      variable = &quot;token.actions.githubusercontent.com:sub&quot;
      values   = [&quot;repo:myorg/myrepo:*&quot;]
    }
  }
}

resource &quot;aws_iam_role&quot; &quot;github_actions&quot; {
  name               = &quot;github-actions-deploy&quot;
  assume_role_policy = data.aws_iam_policy_document.github_actions_assume_role.json
}

resource &quot;aws_iam_role_policy_attachment&quot; &quot;github_actions&quot; {
  role       = aws_iam_role.github_actions.name
  policy_arn = &quot;arn:aws:iam::aws:policy/PowerUserAccess&quot;  # Scope this down!
}
```

### Step 3: Configure Your Workflow

```yaml
name: Deploy
on:
  push:
    branches: [main]

permissions:
  id-token: write   # Required for OIDC
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      
      - name: Configure AWS credentials
        uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-deploy
          aws-region: eu-west-1
      
      - name: Deploy
        run: |
          aws s3 sync ./dist s3://my-bucket
```

That&apos;s it. No `AWS_ACCESS_KEY_ID`. No `AWS_SECRET_ACCESS_KEY`. Just the role ARN.

## Token Claims: The Security Controls

The GitHub OIDC token contains claims that you can use in IAM trust policies. This is where you lock things down:

| Claim | Example | Use Case |
|-------|---------|----------|
| `sub` | `repo:myorg/myrepo:ref:refs/heads/main` | Restrict to specific repo/branch |
| `repository` | `myorg/myrepo` | Match repository name |
| `repository_owner` | `myorg` | Match organization |
| `ref` | `refs/heads/main` | Match branch/tag |
| `environment` | `production` | Match GitHub environment |
| `job_workflow_ref` | `myorg/myrepo/.github/workflows/deploy.yml@refs/heads/main` | Match specific workflow file |
| `actor` | `octocat` | Match user who triggered |
| `event_name` | `push` | Match trigger event |

### Restricting by Repository

Only allow a specific repo:

```json
{
  &quot;Condition&quot;: {
    &quot;StringEquals&quot;: {
      &quot;token.actions.githubusercontent.com:sub&quot;: &quot;repo:myorg/myrepo:ref:refs/heads/main&quot;
    }
  }
}
```

### Restricting by Branch

Only allow `main` branch:

```json
{
  &quot;Condition&quot;: {
    &quot;StringLike&quot;: {
      &quot;token.actions.githubusercontent.com:sub&quot;: &quot;repo:myorg/myrepo:ref:refs/heads/main&quot;
    }
  }
}
```

### Restricting by Environment

Only allow the `production` GitHub environment:

```json
{
  &quot;Condition&quot;: {
    &quot;StringEquals&quot;: {
      &quot;token.actions.githubusercontent.com:sub&quot;: &quot;repo:myorg/myrepo:environment:production&quot;
    }
  }
}
```

This is powerful – you can require manual approval in GitHub before the role can be assumed.

### Restricting by Organization

Allow any repo in your org:

```json
{
  &quot;Condition&quot;: {
    &quot;StringLike&quot;: {
      &quot;token.actions.githubusercontent.com:sub&quot;: &quot;repo:myorg/*:*&quot;
    }
  }
}
```

## Common Patterns

### Different Roles per Environment

```hcl
# Production role - only main branch
resource &quot;aws_iam_role&quot; &quot;github_actions_prod&quot; {
  name = &quot;github-actions-prod&quot;
  
  assume_role_policy = jsonencode({
    Version = &quot;2012-10-17&quot;
    Statement = [{
      Effect = &quot;Allow&quot;
      Action = &quot;sts:AssumeRoleWithWebIdentity&quot;
      Principal = {
        Federated = aws_iam_openid_connect_provider.github.arn
      }
      Condition = {
        StringEquals = {
          &quot;token.actions.githubusercontent.com:aud&quot; = &quot;sts.amazonaws.com&quot;
          &quot;token.actions.githubusercontent.com:sub&quot; = &quot;repo:myorg/myrepo:environment:production&quot;
        }
      }
    }]
  })
}

# Staging role - any branch
resource &quot;aws_iam_role&quot; &quot;github_actions_staging&quot; {
  name = &quot;github-actions-staging&quot;
  
  assume_role_policy = jsonencode({
    Version = &quot;2012-10-17&quot;
    Statement = [{
      Effect = &quot;Allow&quot;
      Action = &quot;sts:AssumeRoleWithWebIdentity&quot;
      Principal = {
        Federated = aws_iam_openid_connect_provider.github.arn
      }
      Condition = {
        StringEquals = {
          &quot;token.actions.githubusercontent.com:aud&quot; = &quot;sts.amazonaws.com&quot;
        }
        StringLike = {
          &quot;token.actions.githubusercontent.com:sub&quot; = &quot;repo:myorg/myrepo:*&quot;
        }
      }
    }]
  })
}
```

### Workflow with Environment Gates

```yaml
jobs:
  deploy-staging:
    runs-on: ubuntu-latest
    environment: staging
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-staging
          aws-region: eu-west-1

  deploy-production:
    needs: deploy-staging
    runs-on: ubuntu-latest
    environment: production  # Requires manual approval
    permissions:
      id-token: write
      contents: read
    steps:
      - uses: aws-actions/configure-aws-credentials@v4
        with:
          role-to-assume: arn:aws:iam::123456789012:role/github-actions-prod
          aws-region: eu-west-1
```

## Debugging OIDC Issues

### &quot;Not authorized to perform sts:AssumeRoleWithWebIdentity&quot;

Check your trust policy conditions. Print the token claims to see what you&apos;re actually getting:

```yaml
- name: Debug OIDC token
  run: |
    TOKEN=$(curl -s -H &quot;Authorization: bearer $ACTIONS_ID_TOKEN_REQUEST_TOKEN&quot; \
      &quot;$ACTIONS_ID_TOKEN_REQUEST_URL&amp;audience=sts.amazonaws.com&quot; | jq -r &apos;.value&apos;)
    echo &quot;Token claims:&quot;
    echo $TOKEN | cut -d. -f2 | base64 -d 2&gt;/dev/null | jq .
```

### &quot;Audience validation failed&quot;

Make sure your IAM trust policy checks for `sts.amazonaws.com`:

```json
{
  &quot;Condition&quot;: {
    &quot;StringEquals&quot;: {
      &quot;token.actions.githubusercontent.com:aud&quot;: &quot;sts.amazonaws.com&quot;
    }
  }
}
```

### &quot;Subject claim mismatch&quot;

The `sub` claim format varies based on the trigger:

- Push: `repo:org/repo:ref:refs/heads/branch`
- PR: `repo:org/repo:pull_request`
- Environment: `repo:org/repo:environment:name`

Use `StringLike` with wildcards if needed.

## Beyond AWS

OIDC works with other clouds too:

### Azure

```yaml
- uses: azure/login@v1
  with:
    client-id: ${{ secrets.AZURE_CLIENT_ID }}
    tenant-id: ${{ secrets.AZURE_TENANT_ID }}
    subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
```

### GCP

```yaml
- uses: google-github-actions/auth@v2
  with:
    workload_identity_provider: projects/123456/locations/global/workloadIdentityPools/github/providers/github
    service_account: github-actions@my-project.iam.gserviceaccount.com
```

### HashiCorp Vault

```yaml
- uses: hashicorp/vault-action@v2
  with:
    url: https://vault.example.com
    method: jwt
    role: github-actions
    jwtGithubAudience: https://vault.example.com
```

## Summary

| Approach | Credentials | Lifetime | Rotation | Blast Radius |
|----------|-------------|----------|----------|--------------|
| Access Keys | Static | Indefinite | Manual | High |
| OIDC | Dynamic | 15-60 min | Automatic | Low |

OIDC is:
- **More secure** – no long-lived credentials
- **Easier to audit** – every workflow run has unique credentials
- **Granular** – control access by repo, branch, environment
- **Zero maintenance** – no keys to rotate

Stop storing AWS keys in GitHub. OIDC has been stable since 2021. There&apos;s no excuse.

---

*Further reading: [GitHub OIDC docs](https://docs.github.com/en/actions/deployment/security-hardening-your-deployments/about-security-hardening-with-openid-connect) and [AWS federation guide](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html).*</content:encoded><category>github-actions</category><category>oidc</category><category>aws</category><category>iam</category><category>security</category><category>cicd</category><category>devops</category><author>Mo Abukar</author></item><item><title>Contract vs Perm: 4 Years of Both and What I&apos;d Choose Now</title><link>https://moabukar.co.uk/blog/contract-vs-perm/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/contract-vs-perm/</guid><description>I&apos;ve done both. Multiple times. Here&apos;s the real trade-offs nobody talks about - the money, the time off problem, the boredom factor, and why your life situation matters more than you think.</description><pubDate>Tue, 18 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Contract vs Perm: 4 Years of Both and What I&apos;d Choose Now
=========================================================

I&apos;ve done both. Multiple times. Permanent roles at startups and enterprises. Contract gigs at £500-650/day. I&apos;ve felt the security of a salary and the rush of invoicing five figures a month.

Everyone asks: &quot;Which is better?&quot;

**The answer is: it depends on who you are and what season of life you&apos;re in.**

Let me break down the real trade-offs - not the sanitised version, but what actually matters when you&apos;re making this decision.

---

![Contract vs Perm: 4 Years of Both and What I&apos;d Choose Now](/images/contract-vs-perm.jpg)


The Money Reality
=================

Let&apos;s get the obvious out of the way: contractors earn more per day.

A Senior DevOps Engineer on £90k permanent is roughly £360/day (assuming 250 working days). A contractor doing the same work might charge £550-650/day. That&apos;s a 50-80% premium.

But it&apos;s not that simple.

**What permanent gives you:**

- 25-30 days paid holiday
- Sick pay
- Pension contributions (often 5-10% matched)
- Training budgets
- Stock options at some companies
- Job security (relatively)
- Paid parental leave

**What contractors don&apos;t get:**

- No paid holidays - every day off is money lost
- No sick pay - get ill? That&apos;s your problem
- No pension match - you fund your own
- No training budget - certifications come out of your pocket
- IR35 headaches - if you&apos;re inside IR35, you&apos;re paying employee taxes without employee benefits
- Contract ends? Start hunting immediately

When you actually do the maths - factoring in holidays, pension, sick days, and the stress of finding the next gig - the gap narrows.

```
SCENARIO                    PERM £90K       CONTRACT £600/DAY
========                    =========       =================
Gross annual                £90,000         £132,000 (220 days)
Holidays (25 days)          Paid            -£15,000
Sick days (5 avg)           Paid            -£3,000
Pension (8% match)          +£7,200         £0
Gap between contracts       N/A             -£6,000 (avg)
-------                     -------         -------
Effective value             ~£97,200        ~£108,000
```

Still better money. But not the 2x gap people imagine.

---

The Time Off Problem
====================

This is the one that catches people.

As a permanent employee, you get paid holidays. You book two weeks off in August, you get paid. Simple.

As a contractor, those two weeks cost you £5,000-6,000 in lost income. Plus the bank holidays. Plus any sick days. Plus the gaps between contracts.

**Real example:** I once took a week off as a contractor. It felt like burning money. I couldn&apos;t relax properly because I kept calculating what that beach holiday was *really* costing me.

As a permanent employee, I take my holidays guilt-free. They&apos;re part of my compensation. I&apos;m not losing anything.

If you&apos;re the kind of person who needs proper downtime to function - and most of us do - this matters more than you think.

---

The Boredom Factor
==================

Here&apos;s where it gets personal.

**Some engineers thrive on variety.** They get bored working on the same codebase for two years. They want new problems, new tech stacks, new teams. Contracting is perfect for this. Every 3-6 months, you&apos;re somewhere new. Fresh challenges. No legacy baggage (that&apos;s someone else&apos;s problem now).

**Other engineers want depth.** They want to see a system through from inception to maturity. They want to build something they&apos;re proud of over years, not months. They want to see the long-term consequences of their decisions. Permanent roles offer this.

Neither is wrong. But you need to know which one you are.

```
TYPE                GETS ENERGY FROM        DRAINED BY
====                ================        ==========
Variety-seeker      New challenges          Same problems
Depth-seeker        Mastering systems       Starting over
```

I&apos;ve worked with brilliant contractors who would be miserable in a permanent role - they need the variety to stay engaged. I&apos;ve also worked with brilliant permanent engineers who would hate contracting - they want to *own* something long-term.

**Ask yourself:** Do you get energised or drained by constantly starting fresh?

---

The Responsibility Question
===========================

This is the big one that nobody talks about.

**If you don&apos;t have major financial responsibilities, contracting is a no-brainer.**

No mortgage. No kids. No dependents. You can handle gaps between contracts. You can take risks. You can say no to bad gigs because you don&apos;t *need* the money next month. You can stack cash when rates are high and take extended breaks when you want.

This is the ideal time to contract. Build your runway. Save aggressively. You have leverage because you can walk away.

**If you have a mortgage and kids, the calculus changes.**

That gap between contracts isn&apos;t just an inconvenience - it&apos;s genuine stress. The inconsistent income makes financial planning harder. The lack of sick pay means a serious illness could be financially devastating. The lack of job security means you&apos;re always one client decision away from scrambling.

I&apos;ve seen contractors with families who make it work. But they all have significant savings - usually 6-12 months of expenses - to buffer the uncertainty. Without that buffer, contracting with dependents is playing with fire.

---

The Stack-Up Strategy
=====================

Here&apos;s the advice I give to engineers in their 20s and early 30s with no major responsibilities:

**Contract. Stack cash. Aggressively.**

You have a window right now where your expenses are low and your earning potential is high. Use it.

- Live below your means
- Save 40-50% of your income
- Build a 6-12 month emergency fund
- Invest the rest

The goal isn&apos;t to contract forever. The goal is to build financial security so that later in life, you have *options*. You can take the interesting permanent role that pays less. You can start something. You can take time off. You can say no to bad opportunities.

**Money buys optionality. Contracting early is how you stack it.**

Then, when life gets more complicated - mortgage, kids, whatever - you can choose permanent work from a position of strength, not desperation.

---

The Career Progression Question
===============================

One argument for permanent: it&apos;s easier to get promoted.

As a contractor, you&apos;re brought in to do a job. You do it, you leave. There&apos;s no promotion path because you&apos;re not on the ladder.

As a permanent employee, you can grow from Senior to Staff to Principal within the same company. You build relationships. You get visibility. You get sponsored for opportunities.

```
PATH                    CONTRACTOR              PERMANENT
====                    ==========              =========
Title progression       Lateral moves           Vertical growth
Relationships           Transactional           Long-term
Visibility              Project-based           Org-wide
Sponsorship             Rare                    Common
```

If your goal is to reach Staff or Principal level, permanent roles are the clearer path. Not impossible as a contractor, but harder.

That said, many contractors don&apos;t want to climb the ladder. They want to stay hands-on, do good work, and get paid well for it. That&apos;s a valid choice. The IC ladder isn&apos;t for everyone, and contracting lets you opt out of the politics entirely.

---

The Learning Angle
==================

Contractors often learn faster - but shallower.

You see more companies, more tech stacks, more ways of doing things. You learn what works and what doesn&apos;t across different contexts. This breadth is genuinely valuable.

But you rarely see anything through long-term. You don&apos;t learn what happens two years after that &quot;perfect&quot; architecture decision. You don&apos;t experience the maintenance burden of choices you made. You miss the deep lessons that only come from living with your decisions.

Permanent employees learn slower - but deeper. You understand the full lifecycle. You see the consequences. You develop intuition that&apos;s hard to get any other way.

**Best of both worlds:** Do a mix. Contract for a few years, then go permanent somewhere interesting. Take what you learned across multiple companies and apply it deeply. Repeat.

---

My Personal Take
================

I&apos;ve done both, and I&apos;ll probably do both again.

**When I contract:** I&apos;m in execution mode. Stack money, gain exposure, stay flexible. I don&apos;t get emotionally invested in the company&apos;s success because I know I&apos;m temporary. I do good work and move on.

**When I go permanent:** I&apos;m in building mode. I want to see something through. I want to influence direction. I want to grow within an organisation. I accept the lower daily rate because I&apos;m getting something else - stability, depth, progression.

Right now, I value building. But I know that might change. And because I stacked cash early, I have the freedom to choose.

---

The Decision Matrix
===================

**Choose contracting if:**

- You have few financial responsibilities
- You want to stack money quickly
- You get bored easily and crave variety
- You&apos;re comfortable with uncertainty
- You have a strong network to find gigs

**Choose permanent if:**

- You have a mortgage, kids, or other dependents
- You want depth over breadth
- You want to progress to Staff/Principal
- You value stability and benefits
- You want to build something long-term

There&apos;s no objectively correct answer. It depends on your situation, your personality, and what phase of life you&apos;re in.

**The real mistake is picking one and never reconsidering.** Your circumstances change. Revisit the decision every few years.

```
========================================
Stack when you can.
Build when you want to.
Keep your options open.
========================================
```</content:encoded><category>career</category><category>contracting</category><category>salary</category><category>advice</category><category>engineering-culture</category><author>Mo Abukar</author></item><item><title>Port and Kratix: Internal Developer Platforms Beyond Backstage</title><link>https://moabukar.co.uk/blog/internal-developer-platforms/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/internal-developer-platforms/</guid><description>Explore Port and Kratix for building internal developer platforms. Self-service infrastructure, developer workflows, and platform engineering patterns.</description><pubDate>Tue, 18 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Port and Kratix: Internal Developer Platforms Beyond Backstage
===============================================================

Backstage is a developer portal. Port and Kratix go further -
they&apos;re platforms for building platforms. Port focuses on the
catalog and self-service actions. Kratix focuses on composable
infrastructure delivery.

This guide covers when to use each and how to implement them.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/backstage.svg&quot; alt=&quot;Backstage logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- **Port**: SaaS developer portal with actions and scorecards
- **Kratix**: Self-hosted platform framework with Promises
- Backstage for catalog + docs, Port for actions + metrics
- Kratix for GitOps-native infrastructure delivery
- All can work together


Port: Self-Service Developer Portal
====================================

Port is a SaaS platform for building developer portals with
self-service capabilities.


Architecture
------------

```
┌─────────────────────────────────────────────────────────────────┐
│                           Port                                   │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────────┐  │
│  │  Software   │  │  Self-Svc   │  │     Scorecards          │  │
│  │  Catalog    │  │  Actions    │  │  (Production Readiness) │  │
│  └─────────────┘  └─────────────┘  └─────────────────────────┘  │
└─────────────────────────────────────────────────────────────────┘
                               │
          ┌────────────────────┼────────────────────┐
          ▼                    ▼                    ▼
   ┌────────────┐       ┌────────────┐       ┌────────────┐
   │  GitHub    │       │ Kubernetes │       │   Slack    │
   │            │       │            │       │            │
   └────────────┘       └────────────┘       └────────────┘
```


Define Blueprints
-----------------

```json
{
  &quot;identifier&quot;: &quot;service&quot;,
  &quot;title&quot;: &quot;Service&quot;,
  &quot;icon&quot;: &quot;Microservice&quot;,
  &quot;schema&quot;: {
    &quot;properties&quot;: {
      &quot;language&quot;: {
        &quot;type&quot;: &quot;string&quot;,
        &quot;enum&quot;: [&quot;Go&quot;, &quot;Python&quot;, &quot;Node.js&quot;, &quot;Java&quot;]
      },
      &quot;tier&quot;: {
        &quot;type&quot;: &quot;string&quot;,
        &quot;enum&quot;: [&quot;critical&quot;, &quot;standard&quot;, &quot;experimental&quot;]
      },
      &quot;owner&quot;: {
        &quot;type&quot;: &quot;string&quot;
      },
      &quot;repository&quot;: {
        &quot;type&quot;: &quot;string&quot;,
        &quot;format&quot;: &quot;url&quot;
      },
      &quot;slackChannel&quot;: {
        &quot;type&quot;: &quot;string&quot;
      },
      &quot;onCall&quot;: {
        &quot;type&quot;: &quot;string&quot;
      },
      &quot;productionReadiness&quot;: {
        &quot;type&quot;: &quot;number&quot;,
        &quot;minimum&quot;: 0,
        &quot;maximum&quot;: 100
      }
    },
    &quot;required&quot;: [&quot;language&quot;, &quot;tier&quot;, &quot;owner&quot;]
  },
  &quot;relations&quot;: {
    &quot;environment&quot;: {
      &quot;target&quot;: &quot;environment&quot;,
      &quot;many&quot;: true
    },
    &quot;dependencies&quot;: {
      &quot;target&quot;: &quot;service&quot;,
      &quot;many&quot;: true
    }
  }
}
```


Self-Service Actions
--------------------

```json
{
  &quot;identifier&quot;: &quot;create_service&quot;,
  &quot;title&quot;: &quot;Create New Service&quot;,
  &quot;icon&quot;: &quot;Plus&quot;,
  &quot;trigger&quot;: {
    &quot;type&quot;: &quot;self-service&quot;,
    &quot;userInputs&quot;: {
      &quot;properties&quot;: {
        &quot;name&quot;: {
          &quot;type&quot;: &quot;string&quot;,
          &quot;pattern&quot;: &quot;^[a-z][a-z0-9-]*$&quot;
        },
        &quot;language&quot;: {
          &quot;type&quot;: &quot;string&quot;,
          &quot;enum&quot;: [&quot;Go&quot;, &quot;Python&quot;, &quot;Node.js&quot;]
        },
        &quot;tier&quot;: {
          &quot;type&quot;: &quot;string&quot;,
          &quot;enum&quot;: [&quot;critical&quot;, &quot;standard&quot;]
        },
        &quot;includeDatabase&quot;: {
          &quot;type&quot;: &quot;boolean&quot;,
          &quot;default&quot;: false
        }
      },
      &quot;required&quot;: [&quot;name&quot;, &quot;language&quot;, &quot;tier&quot;]
    }
  },
  &quot;invocationMethod&quot;: {
    &quot;type&quot;: &quot;GITHUB&quot;,
    &quot;org&quot;: &quot;company&quot;,
    &quot;repo&quot;: &quot;platform-actions&quot;,
    &quot;workflow&quot;: &quot;create-service.yaml&quot;
  }
}
```


GitHub Action Backend
---------------------

```yaml
# .github/workflows/create-service.yaml
name: Create Service
on:
  workflow_dispatch:
    inputs:
      name:
        required: true
      language:
        required: true
      tier:
        required: true
      includeDatabase:
        required: false
        default: &apos;false&apos;
      port_run_id:
        required: true

jobs:
  create:
    runs-on: ubuntu-latest
    steps:
      - name: Notify Port - Running
        uses: port-labs/port-github-action@v1
        with:
          clientId: ${{ secrets.PORT_CLIENT_ID }}
          clientSecret: ${{ secrets.PORT_CLIENT_SECRET }}
          runId: ${{ inputs.port_run_id }}
          status: &quot;RUNNING&quot;

      - name: Create Repository
        uses: actions/github-script@v6
        with:
          script: |
            await github.rest.repos.createUsingTemplate({
              template_owner: &apos;company&apos;,
              template_repo: &apos;${{ inputs.language }}-service-template&apos;,
              name: &apos;${{ inputs.name }}&apos;,
              owner: &apos;company&apos;,
              private: true
            })

      - name: Create Database (if requested)
        if: inputs.includeDatabase == &apos;true&apos;
        run: |
          # Trigger Crossplane claim or Terraform
          kubectl apply -f - &lt;&lt;EOF
          apiVersion: database.platform.company.com/v1alpha1
          kind: PostgreSQLInstance
          metadata:
            name: ${{ inputs.name }}-db
            namespace: platform
          spec:
            size: small
          EOF

      - name: Register in Port
        uses: port-labs/port-github-action@v1
        with:
          clientId: ${{ secrets.PORT_CLIENT_ID }}
          clientSecret: ${{ secrets.PORT_CLIENT_SECRET }}
          operation: CREATE
          blueprint: service
          identifier: ${{ inputs.name }}
          properties: |
            {
              &quot;language&quot;: &quot;${{ inputs.language }}&quot;,
              &quot;tier&quot;: &quot;${{ inputs.tier }}&quot;,
              &quot;repository&quot;: &quot;https://github.com/company/${{ inputs.name }}&quot;
            }

      - name: Notify Port - Complete
        uses: port-labs/port-github-action@v1
        with:
          clientId: ${{ secrets.PORT_CLIENT_ID }}
          clientSecret: ${{ secrets.PORT_CLIENT_SECRET }}
          runId: ${{ inputs.port_run_id }}
          status: &quot;SUCCESS&quot;
          summary: &quot;Service ${{ inputs.name }} created successfully&quot;
```


Scorecards
----------

```json
{
  &quot;identifier&quot;: &quot;production_readiness&quot;,
  &quot;title&quot;: &quot;Production Readiness&quot;,
  &quot;rules&quot;: [
    {
      &quot;identifier&quot;: &quot;has_readme&quot;,
      &quot;title&quot;: &quot;Has README&quot;,
      &quot;level&quot;: &quot;Bronze&quot;,
      &quot;query&quot;: {
        &quot;property&quot;: &quot;hasReadme&quot;,
        &quot;operator&quot;: &quot;=&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;identifier&quot;: &quot;has_monitoring&quot;,
      &quot;title&quot;: &quot;Has Monitoring&quot;,
      &quot;level&quot;: &quot;Silver&quot;,
      &quot;query&quot;: {
        &quot;property&quot;: &quot;hasMonitoring&quot;,
        &quot;operator&quot;: &quot;=&quot;,
        &quot;value&quot;: true
      }
    },
    {
      &quot;identifier&quot;: &quot;has_runbook&quot;,
      &quot;title&quot;: &quot;Has Runbook&quot;,
      &quot;level&quot;: &quot;Silver&quot;,
      &quot;query&quot;: {
        &quot;property&quot;: &quot;runbookUrl&quot;,
        &quot;operator&quot;: &quot;isNotEmpty&quot;
      }
    },
    {
      &quot;identifier&quot;: &quot;slo_defined&quot;,
      &quot;title&quot;: &quot;SLO Defined&quot;,
      &quot;level&quot;: &quot;Gold&quot;,
      &quot;query&quot;: {
        &quot;property&quot;: &quot;sloAvailability&quot;,
        &quot;operator&quot;: &quot;&gt;&quot;,
        &quot;value&quot;: 0
      }
    }
  ]
}
```


Kratix: Composable Platform Framework
=====================================

Kratix lets you define &quot;Promises&quot; - self-service capabilities
that developers can request. It&apos;s GitOps-native and works with
any Kubernetes resources.


Architecture
------------

```
┌─────────────────────────────────────────────────────────────────┐
│                      Platform Cluster                            │
│  ┌─────────────────────────────────────────────────────────────┐│
│  │                    Kratix Controller                         ││
│  │  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  ││
│  │  │   Promise   │  │   Promise   │  │      Promise        │  ││
│  │  │  (Database) │  │  (Logging)  │  │  (Environment)      │  ││
│  │  └─────────────┘  └─────────────┘  └─────────────────────┘  ││
│  └─────────────────────────────────────────────────────────────┘│
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼ GitOps (Flux/Argo)
          ┌────────────────────┼────────────────────┐
          ▼                    ▼                    ▼
   ┌────────────┐       ┌────────────┐       ┌────────────┐
   │  Worker 1  │       │  Worker 2  │       │  Worker 3  │
   │  (Dev)     │       │  (Staging) │       │  (Prod)    │
   └────────────┘       └────────────┘       └────────────┘
```


Define a Promise
----------------

```yaml
# promise-postgresql.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: Promise
metadata:
  name: postgresql
spec:
  # What developers request
  api:
    apiVersion: apiextensions.k8s.io/v1
    kind: CustomResourceDefinition
    metadata:
      name: postgresqls.database.platform.company.com
    spec:
      group: database.platform.company.com
      names:
        kind: PostgreSQL
        plural: postgresqls
        singular: postgresql
      scope: Namespaced
      versions:
        - name: v1
          served: true
          storage: true
          schema:
            openAPIV3Schema:
              type: object
              properties:
                spec:
                  type: object
                  properties:
                    size:
                      type: string
                      enum: [&quot;small&quot;, &quot;medium&quot;, &quot;large&quot;]
                    version:
                      type: string
                      default: &quot;15&quot;
                  required:
                    - size
  
  # Pipeline to process requests
  workflows:
    resource:
      configure:
        - apiVersion: platform.kratix.io/v1alpha1
          kind: Pipeline
          metadata:
            name: configure-postgresql
          spec:
            containers:
              - name: generate-manifests
                image: company/postgresql-pipeline:latest
                command:
                  - /bin/sh
                  - -c
                  - |
                    # Read request
                    SIZE=$(yq &apos;.spec.size&apos; /kratix/input/object.yaml)
                    VERSION=$(yq &apos;.spec.version&apos; /kratix/input/object.yaml)
                    NAME=$(yq &apos;.metadata.name&apos; /kratix/input/object.yaml)
                    NAMESPACE=$(yq &apos;.metadata.namespace&apos; /kratix/input/object.yaml)
                    
                    # Map size to resources
                    case $SIZE in
                      small)  CPU=500m; MEM=1Gi; STORAGE=10Gi ;;
                      medium) CPU=1;    MEM=2Gi; STORAGE=50Gi ;;
                      large)  CPU=2;    MEM=4Gi; STORAGE=100Gi ;;
                    esac
                    
                    # Generate CloudNativePG cluster
                    cat &gt; /kratix/output/cluster.yaml &lt;&lt;EOF
                    apiVersion: postgresql.cnpg.io/v1
                    kind: Cluster
                    metadata:
                      name: ${NAME}
                      namespace: ${NAMESPACE}
                    spec:
                      instances: 3
                      imageName: ghcr.io/cloudnative-pg/postgresql:${VERSION}
                      storage:
                        size: ${STORAGE}
                      resources:
                        requests:
                          cpu: ${CPU}
                          memory: ${MEM}
                      monitoring:
                        enablePodMonitor: true
                    EOF
```


Developer Usage
---------------

```yaml
# database-request.yaml
apiVersion: database.platform.company.com/v1
kind: PostgreSQL
metadata:
  name: my-app-db
  namespace: my-team
spec:
  size: medium
  version: &quot;15&quot;
```


Multi-Cluster Delivery
----------------------

```yaml
# kratix-destination.yaml
apiVersion: platform.kratix.io/v1alpha1
kind: Destination
metadata:
  name: production
spec:
  stateStoreRef:
    name: production-gitops
    kind: GitStateStore
  strictMatchLabels: true
  
---
apiVersion: platform.kratix.io/v1alpha1
kind: GitStateStore
metadata:
  name: production-gitops
spec:
  url: https://github.com/company/gitops-production
  branch: main
  secretRef:
    name: github-credentials
```


Comparison
==========

```
FEATURE                 PORT            KRATIX          BACKSTAGE
=======                 ====            ======          =========
Hosting                 SaaS            Self-hosted     Self-hosted
Catalog                 ✅              ❌              ✅
Self-Service Actions    ✅ (built-in)   ✅ (Promises)   ✅ (plugins)
Scorecards              ✅              ❌              ✅ (plugins)
GitOps Native           ✅ (GitHub)     ✅              ❌
Infrastructure          Actions         Native          Plugins
Learning Curve          Low             Medium          High
Customization           Medium          High            Very High
```


When to Use Each
================

**Port:**
- Quick time-to-value
- Non-technical stakeholders need visibility
- Scorecards and compliance tracking
- Existing GitHub/GitLab workflows

**Kratix:**
- GitOps-native infrastructure delivery
- Multi-cluster deployments
- Complex infrastructure composition
- Want to own the platform

**Backstage:**
- Need extensive customization
- Strong frontend development capability
- Want to own the entire portal


References
==========

- Port Docs: https://docs.getport.io
- Kratix Docs: https://kratix.io/docs
- Backstage: https://backstage.io


========================================
Port + Kratix + Platform Engineering
========================================
Self-service infrastructure. Developer happiness.
========================================</content:encoded><category>platform-engineering</category><category>port</category><category>kratix</category><category>developer-experience</category><category>self-service</category><author>Mo Abukar</author></item><item><title>AWS Account Provisioning at Scale with Control Tower, Service Catalog, and Terraform</title><link>https://moabukar.co.uk/blog/aws-account-provisioning/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/aws-account-provisioning/</guid><description>How to build an automated account vending machine using AWS Control Tower Account Factory, Service Catalog, CloudFormation StackSets, and Terraform – from request to fully provisioned account with SSO and IAM roles.</description><pubDate>Sat, 15 Nov 2025 00:00:00 GMT</pubDate><content:encoded># AWS Account Provisioning at Scale with Control Tower, Service Catalog, and Terraform

When you&apos;re running a platform for hundreds of microservices, account sprawl is inevitable. Teams need isolated environments – dev, staging, prod – and you need guardrails, SSO access, networking, and baseline security in every single one.

Doing this manually doesn&apos;t scale. At a previous company, I built an automated account vending machine that could spin up a fully configured AWS account in under 30 minutes: enrolled in Control Tower, SSO access configured, baseline IAM roles deployed, and ready for application workloads.

This post covers the architecture and Terraform implementation – how we used Control Tower Account Factory via Service Catalog, CloudFormation StackSets for cross-account role deployment, and a modular Terraform structure that made provisioning new accounts a single PR.

## The Flow

![Account provisioning architecture](/images/aws-account-provisioning.webp)

The provisioning flow:

1. **Developer requests account** via PR to the account-provisioning repo
2. **Terraform provisions** the account via Service Catalog (Control Tower Account Factory)
3. **Control Tower enrolls** the account, applies guardrails, sets up CloudTrail/Config
4. **StackSet deploys** baseline IAM roles to the new account
5. **SSO user created** automatically with access to the account
6. **Account metadata** written to S3 for billing/tagging systems
7. **CI/CD stack created** with permissions to deploy to the new account

The entire process is GitOps-driven. No console clicks, no manual steps, full audit trail.

## Prerequisites

Before implementing this, you need:

- **AWS Organizations** with a management (billing) account
- **Control Tower** enabled and configured
- **Service Catalog** with the Control Tower Account Factory product
- **At least one registered OU** (Organizational Unit) in Control Tower
- **IAM Identity Center (SSO)** configured
- **A CI/CD platform** with AWS integration (we used a Terraform automation platform)

## Module Structure

```
account-provisioning/
├── modules/
│   ├── account/           # Creates AWS account via Service Catalog
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   ├── outputs.tf
│   │   └── metadata.tf
│   └── stack/             # Creates CI/CD stack for the OU
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
└── ou/
    ├── stacks.tf          # CI/CD stack definitions per OU
    ├── platform/
    │   ├── my-service/
    │   │   └── main.tf    # Account definitions
    │   └── another-service/
    │       └── main.tf
    └── infrastructure/
        ├── dns/
        │   └── main.tf
        └── networking/
            └── main.tf
```

Each directory under `ou/` corresponds to a registered Organizational Unit. Accounts are defined within their respective OU folders.

## The Account Module

This is the core module that provisions accounts via Control Tower Account Factory.

### modules/account/variables.tf

```hcl
variable &quot;name&quot; {
  description = &quot;Account name (must be unique across the organization)&quot;
  type        = string
}

variable &quot;email&quot; {
  description = &quot;Root email for the account (must be unique, use + addressing)&quot;
  type        = string
}

variable &quot;parent_ou_id&quot; {
  description = &quot;Parent OU ID for account lookups&quot;
  type        = string
}

variable &quot;ou_id&quot; {
  description = &quot;Target OU ID where the account will be placed&quot;
  type        = string
}

variable &quot;sso_user_firstname&quot; {
  description = &quot;First name for the SSO user&quot;
  type        = string
}

variable &quot;sso_user_lastname&quot; {
  description = &quot;Last name for the SSO user&quot;
  type        = string
}

variable &quot;alias&quot; {
  description = &quot;Account alias (defaults to name)&quot;
  type        = string
  default     = null
}

variable &quot;account_deployment_type&quot; {
  description = &quot;Deployment type for billing categorization&quot;
  type        = string
  default     = null
}

variable &quot;account_bill_type&quot; {
  description = &quot;Billing type (Prod/NonProd/Data)&quot;
  type        = string
  default     = null
}
```

### modules/account/main.tf

```hcl
locals {
  # After account creation, look it up by name to get the account ID
  account_lookup = [
    for account in data.aws_organizations_organizational_unit_descendant_accounts.accounts.accounts : 
    account if account.name == var.name
  ]
  account_id = length(local.account_lookup) &gt; 0 ? local.account_lookup[0].id : null
}

# Look up all accounts in the parent OU to find our newly created account
data &quot;aws_organizations_organizational_unit_descendant_accounts&quot; &quot;accounts&quot; {
  depends_on = [aws_cloudformation_stack_set_instance.deploy_baseline_roles]
  parent_id  = var.parent_ou_id
}

# This is the magic - Service Catalog provisions the account via Control Tower
resource &quot;aws_servicecatalog_provisioned_product&quot; &quot;account&quot; {
  name                       = var.name
  product_name               = &quot;AWS Control Tower Account Factory&quot;
  provisioning_artifact_name = &quot;AWS Control Tower Account Factory&quot;

  provisioning_parameters {
    key   = &quot;AccountName&quot;
    value = var.name
  }

  provisioning_parameters {
    key   = &quot;AccountEmail&quot;
    value = var.email
  }

  provisioning_parameters {
    key   = &quot;ManagedOrganizationalUnit&quot;
    value = &quot;Custom (${var.ou_id})&quot;
  }

  provisioning_parameters {
    key   = &quot;SSOUserEmail&quot;
    value = var.email
  }

  provisioning_parameters {
    key   = &quot;SSOUserFirstName&quot;
    value = var.sso_user_firstname
  }

  provisioning_parameters {
    key   = &quot;SSOUserLastName&quot;
    value = var.sso_user_lastname
  }

  tags = {
    ManagedBy  = &quot;Terraform&quot;
    OwningTeam = &quot;Platform&quot;
  }

  # Account creation can take 20-30 minutes
  timeouts {
    create = &quot;60m&quot;
    update = &quot;60m&quot;
    delete = &quot;60m&quot;
  }
}

# Deploy baseline IAM roles to the new account via StackSet
resource &quot;aws_cloudformation_stack_set_instance&quot; &quot;deploy_baseline_roles&quot; {
  stack_set_name = &quot;BaselineIAMRoles&quot;  # Pre-created StackSet

  deployment_targets {
    organizational_unit_ids = [var.parent_ou_id]
  }

  retain_stack = false
  region       = &quot;eu-west-1&quot;

  depends_on = [
    aws_servicecatalog_provisioned_product.account
  ]
}
```

### Key Points About the Account Module

**Service Catalog provisioning**: The `aws_servicecatalog_provisioned_product` resource triggers Control Tower Account Factory. This:
- Creates the AWS account
- Enrolls it in Control Tower
- Applies mandatory guardrails (SCPs)
- Sets up CloudTrail and AWS Config
- Creates the SSO user

**The OU parameter format**: Note `&quot;Custom (${var.ou_id})&quot;` – this is the exact format Control Tower expects. The &quot;Custom&quot; prefix indicates it&apos;s a custom OU rather than a foundational one.

**StackSet deployment**: After the account exists, we deploy baseline IAM roles via a pre-existing CloudFormation StackSet. This runs automatically across all accounts in the OU.

**Account ID lookup**: We can&apos;t get the account ID directly from Service Catalog, so we look it up after creation using the Organizations API.

### modules/account/outputs.tf

```hcl
output &quot;account_id&quot; {
  description = &quot;The AWS account ID&quot;
  value       = local.account_id
}

output &quot;account_name&quot; {
  description = &quot;The account name&quot;
  value       = var.name
}

output &quot;deploy_role_arn&quot; {
  description = &quot;ARN of the deployment role in the new account&quot;
  value       = local.account_id != null ? &quot;arn:aws:iam::${local.account_id}:role/DeployRole&quot; : null
}
```

### modules/account/metadata.tf

We write account metadata to S3 for billing systems and asset management:

```hcl
locals {
  # Infer deployment type from account name
  account_deployment_type = can(regex(&quot;data&quot;, var.name)) ? &quot;Data&quot; : &quot;Containers&quot;
  
  # Infer billing type from account name
  account_bill_type = can(regex(&quot;data&quot;, var.name)) ? &quot;Data&quot; : (
    can(regex(&quot;-prod&quot;, var.name)) ? &quot;Prod&quot; : &quot;NonProd&quot;
  )
  
  # Look up parent OU name for metadata
  parent_ou_name = [
    for ou in data.aws_organizations_organizational_units.root.children : 
    ou.name if ou.id == var.parent_ou_id
  ][0]
}

data &quot;aws_organizations_organization&quot; &quot;org&quot; {}

data &quot;aws_organizations_organizational_units&quot; &quot;root&quot; {
  parent_id = data.aws_organizations_organization.org.roots[0].id
}

resource &quot;aws_s3_object&quot; &quot;account_metadata&quot; {
  count = local.account_id != null ? 1 : 0

  bucket = &quot;platform-account-metadata&quot;  # Pre-existing bucket
  key    = &quot;accounts/${var.name}-${local.account_id}.json&quot;
  acl    = &quot;private&quot;
  
  content = jsonencode({
    account_id              = local.account_id
    account_name            = var.name
    alias                   = coalesce(var.alias, var.name)
    account_deployment_type = coalesce(var.account_deployment_type, local.account_deployment_type)
    account_bill_type       = coalesce(var.account_bill_type, local.account_bill_type)
    parent_ou_id            = var.parent_ou_id
    parent_ou_name          = local.parent_ou_name
    created_at              = timestamp()
  })

  lifecycle {
    ignore_changes = [content]  # Don&apos;t update timestamp on every apply
  }
}
```

This metadata feeds into:
- Cost allocation and showback
- Asset inventory
- Compliance reporting
- Automated tagging

## The Stack Module

Each OU needs a CI/CD stack that can provision accounts within it:

### modules/stack/main.tf

```hcl
variable &quot;name&quot; {
  description = &quot;Stack name&quot;
  type        = string
}

variable &quot;repository&quot; {
  description = &quot;Git repository name&quot;
  type        = string
}

variable &quot;path&quot; {
  description = &quot;Path within repository&quot;
  type        = string
}

variable &quot;deploy_role_arn&quot; {
  description = &quot;IAM role ARN for deployments&quot;
  type        = string
}

# Create the CI/CD stack (example using a generic Terraform automation platform)
resource &quot;cicd_stack&quot; &quot;this&quot; {
  name = var.name

  vcs_config {
    branch     = &quot;main&quot;
    repository = var.repository
    path       = var.path
  }

  # Administrative stacks can create other stacks
  administrative = true

  labels = [&quot;folder:platform/accounts&quot;]
}

# Configure AWS credentials for the stack
resource &quot;cicd_aws_integration&quot; &quot;this&quot; {
  name     = var.name
  role_arn = var.deploy_role_arn

  # Account creation takes time, extend session duration
  session_duration_seconds = 3600
}

resource &quot;cicd_aws_integration_attachment&quot; &quot;this&quot; {
  stack_id       = cicd_stack.this.id
  integration_id = cicd_aws_integration.this.id
  read           = true
  write          = true
}

# Pass the role ARN as a Terraform variable
resource &quot;cicd_environment_variable&quot; &quot;deploy_role&quot; {
  stack_id = cicd_stack.this.id
  name     = &quot;TF_VAR_deploy_role_arn&quot;
  value    = var.deploy_role_arn
}

# Attach standard policies
resource &quot;cicd_policy_attachment&quot; &quot;standard_plan&quot; {
  policy_id = &quot;standard-plan-policy&quot;
  stack_id  = cicd_stack.this.id
}

resource &quot;cicd_policy_attachment&quot; &quot;git_push_trigger&quot; {
  policy_id = &quot;git-push-trigger&quot;
  stack_id  = cicd_stack.this.id
}

output &quot;stack_id&quot; {
  value = cicd_stack.this.id
}
```

### ou/stacks.tf

Define a stack for each OU:

```hcl
module &quot;platform_ou&quot; {
  source = &quot;../modules/stack&quot;

  name            = &quot;platform-ou&quot;
  repository      = &quot;account-provisioning&quot;
  path            = &quot;ou/platform&quot;
  deploy_role_arn = &quot;arn:aws:iam::123456789012:role/AccountProvisioningRole&quot;
}

module &quot;infrastructure_ou&quot; {
  source = &quot;../modules/stack&quot;

  name            = &quot;infrastructure-ou&quot;
  repository      = &quot;account-provisioning&quot;
  path            = &quot;ou/infrastructure&quot;
  deploy_role_arn = &quot;arn:aws:iam::123456789012:role/AccountProvisioningRole&quot;
}

module &quot;sandbox_ou&quot; {
  source = &quot;../modules/stack&quot;

  name            = &quot;sandbox-ou&quot;
  repository      = &quot;account-provisioning&quot;
  path            = &quot;ou/sandbox&quot;
  deploy_role_arn = &quot;arn:aws:iam::123456789012:role/AccountProvisioningRole&quot;
}
```

## Creating Accounts

With the modules in place, creating an account is a simple Terraform definition.

### Single Account

```hcl
# ou/infrastructure/dns/main.tf
locals {
  account_name = &quot;dns&quot;
  parent_ou_id = &quot;ou-xxxx-yyyyyyyy&quot;  # Infrastructure OU
  prod_ou      = &quot;ou-xxxx-prodprod&quot;
  nonprod_ou   = &quot;ou-xxxx-nonprodx&quot;
}

module &quot;dns_prod&quot; {
  source = &quot;../../../modules/account&quot;

  name = &quot;${local.account_name}-prod&quot;

  parent_ou_id       = local.parent_ou_id
  ou_id              = local.prod_ou
  email              = &quot;aws-accounts+infra-${local.account_name}-prod@example.com&quot;
  sso_user_firstname = local.account_name
  sso_user_lastname  = &quot;prod&quot;
}

module &quot;dns_nonprod&quot; {
  source = &quot;../../../modules/account&quot;

  name = &quot;${local.account_name}-nonprod&quot;

  parent_ou_id       = local.parent_ou_id
  ou_id              = local.nonprod_ou
  email              = &quot;aws-accounts+infra-${local.account_name}-nonprod@example.com&quot;
  sso_user_firstname = local.account_name
  sso_user_lastname  = &quot;nonprod&quot;
}
```

### Multiple Environments Pattern

For services that need the full environment set:

```hcl
# ou/platform/my-service/main.tf
locals {
  service_name = &quot;order-service&quot;
  parent_ou_id = &quot;ou-xxxx-platform&quot;
  
  environments = {
    dev     = &quot;ou-xxxx-devdevde&quot;
    staging = &quot;ou-xxxx-staging&quot;
    prod    = &quot;ou-xxxx-prodprod&quot;
  }
}

module &quot;accounts&quot; {
  source   = &quot;../../../modules/account&quot;
  for_each = local.environments

  name = &quot;${local.service_name}-${each.key}&quot;

  parent_ou_id       = local.parent_ou_id
  ou_id              = each.value
  email              = &quot;aws-accounts+platform-${local.service_name}-${each.key}@example.com&quot;
  sso_user_firstname = local.service_name
  sso_user_lastname  = each.key
}

output &quot;account_ids&quot; {
  value = { for k, v in module.accounts : k =&gt; v.account_id }
}
```

## The Baseline IAM StackSet

Before accounts can be provisioned, you need a StackSet that deploys baseline roles:

```yaml
# baseline-iam-roles.yaml
AWSTemplateFormatVersion: &apos;2010-09-09&apos;
Description: Baseline IAM roles for all accounts

Resources:
  DeployRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: DeployRole
      AssumeRolePolicyDocument:
        Version: &apos;2012-10-17&apos;
        Statement:
          - Effect: Allow
            Principal:
              AWS: arn:aws:iam::123456789012:root  # Management account
            Action: sts:AssumeRole
            Condition:
              StringEquals:
                sts:ExternalId: !Ref ExternalId
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/AdministratorAccess
      Tags:
        - Key: ManagedBy
          Value: StackSet

  ReadOnlyRole:
    Type: AWS::IAM::Role
    Properties:
      RoleName: ReadOnlyRole
      AssumeRolePolicyDocument:
        Version: &apos;2012-10-17&apos;
        Statement:
          - Effect: Allow
            Principal:
              AWS: arn:aws:iam::123456789012:root
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/ReadOnlyAccess

Parameters:
  ExternalId:
    Type: String
    Default: &quot;platform-deploy-external-id&quot;
    Description: External ID for assume role

Outputs:
  DeployRoleArn:
    Value: !GetAtt DeployRole.Arn
    Export:
      Name: DeployRoleArn
```

Create the StackSet in the management account:

```bash
aws cloudformation create-stack-set \
  --stack-set-name BaselineIAMRoles \
  --template-body file://baseline-iam-roles.yaml \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
  --capabilities CAPABILITY_NAMED_IAM
```

## Account Deletion

Deleting accounts requires careful sequencing:

### 1. Remove from Terraform

Delete the account module from the Terraform code and apply. This removes the Service Catalog provisioned product, which **unenrolls** the account from Control Tower but doesn&apos;t delete it.

### 2. Move to Suspended OU

```bash
aws organizations move-account \
  --account-id 123456789012 \
  --source-parent-id ou-xxxx-current \
  --destination-parent-id ou-xxxx-suspended
```

### 3. Close the Account

Requires admin access in the management account:

```bash
aws organizations close-account --account-id 123456789012
```

### 4. Verify Suspension

```bash
aws organizations describe-account --account-id 123456789012 \
  --query &apos;Account.Status&apos;
# Returns: &quot;SUSPENDED&quot;
```

The account remains in suspended state for 90 days before permanent deletion.

### 5. Clean Up SSO Assignments

Remove any SSO permission set assignments for the deleted account from your SSO configuration.

## Gotchas and Lessons Learned

### 1. OUs Must Be Registered in Control Tower

Terraform-created OUs are not automatically registered with Control Tower. You must register them manually in the Control Tower console first, then reference them in Terraform.

### 2. Email Addresses Must Be Unique

AWS requires unique email addresses for each account. Use `+` addressing:
- `aws-accounts+service-prod@example.com`
- `aws-accounts+service-dev@example.com`

All emails route to the same mailbox but are unique to AWS.

### 3. Account Creation Takes Time

Service Catalog account provisioning typically takes 20-30 minutes. Set appropriate timeouts:

```hcl
timeouts {
  create = &quot;60m&quot;
  update = &quot;60m&quot;
}
```

### 4. Account ID Lookup Timing

The account ID isn&apos;t available until after creation completes. Use `depends_on` to sequence the lookup:

```hcl
data &quot;aws_organizations_organizational_unit_descendant_accounts&quot; &quot;accounts&quot; {
  depends_on = [aws_servicecatalog_provisioned_product.account]
  parent_id  = var.parent_ou_id
}
```

### 5. StackSet Deployment is Eventually Consistent

New accounts may not immediately receive StackSet deployments. The StackSet targets the OU, and AWS eventually detects new accounts. Allow a few minutes after account creation.

### 6. Protect Account Emails with OPA/Sentinel

Once an account exists, changing its email is dangerous (it changes root access). Use policy-as-code to prevent email modifications:

```rego
# OPA policy to prevent account email changes
deny[msg] {
  input.resource_changes[_].type == &quot;aws_servicecatalog_provisioned_product&quot;
  input.resource_changes[_].change.actions[_] == &quot;update&quot;
  
  before := input.resource_changes[_].change.before.provisioning_parameters
  after := input.resource_changes[_].change.after.provisioning_parameters
  
  email_before := [p.value | p := before[_]; p.key == &quot;AccountEmail&quot;][0]
  email_after := [p.value | p := after[_]; p.key == &quot;AccountEmail&quot;][0]
  
  email_before != email_after
  
  msg := &quot;Account email cannot be changed after creation&quot;
}
```

### 7. Set Up Alerts for Root Login

Even with SSO, the root user still exists. Set up CloudWatch alerts for root login attempts:

```hcl
resource &quot;aws_cloudwatch_metric_alarm&quot; &quot;root_login&quot; {
  alarm_name          = &quot;root-account-login-${var.name}&quot;
  comparison_operator = &quot;GreaterThanThreshold&quot;
  evaluation_periods  = 1
  metric_name         = &quot;RootAccountUsage&quot;
  namespace           = &quot;CloudTrailMetrics&quot;
  period              = 300
  statistic           = &quot;Sum&quot;
  threshold           = 0
  alarm_description   = &quot;Root account login detected&quot;
  alarm_actions       = [aws_sns_topic.security_alerts.arn]
}
```

## The End-to-End Flow

1. **Developer submits PR** adding account definition to `ou/platform/my-service/main.tf`
2. **CI runs `terraform plan`** showing new account resources
3. **PR approved and merged**
4. **CI runs `terraform apply`**:
   - Service Catalog triggers Account Factory
   - Account created and enrolled in Control Tower
   - StackSet deploys baseline IAM roles
   - SSO user created (invitation email sent)
   - Account metadata written to S3
5. **Developer receives SSO invitation** and can access the new account
6. **Downstream pipelines** can now deploy to the account using the provisioned IAM role

Total time: ~30 minutes from PR merge to usable account.

## Summary

Building an account vending machine with Control Tower and Terraform gives you:

- **Self-service provisioning** – developers request accounts via PR
- **Consistent configuration** – every account gets baseline roles, guardrails, SSO
- **Audit trail** – Git history shows who created what and when
- **Scalable** – same process whether you have 10 or 1,000 accounts
- **Compliant** – Control Tower guardrails enforced automatically

The upfront investment in building this pays off quickly when you&apos;re managing accounts at scale.

---

*Building an account vending machine or have questions about multi-account strategy? Find me on [LinkedIn](https://linkedin.com/in/moabukar).*</content:encoded><category>aws</category><category>control-tower</category><category>account-factory</category><category>terraform</category><category>service-catalog</category><category>organizations</category><category>sso</category><category>platform-engineering</category><category>devops</category><author>Mo Abukar</author></item><item><title>Backstage Plugins: Building Custom Developer Portal Features</title><link>https://moabukar.co.uk/blog/backstage-custom-plugins/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/backstage-custom-plugins/</guid><description>Build custom Backstage plugins for your internal developer portal. Create frontend components, backend APIs, and integrate with your existing tools.</description><pubDate>Fri, 14 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Backstage Plugins: Building Custom Developer Portal Features
=============================================================

Backstage is only as useful as the plugins you add. This guide
covers building custom plugins - frontend components, backend
APIs, and integrations with your existing infrastructure.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/backstage.svg&quot; alt=&quot;Backstage logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Plugins are modular React/Node packages
- Frontend plugin = React components + routes
- Backend plugin = Express routes + services
- Full example: Service health dashboard plugin
- Testing and deployment patterns included


Plugin Architecture
===================

```
backstage/
├── packages/
│   ├── app/                    # Frontend app
│   └── backend/                # Backend app
└── plugins/
    └── my-plugin/
        ├── src/
        │   ├── components/     # React components
        │   ├── api/            # API client
        │   ├── routes.tsx      # Plugin routes
        │   └── plugin.ts       # Plugin definition
        └── package.json
```


Create a Plugin
===============

```bash
# Create new plugin
cd backstage
yarn new --select plugin

# Follow prompts:
# ? Enter the ID of the plugin [required] service-health
# ? Enter the owner(s) of the plugin platform-team
```


Frontend Plugin
===============

Plugin Definition
-----------------

```typescript
// plugins/service-health/src/plugin.ts
import {
  createPlugin,
  createRoutableExtension,
  createApiFactory,
} from &apos;@backstage/core-plugin-api&apos;;
import { serviceHealthApiRef, ServiceHealthClient } from &apos;./api&apos;;
import { rootRouteRef } from &apos;./routes&apos;;

export const serviceHealthPlugin = createPlugin({
  id: &apos;service-health&apos;,
  routes: {
    root: rootRouteRef,
  },
  apis: [
    createApiFactory({
      api: serviceHealthApiRef,
      deps: {},
      factory: () =&gt; new ServiceHealthClient(),
    }),
  ],
});

export const ServiceHealthPage = serviceHealthPlugin.provide(
  createRoutableExtension({
    name: &apos;ServiceHealthPage&apos;,
    component: () =&gt;
      import(&apos;./components/ServiceHealthPage&apos;).then(m =&gt; m.ServiceHealthPage),
    mountPoint: rootRouteRef,
  }),
);
```


API Client
----------

```typescript
// plugins/service-health/src/api/types.ts
import { createApiRef } from &apos;@backstage/core-plugin-api&apos;;

export interface ServiceHealth {
  name: string;
  status: &apos;healthy&apos; | &apos;degraded&apos; | &apos;down&apos;;
  latency: number;
  uptime: number;
  lastChecked: string;
}

export interface ServiceHealthApi {
  getServices(): Promise&lt;ServiceHealth[]&gt;;
  getService(name: string): Promise&lt;ServiceHealth&gt;;
}

export const serviceHealthApiRef = createApiRef&lt;ServiceHealthApi&gt;({
  id: &apos;plugin.service-health&apos;,
});

// plugins/service-health/src/api/client.ts
import { ServiceHealthApi, ServiceHealth } from &apos;./types&apos;;

export class ServiceHealthClient implements ServiceHealthApi {
  private baseUrl = &apos;/api/service-health&apos;;

  async getServices(): Promise&lt;ServiceHealth[]&gt; {
    const response = await fetch(this.baseUrl);
    if (!response.ok) {
      throw new Error(`Failed to fetch services: ${response.statusText}`);
    }
    return response.json();
  }

  async getService(name: string): Promise&lt;ServiceHealth&gt; {
    const response = await fetch(`${this.baseUrl}/${name}`);
    if (!response.ok) {
      throw new Error(`Failed to fetch service: ${response.statusText}`);
    }
    return response.json();
  }
}
```


React Components
----------------

```tsx
// plugins/service-health/src/components/ServiceHealthPage.tsx
import React from &apos;react&apos;;
import { useAsync } from &apos;react-use&apos;;
import {
  Content,
  ContentHeader,
  Page,
  Progress,
  ResponseErrorPanel,
  Table,
  TableColumn,
} from &apos;@backstage/core-components&apos;;
import { useApi } from &apos;@backstage/core-plugin-api&apos;;
import { serviceHealthApiRef, ServiceHealth } from &apos;../api&apos;;

const columns: TableColumn&lt;ServiceHealth&gt;[] = [
  { title: &apos;Service&apos;, field: &apos;name&apos; },
  {
    title: &apos;Status&apos;,
    field: &apos;status&apos;,
    render: row =&gt; (
      &lt;StatusIndicator status={row.status} /&gt;
    ),
  },
  { title: &apos;Latency&apos;, field: &apos;latency&apos;, render: row =&gt; `${row.latency}ms` },
  { title: &apos;Uptime&apos;, field: &apos;uptime&apos;, render: row =&gt; `${row.uptime}%` },
  { title: &apos;Last Checked&apos;, field: &apos;lastChecked&apos; },
];

export const ServiceHealthPage = () =&gt; {
  const api = useApi(serviceHealthApiRef);
  const { value, loading, error } = useAsync(() =&gt; api.getServices(), []);

  if (loading) return &lt;Progress /&gt;;
  if (error) return &lt;ResponseErrorPanel error={error} /&gt;;

  return (
    &lt;Page themeId=&quot;tool&quot;&gt;
      &lt;Content&gt;
        &lt;ContentHeader title=&quot;Service Health Dashboard&quot; /&gt;
        &lt;Table
          title=&quot;Services&quot;
          columns={columns}
          data={value || []}
          options={{ search: true, paging: true }}
        /&gt;
      &lt;/Content&gt;
    &lt;/Page&gt;
  );
};

const StatusIndicator = ({ status }: { status: string }) =&gt; {
  const colors = {
    healthy: &apos;#4caf50&apos;,
    degraded: &apos;#ff9800&apos;,
    down: &apos;#f44336&apos;,
  };
  
  return (
    &lt;span style={{ 
      color: colors[status as keyof typeof colors],
      fontWeight: &apos;bold&apos; 
    }}&gt;
      {status.toUpperCase()}
    &lt;/span&gt;
  );
};
```


Entity Card Component
---------------------

Add a card to the entity page:

```tsx
// plugins/service-health/src/components/ServiceHealthCard.tsx
import React from &apos;react&apos;;
import { useAsync } from &apos;react-use&apos;;
import {
  InfoCard,
  Progress,
  ResponseErrorPanel,
} from &apos;@backstage/core-components&apos;;
import { useApi } from &apos;@backstage/core-plugin-api&apos;;
import { useEntity } from &apos;@backstage/plugin-catalog-react&apos;;
import { serviceHealthApiRef } from &apos;../api&apos;;

export const ServiceHealthCard = () =&gt; {
  const { entity } = useEntity();
  const api = useApi(serviceHealthApiRef);
  const serviceName = entity.metadata.name;
  
  const { value, loading, error } = useAsync(
    () =&gt; api.getService(serviceName),
    [serviceName]
  );

  if (loading) return &lt;Progress /&gt;;
  if (error) return &lt;ResponseErrorPanel error={error} /&gt;;

  return (
    &lt;InfoCard title=&quot;Service Health&quot;&gt;
      &lt;dl&gt;
        &lt;dt&gt;Status&lt;/dt&gt;
        &lt;dd&gt;{value?.status}&lt;/dd&gt;
        &lt;dt&gt;Latency&lt;/dt&gt;
        &lt;dd&gt;{value?.latency}ms&lt;/dd&gt;
        &lt;dt&gt;Uptime&lt;/dt&gt;
        &lt;dd&gt;{value?.uptime}%&lt;/dd&gt;
      &lt;/dl&gt;
    &lt;/InfoCard&gt;
  );
};

// Export for use in entity page
export const serviceHealthPlugin.provide(
  createComponentExtension({
    name: &apos;ServiceHealthCard&apos;,
    component: {
      lazy: () =&gt; import(&apos;./components/ServiceHealthCard&apos;).then(m =&gt; m.ServiceHealthCard),
    },
  }),
);
```


Backend Plugin
==============

```typescript
// plugins/service-health-backend/src/plugin.ts
import { createBackendPlugin } from &apos;@backstage/backend-plugin-api&apos;;
import { createRouter } from &apos;./router&apos;;

export const serviceHealthPlugin = createBackendPlugin({
  pluginId: &apos;service-health&apos;,
  register(env) {
    env.registerInit({
      deps: {
        httpRouter: coreServices.httpRouter,
        logger: coreServices.logger,
        config: coreServices.rootConfig,
      },
      async init({ httpRouter, logger, config }) {
        httpRouter.use(
          await createRouter({ logger, config }),
        );
      },
    });
  },
});

// plugins/service-health-backend/src/router.ts
import { Router } from &apos;express&apos;;
import { Logger } from &apos;winston&apos;;
import { Config } from &apos;@backstage/config&apos;;

interface ServiceHealth {
  name: string;
  status: &apos;healthy&apos; | &apos;degraded&apos; | &apos;down&apos;;
  latency: number;
  uptime: number;
  lastChecked: string;
}

export async function createRouter(options: {
  logger: Logger;
  config: Config;
}): Promise&lt;Router&gt; {
  const { logger, config } = options;
  const router = Router();

  // Health check endpoint for each service
  const services = config.getConfigArray(&apos;serviceHealth.services&apos;);
  
  router.get(&apos;/&apos;, async (req, res) =&gt; {
    const results: ServiceHealth[] = [];
    
    for (const service of services) {
      const name = service.getString(&apos;name&apos;);
      const url = service.getString(&apos;healthUrl&apos;);
      
      try {
        const start = Date.now();
        const response = await fetch(url);
        const latency = Date.now() - start;
        
        results.push({
          name,
          status: response.ok ? &apos;healthy&apos; : &apos;degraded&apos;,
          latency,
          uptime: 99.9, // Would come from metrics store
          lastChecked: new Date().toISOString(),
        });
      } catch (error) {
        results.push({
          name,
          status: &apos;down&apos;,
          latency: 0,
          uptime: 0,
          lastChecked: new Date().toISOString(),
        });
      }
    }
    
    res.json(results);
  });

  router.get(&apos;/:name&apos;, async (req, res) =&gt; {
    const { name } = req.params;
    const service = services.find(s =&gt; s.getString(&apos;name&apos;) === name);
    
    if (!service) {
      res.status(404).json({ error: &apos;Service not found&apos; });
      return;
    }
    
    // ... check specific service
  });

  return router;
}
```


Configuration
-------------

```yaml
# app-config.yaml
serviceHealth:
  services:
    - name: api-gateway
      healthUrl: https://api.company.com/health
    - name: user-service
      healthUrl: https://users.company.com/health
    - name: payment-service
      healthUrl: https://payments.company.com/health
```


Register Plugins
================

Frontend:

```tsx
// packages/app/src/App.tsx
import { serviceHealthPlugin, ServiceHealthPage } from &apos;@internal/plugin-service-health&apos;;

const routes = (
  &lt;FlatRoutes&gt;
    &lt;Route path=&quot;/service-health&quot; element={&lt;ServiceHealthPage /&gt;} /&gt;
  &lt;/FlatRoutes&gt;
);

// Add to sidebar
// packages/app/src/components/Root/Root.tsx
&lt;SidebarItem icon={HeartIcon} to=&quot;service-health&quot; text=&quot;Service Health&quot; /&gt;
```

Backend:

```typescript
// packages/backend/src/index.ts
import { serviceHealthPlugin } from &apos;@internal/plugin-service-health-backend&apos;;

const backend = createBackend();
backend.add(serviceHealthPlugin);
```


Testing
=======

```typescript
// plugins/service-health/src/components/ServiceHealthPage.test.tsx
import React from &apos;react&apos;;
import { render, screen, waitFor } from &apos;@testing-library/react&apos;;
import { TestApiProvider } from &apos;@backstage/test-utils&apos;;
import { ServiceHealthPage } from &apos;./ServiceHealthPage&apos;;
import { serviceHealthApiRef } from &apos;../api&apos;;

const mockApi = {
  getServices: jest.fn().mockResolvedValue([
    { name: &apos;api&apos;, status: &apos;healthy&apos;, latency: 50, uptime: 99.9 },
    { name: &apos;db&apos;, status: &apos;degraded&apos;, latency: 200, uptime: 95.0 },
  ]),
};

describe(&apos;ServiceHealthPage&apos;, () =&gt; {
  it(&apos;renders service list&apos;, async () =&gt; {
    render(
      &lt;TestApiProvider apis={[[serviceHealthApiRef, mockApi]]}&gt;
        &lt;ServiceHealthPage /&gt;
      &lt;/TestApiProvider&gt;
    );

    await waitFor(() =&gt; {
      expect(screen.getByText(&apos;api&apos;)).toBeInTheDocument();
      expect(screen.getByText(&apos;HEALTHY&apos;)).toBeInTheDocument();
    });
  });
});
```


Publishing
==========

```bash
# Build plugin
cd plugins/service-health
yarn build

# Publish to private registry
yarn publish --registry https://npm.company.com
```


References
==========

- Backstage Docs: https://backstage.io/docs
- Plugin Development: https://backstage.io/docs/plugins
- Storybook: https://backstage.io/storybook


========================================
Backstage + Custom Plugins
========================================
Your portal. Your features. Your way.
========================================</content:encoded><category>backstage</category><category>developer-portal</category><category>platform-engineering</category><category>react</category><category>typescript</category><author>Mo Abukar</author></item><item><title>Kyverno vs OPA: Policy Engines Compared</title><link>https://moabukar.co.uk/blog/kyverno-vs-opa/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/kyverno-vs-opa/</guid><description>Detailed comparison of Kyverno and OPA Gatekeeper for Kubernetes policy enforcement. Includes real examples, performance considerations, and migration guidance.</description><pubDate>Mon, 10 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Kyverno vs OPA: Policy Engines Compared
========================================

Both Kyverno and OPA Gatekeeper enforce policies in Kubernetes.
OPA uses Rego, a purpose-built language. Kyverno uses YAML.
This guide compares them with real examples so you can choose.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/kyverno.svg&quot; alt=&quot;Kyverno logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- **Kyverno**: YAML-based, easier to learn, Kubernetes-native
- **OPA/Gatekeeper**: Rego-based, more powerful, general-purpose
- Kyverno for simpler policies and faster adoption
- OPA for complex logic and non-K8s use cases
- Both production-ready, both well-maintained


Quick Comparison
================

```
FEATURE                 KYVERNO             OPA/GATEKEEPER
=======                 =======             ==============
Policy Language         YAML                Rego
Learning Curve          Low                 Medium-High
Validation              ✅                  ✅
Mutation                ✅                  ✅
Generation              ✅                  ❌
Image Verification      ✅                  ❌ (external)
CLI Testing             ✅ (kyverno test)   ✅ (gator, conftest)
Non-K8s Use             ❌                  ✅
Performance             Good                Good
Community               Growing             Established
```


Same Policy, Different Languages
================================

Block Privileged Containers
---------------------------

**Kyverno:**

```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: disallow-privileged
spec:
  validationFailureAction: Enforce
  rules:
    - name: deny-privileged
      match:
        any:
          - resources:
              kinds:
                - Pod
      validate:
        message: &quot;Privileged containers are not allowed&quot;
        pattern:
          spec:
            containers:
              - securityContext:
                  privileged: &quot;!true&quot;
            initContainers:
              - securityContext:
                  privileged: &quot;!true&quot;
```

**OPA/Gatekeeper:**

```yaml
# ConstraintTemplate
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8spsprivileged
spec:
  crd:
    spec:
      names:
        kind: K8sPSPPrivileged
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8spsprivileged

        violation[{&quot;msg&quot;: msg}] {
          c := input_containers[_]
          c.securityContext.privileged == true
          msg := sprintf(&quot;Privileged container not allowed: %v&quot;, [c.name])
        }

        input_containers[c] {
          c := input.review.object.spec.containers[_]
        }

        input_containers[c] {
          c := input.review.object.spec.initContainers[_]
        }

---
# Constraint
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sPSPPrivileged
metadata:
  name: deny-privileged
spec:
  match:
    kinds:
      - apiGroups: [&quot;&quot;]
        kinds: [&quot;Pod&quot;]
```

**Verdict**: Kyverno is more concise for this use case.


Required Labels
---------------

**Kyverno:**

```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: require-labels
spec:
  validationFailureAction: Enforce
  rules:
    - name: require-team-label
      match:
        any:
          - resources:
              kinds:
                - Deployment
                - StatefulSet
      validate:
        message: &quot;Label &apos;team&apos; is required&quot;
        pattern:
          metadata:
            labels:
              team: &quot;?*&quot;
```

**OPA/Gatekeeper:**

```yaml
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata:
  name: k8srequiredlabels
spec:
  crd:
    spec:
      names:
        kind: K8sRequiredLabels
      validation:
        openAPIV3Schema:
          type: object
          properties:
            labels:
              type: array
              items:
                type: string
  targets:
    - target: admission.k8s.gatekeeper.sh
      rego: |
        package k8srequiredlabels

        violation[{&quot;msg&quot;: msg}] {
          provided := {l | input.review.object.metadata.labels[l]}
          required := {l | l := input.parameters.labels[_]}
          missing := required - provided
          count(missing) &gt; 0
          msg := sprintf(&quot;Missing labels: %v&quot;, [missing])
        }

---
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata:
  name: require-team-label
spec:
  match:
    kinds:
      - apiGroups: [&quot;apps&quot;]
        kinds: [&quot;Deployment&quot;, &quot;StatefulSet&quot;]
  parameters:
    labels:
      - team
```

**Verdict**: Kyverno wins on simplicity, OPA wins on reusability.


Kyverno Unique Features
=======================

Generate Resources
------------------

Automatically create resources when others are created:

```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: generate-network-policy
spec:
  rules:
    - name: generate-default-deny
      match:
        any:
          - resources:
              kinds:
                - Namespace
      generate:
        apiVersion: networking.k8s.io/v1
        kind: NetworkPolicy
        name: default-deny
        namespace: &quot;{{request.object.metadata.name}}&quot;
        data:
          spec:
            podSelector: {}
            policyTypes:
              - Ingress
              - Egress
```


Image Signature Verification
----------------------------

```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: verify-image-signatures
spec:
  validationFailureAction: Enforce
  webhookTimeoutSeconds: 30
  rules:
    - name: verify-signature
      match:
        any:
          - resources:
              kinds:
                - Pod
      verifyImages:
        - imageReferences:
            - &quot;ghcr.io/company/*&quot;
          attestors:
            - entries:
                - keys:
                    publicKeys: |
                      -----BEGIN PUBLIC KEY-----
                      MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE...
                      -----END PUBLIC KEY-----
```


Mutate with Context
-------------------

```yaml
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-image-pull-secret
spec:
  rules:
    - name: add-pull-secret
      match:
        any:
          - resources:
              kinds:
                - Pod
      mutate:
        patchStrategicMerge:
          spec:
            imagePullSecrets:
              - name: ghcr-pull-secret
```


OPA Unique Features
===================

Complex Logic with Rego
-----------------------

```rego
package kubernetes.admission

# Deny if total CPU requests exceed namespace quota
deny[msg] {
  input.request.kind.kind == &quot;Pod&quot;
  namespace := input.request.namespace
  
  # Get existing pods in namespace
  existing_pods := data.kubernetes.pods[namespace]
  
  # Calculate current CPU usage
  current_cpu := sum([cpu |
    pod := existing_pods[_]
    container := pod.spec.containers[_]
    cpu := parse_cpu(container.resources.requests.cpu)
  ])
  
  # Calculate requested CPU
  requested_cpu := sum([cpu |
    container := input.request.object.spec.containers[_]
    cpu := parse_cpu(container.resources.requests.cpu)
  ])
  
  # Get quota
  quota := data.quotas[namespace].cpu
  
  # Check if exceeds
  current_cpu + requested_cpu &gt; quota
  
  msg := sprintf(&quot;CPU request exceeds namespace quota: %v + %v &gt; %v&quot;,
    [current_cpu, requested_cpu, quota])
}
```


Cross-Resource Validation
-------------------------

```rego
package kubernetes.admission

# Deny if service selector doesn&apos;t match any deployment
deny[msg] {
  input.request.kind.kind == &quot;Service&quot;
  service := input.request.object
  selector := service.spec.selector
  
  # Check if any deployment matches
  not deployment_exists(input.request.namespace, selector)
  
  msg := sprintf(&quot;Service %v selector doesn&apos;t match any deployment&quot;, 
    [service.metadata.name])
}

deployment_exists(namespace, selector) {
  deployment := data.kubernetes.deployments[namespace][_]
  matches_selector(deployment.spec.template.metadata.labels, selector)
}

matches_selector(labels, selector) {
  all_match := [match |
    selector[key] = value
    match := labels[key] == value
  ]
  not false in all_match
}
```


Performance Comparison
======================

Testing with 1000 pods:

```
SCENARIO                    KYVERNO     GATEKEEPER
========                    =======     ==========
Simple validation           ~2ms        ~3ms
Complex validation          ~5ms        ~4ms
Mutation                    ~3ms        ~4ms
Memory (idle)               ~200MB      ~300MB
Memory (1000 policies)      ~500MB      ~600MB
```

Both are production-ready. Performance is similar.


Migration: OPA to Kyverno
=========================

Common patterns:

```yaml
# OPA: deny if no limits
# rego: not container.resources.limits.memory

# Kyverno equivalent:
validate:
  pattern:
    spec:
      containers:
        - resources:
            limits:
              memory: &quot;?*&quot;
```

```yaml
# OPA: allow only specific registries
# rego: startswith(image, &quot;gcr.io/company/&quot;)

# Kyverno equivalent:
validate:
  pattern:
    spec:
      containers:
        - image: &quot;gcr.io/company/*&quot;
```


When to Use Which
=================

**Choose Kyverno when:**

- Team prefers YAML over learning Rego
- You need resource generation
- You need image signature verification
- Simpler policies are sufficient
- Faster time-to-value is important

**Choose OPA/Gatekeeper when:**

- You need complex policy logic
- You have existing Rego expertise
- You need policies outside K8s (Terraform, etc.)
- Cross-resource validation is required
- You need the OPA ecosystem (Conftest, etc.)


Install Both? (Hybrid Approach)
===============================

You can run both:

```yaml
# Kyverno for mutations and generation
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: add-defaults
spec:
  rules:
    - name: add-labels
      mutate:
        patchStrategicMerge:
          metadata:
            labels:
              managed-by: platform

# Gatekeeper for complex validation
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sComplexValidation
metadata:
  name: cross-resource-check
```


References
==========

- Kyverno Docs: https://kyverno.io/docs
- OPA Gatekeeper: https://open-policy-agent.github.io/gatekeeper
- Kyverno Policies: https://kyverno.io/policies
- Rego Playground: https://play.openpolicyagent.org


========================================
Kyverno vs OPA Gatekeeper
========================================
Choose your weapon. Enforce your policies.
========================================</content:encoded><category>kyverno</category><category>opa</category><category>gatekeeper</category><category>kubernetes</category><category>policy</category><category>security</category><author>Mo Abukar</author></item><item><title>Test GitHub Actions Locally with Act</title><link>https://moabukar.co.uk/blog/act-locally-test-github-actions/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/act-locally-test-github-actions/</guid><description>Stop pushing to test your workflows. Act lets you run GitHub Actions locally with instant feedback. Here&apos;s how to set it up and use it effectively.</description><pubDate>Sat, 08 Nov 2025 00:00:00 GMT</pubDate><content:encoded># Test GitHub Actions Locally with Act

Push. Wait for runner. Fail on line 47. Fix typo. Push. Wait. Fail on line 52.

Sound familiar? Debugging GitHub Actions by pushing commits is painful. Each iteration takes minutes, clutters your git history, and burns CI minutes.

Act runs GitHub Actions locally on your machine. Change workflow, run `act`, see results in seconds. No push required.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/githubactions.svg&quot; alt=&quot;GitHub Actions logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


## TL;DR

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/act-locally-test-github-actions](https://github.com/moabukar/blog-code/tree/main/act-locally-test-github-actions)

- Act runs GitHub Actions workflows locally using Docker
- Instant feedback loop - seconds instead of minutes
- Works with most GitHub Actions out of the box
- Some features (secrets, artifacts, matrix) need configuration
- Essential for workflow development and debugging

&gt; **Code Repository:** All code from this post is available at [github.com/moabukar/blog-code/act-github-actions](https://github.com/moabukar/blog-code/tree/main/act-github-actions)

---

## Installing Act

```bash
# macOS
brew install act

# Linux (via GitHub releases)
curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash

# Windows (via Chocolatey)
choco install act-cli

# Verify
act --version
```

Act requires Docker to be installed and running.

---

## Basic Usage

```bash
# Run the default event (push)
act

# Run a specific event
act pull_request
act workflow_dispatch
act schedule

# Run a specific job
act -j build
act -j test

# Run a specific workflow file
act -W .github/workflows/ci.yml

# Dry run (show what would happen)
act -n
```

---

## Your First Run

Given this workflow:

```yaml
# .github/workflows/ci.yml
name: CI
on: [push]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - name: Run tests
        run: |
          echo &quot;Running tests...&quot;
          npm test
```

Run it locally:

```bash
$ act

[CI/test] 🚀  Start image=catthehacker/ubuntu:act-latest
[CI/test]   🐳  docker pull image=catthehacker/ubuntu:act-latest
[CI/test] 🧪  Run actions/checkout@v4
[CI/test]   ✅  Success - actions/checkout@v4
[CI/test] 🧪  Run Run tests
[CI/test]   💬  echo &quot;Running tests...&quot;
Running tests...
[CI/test]   💬  npm test
...
[CI/test]   ✅  Success - Run tests
```

---

## Runner Images

Act uses Docker images to simulate GitHub-hosted runners. The default is a minimal image.

### Image Sizes

```bash
# Micro (default) - minimal, fast to download
act -P ubuntu-latest=catthehacker/ubuntu:act-latest

# Medium - includes more common tools
act -P ubuntu-latest=catthehacker/ubuntu:act-22.04

# Large - closest to actual GitHub runners (~20GB)
act -P ubuntu-latest=catthehacker/ubuntu:full-22.04
```

### Configure Default in `.actrc`

```bash
# ~/.actrc or .actrc in repo root
-P ubuntu-latest=catthehacker/ubuntu:act-22.04
-P ubuntu-22.04=catthehacker/ubuntu:act-22.04
-P ubuntu-20.04=catthehacker/ubuntu:act-20.04
```

### Custom Runner Image

```dockerfile
# Dockerfile.act
FROM catthehacker/ubuntu:act-22.04

# Add your specific tools
RUN apt-get update &amp;&amp; apt-get install -y \
    postgresql-client \
    redis-tools

# Pre-install language runtimes
RUN curl -fsSL https://deb.nodesource.com/setup_20.x | bash - \
    &amp;&amp; apt-get install -y nodejs
```

```bash
docker build -f Dockerfile.act -t my-act-runner .
act -P ubuntu-latest=my-act-runner
```

---

## Handling Secrets

### From Environment Variables

```bash
# Pass individual secrets
act -s GITHUB_TOKEN=$GITHUB_TOKEN -s NPM_TOKEN=$NPM_TOKEN

# Or use a secrets file
act --secret-file .secrets
```

```bash
# .secrets (add to .gitignore!)
GITHUB_TOKEN=ghp_xxxxxxxxxxxx
NPM_TOKEN=npm_xxxxxxxxxxxx
AWS_ACCESS_KEY_ID=AKIA...
AWS_SECRET_ACCESS_KEY=xxx
```

### Using 1Password or Other Secret Managers

```bash
# Pipe from 1Password
act -s GITHUB_TOKEN=$(op read &quot;op://Vault/GitHub Token/credential&quot;)
```

---

## Environment Variables

```bash
# Pass env vars
act --env FOO=bar --env BAZ=qux

# From file
act --env-file .env

# GitHub-provided variables (automatically set)
# GITHUB_SHA, GITHUB_REF, GITHUB_REPOSITORY, etc.
```

```bash
# .env
NODE_ENV=test
DATABASE_URL=postgres://localhost:5432/test
```

---

## Matrix Builds

Act supports matrix strategies:

```yaml
jobs:
  test:
    strategy:
      matrix:
        node: [18, 20, 22]
        os: [ubuntu-latest]
    runs-on: ${{ matrix.os }}
    steps:
      - uses: actions/setup-node@v4
        with:
          node-version: ${{ matrix.node }}
      - run: node --version
```

```bash
# Run all matrix combinations
act

# Run specific matrix combination
act -j test --matrix node:20
```

---

## Services (Docker Compose Style)

GitHub Actions supports service containers. Act handles them:

```yaml
jobs:
  test:
    runs-on: ubuntu-latest
    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_PASSWORD: postgres
        ports:
          - 5432:5432
        options: &gt;-
          --health-cmd pg_isready
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5
      redis:
        image: redis:7
        ports:
          - 6379:6379
    steps:
      - name: Test DB connection
        run: |
          pg_isready -h localhost -p 5432
          redis-cli -h localhost ping
```

```bash
# Services start automatically
act -j test
```

---

## Artifacts

Act can handle artifacts with some configuration:

```yaml
- uses: actions/upload-artifact@v4
  with:
    name: test-results
    path: coverage/

- uses: actions/download-artifact@v4
  with:
    name: test-results
```

```bash
# Specify artifact server
act --artifact-server-path /tmp/artifacts

# Or use local directory
mkdir -p /tmp/act-artifacts
act --artifact-server-path /tmp/act-artifacts
```

---

## Debugging Workflows

### Verbose Output

```bash
# Show all output
act -v

# Even more verbose
act -vv
```

### Interactive Shell

```bash
# Drop into shell on failure
act -j build --reuse

# Then manually debug
docker exec -it act-CI-build /bin/bash
```

### Step Through

```bash
# Stop after each step
act --step-by-step
```

---

## Common Gotchas

### 1. `GITHUB_TOKEN` Permissions

```yaml
# Workflow uses github.token
- uses: actions/checkout@v4
  with:
    token: ${{ secrets.GITHUB_TOKEN }}
```

```bash
# Create a PAT and pass it
act -s GITHUB_TOKEN=ghp_xxxx
```

### 2. Actions That Require GitHub Context

Some actions only work on GitHub:
- `github-script` (limited support)
- `create-release` (needs GitHub API)
- `deploy-pages` (GitHub-specific)

**Workaround:** Mock the behavior locally or skip:

```yaml
- name: Create Release
  if: ${{ !env.ACT }}  # Skip when running in act
  uses: softprops/action-gh-release@v1
```

### 3. Docker-in-Docker

Actions that build Docker images need Docker socket access:

```bash
# Mount Docker socket
act -P ubuntu-latest=catthehacker/ubuntu:act-latest \
    --container-daemon-socket /var/run/docker.sock
```

### 4. Self-Hosted Runner Features

Features like caching to GitHub&apos;s cache service won&apos;t work locally:

```yaml
- uses: actions/cache@v4  # Works, but uses local cache only
  with:
    path: ~/.npm
    key: npm-${{ hashFiles(&apos;**/package-lock.json&apos;) }}
```

---

## Practical Workflow

### 1. Create `.actrc` for Your Repo

```bash
# .actrc
-P ubuntu-latest=catthehacker/ubuntu:act-22.04
--secret-file .secrets
--env-file .env
--artifact-server-path /tmp/artifacts
```

### 2. Add `.secrets` to `.gitignore`

```bash
# .gitignore
.secrets
.env.local
```

### 3. Document Required Secrets

```bash
# .secrets.example (commit this)
GITHUB_TOKEN=your-github-token
NPM_TOKEN=your-npm-token
AWS_ACCESS_KEY_ID=your-aws-key
AWS_SECRET_ACCESS_KEY=your-aws-secret
```

### 4. Test Before Push

```bash
# Quick validation
act -n  # Dry run

# Full test
act

# Specific job
act -j deploy
```

---

## Speed Tips

### 1. Use Smaller Images

```bash
# Instead of full (20GB)
act -P ubuntu-latest=catthehacker/ubuntu:act-latest  # ~500MB
```

### 2. Cache Docker Images

```bash
# Pre-pull images
docker pull catthehacker/ubuntu:act-22.04
docker pull node:20
docker pull postgres:15
```

### 3. Reuse Containers

```bash
# Don&apos;t recreate containers between runs
act --reuse
```

### 4. Skip Unnecessary Steps

```yaml
- name: Deploy to Production
  if: github.ref == &apos;refs/heads/main&apos; &amp;&amp; !env.ACT
  run: ./deploy.sh
```

---

## Integration with Make

```makefile
# Makefile
.PHONY: ci ci-push ci-pr

# Run CI workflow locally
ci:
	act push

# Run PR workflow
ci-pr:
	act pull_request

# Run with verbose output
ci-debug:
	act -v

# Dry run
ci-dry:
	act -n

# Specific job
ci-test:
	act -j test

ci-build:
	act -j build
```

---

## When Act Isn&apos;t Enough

Act covers 90% of use cases. For the remaining 10%:

| Limitation | Alternative |
|------------|-------------|
| GitHub-specific APIs | Test against actual GitHub on a test repo |
| OIDC/Workload Identity | Can&apos;t be simulated locally |
| Large runners (16+ CPU) | Use actual GitHub runners |
| macOS/Windows runners | Not supported in act |
| GitHub Packages auth | Use real GitHub with PAT |

---

## Quick Reference

```bash
# Basic run
act

# Specific event
act pull_request

# Specific job
act -j build

# With secrets
act -s GITHUB_TOKEN=$TOKEN

# Secrets from file
act --secret-file .secrets

# Environment variables
act --env NODE_ENV=test

# Dry run
act -n

# Verbose
act -v

# Use specific image
act -P ubuntu-latest=catthehacker/ubuntu:act-22.04

# Reuse containers
act --reuse

# List available jobs
act -l
```

---

## Conclusion

Act transforms GitHub Actions development from &quot;push and pray&quot; to instant local feedback. For workflow development:

1. Write workflow
2. Run `act`
3. Fix issues
4. Repeat until green
5. Push once, confident it works

Your git history will thank you. Your CI minutes will thank you. Your sanity will thank you.

---

## References

- [Act GitHub Repository](https://github.com/nektos/act)
- [Act Documentation](https://nektosact.com/)
- [Runner Images](https://github.com/catthehacker/docker_images)
- [GitHub Actions Documentation](https://docs.github.com/en/actions)</content:encoded><category>github-actions</category><category>ci-cd</category><category>act</category><category>devops</category><category>testing</category><category>automation</category><author>Mo Abukar</author></item><item><title>Crossplane Compositions: Build Your Own Cloud API</title><link>https://moabukar.co.uk/blog/crossplane-compositions/</link><guid isPermaLink="true">https://moabukar.co.uk/blog/crossplane-compositions/</guid><description>Create custom cloud APIs with Crossplane Compositions. Abstract away complexity and give developers self-service infrastructure with guardrails.</description><pubDate>Thu, 06 Nov 2025 00:00:00 GMT</pubDate><content:encoded>Crossplane Compositions: Build Your Own Cloud API
==================================================

Crossplane lets you define custom Kubernetes APIs for your
infrastructure. Instead of developers writing Terraform or
clicking through consoles, they create a simple YAML and get
a fully configured database, network, or entire environment.

This guide covers building Compositions that abstract complexity
while maintaining security and compliance.

&lt;div align=&quot;center&quot;&gt;
  &lt;img src=&quot;/images/logos/crossplane.svg&quot; alt=&quot;Crossplane logo&quot; width=&quot;120&quot; /&gt;
&lt;/div&gt;


TL;DR
=====

- Crossplane = Kubernetes-native infrastructure management
- Compositions = templates that combine multiple resources
- CompositeResourceDefinitions (XRDs) = your custom API schema
- Claims = what developers use to request infrastructure
- Full examples for databases, networks, and applications


Architecture
============

```
┌─────────────────────────────────────────────────────────────────┐
│                        Developer                                 │
│                 (creates simple Claim)                          │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                    Claim (namespace-scoped)                      │
│            apiVersion: platform.company.com/v1alpha1             │
│            kind: PostgreSQLInstance                              │
│            spec:                                                 │
│              size: small                                         │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│               CompositeResource (cluster-scoped)                 │
│          XPostgreSQLInstance (generated from XRD)                │
└─────────────────────────────────────────────────────────────────┘
                               │
                               ▼
┌─────────────────────────────────────────────────────────────────┐
│                       Composition                                │
│            (template: what resources to create)                  │
└─────────────────────────────────────────────────────────────────┘
                               │
          ┌────────────────────┼────────────────────┐
          ▼                    ▼                    ▼
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│   RDS Instance   │  │  Security Group  │  │  Parameter Group │
│                  │  │                  │  │                  │
└──────────────────┘  └──────────────────┘  └──────────────────┘
```


Install Crossplane
==================

```bash
# Install Crossplane
helm repo add crossplane-stable https://charts.crossplane.io/stable
helm upgrade --install crossplane crossplane-stable/crossplane \
  --namespace crossplane-system --create-namespace

# Install AWS Provider
cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: pkg.crossplane.io/v1
kind: Provider
metadata:
  name: provider-aws
spec:
  package: xpkg.upbound.io/upbound/provider-aws:v0.47.0
EOF

# Configure credentials
kubectl create secret generic aws-creds \
  -n crossplane-system \
  --from-file=credentials=~/.aws/credentials

cat &lt;&lt;EOF | kubectl apply -f -
apiVersion: aws.upbound.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: credentials
EOF
```


Example 1: PostgreSQL Database
==============================

Define the API (XRD)
--------------------

```yaml
# xrd-postgresql.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xpostgresqlinstances.database.platform.company.com
spec:
  group: database.platform.company.com
  names:
    kind: XPostgreSQLInstance
    plural: xpostgresqlinstances
  
  claimNames:
    kind: PostgreSQLInstance
    plural: postgresqlinstances
  
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                size:
                  type: string
                  enum: [&quot;small&quot;, &quot;medium&quot;, &quot;large&quot;]
                  description: &quot;Database size preset&quot;
                version:
                  type: string
                  enum: [&quot;13&quot;, &quot;14&quot;, &quot;15&quot;]
                  default: &quot;15&quot;
                region:
                  type: string
                  default: &quot;eu-west-2&quot;
              required:
                - size
            status:
              type: object
              properties:
                endpoint:
                  type: string
                port:
                  type: integer
                secretName:
                  type: string
```


Create the Composition
----------------------

```yaml
# composition-postgresql.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: postgresql-aws
  labels:
    provider: aws
    database: postgresql
spec:
  compositeTypeRef:
    apiVersion: database.platform.company.com/v1alpha1
    kind: XPostgreSQLInstance
  
  patchSets:
    - name: common-tags
      patches:
        - type: FromCompositeFieldPath
          fromFieldPath: metadata.labels
          toFieldPath: spec.forProvider.tags
          policy:
            mergeOptions:
              keepMapValues: true
  
  resources:
    # Subnet Group
    - name: subnet-group
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: SubnetGroup
        spec:
          forProvider:
            description: &quot;Managed by Crossplane&quot;
            subnetIds:
              - subnet-aaaaaaaa
              - subnet-bbbbbbbb
              - subnet-cccccccc
          providerConfigRef:
            name: default
      patches:
        - type: PatchSet
          patchSetName: common-tags

    # Security Group
    - name: security-group
      base:
        apiVersion: ec2.aws.upbound.io/v1beta1
        kind: SecurityGroup
        spec:
          forProvider:
            vpcId: vpc-xxxxxxxx
            description: &quot;PostgreSQL access&quot;
          providerConfigRef:
            name: default
      patches:
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.id
          toFieldPath: status.securityGroupId

    # Security Group Rule
    - name: sg-rule-ingress
      base:
        apiVersion: ec2.aws.upbound.io/v1beta1
        kind: SecurityGroupRule
        spec:
          forProvider:
            type: ingress
            fromPort: 5432
            toPort: 5432
            protocol: tcp
            cidrBlocks:
              - &quot;10.0.0.0/8&quot;
            securityGroupIdSelector:
              matchControllerRef: true
          providerConfigRef:
            name: default

    # Parameter Group
    - name: parameter-group
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: ParameterGroup
        spec:
          forProvider:
            family: postgres15
            parameter:
              - name: log_statement
                value: &quot;all&quot;
              - name: log_min_duration_statement
                value: &quot;1000&quot;
          providerConfigRef:
            name: default
      patches:
        - fromFieldPath: spec.version
          toFieldPath: spec.forProvider.family
          transforms:
            - type: string
              string:
                fmt: &quot;postgres%s&quot;

    # RDS Instance
    - name: rds-instance
      base:
        apiVersion: rds.aws.upbound.io/v1beta1
        kind: Instance
        spec:
          forProvider:
            engine: postgres
            publiclyAccessible: false
            skipFinalSnapshot: true
            storageEncrypted: true
            storageType: gp3
            autoGeneratePassword: true
            passwordSecretRef:
              namespace: crossplane-system
              key: password
            dbSubnetGroupNameSelector:
              matchControllerRef: true
            vpcSecurityGroupIdSelector:
              matchControllerRef: true
            parameterGroupNameSelector:
              matchControllerRef: true
          providerConfigRef:
            name: default
          writeConnectionSecretToRef:
            namespace: crossplane-system
      patches:
        # Size mapping
        - type: FromCompositeFieldPath
          fromFieldPath: spec.size
          toFieldPath: spec.forProvider.instanceClass
          transforms:
            - type: map
              map:
                small: db.t3.micro
                medium: db.t3.small
                large: db.r6g.large
        
        - type: FromCompositeFieldPath
          fromFieldPath: spec.size
          toFieldPath: spec.forProvider.allocatedStorage
          transforms:
            - type: map
              map:
                small: 20
                medium: 50
                large: 100
        
        - type: FromCompositeFieldPath
          fromFieldPath: spec.version
          toFieldPath: spec.forProvider.engineVersion
        
        - type: FromCompositeFieldPath
          fromFieldPath: spec.region
          toFieldPath: spec.forProvider.region
        
        # Connection secret
        - type: FromCompositeFieldPath
          fromFieldPath: metadata.uid
          toFieldPath: spec.writeConnectionSecretToRef.name
          transforms:
            - type: string
              string:
                fmt: &quot;%s-connection&quot;
        
        # Status outputs
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.endpoint
          toFieldPath: status.endpoint
        
        - type: ToCompositeFieldPath
          fromFieldPath: status.atProvider.port
          toFieldPath: status.port
      
      connectionDetails:
        - name: endpoint
          fromFieldPath: status.atProvider.endpoint
        - name: port
          fromFieldPath: status.atProvider.port
        - name: username
          fromFieldPath: spec.forProvider.username
        - name: password
          fromConnectionSecretKey: password
```


Developer Experience (Claim)
----------------------------

```yaml
# database-claim.yaml
apiVersion: database.platform.company.com/v1alpha1
kind: PostgreSQLInstance
metadata:
  name: my-app-db
  namespace: my-team
spec:
  size: small
  version: &quot;15&quot;
  
  # Connection secret written to this namespace
  writeConnectionSecretToRef:
    name: my-app-db-creds
```

That&apos;s it. Developer creates this simple YAML and gets a fully
configured RDS instance with security groups, encryption, and
connection credentials.


Example 2: Application Environment
==================================

Compose an entire environment:

```yaml
# xrd-environment.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: CompositeResourceDefinition
metadata:
  name: xenvironments.platform.company.com
spec:
  group: platform.company.com
  names:
    kind: XEnvironment
    plural: xenvironments
  claimNames:
    kind: Environment
    plural: environments
  versions:
    - name: v1alpha1
      served: true
      referenceable: true
      schema:
        openAPIV3Schema:
          type: object
          properties:
            spec:
              type: object
              properties:
                tier:
                  type: string
                  enum: [&quot;dev&quot;, &quot;staging&quot;, &quot;prod&quot;]
                team:
                  type: string
                enableDatabase:
                  type: boolean
                  default: true
                enableCache:
                  type: boolean
                  default: false
                enableQueue:
                  type: boolean
                  default: false
              required:
                - tier
                - team
```

```yaml
# composition-environment.yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: environment-aws
spec:
  compositeTypeRef:
    apiVersion: platform.company.com/v1alpha1
    kind: XEnvironment
  
  resources:
    # Namespace
    - name: namespace
      base:
        apiVersion: kubernetes.crossplane.io/v1alpha1
        kind: Object
        spec:
          forProvider:
            manifest:
              apiVersion: v1
              kind: Namespace
              metadata:
                labels:
                  managed-by: crossplane
      patches:
        - fromFieldPath: spec.team
          toFieldPath: spec.forProvider.manifest.metadata.name
          transforms:
            - type: string
              string:
                fmt: &quot;%s-env&quot;

    # PostgreSQL (conditional)
    - name: database
      base:
        apiVersion: database.platform.company.com/v1alpha1
        kind: XPostgreSQLInstance
        spec:
          size: small
      patches:
        - fromFieldPath: spec.tier
          toFieldPath: spec.size
          transforms:
            - type: map
              map:
                dev: small
                staging: medium
                prod: large
        - fromFieldPath: spec.enableDatabase
          toFieldPath: metadata.annotations[crossplane.io/paused]
          transforms:
            - type: convert
              convert:
                toType: string
            - type: map
              map:
                &quot;true&quot;: &quot;false&quot;
                &quot;false&quot;: &quot;true&quot;

    # ElastiCache (conditional)
    - name: cache
      base:
        apiVersion: cache.platform.company.com/v1alpha1
        kind: XRedisCluster
        spec:
          size: small
      patches:
        - fromFieldPath: spec.enableCache
          toFieldPath: metadata.annotations[crossplane.io/paused]
          transforms:
            - type: convert
              convert:
                toType: string
            - type: map
              map:
                &quot;true&quot;: &quot;false&quot;
                &quot;false&quot;: &quot;true&quot;
```


Developer usage:

```yaml
apiVersion: platform.company.com/v1alpha1
kind: Environment
metadata:
  name: my-app
  namespace: platform
spec:
  tier: dev
  team: payments
  enableDatabase: true
  enableCache: true
```


Composition Functions
=====================

For complex logic, use Composition Functions:

```yaml
apiVersion: apiextensions.crossplane.io/v1
kind: Composition
metadata:
  name: postgresql-with-functions
spec:
  compositeTypeRef:
    apiVersion: database.platform.company.com/v1alpha1
    kind: XPostgreSQLInstance
  
  mode: Pipeline
  pipeline:
    - step: patch-and-transform
      functionRef:
        name: function-patch-and-transform
      input:
        apiVersion: pt.fn.crossplane.io/v1beta1
        kind: Resources
        resources:
          - name: rds
            base:
              apiVersion: rds.aws.upbound.io/v1beta1
              kind: Instance
              spec:
                forProvider:
                  engine: postgres
            patches:
              - type: FromCompositeFieldPath
                fromFieldPath: spec.size
                toFieldPath: spec.forProvider.instanceClass
    
    - step: auto-ready
      functionRef:
        name: function-auto-ready
```


GitOps with Crossplane
======================

Store compositions in Git and apply via ArgoCD:

```yaml
# argocd-application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: platform-compositions
  namespace: argocd
spec:
  project: platform
  source:
    repoURL: https://github.com/company/platform-compositions
    path: compositions
    targetRevision: main
  destination:
    server: https://kubernetes.default.svc
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
```


Troubleshooting
===============

```bash
# Check XRD status
kubectl get xrd

# Check composition
kubectl get composition

# Check composite resource
kubectl get xpostgresqlinstances

# Check claim
kubectl get postgresqlinstances -n my-namespace

# Debug - see all managed resources
kubectl get managed

# See events
kubectl describe xpostgresqlinstance my-db
```


References
==========

- Crossplane Docs: https://docs.crossplane.io
- Composition Docs: https://docs.crossplane.io/latest/concepts/compositions/
- Upbound Marketplace: https://marketplace.upbound.io


========================================
Crossplane + Compositions + Kubernetes
========================================
Your cloud. Your API. Your guardrails.
========================================</content:encoded><category>crossplane</category><category>kubernetes</category><category>platform-engineering</category><category>infrastructure</category><category>gitops</category><author>Mo Abukar</author></item></channel></rss>