Progressive Delivery with Flagger: Automated Canary Deployments
Manual canary deployments are tedious. Flagger automates the entire process: deploy canary, shift traffic gradually, analyze metrics, promote or rollback automatically.
TL;DR
- Flagger = automated canary/blue-green/A-B testing
- Metrics-based promotion (Prometheus, Datadog, etc.)
- Automatic rollback on failure
- Works with Istio, Linkerd, Nginx, Gateway API
- Full GitOps integration
Install Flagger
helm repo add flagger https://flagger.app
helm upgrade -i flagger flagger/flagger \
--namespace flagger-system --create-namespace \
--set meshProvider=istio \
--set metricsServer=http://prometheus.monitoring:9090
Canary Resource
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-server
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
progressDeadlineSeconds: 600
service:
port: 8080
targetPort: 8080
gateways:
- production-gateway
hosts:
- api.company.com
analysis:
# Canary increment
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
# Promotion metrics
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
- name: request-duration
thresholdRange:
max: 500
interval: 1m
# Webhooks for custom checks
webhooks:
- name: load-test
type: rollout
url: http://loadtester.flagger-system/
metadata:
cmd: "hey -z 1m -q 10 -c 2 http://api-server-canary.production:8080/"
Metrics Templates
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: request-success-rate
namespace: flagger-system
spec:
provider:
type: prometheus
address: http://prometheus.monitoring:9090
query: |
100 - (
sum(
rate(
http_requests_total{
namespace="{{ namespace }}",
service=~"{{ target }}-canary",
status=~"5.."
}[{{ interval }}]
)
)
/
sum(
rate(
http_requests_total{
namespace="{{ namespace }}",
service=~"{{ target }}-canary"
}[{{ interval }}]
)
) * 100
)
---
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
name: request-duration
namespace: flagger-system
spec:
provider:
type: prometheus
address: http://prometheus.monitoring:9090
query: |
histogram_quantile(0.99,
sum(
rate(
http_request_duration_seconds_bucket{
namespace="{{ namespace }}",
service=~"{{ target }}-canary"
}[{{ interval }}]
)
) by (le)
) * 1000
Blue-Green Deployment
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-server
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
service:
port: 8080
analysis:
# Blue-green: 0% or 100%
interval: 1m
threshold: 5
iterations: 10 # Run analysis 10 times before promoting
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
webhooks:
- name: acceptance-test
type: pre-rollout
url: http://loadtester.flagger-system/
metadata:
type: bash
cmd: "curl -s http://api-server-canary.production:8080/health | grep ok"
- name: load-test
type: rollout
url: http://loadtester.flagger-system/
metadata:
cmd: "hey -z 2m -q 50 -c 10 http://api-server-canary.production:8080/"
A/B Testing
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-server
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
service:
port: 8080
analysis:
interval: 1m
threshold: 5
iterations: 100
# A/B testing by header
match:
- headers:
x-user-type:
exact: beta
metrics:
- name: request-success-rate
thresholdRange:
min: 99
interval: 1m
Gateway API Integration
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: api-server
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
service:
port: 8080
# Use Gateway API instead of Istio
gatewayRefs:
- name: production-gateway
namespace: gateway-system
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
templateRef:
name: request-success-rate
namespace: flagger-system
thresholdRange:
min: 99
Alerting
apiVersion: flagger.app/v1beta1
kind: AlertProvider
metadata:
name: slack
namespace: flagger-system
spec:
type: slack
channel: deployments
username: flagger
secretRef:
name: slack-url
---
apiVersion: v1
kind: Secret
metadata:
name: slack-url
namespace: flagger-system
stringData:
address: https://hooks.slack.com/services/xxx/yyy/zzz
GitOps Workflow
# Argo CD triggers Flagger
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: api-server
namespace: argocd
spec:
source:
repoURL: https://github.com/company/app
path: k8s
targetRevision: HEAD
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true
selfHeal: true
# Deployment update triggers Flagger canary
# Flagger handles the progressive rollout
Monitoring Progress
# Watch canary status
kubectl get canary -n production -w
# Describe canary
kubectl describe canary api-server -n production
# Check events
kubectl get events -n production --field-selector involvedObject.name=api-server
Rollback
# Manual rollback
kubectl annotate canary api-server -n production flagger.app/rollback=true
# Or set status
kubectl patch canary api-server -n production \
--type=merge -p '{"status":{"phase":"Failed"}}'
References
- Flagger Docs: https://docs.flagger.app
- Metrics: https://docs.flagger.app/usage/metrics
- Webhooks: https://docs.flagger.app/usage/webhooks