SLO-Based Alerting: Burn Rate Alerts vs Threshold Alerts
Threshold alerts are noisy. “CPU > 80%” fires constantly but rarely matters. SLO-based alerting focuses on what matters: are we burning through our error budget too fast?
TL;DR
- SLO = target reliability (e.g., 99.9% availability)
- Error Budget = allowed unreliability (0.1% = 43 min/month)
- Burn Rate = how fast you’re consuming error budget
- Multi-window alerts reduce noise, catch real problems
- Prometheus/Grafana implementation included
Why SLO-Based Alerting?
THRESHOLD ALERTS SLO-BASED ALERTS
================ ================
"Error rate > 1%" "Burning 10x error budget"
Fires on any spike Fires on sustained impact
100s of alerts/week ~5 alerts/week
Alert fatigue Actionable alerts
Error Budget Math
SLO: 99.9% availability
Error Budget: 0.1% = 1 - 0.999
Monthly error budget (30 days):
30 days × 24 hours × 60 minutes × 0.001 = 43.2 minutes
If you're at 99.8% for an hour:
- Errors: 0.2% of traffic
- Budget consumed: 2 × (60 min / 43.2 min budget) = 2.78 hours worth
- Burn rate: 2× normal
Burn Rate Definition
Burn Rate = (Actual Error Rate) / (SLO Error Rate)
Example:
- SLO allows 0.1% errors
- Current error rate: 0.5%
- Burn rate: 0.5 / 0.1 = 5×
At 5× burn rate:
- 30-day budget exhausted in 6 days
- 1-day budget exhausted in ~5 hours
Multi-Window Burn Rate Alerts
Single window alerts are still noisy. Use multiple windows:
SHORT WINDOW LONG WINDOW SEVERITY
============ =========== ========
5 min 1 hour Page (critical)
30 min 6 hours Page (warning)
2 hours 24 hours Ticket
6 hours 3 days Ticket
Both windows must exceed threshold to fire.
Prometheus Recording Rules
# slo-recording-rules.yaml
groups:
- name: slo-recording
interval: 30s
rules:
# Error ratio over different windows
- record: slo:http_request_error_ratio:rate5m
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) by (service)
/
sum(rate(http_requests_total[5m])) by (service)
- record: slo:http_request_error_ratio:rate30m
expr: |
sum(rate(http_requests_total{status=~"5.."}[30m])) by (service)
/
sum(rate(http_requests_total[30m])) by (service)
- record: slo:http_request_error_ratio:rate1h
expr: |
sum(rate(http_requests_total{status=~"5.."}[1h])) by (service)
/
sum(rate(http_requests_total[1h])) by (service)
- record: slo:http_request_error_ratio:rate6h
expr: |
sum(rate(http_requests_total{status=~"5.."}[6h])) by (service)
/
sum(rate(http_requests_total[6h])) by (service)
- record: slo:http_request_error_ratio:rate24h
expr: |
sum(rate(http_requests_total{status=~"5.."}[24h])) by (service)
/
sum(rate(http_requests_total[24h])) by (service)
- record: slo:http_request_error_ratio:rate3d
expr: |
sum(rate(http_requests_total{status=~"5.."}[3d])) by (service)
/
sum(rate(http_requests_total[3d])) by (service)
# SLO targets (configure per service)
- record: slo:http_request:objective
expr: |
vector(0.001) # 99.9% = 0.1% error budget
labels:
service: api-server
- record: slo:http_request:objective
expr: |
vector(0.01) # 99% = 1% error budget
labels:
service: batch-processor
Burn Rate Alerts
# slo-alerting-rules.yaml
groups:
- name: slo-alerts
rules:
# Critical: 14.4× burn rate over 5m AND 1h
# Exhausts budget in 2 hours
- alert: SLOErrorBudgetCritical
expr: |
(
slo:http_request_error_ratio:rate5m > (14.4 * on(service) group_left slo:http_request:objective)
and
slo:http_request_error_ratio:rate1h > (14.4 * on(service) group_left slo:http_request:objective)
)
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} burning error budget 14× faster than allowed"
description: "Error rate is {{ $value | humanizePercentage }}. At this rate, 30-day budget exhausted in ~2 hours."
runbook_url: https://runbooks.company.com/slo-critical
# Warning: 6× burn rate over 30m AND 6h
# Exhausts budget in 5 days
- alert: SLOErrorBudgetWarning
expr: |
(
slo:http_request_error_ratio:rate30m > (6 * on(service) group_left slo:http_request:objective)
and
slo:http_request_error_ratio:rate6h > (6 * on(service) group_left slo:http_request:objective)
)
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.service }} burning error budget 6× faster than allowed"
description: "At this rate, 30-day budget exhausted in ~5 days."
# Ticket: 3× burn rate over 2h AND 24h
# Exhausts budget in 10 days
- alert: SLOErrorBudgetDegraded
expr: |
(
slo:http_request_error_ratio:rate2h > (3 * on(service) group_left slo:http_request:objective)
and
slo:http_request_error_ratio:rate24h > (3 * on(service) group_left slo:http_request:objective)
)
for: 15m
labels:
severity: ticket
annotations:
summary: "{{ $labels.service }} error rate elevated"
# Slow Burn: 1× burn rate over 6h AND 3d
# On track to exhaust budget
- alert: SLOErrorBudgetSlowBurn
expr: |
(
slo:http_request_error_ratio:rate6h > on(service) group_left slo:http_request:objective
and
slo:http_request_error_ratio:rate3d > on(service) group_left slo:http_request:objective
)
for: 1h
labels:
severity: ticket
annotations:
summary: "{{ $labels.service }} on track to exhaust error budget"
Latency SLOs
groups:
- name: latency-slo-recording
rules:
# P99 latency ratio
- record: slo:http_request_latency_ratio:rate5m
expr: |
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) by (service)
/
sum(rate(http_request_duration_seconds_count[5m])) by (service)
# Target: 99% of requests < 500ms
- record: slo:http_request_latency:objective
expr: vector(0.99)
labels:
service: api-server
- name: latency-slo-alerts
rules:
- alert: SLOLatencyBudgetCritical
expr: |
(
slo:http_request_latency_ratio:rate5m < (1 - 14.4 * (1 - on(service) group_left slo:http_request_latency:objective))
and
slo:http_request_latency_ratio:rate1h < (1 - 14.4 * (1 - on(service) group_left slo:http_request_latency:objective))
)
for: 2m
labels:
severity: critical
annotations:
summary: "{{ $labels.service }} latency SLO breach"
Grafana Dashboard
{
"panels": [
{
"title": "Error Budget Remaining (30d)",
"type": "gauge",
"targets": [
{
"expr": "1 - (sum_over_time(slo:http_request_error_ratio:rate5m{service=\"api-server\"}[30d]) / count_over_time(slo:http_request_error_ratio:rate5m{service=\"api-server\"}[30d])) / 0.001",
"legendFormat": "Budget Remaining"
}
],
"options": {
"reduceOptions": {
"calcs": ["lastNotNull"]
}
},
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1,
"thresholds": {
"steps": [
{"color": "red", "value": 0},
{"color": "yellow", "value": 0.25},
{"color": "green", "value": 0.5}
]
}
}
}
},
{
"title": "Current Burn Rate",
"type": "stat",
"targets": [
{
"expr": "slo:http_request_error_ratio:rate1h{service=\"api-server\"} / 0.001",
"legendFormat": "Burn Rate"
}
],
"fieldConfig": {
"defaults": {
"unit": "x",
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 1},
{"color": "red", "value": 6}
]
}
}
}
},
{
"title": "Time Until Budget Exhausted",
"type": "stat",
"targets": [
{
"expr": "(1 - (sum_over_time(slo:http_request_error_ratio:rate5m{service=\"api-server\"}[30d]) / count_over_time(slo:http_request_error_ratio:rate5m{service=\"api-server\"}[30d])) / 0.001) * 30 * 24 / (slo:http_request_error_ratio:rate1h{service=\"api-server\"} / 0.001)",
"legendFormat": "Hours Remaining"
}
],
"fieldConfig": {
"defaults": {
"unit": "h"
}
}
}
]
}
Sloth: SLO Generator
# sloth-slo.yaml
version: prometheus/v1
service: api-server
slos:
- name: requests-availability
objective: 99.9
description: 99.9% of requests succeed
sli:
events:
error_query: sum(rate(http_requests_total{service="api-server",status=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total{service="api-server"}[{{.window}}]))
alerting:
name: APIServerAvailability
page_alert:
labels:
severity: critical
ticket_alert:
labels:
severity: warning
sloth generate -i sloth-slo.yaml -o prometheus-rules.yaml
References
- Google SRE Book: https://sre.google/sre-book/service-level-objectives/
- Sloth: https://sloth.dev
- OpenSLO: https://openslo.com