Managing Dynatrace Alerts at Scale with Custom Ansible Roles
Dynatrace is powerful, but managing alerts across 50+ applications and 4 environments through the UI is a nightmare. Click here, configure there, copy settings manually - it doesn’t scale, and drift is inevitable.
We solved this by treating Dynatrace alerting configuration as code, managed through custom Ansible roles. This post covers how we built it, the Dynatrace API patterns, and the Ansible structure that let us manage thousands of alert configurations consistently.
The Problem
Our Dynatrace setup had grown organically:
- 200+ alerting profiles - many duplicates, inconsistent thresholds
- No version control - who changed what, when?
- Environment drift - prod alerts different from staging
- Manual onboarding - new services took hours to configure
- No review process - anyone could change alerts without approval
We needed Infrastructure as Code for our alerting.
Why Ansible?
We evaluated several options:
| Tool | Pros | Cons |
|---|---|---|
| Terraform | Declarative, state management | Dynatrace provider was immature (2022) |
| Dynatrace Monaco | Purpose-built for Dynatrace | Another tool to learn, limited flexibility |
| Ansible | Already in our stack, flexible, good API support | Imperative, no state tracking |
| Custom scripts | Full control | Maintenance burden |
We chose Ansible because:
- Team already knew it
- Good HTTP/REST modules
- Could integrate with existing automation
- Jinja2 templating for complex configs
Dynatrace API Fundamentals
Before diving into Ansible, understanding Dynatrace’s APIs is crucial.
API Versions
Dynatrace has multiple APIs:
Environment API v1: /api/v1/...
Environment API v2: /api/v2/...
Configuration API v1: /api/config/v1/...
For alerting, we primarily use:
- Config API v1 - Alerting profiles, notifications
- Environment API v2 - Metric events, SLOs
Authentication
# API Token with these permissions:
# - Read configuration
# - Write configuration
# - Read metrics
# - Read entities
export DT_API_TOKEN="dt0c01.XXXXXXXX.YYYYYYYY"
export DT_ENVIRONMENT_URL="https://abc12345.live.dynatrace.com"
Key Endpoints
# Alerting Profiles
GET/POST/PUT/DELETE /api/config/v1/alertingProfiles
# Problem Notifications (integrations)
GET/POST/PUT/DELETE /api/config/v1/notifications
# Metric Events (custom alerts)
GET/POST/PUT/DELETE /api/config/v1/anomalyDetection/metricEvents
# Maintenance Windows
GET/POST/PUT/DELETE /api/config/v1/maintenanceWindows
# Auto-tags (for filtering)
GET/POST/PUT/DELETE /api/config/v1/autoTags
Ansible Role Structure
Here’s the structure we built:
roles/
└── dynatrace_alerting/
├── defaults/
│ └── main.yml # Default variables
├── tasks/
│ ├── main.yml # Entry point
│ ├── alerting_profiles.yml # Alerting profile management
│ ├── notifications.yml # Notification channels
│ ├── metric_events.yml # Custom metric alerts
│ ├── maintenance.yml # Maintenance windows
│ └── validate.yml # Pre-flight checks
├── templates/
│ ├── alerting_profile.json.j2
│ ├── notification_slack.json.j2
│ ├── notification_pagerduty.json.j2
│ ├── notification_email.json.j2
│ ├── metric_event.json.j2
│ └── maintenance_window.json.j2
├── vars/
│ └── main.yml # Static variables
├── handlers/
│ └── main.yml
└── meta/
└── main.yml
Role Implementation
defaults/main.yml
---
# Dynatrace connection
dynatrace_environment_url: "{{ lookup('env', 'DT_ENVIRONMENT_URL') }}"
dynatrace_api_token: "{{ lookup('env', 'DT_API_TOKEN') }}"
# API endpoints
dynatrace_config_api: "{{ dynatrace_environment_url }}/api/config/v1"
dynatrace_env_api_v2: "{{ dynatrace_environment_url }}/api/v2"
# Default alerting settings
dynatrace_default_alert_delay: 0
dynatrace_default_severity_rules:
- severity: AVAILABILITY
delay_in_minutes: 0
- severity: ERROR
delay_in_minutes: 0
- severity: SLOWDOWN
delay_in_minutes: 5
- severity: RESOURCE_CONTENTION
delay_in_minutes: 10
- severity: CUSTOM_ALERT
delay_in_minutes: 0
# Environment-specific overrides
dynatrace_environments:
production:
alert_delay_multiplier: 1
notify_on_close: true
staging:
alert_delay_multiplier: 2
notify_on_close: false
development:
alert_delay_multiplier: 5
notify_on_close: false
tasks/main.yml
---
- name: Validate Dynatrace connection
include_tasks: validate.yml
tags:
- always
- name: Manage alerting profiles
include_tasks: alerting_profiles.yml
when: dynatrace_alerting_profiles is defined
tags:
- alerting_profiles
- profiles
- name: Manage notification channels
include_tasks: notifications.yml
when: dynatrace_notifications is defined
tags:
- notifications
- name: Manage metric events
include_tasks: metric_events.yml
when: dynatrace_metric_events is defined
tags:
- metric_events
- custom_alerts
- name: Manage maintenance windows
include_tasks: maintenance.yml
when: dynatrace_maintenance_windows is defined
tags:
- maintenance
tasks/validate.yml
---
- name: Verify Dynatrace API token is set
assert:
that:
- dynatrace_api_token is defined
- dynatrace_api_token | length > 0
fail_msg: "DT_API_TOKEN environment variable must be set"
- name: Verify Dynatrace environment URL is set
assert:
that:
- dynatrace_environment_url is defined
- dynatrace_environment_url | length > 0
fail_msg: "DT_ENVIRONMENT_URL environment variable must be set"
- name: Test Dynatrace API connectivity
uri:
url: "{{ dynatrace_config_api }}/alertingProfiles"
method: GET
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
status_code: 200
register: api_test
failed_when: api_test.status != 200
- name: Display API connection status
debug:
msg: "Successfully connected to Dynatrace. Found {{ api_test.json.values | length }} existing alerting profiles."
Alerting Profiles
Alerting profiles define WHAT problems trigger alerts and with what delay.
tasks/alerting_profiles.yml
---
- name: Get existing alerting profiles
uri:
url: "{{ dynatrace_config_api }}/alertingProfiles"
method: GET
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
register: existing_profiles
- name: Build existing profiles lookup
set_fact:
existing_profiles_map: "{{ existing_profiles.json.values | items2dict(key_name='name', value_name='id') }}"
- name: Create or update alerting profiles
uri:
url: "{{ dynatrace_config_api }}/alertingProfiles/{{ existing_profiles_map[item.name] | default('') }}"
method: "{{ 'PUT' if item.name in existing_profiles_map else 'POST' }}"
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
Content-Type: "application/json"
body: "{{ lookup('template', 'alerting_profile.json.j2') }}"
body_format: json
status_code: [200, 201, 204]
loop: "{{ dynatrace_alerting_profiles }}"
loop_control:
label: "{{ item.name }}"
register: profile_results
- name: Delete removed alerting profiles
uri:
url: "{{ dynatrace_config_api }}/alertingProfiles/{{ item.value }}"
method: DELETE
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
status_code: [204, 404]
loop: "{{ existing_profiles_map | dict2items }}"
loop_control:
label: "{{ item.key }}"
when:
- dynatrace_alerting_profiles_delete_unmanaged | default(false)
- item.key not in (dynatrace_alerting_profiles | map(attribute='name') | list)
templates/alerting_profile.json.j2
{
"displayName": "{{ item.name }}",
"rules": [
{% for rule in item.severity_rules | default(dynatrace_default_severity_rules) %}
{
"severityLevel": "{{ rule.severity }}",
"tagFilter": {
"includeMode": "{{ rule.tag_include_mode | default('INCLUDE_ANY') }}",
"tagFilters": [
{% for tag in rule.tags | default(item.tags | default([])) %}
{
"context": "{{ tag.context | default('CONTEXTLESS') }}",
"key": "{{ tag.key }}",
"value": "{{ tag.value | default('') }}"
}{{ "," if not loop.last else "" }}
{% endfor %}
]
},
"delayInMinutes": {{ (rule.delay_in_minutes * dynatrace_environments[dynatrace_environment].alert_delay_multiplier) | int }}
}{{ "," if not loop.last else "" }}
{% endfor %}
],
{% if item.management_zone is defined %}
"managementZoneId": "{{ item.management_zone }}",
{% endif %}
"eventTypeFilters": [
{% for event_type in item.event_types | default(['CUSTOM_ALERT', 'CUSTOM_ANNOTATION', 'CUSTOM_CONFIGURATION', 'CUSTOM_DEPLOYMENT', 'ERROR_EVENT', 'MARKED_FOR_TERMINATION', 'PERFORMANCE_EVENT', 'RESOURCE_CONTENTION_EVENT']) %}
{
"predefinedEventFilter": {
"eventType": "{{ event_type }}",
"negate": false
}
}{{ "," if not loop.last else "" }}
{% endfor %}
]
}
Notification Channels
Notifications define WHERE alerts go (Slack, PagerDuty, email, webhooks).
tasks/notifications.yml
---
- name: Get existing notifications
uri:
url: "{{ dynatrace_config_api }}/notifications"
method: GET
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
register: existing_notifications
- name: Build existing notifications lookup
set_fact:
existing_notifications_map: "{{ existing_notifications.json.values | items2dict(key_name='name', value_name='id') }}"
- name: Create or update Slack notifications
uri:
url: "{{ dynatrace_config_api }}/notifications/{{ existing_notifications_map[item.name] | default('') }}"
method: "{{ 'PUT' if item.name in existing_notifications_map else 'POST' }}"
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
Content-Type: "application/json"
body: "{{ lookup('template', 'notification_slack.json.j2') }}"
body_format: json
status_code: [200, 201, 204]
loop: "{{ dynatrace_notifications | selectattr('type', 'equalto', 'slack') | list }}"
loop_control:
label: "{{ item.name }}"
- name: Create or update PagerDuty notifications
uri:
url: "{{ dynatrace_config_api }}/notifications/{{ existing_notifications_map[item.name] | default('') }}"
method: "{{ 'PUT' if item.name in existing_notifications_map else 'POST' }}"
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
Content-Type: "application/json"
body: "{{ lookup('template', 'notification_pagerduty.json.j2') }}"
body_format: json
status_code: [200, 201, 204]
loop: "{{ dynatrace_notifications | selectattr('type', 'equalto', 'pagerduty') | list }}"
loop_control:
label: "{{ item.name }}"
- name: Create or update email notifications
uri:
url: "{{ dynatrace_config_api }}/notifications/{{ existing_notifications_map[item.name] | default('') }}"
method: "{{ 'PUT' if item.name in existing_notifications_map else 'POST' }}"
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
Content-Type: "application/json"
body: "{{ lookup('template', 'notification_email.json.j2') }}"
body_format: json
status_code: [200, 201, 204]
loop: "{{ dynatrace_notifications | selectattr('type', 'equalto', 'email') | list }}"
loop_control:
label: "{{ item.name }}"
templates/notification_slack.json.j2
{
"type": "SLACK",
"name": "{{ item.name }}",
"alertingProfile": "{{ item.alerting_profile_id }}",
"active": {{ item.active | default(true) | lower }},
"url": "{{ item.webhook_url }}",
"channel": "{{ item.channel }}",
"title": "{{ item.title | default('{State} {ProblemSeverity} Problem {ProblemID}: {ProblemTitle}') }}"
}
templates/notification_pagerduty.json.j2
{
"type": "PAGER_DUTY",
"name": "{{ item.name }}",
"alertingProfile": "{{ item.alerting_profile_id }}",
"active": {{ item.active | default(true) | lower }},
"account": "{{ item.account }}",
"serviceApiKey": "{{ item.integration_key }}",
"serviceName": "{{ item.service_name }}"
}
templates/notification_email.json.j2
{
"type": "EMAIL",
"name": "{{ item.name }}",
"alertingProfile": "{{ item.alerting_profile_id }}",
"active": {{ item.active | default(true) | lower }},
"subject": "{{ item.subject | default('{State} {ProblemSeverity} Problem {ProblemID}: {ProblemTitle}') }}",
"body": "{{ item.body | default('{ProblemDetailsHTML}') }}",
"receivers": [
{% for email in item.recipients %}
"{{ email }}"{{ "," if not loop.last else "" }}
{% endfor %}
],
"ccReceivers": [
{% for email in item.cc_recipients | default([]) %}
"{{ email }}"{{ "," if not loop.last else "" }}
{% endfor %}
],
"bccReceivers": [
{% for email in item.bcc_recipients | default([]) %}
"{{ email }}"{{ "," if not loop.last else "" }}
{% endfor %}
],
"notifyClosedProblems": {{ dynatrace_environments[dynatrace_environment].notify_on_close | lower }}
}
Custom Metric Events
For alerts on specific metrics (not auto-detected by Davis AI).
tasks/metric_events.yml
---
- name: Get existing metric events
uri:
url: "{{ dynatrace_config_api }}/anomalyDetection/metricEvents"
method: GET
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
register: existing_metric_events
- name: Build existing metric events lookup
set_fact:
existing_metric_events_map: "{{ existing_metric_events.json.values | items2dict(key_name='name', value_name='id') }}"
- name: Create or update metric events
uri:
url: "{{ dynatrace_config_api }}/anomalyDetection/metricEvents/{{ existing_metric_events_map[item.name] | default('') }}"
method: "{{ 'PUT' if item.name in existing_metric_events_map else 'POST' }}"
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
Content-Type: "application/json"
body: "{{ lookup('template', 'metric_event.json.j2') }}"
body_format: json
status_code: [200, 201, 204]
loop: "{{ dynatrace_metric_events }}"
loop_control:
label: "{{ item.name }}"
register: metric_event_results
templates/metric_event.json.j2
{
"metadata": {
"configurationVersions": [3],
"clusterVersion": "1.261.0"
},
"name": "{{ item.name }}",
"description": "{{ item.description | default('') }}",
"enabled": {{ item.enabled | default(true) | lower }},
"alertingScope": [
{% for scope in item.scopes | default([]) %}
{
{% if scope.type == 'management_zone' %}
"filterType": "MANAGEMENT_ZONE",
"managementZoneId": "{{ scope.id }}"
{% elif scope.type == 'entity' %}
"filterType": "ENTITY_ID",
"entityId": "{{ scope.id }}"
{% elif scope.type == 'tag' %}
"filterType": "TAG",
"tagFilter": {
"context": "{{ scope.context | default('CONTEXTLESS') }}",
"key": "{{ scope.key }}",
"value": "{{ scope.value | default('') }}"
}
{% elif scope.type == 'name' %}
"filterType": "NAME",
"nameFilter": {
"value": "{{ scope.value }}",
"operator": "{{ scope.operator | default('EQUALS') }}"
}
{% endif %}
}{{ "," if not loop.last else "" }}
{% endfor %}
],
"metricSelector": "{{ item.metric_selector }}",
"monitoringStrategy": {
"type": "{{ item.strategy_type | default('STATIC_THRESHOLD') }}",
{% if item.strategy_type | default('STATIC_THRESHOLD') == 'STATIC_THRESHOLD' %}
"alertCondition": "{{ item.condition | default('ABOVE') }}",
"samples": {{ item.samples | default(5) }},
"violatingSamples": {{ item.violating_samples | default(3) }},
"dealertingSamples": {{ item.dealerting_samples | default(5) }},
"threshold": {{ item.threshold }},
"unit": "{{ item.unit | default('UNSPECIFIED') }}"
{% elif item.strategy_type == 'AUTO_ADAPTIVE_BASELINE' %}
"alertCondition": "{{ item.condition | default('ABOVE') }}",
"samples": {{ item.samples | default(5) }},
"violatingSamples": {{ item.violating_samples | default(3) }},
"dealertingSamples": {{ item.dealerting_samples | default(5) }},
"numberOfSignalFluctuations": {{ item.signal_fluctuations | default(1.0) }}
{% endif %}
},
{% if item.dimensions is defined %}
"dimensions": [
{% for dim in item.dimensions %}
{
"key": "{{ dim.key }}",
"name": "{{ dim.name | default(dim.key) }}",
"filterType": "{{ dim.filter_type | default('ENTITY') }}",
{% if dim.filter_type | default('ENTITY') == 'ENTITY' %}
"entityDimension": {
"entityDimensionKey": "{{ dim.entity_dimension_key }}"
}
{% endif %}
}{{ "," if not loop.last else "" }}
{% endfor %}
],
{% endif %}
"primaryDimensionKey": "{{ item.primary_dimension_key | default('dt.entity.host') }}",
"severity": "{{ item.severity | default('CUSTOM_ALERT') }}",
"warningReason": "{{ item.warning_reason | default('NONE') }}",
"eventTemplate": {
"title": "{{ item.event_title | default(item.name) }}",
"description": "{{ item.event_description | default('Metric threshold exceeded') }}",
"eventType": "{{ item.event_type | default('CUSTOM_ALERT') }}",
"metadata": [
{% for meta in item.metadata | default([]) %}
{
"metadataKey": "{{ meta.key }}",
"metadataValue": "{{ meta.value }}"
}{{ "," if not loop.last else "" }}
{% endfor %}
]
}
}
Maintenance Windows
For suppressing alerts during planned maintenance.
tasks/maintenance.yml
---
- name: Get existing maintenance windows
uri:
url: "{{ dynatrace_config_api }}/maintenanceWindows"
method: GET
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
register: existing_maintenance
- name: Build existing maintenance lookup
set_fact:
existing_maintenance_map: "{{ existing_maintenance.json.values | items2dict(key_name='name', value_name='id') }}"
- name: Create or update maintenance windows
uri:
url: "{{ dynatrace_config_api }}/maintenanceWindows/{{ existing_maintenance_map[item.name] | default('') }}"
method: "{{ 'PUT' if item.name in existing_maintenance_map else 'POST' }}"
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
Content-Type: "application/json"
body: "{{ lookup('template', 'maintenance_window.json.j2') }}"
body_format: json
status_code: [200, 201, 204]
loop: "{{ dynatrace_maintenance_windows }}"
loop_control:
label: "{{ item.name }}"
templates/maintenance_window.json.j2
{
"name": "{{ item.name }}",
"description": "{{ item.description | default('') }}",
"type": "{{ item.type | default('PLANNED') }}",
"suppression": "{{ item.suppression | default('DETECT_PROBLEMS_DONT_ALERT') }}",
"scope": {
{% if item.scope.type == 'environment' %}
"entities": [],
"matches": []
{% elif item.scope.type == 'entities' %}
"entities": [
{% for entity in item.scope.entities %}
"{{ entity }}"{{ "," if not loop.last else "" }}
{% endfor %}
],
"matches": []
{% elif item.scope.type == 'tags' %}
"entities": [],
"matches": [
{% for match in item.scope.matches %}
{
"type": "{{ match.type | default('SERVICE') }}",
{% if match.management_zone is defined %}
"mzId": "{{ match.management_zone }}",
{% endif %}
"tags": [
{% for tag in match.tags %}
{
"context": "{{ tag.context | default('CONTEXTLESS') }}",
"key": "{{ tag.key }}",
"value": "{{ tag.value | default('') }}"
}{{ "," if not loop.last else "" }}
{% endfor %}
],
"tagCombination": "{{ match.tag_combination | default('AND') }}"
}{{ "," if not loop.last else "" }}
{% endfor %}
]
{% endif %}
},
"schedule": {
"type": "{{ item.schedule.type | default('ONCE') }}",
{% if item.schedule.type | default('ONCE') == 'ONCE' %}
"start": "{{ item.schedule.start }}",
"end": "{{ item.schedule.end }}",
"zoneId": "{{ item.schedule.timezone | default('Europe/London') }}"
{% elif item.schedule.type == 'DAILY' %}
"recurrenceRange": {
"start": "{{ item.schedule.range_start }}",
"end": "{{ item.schedule.range_end }}"
},
"dailyRecurrence": {
"timeWindow": {
"start": "{{ item.schedule.daily_start }}",
"end": "{{ item.schedule.daily_end }}"
},
"recurrenceRange": {
"start": "{{ item.schedule.range_start }}",
"end": "{{ item.schedule.range_end }}"
}
},
"zoneId": "{{ item.schedule.timezone | default('Europe/London') }}"
{% elif item.schedule.type == 'WEEKLY' %}
"recurrenceRange": {
"start": "{{ item.schedule.range_start }}",
"end": "{{ item.schedule.range_end }}"
},
"weeklyRecurrence": {
"timeWindow": {
"start": "{{ item.schedule.weekly_start }}",
"end": "{{ item.schedule.weekly_end }}"
},
"dayOfWeek": "{{ item.schedule.day_of_week }}",
"recurrenceRange": {
"start": "{{ item.schedule.range_start }}",
"end": "{{ item.schedule.range_end }}"
}
},
"zoneId": "{{ item.schedule.timezone | default('Europe/London') }}"
{% endif %}
}
}
Usage Examples
Playbook: Configure All Alerting
# playbooks/dynatrace-alerting.yml
---
- name: Configure Dynatrace Alerting
hosts: localhost
connection: local
gather_facts: false
vars:
dynatrace_environment: "{{ env | default('production') }}"
vars_files:
- "vars/dynatrace/common.yml"
- "vars/dynatrace/{{ dynatrace_environment }}.yml"
roles:
- dynatrace_alerting
vars/dynatrace/common.yml
---
# Alerting profiles used across all environments
dynatrace_alerting_profiles:
# Critical services - immediate alerting
- name: "Critical Services - P1"
tags:
- key: "criticality"
value: "critical"
severity_rules:
- severity: AVAILABILITY
delay_in_minutes: 0
- severity: ERROR
delay_in_minutes: 0
- severity: SLOWDOWN
delay_in_minutes: 2
- severity: RESOURCE_CONTENTION
delay_in_minutes: 5
# Standard services
- name: "Standard Services - P2"
tags:
- key: "criticality"
value: "standard"
severity_rules:
- severity: AVAILABILITY
delay_in_minutes: 5
- severity: ERROR
delay_in_minutes: 5
- severity: SLOWDOWN
delay_in_minutes: 10
- severity: RESOURCE_CONTENTION
delay_in_minutes: 15
# Non-critical / batch jobs
- name: "Non-Critical - P3"
tags:
- key: "criticality"
value: "low"
severity_rules:
- severity: AVAILABILITY
delay_in_minutes: 15
- severity: ERROR
delay_in_minutes: 15
- severity: SLOWDOWN
delay_in_minutes: 30
- severity: RESOURCE_CONTENTION
delay_in_minutes: 60
# Common metric events (custom alerts)
dynatrace_metric_events:
# High CPU on any host
- name: "High CPU Usage"
description: "CPU usage above 90% for 5 minutes"
metric_selector: "builtin:host.cpu.usage:avg"
strategy_type: STATIC_THRESHOLD
threshold: 90
condition: ABOVE
samples: 5
violating_samples: 3
severity: RESOURCE_CONTENTION
scopes:
- type: tag
key: "environment"
value: "{{ dynatrace_environment }}"
# Disk space low
- name: "Low Disk Space"
description: "Less than 10% disk space remaining"
metric_selector: "builtin:host.disk.avail:avg"
strategy_type: STATIC_THRESHOLD
threshold: 10
condition: BELOW
samples: 3
violating_samples: 2
unit: PERCENT
severity: RESOURCE_CONTENTION
# High memory usage
- name: "High Memory Usage"
description: "Memory usage above 95%"
metric_selector: "builtin:host.mem.usage:avg"
strategy_type: STATIC_THRESHOLD
threshold: 95
condition: ABOVE
samples: 5
violating_samples: 3
severity: RESOURCE_CONTENTION
# Error rate spike
- name: "Service Error Rate High"
description: "Error rate above 5%"
metric_selector: "builtin:service.errors.total.rate:avg"
strategy_type: STATIC_THRESHOLD
threshold: 5
condition: ABOVE
samples: 5
violating_samples: 3
unit: PERCENT
severity: ERROR
scopes:
- type: tag
key: "environment"
value: "{{ dynatrace_environment }}"
# Response time degradation
- name: "Service Response Time Degraded"
description: "P95 response time above 2 seconds"
metric_selector: "builtin:service.response.time:percentile(95)"
strategy_type: STATIC_THRESHOLD
threshold: 2000000 # 2 seconds in microseconds
condition: ABOVE
samples: 10
violating_samples: 6
severity: SLOWDOWN
vars/dynatrace/production.yml
---
dynatrace_environment: production
# Production-specific notifications
dynatrace_notifications:
# Critical alerts to PagerDuty
- name: "Production Critical - PagerDuty"
type: pagerduty
alerting_profile_id: "{{ lookup('dynatrace_profile_id', 'Critical Services - P1') }}"
account: "yourcompany"
integration_key: "{{ vault_pagerduty_integration_key }}"
service_name: "Production Critical Services"
# All production alerts to Slack
- name: "Production Alerts - Slack"
type: slack
alerting_profile_id: "{{ lookup('dynatrace_profile_id', 'Standard Services - P2') }}"
webhook_url: "{{ vault_slack_webhook_url }}"
channel: "#prod-alerts"
# Critical alerts also to email
- name: "Production Critical - Email"
type: email
alerting_profile_id: "{{ lookup('dynatrace_profile_id', 'Critical Services - P1') }}"
recipients:
- oncall@yourcompany.com
- platform-team@yourcompany.com
subject: "[CRITICAL] {ProblemSeverity}: {ProblemTitle}"
# Production maintenance windows
dynatrace_maintenance_windows:
# Weekly maintenance window
- name: "Weekly Platform Maintenance"
description: "Sunday 2-4am maintenance window"
type: PLANNED
suppression: DETECT_PROBLEMS_DONT_ALERT
scope:
type: tags
matches:
- type: HOST
tags:
- key: "maintenance-window"
value: "weekly"
schedule:
type: WEEKLY
day_of_week: SUNDAY
weekly_start: "02:00"
weekly_end: "04:00"
range_start: "2022-01-01"
range_end: "2025-12-31"
timezone: "Europe/London"
Running the Playbook
# Configure production alerting
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production
# Configure staging (with longer delays)
ansible-playbook playbooks/dynatrace-alerting.yml -e env=staging
# Only update alerting profiles
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --tags alerting_profiles
# Only update metric events
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --tags metric_events
# Dry run with check mode
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --check --diff
CI/CD Integration
We integrated this into our GitLab CI pipeline:
# .gitlab-ci.yml
stages:
- validate
- plan
- apply
variables:
ANSIBLE_FORCE_COLOR: "true"
.dynatrace-base:
image: ansible/ansible:latest
before_script:
- pip install jmespath
- ansible-galaxy collection install community.general
validate:
extends: .dynatrace-base
stage: validate
script:
- ansible-playbook playbooks/dynatrace-alerting.yml --syntax-check
- ansible-lint playbooks/dynatrace-alerting.yml roles/dynatrace_alerting/
rules:
- if: $CI_MERGE_REQUEST_ID
plan:
extends: .dynatrace-base
stage: plan
script:
- ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --check --diff
rules:
- if: $CI_MERGE_REQUEST_ID
apply:production:
extends: .dynatrace-base
stage: apply
script:
- ansible-playbook playbooks/dynatrace-alerting.yml -e env=production
rules:
- if: $CI_COMMIT_BRANCH == "main"
environment:
name: production
Lessons Learned
1. API Rate Limits
Dynatrace has API rate limits. When managing hundreds of configs, we hit them.
Fix: Add delays between API calls:
- name: Create alerting profile
uri:
# ...
throttle: 1 # Only 1 concurrent request
- name: Pause between API calls
pause:
seconds: 1
when: profile_results.changed
2. Idempotency with IDs
Dynatrace assigns IDs to configs. To make updates idempotent, we needed to track IDs.
Fix: Query existing configs first, build a lookup map, use PUT for updates.
3. Environment-Specific Delays
What’s critical in prod isn’t critical in dev. We wasted time on non-prod alerts.
Fix: Environment-specific delay multipliers in the role defaults.
4. Secret Management
API tokens and webhook URLs are secrets.
Fix: Use Ansible Vault for sensitive variables:
ansible-vault encrypt vars/dynatrace/secrets.yml
ansible-playbook playbooks/dynatrace-alerting.yml --ask-vault-pass
5. Profile ID Lookups
Notifications need alerting profile IDs, but we define profiles by name.
Fix: Create a custom lookup plugin or query the API in a pre-task:
- name: Get alerting profile ID
uri:
url: "{{ dynatrace_config_api }}/alertingProfiles"
method: GET
headers:
Authorization: "Api-Token {{ dynatrace_api_token }}"
register: profiles
- name: Set profile ID facts
set_fact:
alerting_profile_ids: "{{ profiles.json.values | items2dict(key_name='name', value_name='id') }}"
6. Testing Changes
We broke alerting in production by deploying untested changes.
Fix: Deploy to staging first, wait 24 hours, then production. Add --check mode validation to CI.
Key Takeaways
- Treat alerting as code - Version control, review, test, deploy
- Environment-specific configs - Prod alerts ≠ Dev alerts
- Centralize notification channels - Avoid alert sprawl
- Use tags for scoping - Management zones are less flexible
- Automate maintenance windows - Don’t suppress alerts manually
- Test before production -
--checkmode and staging environments - Document your alert strategy - Future you will thank present you
This approach transformed our alerting from a manual, inconsistent mess into a reliable, reviewable, version-controlled system. Changes go through PRs, get reviewed, and deploy consistently across environments.
Managing Dynatrace at scale? Questions about the Ansible integration? Find me on LinkedIn or GitHub.