Skip to content
Back to blog Managing Dynatrace Alerts at Scale with Custom Ansible Roles

Managing Dynatrace Alerts at Scale with Custom Ansible Roles

ObservabilityDevOps

Managing Dynatrace Alerts at Scale with Custom Ansible Roles

Dynatrace is powerful, but managing alerts across 50+ applications and 4 environments through the UI is a nightmare. Click here, configure there, copy settings manually - it doesn’t scale, and drift is inevitable.

We solved this by treating Dynatrace alerting configuration as code, managed through custom Ansible roles. This post covers how we built it, the Dynatrace API patterns, and the Ansible structure that let us manage thousands of alert configurations consistently.

The Problem

Our Dynatrace setup had grown organically:

  • 200+ alerting profiles - many duplicates, inconsistent thresholds
  • No version control - who changed what, when?
  • Environment drift - prod alerts different from staging
  • Manual onboarding - new services took hours to configure
  • No review process - anyone could change alerts without approval

We needed Infrastructure as Code for our alerting.

Why Ansible?

We evaluated several options:

ToolProsCons
TerraformDeclarative, state managementDynatrace provider was immature (2022)
Dynatrace MonacoPurpose-built for DynatraceAnother tool to learn, limited flexibility
AnsibleAlready in our stack, flexible, good API supportImperative, no state tracking
Custom scriptsFull controlMaintenance burden

We chose Ansible because:

  1. Team already knew it
  2. Good HTTP/REST modules
  3. Could integrate with existing automation
  4. Jinja2 templating for complex configs

Dynatrace API Fundamentals

Before diving into Ansible, understanding Dynatrace’s APIs is crucial.

API Versions

Dynatrace has multiple APIs:

Environment API v1: /api/v1/...
Environment API v2: /api/v2/...
Configuration API v1: /api/config/v1/...

For alerting, we primarily use:

  • Config API v1 - Alerting profiles, notifications
  • Environment API v2 - Metric events, SLOs

Authentication

# API Token with these permissions:
# - Read configuration
# - Write configuration
# - Read metrics
# - Read entities

export DT_API_TOKEN="dt0c01.XXXXXXXX.YYYYYYYY"
export DT_ENVIRONMENT_URL="https://abc12345.live.dynatrace.com"

Key Endpoints

# Alerting Profiles
GET/POST/PUT/DELETE /api/config/v1/alertingProfiles

# Problem Notifications (integrations)
GET/POST/PUT/DELETE /api/config/v1/notifications

# Metric Events (custom alerts)
GET/POST/PUT/DELETE /api/config/v1/anomalyDetection/metricEvents

# Maintenance Windows
GET/POST/PUT/DELETE /api/config/v1/maintenanceWindows

# Auto-tags (for filtering)
GET/POST/PUT/DELETE /api/config/v1/autoTags

Ansible Role Structure

Here’s the structure we built:

roles/
└── dynatrace_alerting/
    ├── defaults/
    │   └── main.yml              # Default variables
    ├── tasks/
    │   ├── main.yml              # Entry point
    │   ├── alerting_profiles.yml # Alerting profile management
    │   ├── notifications.yml     # Notification channels
    │   ├── metric_events.yml     # Custom metric alerts
    │   ├── maintenance.yml       # Maintenance windows
    │   └── validate.yml          # Pre-flight checks
    ├── templates/
    │   ├── alerting_profile.json.j2
    │   ├── notification_slack.json.j2
    │   ├── notification_pagerduty.json.j2
    │   ├── notification_email.json.j2
    │   ├── metric_event.json.j2
    │   └── maintenance_window.json.j2
    ├── vars/
    │   └── main.yml              # Static variables
    ├── handlers/
    │   └── main.yml
    └── meta/
        └── main.yml

Role Implementation

defaults/main.yml

---
# Dynatrace connection
dynatrace_environment_url: "{{ lookup('env', 'DT_ENVIRONMENT_URL') }}"
dynatrace_api_token: "{{ lookup('env', 'DT_API_TOKEN') }}"

# API endpoints
dynatrace_config_api: "{{ dynatrace_environment_url }}/api/config/v1"
dynatrace_env_api_v2: "{{ dynatrace_environment_url }}/api/v2"

# Default alerting settings
dynatrace_default_alert_delay: 0
dynatrace_default_severity_rules:
  - severity: AVAILABILITY
    delay_in_minutes: 0
  - severity: ERROR
    delay_in_minutes: 0
  - severity: SLOWDOWN
    delay_in_minutes: 5
  - severity: RESOURCE_CONTENTION
    delay_in_minutes: 10
  - severity: CUSTOM_ALERT
    delay_in_minutes: 0

# Environment-specific overrides
dynatrace_environments:
  production:
    alert_delay_multiplier: 1
    notify_on_close: true
  staging:
    alert_delay_multiplier: 2
    notify_on_close: false
  development:
    alert_delay_multiplier: 5
    notify_on_close: false

tasks/main.yml

---
- name: Validate Dynatrace connection
  include_tasks: validate.yml
  tags:
    - always

- name: Manage alerting profiles
  include_tasks: alerting_profiles.yml
  when: dynatrace_alerting_profiles is defined
  tags:
    - alerting_profiles
    - profiles

- name: Manage notification channels
  include_tasks: notifications.yml
  when: dynatrace_notifications is defined
  tags:
    - notifications

- name: Manage metric events
  include_tasks: metric_events.yml
  when: dynatrace_metric_events is defined
  tags:
    - metric_events
    - custom_alerts

- name: Manage maintenance windows
  include_tasks: maintenance.yml
  when: dynatrace_maintenance_windows is defined
  tags:
    - maintenance

tasks/validate.yml

---
- name: Verify Dynatrace API token is set
  assert:
    that:
      - dynatrace_api_token is defined
      - dynatrace_api_token | length > 0
    fail_msg: "DT_API_TOKEN environment variable must be set"

- name: Verify Dynatrace environment URL is set
  assert:
    that:
      - dynatrace_environment_url is defined
      - dynatrace_environment_url | length > 0
    fail_msg: "DT_ENVIRONMENT_URL environment variable must be set"

- name: Test Dynatrace API connectivity
  uri:
    url: "{{ dynatrace_config_api }}/alertingProfiles"
    method: GET
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
    status_code: 200
  register: api_test
  failed_when: api_test.status != 200

- name: Display API connection status
  debug:
    msg: "Successfully connected to Dynatrace. Found {{ api_test.json.values | length }} existing alerting profiles."

Alerting Profiles

Alerting profiles define WHAT problems trigger alerts and with what delay.

tasks/alerting_profiles.yml

---
- name: Get existing alerting profiles
  uri:
    url: "{{ dynatrace_config_api }}/alertingProfiles"
    method: GET
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
  register: existing_profiles

- name: Build existing profiles lookup
  set_fact:
    existing_profiles_map: "{{ existing_profiles.json.values | items2dict(key_name='name', value_name='id') }}"

- name: Create or update alerting profiles
  uri:
    url: "{{ dynatrace_config_api }}/alertingProfiles/{{ existing_profiles_map[item.name] | default('') }}"
    method: "{{ 'PUT' if item.name in existing_profiles_map else 'POST' }}"
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
      Content-Type: "application/json"
    body: "{{ lookup('template', 'alerting_profile.json.j2') }}"
    body_format: json
    status_code: [200, 201, 204]
  loop: "{{ dynatrace_alerting_profiles }}"
  loop_control:
    label: "{{ item.name }}"
  register: profile_results

- name: Delete removed alerting profiles
  uri:
    url: "{{ dynatrace_config_api }}/alertingProfiles/{{ item.value }}"
    method: DELETE
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
    status_code: [204, 404]
  loop: "{{ existing_profiles_map | dict2items }}"
  loop_control:
    label: "{{ item.key }}"
  when:
    - dynatrace_alerting_profiles_delete_unmanaged | default(false)
    - item.key not in (dynatrace_alerting_profiles | map(attribute='name') | list)

templates/alerting_profile.json.j2

{
  "displayName": "{{ item.name }}",
  "rules": [
{% for rule in item.severity_rules | default(dynatrace_default_severity_rules) %}
    {
      "severityLevel": "{{ rule.severity }}",
      "tagFilter": {
        "includeMode": "{{ rule.tag_include_mode | default('INCLUDE_ANY') }}",
        "tagFilters": [
{% for tag in rule.tags | default(item.tags | default([])) %}
          {
            "context": "{{ tag.context | default('CONTEXTLESS') }}",
            "key": "{{ tag.key }}",
            "value": "{{ tag.value | default('') }}"
          }{{ "," if not loop.last else "" }}
{% endfor %}
        ]
      },
      "delayInMinutes": {{ (rule.delay_in_minutes * dynatrace_environments[dynatrace_environment].alert_delay_multiplier) | int }}
    }{{ "," if not loop.last else "" }}
{% endfor %}
  ],
{% if item.management_zone is defined %}
  "managementZoneId": "{{ item.management_zone }}",
{% endif %}
  "eventTypeFilters": [
{% for event_type in item.event_types | default(['CUSTOM_ALERT', 'CUSTOM_ANNOTATION', 'CUSTOM_CONFIGURATION', 'CUSTOM_DEPLOYMENT', 'ERROR_EVENT', 'MARKED_FOR_TERMINATION', 'PERFORMANCE_EVENT', 'RESOURCE_CONTENTION_EVENT']) %}
    {
      "predefinedEventFilter": {
        "eventType": "{{ event_type }}",
        "negate": false
      }
    }{{ "," if not loop.last else "" }}
{% endfor %}
  ]
}

Notification Channels

Notifications define WHERE alerts go (Slack, PagerDuty, email, webhooks).

tasks/notifications.yml

---
- name: Get existing notifications
  uri:
    url: "{{ dynatrace_config_api }}/notifications"
    method: GET
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
  register: existing_notifications

- name: Build existing notifications lookup
  set_fact:
    existing_notifications_map: "{{ existing_notifications.json.values | items2dict(key_name='name', value_name='id') }}"

- name: Create or update Slack notifications
  uri:
    url: "{{ dynatrace_config_api }}/notifications/{{ existing_notifications_map[item.name] | default('') }}"
    method: "{{ 'PUT' if item.name in existing_notifications_map else 'POST' }}"
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
      Content-Type: "application/json"
    body: "{{ lookup('template', 'notification_slack.json.j2') }}"
    body_format: json
    status_code: [200, 201, 204]
  loop: "{{ dynatrace_notifications | selectattr('type', 'equalto', 'slack') | list }}"
  loop_control:
    label: "{{ item.name }}"

- name: Create or update PagerDuty notifications
  uri:
    url: "{{ dynatrace_config_api }}/notifications/{{ existing_notifications_map[item.name] | default('') }}"
    method: "{{ 'PUT' if item.name in existing_notifications_map else 'POST' }}"
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
      Content-Type: "application/json"
    body: "{{ lookup('template', 'notification_pagerduty.json.j2') }}"
    body_format: json
    status_code: [200, 201, 204]
  loop: "{{ dynatrace_notifications | selectattr('type', 'equalto', 'pagerduty') | list }}"
  loop_control:
    label: "{{ item.name }}"

- name: Create or update email notifications
  uri:
    url: "{{ dynatrace_config_api }}/notifications/{{ existing_notifications_map[item.name] | default('') }}"
    method: "{{ 'PUT' if item.name in existing_notifications_map else 'POST' }}"
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
      Content-Type: "application/json"
    body: "{{ lookup('template', 'notification_email.json.j2') }}"
    body_format: json
    status_code: [200, 201, 204]
  loop: "{{ dynatrace_notifications | selectattr('type', 'equalto', 'email') | list }}"
  loop_control:
    label: "{{ item.name }}"

templates/notification_slack.json.j2

{
  "type": "SLACK",
  "name": "{{ item.name }}",
  "alertingProfile": "{{ item.alerting_profile_id }}",
  "active": {{ item.active | default(true) | lower }},
  "url": "{{ item.webhook_url }}",
  "channel": "{{ item.channel }}",
  "title": "{{ item.title | default('{State} {ProblemSeverity} Problem {ProblemID}: {ProblemTitle}') }}"
}

templates/notification_pagerduty.json.j2

{
  "type": "PAGER_DUTY",
  "name": "{{ item.name }}",
  "alertingProfile": "{{ item.alerting_profile_id }}",
  "active": {{ item.active | default(true) | lower }},
  "account": "{{ item.account }}",
  "serviceApiKey": "{{ item.integration_key }}",
  "serviceName": "{{ item.service_name }}"
}

templates/notification_email.json.j2

{
  "type": "EMAIL",
  "name": "{{ item.name }}",
  "alertingProfile": "{{ item.alerting_profile_id }}",
  "active": {{ item.active | default(true) | lower }},
  "subject": "{{ item.subject | default('{State} {ProblemSeverity} Problem {ProblemID}: {ProblemTitle}') }}",
  "body": "{{ item.body | default('{ProblemDetailsHTML}') }}",
  "receivers": [
{% for email in item.recipients %}
    "{{ email }}"{{ "," if not loop.last else "" }}
{% endfor %}
  ],
  "ccReceivers": [
{% for email in item.cc_recipients | default([]) %}
    "{{ email }}"{{ "," if not loop.last else "" }}
{% endfor %}
  ],
  "bccReceivers": [
{% for email in item.bcc_recipients | default([]) %}
    "{{ email }}"{{ "," if not loop.last else "" }}
{% endfor %}
  ],
  "notifyClosedProblems": {{ dynatrace_environments[dynatrace_environment].notify_on_close | lower }}
}

Custom Metric Events

For alerts on specific metrics (not auto-detected by Davis AI).

tasks/metric_events.yml

---
- name: Get existing metric events
  uri:
    url: "{{ dynatrace_config_api }}/anomalyDetection/metricEvents"
    method: GET
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
  register: existing_metric_events

- name: Build existing metric events lookup
  set_fact:
    existing_metric_events_map: "{{ existing_metric_events.json.values | items2dict(key_name='name', value_name='id') }}"

- name: Create or update metric events
  uri:
    url: "{{ dynatrace_config_api }}/anomalyDetection/metricEvents/{{ existing_metric_events_map[item.name] | default('') }}"
    method: "{{ 'PUT' if item.name in existing_metric_events_map else 'POST' }}"
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
      Content-Type: "application/json"
    body: "{{ lookup('template', 'metric_event.json.j2') }}"
    body_format: json
    status_code: [200, 201, 204]
  loop: "{{ dynatrace_metric_events }}"
  loop_control:
    label: "{{ item.name }}"
  register: metric_event_results

templates/metric_event.json.j2

{
  "metadata": {
    "configurationVersions": [3],
    "clusterVersion": "1.261.0"
  },
  "name": "{{ item.name }}",
  "description": "{{ item.description | default('') }}",
  "enabled": {{ item.enabled | default(true) | lower }},
  "alertingScope": [
{% for scope in item.scopes | default([]) %}
    {
{% if scope.type == 'management_zone' %}
      "filterType": "MANAGEMENT_ZONE",
      "managementZoneId": "{{ scope.id }}"
{% elif scope.type == 'entity' %}
      "filterType": "ENTITY_ID",
      "entityId": "{{ scope.id }}"
{% elif scope.type == 'tag' %}
      "filterType": "TAG",
      "tagFilter": {
        "context": "{{ scope.context | default('CONTEXTLESS') }}",
        "key": "{{ scope.key }}",
        "value": "{{ scope.value | default('') }}"
      }
{% elif scope.type == 'name' %}
      "filterType": "NAME",
      "nameFilter": {
        "value": "{{ scope.value }}",
        "operator": "{{ scope.operator | default('EQUALS') }}"
      }
{% endif %}
    }{{ "," if not loop.last else "" }}
{% endfor %}
  ],
  "metricSelector": "{{ item.metric_selector }}",
  "monitoringStrategy": {
    "type": "{{ item.strategy_type | default('STATIC_THRESHOLD') }}",
{% if item.strategy_type | default('STATIC_THRESHOLD') == 'STATIC_THRESHOLD' %}
    "alertCondition": "{{ item.condition | default('ABOVE') }}",
    "samples": {{ item.samples | default(5) }},
    "violatingSamples": {{ item.violating_samples | default(3) }},
    "dealertingSamples": {{ item.dealerting_samples | default(5) }},
    "threshold": {{ item.threshold }},
    "unit": "{{ item.unit | default('UNSPECIFIED') }}"
{% elif item.strategy_type == 'AUTO_ADAPTIVE_BASELINE' %}
    "alertCondition": "{{ item.condition | default('ABOVE') }}",
    "samples": {{ item.samples | default(5) }},
    "violatingSamples": {{ item.violating_samples | default(3) }},
    "dealertingSamples": {{ item.dealerting_samples | default(5) }},
    "numberOfSignalFluctuations": {{ item.signal_fluctuations | default(1.0) }}
{% endif %}
  },
{% if item.dimensions is defined %}
  "dimensions": [
{% for dim in item.dimensions %}
    {
      "key": "{{ dim.key }}",
      "name": "{{ dim.name | default(dim.key) }}",
      "filterType": "{{ dim.filter_type | default('ENTITY') }}",
{% if dim.filter_type | default('ENTITY') == 'ENTITY' %}
      "entityDimension": {
        "entityDimensionKey": "{{ dim.entity_dimension_key }}"
      }
{% endif %}
    }{{ "," if not loop.last else "" }}
{% endfor %}
  ],
{% endif %}
  "primaryDimensionKey": "{{ item.primary_dimension_key | default('dt.entity.host') }}",
  "severity": "{{ item.severity | default('CUSTOM_ALERT') }}",
  "warningReason": "{{ item.warning_reason | default('NONE') }}",
  "eventTemplate": {
    "title": "{{ item.event_title | default(item.name) }}",
    "description": "{{ item.event_description | default('Metric threshold exceeded') }}",
    "eventType": "{{ item.event_type | default('CUSTOM_ALERT') }}",
    "metadata": [
{% for meta in item.metadata | default([]) %}
      {
        "metadataKey": "{{ meta.key }}",
        "metadataValue": "{{ meta.value }}"
      }{{ "," if not loop.last else "" }}
{% endfor %}
    ]
  }
}

Maintenance Windows

For suppressing alerts during planned maintenance.

tasks/maintenance.yml

---
- name: Get existing maintenance windows
  uri:
    url: "{{ dynatrace_config_api }}/maintenanceWindows"
    method: GET
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
  register: existing_maintenance

- name: Build existing maintenance lookup
  set_fact:
    existing_maintenance_map: "{{ existing_maintenance.json.values | items2dict(key_name='name', value_name='id') }}"

- name: Create or update maintenance windows
  uri:
    url: "{{ dynatrace_config_api }}/maintenanceWindows/{{ existing_maintenance_map[item.name] | default('') }}"
    method: "{{ 'PUT' if item.name in existing_maintenance_map else 'POST' }}"
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
      Content-Type: "application/json"
    body: "{{ lookup('template', 'maintenance_window.json.j2') }}"
    body_format: json
    status_code: [200, 201, 204]
  loop: "{{ dynatrace_maintenance_windows }}"
  loop_control:
    label: "{{ item.name }}"

templates/maintenance_window.json.j2

{
  "name": "{{ item.name }}",
  "description": "{{ item.description | default('') }}",
  "type": "{{ item.type | default('PLANNED') }}",
  "suppression": "{{ item.suppression | default('DETECT_PROBLEMS_DONT_ALERT') }}",
  "scope": {
{% if item.scope.type == 'environment' %}
    "entities": [],
    "matches": []
{% elif item.scope.type == 'entities' %}
    "entities": [
{% for entity in item.scope.entities %}
      "{{ entity }}"{{ "," if not loop.last else "" }}
{% endfor %}
    ],
    "matches": []
{% elif item.scope.type == 'tags' %}
    "entities": [],
    "matches": [
{% for match in item.scope.matches %}
      {
        "type": "{{ match.type | default('SERVICE') }}",
{% if match.management_zone is defined %}
        "mzId": "{{ match.management_zone }}",
{% endif %}
        "tags": [
{% for tag in match.tags %}
          {
            "context": "{{ tag.context | default('CONTEXTLESS') }}",
            "key": "{{ tag.key }}",
            "value": "{{ tag.value | default('') }}"
          }{{ "," if not loop.last else "" }}
{% endfor %}
        ],
        "tagCombination": "{{ match.tag_combination | default('AND') }}"
      }{{ "," if not loop.last else "" }}
{% endfor %}
    ]
{% endif %}
  },
  "schedule": {
    "type": "{{ item.schedule.type | default('ONCE') }}",
{% if item.schedule.type | default('ONCE') == 'ONCE' %}
    "start": "{{ item.schedule.start }}",
    "end": "{{ item.schedule.end }}",
    "zoneId": "{{ item.schedule.timezone | default('Europe/London') }}"
{% elif item.schedule.type == 'DAILY' %}
    "recurrenceRange": {
      "start": "{{ item.schedule.range_start }}",
      "end": "{{ item.schedule.range_end }}"
    },
    "dailyRecurrence": {
      "timeWindow": {
        "start": "{{ item.schedule.daily_start }}",
        "end": "{{ item.schedule.daily_end }}"
      },
      "recurrenceRange": {
        "start": "{{ item.schedule.range_start }}",
        "end": "{{ item.schedule.range_end }}"
      }
    },
    "zoneId": "{{ item.schedule.timezone | default('Europe/London') }}"
{% elif item.schedule.type == 'WEEKLY' %}
    "recurrenceRange": {
      "start": "{{ item.schedule.range_start }}",
      "end": "{{ item.schedule.range_end }}"
    },
    "weeklyRecurrence": {
      "timeWindow": {
        "start": "{{ item.schedule.weekly_start }}",
        "end": "{{ item.schedule.weekly_end }}"
      },
      "dayOfWeek": "{{ item.schedule.day_of_week }}",
      "recurrenceRange": {
        "start": "{{ item.schedule.range_start }}",
        "end": "{{ item.schedule.range_end }}"
      }
    },
    "zoneId": "{{ item.schedule.timezone | default('Europe/London') }}"
{% endif %}
  }
}

Usage Examples

Playbook: Configure All Alerting

# playbooks/dynatrace-alerting.yml
---
- name: Configure Dynatrace Alerting
  hosts: localhost
  connection: local
  gather_facts: false

  vars:
    dynatrace_environment: "{{ env | default('production') }}"

  vars_files:
    - "vars/dynatrace/common.yml"
    - "vars/dynatrace/{{ dynatrace_environment }}.yml"

  roles:
    - dynatrace_alerting

vars/dynatrace/common.yml

---
# Alerting profiles used across all environments
dynatrace_alerting_profiles:
  # Critical services - immediate alerting
  - name: "Critical Services - P1"
    tags:
      - key: "criticality"
        value: "critical"
    severity_rules:
      - severity: AVAILABILITY
        delay_in_minutes: 0
      - severity: ERROR
        delay_in_minutes: 0
      - severity: SLOWDOWN
        delay_in_minutes: 2
      - severity: RESOURCE_CONTENTION
        delay_in_minutes: 5

  # Standard services
  - name: "Standard Services - P2"
    tags:
      - key: "criticality"
        value: "standard"
    severity_rules:
      - severity: AVAILABILITY
        delay_in_minutes: 5
      - severity: ERROR
        delay_in_minutes: 5
      - severity: SLOWDOWN
        delay_in_minutes: 10
      - severity: RESOURCE_CONTENTION
        delay_in_minutes: 15

  # Non-critical / batch jobs
  - name: "Non-Critical - P3"
    tags:
      - key: "criticality"
        value: "low"
    severity_rules:
      - severity: AVAILABILITY
        delay_in_minutes: 15
      - severity: ERROR
        delay_in_minutes: 15
      - severity: SLOWDOWN
        delay_in_minutes: 30
      - severity: RESOURCE_CONTENTION
        delay_in_minutes: 60

# Common metric events (custom alerts)
dynatrace_metric_events:
  # High CPU on any host
  - name: "High CPU Usage"
    description: "CPU usage above 90% for 5 minutes"
    metric_selector: "builtin:host.cpu.usage:avg"
    strategy_type: STATIC_THRESHOLD
    threshold: 90
    condition: ABOVE
    samples: 5
    violating_samples: 3
    severity: RESOURCE_CONTENTION
    scopes:
      - type: tag
        key: "environment"
        value: "{{ dynatrace_environment }}"

  # Disk space low
  - name: "Low Disk Space"
    description: "Less than 10% disk space remaining"
    metric_selector: "builtin:host.disk.avail:avg"
    strategy_type: STATIC_THRESHOLD
    threshold: 10
    condition: BELOW
    samples: 3
    violating_samples: 2
    unit: PERCENT
    severity: RESOURCE_CONTENTION

  # High memory usage
  - name: "High Memory Usage"
    description: "Memory usage above 95%"
    metric_selector: "builtin:host.mem.usage:avg"
    strategy_type: STATIC_THRESHOLD
    threshold: 95
    condition: ABOVE
    samples: 5
    violating_samples: 3
    severity: RESOURCE_CONTENTION

  # Error rate spike
  - name: "Service Error Rate High"
    description: "Error rate above 5%"
    metric_selector: "builtin:service.errors.total.rate:avg"
    strategy_type: STATIC_THRESHOLD
    threshold: 5
    condition: ABOVE
    samples: 5
    violating_samples: 3
    unit: PERCENT
    severity: ERROR
    scopes:
      - type: tag
        key: "environment"
        value: "{{ dynatrace_environment }}"

  # Response time degradation
  - name: "Service Response Time Degraded"
    description: "P95 response time above 2 seconds"
    metric_selector: "builtin:service.response.time:percentile(95)"
    strategy_type: STATIC_THRESHOLD
    threshold: 2000000  # 2 seconds in microseconds
    condition: ABOVE
    samples: 10
    violating_samples: 6
    severity: SLOWDOWN

vars/dynatrace/production.yml

---
dynatrace_environment: production

# Production-specific notifications
dynatrace_notifications:
  # Critical alerts to PagerDuty
  - name: "Production Critical - PagerDuty"
    type: pagerduty
    alerting_profile_id: "{{ lookup('dynatrace_profile_id', 'Critical Services - P1') }}"
    account: "yourcompany"
    integration_key: "{{ vault_pagerduty_integration_key }}"
    service_name: "Production Critical Services"

  # All production alerts to Slack
  - name: "Production Alerts - Slack"
    type: slack
    alerting_profile_id: "{{ lookup('dynatrace_profile_id', 'Standard Services - P2') }}"
    webhook_url: "{{ vault_slack_webhook_url }}"
    channel: "#prod-alerts"

  # Critical alerts also to email
  - name: "Production Critical - Email"
    type: email
    alerting_profile_id: "{{ lookup('dynatrace_profile_id', 'Critical Services - P1') }}"
    recipients:
      - oncall@yourcompany.com
      - platform-team@yourcompany.com
    subject: "[CRITICAL] {ProblemSeverity}: {ProblemTitle}"

# Production maintenance windows
dynatrace_maintenance_windows:
  # Weekly maintenance window
  - name: "Weekly Platform Maintenance"
    description: "Sunday 2-4am maintenance window"
    type: PLANNED
    suppression: DETECT_PROBLEMS_DONT_ALERT
    scope:
      type: tags
      matches:
        - type: HOST
          tags:
            - key: "maintenance-window"
              value: "weekly"
    schedule:
      type: WEEKLY
      day_of_week: SUNDAY
      weekly_start: "02:00"
      weekly_end: "04:00"
      range_start: "2022-01-01"
      range_end: "2025-12-31"
      timezone: "Europe/London"

Running the Playbook

# Configure production alerting
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production

# Configure staging (with longer delays)
ansible-playbook playbooks/dynatrace-alerting.yml -e env=staging

# Only update alerting profiles
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --tags alerting_profiles

# Only update metric events
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --tags metric_events

# Dry run with check mode
ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --check --diff

CI/CD Integration

We integrated this into our GitLab CI pipeline:

# .gitlab-ci.yml
stages:
  - validate
  - plan
  - apply

variables:
  ANSIBLE_FORCE_COLOR: "true"

.dynatrace-base:
  image: ansible/ansible:latest
  before_script:
    - pip install jmespath
    - ansible-galaxy collection install community.general

validate:
  extends: .dynatrace-base
  stage: validate
  script:
    - ansible-playbook playbooks/dynatrace-alerting.yml --syntax-check
    - ansible-lint playbooks/dynatrace-alerting.yml roles/dynatrace_alerting/
  rules:
    - if: $CI_MERGE_REQUEST_ID

plan:
  extends: .dynatrace-base
  stage: plan
  script:
    - ansible-playbook playbooks/dynatrace-alerting.yml -e env=production --check --diff
  rules:
    - if: $CI_MERGE_REQUEST_ID

apply:production:
  extends: .dynatrace-base
  stage: apply
  script:
    - ansible-playbook playbooks/dynatrace-alerting.yml -e env=production
  rules:
    - if: $CI_COMMIT_BRANCH == "main"
  environment:
    name: production

Lessons Learned

1. API Rate Limits

Dynatrace has API rate limits. When managing hundreds of configs, we hit them.

Fix: Add delays between API calls:

- name: Create alerting profile
  uri:
    # ...
  throttle: 1  # Only 1 concurrent request
  
- name: Pause between API calls
  pause:
    seconds: 1
  when: profile_results.changed

2. Idempotency with IDs

Dynatrace assigns IDs to configs. To make updates idempotent, we needed to track IDs.

Fix: Query existing configs first, build a lookup map, use PUT for updates.

3. Environment-Specific Delays

What’s critical in prod isn’t critical in dev. We wasted time on non-prod alerts.

Fix: Environment-specific delay multipliers in the role defaults.

4. Secret Management

API tokens and webhook URLs are secrets.

Fix: Use Ansible Vault for sensitive variables:

ansible-vault encrypt vars/dynatrace/secrets.yml
ansible-playbook playbooks/dynatrace-alerting.yml --ask-vault-pass

5. Profile ID Lookups

Notifications need alerting profile IDs, but we define profiles by name.

Fix: Create a custom lookup plugin or query the API in a pre-task:

- name: Get alerting profile ID
  uri:
    url: "{{ dynatrace_config_api }}/alertingProfiles"
    method: GET
    headers:
      Authorization: "Api-Token {{ dynatrace_api_token }}"
  register: profiles

- name: Set profile ID facts
  set_fact:
    alerting_profile_ids: "{{ profiles.json.values | items2dict(key_name='name', value_name='id') }}"

6. Testing Changes

We broke alerting in production by deploying untested changes.

Fix: Deploy to staging first, wait 24 hours, then production. Add --check mode validation to CI.


Key Takeaways

  1. Treat alerting as code - Version control, review, test, deploy
  2. Environment-specific configs - Prod alerts ≠ Dev alerts
  3. Centralize notification channels - Avoid alert sprawl
  4. Use tags for scoping - Management zones are less flexible
  5. Automate maintenance windows - Don’t suppress alerts manually
  6. Test before production - --check mode and staging environments
  7. Document your alert strategy - Future you will thank present you

This approach transformed our alerting from a manual, inconsistent mess into a reliable, reviewable, version-controlled system. Changes go through PRs, get reviewed, and deploy consistently across environments.


Managing Dynatrace at scale? Questions about the Ansible integration? Find me on LinkedIn or GitHub.

Found this helpful?

Comments