DORA Metrics Implementation - Measuring What Matters

DORA metrics have become the standard for measuring DevOps performance. Every platform engineering talk mentions them. Every engineering leader wants them.

But most implementations fail. Teams collect the numbers without understanding what they mean. Dashboards get built but never improve anything. Metrics become targets that get gamed.

This guide covers how to implement DORA metrics in a way that actually drives improvement.

DORA Metrics Implementation - Measuring What Matters

The Four Metrics

DORA research identified four key metrics that predict software delivery performance:

Deployment Frequency - How often you deploy to production. Elite performers deploy on-demand, multiple times per day.

Lead Time for Changes - Time from code commit to code running in production. Elite performers do this in less than an hour.

Change Failure Rate - Percentage of deployments that cause failures requiring remediation. Elite performers stay under 15%.

Time to Restore Service - How long it takes to recover from failures. Elite performers restore in under an hour.

These metrics are correlated. Teams that deploy frequently also have lower failure rates. Fast lead times correlate with faster recovery. This isn’t coincidence - the same practices that enable speed also enable quality.

Why These Metrics Matter

Traditional metrics measure activity: lines of code, story points, velocity. These metrics are easy to game and don’t correlate with business outcomes.

DORA metrics measure capability: can you deliver changes safely and quickly? This directly connects to business value.

A team that deploys daily with low failure rates can:

Respond to customer feedback quickly
Fix bugs before they compound
Ship features while they’re still relevant
Recover from incidents without panic

A team that deploys monthly with high failure rates cannot do these things, no matter how many story points they complete.

Measuring Deployment Frequency

Deployment frequency sounds simple but has nuance.

What counts as a deployment? I define it as any change that reaches production, including:

Feature releases
Bug fixes
Configuration changes
Infrastructure updates

Where to measure? Pull from your deployment tool. If you’re using ArgoCD, query ArgoCD. If you’re using GitHub Actions, query GitHub. Don’t make humans log deployments manually.

A simple query for GitHub Actions:

gh api repos/org/repo/actions/runs \
  --jq '[.workflow_runs[] | select(.conclusion == "success" and .name == "Deploy")] | length'

For ArgoCD:

argocd app history my-app --output json | jq 'length'

Aggregation. Calculate daily, weekly, and monthly frequencies. The trend matters more than any single number.

Per team vs system-wide. Track both. Some teams might deploy frequently while others are blocked. You need visibility into both.

Measuring Lead Time

Lead time is the most technically challenging metric to collect.

Definition. Time from first commit in a change to that change running in production.

The tricky part. A deployment might include multiple commits, PRs, and merges. You need to trace back to the first commit.

If you’re using conventional commits or PR-based workflows, you can trace from deployment to PR to commits.

For GitHub-based workflows:

def calculate_lead_time(deployment_sha, repo):
    # Find the merge commit
    merge_commit = get_commit(deployment_sha)
    
    # Find the PR that created this merge
    prs = get_prs_for_commit(deployment_sha)
    
    # Get the first commit in the PR
    first_commit = get_first_commit_in_pr(prs[0])
    
    # Calculate time difference
    lead_time = deployment_time - first_commit.timestamp
    
    return lead_time

Simplification. If full tracing is hard, approximate. Measure from PR open time to deployment time. It’s not exact but captures most of the delay.

Exclude outliers thoughtfully. A PR that sat for three months before being deployed shouldn’t be excluded just because it’s inconvenient. That’s signal, not noise.

Measuring Change Failure Rate

Change failure rate requires defining what a failure is.

What counts as a failure?

Rollbacks
Hotfixes deployed within X hours
Incidents triggered by deployments
Feature flags immediately disabled

What doesn’t count?

Bugs found in staging
Bugs found days later (hard to attribute)
Performance regressions that don’t trigger incidents

The key is consistency. Pick a definition and stick with it.

Data sources. Cross-reference deployments with:

Rollback events
Incident management systems (PagerDuty, Opsgenie)
Hotfix deployments (often tagged differently)

def calculate_cfr(deployments, incidents, window_hours=24):
    failures = 0
    
    for deployment in deployments:
        # Check for incidents within window after deployment
        related_incidents = [i for i in incidents 
                           if i.trigger_time > deployment.time 
                           and i.trigger_time < deployment.time + window_hours]
        
        if related_incidents:
            failures += 1
    
    return failures / len(deployments)

Measuring Time to Restore

Time to restore measures incident recovery capability.

Definition. Time from incident start to incident resolution.

Data source. Your incident management system. PagerDuty, Opsgenie, and most tools provide API access to incident timelines.

Considerations:

Use time to mitigate, not time to root cause
Exclude incidents that weren’t actually service-impacting
Track by severity - P1 recovery time matters more than P4

def calculate_mttr(incidents):
    restore_times = []
    
    for incident in incidents:
        if incident.severity <= 2:  # P1 and P2 only
            restore_time = incident.resolved_at - incident.triggered_at
            restore_times.append(restore_time)
    
    return median(restore_times)  # Median is more robust than mean

Building the Dashboard

Once you’re collecting data, visualise it usefully.

Show trends, not just current values. A 5% failure rate means nothing without context. Is it improving or degrading?

Compare to benchmarks. The DORA State of DevOps report publishes benchmarks:

Elite: Deploy on-demand, <1 hour lead time, <15% CFR, <1 hour restore
High: Weekly-monthly deploys, <1 week lead time, 16-30% CFR, <1 day restore
Medium: Monthly-6 monthly, 1-6 months lead time, 31-45% CFR, <1 week restore
Low: <6 months deploys, >6 months lead time, >45% CFR, >6 months restore

Show by team. Aggregated metrics hide team-level problems. Let teams see their own performance.

Avoid vanity displays. A giant number showing “500 deployments this month” looks impressive but doesn’t help improvement. Show metrics that prompt action.

Common Implementation Mistakes

Measuring too precisely. Don’t spend months building perfect measurement. Start with approximations and refine. Some data is better than no data.

Ignoring context. Raw numbers without context mislead. A team that deploys 10x daily but only has 2 services isn’t necessarily high-performing.

Making metrics targets. The moment you tie DORA metrics to performance reviews, people game them. Deploy empty commits to boost frequency. Classify incidents as non-failures.

Forgetting the why. DORA metrics are a means, not an end. The goal is delivering value to customers, not optimising numbers.

Not acting on insights. Dashboards are useless without action. If lead time is high, do something about it. Otherwise, don’t bother measuring.

Using Metrics to Drive Improvement

Metrics should prompt questions, not answers.

If deployment frequency is low: What’s blocking more frequent deploys? Manual testing? Change approval processes? Fear of breaking things?

If lead time is high: Where does time go? Waiting for code review? Waiting for CI? Waiting for deployment windows?

If change failure rate is high: Are we testing effectively? Are we deploying too much at once? Is production observability lacking?

If time to restore is high: Do we have runbooks? Can we roll back quickly? Do we know how to diagnose issues?

Each question leads to specific improvements. Metrics don’t tell you what to do - they tell you where to look.

Tooling Options

Several tools can help collect DORA metrics:

Sleuth - Purpose-built for DORA. Integrates with common tools, provides dashboards out of the box.

LinearB - Broader engineering metrics including DORA. Good if you want more than just deployment metrics.

Faros - Open source option. More setup required but full control.

Custom. If you have platform engineers, building custom collection isn’t hard. Prometheus + Grafana with some Python scripts can work.

My recommendation: start custom, move to tooling if you need polish. Understanding how the data flows helps you trust it.

Starting Point

If you’re starting from zero:

Week 1: Instrument deployment frequency. Just count deploys per day. Put it on a visible dashboard.

Week 2: Add lead time tracking. Start with PR-open-to-deploy time.

Week 3: Add change failure rate. Cross-reference deploys with incidents.

Week 4: Add time to restore. Pull from your incident management tool.

Ongoing: Review metrics weekly with the team. Ask what they tell you. Make one improvement based on what you learn.

DORA metrics aren’t magic. They’re a starting point for continuous improvement. Implement them, use them to ask questions, and act on what you learn.

The Four Metrics

Why These Metrics Matter

Measuring Deployment Frequency

Measuring Lead Time

Measuring Change Failure Rate

Measuring Time to Restore

Building the Dashboard

Common Implementation Mistakes

Using Metrics to Drive Improvement

Tooling Options

Starting Point

Related Posts

OpenTelemetry Changed How I Think About Observability

AWS Control Tower Account Factory - The Gotchas Nobody Tells You

Building an Automated Multi-Account AWS Architecture with Control Tower and Terraform

Comments