DevOps & InfrastructureObservability

Error Budget

Overview

Direct Answer

An error budget is the quantified maximum downtime a service may experience within a specified period whilst remaining compliant with its Service Level Objective (SLO). It represents the inverse of availability—a service with a 99.9% SLO has an error budget of 0.1% downtime per billing period.

How It Works

The budget is calculated by multiplying the acceptable unavailability percentage by the total time in the measurement window. For example, a 99.95% SLO over 30 days permits approximately 21.6 minutes of downtime. Teams track actual downtime against this allocation, enabling informed decisions about when to deploy changes, perform maintenance, or accept operational risk.

Why It Matters

Error budgets align incentives between development velocity and reliability. They prevent premature risk-aversion whilst establishing clear trade-offs: teams can deploy more frequently when budget remains, but must prioritise stability when exhausted. This framework reduces subjective disputes about acceptable outage frequency and directly impacts revenue protection and customer retention.

Common Applications

Cloud infrastructure providers use error budgets to manage scheduled maintenance windows. E-commerce platforms allocate budget consumption across feature releases, infrastructure upgrades, and incident recovery. Financial services organisations establish stricter budgets for payment processing systems whilst allowing higher error margins for non-critical services.

Key Considerations

Error budgets assume uniform business impact across outage types, though customer-facing and backend failures warrant different treatment. Organisations must align SLOs realistically with infrastructure capability, avoiding meaningless targets that exhaust budgets immediately or become irrelevant to actual user experience.

More in DevOps & Infrastructure