Overview
Direct Answer
An error budget is the quantified maximum downtime a service may experience within a specified period whilst remaining compliant with its Service Level Objective (SLO). It represents the inverse of availability—a service with a 99.9% SLO has an error budget of 0.1% downtime per billing period.
How It Works
The budget is calculated by multiplying the acceptable unavailability percentage by the total time in the measurement window. For example, a 99.95% SLO over 30 days permits approximately 21.6 minutes of downtime. Teams track actual downtime against this allocation, enabling informed decisions about when to deploy changes, perform maintenance, or accept operational risk.
Why It Matters
Error budgets align incentives between development velocity and reliability. They prevent premature risk-aversion whilst establishing clear trade-offs: teams can deploy more frequently when budget remains, but must prioritise stability when exhausted. This framework reduces subjective disputes about acceptable outage frequency and directly impacts revenue protection and customer retention.
Common Applications
Cloud infrastructure providers use error budgets to manage scheduled maintenance windows. E-commerce platforms allocate budget consumption across feature releases, infrastructure upgrades, and incident recovery. Financial services organisations establish stricter budgets for payment processing systems whilst allowing higher error margins for non-critical services.
Key Considerations
Error budgets assume uniform business impact across outage types, though customer-facing and backend failures warrant different treatment. Organisations must align SLOs realistically with infrastructure capability, avoiding meaningless targets that exhaust budgets immediately or become irrelevant to actual user experience.
More in DevOps & Infrastructure
Service Level Indicator
CI/CDA quantitative measure of some aspect of the level of service being provided.
Incident Management
Site ReliabilityThe processes and tools for detecting, responding to, resolving, and learning from service disruptions.
Chaos Engineering
Site ReliabilityThe discipline of experimenting on distributed systems to build confidence in their ability to withstand turbulent conditions.
Immutable Infrastructure
Infrastructure as CodeAn approach where infrastructure components are never modified after deployment but replaced entirely with updated versions.
Vertical Scaling
CI/CDIncreasing the resources (CPU, RAM, storage) of an existing machine to handle more load.
Elasticity
CI/CDThe ability of a system to automatically scale resources up or down based on current demand.
Graceful Degradation
CI/CDA design approach where a system continues to operate with reduced functionality when components fail.
Horizontal Scaling
CI/CDAdding more machines or nodes to a system to handle increased load.