SLA Error Budget Burn Rate Calculator
Calculate SLA error budgets, track burn rate, and project exhaustion timeline for reliability management.
Worked Examples
Example 1: SaaS Platform Monthly Budget
Problem: SaaS with 99.95% SLO, 50M requests/month. Day 15: 15,000 failed requests. Assess burn rate and project exhaustion.
Solution: Error Budget Calculation:\n- SLO: 99.95% β Error budget: 0.05%\n- Total budget: 50M Γ 0.0005 = 25,000 errors\n- Used: 15,000 errors (60%)\n- Remaining: 10,000 errors (40%)\n\nBurn Rate Analysis:\n- Ideal burn at day 15: 50%\n- Actual burn: 60%\n- Burn rate ratio: 60%/50% = 1.2x\n\nProjection:\n- Daily burn: 15,000/15 = 1,000 errors/day\n- Days until exhaustion: 10,000/1,000 = 10 days\n- Projected exhaustion: Day 25 (5 days early)\n\nActual reliability: 99.97% (above SLO, but trending wrong)\n\nAction: Burn rate 1.2x is concerning but not critical.\nMonitor closely; implement quick reliability wins.
Result: 60% used | 1.2x burn rate | Exhausts day 25 | Status: CAUTION
Example 2: Post-Incident Budget Impact
Problem: API service with 99.9% SLO. Day 20: Major outage (2 hours, 100% error rate during incident). 10M requests/day average. Previous: 5,000 errors in 20 days.
Solution: Pre-Incident State:\n- Monthly budget: 300M Γ 0.001 = 300,000 errors\n- Used before incident: 5,000 (1.7%)\n- Burn rate: 0.085x (excellent)\n\nIncident Impact:\n- 2 hours = 833,333 requests (10M/24 Γ 2)\n- 100% error = 833,333 errors\n- Single incident consumed 278% of total budget!\n\nPost-Incident State:\n- Total errors: 5,000 + 833,333 = 838,333\n- Budget consumed: 279%\n- SLO breached: 99.72% vs 99.9% target\n\nRecovery Timeline:\n- Need 30+ days of zero errors to recover\n- Or: reset budget at period boundary\n\nAction Required:\n- Declare SLO breach\n- Implement error budget policy consequences\n- RCA and prevention measures mandatory
Result: Single incident: 278% budget | SLO BREACHED | Mandatory reliability focus
Example 3: Multi-Tier SLO Tracking
Problem: E-commerce: web (99.9%), API (99.95%), payments (99.99%). Track each tier's budget status mid-month.
Solution: Tier Analysis (Day 15 of 30):\n\nWeb (99.9% SLO):\n- Budget: 0.1% Γ 20M = 20,000 errors\n- Used: 8,000 (40%)\n- Burn rate: 0.8x (healthy)\n- Status: β GREEN\n\nAPI (99.95% SLO):\n- Budget: 0.05% Γ 100M = 50,000 errors\n- Used: 35,000 (70%)\n- Burn rate: 1.4x (concerning)\n- Status: β οΈ YELLOW\n\nPayments (99.99% SLO):\n- Budget: 0.01% Γ 5M = 500 errors\n- Used: 450 (90%)\n- Burn rate: 1.8x (critical)\n- Status: π΄ RED\n\nPrioritization:\n1. Payments: Only 50 errors remaining; freeze changes\n2. API: Investigate elevated error rate\n3. Web: Continue normal operation\n\nPayments requires immediate attentionβ\n10 errors/day remaining vs historical 30/day
Result: Web: GREEN | API: YELLOW | Payments: RED (50 errors left)
Frequently Asked Questions
How do I calculate error budget burn rate?
Burn rate = (Error budget consumed / Total error budget) / (Time elapsed / Period). A burn rate of 1.0 means you're consuming budget exactly as fast as it regenerates. Above 1.0 means you're trending toward exhaustion before period end. Below 1.0 means you have slack.
What happens when error budget is exhausted?
When error budget exhausts, you've breached your SLO commitment. Teams should: (1) Freeze feature deployments, (2) Prioritize reliability work, (3) Investigate root causes, (4) Implement safeguards. Some orgs have formal error budget policies requiring these actions.
What is the relationship between SLO and SLA?
SLO (Service Level Objective) is an internal target (99.9% uptime). SLA (Service Level Agreement) is an external commitment with consequences (refunds, credits). SLOs should be stricter than SLAsβe.g., target 99.95% internally when SLA promises 99.9%. Error budgets derive from SLOs.
How long should an error budget period be?
Typically 30 days (monthly) or 90 days (quarterly). Shorter periods (weekly) create noise and stress. Longer periods delay feedback. Monthly aligns with most business cycles. Some use rolling windows instead of fixed periods to avoid edge effects.
What is a fast-burn vs slow-burn alert?
Fast-burn alerts trigger when budget consumption rate threatens exhaustion within hours (e.g., 2% budget burned in 1 hour). Slow-burn alerts trigger when trending toward exhaustion by period end. Fast-burn pages engineers; slow-burn creates tickets. Both are essential for proactive management.
Should error budget include planned maintenance?
Philosophically, users experience downtime regardless of cause. Practically, many organizations exclude planned maintenance from error budget to enable necessary work. Document your policy clearly. Consider: if maintenance hurts users, maybe it should count.