Microservice Dependency Risk Mapper

Map microservice dependencies, calculate cascading failure risk, and analyze effective SLA. Enter values for instant results with step-by-step formulas.

December 2025

Worked Examples

Example 1: E-Commerce Checkout Dependency

Problem:Checkout service depends on: Auth (99.9%), Inventory (99.5%), Payment (99.95%), Shipping (99.8%). What's effective checkout SLA? Is this acceptable?

Solution:Dependency Chain:\n1. Auth: 99.9% SLA\n2. Inventory: 99.5% SLA\n3. Payment: 99.95% SLA\n4. Shipping: 99.8% SLA\n\nEffective SLA (Compound):\n0.999 × 0.995 × 0.9995 × 0.998 = 0.9915 = 99.15%\n\nAnalysis:\n- Target checkout SLA: 99.5%\n- Actual: 99.15%\n- Gap: 0.35% = ~2.5 hours downtime/month\n\nImpact:\n- During 2.5 hours downtime at 100 orders/hour\n- Lost orders: 250/month\n- At $80 avg order: $20,000 lost revenue/month\n\nSolutions:\n1. Improve weakest dependency (Inventory 99.5% → 99.9%)\n - New effective: 99.45%\n2. Add circuit breakers:\n - If Shipping fails, still complete order (ship later)\n - Remove Shipping from critical path\n - New effective: 99.84%\n3. Cache Inventory:\n - Serve slightly stale inventory data\n - Eventual consistency acceptable\n - Remove Invento

Result:Current: 99.15% (below target) | With breakers + cache: 99.9% | Lose 2.5h/month → 0.7h/month

Frequently Asked Questions

What is microservice dependency risk?

Dependency risk is the probability a service fails due to dependent services failing. If Service A depends on Services B, C, and D, and each has 99% uptime, A's effective uptime is ~97% (0.99³). More dependencies compound failure probability. Managing this risk requires circuit breakers, fallbacks, and reducing coupling.

Should I use retries for failed dependency calls?

Yes, with exponential backoff and jitter. Immediate retry often fails again (issue persists). Exponential backoff: wait 1s, then 2s, then 4s between retries. Jitter adds randomness to prevent thundering herd. Limit total retries (3-5 max) and use circuit breakers—after multiple failures, stop retrying temporarily.

How do I test microservice resilience?

Chaos engineering: intentionally inject failures (kill services, add latency, corrupt responses) and verify system handles it gracefully. Tools: Chaos Monkey (Netflix), Gremlin, Litmus. Test: one dependency fails, multiple fail, network partitions. System should degrade gracefully, not collapse catastrophically.