Incident Analysis
AWS East Outage (Oct 20, 2025): Explained for Busy Engineers
It Started with a race.!
Yes, It started with a race. A control-plane race condition in AWS’s DNS management flow produced an empty DNS record for a US-East region. That tiny hole ripped a seam through dependencies—DNS lookup errors, throttling, unstable health checks—until engineers disabled automatic health checks to stop the thrash. The whole thing reads like a car changing lanes and briefly vanishing from everyone’s mirrors. 🛞
Below, we translate the root cause and the blast radius into plain visuals, then give you a pragmatic checklist you can implement this week. The lesson isn’t “move clouds.” It’s “design for failure and standardize how you ship.”
Your note from the field: Incidents like this punish teams with bespoke pipelines and snowflake deployments. The paved road—centralized, well-documented, and enforced—keeps most orgs upright when the ground moves.
What failed: the race condition, simplified
The short version: two control-plane components—let’s call them DNS Planner and DNS Enactor—got out of sync. A newer config existed, but an older one was applied due to version checking and cleanup bugs. The result: an empty DNS record for the region.
Key timeline (simplified)
- New DNS config generated for US-East.
- Old config lingers due to version/GC bug.
- Enactor applies stale config → empty DNS record.
- Lookups fail; downstream systems retry; health checks flap.
- Engineers mitigate by disabling automatic health checks to stop thrashing.
Why it hurt so much: cascading effects
DNS sits at the starting line of almost everything. When lookups become inconsistent, you don’t just get “fail.” You get oscillation: components switch between healthy and unhealthy, autoscalers overreact, and control planes keep trying to “help.”
Reality check
Large clouds are enormously reliable, but not perfect. Region-scoped control-plane bugs are rare—and unforgettable. Your job isn’t to predict the exact bug; it’s to make your platform boring when it happens.
What this means for your architecture
- Assume DNS can return “nothing.” Use caching resolvers with sensible TTLs and positive/negative caching.
- Design health checks to avoid flapping. Use slower ramps, hysteresis, and circuit breakers for dependencies.
- Avoid self-amplification. Rate-limit autoscalers; don’t let retries create traffic storms.
- Prefer multi-AZ with tested failover. Region-level isolation is ideal, but expensive; start with multi-AZ done well.
- Gate changes. Admission policies should reject regressions: unsigned images, missing health probes, mis-labeled services.
The paved-road playbook (for CTOs & SREs)
When the platform team supplies a paved road—templates, policy, and a single way to ship—resilience improves even when the cloud shakes.
- Centralize DNS patterns. Provide a blessed resolver configuration and a sidecar/agent for service DNS with caching.
- Standardize health checks. Company-wide defaults: timeouts, success thresholds, failure thresholds, dependency probes.
- Ship a canary by default. Argo Rollouts/Flagger with metric gates. If teams do nothing, it’s safe.
- Use GitOps for environments. Diffs are truth; promotions are PRs; rollbacks are one click.
- Automate post-deploy checks. Synthetic probes, trace sampling bursts, and error budget accounting.
- Emit DORA metrics. Lead time, frequency, CFR, MTTR as first-class pipeline outputs.
- Practice failure. Quarterly game days: kill DNS for a service, break health checks, flip regions for an hour.
Checklist: reduce blast radius this week
- Introduce an internal caching resolver (CoreDNS/Unbound) with negative caching and sane TTLs.
- Make health checks state-aware: backoff after failures; require consecutive passes before re-entry.
- Set egress budgets for retries; cap concurrent throttle storms.
- Turn on readiness gates so traffic never hits unready pods after a control-plane wobble.
- Run a DNS blackout drill against staging with synthetic traffic. Document RTO/RPO results.
- Adopt a release note format per deploy: who, what, change link, image digest, rollback steps.
Frequently asked by leadership
“Do we need multi-region now?”
Maybe. Start with multi-AZ done properly; prove failover times with drills. If your RTO/RPO or regulatory profile demands region isolation, move there with clear ownership and costs. Multi-region without operational maturity just multiplies pain.
“Why didn’t autoscaling save us?”
Autoscaling recovers capacity, not logic errors. With DNS returning empties, extra pods only fail faster. Build tolerances and backoff into the client side.
“Should we decentralize pipelines so teams can react faster?”
Decentralization feels empowering, but during incidents it creates twenty kinds of chaos. A centralized, well-documented pipeline gives you consistent mitigations, policy enforcement, and one place to add learnings next time. That’s speed. 🙂
Post-mortem template (steal this)
Title: Short, neutral description
When: YYYY-MM-DD HH:MM TZ
Where: region / AZ / environment
Symptoms: what users saw (error rates, latency, impact window)
Root Cause: one-line cause + contributing factors
Timeline:
- T0: trigger
- T1: first alarm
- T2: mitigation X
- T3: resolved
Mitigation:
- Short-term: steps taken (feature flags, health checks, DNS caching)
- Long-term: architectural and pipeline changes
Prevention:
- Tests/drills to add
- Policy/rules to encode (admission, CI checks)
- Runbook updates & ownership
What good looks like after this outage
- Every service ships with standard DNS settings, health probes, and circuit breakers.
- Deployment flow produces DORA metrics + release notes on every run.
- Quarterly drills include DNS failure scenarios and flapping health checks.
- Teams build on a paved road, with documented exits for rare cases.