Postmortem • Edge & CDN Resilience

Cloudflare Outage — November 18, 2025: What Broke, Why It Cascaded, and How to Lead Through It

By DevOps City Editorial 0 views
Cloudflare outage concept: config propagation and logging amplification
Config growth → propagation lag → logging storm — a self-reinforcing failure loop.

What happened

Cloudflare experienced a global disruption impacting websites, APIs, and CDNs across its edge on November 18, 2025. Initial symptoms resembled a coordinated DDoS. The underlying issue was a configuration amplification loop tied to ClickHouse schema changes and Bot Management configuration propagation through the core proxy layer.

Simplified timeline

Time (UTC)Event
00:15DBA hardening begins: fine-grained access control in ClickHouse (explicit grants by shard).
00:25Config build picks up larger-than-expected system-table result (~200+ entries vs ~60 baseline).
00:40Core proxies fail BotManagement checks on boot with partial configs; errors logged per request.
01:10Logging amplification degrades CDN latency; autoscaling brings in more proxies → they fail too.
01:30SEV-1 declared. Working hypothesis: coordinated DDoS.
02:00–04:30Proxies disabled in waves; propagation path analyzed; config bloat isolated.
06:15DB changes reverted; rebuilt config propagated; recovery verified.

The technical anatomy

Where it started — ClickHouse × BotManagement

Propagation meets failure

  1. Bigger config files → slower propagation.
  2. Proxies starting with partial config fail the bot/human classification step.
  3. Each failure emits logs → logging storm (disks/telemetry pipelines overloaded).
  4. Autoscaling adds more proxies → they also fail → feedback loop.
Chain reaction diagram: DB change to global propagation failure
The chain reaction: DB change → config growth → propagation lag → logging storm → autoscale thrash.

The investigation spiral

From the vantage point of SRE dashboards, this looked like a DDoS: request spikes, failing health checks, and increasing POP latency. The confusion deepened when the external status page also went down due to an unrelated issue, reinforcing the DDoS hypothesis. Only after taking proxies offline and tracing the propagation path did engineers pinpoint configuration bloat as the trigger.

Key lesson: the system was effectively attacking itself through log amplification and propagation delays. Treat configuration size and replication time as first-class SLOs.

Recovery & mitigation

Mitigation flow: Isolate → Analyze → Revert → Recover
Mitigation flow: Isolate proxies → Analyze propagation → Revert schema → Recover with clean configs.

Executive action items

  1. Configuration as load: add SLOs for config file size, diff %, and replication latency. Fail fast if thresholds are exceeded.
  2. Cross-team change reviews: database schema/access changes require blast-radius analysis with platform owners.
  3. Log amplification controls: rate-limit error logs; auto-switch to summary mode under burst conditions.
  4. Out-of-band status/telemetry: keep status pages and observability separated from production paths.
  5. Game days: rehearse self-induced DDoS patterns (config storms, log floods, propagation stalls).

Metrics to watch

CategoryMetricWhy it matters
Config propagationFile size Δ%, replication timeEarly signal for propagation stalls
Edge stabilityProxy boot-success rateDetects partial config rollout
Logging volumeErrors/sec per POPPrevents amplification storms
Telemetry healthQueue backlogSignals observability degradation
Change managementSchema diff alertsFlags unexpected system-table growth

Want a resilience review or a “config-storm” game day designed for your stack?