Special: When the Cloud Has a Bad Day: Cloudflare, AWS us-east-1 & GitHub Outages

Transcript

0:07 Hey, I'm Brian from Tellers Tech, and this is

0:10 the first episode of Ship It Weekly, a quick

0:14 rundown of what this show is supposed to be.

0:16 Most weeks, this is going to be a short practical

0:19 recap of what happened in the DevOps. SRE and

0:23 platform engineering world. Stuff like big cloud

0:25 changes, notable incident write -ups, useful

0:28 new tools, and the occasional culture or burnout

0:32 topic. Think a couple of main stories, a quick

0:35 tools and releases segment, and maybe one human

0:37 thing at the end. This first episode is going

0:40 to be a little different. The last stretch has

0:43 had a few really big outages from providers we

0:47 all depend on. Cloudflare had a global issue

0:51 that broke a bunch of sites. AWS US East One

0:54 had a long regional incident and GitHub had a

0:58 major hiccup where core Git operations stopped

1:02 working. So instead of a grab bag of topics,

1:05 this one is more of a focused special on those

1:09 outages and what they say about how we design

1:12 our own systems. Going forward, the default will

1:16 be more typical news rundown, but I'll probably

1:19 come back to this kind of themed episode whenever

1:22 something big like this happens again. All the

1:25 links and source material will be in the show

1:27 notes if you want the full timelines and technical

1:30 details. Alright, let's start with Cloudflare.

1:34 On November 18th, Cloudflare had a bad morning

1:38 and took a decent chunk of the internet with

1:40 it. If you were online, you probably saw it in

1:44 one form or another. Lots of sites and apps that

1:47 sit behind Cloudflare started returning Cloudflare

1:51 error pages instead of real responses. Big names,

1:55 smaller sites, random government and financial

1:57 pages, all showing variations of we can't reach

2:00 the origin or something went wrong. Cloudflare

2:04 has since shared enough detail to understand

2:07 the shape of it. The issue started with a configuration

2:10 file they generate to manage threat traffic.

2:13 That file grew much larger than they expected,

2:17 and that pushed part of their internal traffic

2:20 management software into a failure mode. Once

2:23 that system went sideways, they couldn't reliably

2:26 process requests for a lot of customers until

2:29 they got it back under control. They were pretty

2:32 clear there's no sign this was an external attack.

2:36 This was normal complex system at scale behavior,

2:39 config and software interacting in a way that

2:43 wasn't caught ahead of time. From our side, the

2:46 interesting part isn't Cloudflare is unreliable

2:49 because everybody at that scale has incidents.

2:52 The part that matters is how many of us quietly

2:55 treat our CDN and WAF layer as if it can't fail.

2:59 If all of your HTTP traffic goes through one

3:02 provider and you have no way around them, then

3:05 your uptime is effectively pinned to their uptime.

3:09 A couple of questions this should raise for your

3:12 own setup. If your CDN is down for two or three

3:15 hours in the middle of the day, what happens

3:18 to your users? Do you have a way to temporarily

3:20 route some traffic directly to your origin? Is

3:23 that just a theoretical DNS change, or is it

3:26 a documented, tested step you've actually walked

3:29 through? And do you have monitoring that makes

3:32 it obvious the CDN layer is broken but our origin

3:35 is fine, or would you be piecing that together

3:38 from user reports and a status page? Cloudflare

3:41 frontends a huge percentage of the web. Incidents

3:44 like this are not rare in the big picture. They're

3:48 good prompts for a realistic, CDN is down, tabletop

3:52 exercise with your team. Let's move from the

3:55 edge of the internet into the cloud itself and

3:57 talk about AWS US East 1. Back on October 20th,

4:02 AWS had a major incident in the US East 1 region.

4:06 Depending on which analysis you read, it impacted

4:09 well over 100 AWS services and lasted somewhere

4:13 in the 14 -15 hour range. That's not a brief

4:17 blip, that's most of the workday. A lot of well

4:20 -known companies reported knock -on effects,

4:23 slow or failing requests, backends timing out,

4:26 and internal tools misbehaving. If you look at

4:29 AWS's own summaries and third party breakdowns,

4:33 you see a combination of issues inside key subsystems.

4:37 Services that monitor and manage other services

4:39 had problems. There were DNS resolution issues

4:43 inside the region and the control plane APIs

4:46 people depend on to manage resources were degraded

4:49 or error prone for long stretches. The important

4:52 bit for us is how that intersects with the way

4:55 people talk about high availability. and disaster

4:58 recovery. A lot of teams quite honestly stop

5:01 at, we're spread across multiple availability

5:04 zones in the US East one, so we're good. That

5:07 helps if a single data center has a power problem

5:10 or a localized failure. It does very little for

5:14 you when the whole region is unhealthy in ways

5:17 that touch both data plane and control plane.

5:21 The second pattern that shows up in postmortems

5:23 and social posts is the backup plan. You see

5:27 some version of, if US East 1 has trouble, we'll

5:30 just redeploy to another region with Terraform

5:33 or CloudFormation. But in an event like this,

5:36 the very APIs those tools rely on are also degraded.

5:40 So your recovery plan assumes the control plane

5:43 is perfectly usable at the exact moment AWS is

5:46 saying, we're having issues with the control

5:48 plane operations in this region. A few questions

5:51 to think about in the context of your own systems.

5:54 Do you have anything running in another region

5:57 right now, even in a scaled down form that could

6:00 serve as a fallback, not theory, but actual workloads

6:04 you can point to? Could you provide some kind

6:07 of degraded experience from that secondary region

6:10 without building it on the fly in the middle

6:12 of an incident? And have you ever walked through

6:15 a full failover? and fail back end to end when

6:20 things were calm so you know what breaks, how

6:23 long it actually takes, and who needs to be involved.

6:26 I'm not saying everyone should be active active

6:29 everywhere. That's not realistic for most teams.

6:32 But if you call a system mission critical, and

6:35 it only exists in a single AWS region with untested,

6:40 we'll spin it up elsewhere docs, incidents like

6:43 this are a pretty strong signal that's not a

6:46 Now, let's shift from user -facing outages and

6:50 regional issues to something closer to home.

6:53 GitHub. On November 18th, the same day as the

6:56 Cloudflare incident, GitHub had its own major

6:59 problem. According to their status updates and

7:02 multiple tracking sites, they started investigating

7:05 failures on all Git operations in the early evening

7:10 of UTC. That meant push, pull, clone, over both

7:14 HTTP and SSH were failing or timing out. A bit

7:19 later, they also called out degraded availability

7:22 for code spaces. It took roughly an hour before

7:25 they reported recovery. From a development and

7:28 operations perspective, that touches a lot of

7:31 things at once. CI systems that do a fresh clone

7:34 from GitHub every run will fail. GitOps tools

7:38 like Argo CD or Flux that continuously sync from

7:42 GitHub will stop updating. Developers trying

7:45 to push a fix for some other outage can't get

7:48 their code up. And if your only CI system is

7:52 GitHub Actions, those workflows are either delayed

7:54 or completely blocked. So while GitHub going

7:58 down doesn't look like a classic production is

8:00 down incident, it absolutely can turn into one

8:03 because it stops you from changing production

8:05 at the exact moment you might need to. Most teams

8:08 don't have a playbook for GitHub is unavailable

8:11 in the same way they have a run book for ServiceX

8:14 is unhealthy. A few things worth considering.

8:17 Do you have read -only mirrors of your most important

8:21 repositories anywhere else? Another Git provider,

8:24 an internal mirror, anything? Even a simple periodic

8:28 mirror of your infra and core app repos can make

8:32 a difference. Can your CI run from a cached copy

8:35 of the repo and existing artifacts for some period

8:38 of time? Or is every pipeline hardwired to always

8:43 pull from GitHub's live API? And if GitHub Actions

8:46 is your only pipeline engine, do you have any

8:49 backup, even if it's slower and more manual,

8:52 or is the default we simply wait. None of this

8:55 has to be perfect, but your critical repos and

8:58 pipelines deserve to be treated as part of your

9:01 reliability story the same way your databases

9:04 and load balancers are. Now that we've looked

9:06 at Cloudflare at the edge, AWS in the region,

9:10 and GitHub in the development loop, let's zoom

9:12 out and talk about the pattern. All three of

9:15 these incidents point at the same basic reality.

9:19 We are heavily dependent on a small set of external

9:23 providers that we treat like background infrastructure.

9:26 The CDN or WAF in front of us, the primary cloud

9:30 region we run in, and the platform that hosts

9:33 our code and pipelines. When they have issues,

9:37 they expose assumptions in our designs and in

9:40 our runbooks. The root causes are also pretty

9:43 typical for large complex systems. A configuration

9:47 file grows larger than expected and interacts

9:50 badly with software that wasn't written for that

9:53 case. Internal health or management services

9:56 fail in surprising ways and drag down other components.

10:01 Service operations inside GitHub stumble and

10:04 suddenly Git operations don't reliably work.

10:07 Nothing exotic. just scale and complexity doing

10:10 what they do. For me, the practical move here

10:14 is not to panic about the cloud being fragile.

10:17 It's to get very explicit about where your external

10:21 single points of failure are and then improve

10:24 a couple of them in a concrete way. For each

10:27 major provider you depend on, ask two questions.

10:31 If this provider is impaired for a few hours

10:33 in the middle of the day, what exactly breaks

10:36 for our users and what breaks for our ability

10:38 to respond? And what is the specific set of steps

10:42 we would take with the tooling and people we

10:46 have right now? Then pick one of those areas

10:49 and move it a step forward. That might be putting

10:53 a minimal but real footprint in a second AWS

10:57 region for your most critical services and exercising

11:00 it on a schedule. Documenting and testing a simple

11:04 path to temporarily bypass the CDN for some subset

11:08 of traffic if the edge is misbehaving. or mirroring

11:12 your key repos and adjusting CI so you're not

11:16 entirely dependent on a single Git provider for

11:20 every build and deployment. You don't need a

11:22 perfect answer for every scenario, but these

11:25 kinds of outages are useful pressure tests and

11:29 they're good leverage when you're trying to justify

11:32 reliability work to people who only see the cloud

11:36 marketing slides. That's it for this first episode

11:39 of Ship It Weekly by Tellers Tech. Today was

11:42 a bit of a special, mostly focused on three related

11:45 outages and what they mean for reliability and

11:49 architecture. Going forward, the normal format

11:52 will be a little more mixed. Most weeks, I'll

11:55 cover a couple of main stories, some quick mentions

11:58 of tools, or releases that are worth a look,

12:01 and usually one culture or career topic that's

12:05 relevant to DevOps and SRE work. We'll still

12:08 do focused episodes like this when something

12:10 big happens, but you can expect more variety

12:13 week to week. If this was useful, feel free to

12:16 share it with someone on your team who thinks

12:18 about incidents, DR, and platform design. Show

12:22 notes will have links to the Cloudflare incident

12:25 explanations, the AWS US EastOne analysis, and

12:29 the GitHub outage history, so you can dig into

12:32 the details if you want. I'm Brian. Thanks for

12:34 listening, and I'll see you in the next one.

Special: When the Cloud Has a Bad Day: Cloudflare, AWS us-east-1 & GitHub Outages

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller