0:07
Hey, I'm Brian from Tellers Tech, and this is
0:10
the first episode of Ship It Weekly, a quick
0:14
rundown of what this show is supposed to be.
0:16
Most weeks, this is going to be a short practical
0:19
recap of what happened in the DevOps. SRE and
0:23
platform engineering world. Stuff like big cloud
0:25
changes, notable incident write -ups, useful
0:28
new tools, and the occasional culture or burnout
0:32
topic. Think a couple of main stories, a quick
0:35
tools and releases segment, and maybe one human
0:37
thing at the end. This first episode is going
0:40
to be a little different. The last stretch has
0:43
had a few really big outages from providers we
0:47
all depend on. Cloudflare had a global issue
0:51
that broke a bunch of sites. AWS US East One
0:54
had a long regional incident and GitHub had a
0:58
major hiccup where core Git operations stopped
1:02
working. So instead of a grab bag of topics,
1:05
this one is more of a focused special on those
1:09
outages and what they say about how we design
1:12
our own systems. Going forward, the default will
1:16
be more typical news rundown, but I'll probably
1:19
come back to this kind of themed episode whenever
1:22
something big like this happens again. All the
1:25
links and source material will be in the show
1:27
notes if you want the full timelines and technical
1:30
details. Alright, let's start with Cloudflare.
1:34
On November 18th, Cloudflare had a bad morning
1:38
and took a decent chunk of the internet with
1:40
it. If you were online, you probably saw it in
1:44
one form or another. Lots of sites and apps that
1:47
sit behind Cloudflare started returning Cloudflare
1:51
error pages instead of real responses. Big names,
1:55
smaller sites, random government and financial
1:57
pages, all showing variations of we can't reach
2:00
the origin or something went wrong. Cloudflare
2:04
has since shared enough detail to understand
2:07
the shape of it. The issue started with a configuration
2:10
file they generate to manage threat traffic.
2:13
That file grew much larger than they expected,
2:17
and that pushed part of their internal traffic
2:20
management software into a failure mode. Once
2:23
that system went sideways, they couldn't reliably
2:26
process requests for a lot of customers until
2:29
they got it back under control. They were pretty
2:32
clear there's no sign this was an external attack.
2:36
This was normal complex system at scale behavior,
2:39
config and software interacting in a way that
2:43
wasn't caught ahead of time. From our side, the
2:46
interesting part isn't Cloudflare is unreliable
2:49
because everybody at that scale has incidents.
2:52
The part that matters is how many of us quietly
2:55
treat our CDN and WAF layer as if it can't fail.
2:59
If all of your HTTP traffic goes through one
3:02
provider and you have no way around them, then
3:05
your uptime is effectively pinned to their uptime.
3:09
A couple of questions this should raise for your
3:12
own setup. If your CDN is down for two or three
3:15
hours in the middle of the day, what happens
3:18
to your users? Do you have a way to temporarily
3:20
route some traffic directly to your origin? Is
3:23
that just a theoretical DNS change, or is it
3:26
a documented, tested step you've actually walked
3:29
through? And do you have monitoring that makes
3:32
it obvious the CDN layer is broken but our origin
3:35
is fine, or would you be piecing that together
3:38
from user reports and a status page? Cloudflare
3:41
frontends a huge percentage of the web. Incidents
3:44
like this are not rare in the big picture. They're
3:48
good prompts for a realistic, CDN is down, tabletop
3:52
exercise with your team. Let's move from the
3:55
edge of the internet into the cloud itself and
3:57
talk about AWS US East 1. Back on October 20th,
4:02
AWS had a major incident in the US East 1 region.
4:06
Depending on which analysis you read, it impacted
4:09
well over 100 AWS services and lasted somewhere
4:13
in the 14 -15 hour range. That's not a brief
4:17
blip, that's most of the workday. A lot of well
4:20
-known companies reported knock -on effects,
4:23
slow or failing requests, backends timing out,
4:26
and internal tools misbehaving. If you look at
4:29
AWS's own summaries and third party breakdowns,
4:33
you see a combination of issues inside key subsystems.
4:37
Services that monitor and manage other services
4:39
had problems. There were DNS resolution issues
4:43
inside the region and the control plane APIs
4:46
people depend on to manage resources were degraded
4:49
or error prone for long stretches. The important
4:52
bit for us is how that intersects with the way
4:55
people talk about high availability. and disaster
4:58
recovery. A lot of teams quite honestly stop
5:01
at, we're spread across multiple availability
5:04
zones in the US East one, so we're good. That
5:07
helps if a single data center has a power problem
5:10
or a localized failure. It does very little for
5:14
you when the whole region is unhealthy in ways
5:17
that touch both data plane and control plane.
5:21
The second pattern that shows up in postmortems
5:23
and social posts is the backup plan. You see
5:27
some version of, if US East 1 has trouble, we'll
5:30
just redeploy to another region with Terraform
5:33
or CloudFormation. But in an event like this,
5:36
the very APIs those tools rely on are also degraded.
5:40
So your recovery plan assumes the control plane
5:43
is perfectly usable at the exact moment AWS is
5:46
saying, we're having issues with the control
5:48
plane operations in this region. A few questions
5:51
to think about in the context of your own systems.
5:54
Do you have anything running in another region
5:57
right now, even in a scaled down form that could
6:00
serve as a fallback, not theory, but actual workloads
6:04
you can point to? Could you provide some kind
6:07
of degraded experience from that secondary region
6:10
without building it on the fly in the middle
6:12
of an incident? And have you ever walked through
6:15
a full failover? and fail back end to end when
6:20
things were calm so you know what breaks, how
6:23
long it actually takes, and who needs to be involved.
6:26
I'm not saying everyone should be active active
6:29
everywhere. That's not realistic for most teams.
6:32
But if you call a system mission critical, and
6:35
it only exists in a single AWS region with untested,
6:40
we'll spin it up elsewhere docs, incidents like
6:43
this are a pretty strong signal that's not a
6:46
Now, let's shift from user -facing outages and
6:50
regional issues to something closer to home.
6:53
GitHub. On November 18th, the same day as the
6:56
Cloudflare incident, GitHub had its own major
6:59
problem. According to their status updates and
7:02
multiple tracking sites, they started investigating
7:05
failures on all Git operations in the early evening
7:10
of UTC. That meant push, pull, clone, over both
7:14
HTTP and SSH were failing or timing out. A bit
7:19
later, they also called out degraded availability
7:22
for code spaces. It took roughly an hour before
7:25
they reported recovery. From a development and
7:28
operations perspective, that touches a lot of
7:31
things at once. CI systems that do a fresh clone
7:34
from GitHub every run will fail. GitOps tools
7:38
like Argo CD or Flux that continuously sync from
7:42
GitHub will stop updating. Developers trying
7:45
to push a fix for some other outage can't get
7:48
their code up. And if your only CI system is
7:52
GitHub Actions, those workflows are either delayed
7:54
or completely blocked. So while GitHub going
7:58
down doesn't look like a classic production is
8:00
down incident, it absolutely can turn into one
8:03
because it stops you from changing production
8:05
at the exact moment you might need to. Most teams
8:08
don't have a playbook for GitHub is unavailable
8:11
in the same way they have a run book for ServiceX
8:14
is unhealthy. A few things worth considering.
8:17
Do you have read -only mirrors of your most important
8:21
repositories anywhere else? Another Git provider,
8:24
an internal mirror, anything? Even a simple periodic
8:28
mirror of your infra and core app repos can make
8:32
a difference. Can your CI run from a cached copy
8:35
of the repo and existing artifacts for some period
8:38
of time? Or is every pipeline hardwired to always
8:43
pull from GitHub's live API? And if GitHub Actions
8:46
is your only pipeline engine, do you have any
8:49
backup, even if it's slower and more manual,
8:52
or is the default we simply wait. None of this
8:55
has to be perfect, but your critical repos and
8:58
pipelines deserve to be treated as part of your
9:01
reliability story the same way your databases
9:04
and load balancers are. Now that we've looked
9:06
at Cloudflare at the edge, AWS in the region,
9:10
and GitHub in the development loop, let's zoom
9:12
out and talk about the pattern. All three of
9:15
these incidents point at the same basic reality.
9:19
We are heavily dependent on a small set of external
9:23
providers that we treat like background infrastructure.
9:26
The CDN or WAF in front of us, the primary cloud
9:30
region we run in, and the platform that hosts
9:33
our code and pipelines. When they have issues,
9:37
they expose assumptions in our designs and in
9:40
our runbooks. The root causes are also pretty
9:43
typical for large complex systems. A configuration
9:47
file grows larger than expected and interacts
9:50
badly with software that wasn't written for that
9:53
case. Internal health or management services
9:56
fail in surprising ways and drag down other components.
10:01
Service operations inside GitHub stumble and
10:04
suddenly Git operations don't reliably work.
10:07
Nothing exotic. just scale and complexity doing
10:10
what they do. For me, the practical move here
10:14
is not to panic about the cloud being fragile.
10:17
It's to get very explicit about where your external
10:21
single points of failure are and then improve
10:24
a couple of them in a concrete way. For each
10:27
major provider you depend on, ask two questions.
10:31
If this provider is impaired for a few hours
10:33
in the middle of the day, what exactly breaks
10:36
for our users and what breaks for our ability
10:38
to respond? And what is the specific set of steps
10:42
we would take with the tooling and people we
10:46
have right now? Then pick one of those areas
10:49
and move it a step forward. That might be putting
10:53
a minimal but real footprint in a second AWS
10:57
region for your most critical services and exercising
11:00
it on a schedule. Documenting and testing a simple
11:04
path to temporarily bypass the CDN for some subset
11:08
of traffic if the edge is misbehaving. or mirroring
11:12
your key repos and adjusting CI so you're not
11:16
entirely dependent on a single Git provider for
11:20
every build and deployment. You don't need a
11:22
perfect answer for every scenario, but these
11:25
kinds of outages are useful pressure tests and
11:29
they're good leverage when you're trying to justify
11:32
reliability work to people who only see the cloud
11:36
marketing slides. That's it for this first episode
11:39
of Ship It Weekly by Tellers Tech. Today was
11:42
a bit of a special, mostly focused on three related
11:45
outages and what they mean for reliability and
11:49
architecture. Going forward, the normal format
11:52
will be a little more mixed. Most weeks, I'll
11:55
cover a couple of main stories, some quick mentions
11:58
of tools, or releases that are worth a look,
12:01
and usually one culture or career topic that's
12:05
relevant to DevOps and SRE work. We'll still
12:08
do focused episodes like this when something
12:10
big happens, but you can expect more variety
12:13
week to week. If this was useful, feel free to
12:16
share it with someone on your team who thinks
12:18
about incidents, DR, and platform design. Show
12:22
notes will have links to the Cloudflare incident
12:25
explanations, the AWS US EastOne analysis, and
12:29
the GitHub outage history, so you can dig into
12:32
the details if you want. I'm Brian. Thanks for
12:34
listening, and I'll see you in the next one.