0:00
This week is basically a masterclass in the system
0:03
did exactly what we built it to do. Cloudflare
0:06
automated something that touches routing and
0:08
a bug turned into real BGP withdrawals for customer
0:12
prefixes. Clerk got taken out by a query plan
0:15
flip. Nothing crashed. The database was up. It
0:19
was just slow enough to light everything on fire.
0:21
And AWS is in the middle of a new era where internal
0:25
tools can take bigger actions faster. And the
0:28
whole story comes back to permissions and guardrails.
0:31
If you run production, this is your week. Hey.
0:50
I'm Brian from Tellers Tech, and this is Ship
0:53
It Weekly, where we cover outages, releases,
0:55
and incident write -ups, then translate them
0:58
into what it actually means for your systems.
1:01
If you like the show, follow or subscribe wherever
1:04
you are listening. And if you've got an on -call
1:06
buddy, send them this episode. That share does
1:09
more than you'd expect. Also, quick plug, full
1:12
show notes live on shipitweekly .fm. And the
1:16
curated weekly brief is on oncallbrief .com.
1:19
All right, quick overview so you know where we're
1:22
going today. First, Cloudflare's BYOIP outage
1:26
where a cleanup job ended up withdrawing customer
1:29
prefixes. It's a really clean example of how
1:32
background automation can become production blast
1:35
radius if it touches routing or reachability.
1:38
Second, Clerk's outage from a Postgres auto -analyze
1:42
triggering a query plan flip. The database was
1:45
up, but performance tanked and the system started
1:48
shedding load. It's a great case study in why
1:51
degraded is sometimes harder than down. Third,
1:54
the AWS Kiro story and the follow -up response
1:57
from AWS. Regardless of the headline, the useful
2:00
lesson is permissions. If a tool can take actions,
2:03
it's part of your control plane and it needs
2:05
real boundaries and approvals. After that, I'll
2:08
do a quick platform note on the EKS node monitoring
2:12
agent going open source. Then a tight lightning
2:14
round. Then we'll close with a human section
2:17
on why SRE is basically allergic to ticket queues.
2:21
All right, story one. So Cloudflare had an outage
2:28
tied to BYOIP, bring your own IP. If you are
2:32
not familiar, the idea is simple. Customers want
2:34
to use their own IP ranges at the edge. So traffic
2:37
still comes from the IP space they own, even
2:40
though Cloudflare is doing the heavy lifting.
2:43
The scary part is what makes it powerful. It's
2:45
real routing. BGP announcements. Prefixes. Internet
2:49
reachability. Cloudflare's postmortem is one
2:52
of those that's uncomfortable because it's relatable.
2:54
A cleanup subtask was designed to remove BYOIP
2:58
prefixes that should be removed. Basically, automate
3:01
a manual workflow so support and ops don't have
3:04
to do it by hand. But the cleanup job had a bug.
3:08
The query it ran against their own internal API
3:10
returned more prefixes than it should have. And
3:14
the system started withdrawing prefixes that
3:16
were still in use. At peak, they say around 1
3:19
,100 prefixes were unintentionally withdrawn
3:23
for a window of time. And then the incident as
3:27
a whole took hours to fully unwind. Because even
3:30
after you stop the bleeding, you are stuck restoring
3:33
state across a big distributed system. This is
3:36
the part I want to linger on because it's the
3:38
part that teams tend to underestimate. When you
3:41
withdraw routes, a rollback isn't just flipping
3:44
a flag. You are waiting on propagation. You are
3:47
dealing with caches. You are dealing with partial
3:49
state. You are dealing with bindings and dependencies
3:52
that get removed or drift. Cloudflare even published
3:56
customer guidance during the incident that some
3:58
customers could self -remediate by going into
4:01
the dashboard and re -advertising their prefixes.
4:04
That right there is the difference between this
4:07
is bad and this is catastrophic. If your customers
4:09
can do something to recover without waiting for
4:12
you, you just bought yourself time. So here's
4:14
the ops lesson. If you have automation that can
4:17
withdraw routes, revoke advertisements, delete
4:20
edge bindings, remove cert associations, change
4:23
DNS at scale, or anything in that reachability
4:26
control plane bucket, treat it like production
4:29
deployment tooling, not like a cron job. This
4:32
is the kind of automation that needs friction
4:34
on purpose. It needs a safety check that says,
4:37
if this job is about to touch more than N prefixes,
4:40
stop. It needs canary behavior. It needs a dry
4:44
run mode that produces a diff humans can look
4:47
at. It needs a circuit breaker that triggers
4:49
on anomaly, not on service down. Because the
4:53
failure mode here is always the same. One helper
4:56
job. One bug. One correlated blast radius. And
5:00
nobody sleeps. Practical Monday take. Ask, what
5:03
systems do we have where a background job can
5:06
make reachability go away? That could be BGP,
5:09
but it could also be DNS automation, certificate
5:12
rotation automation, firewall rule cleanup, CDN
5:16
rule cleanup, or even a Terraform pipeline with
5:20
the ability to destroy and recreate shared infrastructure.
5:23
Then ask one more question. If that job goes
5:26
sideways, what's the fastest human safe rollback?
5:29
And does it require tribal knowledge? If the
5:32
answer is we'd figure it out, congrats. You just
5:34
found your next incident. All right, story two
5:41
is one of my favorite kind of postmortems, even
5:44
though it's painful. Clerk had a system outage,
5:47
and the root cause was an inefficient query plan
5:50
caused by Postgres auto -analyze. So nothing
5:53
exotic, no kernel panic, no region failure, no
5:56
someone deleted prod. Just Postgres doing normal
6:00
Postgres things. Here's the chain. Auto -analyze
6:03
runs. Statistics get updated. That causes a query
6:06
plan flip. Same query, different plan. That plan
6:10
is dramatically worse, which drags database performance
6:13
down, which backs up request handlers, which
6:16
turns into queuing. And then almost all traffic
6:18
starts getting 429'd without being handled. They
6:22
call out that over 95 % of traffic was returning
6:25
429 because the system was basically shedding
6:28
load while it was drowning. The fix was pretty
6:30
direct. Manually rerun analyze for the table
6:34
involved, which changed the stats and brought
6:37
the query plan back to the good version. What's
6:39
interesting is the detail they share about why
6:42
the planner got it wrong. It had to estimate
6:44
how many rows would match a condition, and it
6:47
used a statistic that depends on sampling. Their
6:50
data had a column where almost everything was
6:53
null. And because the sample was small, the sample
6:56
ended up being basically all nulls. So the planner
6:59
overconfidently assumed 100 % nulls. The query
7:03
planner then expected a certain part of the query
7:05
to return basically zero rows. But in reality,
7:09
it returned something like 17 ,000 rows. So the
7:12
plan it picked was good for zero rows and terrible
7:15
for 17 ,000. That mismatch is the kind of thing
7:19
that doesn't show up in unit tests. It shows
7:22
up on a Thursday morning when auto -analyze decides
7:25
its time. So why does this matter for platform
7:28
and SRE folks? Because a lot of teams still think
7:31
in binary failure states. Database up or down.
7:35
Service up or down. But a huge chunk of production
7:38
incidents live in the gray zone. Database is
7:41
up, but slow. The service is up, but queuing.
7:44
Your health checks pass, but users are screaming.
7:47
They even point out that their automatic failover
7:50
didn't trigger because Postgres was online, just
7:52
degraded. So it didn't match the failover now
7:55
conditions. And this is where the playbook needs
7:58
to evolve. If your failover only triggers on
8:01
dead, you're going to get smoked by limping.
8:04
Clerk's remediation is worth stealing. They talk
8:07
about adding alerting specifically for query
8:09
plan flips because it's sudden and severe. They
8:12
also talk about a mitigation that offloaded session
8:15
token generation outside their core session API
8:18
to reduce backend load and help people stay logged
8:22
in even while other parts of the system were
8:25
unhealthy. That's a classic reliability move.
8:28
Protect the critical path even if the full feature
8:31
set is degraded. And they're also honest about
8:34
communication. They say their updates were too
8:37
infrequent, their initial status severity didn't
8:39
match impact, and their first update was too
8:41
slow. Every team thinks they're good at comms
8:44
until they're in a real outage. So Monday take.
8:47
If you run Postgres, ask yourself, do we have
8:49
any alerting that detects this query suddenly
8:52
got 50 times slower? Or this query changed plan?
8:55
Or do we just wait for CPU graphs to screen?
8:58
And separately, do we have a degraded mode strategy
9:01
for the handful of flows that absolutely cannot
9:04
be down? Auth, token validation, session refresh,
9:08
payment, whatever it is for your product. Because
9:11
the best incident is the one where users can
9:14
still do the one thing they really need, even
9:16
if the rest is on fire. All right, story three.
9:23
This one's floating around as AI took down AWS,
9:26
which is obviously the headline everybody wants.
9:29
But the more useful way to look at it is this
9:32
is a permission story. Reuters reported that
9:34
AWS had a disruption tied to a cost management
9:37
feature, and the reporting connected it to AWS's
9:41
internal AWS tooling called Kero. AWS responded
9:45
by saying it was limited to a single service,
9:47
not AWS broadly. It was limited to one region,
9:50
and it was user error. Then AWS published their
9:53
own statement, basically saying, the interruption
9:56
was due to misconfigured access controls, not
9:59
AI. They also say they added additional safeguards,
10:02
including mandatory peer review for production
10:05
access. You can believe whichever framing you
10:07
want, but the operational takeaway is identical.
10:10
If you have a tool that can take action, it is
10:13
part of your control plane. Whether it's an agent,
10:16
A bot, a pipeline, Terraform, a chatbot command,
10:20
a script that runs at 2am, or an internal self
10:22
-service portal. The moment it can touch production,
10:25
you need to treat it like production access.
10:28
And the fastest way to get hurt here is letting
10:30
convenience win over boundaries. So what does
10:33
good boundaries look like in real teams? It looks
10:36
like separation between read -only and write.
10:38
It looks like separation between propose a plan
10:41
and execute the plan. It looks like destructive
10:43
actions requiring explicit approvals. Not just
10:46
it ran in automation, so it must be fine. It
10:49
looks like a break glass path for emergencies
10:52
that is auditable and annoying enough that nobody
10:55
uses it casually. And it looks like logging actual
10:58
tool actions, not just chat transcripts. Not
11:01
the bot said it would delete things. I mean,
11:04
who called what API, with what role, against
11:07
what resources, and what changed? Because in
11:10
this new era, the hardest incidents will be the
11:13
ones where everything moved fast. And nobody
11:16
can confidently answer what actually happened.
11:19
Monday take. If your org is messing with agents,
11:22
or even just adding more automation, do one simple
11:26
exercise. Pick one destructive action that exists
11:29
in your environment. Like deleting an environment,
11:32
rotating a secret, revoking access, withdrawing
11:35
a route, disabling a control. Now ask, can anything
11:38
do this without a second human being involved?
11:42
If yes, that's your risk. Not because AI is dangerous,
11:45
but because any tool with power plus weak guardrails
11:48
is dangerous. Quick platform note. AWS open sourced
11:56
the EKS node monitoring agent. The big pitch
11:59
is it monitors node -level system, storage, networking,
12:03
and accelerator issues and publishes them as
12:06
node conditions. And EKS can use those conditions
12:09
to drive automatic node repair. If you've ever
12:12
had a weird node that's half dead and you ended
12:16
up SSHing in, tailing kubelet logs, checking
12:19
disk pressure, and basically doing detective
12:22
work while your workloads suffer, That's the
12:24
exact pain that this is aimed at. I like this
12:27
category of tooling because it's not another
12:30
dashboard. It's turn node weirdness into a signal
12:33
that the control plane can act on. If you are
12:36
on EKS and you've had node flakiness incidents,
12:39
it's worth a look. All right, time for the lightning
12:48
round. I'm keeping this tight, four items and
12:51
all high signal. First, Grafana. There's a high
12:54
-severity advisory for cross -dashboard privilege
12:57
escalation via permission management. The short
13:00
version is, if someone has permission management
13:02
rights on one dashboard, under certain conditions,
13:06
they can read and modify permissions on other
13:09
dashboards. If you run Grafana in a shared environment,
13:12
this is one of those check your version and patch
13:15
stories, not a someday story. Second, run C CVEs.
13:20
AWS put out a bulletin for recently disclosed
13:23
run C issues that affect container runtimes when
13:26
launching new containers. I'm not going to pretend
13:29
everyone patches this instantly because the reality
13:32
is it depends on how you get your node OS and
13:35
runtime updates. But this is still a reminder
13:38
to keep node rollouts and runtime patching as
13:41
a normal muscle, not a panic button. Third, GitLab
13:45
patch train. GitLab shipped patch releases that
13:48
include important bug and security fixes, and
13:51
they strongly recommend self -managed installs
13:54
upgrade. If you self -host GitLab, you already
13:57
know the deal. Don't let we'll do it later become
13:59
we got popped because we were busy. Fourth, Atlassian's
14:03
February security bulletin. This is for the enterprise
14:06
crowd still running data center products. They
14:09
are calling out a pile of high severity and critical
14:12
severity vulnerabilities fixed in recent product
14:15
releases. Same story. If you run it, patch it.
14:19
If you don't run it, thank your lucky stars and
14:21
keep scrolling. All right, human closer. There's
14:31
an ACMQ piece called SRE is anti -transactional,
14:35
and it nails something that every platform team
14:37
eventually runs into. Tickets don't scale. Manual
14:41
work scales linearly. More requests means more
14:44
humans. And that is how you turn a platform team
14:47
into a help desk with pager fatigue. The SRE
14:50
instinct is to build systems that do work for
14:53
you. Not because you hate helping people, but
14:56
because you want the systems to be reliable without
14:59
requiring human glue for every small thing. And
15:02
honestly, this week's stories are all versions
15:04
of that same theme. Cloudflare tried to automate
15:07
a workflow that used to be manual. The idea was
15:10
right, but the guardrails weren't strong enough.
15:13
Clerk got hit by a database behavior that didn't
15:16
trip the usual failover assumptions. And they
15:19
are evolving their system so the critical flows
15:21
can survive partial failure. And AWS is in the
15:25
middle of a bigger shift where tools are doing
15:27
more, faster, and the only thing standing between
15:30
helpful and incident is how you design boundaries
15:33
and approvals. So if you are a platform engineer
15:36
or an SRE listening to this and you feel like
15:38
you are buried in tickets, here's the move. Pick
15:41
one repeated transactional pain this week and
15:44
don't solve it with another runbook. Solve it
15:46
with an API. a self -service workflow, or automation
15:50
with proper guardrails. Okay, time for a quick
15:52
recap before we wrap. Cloudflare is a reminder
15:55
that helper jobs are never just a cron. If automation
15:58
can touch reachability, routing, DNS, certs,
16:02
or anything shared, it needs production -grade
16:05
guardrails and a rollback you can execute under
16:08
stress. Clerk is the reminder that up but slow
16:11
can be worse than down. If your alerting and
16:14
failover only triggers on dead systems, you are
16:17
going to miss the incidents that actually hurt.
16:20
And the AWS Kiro story, no matter how you frame
16:23
it, comes back to permissions. If a tool can
16:26
execute changes, separate propose versus execute.
16:29
Require approvals for destructive actions. And
16:33
log the actual actions taken. Lightning round
16:36
recap. Grafana's permission escalation risk.
16:39
RunSee Runtime CVEs, GitLab patch releases, and
16:43
Alassian's monthly security bulletin. Links for
16:46
all of these are in the show notes and the human
16:48
takeaway. SRE is anti -transactional for a reason.
16:51
Tickets don't scale. Build self -service and
16:54
guardrails so humans stop being the interface
16:57
for every little thing. All right, that's it
17:00
for this week. If you want the full receipts
17:02
and links, the full show notes are on shipitweekly
17:05
.fm. And the curated weekly brief is on OnCallBrief
17:09
.com. If you got value out of this, follow or
17:12
subscribe wherever you listen. And subscribe
17:14
on YouTube if you are watching the video version.
17:17
And if you've got an OnCall buddy, send them
17:19
this episode. I'm Brian for Ship It Weekly, and
17:21
I'll see you next week.