0:00
This week is basically a masterclass in the system
0:03
did exactly what we built it to do. Cloudflare
0:06
automated something that touches routing and
0:08
a bug turned into real BGP withdrawals for customer
0:12
prefixes. Clerk got taken out by a query plan
0:15
flip. Nothing crashed. The database was up. It
0:19
was just slow enough to light everything on fire.
0:21
And AWS is in the middle of a new era where internal
0:25
tools can take bigger actions faster. And the
0:28
whole story comes back to permissions and guardrails.
0:31
If you run production, this is your week. Hey.
0:50
I'm Brian from Tellers Tech, and this is Ship
0:53
It Weekly, where we cover outages, releases,
0:55
and incident write -ups, then translate them
0:58
into what it actually means for your systems.
1:01
If you like the show, follow or subscribe wherever
1:04
you are listening. And if you've got an on -call
1:06
buddy, send them this episode. That share does
1:09
more than you'd expect. Also, quick plug, full
1:12
show notes live on shipitweekly .fm. And the
1:16
curated weekly brief is on oncallbrief .com.
1:19
All right, quick overview so you know where we're
1:22
going today. First, Cloudflare's BYOIP outage
1:26
where a cleanup job ended up withdrawing customer
1:29
prefixes. It's a really clean example of how
1:32
background automation can become production blast
1:35
radius if it touches routing or reachability.
1:38
Second, Clerk's outage from a Postgres auto -analyze
1:42
triggering a query plan flip. The database was
1:45
up, but performance tanked and the system started
1:48
shedding load. It's a great case study in why
1:51
degraded is sometimes harder than down. Third,
1:54
the AWS Kiro story and the follow -up response
1:57
from AWS. Regardless of the headline, the useful
2:00
lesson is permissions. If a tool can take actions,
2:03
it's part of your control plane and it needs
2:05
real boundaries and approvals. After that, I'll
2:08
do a quick platform note on the EKS node monitoring
2:12
agent going open source. Then a tight lightning
2:14
round. Then we'll close with a human section
2:17
on why SRE is basically allergic to ticket queues.
2:21
All right, story one. So Cloudflare had an outage
2:28
tied to BYOIP, bring your own IP. If you are
2:32
not familiar, the idea is simple. Customers want
2:34
to use their own IP ranges at the edge. So traffic
2:37
still comes from the IP space they own, even
2:40
though Cloudflare is doing the heavy lifting.
2:43
The scary part is what makes it powerful. It's
2:45
real routing. BGP announcements. Prefixes. Internet
2:49
reachability. Cloudflare's postmortem is one
2:52
of those that's uncomfortable because it's relatable.
2:54
A cleanup subtask was designed to remove BYOIP
2:58
prefixes that should be removed. Basically, automate
3:01
a manual workflow so support and ops don't have
3:04
to do it by hand. But the cleanup job had a bug.
3:08
The query it ran against their own internal API
3:10
returned more prefixes than it should have. And
3:14
the system started withdrawing prefixes that
3:16
were still in use. At peak, they say around 1
3:19
,100 prefixes were unintentionally withdrawn
3:23
for a window of time. And then the incident as
3:27
a whole took hours to fully unwind. Because even
3:30
after you stop the bleeding, you are stuck restoring
3:33
state across a big distributed system. This is
3:36
the part I want to linger on because it's the
3:38
part that teams tend to underestimate. When you
3:41
withdraw routes, a rollback isn't just flipping
3:44
a flag. You are waiting on propagation. You are
3:47
dealing with caches. You are dealing with partial
3:49
state. You are dealing with bindings and dependencies
3:52
that get removed or drift. Cloudflare even published
3:56
customer guidance during the incident that some
3:58
customers could self -remediate by going into
4:01
the dashboard and re -advertising their prefixes.
4:04
That right there is the difference between this
4:07
is bad and this is catastrophic. If your customers
4:09
can do something to recover without waiting for
4:12
you, you just bought yourself time. So here's
4:14
the ops lesson. If you have automation that can
4:17
withdraw routes, revoke advertisements, delete
4:20
edge bindings, remove cert associations, change
4:23
DNS at scale, or anything in that reachability
4:26
control plane bucket, treat it like production
4:29
deployment tooling, not like a cron job. This
4:32
is the kind of automation that needs friction
4:34
on purpose. It needs a safety check that says,
4:37
if this job is about to touch more than N prefixes,
4:40
stop. It needs canary behavior. It needs a dry
4:44
run mode that produces a diff humans can look
4:47
at. It needs a circuit breaker that triggers
4:49
on anomaly, not on service down. Because the
4:53
failure mode here is always the same. One helper
4:56
job. One bug. One correlated blast radius. And
5:00
nobody sleeps. Practical Monday take. Ask, what
5:03
systems do we have where a background job can
5:06
make reachability go away? That could be BGP,
5:09
but it could also be DNS automation, certificate
5:12
rotation automation, firewall rule cleanup, CDN
5:16
rule cleanup, or even a Terraform pipeline with
5:20
the ability to destroy and recreate shared infrastructure.
5:23
Then ask one more question. If that job goes
5:26
sideways, what's the fastest human safe rollback?
5:29
And does it require tribal knowledge? If the
5:32
answer is we'd figure it out, congrats. You just
5:34
found your next incident. All right, story two
5:41
is one of my favorite kind of postmortems, even
5:44
though it's painful. Clerk had a system outage,
5:47
and the root cause was an inefficient query plan
5:50
caused by Postgres auto -analyze. So nothing
5:53
exotic, no kernel panic, no region failure, no
5:56
someone deleted prod. Just Postgres doing normal
6:00
Postgres things. Here's the chain. Auto -analyze
6:03
runs. Statistics get updated. That causes a query
6:06
plan flip. Same query, different plan. That plan
6:10
is dramatically worse, which drags database performance
6:13
down, which backs up request handlers, which
6:16
turns into queuing. And then almost all traffic
6:18
starts getting 429'd without being handled. They
6:22
call out that over 95 % of traffic was returning
6:25
429 because the system was basically shedding
6:28
load while it was drowning. The fix was pretty
6:30
direct. Manually rerun analyze for the table
6:34
involved, which changed the stats and brought
6:37
the query plan back to the good version. What's
6:39
interesting is the detail they share about why
6:42
the planner got it wrong. It had to estimate
6:44
how many rows would match a condition, and it
6:47
used a statistic that depends on sampling. Their
6:50
data had a column where almost everything was
6:53
null. And because the sample was small, the sample
6:56
ended up being basically all nulls. So the planner
6:59
overconfidently assumed 100 % nulls. The query
7:03
planner then expected a certain part of the query
7:05
to return basically zero rows. But in reality,
7:09
it returned something like 17 ,000 rows. So the
7:12
plan it picked was good for zero rows and terrible
7:15
for 17 ,000. That mismatch is the kind of thing
7:19
that doesn't show up in unit tests. It shows
7:22
up on a Thursday morning when auto -analyze decides
7:25
its time. So why does this matter for platform
7:28
and SRE folks? Because a lot of teams still think
7:31
in binary failure states. Database up or down.
7:35
Service up or down. But a huge chunk of production
7:38
incidents live in the gray zone. Database is
7:41
up, but slow. The service is up, but queuing.
7:44
Your health checks pass, but users are screaming.
7:47
They even point out that their automatic failover
7:50
didn't trigger because Postgres was online, just
7:52
degraded. So it didn't match the failover now
7:55
conditions. And this is where the playbook needs
7:58
to evolve. If your failover only triggers on
8:01
dead, you're going to get smoked by limping.
8:04
Clerk's remediation is worth stealing. They talk
8:07
about adding alerting specifically for query
8:09
plan flips because it's sudden and severe. They
8:12
also talk about a mitigation that offloaded session
8:15
token generation outside their core session API
8:18
to reduce backend load and help people stay logged
8:22
in even while other parts of the system were
8:25
unhealthy. That's a classic reliability move.
8:28
Protect the critical path even if the full feature
8:31
set is degraded. And they're also honest about
8:34
communication. They say their updates were too
8:37
infrequent, their initial status severity didn't
8:39
match impact, and their first update was too
8:41
slow. Every team thinks they're good at comms
8:44
until they're in a real outage. So Monday take.
8:47
If you run Postgres, ask yourself, do we have
8:49
any alerting that detects this query suddenly
8:52
got 50 times slower? Or this query changed plan?
8:55
Or do we just wait for CPU graphs to screen?
8:58
And separately, do we have a degraded mode strategy
9:01
for the handful of flows that absolutely cannot
9:04
be down? Auth, token validation, session refresh,
9:08
payment, whatever it is for your product. Because
9:11
the best incident is the one where users can
9:14
still do the one thing they really need, even
9:16
if the rest is on fire. All right, story three.
9:23
This one's floating around as AI took down AWS,
9:26
which is obviously the headline everybody wants.
9:29
But the more useful way to look at it is this
9:32
is a permission story. Reuters reported that
9:34
AWS had a disruption tied to a cost management
9:37
feature, and the reporting connected it to AWS's
9:41
internal AWS tooling called Kero. AWS responded
9:45
by saying it was limited to a single service,
9:47
not AWS broadly. It was limited to one region,
9:50
and it was user error. Then AWS published their
9:53
own statement, basically saying, the interruption
9:56
was due to misconfigured access controls, not
9:59
AI. They also say they added additional safeguards,
10:02
including mandatory peer review for production
10:05
access. You can believe whichever framing you
10:07
want, but the operational takeaway is identical.
10:10
If you have a tool that can take action, it is
10:13
part of your control plane. Whether it's an agent,
10:16
A bot, a pipeline, Terraform, a chatbot command,
10:20
a script that runs at 2am, or an internal self
10:22
-service portal. The moment it can touch production,
10:25
you need to treat it like production access.
10:28
And the fastest way to get hurt here is letting
10:30
convenience win over boundaries. So what does
10:33
good boundaries look like in real teams? It looks
10:36
like separation between read -only and write.
10:38
It looks like separation between propose a plan
10:41
and execute the plan. It looks like destructive
10:43
actions requiring explicit approvals. Not just
10:46
it ran in automation, so it must be fine. It
10:49
looks like a break glass path for emergencies
10:52
that is auditable and annoying enough that nobody
10:55
uses it casually. And it looks like logging actual
10:58
tool actions, not just chat transcripts. Not
11:01
the bot said it would delete things. I mean,
11:04
who called what API, with what role, against
11:07
what resources, and what changed? Because in
11:10
this new era, the hardest incidents will be the
11:13
ones where everything moved fast. And nobody
11:16
can confidently answer what actually happened.
11:19
Monday take. If your org is messing with agents,
11:22
or even just adding more automation, do one simple
11:26
exercise. Pick one destructive action that exists
11:29
in your environment. Like deleting an environment,
11:32
rotating a secret, revoking access, withdrawing
11:35
a route, disabling a control. Now ask, can anything
11:38
do this without a second human being involved?
11:42
If yes, that's your risk. Not because AI is dangerous,
11:45
but because any tool with power plus weak guardrails
11:48
is dangerous. Quick platform note. AWS open sourced
11:56
the EKS node monitoring agent. The big pitch
11:59
is it monitors node -level system, storage, networking,
12:03
and accelerator issues and publishes them as
12:06
node conditions. And EKS can use those conditions
12:09
to drive automatic node repair. If you've ever
12:12
had a weird node that's half dead and you ended
12:16
up SSHing in, tailing kubelet logs, checking
12:19
disk pressure, and basically doing detective
12:22
work while your workloads suffer, That's the
12:24
exact pain that this is aimed at. I like this
12:27
category of tooling because it's not another
12:30
dashboard. It's turn node weirdness into a signal
12:33
that the control plane can act on. If you are
12:36
on EKS and you've had node flakiness incidents,
12:39
it's worth a look. All right, time for the lightning
12:48
round. I'm keeping this tight, four items and
12:51
all high signal. First, Grafana. There's a high
12:54
-severity advisory for cross -dashboard privilege
12:57
escalation via permission management. The short
13:00
version is, if someone has permission management
13:02
rights on one dashboard, under certain conditions,
13:06
they can read and modify permissions on other
13:09
dashboards. If you run Grafana in a shared environment,
13:12
this is one of those check your version and patch
13:15
stories, not a someday story. Second, run C CVEs.
13:20
AWS put out a bulletin for recently disclosed
13:23
run C issues that affect container runtimes when
13:26
launching new containers. I'm not going to pretend
13:29
everyone patches this instantly because the reality
13:32
is it depends on how you get your node OS and
13:35
runtime updates. But this is still a reminder
13:38
to keep node rollouts and runtime patching as
13:41
a normal muscle, not a panic button. Third, GitLab
13:45
patch train. GitLab shipped patch releases that
13:48
include important bug and security fixes, and
13:51
they strongly recommend self -managed installs
13:54
upgrade. If you self -host GitLab, you already
13:57
know the deal. Don't let we'll do it later become
13:59
we got popped because we were busy. Fourth, Atlassian's
14:03
February security bulletin. This is for the enterprise
14:06
crowd still running data center products. They
14:09
are calling out a pile of high severity and critical
14:12
severity vulnerabilities fixed in recent product
14:15
releases. Same story. If you run it, patch it.
14:19
If you don't run it, thank your lucky stars and
14:21
keep scrolling. All right, human closer. There's
14:31
an ACMQ piece called SRE is anti -transactional,
14:35
and it nails something that every platform team
14:37
eventually runs into. Tickets don't scale. Manual
14:41
work scales linearly. More requests means more
14:44
humans. And that is how you turn a platform team
14:47
into a help desk with pager fatigue. The SRE
14:50
instinct is to build systems that do work for
14:53
you. Not because you hate helping people, but
14:56
because you want the systems to be reliable without
14:59
requiring human glue for every small thing. And
15:02
honestly, this week's stories are all versions
15:04
of that same theme. Cloudflare tried to automate
15:07
a workflow that used to be manual. The idea was
15:10
right, but the guardrails weren't strong enough.
15:13
Clerk got hit by a database behavior that didn't
15:16
trip the usual failover assumptions. And they
15:19
are evolving their system so the critical flows
15:21
can survive partial failure. And AWS is in the
15:25
middle of a bigger shift where tools are doing
15:27
more, faster, and the only thing standing between
15:30
helpful and incident is how you design boundaries
15:33
and approvals. So if you are a platform engineer
15:36
or an SRE listening to this and you feel like
15:38
you are buried in tickets, here's the move. Pick
15:41
one repeated transactional pain this week and
15:44
don't solve it with another runbook. Solve it
15:46
with an API. a self -service workflow, or automation
15:50
with proper guardrails. Okay, time for a quick
15:52
recap before we wrap. Cloudflare is a reminder
15:55
that helper jobs are never just a cron. If automation
15:58
can touch reachability, routing, DNS, certs,
16:02
or anything shared, it needs production -grade
16:05
guardrails and a rollback you can execute under
16:08
stress. Clerk is the reminder that up but slow
16:11
can be worse than down. If your alerting and
16:14
failover only triggers on dead systems, you are
16:17
going to miss the incidents that actually hurt.
16:20
And the AWS Kiro story, no matter how you frame
16:23
it, comes back to permissions. If a tool can
16:26
execute changes, separate propose versus execute.
16:29
Require approvals for destructive actions. And
16:33
log the actual actions taken. Lightning round
16:36
recap. Grafana's permission escalation risk.
16:39
RunSee Runtime CVEs, GitLab patch releases, and
16:43
Alassian's monthly security bulletin. Links for
16:46
all of these are in the show notes and the human
16:48
takeaway. SRE is anti -transactional for a reason.
16:51
Tickets don't scale. Build self -service and
16:54
guardrails so humans stop being the interface
16:57
for every little thing. All right, that's it
17:00
for this week. If you want the full receipts
17:02
and links, the full show notes are on shipitweekly
17:05
.fm. And the curated weekly brief is on OnCallBrief
17:09
.com. If you got value out of this, follow or
17:12
subscribe wherever you listen. And subscribe
17:14
on YouTube if you are watching the video version.
17:17
And if you've got an OnCall buddy, send them
17:19
this episode. I'm Brian for Ship It Weekly, and
17:21
I'll see you next week.
For this episode, I wanted to anchor on something I think every ops team learns the hard way.
The incidents that hurt the most are rarely the big obvious deploys.
It’s the background systems. The reconcilers. The cleanup jobs. The “this should be safe because it’s routine” automation.
Because those jobs are usually touching shared truth.
Routing state. Prefix state. Permissions. Database statistics. The stuff everything else quietly depends on.
And when that shared truth shifts under you, you don’t just get a bug. You get reachability problems. You get cascading retries. You get queueing. You get “everything is up but nothing works.”
Cloudflare BYOIP is the cleanest example this week.
This wasn’t “somebody fat-fingered BGP.” It was a buggy cleanup sub-task that queried the Addressing API wrong and ended up withdrawing about 1,100 BYOIP prefixes before they could revert the change. Some customers could re-advertise their prefixes from the dashboard, but the real work was restoring prefix configuration state back to normal.
That’s the lesson. If you have automation that can touch reachability, it is production control plane. Treat it like prod deploy tooling, not like “just a job.” Put caps on it. Put canaries on it. Put a circuit breaker on it. And most importantly, build rollback that does not require tribal knowledge at 3am.
Cloudflare outage postmortem
https://blog.cloudflare.com/cloudflare-outage-february-20-2026
Next, Clerk’s postmortem is the same theme, just inside Postgres instead of BGP.
Auto analyze ran, statistics shifted, a query plan flipped into something awful, and suddenly the system is shedding load so hard that most traffic is coming back 429 without even being handled. They fixed it by forcing ANALYZE again, and then they got really explicit about hardening failover so it can trigger on “any failure at origin,” not just “Postgres is down.”
This is why I keep saying “degraded is harder than down.”
Most teams have alarms for dead things. A lot fewer teams have alarms for “same query, different plan” or “latency is spiking but nothing is technically failing.” That gap is where the really ugly incidents live.
Clerk outage postmortem
https://clerk.com/blog/2026-02-19-system-outage-postmortem
And then you’ve got the AWS Kiro story, which is going to get summarized everywhere as “AI took down AWS.”
AWS’s response is basically: no, it was misconfigured access controls, and they added safeguards like mandatory peer review for production access. Reuters covered the reporting around it, and AWS published their own statement pushing back.
Here’s my take.
Whether it was an agent or a bash script, it’s the same root problem: a tool got permissions it shouldn’t have had.
So the practical move is boring, but it’s the whole game.
Separate propose from execute.
Let tools draft plans, diffs, PRs, and recommendations all day long.
But when it comes to destructive actions, make that path intentionally gated, intentionally scoped, and painfully auditable.
AWS response on Kiro
https://www.aboutamazon.com/news/aws/aws-service-outage-ai-bot-kiro
AWS outage reporting (Reuters)
https://www.reuters.com/business/retail-consumer/amazons-cloud-unit-hit-by-least-two-outages-involving-ai-tools-ft-says-2026-02-20
Quick platform note before we move on.
AWS open-sourced the EKS Node Monitoring Agent, which is aimed at detecting node-level issues and surfacing them as signals EKS can act on, including automated node repair paths. This is one of those “make the pager quieter” features that I actually like seeing.
EKS Node Monitoring Agent
https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-eks-node-monitoring-agent-open-source
Lightning round quick thoughts.
Grafana has a high severity issue where a user with permission management rights on one dashboard could modify permissions on other dashboards. If you run Grafana in a shared org and you’ve got a lot of teams in there, that’s a “patch it” item.
https://grafana.com/security/security-advisories/cve-2026-21721
AWS published a bulletin on runc CVEs affecting container runtime behavior when launching new containers. The evergreen reminder is still true: containers are not a security boundary, and runtime bugs turn into host risk depending on how you’re running workloads.
https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024
GitLab shipped patch releases 18.6.1, 18.5.3, 18.4.5. If you self-host GitLab, you already know the rule. Staying behind becomes “we’ll do it later” until it becomes a weekend incident.
https://about.gitlab.com/releases/2025/11/26/patch-release-gitlab-18-6-1-released
And Atlassian’s February bulletin is the monthly reminder that on-prem Data Center products are a patch treadmill. They call out a pile of high severity vulns and critical severity ones fixed in newer versions.
https://confluence.atlassian.com/security/security-bulletin-february-17-2026-1722256046.html
Human closer.
ACM Queue ran a piece called “SRE Is Anti-Transactional,” and it’s basically describing the exact emotional arc behind all of these stories.
SRE and platform teams aren’t trying to dodge work.
They’re trying to move the org away from manual, transactional toil, toward systems that do safe work by default, and only involve humans for exceptions.
But this week is a reminder that you don’t get autonomy by giving tools more power.
You get autonomy by engineering the guardrails first, then widening the lane over time.
SRE Is Anti-Transactional
https://queue.acm.org/detail.cfm?id=3773094
That ties the whole episode together.
Cloudflare automated cleanup touching routing state.
Clerk got hit by a “system is up but behavior changed” database failure mode.
AWS is reinforcing that permissions are still the sharp edge, no matter what tool is holding the knife.
Defaults shift. Background systems become dependencies. Guardrails decide whether it’s a story you learn from, or a story you apologize for.
Full show notes are on shipitweekly.fm. The weekly curated brief is on oncallbrief.com.
And if you got value out of this episode, follow or subscribe wherever you listen. Helps a ton.