Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

Transcript

This week is basically a masterclass in the system

did exactly what we built it to do. Cloudflare

automated something that touches routing and

a bug turned into real BGP withdrawals for customer

prefixes. Clerk got taken out by a query plan

flip. Nothing crashed. The database was up. It

was just slow enough to light everything on fire.

And AWS is in the middle of a new era where internal

tools can take bigger actions faster. And the

whole story comes back to permissions and guardrails.

If you run production, this is your week. Hey.

I'm Brian from Tellers Tech, and this is Ship

It Weekly, where we cover outages, releases,

and incident write -ups, then translate them

into what it actually means for your systems.

If you like the show, follow or subscribe wherever

you are listening. And if you've got an on -call

buddy, send them this episode. That share does

more than you'd expect. Also, quick plug, full

show notes live on shipitweekly .fm. And the

curated weekly brief is on oncallbrief .com.

All right, quick overview so you know where we're

going today. First, Cloudflare's BYOIP outage

where a cleanup job ended up withdrawing customer

prefixes. It's a really clean example of how

background automation can become production blast

radius if it touches routing or reachability.

Second, Clerk's outage from a Postgres auto -analyze

triggering a query plan flip. The database was

up, but performance tanked and the system started

shedding load. It's a great case study in why

degraded is sometimes harder than down. Third,

the AWS Kiro story and the follow -up response

from AWS. Regardless of the headline, the useful

lesson is permissions. If a tool can take actions,

it's part of your control plane and it needs

real boundaries and approvals. After that, I'll

do a quick platform note on the EKS node monitoring

agent going open source. Then a tight lightning

round. Then we'll close with a human section

on why SRE is basically allergic to ticket queues.

All right, story one. So Cloudflare had an outage

tied to BYOIP, bring your own IP. If you are

not familiar, the idea is simple. Customers want

to use their own IP ranges at the edge. So traffic

still comes from the IP space they own, even

though Cloudflare is doing the heavy lifting.

The scary part is what makes it powerful. It's

real routing. BGP announcements. Prefixes. Internet

reachability. Cloudflare's postmortem is one

of those that's uncomfortable because it's relatable.

A cleanup subtask was designed to remove BYOIP

prefixes that should be removed. Basically, automate

a manual workflow so support and ops don't have

to do it by hand. But the cleanup job had a bug.

The query it ran against their own internal API

returned more prefixes than it should have. And

the system started withdrawing prefixes that

were still in use. At peak, they say around 1

,100 prefixes were unintentionally withdrawn

for a window of time. And then the incident as

a whole took hours to fully unwind. Because even

after you stop the bleeding, you are stuck restoring

state across a big distributed system. This is

the part I want to linger on because it's the

part that teams tend to underestimate. When you

withdraw routes, a rollback isn't just flipping

a flag. You are waiting on propagation. You are

dealing with caches. You are dealing with partial

state. You are dealing with bindings and dependencies

that get removed or drift. Cloudflare even published

customer guidance during the incident that some

customers could self -remediate by going into

the dashboard and re -advertising their prefixes.

That right there is the difference between this

is bad and this is catastrophic. If your customers

can do something to recover without waiting for

you, you just bought yourself time. So here's

the ops lesson. If you have automation that can

withdraw routes, revoke advertisements, delete

edge bindings, remove cert associations, change

DNS at scale, or anything in that reachability

control plane bucket, treat it like production

deployment tooling, not like a cron job. This

is the kind of automation that needs friction

on purpose. It needs a safety check that says,

if this job is about to touch more than N prefixes,

stop. It needs canary behavior. It needs a dry

run mode that produces a diff humans can look

at. It needs a circuit breaker that triggers

on anomaly, not on service down. Because the

failure mode here is always the same. One helper

job. One bug. One correlated blast radius. And

nobody sleeps. Practical Monday take. Ask, what

systems do we have where a background job can

make reachability go away? That could be BGP,

but it could also be DNS automation, certificate

rotation automation, firewall rule cleanup, CDN

rule cleanup, or even a Terraform pipeline with

the ability to destroy and recreate shared infrastructure.

Then ask one more question. If that job goes

sideways, what's the fastest human safe rollback?

And does it require tribal knowledge? If the

answer is we'd figure it out, congrats. You just

found your next incident. All right, story two

is one of my favorite kind of postmortems, even

though it's painful. Clerk had a system outage,

and the root cause was an inefficient query plan

caused by Postgres auto -analyze. So nothing

exotic, no kernel panic, no region failure, no

someone deleted prod. Just Postgres doing normal

Postgres things. Here's the chain. Auto -analyze

runs. Statistics get updated. That causes a query

plan flip. Same query, different plan. That plan

is dramatically worse, which drags database performance

down, which backs up request handlers, which

turns into queuing. And then almost all traffic

starts getting 429'd without being handled. They

call out that over 95 % of traffic was returning

429 because the system was basically shedding

load while it was drowning. The fix was pretty

direct. Manually rerun analyze for the table

involved, which changed the stats and brought

the query plan back to the good version. What's

interesting is the detail they share about why

the planner got it wrong. It had to estimate

how many rows would match a condition, and it

used a statistic that depends on sampling. Their

data had a column where almost everything was

null. And because the sample was small, the sample

ended up being basically all nulls. So the planner

overconfidently assumed 100 % nulls. The query

planner then expected a certain part of the query

to return basically zero rows. But in reality,

it returned something like 17 ,000 rows. So the

plan it picked was good for zero rows and terrible

for 17 ,000. That mismatch is the kind of thing

that doesn't show up in unit tests. It shows

up on a Thursday morning when auto -analyze decides

its time. So why does this matter for platform

and SRE folks? Because a lot of teams still think

in binary failure states. Database up or down.

Service up or down. But a huge chunk of production

incidents live in the gray zone. Database is

up, but slow. The service is up, but queuing.

Your health checks pass, but users are screaming.

They even point out that their automatic failover

didn't trigger because Postgres was online, just

degraded. So it didn't match the failover now

conditions. And this is where the playbook needs

to evolve. If your failover only triggers on

dead, you're going to get smoked by limping.

Clerk's remediation is worth stealing. They talk

about adding alerting specifically for query

plan flips because it's sudden and severe. They

also talk about a mitigation that offloaded session

token generation outside their core session API

to reduce backend load and help people stay logged

in even while other parts of the system were

unhealthy. That's a classic reliability move.

Protect the critical path even if the full feature

set is degraded. And they're also honest about

communication. They say their updates were too

infrequent, their initial status severity didn't

match impact, and their first update was too

slow. Every team thinks they're good at comms

until they're in a real outage. So Monday take.

If you run Postgres, ask yourself, do we have

any alerting that detects this query suddenly

got 50 times slower? Or this query changed plan?

Or do we just wait for CPU graphs to screen?

And separately, do we have a degraded mode strategy

for the handful of flows that absolutely cannot

be down? Auth, token validation, session refresh,

payment, whatever it is for your product. Because

the best incident is the one where users can

still do the one thing they really need, even

if the rest is on fire. All right, story three.

This one's floating around as AI took down AWS,

which is obviously the headline everybody wants.

But the more useful way to look at it is this

is a permission story. Reuters reported that

AWS had a disruption tied to a cost management

feature, and the reporting connected it to AWS's

internal AWS tooling called Kero. AWS responded

by saying it was limited to a single service,

not AWS broadly. It was limited to one region,

and it was user error. Then AWS published their

own statement, basically saying, the interruption

was due to misconfigured access controls, not

AI. They also say they added additional safeguards,

including mandatory peer review for production

access. You can believe whichever framing you

want, but the operational takeaway is identical.

If you have a tool that can take action, it is

part of your control plane. Whether it's an agent,

A bot, a pipeline, Terraform, a chatbot command,

a script that runs at 2am, or an internal self

-service portal. The moment it can touch production,

you need to treat it like production access.

And the fastest way to get hurt here is letting

convenience win over boundaries. So what does

good boundaries look like in real teams? It looks

like separation between read -only and write.

It looks like separation between propose a plan

and execute the plan. It looks like destructive

actions requiring explicit approvals. Not just

it ran in automation, so it must be fine. It

looks like a break glass path for emergencies

that is auditable and annoying enough that nobody

uses it casually. And it looks like logging actual

tool actions, not just chat transcripts. Not

the bot said it would delete things. I mean,

who called what API, with what role, against

what resources, and what changed? Because in

this new era, the hardest incidents will be the

ones where everything moved fast. And nobody

can confidently answer what actually happened.

Monday take. If your org is messing with agents,

or even just adding more automation, do one simple

exercise. Pick one destructive action that exists

in your environment. Like deleting an environment,

rotating a secret, revoking access, withdrawing

a route, disabling a control. Now ask, can anything

do this without a second human being involved?

If yes, that's your risk. Not because AI is dangerous,

but because any tool with power plus weak guardrails

is dangerous. Quick platform note. AWS open sourced

the EKS node monitoring agent. The big pitch

is it monitors node -level system, storage, networking,

and accelerator issues and publishes them as

node conditions. And EKS can use those conditions

to drive automatic node repair. If you've ever

had a weird node that's half dead and you ended

up SSHing in, tailing kubelet logs, checking

disk pressure, and basically doing detective

work while your workloads suffer, That's the

exact pain that this is aimed at. I like this

category of tooling because it's not another

dashboard. It's turn node weirdness into a signal

that the control plane can act on. If you are

on EKS and you've had node flakiness incidents,

it's worth a look. All right, time for the lightning

round. I'm keeping this tight, four items and

all high signal. First, Grafana. There's a high

-severity advisory for cross -dashboard privilege

escalation via permission management. The short

version is, if someone has permission management

rights on one dashboard, under certain conditions,

they can read and modify permissions on other

dashboards. If you run Grafana in a shared environment,

this is one of those check your version and patch

stories, not a someday story. Second, run C CVEs.

AWS put out a bulletin for recently disclosed

run C issues that affect container runtimes when

launching new containers. I'm not going to pretend

everyone patches this instantly because the reality

is it depends on how you get your node OS and

runtime updates. But this is still a reminder

to keep node rollouts and runtime patching as

a normal muscle, not a panic button. Third, GitLab

patch train. GitLab shipped patch releases that

include important bug and security fixes, and

they strongly recommend self -managed installs

upgrade. If you self -host GitLab, you already

know the deal. Don't let we'll do it later become

we got popped because we were busy. Fourth, Atlassian's

February security bulletin. This is for the enterprise

crowd still running data center products. They

are calling out a pile of high severity and critical

severity vulnerabilities fixed in recent product

releases. Same story. If you run it, patch it.

If you don't run it, thank your lucky stars and

keep scrolling. All right, human closer. There's

an ACMQ piece called SRE is anti -transactional,

and it nails something that every platform team

eventually runs into. Tickets don't scale. Manual

work scales linearly. More requests means more

humans. And that is how you turn a platform team

into a help desk with pager fatigue. The SRE

instinct is to build systems that do work for

you. Not because you hate helping people, but

because you want the systems to be reliable without

requiring human glue for every small thing. And

honestly, this week's stories are all versions

of that same theme. Cloudflare tried to automate

a workflow that used to be manual. The idea was

right, but the guardrails weren't strong enough.

Clerk got hit by a database behavior that didn't

trip the usual failover assumptions. And they

are evolving their system so the critical flows

can survive partial failure. And AWS is in the

middle of a bigger shift where tools are doing

more, faster, and the only thing standing between

helpful and incident is how you design boundaries

and approvals. So if you are a platform engineer

or an SRE listening to this and you feel like

you are buried in tickets, here's the move. Pick

one repeated transactional pain this week and

don't solve it with another runbook. Solve it

with an API. a self -service workflow, or automation

with proper guardrails. Okay, time for a quick

recap before we wrap. Cloudflare is a reminder

that helper jobs are never just a cron. If automation

can touch reachability, routing, DNS, certs,

or anything shared, it needs production -grade

guardrails and a rollback you can execute under

stress. Clerk is the reminder that up but slow

can be worse than down. If your alerting and

failover only triggers on dead systems, you are

going to miss the incidents that actually hurt.

And the AWS Kiro story, no matter how you frame

it, comes back to permissions. If a tool can

execute changes, separate propose versus execute.

Require approvals for destructive actions. And

log the actual actions taken. Lightning round

recap. Grafana's permission escalation risk.

RunSee Runtime CVEs, GitLab patch releases, and

Alassian's monthly security bulletin. Links for

all of these are in the show notes and the human

takeaway. SRE is anti -transactional for a reason.

Tickets don't scale. Build self -service and

guardrails so humans stop being the interface

for every little thing. All right, that's it

for this week. If you want the full receipts

and links, the full show notes are on shipitweekly

.fm. And the curated weekly brief is on OnCallBrief

.com. If you got value out of this, follow or

subscribe wherever you listen. And subscribe

on YouTube if you are watching the video version.

And if you've got an OnCall buddy, send them

this episode. I'm Brian for Ship It Weekly, and

I'll see you next week.

Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

Watch this episode here

Chapters

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

EKS Rollbacks, GitHub Actions Supply Chain Attacks, AI Agentjacking, CloudWatch Log Alarms, and Why Safety Nets Don’t Replace Ownership

containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

Ship It Conversations: Gareth Kersey on IaCConf 2026, AI, and Corey Quinn’s Terraform Keynote

Get the next episode in your inbox