Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

Transcript

0:00 This week is basically a masterclass in the system

0:03 did exactly what we built it to do. Cloudflare

0:06 automated something that touches routing and

0:08 a bug turned into real BGP withdrawals for customer

0:12 prefixes. Clerk got taken out by a query plan

0:15 flip. Nothing crashed. The database was up. It

0:19 was just slow enough to light everything on fire.

0:21 And AWS is in the middle of a new era where internal

0:25 tools can take bigger actions faster. And the

0:28 whole story comes back to permissions and guardrails.

0:31 If you run production, this is your week. Hey.

0:50 I'm Brian from Tellers Tech, and this is Ship

0:53 It Weekly, where we cover outages, releases,

0:55 and incident write -ups, then translate them

0:58 into what it actually means for your systems.

1:01 If you like the show, follow or subscribe wherever

1:04 you are listening. And if you've got an on -call

1:06 buddy, send them this episode. That share does

1:09 more than you'd expect. Also, quick plug, full

1:12 show notes live on shipitweekly .fm. And the

1:16 curated weekly brief is on oncallbrief .com.

1:19 All right, quick overview so you know where we're

1:22 going today. First, Cloudflare's BYOIP outage

1:26 where a cleanup job ended up withdrawing customer

1:29 prefixes. It's a really clean example of how

1:32 background automation can become production blast

1:35 radius if it touches routing or reachability.

1:38 Second, Clerk's outage from a Postgres auto -analyze

1:42 triggering a query plan flip. The database was

1:45 up, but performance tanked and the system started

1:48 shedding load. It's a great case study in why

1:51 degraded is sometimes harder than down. Third,

1:54 the AWS Kiro story and the follow -up response

1:57 from AWS. Regardless of the headline, the useful

2:00 lesson is permissions. If a tool can take actions,

2:03 it's part of your control plane and it needs

2:05 real boundaries and approvals. After that, I'll

2:08 do a quick platform note on the EKS node monitoring

2:12 agent going open source. Then a tight lightning

2:14 round. Then we'll close with a human section

2:17 on why SRE is basically allergic to ticket queues.

2:21 All right, story one. So Cloudflare had an outage

2:28 tied to BYOIP, bring your own IP. If you are

2:32 not familiar, the idea is simple. Customers want

2:34 to use their own IP ranges at the edge. So traffic

2:37 still comes from the IP space they own, even

2:40 though Cloudflare is doing the heavy lifting.

2:43 The scary part is what makes it powerful. It's

2:45 real routing. BGP announcements. Prefixes. Internet

2:49 reachability. Cloudflare's postmortem is one

2:52 of those that's uncomfortable because it's relatable.

2:54 A cleanup subtask was designed to remove BYOIP

2:58 prefixes that should be removed. Basically, automate

3:01 a manual workflow so support and ops don't have

3:04 to do it by hand. But the cleanup job had a bug.

3:08 The query it ran against their own internal API

3:10 returned more prefixes than it should have. And

3:14 the system started withdrawing prefixes that

3:16 were still in use. At peak, they say around 1

3:19 ,100 prefixes were unintentionally withdrawn

3:23 for a window of time. And then the incident as

3:27 a whole took hours to fully unwind. Because even

3:30 after you stop the bleeding, you are stuck restoring

3:33 state across a big distributed system. This is

3:36 the part I want to linger on because it's the

3:38 part that teams tend to underestimate. When you

3:41 withdraw routes, a rollback isn't just flipping

3:44 a flag. You are waiting on propagation. You are

3:47 dealing with caches. You are dealing with partial

3:49 state. You are dealing with bindings and dependencies

3:52 that get removed or drift. Cloudflare even published

3:56 customer guidance during the incident that some

3:58 customers could self -remediate by going into

4:01 the dashboard and re -advertising their prefixes.

4:04 That right there is the difference between this

4:07 is bad and this is catastrophic. If your customers

4:09 can do something to recover without waiting for

4:12 you, you just bought yourself time. So here's

4:14 the ops lesson. If you have automation that can

4:17 withdraw routes, revoke advertisements, delete

4:20 edge bindings, remove cert associations, change

4:23 DNS at scale, or anything in that reachability

4:26 control plane bucket, treat it like production

4:29 deployment tooling, not like a cron job. This

4:32 is the kind of automation that needs friction

4:34 on purpose. It needs a safety check that says,

4:37 if this job is about to touch more than N prefixes,

4:40 stop. It needs canary behavior. It needs a dry

4:44 run mode that produces a diff humans can look

4:47 at. It needs a circuit breaker that triggers

4:49 on anomaly, not on service down. Because the

4:53 failure mode here is always the same. One helper

4:56 job. One bug. One correlated blast radius. And

5:00 nobody sleeps. Practical Monday take. Ask, what

5:03 systems do we have where a background job can

5:06 make reachability go away? That could be BGP,

5:09 but it could also be DNS automation, certificate

5:12 rotation automation, firewall rule cleanup, CDN

5:16 rule cleanup, or even a Terraform pipeline with

5:20 the ability to destroy and recreate shared infrastructure.

5:23 Then ask one more question. If that job goes

5:26 sideways, what's the fastest human safe rollback?

5:29 And does it require tribal knowledge? If the

5:32 answer is we'd figure it out, congrats. You just

5:34 found your next incident. All right, story two

5:41 is one of my favorite kind of postmortems, even

5:44 though it's painful. Clerk had a system outage,

5:47 and the root cause was an inefficient query plan

5:50 caused by Postgres auto -analyze. So nothing

5:53 exotic, no kernel panic, no region failure, no

5:56 someone deleted prod. Just Postgres doing normal

6:00 Postgres things. Here's the chain. Auto -analyze

6:03 runs. Statistics get updated. That causes a query

6:06 plan flip. Same query, different plan. That plan

6:10 is dramatically worse, which drags database performance

6:13 down, which backs up request handlers, which

6:16 turns into queuing. And then almost all traffic

6:18 starts getting 429'd without being handled. They

6:22 call out that over 95 % of traffic was returning

6:25 429 because the system was basically shedding

6:28 load while it was drowning. The fix was pretty

6:30 direct. Manually rerun analyze for the table

6:34 involved, which changed the stats and brought

6:37 the query plan back to the good version. What's

6:39 interesting is the detail they share about why

6:42 the planner got it wrong. It had to estimate

6:44 how many rows would match a condition, and it

6:47 used a statistic that depends on sampling. Their

6:50 data had a column where almost everything was

6:53 null. And because the sample was small, the sample

6:56 ended up being basically all nulls. So the planner

6:59 overconfidently assumed 100 % nulls. The query

7:03 planner then expected a certain part of the query

7:05 to return basically zero rows. But in reality,

7:09 it returned something like 17 ,000 rows. So the

7:12 plan it picked was good for zero rows and terrible

7:15 for 17 ,000. That mismatch is the kind of thing

7:19 that doesn't show up in unit tests. It shows

7:22 up on a Thursday morning when auto -analyze decides

7:25 its time. So why does this matter for platform

7:28 and SRE folks? Because a lot of teams still think

7:31 in binary failure states. Database up or down.

7:35 Service up or down. But a huge chunk of production

7:38 incidents live in the gray zone. Database is

7:41 up, but slow. The service is up, but queuing.

7:44 Your health checks pass, but users are screaming.

7:47 They even point out that their automatic failover

7:50 didn't trigger because Postgres was online, just

7:52 degraded. So it didn't match the failover now

7:55 conditions. And this is where the playbook needs

7:58 to evolve. If your failover only triggers on

8:01 dead, you're going to get smoked by limping.

8:04 Clerk's remediation is worth stealing. They talk

8:07 about adding alerting specifically for query

8:09 plan flips because it's sudden and severe. They

8:12 also talk about a mitigation that offloaded session

8:15 token generation outside their core session API

8:18 to reduce backend load and help people stay logged

8:22 in even while other parts of the system were

8:25 unhealthy. That's a classic reliability move.

8:28 Protect the critical path even if the full feature

8:31 set is degraded. And they're also honest about

8:34 communication. They say their updates were too

8:37 infrequent, their initial status severity didn't

8:39 match impact, and their first update was too

8:41 slow. Every team thinks they're good at comms

8:44 until they're in a real outage. So Monday take.

8:47 If you run Postgres, ask yourself, do we have

8:49 any alerting that detects this query suddenly

8:52 got 50 times slower? Or this query changed plan?

8:55 Or do we just wait for CPU graphs to screen?

8:58 And separately, do we have a degraded mode strategy

9:01 for the handful of flows that absolutely cannot

9:04 be down? Auth, token validation, session refresh,

9:08 payment, whatever it is for your product. Because

9:11 the best incident is the one where users can

9:14 still do the one thing they really need, even

9:16 if the rest is on fire. All right, story three.

9:23 This one's floating around as AI took down AWS,

9:26 which is obviously the headline everybody wants.

9:29 But the more useful way to look at it is this

9:32 is a permission story. Reuters reported that

9:34 AWS had a disruption tied to a cost management

9:37 feature, and the reporting connected it to AWS's

9:41 internal AWS tooling called Kero. AWS responded

9:45 by saying it was limited to a single service,

9:47 not AWS broadly. It was limited to one region,

9:50 and it was user error. Then AWS published their

9:53 own statement, basically saying, the interruption

9:56 was due to misconfigured access controls, not

9:59 AI. They also say they added additional safeguards,

10:02 including mandatory peer review for production

10:05 access. You can believe whichever framing you

10:07 want, but the operational takeaway is identical.

10:10 If you have a tool that can take action, it is

10:13 part of your control plane. Whether it's an agent,

10:16 A bot, a pipeline, Terraform, a chatbot command,

10:20 a script that runs at 2am, or an internal self

10:22 -service portal. The moment it can touch production,

10:25 you need to treat it like production access.

10:28 And the fastest way to get hurt here is letting

10:30 convenience win over boundaries. So what does

10:33 good boundaries look like in real teams? It looks

10:36 like separation between read -only and write.

10:38 It looks like separation between propose a plan

10:41 and execute the plan. It looks like destructive

10:43 actions requiring explicit approvals. Not just

10:46 it ran in automation, so it must be fine. It

10:49 looks like a break glass path for emergencies

10:52 that is auditable and annoying enough that nobody

10:55 uses it casually. And it looks like logging actual

10:58 tool actions, not just chat transcripts. Not

11:01 the bot said it would delete things. I mean,

11:04 who called what API, with what role, against

11:07 what resources, and what changed? Because in

11:10 this new era, the hardest incidents will be the

11:13 ones where everything moved fast. And nobody

11:16 can confidently answer what actually happened.

11:19 Monday take. If your org is messing with agents,

11:22 or even just adding more automation, do one simple

11:26 exercise. Pick one destructive action that exists

11:29 in your environment. Like deleting an environment,

11:32 rotating a secret, revoking access, withdrawing

11:35 a route, disabling a control. Now ask, can anything

11:38 do this without a second human being involved?

11:42 If yes, that's your risk. Not because AI is dangerous,

11:45 but because any tool with power plus weak guardrails

11:48 is dangerous. Quick platform note. AWS open sourced

11:56 the EKS node monitoring agent. The big pitch

11:59 is it monitors node -level system, storage, networking,

12:03 and accelerator issues and publishes them as

12:06 node conditions. And EKS can use those conditions

12:09 to drive automatic node repair. If you've ever

12:12 had a weird node that's half dead and you ended

12:16 up SSHing in, tailing kubelet logs, checking

12:19 disk pressure, and basically doing detective

12:22 work while your workloads suffer, That's the

12:24 exact pain that this is aimed at. I like this

12:27 category of tooling because it's not another

12:30 dashboard. It's turn node weirdness into a signal

12:33 that the control plane can act on. If you are

12:36 on EKS and you've had node flakiness incidents,

12:39 it's worth a look. All right, time for the lightning

12:48 round. I'm keeping this tight, four items and

12:51 all high signal. First, Grafana. There's a high

12:54 -severity advisory for cross -dashboard privilege

12:57 escalation via permission management. The short

13:00 version is, if someone has permission management

13:02 rights on one dashboard, under certain conditions,

13:06 they can read and modify permissions on other

13:09 dashboards. If you run Grafana in a shared environment,

13:12 this is one of those check your version and patch

13:15 stories, not a someday story. Second, run C CVEs.

13:20 AWS put out a bulletin for recently disclosed

13:23 run C issues that affect container runtimes when

13:26 launching new containers. I'm not going to pretend

13:29 everyone patches this instantly because the reality

13:32 is it depends on how you get your node OS and

13:35 runtime updates. But this is still a reminder

13:38 to keep node rollouts and runtime patching as

13:41 a normal muscle, not a panic button. Third, GitLab

13:45 patch train. GitLab shipped patch releases that

13:48 include important bug and security fixes, and

13:51 they strongly recommend self -managed installs

13:54 upgrade. If you self -host GitLab, you already

13:57 know the deal. Don't let we'll do it later become

13:59 we got popped because we were busy. Fourth, Atlassian's

14:03 February security bulletin. This is for the enterprise

14:06 crowd still running data center products. They

14:09 are calling out a pile of high severity and critical

14:12 severity vulnerabilities fixed in recent product

14:15 releases. Same story. If you run it, patch it.

14:19 If you don't run it, thank your lucky stars and

14:21 keep scrolling. All right, human closer. There's

14:31 an ACMQ piece called SRE is anti -transactional,

14:35 and it nails something that every platform team

14:37 eventually runs into. Tickets don't scale. Manual

14:41 work scales linearly. More requests means more

14:44 humans. And that is how you turn a platform team

14:47 into a help desk with pager fatigue. The SRE

14:50 instinct is to build systems that do work for

14:53 you. Not because you hate helping people, but

14:56 because you want the systems to be reliable without

14:59 requiring human glue for every small thing. And

15:02 honestly, this week's stories are all versions

15:04 of that same theme. Cloudflare tried to automate

15:07 a workflow that used to be manual. The idea was

15:10 right, but the guardrails weren't strong enough.

15:13 Clerk got hit by a database behavior that didn't

15:16 trip the usual failover assumptions. And they

15:19 are evolving their system so the critical flows

15:21 can survive partial failure. And AWS is in the

15:25 middle of a bigger shift where tools are doing

15:27 more, faster, and the only thing standing between

15:30 helpful and incident is how you design boundaries

15:33 and approvals. So if you are a platform engineer

15:36 or an SRE listening to this and you feel like

15:38 you are buried in tickets, here's the move. Pick

15:41 one repeated transactional pain this week and

15:44 don't solve it with another runbook. Solve it

15:46 with an API. a self -service workflow, or automation

15:50 with proper guardrails. Okay, time for a quick

15:52 recap before we wrap. Cloudflare is a reminder

15:55 that helper jobs are never just a cron. If automation

15:58 can touch reachability, routing, DNS, certs,

16:02 or anything shared, it needs production -grade

16:05 guardrails and a rollback you can execute under

16:08 stress. Clerk is the reminder that up but slow

16:11 can be worse than down. If your alerting and

16:14 failover only triggers on dead systems, you are

16:17 going to miss the incidents that actually hurt.

16:20 And the AWS Kiro story, no matter how you frame

16:23 it, comes back to permissions. If a tool can

16:26 execute changes, separate propose versus execute.

16:29 Require approvals for destructive actions. And

16:33 log the actual actions taken. Lightning round

16:36 recap. Grafana's permission escalation risk.

16:39 RunSee Runtime CVEs, GitLab patch releases, and

16:43 Alassian's monthly security bulletin. Links for

16:46 all of these are in the show notes and the human

16:48 takeaway. SRE is anti -transactional for a reason.

16:51 Tickets don't scale. Build self -service and

16:54 guardrails so humans stop being the interface

16:57 for every little thing. All right, that's it

17:00 for this week. If you want the full receipts

17:02 and links, the full show notes are on shipitweekly

17:05 .fm. And the curated weekly brief is on OnCallBrief

17:09 .com. If you got value out of this, follow or

17:12 subscribe wherever you listen. And subscribe

17:14 on YouTube if you are watching the video version.

17:17 And if you've got an OnCall buddy, send them

17:19 this episode. I'm Brian for Ship It Weekly, and

17:21 I'll see you next week.

Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Ship It Conversations: Jake Warner on Cycle.io, Bare Metal’s Comeback, and Why Private Cloud Is Getting Interesting Again

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk