Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

Transcript

0:00 This week is basically a masterclass in the system

0:03 did exactly what we built it to do. Cloudflare

0:06 automated something that touches routing and

0:08 a bug turned into real BGP withdrawals for customer

0:12 prefixes. Clerk got taken out by a query plan

0:15 flip. Nothing crashed. The database was up. It

0:19 was just slow enough to light everything on fire.

0:21 And AWS is in the middle of a new era where internal

0:25 tools can take bigger actions faster. And the

0:28 whole story comes back to permissions and guardrails.

0:31 If you run production, this is your week. Hey.

0:50 I'm Brian from Tellers Tech, and this is Ship

0:53 It Weekly, where we cover outages, releases,

0:55 and incident write -ups, then translate them

0:58 into what it actually means for your systems.

1:01 If you like the show, follow or subscribe wherever

1:04 you are listening. And if you've got an on -call

1:06 buddy, send them this episode. That share does

1:09 more than you'd expect. Also, quick plug, full

1:12 show notes live on shipitweekly .fm. And the

1:16 curated weekly brief is on oncallbrief .com.

1:19 All right, quick overview so you know where we're

1:22 going today. First, Cloudflare's BYOIP outage

1:26 where a cleanup job ended up withdrawing customer

1:29 prefixes. It's a really clean example of how

1:32 background automation can become production blast

1:35 radius if it touches routing or reachability.

1:38 Second, Clerk's outage from a Postgres auto -analyze

1:42 triggering a query plan flip. The database was

1:45 up, but performance tanked and the system started

1:48 shedding load. It's a great case study in why

1:51 degraded is sometimes harder than down. Third,

1:54 the AWS Kiro story and the follow -up response

1:57 from AWS. Regardless of the headline, the useful

2:00 lesson is permissions. If a tool can take actions,

2:03 it's part of your control plane and it needs

2:05 real boundaries and approvals. After that, I'll

2:08 do a quick platform note on the EKS node monitoring

2:12 agent going open source. Then a tight lightning

2:14 round. Then we'll close with a human section

2:17 on why SRE is basically allergic to ticket queues.

2:21 All right, story one. So Cloudflare had an outage

2:28 tied to BYOIP, bring your own IP. If you are

2:32 not familiar, the idea is simple. Customers want

2:34 to use their own IP ranges at the edge. So traffic

2:37 still comes from the IP space they own, even

2:40 though Cloudflare is doing the heavy lifting.

2:43 The scary part is what makes it powerful. It's

2:45 real routing. BGP announcements. Prefixes. Internet

2:49 reachability. Cloudflare's postmortem is one

2:52 of those that's uncomfortable because it's relatable.

2:54 A cleanup subtask was designed to remove BYOIP

2:58 prefixes that should be removed. Basically, automate

3:01 a manual workflow so support and ops don't have

3:04 to do it by hand. But the cleanup job had a bug.

3:08 The query it ran against their own internal API

3:10 returned more prefixes than it should have. And

3:14 the system started withdrawing prefixes that

3:16 were still in use. At peak, they say around 1

3:19 ,100 prefixes were unintentionally withdrawn

3:23 for a window of time. And then the incident as

3:27 a whole took hours to fully unwind. Because even

3:30 after you stop the bleeding, you are stuck restoring

3:33 state across a big distributed system. This is

3:36 the part I want to linger on because it's the

3:38 part that teams tend to underestimate. When you

3:41 withdraw routes, a rollback isn't just flipping

3:44 a flag. You are waiting on propagation. You are

3:47 dealing with caches. You are dealing with partial

3:49 state. You are dealing with bindings and dependencies

3:52 that get removed or drift. Cloudflare even published

3:56 customer guidance during the incident that some

3:58 customers could self -remediate by going into

4:01 the dashboard and re -advertising their prefixes.

4:04 That right there is the difference between this

4:07 is bad and this is catastrophic. If your customers

4:09 can do something to recover without waiting for

4:12 you, you just bought yourself time. So here's

4:14 the ops lesson. If you have automation that can

4:17 withdraw routes, revoke advertisements, delete

4:20 edge bindings, remove cert associations, change

4:23 DNS at scale, or anything in that reachability

4:26 control plane bucket, treat it like production

4:29 deployment tooling, not like a cron job. This

4:32 is the kind of automation that needs friction

4:34 on purpose. It needs a safety check that says,

4:37 if this job is about to touch more than N prefixes,

4:40 stop. It needs canary behavior. It needs a dry

4:44 run mode that produces a diff humans can look

4:47 at. It needs a circuit breaker that triggers

4:49 on anomaly, not on service down. Because the

4:53 failure mode here is always the same. One helper

4:56 job. One bug. One correlated blast radius. And

5:00 nobody sleeps. Practical Monday take. Ask, what

5:03 systems do we have where a background job can

5:06 make reachability go away? That could be BGP,

5:09 but it could also be DNS automation, certificate

5:12 rotation automation, firewall rule cleanup, CDN

5:16 rule cleanup, or even a Terraform pipeline with

5:20 the ability to destroy and recreate shared infrastructure.

5:23 Then ask one more question. If that job goes

5:26 sideways, what's the fastest human safe rollback?

5:29 And does it require tribal knowledge? If the

5:32 answer is we'd figure it out, congrats. You just

5:34 found your next incident. All right, story two

5:41 is one of my favorite kind of postmortems, even

5:44 though it's painful. Clerk had a system outage,

5:47 and the root cause was an inefficient query plan

5:50 caused by Postgres auto -analyze. So nothing

5:53 exotic, no kernel panic, no region failure, no

5:56 someone deleted prod. Just Postgres doing normal

6:00 Postgres things. Here's the chain. Auto -analyze

6:03 runs. Statistics get updated. That causes a query

6:06 plan flip. Same query, different plan. That plan

6:10 is dramatically worse, which drags database performance

6:13 down, which backs up request handlers, which

6:16 turns into queuing. And then almost all traffic

6:18 starts getting 429'd without being handled. They

6:22 call out that over 95 % of traffic was returning

6:25 429 because the system was basically shedding

6:28 load while it was drowning. The fix was pretty

6:30 direct. Manually rerun analyze for the table

6:34 involved, which changed the stats and brought

6:37 the query plan back to the good version. What's

6:39 interesting is the detail they share about why

6:42 the planner got it wrong. It had to estimate

6:44 how many rows would match a condition, and it

6:47 used a statistic that depends on sampling. Their

6:50 data had a column where almost everything was

6:53 null. And because the sample was small, the sample

6:56 ended up being basically all nulls. So the planner

6:59 overconfidently assumed 100 % nulls. The query

7:03 planner then expected a certain part of the query

7:05 to return basically zero rows. But in reality,

7:09 it returned something like 17 ,000 rows. So the

7:12 plan it picked was good for zero rows and terrible

7:15 for 17 ,000. That mismatch is the kind of thing

7:19 that doesn't show up in unit tests. It shows

7:22 up on a Thursday morning when auto -analyze decides

7:25 its time. So why does this matter for platform

7:28 and SRE folks? Because a lot of teams still think

7:31 in binary failure states. Database up or down.

7:35 Service up or down. But a huge chunk of production

7:38 incidents live in the gray zone. Database is

7:41 up, but slow. The service is up, but queuing.

7:44 Your health checks pass, but users are screaming.

7:47 They even point out that their automatic failover

7:50 didn't trigger because Postgres was online, just

7:52 degraded. So it didn't match the failover now

7:55 conditions. And this is where the playbook needs

7:58 to evolve. If your failover only triggers on

8:01 dead, you're going to get smoked by limping.

8:04 Clerk's remediation is worth stealing. They talk

8:07 about adding alerting specifically for query

8:09 plan flips because it's sudden and severe. They

8:12 also talk about a mitigation that offloaded session

8:15 token generation outside their core session API

8:18 to reduce backend load and help people stay logged

8:22 in even while other parts of the system were

8:25 unhealthy. That's a classic reliability move.

8:28 Protect the critical path even if the full feature

8:31 set is degraded. And they're also honest about

8:34 communication. They say their updates were too

8:37 infrequent, their initial status severity didn't

8:39 match impact, and their first update was too

8:41 slow. Every team thinks they're good at comms

8:44 until they're in a real outage. So Monday take.

8:47 If you run Postgres, ask yourself, do we have

8:49 any alerting that detects this query suddenly

8:52 got 50 times slower? Or this query changed plan?

8:55 Or do we just wait for CPU graphs to screen?

8:58 And separately, do we have a degraded mode strategy

9:01 for the handful of flows that absolutely cannot

9:04 be down? Auth, token validation, session refresh,

9:08 payment, whatever it is for your product. Because

9:11 the best incident is the one where users can

9:14 still do the one thing they really need, even

9:16 if the rest is on fire. All right, story three.

9:23 This one's floating around as AI took down AWS,

9:26 which is obviously the headline everybody wants.

9:29 But the more useful way to look at it is this

9:32 is a permission story. Reuters reported that

9:34 AWS had a disruption tied to a cost management

9:37 feature, and the reporting connected it to AWS's

9:41 internal AWS tooling called Kero. AWS responded

9:45 by saying it was limited to a single service,

9:47 not AWS broadly. It was limited to one region,

9:50 and it was user error. Then AWS published their

9:53 own statement, basically saying, the interruption

9:56 was due to misconfigured access controls, not

9:59 AI. They also say they added additional safeguards,

10:02 including mandatory peer review for production

10:05 access. You can believe whichever framing you

10:07 want, but the operational takeaway is identical.

10:10 If you have a tool that can take action, it is

10:13 part of your control plane. Whether it's an agent,

10:16 A bot, a pipeline, Terraform, a chatbot command,

10:20 a script that runs at 2am, or an internal self

10:22 -service portal. The moment it can touch production,

10:25 you need to treat it like production access.

10:28 And the fastest way to get hurt here is letting

10:30 convenience win over boundaries. So what does

10:33 good boundaries look like in real teams? It looks

10:36 like separation between read -only and write.

10:38 It looks like separation between propose a plan

10:41 and execute the plan. It looks like destructive

10:43 actions requiring explicit approvals. Not just

10:46 it ran in automation, so it must be fine. It

10:49 looks like a break glass path for emergencies

10:52 that is auditable and annoying enough that nobody

10:55 uses it casually. And it looks like logging actual

10:58 tool actions, not just chat transcripts. Not

11:01 the bot said it would delete things. I mean,

11:04 who called what API, with what role, against

11:07 what resources, and what changed? Because in

11:10 this new era, the hardest incidents will be the

11:13 ones where everything moved fast. And nobody

11:16 can confidently answer what actually happened.

11:19 Monday take. If your org is messing with agents,

11:22 or even just adding more automation, do one simple

11:26 exercise. Pick one destructive action that exists

11:29 in your environment. Like deleting an environment,

11:32 rotating a secret, revoking access, withdrawing

11:35 a route, disabling a control. Now ask, can anything

11:38 do this without a second human being involved?

11:42 If yes, that's your risk. Not because AI is dangerous,

11:45 but because any tool with power plus weak guardrails

11:48 is dangerous. Quick platform note. AWS open sourced

11:56 the EKS node monitoring agent. The big pitch

11:59 is it monitors node -level system, storage, networking,

12:03 and accelerator issues and publishes them as

12:06 node conditions. And EKS can use those conditions

12:09 to drive automatic node repair. If you've ever

12:12 had a weird node that's half dead and you ended

12:16 up SSHing in, tailing kubelet logs, checking

12:19 disk pressure, and basically doing detective

12:22 work while your workloads suffer, That's the

12:24 exact pain that this is aimed at. I like this

12:27 category of tooling because it's not another

12:30 dashboard. It's turn node weirdness into a signal

12:33 that the control plane can act on. If you are

12:36 on EKS and you've had node flakiness incidents,

12:39 it's worth a look. All right, time for the lightning

12:48 round. I'm keeping this tight, four items and

12:51 all high signal. First, Grafana. There's a high

12:54 -severity advisory for cross -dashboard privilege

12:57 escalation via permission management. The short

13:00 version is, if someone has permission management

13:02 rights on one dashboard, under certain conditions,

13:06 they can read and modify permissions on other

13:09 dashboards. If you run Grafana in a shared environment,

13:12 this is one of those check your version and patch

13:15 stories, not a someday story. Second, run C CVEs.

13:20 AWS put out a bulletin for recently disclosed

13:23 run C issues that affect container runtimes when

13:26 launching new containers. I'm not going to pretend

13:29 everyone patches this instantly because the reality

13:32 is it depends on how you get your node OS and

13:35 runtime updates. But this is still a reminder

13:38 to keep node rollouts and runtime patching as

13:41 a normal muscle, not a panic button. Third, GitLab

13:45 patch train. GitLab shipped patch releases that

13:48 include important bug and security fixes, and

13:51 they strongly recommend self -managed installs

13:54 upgrade. If you self -host GitLab, you already

13:57 know the deal. Don't let we'll do it later become

13:59 we got popped because we were busy. Fourth, Atlassian's

14:03 February security bulletin. This is for the enterprise

14:06 crowd still running data center products. They

14:09 are calling out a pile of high severity and critical

14:12 severity vulnerabilities fixed in recent product

14:15 releases. Same story. If you run it, patch it.

14:19 If you don't run it, thank your lucky stars and

14:21 keep scrolling. All right, human closer. There's

14:31 an ACMQ piece called SRE is anti -transactional,

14:35 and it nails something that every platform team

14:37 eventually runs into. Tickets don't scale. Manual

14:41 work scales linearly. More requests means more

14:44 humans. And that is how you turn a platform team

14:47 into a help desk with pager fatigue. The SRE

14:50 instinct is to build systems that do work for

14:53 you. Not because you hate helping people, but

14:56 because you want the systems to be reliable without

14:59 requiring human glue for every small thing. And

15:02 honestly, this week's stories are all versions

15:04 of that same theme. Cloudflare tried to automate

15:07 a workflow that used to be manual. The idea was

15:10 right, but the guardrails weren't strong enough.

15:13 Clerk got hit by a database behavior that didn't

15:16 trip the usual failover assumptions. And they

15:19 are evolving their system so the critical flows

15:21 can survive partial failure. And AWS is in the

15:25 middle of a bigger shift where tools are doing

15:27 more, faster, and the only thing standing between

15:30 helpful and incident is how you design boundaries

15:33 and approvals. So if you are a platform engineer

15:36 or an SRE listening to this and you feel like

15:38 you are buried in tickets, here's the move. Pick

15:41 one repeated transactional pain this week and

15:44 don't solve it with another runbook. Solve it

15:46 with an API. a self -service workflow, or automation

15:50 with proper guardrails. Okay, time for a quick

15:52 recap before we wrap. Cloudflare is a reminder

15:55 that helper jobs are never just a cron. If automation

15:58 can touch reachability, routing, DNS, certs,

16:02 or anything shared, it needs production -grade

16:05 guardrails and a rollback you can execute under

16:08 stress. Clerk is the reminder that up but slow

16:11 can be worse than down. If your alerting and

16:14 failover only triggers on dead systems, you are

16:17 going to miss the incidents that actually hurt.

16:20 And the AWS Kiro story, no matter how you frame

16:23 it, comes back to permissions. If a tool can

16:26 execute changes, separate propose versus execute.

16:29 Require approvals for destructive actions. And

16:33 log the actual actions taken. Lightning round

16:36 recap. Grafana's permission escalation risk.

16:39 RunSee Runtime CVEs, GitLab patch releases, and

16:43 Alassian's monthly security bulletin. Links for

16:46 all of these are in the show notes and the human

16:48 takeaway. SRE is anti -transactional for a reason.

16:51 Tickets don't scale. Build self -service and

16:54 guardrails so humans stop being the interface

16:57 for every little thing. All right, that's it

17:00 for this week. If you want the full receipts

17:02 and links, the full show notes are on shipitweekly

17:05 .fm. And the curated weekly brief is on OnCallBrief

17:09 .com. If you got value out of this, follow or

17:12 subscribe wherever you listen. And subscribe

17:14 on YouTube if you are watching the video version.

17:17 And if you've got an OnCall buddy, send them

17:19 this episode. I'm Brian for Ship It Weekly, and

17:21 I'll see you next week.

Cloudflare BYOIP BGP Withdrawals, Clerk’s Postgres Query-Plan Flip Outage, and AWS Kiro Permissions Lessons (Grafana Privesc + runc CVEs)

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller