Cursor Deletes PocketOS Prod DB, .de DNSSEC Outage, Bluesky Postmortem, Argo CD, and Copy Fail

Transcript

0:00 A lot of the risk in modern infrastructure is

0:02 no longer hiding in the parts we used to fear

0:05 most. Sometimes it is a bad kernel bug. Sometimes

0:08 it is a broken DNS signature at the TLD layer.

0:12 Sometimes it is a GitOps upgrade that changes

0:15 behavior under your feet. And sometimes it is

0:18 an AI agent with way too much authority and not

0:21 nearly enough guardrails. This week has a little

0:24 bit of all of that. And the common thread is

0:26 pretty simple. Control planes are brittle. Automation

0:30 is powerful, and identity still decides whether

0:33 a mistake stays small or becomes a crater. Hey,

0:54 I'm Brian Teller. I work in DevOps and SRE, and

0:57 I run Teller's Tech. This is Ship It Weekly,

1:00 where I filter the noise and focus on what actually

1:02 changes how we run infrastructure and own reliability.

1:06 Show notes and links are on shipitweekly .fm.

1:09 If the show's been useful, follow it wherever

1:11 you listen. Ratings help way more than they should.

1:14 And if you're watching on YouTube, subscribe

1:16 there too. We've got six main stories today,

1:20 then the lightning round, and we'll wrap with

1:22 the human closer. We're starting with the Pocket

1:24 PocketOS and Cursor incident because it is probably

1:27 the most instantly gripping story in the set

1:30 and also one of the most revealing. Not because

1:33 AI made a mistake, because an agent got access

1:36 to something it should not have had in the first

1:38 place. Then we're going to the .de DNSSEC outage

1:43 and Cloudflare's response, which is one of those

1:46 reminders that the internet's deepest layers

1:48 are still capable of ruining everybody's afternoon,

1:52 all at once. After that, Bluesky's outage post-

1:56 mortem, which is a really good incident story

1:58 because it is weird, specific, and painfully real.

2:02 Then Argo CD, with version 3.1.16 being the

2:08 last 3 .1 release and version 3 .4 .1 bringing

2:13 a behavior change that people really do need

2:16 to notice. Then the Linux kernel Copy Fail bug,

2:20 CVE-2026-31431. Because active exploitation plus broad

2:27 Linux exposure is not something ops teams get

2:30 to casually defer. And finally, Google Cloud

2:33 Agent Identity and AWS MCP Server GA. Because

2:37 the cloud platforms are starting to treat AI

2:39 agents as first -class actors instead of weird

2:43 sidecars bolted onto existing IAM assumptions.

2:50 Story one. PocketOS and Cursor is really an

2:54 identity story. Let's start there. This week's

2:57 on -call brief summarized the incident this way.

3:00 On April 25th, a Cursor AI agent reportedly deleted

3:04 PocketOS's production database in under 10 seconds

3:07 after a credential mismatch led it to access

3:11 an API token it should not have had. The result,

3:14 according to the reporting cited in the brief,

3:17 was a full production data loss, including backups.

3:20 This is obviously a wild headline, but I think

3:23 the more important part is not the speed or even

3:26 the AI branding. It is the access path. Because

3:29 if an agent can see a destructive credential,

3:31 then the real problem started before the agent

3:34 ever took action. That is why I think that this

3:36 story is more useful than just look at what AI

3:39 did. It is really about credential sprawl, environment

3:42 separation, and what happens when people start

3:45 granting machine -assisted workflows access that

3:48 was probably too broad even for a human. An AI

3:51 agent did not invent bad blast radius design.

3:55 It just exercised it fast. And that is the practical

3:58 takeaway I'd want teams to hear. If you are adding

4:01 AI coding agents or operational agents into real

4:05 environments, you do not get to treat identity

4:08 as a cleanup task for later. The difference between

4:11 helpful automation and production incident is

4:14 often just one token, one role, one environment

4:18 boundary, one missing approval step. So yeah,

4:21 this story is dramatic. But the lesson is boring.

4:25 And boring is good. Scope the credential. Separate

4:28 staging from prod. Assume the agent will eventually

4:32 try something dumb, overconfident, or surprising.

4:35 Because sooner or later, it will. Story 2. The

4:42 The .de DNSSEC outage is a reminder that internet

4:46 plumbing still has global blast radius. Next

4:50 up, the .de outage. Cloudflare says that on May

4:53 5th, at roughly 19:30 UTC, DENIC, the operator

4:59 for the .de TLD, started publishing incorrect DNSSEC

5:04 signatures for the .de zone. Any validating

5:08 resolver that received those records was required

5:11 by DNSSEC rules to reject them and return SERV-

5:15 FAIL. Cloudflare says 1.1.1.1 was no exception.

5:20 That means a bad cryptographic state at the TLD

5:24 layer was enough to make huge numbers of .DE

5:27 domains effectively unreachable. That is a great

5:31 outage story for ops people because it is one

5:34 of those failures where everything is technically

5:36 working exactly as designed and that is the problem.

5:40 DNSSEC is there to make sure responses are authentic.

5:43 The signatures were wrong. Resolvers did the

5:46 correct thing and refused to trust the data.

5:48 And suddenly, correct behavior becomes mass breakage.

5:52 That is such a brutal but important reliability

5:55 lesson. Security mechanisms do not eliminate

5:59 failure. They sometimes concentrate it. Cloudflare's

6:02 write -up is also good because it walks through

6:04 mitigation trade -offs instead of pretending

6:07 there's always a clean answer. When the trust

6:09 anchor above you is wrong, your choices get ugly

6:12 fast. Do you keep strict validation and break

6:16 reachability? Or do you apply a temporary exception

6:19 and accept the risk of bypassing a protection

6:22 that exists for a reason? That is not really

6:25 a DNS -only lesson either. It is a control plane

6:28 lesson. When a higher order authority goes bad,

6:32 the downstream systems that trust it do not fail

6:35 independently. They fail together. So if I'm

6:38 taking one operator lesson from this, it is this.

6:41 Any system built on centralized trust, centralized

6:44 signing, or centralized metadata needs a very

6:48 honest failure mode conversation. Because when

6:51 the thing at the top is wrong, your elegant security

6:53 chain can turn into synchronized outage machinery

6:57 very quickly. Before we get to the next story,

7:00 a quick note from this week's sponsor, Guardsquare.

7:03 If you are building mobile apps, good enough

7:05 security is usually a problem waiting to happen.

7:09 GuardSquare focuses on actually protecting your

7:12 code, in addition to scanning it. That means code

7:15 hardening, runtime protection, testing, and visibility

7:19 into what's happening once your app is out in

7:22 the wild. So if you are responsible for shipping

7:24 and securing mobile apps, Android or iOS, definitely

7:28 worth taking a look at guardsquare.com. Alright,

7:32 back to the show. Story 3. Bluesky's outage

7:39 post -mortem is the kind of weird incident story

7:42 worth studying. Now for my favorite kind of reliability

7:45 story. Blue Sky published a post -mortem for

7:48 its April outage. And the post says that the

7:50 service was intermittently down for about half

7:52 its users for around eight hours. Jim Calabro's

7:55 write -up says they saw pretty quickly that they

7:58 were exhausting ports. But the harder part was

8:01 figuring out exactly where and why. The root

8:04 issue involved memcached traffic, ephemeral port

8:07 exhaustion, and a cascade where the debugging

8:10 path became part of the failure path. What makes

8:14 this one so good is that it is not just we ran

8:17 out of ports. The postmortem explains that logging

8:20 used blocking write syscalls, and under the load

8:24 of trying to log huge volumes of errors while

8:27 still serving traffic, the Go runtime spawned

8:30 far more OS threads than normal. BlueSky says

8:33 that extra thread pressure then hit the garbage

8:36 collector, contributing to the broader failure

8:39 cascade. The write-up also shows a custom dialer

8:42 workaround that randomizes loopback IPs to avoid

8:46 ephemeral port exhaustion on a single address

8:50 after restarts. That is such a real incident

8:52 pattern. The original fault hurts. The observability

8:56 path amplifies it. The recovery path gets noisy.

8:59 And then the system that is supposed to help

9:01 you reason about the incident starts participating

9:03 in the incident. That is not hypothetical. That

9:06 is production. And I also like the honesty in

9:09 the write -up. BlueSky basically says the signal

9:12 Bluesky basically says the signal

9:15 and that you need the discipline and the metrics

9:18 to cut through it. That lands because it is true.

9:21 A lot of outages are not invisible. They are

9:24 just obscured by too many symptoms arriving at

9:26 once. So my takeaway here is not just watch ephemeral

9:29 ports. It is broader than that. Watch where your

9:32 debugging strategy becomes a scaling liability.

9:36 Watch where synchronous behavior hides inside

9:38 paths you think are harmless. And watch for failure

9:42 modes where saturation causes your tooling to

9:45 become part of the blast radius instead of part

9:48 of the recovery. Story 4. Argo CD is giving people

9:56 one quiet end-of-life and one not-so-quiet

9:59 behavior change. Next up, Argo CD. Argo's version

10:03 3.1.16 release is the final release in the

10:07 3 .1 series. The release notes are very explicit.

10:11 As of May 6, 2026, 3 .1 has reached end of life

10:15 and will no longer receive bug fixes or security

10:18 updates. The same release notes tell operators

10:21 to move to a supported version, meaning 3 .2,

10:25 3 .3, or 3 .4. That alone is worth a mention

10:28 because GitOps tools have a way of becoming background

10:32 furniture. Teams stop thinking about the controller

10:34 version because the controller is just there

10:36 doing its thing. Right up until the day it matters

10:39 a lot. But then there is the second part. Argo

10:41 CD version 3 .4 .1 is the first release in the

10:46 3 .4 series. And the release notes call out an

10:49 important change. Following Helm 3 .19 .0, Argo

10:53 CD now aligns its interpretation of Kubernetes

10:56 cluster version with Helm's behavior. OnCallBrief

11:00 points out that this affects application sets

11:02 that filter clusters by Kubernetes version. That

11:06 is exactly the kind of change that looks tiny

11:08 in a release note and then quietly breaks assumptions

11:11 in real environments. So for me, this is a classic

11:15 operator story. Not sexy, not viral, very real.

11:19 One branch is dead. One branch changes parsing

11:22 behavior. And if your GitOps setup depends on

11:25 version -based selection logic, you need to actually

11:28 test that logic instead of assuming a minor release

11:31 means minor consequences. The practical lesson

11:34 is simple. Treat controller upgrades like control

11:38 plane changes, not package refreshes. Because

11:41 that is what they are. If the thing deciding

11:44 what gets deployed, where, and when changes how

11:47 it interprets your environment, That is production

11:50 behavior, not housekeeping. Story 5. Copy fail

11:58 Story 5. Copy Fail is the kind of kernel bug

12:01 Now to CVE-2026-31431, also known as Copy Fail.

12:08 NVD shows that this Linux kernel vulnerability

12:11 is in CISA's Known Exploited Vulnerabilities

12:15 catalog. And CISA's entry gives federal agencies

12:19 a remediation due date of May 15th. The flaw

12:23 is an incorrect resource transfer between spheres

12:26 issue in the Linux kernel. And the reporting

12:29 around it says it enables local privilege escalation

12:32 to root on a wide range of Linux distributions.

12:36 AWS's security bulletin says that with the exception

12:40 of specifically listed services, most AWS customers

12:44 are not affected. But it also lists update timelines

12:47 for affected services such as Bottlerocket, ECS

12:52 on EC2, EKS optimized AMIs, and EMR. So this

12:57 is one of those stories where not everybody is

12:59 affected should not turn into nobody on our side

13:02 is affected. You still have to inventory what

13:05 you run. That is the operational lesson here.

13:07 When a kernel LPE lands in KEV and there is an

13:12 active exploitation, this is not the time for

13:15 vague patch queues. This is the time for concrete

13:18 exposure review. Which hosts? Which images? Which

13:21 managed services? Which self -managed nodes?

13:24 Which maintenance windows? Which compensating

13:27 controls until patched? And honestly, the thing

13:31 I always come back to with stories like this

13:33 is that kernel issues collapse abstraction layers

13:36 really fast. You can have immaculate Kubernetes

13:39 policy, good IAM, strong workload boundaries,

13:43 and still get wrecked if the shared kernel underneath

13:46 is vulnerable and unpatched. That is why these

13:50 bugs matter. They turn every higher -level control

13:53 into a best -effort suggestion until the underlying

13:57 system is fixed. Story 6. Google and AWS are

14:05 both telling us agents need first -class infrastructure

14:08 identity now. Last main story. I wanted to put

14:12 Google Cloud Agent Identity and AWS MCP Server

14:15 GA together. Because they are different products,

14:18 but they point in the same direction. Google

14:21 Cloud's new IAM post says that the AI era needs

14:25 a different security and governance model for

14:27 autonomous agents. And it introduces agent identity

14:31 as a new first -class principal type, distinct

14:34 from human identities and generic service accounts.

14:38 Google says these identities are built on SPIFFE,

14:42 are cryptographically protected, and strongly

14:45 attested, and can be used to authenticate to

14:48 MCP servers, cloud resources, endpoints, and

14:52 other agents. It also ties this into agent gateway

14:55 policy enforcement, least privilege controls,

14:58 and runtime defense. AWS's side of the story

15:02 is different in implementation, but similar in

15:05 intent. AWS says that the AWS MCP server is now

15:08 generally available and gives AI agents and coding

15:11 assistants secure, authenticated access to AWS

15:14 services through a small fixed set of tools

15:17 using existing IAM credentials. The official

15:21 announcement frames it directly around the problem

15:23 people keep asking. How do you give an agent

15:26 real AWS access without just handing it keys

15:29 to the kingdom? That is why I think these stories

15:31 belong together. Both clouds are basically admitting

15:34 the same thing. Agents are not side features

15:36 anymore. They are becoming infrastructure actors.

15:40 And once that is true, the identity model has

15:42 to change too. Not just more service accounts,

15:45 not just more static tokens, not just vibes and

15:48 trust. Real principals, real boundaries, real

15:52 policy, real auditability. So my read is that

15:56 this is where cloud identity is headed next.

15:58 Human identity was phase one. Workload identity

16:02 was phase two. Agent identity is shaping up to

16:05 be phase three. And if teams do not start treating

16:08 that as real infrastructure design now, they're

16:11 going to rediscover all of the old machine identity

16:14 mistakes with an LLM in the loop and a much faster

16:18 failure path. A few quick ones before we wrap.

16:28 Cilium published lessons from securing CI/CD

16:31 for an open source project. And the reason that

16:34 I like it is because it is practical. On -call

16:37 brief highlights controls like tighter CI/CD security

16:40 practices and lessons learned from operating

16:43 a real open source pipeline. That is worth a

16:45 read for anyone treating GitHub actions hardening

16:48 like a someday project. Velero version 1.18

16:51 .1-rc.2 includes a security fix for CVE-2026

16:57 -27141 by bumping golang.org/x/net

17:03 to version 0.51.0. That is the kind of

17:08 small release note that matters a lot when you

17:10 realize it closes a known vuln in a tool people

17:13 trust for recovery. Google's release notes also

17:16 carry a couple of quieter but real operational

17:19 changes. Google Ads and related measurement APIs

17:23 moved to a 37 -month retention policy for granular

17:27 performance stats starting June 1st. And Google's

17:31 distributed cloud's Kubernetes 1.35 platform

17:34 update now requires cgroups v2 with cgroups v1

17:40 no longer supported for creation or upgrades.

17:44 Those are not flashy changes, but both can absolutely

17:47 break assumptions if nobody notices them. AWS

17:50 also published a cross -region disaster recovery

17:53 walkthrough for Amazon EKS using AWS Backup.

17:58 The post walks through creating backup vaults

18:01 in source and DR regions, running on -demand

18:04 backups, and initiating cross -region copies.

18:07 It is not a glamorous story, but it is the kind

18:10 of thing that people say they care about right

18:12 up until they realize they have never actually

18:14 rehearsed it. I think the human thread underneath

18:24 this week's episode is that modern reliability

18:27 work keeps getting squeezed between two kinds

18:31 The second kind feels newer: AI agents with too much authority, new cloud identity models

18:55 that quietly become control planes, build and

18:58 operational systems that act like sidecars until

19:01 the day they absolutely are not. And what is

19:04 tricky is that these two categories are not separate

19:07 anymore. The old failures now happen inside environments

19:11 shaped by the newer automation. The newer automation

19:14 inherits the blast radius of the old infrastructure.

19:17 And ops teams end up responsible for both at

19:21 the same time. So I think the real takeaway this

19:23 week is pretty simple. Reliability is not just

19:26 about uptime anymore. It is about authority.

19:28 Who and what can act? What it can touch? How

19:32 fast it can fail? And whether your system design

19:34 assumes mistakes will stay local when they almost

19:37 never do. That sounds abstract until you line

19:41 up the stories. A TLD signs bad data and millions

19:45 of lookups fail. A social platform runs into

19:48 port exhaustion and logging makes it worse. A

19:51 GitOps controller hits end of life while version

19:54 parsing changes in the next branch. A kernel

19:57 bug drops into active exploitation. An AI agent

20:01 sees the wrong token and production disappears.

20:04 Different systems, same lesson. Small authority

20:07 problems turn into large reliability problems

20:11 very quickly. All right, that's it for this week

20:14 of Ship It Weekly. Quick recap. The Pocket OS

20:17 and Cursor database wipe and why it is really

20:21 an identity story. The .de DNSSEC outage and what

20:25 happens when the trust chain itself goes wrong.

20:28 Bluesky's outage post-mortem and how observability

20:31 can become part of the incident path. Argo CD

20:35 3 .1 going end of life while 3 .4 changes behavior.

20:40 Copy fail and why active kernel exploitation

20:42 still cuts through all the higher level abstractions.

20:46 And Google Cloud agent identity plus AWS MCP

20:50 server GA. Because the cloud providers are starting

20:53 to formalize agents as real infrastructure actors.

20:56 Then in the lightning round, Cilium's CICD security

21:00 lessons. Velero's CVE fix. Google's retention

21:04 and cgroups v2 changes. and AWS's cross -region

21:09 EKS disaster recovery. Links and show notes are

21:12 on shipitweekly .fm. You can also find the video

21:15 versions on YouTube. And if you want the source

21:18 stack before the episode lands, check out this

21:21 week's on -call brief. If this episode was useful,

21:24 follow or subscribe wherever you listen. And

21:27 send it to the person on your team who still

21:28 has to explain that reliability problems are

21:32 not just outages anymore. Sometimes they are

21:34 identity problems, tooling problems, and authority

21:37 problems wearing outage clothes. I'm Brian and

21:40 I'll see you next week.

Cursor Deletes PocketOS Prod DB, .de DNSSEC Outage, Bluesky Postmortem, Argo CD, and Copy Fail

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes