0:00
A lot of the risk in modern infrastructure is
0:02
no longer hiding in the parts we used to fear
0:05
most. Sometimes it is a bad kernel bug. Sometimes
0:08
it is a broken DNS signature at the TLD layer.
0:12
Sometimes it is a GitOps upgrade that changes
0:15
behavior under your feet. And sometimes it is
0:18
an AI agent with way too much authority and not
0:21
nearly enough guardrails. This week has a little
0:24
bit of all of that. And the common thread is
0:26
pretty simple. Control planes are brittle. Automation
0:30
is powerful, and identity still decides whether
0:33
a mistake stays small or becomes a crater. Hey,
0:54
I'm Brian Teller. I work in DevOps and SRE, and
0:57
I run Teller's Tech. This is Ship It Weekly,
1:00
where I filter the noise and focus on what actually
1:02
changes how we run infrastructure and own reliability.
1:06
Show notes and links are on shipitweekly .fm.
1:09
If the show's been useful, follow it wherever
1:11
you listen. Ratings help way more than they should.
1:14
And if you're watching on YouTube, subscribe
1:16
there too. We've got six main stories today,
1:20
then the lightning round, and we'll wrap with
1:22
the human closer. We're starting with the Pocket
1:24
PocketOS and Cursor incident because it is probably
1:27
the most instantly gripping story in the set
1:30
and also one of the most revealing. Not because
1:33
AI made a mistake, because an agent got access
1:36
to something it should not have had in the first
1:38
place. Then we're going to the .de DNSSEC outage
1:43
and Cloudflare's response, which is one of those
1:46
reminders that the internet's deepest layers
1:48
are still capable of ruining everybody's afternoon,
1:52
all at once. After that, Bluesky's outage post-
1:56
mortem, which is a really good incident story
1:58
because it is weird, specific, and painfully real.
2:02
Then Argo CD, with version 3.1.16 being the
2:08
last 3 .1 release and version 3 .4 .1 bringing
2:13
a behavior change that people really do need
2:16
to notice. Then the Linux kernel Copy Fail bug,
2:20
CVE-2026-31431. Because active exploitation plus broad
2:27
Linux exposure is not something ops teams get
2:30
to casually defer. And finally, Google Cloud
2:33
Agent Identity and AWS MCP Server GA. Because
2:37
the cloud platforms are starting to treat AI
2:39
agents as first -class actors instead of weird
2:43
sidecars bolted onto existing IAM assumptions.
2:50
Story one. PocketOS and Cursor is really an
2:54
identity story. Let's start there. This week's
2:57
on -call brief summarized the incident this way.
3:00
On April 25th, a Cursor AI agent reportedly deleted
3:04
PocketOS's production database in under 10 seconds
3:07
after a credential mismatch led it to access
3:11
an API token it should not have had. The result,
3:14
according to the reporting cited in the brief,
3:17
was a full production data loss, including backups.
3:20
This is obviously a wild headline, but I think
3:23
the more important part is not the speed or even
3:26
the AI branding. It is the access path. Because
3:29
if an agent can see a destructive credential,
3:31
then the real problem started before the agent
3:34
ever took action. That is why I think that this
3:36
story is more useful than just look at what AI
3:39
did. It is really about credential sprawl, environment
3:42
separation, and what happens when people start
3:45
granting machine -assisted workflows access that
3:48
was probably too broad even for a human. An AI
3:51
agent did not invent bad blast radius design.
3:55
It just exercised it fast. And that is the practical
3:58
takeaway I'd want teams to hear. If you are adding
4:01
AI coding agents or operational agents into real
4:05
environments, you do not get to treat identity
4:08
as a cleanup task for later. The difference between
4:11
helpful automation and production incident is
4:14
often just one token, one role, one environment
4:18
boundary, one missing approval step. So yeah,
4:21
this story is dramatic. But the lesson is boring.
4:25
And boring is good. Scope the credential. Separate
4:28
staging from prod. Assume the agent will eventually
4:32
try something dumb, overconfident, or surprising.
4:35
Because sooner or later, it will. Story 2. The
4:42
The .de DNSSEC outage is a reminder that internet
4:46
plumbing still has global blast radius. Next
4:50
up, the .de outage. Cloudflare says that on May
4:53
5th, at roughly 19:30 UTC, DENIC, the operator
4:59
for the .de TLD, started publishing incorrect DNSSEC
5:04
signatures for the .de zone. Any validating
5:08
resolver that received those records was required
5:11
by DNSSEC rules to reject them and return SERV-
5:15
FAIL. Cloudflare says 1.1.1.1 was no exception.
5:20
That means a bad cryptographic state at the TLD
5:24
layer was enough to make huge numbers of .DE
5:27
domains effectively unreachable. That is a great
5:31
outage story for ops people because it is one
5:34
of those failures where everything is technically
5:36
working exactly as designed and that is the problem.
5:40
DNSSEC is there to make sure responses are authentic.
5:43
The signatures were wrong. Resolvers did the
5:46
correct thing and refused to trust the data.
5:48
And suddenly, correct behavior becomes mass breakage.
5:52
That is such a brutal but important reliability
5:55
lesson. Security mechanisms do not eliminate
5:59
failure. They sometimes concentrate it. Cloudflare's
6:02
write -up is also good because it walks through
6:04
mitigation trade -offs instead of pretending
6:07
there's always a clean answer. When the trust
6:09
anchor above you is wrong, your choices get ugly
6:12
fast. Do you keep strict validation and break
6:16
reachability? Or do you apply a temporary exception
6:19
and accept the risk of bypassing a protection
6:22
that exists for a reason? That is not really
6:25
a DNS -only lesson either. It is a control plane
6:28
lesson. When a higher order authority goes bad,
6:32
the downstream systems that trust it do not fail
6:35
independently. They fail together. So if I'm
6:38
taking one operator lesson from this, it is this.
6:41
Any system built on centralized trust, centralized
6:44
signing, or centralized metadata needs a very
6:48
honest failure mode conversation. Because when
6:51
the thing at the top is wrong, your elegant security
6:53
chain can turn into synchronized outage machinery
6:57
very quickly. Before we get to the next story,
7:00
a quick note from this week's sponsor, Guardsquare.
7:03
If you are building mobile apps, good enough
7:05
security is usually a problem waiting to happen.
7:09
GuardSquare focuses on actually protecting your
7:12
code, in addition to scanning it. That means code
7:15
hardening, runtime protection, testing, and visibility
7:19
into what's happening once your app is out in
7:22
the wild. So if you are responsible for shipping
7:24
and securing mobile apps, Android or iOS, definitely
7:28
worth taking a look at guardsquare.com. Alright,
7:32
back to the show. Story 3. Bluesky's outage
7:39
post -mortem is the kind of weird incident story
7:42
worth studying. Now for my favorite kind of reliability
7:45
story. Blue Sky published a post -mortem for
7:48
its April outage. And the post says that the
7:50
service was intermittently down for about half
7:52
its users for around eight hours. Jim Calabro's
7:55
write -up says they saw pretty quickly that they
7:58
were exhausting ports. But the harder part was
8:01
figuring out exactly where and why. The root
8:04
issue involved memcached traffic, ephemeral port
8:07
exhaustion, and a cascade where the debugging
8:10
path became part of the failure path. What makes
8:14
this one so good is that it is not just we ran
8:17
out of ports. The postmortem explains that logging
8:20
used blocking write syscalls, and under the load
8:24
of trying to log huge volumes of errors while
8:27
still serving traffic, the Go runtime spawned
8:30
far more OS threads than normal. BlueSky says
8:33
that extra thread pressure then hit the garbage
8:36
collector, contributing to the broader failure
8:39
cascade. The write-up also shows a custom dialer
8:42
workaround that randomizes loopback IPs to avoid
8:46
ephemeral port exhaustion on a single address
8:50
after restarts. That is such a real incident
8:52
pattern. The original fault hurts. The observability
8:56
path amplifies it. The recovery path gets noisy.
8:59
And then the system that is supposed to help
9:01
you reason about the incident starts participating
9:03
in the incident. That is not hypothetical. That
9:06
is production. And I also like the honesty in
9:09
the write -up. BlueSky basically says the signal
9:12
Bluesky basically says the signal
9:15
and that you need the discipline and the metrics
9:18
to cut through it. That lands because it is true.
9:21
A lot of outages are not invisible. They are
9:24
just obscured by too many symptoms arriving at
9:26
once. So my takeaway here is not just watch ephemeral
9:29
ports. It is broader than that. Watch where your
9:32
debugging strategy becomes a scaling liability.
9:36
Watch where synchronous behavior hides inside
9:38
paths you think are harmless. And watch for failure
9:42
modes where saturation causes your tooling to
9:45
become part of the blast radius instead of part
9:48
of the recovery. Story 4. Argo CD is giving people
9:56
one quiet end-of-life and one not-so-quiet
9:59
behavior change. Next up, Argo CD. Argo's version
10:03
3.1.16 release is the final release in the
10:07
3 .1 series. The release notes are very explicit.
10:11
As of May 6, 2026, 3 .1 has reached end of life
10:15
and will no longer receive bug fixes or security
10:18
updates. The same release notes tell operators
10:21
to move to a supported version, meaning 3 .2,
10:25
3 .3, or 3 .4. That alone is worth a mention
10:28
because GitOps tools have a way of becoming background
10:32
furniture. Teams stop thinking about the controller
10:34
version because the controller is just there
10:36
doing its thing. Right up until the day it matters
10:39
a lot. But then there is the second part. Argo
10:41
CD version 3 .4 .1 is the first release in the
10:46
3 .4 series. And the release notes call out an
10:49
important change. Following Helm 3 .19 .0, Argo
10:53
CD now aligns its interpretation of Kubernetes
10:56
cluster version with Helm's behavior. OnCallBrief
11:00
points out that this affects application sets
11:02
that filter clusters by Kubernetes version. That
11:06
is exactly the kind of change that looks tiny
11:08
in a release note and then quietly breaks assumptions
11:11
in real environments. So for me, this is a classic
11:15
operator story. Not sexy, not viral, very real.
11:19
One branch is dead. One branch changes parsing
11:22
behavior. And if your GitOps setup depends on
11:25
version -based selection logic, you need to actually
11:28
test that logic instead of assuming a minor release
11:31
means minor consequences. The practical lesson
11:34
is simple. Treat controller upgrades like control
11:38
plane changes, not package refreshes. Because
11:41
that is what they are. If the thing deciding
11:44
what gets deployed, where, and when changes how
11:47
it interprets your environment, That is production
11:50
behavior, not housekeeping. Story 5. Copy fail
11:58
Story 5. Copy Fail is the kind of kernel bug
12:01
Now to CVE-2026-31431, also known as Copy Fail.
12:08
NVD shows that this Linux kernel vulnerability
12:11
is in CISA's Known Exploited Vulnerabilities
12:15
catalog. And CISA's entry gives federal agencies
12:19
a remediation due date of May 15th. The flaw
12:23
is an incorrect resource transfer between spheres
12:26
issue in the Linux kernel. And the reporting
12:29
around it says it enables local privilege escalation
12:32
to root on a wide range of Linux distributions.
12:36
AWS's security bulletin says that with the exception
12:40
of specifically listed services, most AWS customers
12:44
are not affected. But it also lists update timelines
12:47
for affected services such as Bottlerocket, ECS
12:52
on EC2, EKS optimized AMIs, and EMR. So this
12:57
is one of those stories where not everybody is
12:59
affected should not turn into nobody on our side
13:02
is affected. You still have to inventory what
13:05
you run. That is the operational lesson here.
13:07
When a kernel LPE lands in KEV and there is an
13:12
active exploitation, this is not the time for
13:15
vague patch queues. This is the time for concrete
13:18
exposure review. Which hosts? Which images? Which
13:21
managed services? Which self -managed nodes?
13:24
Which maintenance windows? Which compensating
13:27
controls until patched? And honestly, the thing
13:31
I always come back to with stories like this
13:33
is that kernel issues collapse abstraction layers
13:36
really fast. You can have immaculate Kubernetes
13:39
policy, good IAM, strong workload boundaries,
13:43
and still get wrecked if the shared kernel underneath
13:46
is vulnerable and unpatched. That is why these
13:50
bugs matter. They turn every higher -level control
13:53
into a best -effort suggestion until the underlying
13:57
system is fixed. Story 6. Google and AWS are
14:05
both telling us agents need first -class infrastructure
14:08
identity now. Last main story. I wanted to put
14:12
Google Cloud Agent Identity and AWS MCP Server
14:15
GA together. Because they are different products,
14:18
but they point in the same direction. Google
14:21
Cloud's new IAM post says that the AI era needs
14:25
a different security and governance model for
14:27
autonomous agents. And it introduces agent identity
14:31
as a new first -class principal type, distinct
14:34
from human identities and generic service accounts.
14:38
Google says these identities are built on SPIFFE,
14:42
are cryptographically protected, and strongly
14:45
attested, and can be used to authenticate to
14:48
MCP servers, cloud resources, endpoints, and
14:52
other agents. It also ties this into agent gateway
14:55
policy enforcement, least privilege controls,
14:58
and runtime defense. AWS's side of the story
15:02
is different in implementation, but similar in
15:05
intent. AWS says that the AWS MCP server is now
15:08
generally available and gives AI agents and coding
15:11
assistants secure, authenticated access to AWS
15:14
services through a small fixed set of tools
15:17
using existing IAM credentials. The official
15:21
announcement frames it directly around the problem
15:23
people keep asking. How do you give an agent
15:26
real AWS access without just handing it keys
15:29
to the kingdom? That is why I think these stories
15:31
belong together. Both clouds are basically admitting
15:34
the same thing. Agents are not side features
15:36
anymore. They are becoming infrastructure actors.
15:40
And once that is true, the identity model has
15:42
to change too. Not just more service accounts,
15:45
not just more static tokens, not just vibes and
15:48
trust. Real principals, real boundaries, real
15:52
policy, real auditability. So my read is that
15:56
this is where cloud identity is headed next.
15:58
Human identity was phase one. Workload identity
16:02
was phase two. Agent identity is shaping up to
16:05
be phase three. And if teams do not start treating
16:08
that as real infrastructure design now, they're
16:11
going to rediscover all of the old machine identity
16:14
mistakes with an LLM in the loop and a much faster
16:18
failure path. A few quick ones before we wrap.
16:28
Cilium published lessons from securing CI/CD
16:31
for an open source project. And the reason that
16:34
I like it is because it is practical. On -call
16:37
brief highlights controls like tighter CI/CD security
16:40
practices and lessons learned from operating
16:43
a real open source pipeline. That is worth a
16:45
read for anyone treating GitHub actions hardening
16:48
like a someday project. Velero version 1.18
16:51
.1-rc.2 includes a security fix for CVE-2026
16:57
-27141 by bumping golang.org/x/net
17:03
to version 0.51.0. That is the kind of
17:08
small release note that matters a lot when you
17:10
realize it closes a known vuln in a tool people
17:13
trust for recovery. Google's release notes also
17:16
carry a couple of quieter but real operational
17:19
changes. Google Ads and related measurement APIs
17:23
moved to a 37 -month retention policy for granular
17:27
performance stats starting June 1st. And Google's
17:31
distributed cloud's Kubernetes 1.35 platform
17:34
update now requires cgroups v2 with cgroups v1
17:40
no longer supported for creation or upgrades.
17:44
Those are not flashy changes, but both can absolutely
17:47
break assumptions if nobody notices them. AWS
17:50
also published a cross -region disaster recovery
17:53
walkthrough for Amazon EKS using AWS Backup.
17:58
The post walks through creating backup vaults
18:01
in source and DR regions, running on -demand
18:04
backups, and initiating cross -region copies.
18:07
It is not a glamorous story, but it is the kind
18:10
of thing that people say they care about right
18:12
up until they realize they have never actually
18:14
rehearsed it. I think the human thread underneath
18:24
this week's episode is that modern reliability
18:27
work keeps getting squeezed between two kinds
18:31
The second kind feels newer: AI agents with too much authority, new cloud identity models
18:55
that quietly become control planes, build and
18:58
operational systems that act like sidecars until
19:01
the day they absolutely are not. And what is
19:04
tricky is that these two categories are not separate
19:07
anymore. The old failures now happen inside environments
19:11
shaped by the newer automation. The newer automation
19:14
inherits the blast radius of the old infrastructure.
19:17
And ops teams end up responsible for both at
19:21
the same time. So I think the real takeaway this
19:23
week is pretty simple. Reliability is not just
19:26
about uptime anymore. It is about authority.
19:28
Who and what can act? What it can touch? How
19:32
fast it can fail? And whether your system design
19:34
assumes mistakes will stay local when they almost
19:37
never do. That sounds abstract until you line
19:41
up the stories. A TLD signs bad data and millions
19:45
of lookups fail. A social platform runs into
19:48
port exhaustion and logging makes it worse. A
19:51
GitOps controller hits end of life while version
19:54
parsing changes in the next branch. A kernel
19:57
bug drops into active exploitation. An AI agent
20:01
sees the wrong token and production disappears.
20:04
Different systems, same lesson. Small authority
20:07
problems turn into large reliability problems
20:11
very quickly. All right, that's it for this week
20:14
of Ship It Weekly. Quick recap. The Pocket OS
20:17
and Cursor database wipe and why it is really
20:21
an identity story. The .de DNSSEC outage and what
20:25
happens when the trust chain itself goes wrong.
20:28
Bluesky's outage post-mortem and how observability
20:31
can become part of the incident path. Argo CD
20:35
3 .1 going end of life while 3 .4 changes behavior.
20:40
Copy fail and why active kernel exploitation
20:42
still cuts through all the higher level abstractions.
20:46
And Google Cloud agent identity plus AWS MCP
20:50
server GA. Because the cloud providers are starting
20:53
to formalize agents as real infrastructure actors.
20:56
Then in the lightning round, Cilium's CICD security
21:00
lessons. Velero's CVE fix. Google's retention
21:04
and cgroups v2 changes. and AWS's cross -region
21:09
EKS disaster recovery. Links and show notes are
21:12
on shipitweekly .fm. You can also find the video
21:15
versions on YouTube. And if you want the source
21:18
stack before the episode lands, check out this
21:21
week's on -call brief. If this episode was useful,
21:24
follow or subscribe wherever you listen. And
21:27
send it to the person on your team who still
21:28
has to explain that reliability problems are
21:32
not just outages anymore. Sometimes they are
21:34
identity problems, tooling problems, and authority
21:37
problems wearing outage clothes. I'm Brian and
21:40
I'll see you next week.
What stood out to me this week is that the failure modes were all over the stack, but they kept pointing back to the same thing: authority.
The PocketOS and Cursor story is the obvious example. It is easy to frame that one as “AI went rogue,” but that’s not really the useful lesson. The useful lesson is that an agent got access to a token it should not have had, and once it had that authority, the rest happened fast. On the other end of the spectrum, the
.deoutage was not AI at all. It was classic Internet plumbing: bad DNSSEC signatures at the TLD level, validating resolvers doing exactly what they were supposed to do, and millions of domains effectively disappearing behindSERVFAIL. Different systems, same theme. Give the wrong thing too much trust, or centralize trust in the wrong place, and the blast radius gets big fast. (Teller's Tech)That’s also why I liked the Bluesky postmortem so much. It is the kind of outage write-up operators actually learn from because it is not clean or elegant. They were exhausting ports, but the debugging path and the logging behavior helped amplify the pain. That is a very real production pattern. The first problem hurts, then the systems you rely on to reason about it start adding load, noise, or contention of their own. A lot of outages are not one bad component failing in isolation. They are a cluster of small, understandable behaviors that turn pathological together. (Pckt)
Argo CD and the kernel bug were the quieter stories, but maybe the more familiar ones for day-to-day operators. Argo CD 3.1 hitting end of life while 3.4 changes Kubernetes version interpretation is exactly the kind of thing teams wave off until a controller upgrade lands and selection logic stops behaving the way people assumed. CVE-2026-31431 is the same kind of reminder from a different angle. Kernel bugs do not care how nice your abstractions are. If the shared base layer is vulnerable and actively exploited, your higher-level controls stop feeling very absolute. That’s why the boring work still matters: controller version hygiene, image inventory, maintenance windows, patch review, and all the stuff nobody wants to talk about when there is a shinier story on the page. (GitHub)
The other piece I kept coming back to is that the clouds are starting to admit agents are no longer a novelty feature hanging off the side of existing IAM. Google is introducing Agent Identity as a first-class principal type built on SPIFFE, and AWS is pushing MCP access as something that should be secure, authenticated, and bounded through a fixed tool surface. That is a pretty big signal. We are watching cloud identity move from human identity, to workload identity, to agent identity. And if that sounds abstract, it really is not. It just means teams are about to rediscover every old machine-identity mistake they already made once, except now the actor on the other end can move faster and make stranger decisions. (Google Cloud)
So my takeaway from this episode is simple. Reliability is still about uptime, latency, and recovery, sure. But more and more, it is also about who or what is allowed to act, what it can touch, and whether your environment assumes a mistake will stay local when it probably will not. That applies to DNS trust chains, GitOps controllers, kernel exposure, backup design, and AI agents with credentials. Different layers, same question: where does the authority actually live, and how much damage can it do before something stops it? (Teller's Tech)