0:00
A lot of the risk in modern infrastructure is
0:02
no longer hiding in the parts we used to fear
0:05
most. Sometimes it is a bad kernel bug. Sometimes
0:08
it is a broken DNS signature at the TLD layer.
0:12
Sometimes it is a GitOps upgrade that changes
0:15
behavior under your feet. And sometimes it is
0:18
an AI agent with way too much authority and not
0:21
nearly enough guardrails. This week has a little
0:24
bit of all of that. And the common thread is
0:26
pretty simple. Control planes are brittle. Automation
0:30
is powerful, and identity still decides whether
0:33
a mistake stays small or becomes a crater. Hey,
0:54
I'm Brian Teller. I work in DevOps and SRE, and
0:57
I run Teller's Tech. This is Ship It Weekly,
1:00
where I filter the noise and focus on what actually
1:02
changes how we run infrastructure and own reliability.
1:06
Show notes and links are on shipitweekly .fm.
1:09
If the show's been useful, follow it wherever
1:11
you listen. Ratings help way more than they should.
1:14
And if you're watching on YouTube, subscribe
1:16
there too. We've got six main stories today,
1:20
then the lightning round, and we'll wrap with
1:22
the human closer. We're starting with the Pocket
1:24
PocketOS and Cursor incident because it is probably
1:27
the most instantly gripping story in the set
1:30
and also one of the most revealing. Not because
1:33
AI made a mistake, because an agent got access
1:36
to something it should not have had in the first
1:38
place. Then we're going to the .de DNSSEC outage
1:43
and Cloudflare's response, which is one of those
1:46
reminders that the internet's deepest layers
1:48
are still capable of ruining everybody's afternoon,
1:52
all at once. After that, Bluesky's outage post-
1:56
mortem, which is a really good incident story
1:58
because it is weird, specific, and painfully real.
2:02
Then Argo CD, with version 3.1.16 being the
2:08
last 3 .1 release and version 3 .4 .1 bringing
2:13
a behavior change that people really do need
2:16
to notice. Then the Linux kernel Copy Fail bug,
2:20
CVE-2026-31431. Because active exploitation plus broad
2:27
Linux exposure is not something ops teams get
2:30
to casually defer. And finally, Google Cloud
2:33
Agent Identity and AWS MCP Server GA. Because
2:37
the cloud platforms are starting to treat AI
2:39
agents as first -class actors instead of weird
2:43
sidecars bolted onto existing IAM assumptions.
2:50
Story one. PocketOS and Cursor is really an
2:54
identity story. Let's start there. This week's
2:57
on -call brief summarized the incident this way.
3:00
On April 25th, a Cursor AI agent reportedly deleted
3:04
PocketOS's production database in under 10 seconds
3:07
after a credential mismatch led it to access
3:11
an API token it should not have had. The result,
3:14
according to the reporting cited in the brief,
3:17
was a full production data loss, including backups.
3:20
This is obviously a wild headline, but I think
3:23
the more important part is not the speed or even
3:26
the AI branding. It is the access path. Because
3:29
if an agent can see a destructive credential,
3:31
then the real problem started before the agent
3:34
ever took action. That is why I think that this
3:36
story is more useful than just look at what AI
3:39
did. It is really about credential sprawl, environment
3:42
separation, and what happens when people start
3:45
granting machine -assisted workflows access that
3:48
was probably too broad even for a human. An AI
3:51
agent did not invent bad blast radius design.
3:55
It just exercised it fast. And that is the practical
3:58
takeaway I'd want teams to hear. If you are adding
4:01
AI coding agents or operational agents into real
4:05
environments, you do not get to treat identity
4:08
as a cleanup task for later. The difference between
4:11
helpful automation and production incident is
4:14
often just one token, one role, one environment
4:18
boundary, one missing approval step. So yeah,
4:21
this story is dramatic. But the lesson is boring.
4:25
And boring is good. Scope the credential. Separate
4:28
staging from prod. Assume the agent will eventually
4:32
try something dumb, overconfident, or surprising.
4:35
Because sooner or later, it will. Story 2. The
4:42
The .de DNSSEC outage is a reminder that internet
4:46
plumbing still has global blast radius. Next
4:50
up, the .de outage. Cloudflare says that on May
4:53
5th, at roughly 19:30 UTC, DENIC, the operator
4:59
for the .de TLD, started publishing incorrect DNSSEC
5:04
signatures for the .de zone. Any validating
5:08
resolver that received those records was required
5:11
by DNSSEC rules to reject them and return SERV-
5:15
FAIL. Cloudflare says 1.1.1.1 was no exception.
5:20
That means a bad cryptographic state at the TLD
5:24
layer was enough to make huge numbers of .DE
5:27
domains effectively unreachable. That is a great
5:31
outage story for ops people because it is one
5:34
of those failures where everything is technically
5:36
working exactly as designed and that is the problem.
5:40
DNSSEC is there to make sure responses are authentic.
5:43
The signatures were wrong. Resolvers did the
5:46
correct thing and refused to trust the data.
5:48
And suddenly, correct behavior becomes mass breakage.
5:52
That is such a brutal but important reliability
5:55
lesson. Security mechanisms do not eliminate
5:59
failure. They sometimes concentrate it. Cloudflare's
6:02
write -up is also good because it walks through
6:04
mitigation trade -offs instead of pretending
6:07
there's always a clean answer. When the trust
6:09
anchor above you is wrong, your choices get ugly
6:12
fast. Do you keep strict validation and break
6:16
reachability? Or do you apply a temporary exception
6:19
and accept the risk of bypassing a protection
6:22
that exists for a reason? That is not really
6:25
a DNS -only lesson either. It is a control plane
6:28
lesson. When a higher order authority goes bad,
6:32
the downstream systems that trust it do not fail
6:35
independently. They fail together. So if I'm
6:38
taking one operator lesson from this, it is this.
6:41
Any system built on centralized trust, centralized
6:44
signing, or centralized metadata needs a very
6:48
honest failure mode conversation. Because when
6:51
the thing at the top is wrong, your elegant security
6:53
chain can turn into synchronized outage machinery
6:57
very quickly. Before we get to the next story,
7:00
a quick note from this week's sponsor, Guardsquare.
7:03
If you are building mobile apps, good enough
7:05
security is usually a problem waiting to happen.
7:09
GuardSquare focuses on actually protecting your
7:12
code, in addition to scanning it. That means code
7:15
hardening, runtime protection, testing, and visibility
7:19
into what's happening once your app is out in
7:22
the wild. So if you are responsible for shipping
7:24
and securing mobile apps, Android or iOS, definitely
7:28
worth taking a look at guardsquare.com. Alright,
7:32
back to the show. Story 3. Bluesky's outage
7:39
post -mortem is the kind of weird incident story
7:42
worth studying. Now for my favorite kind of reliability
7:45
story. Blue Sky published a post -mortem for
7:48
its April outage. And the post says that the
7:50
service was intermittently down for about half
7:52
its users for around eight hours. Jim Calabro's
7:55
write -up says they saw pretty quickly that they
7:58
were exhausting ports. But the harder part was
8:01
figuring out exactly where and why. The root
8:04
issue involved memcached traffic, ephemeral port
8:07
exhaustion, and a cascade where the debugging
8:10
path became part of the failure path. What makes
8:14
this one so good is that it is not just we ran
8:17
out of ports. The postmortem explains that logging
8:20
used blocking write syscalls, and under the load
8:24
of trying to log huge volumes of errors while
8:27
still serving traffic, the Go runtime spawned
8:30
far more OS threads than normal. BlueSky says
8:33
that extra thread pressure then hit the garbage
8:36
collector, contributing to the broader failure
8:39
cascade. The write-up also shows a custom dialer
8:42
workaround that randomizes loopback IPs to avoid
8:46
ephemeral port exhaustion on a single address
8:50
after restarts. That is such a real incident
8:52
pattern. The original fault hurts. The observability
8:56
path amplifies it. The recovery path gets noisy.
8:59
And then the system that is supposed to help
9:01
you reason about the incident starts participating
9:03
in the incident. That is not hypothetical. That
9:06
is production. And I also like the honesty in
9:09
the write -up. BlueSky basically says the signal
9:12
Bluesky basically says the signal
9:15
and that you need the discipline and the metrics
9:18
to cut through it. That lands because it is true.
9:21
A lot of outages are not invisible. They are
9:24
just obscured by too many symptoms arriving at
9:26
once. So my takeaway here is not just watch ephemeral
9:29
ports. It is broader than that. Watch where your
9:32
debugging strategy becomes a scaling liability.
9:36
Watch where synchronous behavior hides inside
9:38
paths you think are harmless. And watch for failure
9:42
modes where saturation causes your tooling to
9:45
become part of the blast radius instead of part
9:48
of the recovery. Story 4. Argo CD is giving people
9:56
one quiet end-of-life and one not-so-quiet
9:59
behavior change. Next up, Argo CD. Argo's version
10:03
3.1.16 release is the final release in the
10:07
3 .1 series. The release notes are very explicit.
10:11
As of May 6, 2026, 3 .1 has reached end of life
10:15
and will no longer receive bug fixes or security
10:18
updates. The same release notes tell operators
10:21
to move to a supported version, meaning 3 .2,
10:25
3 .3, or 3 .4. That alone is worth a mention
10:28
because GitOps tools have a way of becoming background
10:32
furniture. Teams stop thinking about the controller
10:34
version because the controller is just there
10:36
doing its thing. Right up until the day it matters
10:39
a lot. But then there is the second part. Argo
10:41
CD version 3 .4 .1 is the first release in the
10:46
3 .4 series. And the release notes call out an
10:49
important change. Following Helm 3 .19 .0, Argo
10:53
CD now aligns its interpretation of Kubernetes
10:56
cluster version with Helm's behavior. OnCallBrief
11:00
points out that this affects application sets
11:02
that filter clusters by Kubernetes version. That
11:06
is exactly the kind of change that looks tiny
11:08
in a release note and then quietly breaks assumptions
11:11
in real environments. So for me, this is a classic
11:15
operator story. Not sexy, not viral, very real.
11:19
One branch is dead. One branch changes parsing
11:22
behavior. And if your GitOps setup depends on
11:25
version -based selection logic, you need to actually
11:28
test that logic instead of assuming a minor release
11:31
means minor consequences. The practical lesson
11:34
is simple. Treat controller upgrades like control
11:38
plane changes, not package refreshes. Because
11:41
that is what they are. If the thing deciding
11:44
what gets deployed, where, and when changes how
11:47
it interprets your environment, That is production
11:50
behavior, not housekeeping. Story 5. Copy fail
11:58
Story 5. Copy Fail is the kind of kernel bug
12:01
Now to CVE-2026-31431, also known as Copy Fail.
12:08
NVD shows that this Linux kernel vulnerability
12:11
is in CISA's Known Exploited Vulnerabilities
12:15
catalog. And CISA's entry gives federal agencies
12:19
a remediation due date of May 15th. The flaw
12:23
is an incorrect resource transfer between spheres
12:26
issue in the Linux kernel. And the reporting
12:29
around it says it enables local privilege escalation
12:32
to root on a wide range of Linux distributions.
12:36
AWS's security bulletin says that with the exception
12:40
of specifically listed services, most AWS customers
12:44
are not affected. But it also lists update timelines
12:47
for affected services such as Bottlerocket, ECS
12:52
on EC2, EKS optimized AMIs, and EMR. So this
12:57
is one of those stories where not everybody is
12:59
affected should not turn into nobody on our side
13:02
is affected. You still have to inventory what
13:05
you run. That is the operational lesson here.
13:07
When a kernel LPE lands in KEV and there is an
13:12
active exploitation, this is not the time for
13:15
vague patch queues. This is the time for concrete
13:18
exposure review. Which hosts? Which images? Which
13:21
managed services? Which self -managed nodes?
13:24
Which maintenance windows? Which compensating
13:27
controls until patched? And honestly, the thing
13:31
I always come back to with stories like this
13:33
is that kernel issues collapse abstraction layers
13:36
really fast. You can have immaculate Kubernetes
13:39
policy, good IAM, strong workload boundaries,
13:43
and still get wrecked if the shared kernel underneath
13:46
is vulnerable and unpatched. That is why these
13:50
bugs matter. They turn every higher -level control
13:53
into a best -effort suggestion until the underlying
13:57
system is fixed. Story 6. Google and AWS are
14:05
both telling us agents need first -class infrastructure
14:08
identity now. Last main story. I wanted to put
14:12
Google Cloud Agent Identity and AWS MCP Server
14:15
GA together. Because they are different products,
14:18
but they point in the same direction. Google
14:21
Cloud's new IAM post says that the AI era needs
14:25
a different security and governance model for
14:27
autonomous agents. And it introduces agent identity
14:31
as a new first -class principal type, distinct
14:34
from human identities and generic service accounts.
14:38
Google says these identities are built on SPIFFE,
14:42
are cryptographically protected, and strongly
14:45
attested, and can be used to authenticate to
14:48
MCP servers, cloud resources, endpoints, and
14:52
other agents. It also ties this into agent gateway
14:55
policy enforcement, least privilege controls,
14:58
and runtime defense. AWS's side of the story
15:02
is different in implementation, but similar in
15:05
intent. AWS says that the AWS MCP server is now
15:08
generally available and gives AI agents and coding
15:11
assistants secure, authenticated access to AWS
15:14
services through a small fixed set of tools
15:17
using existing IAM credentials. The official
15:21
announcement frames it directly around the problem
15:23
people keep asking. How do you give an agent
15:26
real AWS access without just handing it keys
15:29
to the kingdom? That is why I think these stories
15:31
belong together. Both clouds are basically admitting
15:34
the same thing. Agents are not side features
15:36
anymore. They are becoming infrastructure actors.
15:40
And once that is true, the identity model has
15:42
to change too. Not just more service accounts,
15:45
not just more static tokens, not just vibes and
15:48
trust. Real principals, real boundaries, real
15:52
policy, real auditability. So my read is that
15:56
this is where cloud identity is headed next.
15:58
Human identity was phase one. Workload identity
16:02
was phase two. Agent identity is shaping up to
16:05
be phase three. And if teams do not start treating
16:08
that as real infrastructure design now, they're
16:11
going to rediscover all of the old machine identity
16:14
mistakes with an LLM in the loop and a much faster
16:18
failure path. A few quick ones before we wrap.
16:28
Cilium published lessons from securing CI/CD
16:31
for an open source project. And the reason that
16:34
I like it is because it is practical. On -call
16:37
brief highlights controls like tighter CI/CD security
16:40
practices and lessons learned from operating
16:43
a real open source pipeline. That is worth a
16:45
read for anyone treating GitHub actions hardening
16:48
like a someday project. Velero version 1.18
16:51
.1-rc.2 includes a security fix for CVE-2026
16:57
-27141 by bumping golang.org/x/net
17:03
to version 0.51.0. That is the kind of
17:08
small release note that matters a lot when you
17:10
realize it closes a known vuln in a tool people
17:13
trust for recovery. Google's release notes also
17:16
carry a couple of quieter but real operational
17:19
changes. Google Ads and related measurement APIs
17:23
moved to a 37 -month retention policy for granular
17:27
performance stats starting June 1st. And Google's
17:31
distributed cloud's Kubernetes 1.35 platform
17:34
update now requires cgroups v2 with cgroups v1
17:40
no longer supported for creation or upgrades.
17:44
Those are not flashy changes, but both can absolutely
17:47
break assumptions if nobody notices them. AWS
17:50
also published a cross -region disaster recovery
17:53
walkthrough for Amazon EKS using AWS Backup.
17:58
The post walks through creating backup vaults
18:01
in source and DR regions, running on -demand
18:04
backups, and initiating cross -region copies.
18:07
It is not a glamorous story, but it is the kind
18:10
of thing that people say they care about right
18:12
up until they realize they have never actually
18:14
rehearsed it. I think the human thread underneath
18:24
this week's episode is that modern reliability
18:27
work keeps getting squeezed between two kinds
18:31
The second kind feels newer: AI agents with too much authority, new cloud identity models
18:55
that quietly become control planes, build and
18:58
operational systems that act like sidecars until
19:01
the day they absolutely are not. And what is
19:04
tricky is that these two categories are not separate
19:07
anymore. The old failures now happen inside environments
19:11
shaped by the newer automation. The newer automation
19:14
inherits the blast radius of the old infrastructure.
19:17
And ops teams end up responsible for both at
19:21
the same time. So I think the real takeaway this
19:23
week is pretty simple. Reliability is not just
19:26
about uptime anymore. It is about authority.
19:28
Who and what can act? What it can touch? How
19:32
fast it can fail? And whether your system design
19:34
assumes mistakes will stay local when they almost
19:37
never do. That sounds abstract until you line
19:41
up the stories. A TLD signs bad data and millions
19:45
of lookups fail. A social platform runs into
19:48
port exhaustion and logging makes it worse. A
19:51
GitOps controller hits end of life while version
19:54
parsing changes in the next branch. A kernel
19:57
bug drops into active exploitation. An AI agent
20:01
sees the wrong token and production disappears.
20:04
Different systems, same lesson. Small authority
20:07
problems turn into large reliability problems
20:11
very quickly. All right, that's it for this week
20:14
of Ship It Weekly. Quick recap. The Pocket OS
20:17
and Cursor database wipe and why it is really
20:21
an identity story. The .de DNSSEC outage and what
20:25
happens when the trust chain itself goes wrong.
20:28
Bluesky's outage post-mortem and how observability
20:31
can become part of the incident path. Argo CD
20:35
3 .1 going end of life while 3 .4 changes behavior.
20:40
Copy fail and why active kernel exploitation
20:42
still cuts through all the higher level abstractions.
20:46
And Google Cloud agent identity plus AWS MCP
20:50
server GA. Because the cloud providers are starting
20:53
to formalize agents as real infrastructure actors.
20:56
Then in the lightning round, Cilium's CICD security
21:00
lessons. Velero's CVE fix. Google's retention
21:04
and cgroups v2 changes. and AWS's cross -region
21:09
EKS disaster recovery. Links and show notes are
21:12
on shipitweekly .fm. You can also find the video
21:15
versions on YouTube. And if you want the source
21:18
stack before the episode lands, check out this
21:21
week's on -call brief. If this episode was useful,
21:24
follow or subscribe wherever you listen. And
21:27
send it to the person on your team who still
21:28
has to explain that reliability problems are
21:32
not just outages anymore. Sometimes they are
21:34
identity problems, tooling problems, and authority
21:37
problems wearing outage clothes. I'm Brian and
21:40
I'll see you next week.