Host Commentary

What stood out to me this week is that the failure modes were all over the stack, but they kept pointing back to the same thing: authority.

The PocketOS and Cursor story is the obvious example. It is easy to frame that one as “AI went rogue,” but that’s not really the useful lesson. The useful lesson is that an agent got access to a token it should not have had, and once it had that authority, the rest happened fast. On the other end of the spectrum, the .de outage was not AI at all. It was classic Internet plumbing: bad DNSSEC signatures at the TLD level, validating resolvers doing exactly what they were supposed to do, and millions of domains effectively disappearing behind SERVFAIL. Different systems, same theme. Give the wrong thing too much trust, or centralize trust in the wrong place, and the blast radius gets big fast. (Teller's Tech)

That’s also why I liked the Bluesky postmortem so much. It is the kind of outage write-up operators actually learn from because it is not clean or elegant. They were exhausting ports, but the debugging path and the logging behavior helped amplify the pain. That is a very real production pattern. The first problem hurts, then the systems you rely on to reason about it start adding load, noise, or contention of their own. A lot of outages are not one bad component failing in isolation. They are a cluster of small, understandable behaviors that turn pathological together. (Pckt)

Argo CD and the kernel bug were the quieter stories, but maybe the more familiar ones for day-to-day operators. Argo CD 3.1 hitting end of life while 3.4 changes Kubernetes version interpretation is exactly the kind of thing teams wave off until a controller upgrade lands and selection logic stops behaving the way people assumed. CVE-2026-31431 is the same kind of reminder from a different angle. Kernel bugs do not care how nice your abstractions are. If the shared base layer is vulnerable and actively exploited, your higher-level controls stop feeling very absolute. That’s why the boring work still matters: controller version hygiene, image inventory, maintenance windows, patch review, and all the stuff nobody wants to talk about when there is a shinier story on the page. (GitHub)

The other piece I kept coming back to is that the clouds are starting to admit agents are no longer a novelty feature hanging off the side of existing IAM. Google is introducing Agent Identity as a first-class principal type built on SPIFFE, and AWS is pushing MCP access as something that should be secure, authenticated, and bounded through a fixed tool surface. That is a pretty big signal. We are watching cloud identity move from human identity, to workload identity, to agent identity. And if that sounds abstract, it really is not. It just means teams are about to rediscover every old machine-identity mistake they already made once, except now the actor on the other end can move faster and make stranger decisions. (Google Cloud)

So my takeaway from this episode is simple. Reliability is still about uptime, latency, and recovery, sure. But more and more, it is also about who or what is allowed to act, what it can touch, and whether your environment assumes a mistake will stay local when it probably will not. That applies to DNS trust chains, GitOps controllers, kernel exposure, backup design, and AI agents with credentials. Different layers, same question: where does the authority actually live, and how much damage can it do before something stops it? (Teller's Tech)

Show Notes

This episode of Ship It Weekly is about modern reliability getting squeezed from both directions. Old-school failures still hit hard, like broken DNSSEC, kernel privilege escalation bugs, and GitOps behavior changes. But newer automation layers add a second kind of risk, where AI agents, machine identity, and cloud control planes can do real damage fast when authority is too broad. Brian covers the Cursor and PocketOS production database wipe, the .de DNSSEC outage and Cloudflare’s response, Bluesky’s April outage postmortem, Argo CD v3.1.16 reaching end of life plus the v3.4.1 behavior change, Linux kernel CVE-2026-31431 under active exploitation, and why Google Cloud Agent Identity and AWS MCP Server GA both point to agents becoming first-class infrastructure actors.

Sponsored by Guardsquare https://hubs.ly/Q04fJgkJ0

Links

Cursor / PocketOS production database wipe https://www.tellerstech.com/on-call-brief/2026-W19/

Cloudflare on the .de DNSSEC outage https://blog.cloudflare.com/de-tld-outage-dnssec/

Bluesky April 2026 outage postmortem https://pckt.blog/b/jcalabro/april-2026-outage-post-mortem-219ebg2

Argo CD releases: v3.1.16 final release and v3.4.1 behavior change https://github.com/argoproj/argo-cd/releases

Linux kernel CVE-2026-31431 https://nvd.nist.gov/vuln/detail/CVE-2026-31431

AWS bulletin for CVE-2026-31431 https://aws.amazon.com/security/security-bulletins/rss/2026-026-aws/

Google Cloud Agent Identity https://cloud.google.com/blog/products/identity-security/whats-new-in-iam-security-governance-and-runtime-defense

AWS MCP Server is now generally available https://aws.amazon.com/blogs/aws/the-aws-mcp-server-is-now-generally-available/

Cross-region disaster recovery for Amazon EKS using AWS Backup https://aws.amazon.com/blogs/containers/cross-region-disaster-recovery-for-amazon-eks-using-aws-backup/

Google Ads new data retention policy starting June 1, 2026 https://ads-developers.googleblog.com/2026/05/new-data-retention-policy-for-google.html

This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W19/

More episodes and show notes https://shipitweekly.fm/

Brian Teller
Hosted by
Brian Teller

25 years in production: DevOps, SRE, platform, and cloud. DevOps Institute & ITIL Ambassador.

More about Brian Teller →