containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

Transcript

0:00 This week, containerd disclosed a stack of CRI

0:04 plugin vulnerabilities in the runtime layer,

0:07 a huge number of Kubernetes nodes trust to start

0:10 your containers. Datadog ran a PostgreSQL

0:14 gameday and learned their database could fail over

0:17 just fine. It just couldn't do it safely. AWS

0:21 DevOps Agent and Datadog's MCP Server are both

0:25 now generally available. And the new AWS integration

0:28 means AI incident response just graduated from

0:33 demo to on-call rotation. And EKS will now route

0:37 your Kubernetes control plane's outbound traffic

0:40 through your own VPC, which is great, right up

0:44 until a stale route table quietly kills your

0:47 admission webhooks. Put those together and the

0:50 shape of the episode is pretty clear. The control

0:53 plane keeps getting wider. Runtimes. Databases.

0:57 Incident agents. API-server egress. credentials,

1:01 even the cloud console. One by one, they are

1:04 all sliding into your production blast radius.

1:07 And here's the part that matters. Your users

1:10 don't care which control plane failed. They just

1:13 feel the wait. I'm Brian Teller from Teller's

1:16 Tech, and this is Ship It Weekly. Welcome back

1:36 to Ship It Weekly, the show about the DevOps,

1:40 SRE, cloud, platform, and security stories that

1:45 actually matter when you are the person who has

1:47 to keep the thing running at 3 a.m. If you are

1:51 new here, follow or subscribe wherever you are

1:54 watching or listening. And if you want the weekly

1:57 story list and source links, check out OnCallBrief.com

2:01 For past episodes, full show notes, and

2:05 more from the show, head over to ShipItWeekly.fm

2:08 We open with the containerd CRI plugin vulnerabilities,

2:13 because your node runtime is the trust boundary

2:16 underneath the trust boundary. Then, Datadog's

2:20 PostgreSQL HA gameday, where the scary discovery

2:24 wasn't that failover was hard, it was that failover

2:28 was unsafe. After that, AWS DevOps Agent and

2:32 Datadog MCP Server going GA. And what it means

2:36 when an AI agent gets a seat near your control

2:40 plane. Then, EKS customer-routed control-plane

2:43 egress. Because your API server is now part of

2:47 your network perimeter, whether you plan for

2:50 it or not. In the lightning round, GitHub Credential

2:53 Revocation. AWS Console Private Access. Vercel

2:57 Connect, and S3 annotations. And we close with

3:01 Marc Brooker on waiting, on why your customers

3:05 live in the tail of your latency distribution,

3:09 even when your dashboards swear everything's

3:12 fine. Let's get into it. First up, containerd

3:20 has a batch of CRI plugin vulnerabilities. And

3:23 if you run Kubernetes, this one's yours. AWS

3:26 published a security bulletin spanning

3:30 containerd branches 1.7 through 2.3. And the list is

3:35 not a fun read. Image cache poisoning through

3:38 checkpoint image references. Host command execution.

3:42 through unsanitized image labels, CDI annotation

3:45 handling that can inject devices and host mounts,

3:50 host file reads through symlinked container

3:53 log paths during checkpoint restore, and a denial

3:57 of service from crafted images that exhaust memory.

4:01 So not exactly a relaxing Patch Tuesday. Here's

4:05 why it matters. containerd sits underneath an

4:08 enormous number of clusters, and we spend almost

4:12 all of our security attention on the layers above

4:15 it. Pod specs, admission control, image scanning,

4:19 RBAC, network policy, runtime classes, all the

4:23 familiar Kubernetes machinery. But eventually,

4:26 something has to actually pull the image, unpack

4:29 it, restore it, wire up devices. handle the logs,

4:34 and start the container. That layer is a trust

4:37 boundary too. And in some ways, it's the more

4:41 dangerous one. Because by the time a workload

4:44 reaches the runtime, the rest of the system has

4:47 already decided this thing is allowed to exist.

4:51 That's why the boring fields turn out to matter.

4:54 Labels, annotations, checkpoint and restore paths,

4:58 CDI, log paths, every field. that feels like

5:02 plumbing can become an input to privileged behavior

5:05 on the node. A malicious image isn't just application

5:09 code. It's metadata, build time weirdness, and

5:14 a set of assumptions the runtime makes about

5:17 what it can trust. The takeaway is direct. Patch

5:20 containerd. Check your managed node groups,

5:24 your self-managed nodes, your AMIs, your Bottlerocket

5:28 versions, your distro packages, anything

5:31 that controls the runtime. If you lean on checkpoint

5:35 restore, CDI devices, or GPU workloads, look

5:39 harder. And if you don't use any of that, don't

5:43 relax. At least one of these issues doesn't need

5:46 checkpoint and restore turned on at all. Your

5:49 node runtime is the trust boundary under the

5:53 trust boundary. Stop treating it like invisible

5:56 plumbing. Second story. Datadog published a genuinely

6:03 good engineering write-up on running high availability

6:07 PostgreSQL on Kubernetes. And it's one of those

6:11 pieces that sounds boring until the real problem

6:14 comes into focus. The problem wasn't that the

6:17 database couldn't fail over. It was that it couldn't

6:20 fail over safely. During a gameday, Datadog

6:24 simulated a zonal failure. That added network

6:27 latency, replication lag grew, and when the cluster

6:30 needed a new primary, Patroni couldn't safely

6:34 promote a standby without risking data loss.

6:37 So the system got stuck in the worst possible

6:40 spot. The old primary was unhealthy. The standbys

6:44 weren't safe to promote, and the only correct

6:47 move was to wait. That's the kind of failure

6:50 mode that ages every SRE in the room about three

6:53 years. Because on paper, you have everything.

6:56 Multiple nodes, standbys, Kubernetes, automation,

7:01 failover machinery. And then the actual failure

7:04 arrives and the system says, yes, but not safely.

7:07 Which, honestly, is the right answer. Promoting

7:10 a stale standby might hand you a writable primary

7:14 faster. But if it costs you data loss, split

7:17 brain, or a broken consistency guarantee, you

7:21 haven't fixed the outage. You've traded it for

7:23 a corruption event. That's not an improvement.

7:26 It's just a different postmortem. The real lesson

7:29 is that HA isn't only about whether the service

7:33 comes back. It's about whether the recovery path

7:35 itself is safe. Can you fail over without losing

7:39 writes? Can you prove which standby is safe to

7:42 promote? Can your automation tell the difference

7:45 between available and correct? And does your

7:47 whole team agree on which one it should prefer

7:51 before the incident call is on fire? Datadog's

7:54 answer was to move toward synchronous replication

7:57 and stronger Patroni guardrails. So a promoted

8:01 standby is guaranteed to have the writes it needs.

8:05 And that's the part that's worth copying. They

8:07 didn't just ask how to recover faster. They asked

8:11 how to recover safely. So test your database

8:14 HA against real constraints, not the easy ones.

8:18 Ask what happens under replication lag. Ask what

8:22 happens during a zone failure. Ask what happens

8:25 when the network is slow instead of cleanly dead.

8:28 Ask what happens when every standby is behind.

8:32 And ask whether your automation prefers safety

8:35 or availability. And whether everyone actually

8:38 agrees with that choice. Because failover is

8:41 useless, if the only safe option is waiting.

8:45 But unsafe failover can be a lot worse. Third

8:52 story, AWS DevOps Agent is now generally available

8:56 and Datadog's MCP Server is GA as a standard

9:00 way for AI agents to reach Datadog monitoring

9:04 data. This is one of those announcements. where

9:07 the slide says autonomous incident resolution

9:10 and the operator says, cool, but what exactly

9:13 is it allowed to touch? The idea is solid. AWS

9:17 DevOps Agent can work through Datadog MCP Server

9:21 to investigate an incident across logs, metrics,

9:26 traces, deployment events, and AWS infrastructure

9:30 context. Instead of one engineer bouncing between

9:33 CloudWatch, Datadog, deploy history, traces, dashboards,

9:38 and Slack, the agent correlates the signals and

9:42 helps push the incident forward and nobody wants

9:45 to spend the first 30 minutes of an outage doing

9:48 browser-tab archaeology if an agent can gather

9:51 context, summarize what changed, flag a suspicious

9:55 deploy and propose likely causes that's real

9:59 time saved but this is also the moment AI incident

10:02 response stops being a chatbot and becomes a

10:06 production workflow. It's an agent reading operational

10:10 telemetry, interpreting signals, recommending

10:13 fixes, and potentially wired into Slack, PagerDuty,

10:18 ServiceNow, your code, your deploys, and your

10:21 runbooks. That puts it right next to the control

10:24 plane. And once something sits next to the control

10:27 plane, the question stops being, is it smart?

10:31 And becomes, what authority does it have? Can

10:34 it only read? Can it write? Can it open tickets?

10:38 Trigger automation? Roll back a deploy? Restart

10:42 a service? Change config? Page a human at 4 a.m.?

10:46 Can it make things worse quickly and very confidently?

10:51 That last one is the whole game. Incident response

10:54 isn't about speed. It's about safe speed. So

10:58 treat AI incident tooling like any other production

11:02 automation. Give it the least privilege that

11:05 still leaves it useful. Log what it sees and

11:09 what it does. Make the human approval boundary

11:11 impossible to miss. And draw a hard line between

11:15 what it can recommend and what it can execute.

11:19 Have rollback rules. Know what happens when it's

11:23 wrong. And don't grade it only on time to answer.

11:26 Grade it on whether the answer was safe, auditable,

11:29 and actually useful under pressure. AI incident

11:33 response is moving from demo to production. That's

11:36 exciting. Production just needs guardrails. Fourth

11:44 story. Amazon EKS now supports customer-routed

11:48 control-plane egress. That's a very AWS phrase.

11:52 So here's the human version. The Kubernetes API

11:55 server sometimes needs to call outward to admission

11:58 webhooks, OIDC providers, aggregated API servers,

12:03 other endpoints that you control. Historically,

12:06 that outbound traffic took AWS managed egress

12:09 paths. Now you can route it through your own

12:12 VPC, which hands platform teams control over

12:16 routing, inspection, firewalls, NAT, private

12:19 connectivity, and compliance boundaries. For

12:22 regulated environments, that's a real win. It

12:25 also makes the control plane feel a lot more

12:28 like part of your network. which of course it

12:30 always was. The difference is that now you own

12:34 the outbound path and AWS is blunt about what

12:37 that ownership means. In customer routed mode,

12:40 you are responsible for making sure the control

12:43 plane can reach the endpoints it needs. Wrong

12:46 route table, too-tight security group, a NACL

12:50 that blocks the wrong thing, a broken firewall

12:52 hop, and control plane operations start failing.

12:56 That includes admission webhook calls, and OIDC

13:00 authentication. So yes, great feature. But it

13:03 isn't a checkbox. It's a failure mode change.

13:06 If your API server can't reach an admission webhook,

13:10 do pod creates fail? Do deploys hang? Does authentication

13:14 break? Does your incident response now depend

13:18 on a firewall path some other team owns? And

13:21 do you have a metric? a test, and a name on the

13:25 pager for when it breaks? This is a feature you

13:28 bring to a design review. Not because it's risky,

13:31 but because it's powerful. Map the traffic. Map

13:34 the dependencies. Test the webhooks. Test OIDC.

13:38 Test the failure modes. Make the routing visible.

13:41 And write the runbook before the control plane

13:44 starts failing in creative ways. The Kubernetes

13:47 control plane is becoming part of your network

13:50 perimeter. Treat it like one. Quick lightning

14:00 round. First, GitHub added self-service credential

14:03 revocation for incident response. Enterprise

14:07 owners now get a break-glass capability to revoke

14:11 a compromised user's credentials in one move.

14:14 This matters because credential cleanup should

14:17 never be a scavenger hunt. You do not want to

14:20 be hand-hunting through SSO authorizations,

14:22 personal access tokens, SSH keys, and OAuth grants

14:27 while everyone argues in Slack. Revocation is

14:30 incident response infrastructure. Know who can

14:33 trigger it, know what it kills, know what it

14:35 logs, and put it in the compromised-account runbook.

14:39 Second, AWS Management Console private access

14:42 now works without internet connectivity. Console

14:45 traffic for supported services can flow over

14:48 VPC endpoints instead of the public internet.

14:51 It's a strong story for regulated environments.

14:54 Even the console is getting pulled behind private

14:57 network boundaries. The lesson? Console access

15:00 is part of your control plane too. And private

15:03 link, endpoint policies, and known-account restrictions

15:06 are becoming cloud operations, not just app networking.

15:10 Third, Vercel shipped Vercel Connect. And the

15:13 idea worth catching is runtime credential exchange.

15:17 Instead of stashing a long-lived provider token

15:20 for an agent, The app proves its identity and

15:23 gets a short-lived task-scoped credential.

15:26 That's the pattern that we've been tracking for

15:28 weeks. Agent credentials moving from store this

15:31 token forever to prove who you are and get scoped

15:35 access when you need it. Short-lived credentials

15:38 don't solve every agent security problem, but

15:41 they beat long-lived secrets sitting around

15:44 waiting to become next quarter's incident. Fourth,

15:48 Amazon S3 annotations are here. mutable, queryable

15:52 context attached directly to S3 objects. Sounds

15:56 dull, but object metadata has driven a lot of

15:59 awkward platform design over the years. Side

16:02 tables, DynamoDB metadata stores, Lambda sync

16:06 jobs, custom catalogs, and constant drift between

16:10 the object and whatever's describing it. If annotations

16:14 shrink that glue layer, That's worth watching.

16:16 Object metadata is quietly becoming a first-class

16:20 platform layer, especially for data, AI, search,

16:24 and agent workflows that need to know what an

16:27 object is, not just where it lives. The human

16:38 closer this week comes from a Marc Brooker post

16:42 about waiting, latency, MTTR, and why averages

16:47 can lie. The point is that your users don't experience

16:50 your averages the way your dashboards report

16:54 them. You measure mean latency, mean time to

16:57 recovery, average outage duration. But people

17:00 are far more likely to land in the long waits

17:03 simply because long waits take up more of the

17:07 time. That's the inspection paradox. A 10-minute

17:10 outage catches a few users. A 10-hour outage

17:15 catches a lot of them. Your incident tracker

17:18 counts both as one outage. Your dashboard says

17:22 MTTR looks fine. Your users say they spent all

17:26 morning waiting. Both are true. And that's the

17:29 whole episode, really. When the system breaks,

17:31 nobody experiences your architecture diagram.

17:34 They experience waiting. Waiting for a request.

17:38 Waiting for recovery. Waiting for a credential

17:40 to get revoked. Waiting for a deploy to stop

17:44 failing. Waiting for the control plane to come

17:46 back. Waiting for someone to find the right context.

17:50 So here's the takeaway. Don't only measure the

17:53 system from the server side. Measure it from

17:56 the waiting side. Because your users don't live

18:00 in your average. They live in the tail. And the

18:04 tail is usually where the real reliability story

18:07 is hiding. That's it for this week of Ship It

18:10 Weekly. We covered containerd runtime risk,

18:13 Postgres failover safety, AI incident response,

18:17 EKS control-plane egress, and why your users

18:20 feel the wait more than your dashboards show.

18:24 If this episode was useful, follow or subscribe

18:27 wherever you are watching or listening. If you're

18:29 on YouTube, hit subscribe. If you're in a podcast

18:32 app, follow the show there. And if you know someone

18:36 wrestling with Kubernetes runtime security, database

18:39 failover, AI incident response, or platform control

18:42 planes, send them this one. It genuinely helps

18:45 the show grow, and it helps me keep making this

18:48 for people who actually live with these systems.

18:51 You can find the weekly brief at OnCallBrief.com

18:54 and the full show notes, links, and past

18:57 episodes at ShipItWeekly.fm. I'm Brian Teller

19:01 from Teller's Tech. Thanks for listening. And

19:03 remember, your dashboards measure the average.

19:06 Your users feel the wait.

containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

Cursor Deletes PocketOS Prod DB, .de DNSSEC Outage, Bluesky Postmortem, Argo CD, and Copy Fail

Ship It Conversations: Gareth Kersey on IaCConf 2026, AI, and Corey Quinn’s Terraform Keynote

Kubernetes 1.36, Gateway API v1.5, AWS Copilot End of Support, and Cloudflare Non-Human Identities

Get the next episode in your inbox

containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

Transcript

Catch This Episode

Host Commentary

Show Notes

Related On Call Brief

More from Ship It Weekly

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

Cursor Deletes PocketOS Prod DB, .de DNSSEC Outage, Bluesky Postmortem, Argo CD, and Copy Fail

Ship It Conversations: Gareth Kersey on IaCConf 2026, AI, and Corey Quinn’s Terraform Keynote

Kubernetes 1.36, Gateway API v1.5, AWS Copilot End of Support, and Cloudflare Non-Human Identities

Get the next episode in your inbox