0:00
This week, containerd disclosed a stack of CRI
0:04
plugin vulnerabilities in the runtime layer,
0:07
a huge number of Kubernetes nodes trust to start
0:10
your containers. Datadog ran a PostgreSQL
0:14
gameday and learned their database could fail over
0:17
just fine. It just couldn't do it safely. AWS
0:21
DevOps Agent and Datadog's MCP Server are both
0:25
now generally available. And the new AWS integration
0:28
means AI incident response just graduated from
0:33
demo to on-call rotation. And EKS will now route
0:37
your Kubernetes control plane's outbound traffic
0:40
through your own VPC, which is great, right up
0:44
until a stale route table quietly kills your
0:47
admission webhooks. Put those together and the
0:50
shape of the episode is pretty clear. The control
0:53
plane keeps getting wider. Runtimes. Databases.
0:57
Incident agents. API-server egress. credentials,
1:01
even the cloud console. One by one, they are
1:04
all sliding into your production blast radius.
1:07
And here's the part that matters. Your users
1:10
don't care which control plane failed. They just
1:13
feel the wait. I'm Brian Teller from Teller's
1:16
Tech, and this is Ship It Weekly. Welcome back
1:36
to Ship It Weekly, the show about the DevOps,
1:40
SRE, cloud, platform, and security stories that
1:45
actually matter when you are the person who has
1:47
to keep the thing running at 3 a.m. If you are
1:51
new here, follow or subscribe wherever you are
1:54
watching or listening. And if you want the weekly
1:57
story list and source links, check out OnCallBrief.com
2:01
For past episodes, full show notes, and
2:05
more from the show, head over to ShipItWeekly.fm
2:08
We open with the containerd CRI plugin vulnerabilities,
2:13
because your node runtime is the trust boundary
2:16
underneath the trust boundary. Then, Datadog's
2:20
PostgreSQL HA gameday, where the scary discovery
2:24
wasn't that failover was hard, it was that failover
2:28
was unsafe. After that, AWS DevOps Agent and
2:32
Datadog MCP Server going GA. And what it means
2:36
when an AI agent gets a seat near your control
2:40
plane. Then, EKS customer-routed control-plane
2:43
egress. Because your API server is now part of
2:47
your network perimeter, whether you plan for
2:50
it or not. In the lightning round, GitHub Credential
2:53
Revocation. AWS Console Private Access. Vercel
2:57
Connect, and S3 annotations. And we close with
3:01
Marc Brooker on waiting, on why your customers
3:05
live in the tail of your latency distribution,
3:09
even when your dashboards swear everything's
3:12
fine. Let's get into it. First up, containerd
3:20
has a batch of CRI plugin vulnerabilities. And
3:23
if you run Kubernetes, this one's yours. AWS
3:26
published a security bulletin spanning
3:30
containerd branches 1.7 through 2.3. And the list is
3:35
not a fun read. Image cache poisoning through
3:38
checkpoint image references. Host command execution.
3:42
through unsanitized image labels, CDI annotation
3:45
handling that can inject devices and host mounts,
3:50
host file reads through symlinked container
3:53
log paths during checkpoint restore, and a denial
3:57
of service from crafted images that exhaust memory.
4:01
So not exactly a relaxing Patch Tuesday. Here's
4:05
why it matters. containerd sits underneath an
4:08
enormous number of clusters, and we spend almost
4:12
all of our security attention on the layers above
4:15
it. Pod specs, admission control, image scanning,
4:19
RBAC, network policy, runtime classes, all the
4:23
familiar Kubernetes machinery. But eventually,
4:26
something has to actually pull the image, unpack
4:29
it, restore it, wire up devices. handle the logs,
4:34
and start the container. That layer is a trust
4:37
boundary too. And in some ways, it's the more
4:41
dangerous one. Because by the time a workload
4:44
reaches the runtime, the rest of the system has
4:47
already decided this thing is allowed to exist.
4:51
That's why the boring fields turn out to matter.
4:54
Labels, annotations, checkpoint and restore paths,
4:58
CDI, log paths, every field. that feels like
5:02
plumbing can become an input to privileged behavior
5:05
on the node. A malicious image isn't just application
5:09
code. It's metadata, build time weirdness, and
5:14
a set of assumptions the runtime makes about
5:17
what it can trust. The takeaway is direct. Patch
5:20
containerd. Check your managed node groups,
5:24
your self-managed nodes, your AMIs, your Bottlerocket
5:28
versions, your distro packages, anything
5:31
that controls the runtime. If you lean on checkpoint
5:35
restore, CDI devices, or GPU workloads, look
5:39
harder. And if you don't use any of that, don't
5:43
relax. At least one of these issues doesn't need
5:46
checkpoint and restore turned on at all. Your
5:49
node runtime is the trust boundary under the
5:53
trust boundary. Stop treating it like invisible
5:56
plumbing. Second story. Datadog published a genuinely
6:03
good engineering write-up on running high availability
6:07
PostgreSQL on Kubernetes. And it's one of those
6:11
pieces that sounds boring until the real problem
6:14
comes into focus. The problem wasn't that the
6:17
database couldn't fail over. It was that it couldn't
6:20
fail over safely. During a gameday, Datadog
6:24
simulated a zonal failure. That added network
6:27
latency, replication lag grew, and when the cluster
6:30
needed a new primary, Patroni couldn't safely
6:34
promote a standby without risking data loss.
6:37
So the system got stuck in the worst possible
6:40
spot. The old primary was unhealthy. The standbys
6:44
weren't safe to promote, and the only correct
6:47
move was to wait. That's the kind of failure
6:50
mode that ages every SRE in the room about three
6:53
years. Because on paper, you have everything.
6:56
Multiple nodes, standbys, Kubernetes, automation,
7:01
failover machinery. And then the actual failure
7:04
arrives and the system says, yes, but not safely.
7:07
Which, honestly, is the right answer. Promoting
7:10
a stale standby might hand you a writable primary
7:14
faster. But if it costs you data loss, split
7:17
brain, or a broken consistency guarantee, you
7:21
haven't fixed the outage. You've traded it for
7:23
a corruption event. That's not an improvement.
7:26
It's just a different postmortem. The real lesson
7:29
is that HA isn't only about whether the service
7:33
comes back. It's about whether the recovery path
7:35
itself is safe. Can you fail over without losing
7:39
writes? Can you prove which standby is safe to
7:42
promote? Can your automation tell the difference
7:45
between available and correct? And does your
7:47
whole team agree on which one it should prefer
7:51
before the incident call is on fire? Datadog's
7:54
answer was to move toward synchronous replication
7:57
and stronger Patroni guardrails. So a promoted
8:01
standby is guaranteed to have the writes it needs.
8:05
And that's the part that's worth copying. They
8:07
didn't just ask how to recover faster. They asked
8:11
how to recover safely. So test your database
8:14
HA against real constraints, not the easy ones.
8:18
Ask what happens under replication lag. Ask what
8:22
happens during a zone failure. Ask what happens
8:25
when the network is slow instead of cleanly dead.
8:28
Ask what happens when every standby is behind.
8:32
And ask whether your automation prefers safety
8:35
or availability. And whether everyone actually
8:38
agrees with that choice. Because failover is
8:41
useless, if the only safe option is waiting.
8:45
But unsafe failover can be a lot worse. Third
8:52
story, AWS DevOps Agent is now generally available
8:56
and Datadog's MCP Server is GA as a standard
9:00
way for AI agents to reach Datadog monitoring
9:04
data. This is one of those announcements. where
9:07
the slide says autonomous incident resolution
9:10
and the operator says, cool, but what exactly
9:13
is it allowed to touch? The idea is solid. AWS
9:17
DevOps Agent can work through Datadog MCP Server
9:21
to investigate an incident across logs, metrics,
9:26
traces, deployment events, and AWS infrastructure
9:30
context. Instead of one engineer bouncing between
9:33
CloudWatch, Datadog, deploy history, traces, dashboards,
9:38
and Slack, the agent correlates the signals and
9:42
helps push the incident forward and nobody wants
9:45
to spend the first 30 minutes of an outage doing
9:48
browser-tab archaeology if an agent can gather
9:51
context, summarize what changed, flag a suspicious
9:55
deploy and propose likely causes that's real
9:59
time saved but this is also the moment AI incident
10:02
response stops being a chatbot and becomes a
10:06
production workflow. It's an agent reading operational
10:10
telemetry, interpreting signals, recommending
10:13
fixes, and potentially wired into Slack, PagerDuty,
10:18
ServiceNow, your code, your deploys, and your
10:21
runbooks. That puts it right next to the control
10:24
plane. And once something sits next to the control
10:27
plane, the question stops being, is it smart?
10:31
And becomes, what authority does it have? Can
10:34
it only read? Can it write? Can it open tickets?
10:38
Trigger automation? Roll back a deploy? Restart
10:42
a service? Change config? Page a human at 4 a.m.?
10:46
Can it make things worse quickly and very confidently?
10:51
That last one is the whole game. Incident response
10:54
isn't about speed. It's about safe speed. So
10:58
treat AI incident tooling like any other production
11:02
automation. Give it the least privilege that
11:05
still leaves it useful. Log what it sees and
11:09
what it does. Make the human approval boundary
11:11
impossible to miss. And draw a hard line between
11:15
what it can recommend and what it can execute.
11:19
Have rollback rules. Know what happens when it's
11:23
wrong. And don't grade it only on time to answer.
11:26
Grade it on whether the answer was safe, auditable,
11:29
and actually useful under pressure. AI incident
11:33
response is moving from demo to production. That's
11:36
exciting. Production just needs guardrails. Fourth
11:44
story. Amazon EKS now supports customer-routed
11:48
control-plane egress. That's a very AWS phrase.
11:52
So here's the human version. The Kubernetes API
11:55
server sometimes needs to call outward to admission
11:58
webhooks, OIDC providers, aggregated API servers,
12:03
other endpoints that you control. Historically,
12:06
that outbound traffic took AWS managed egress
12:09
paths. Now you can route it through your own
12:12
VPC, which hands platform teams control over
12:16
routing, inspection, firewalls, NAT, private
12:19
connectivity, and compliance boundaries. For
12:22
regulated environments, that's a real win. It
12:25
also makes the control plane feel a lot more
12:28
like part of your network. which of course it
12:30
always was. The difference is that now you own
12:34
the outbound path and AWS is blunt about what
12:37
that ownership means. In customer routed mode,
12:40
you are responsible for making sure the control
12:43
plane can reach the endpoints it needs. Wrong
12:46
route table, too-tight security group, a NACL
12:50
that blocks the wrong thing, a broken firewall
12:52
hop, and control plane operations start failing.
12:56
That includes admission webhook calls, and OIDC
13:00
authentication. So yes, great feature. But it
13:03
isn't a checkbox. It's a failure mode change.
13:06
If your API server can't reach an admission webhook,
13:10
do pod creates fail? Do deploys hang? Does authentication
13:14
break? Does your incident response now depend
13:18
on a firewall path some other team owns? And
13:21
do you have a metric? a test, and a name on the
13:25
pager for when it breaks? This is a feature you
13:28
bring to a design review. Not because it's risky,
13:31
but because it's powerful. Map the traffic. Map
13:34
the dependencies. Test the webhooks. Test OIDC.
13:38
Test the failure modes. Make the routing visible.
13:41
And write the runbook before the control plane
13:44
starts failing in creative ways. The Kubernetes
13:47
control plane is becoming part of your network
13:50
perimeter. Treat it like one. Quick lightning
14:00
round. First, GitHub added self-service credential
14:03
revocation for incident response. Enterprise
14:07
owners now get a break-glass capability to revoke
14:11
a compromised user's credentials in one move.
14:14
This matters because credential cleanup should
14:17
never be a scavenger hunt. You do not want to
14:20
be hand-hunting through SSO authorizations,
14:22
personal access tokens, SSH keys, and OAuth grants
14:27
while everyone argues in Slack. Revocation is
14:30
incident response infrastructure. Know who can
14:33
trigger it, know what it kills, know what it
14:35
logs, and put it in the compromised-account runbook.
14:39
Second, AWS Management Console private access
14:42
now works without internet connectivity. Console
14:45
traffic for supported services can flow over
14:48
VPC endpoints instead of the public internet.
14:51
It's a strong story for regulated environments.
14:54
Even the console is getting pulled behind private
14:57
network boundaries. The lesson? Console access
15:00
is part of your control plane too. And private
15:03
link, endpoint policies, and known-account restrictions
15:06
are becoming cloud operations, not just app networking.
15:10
Third, Vercel shipped Vercel Connect. And the
15:13
idea worth catching is runtime credential exchange.
15:17
Instead of stashing a long-lived provider token
15:20
for an agent, The app proves its identity and
15:23
gets a short-lived task-scoped credential.
15:26
That's the pattern that we've been tracking for
15:28
weeks. Agent credentials moving from store this
15:31
token forever to prove who you are and get scoped
15:35
access when you need it. Short-lived credentials
15:38
don't solve every agent security problem, but
15:41
they beat long-lived secrets sitting around
15:44
waiting to become next quarter's incident. Fourth,
15:48
Amazon S3 annotations are here. mutable, queryable
15:52
context attached directly to S3 objects. Sounds
15:56
dull, but object metadata has driven a lot of
15:59
awkward platform design over the years. Side
16:02
tables, DynamoDB metadata stores, Lambda sync
16:06
jobs, custom catalogs, and constant drift between
16:10
the object and whatever's describing it. If annotations
16:14
shrink that glue layer, That's worth watching.
16:16
Object metadata is quietly becoming a first-class
16:20
platform layer, especially for data, AI, search,
16:24
and agent workflows that need to know what an
16:27
object is, not just where it lives. The human
16:38
closer this week comes from a Marc Brooker post
16:42
about waiting, latency, MTTR, and why averages
16:47
can lie. The point is that your users don't experience
16:50
your averages the way your dashboards report
16:54
them. You measure mean latency, mean time to
16:57
recovery, average outage duration. But people
17:00
are far more likely to land in the long waits
17:03
simply because long waits take up more of the
17:07
time. That's the inspection paradox. A 10-minute
17:10
outage catches a few users. A 10-hour outage
17:15
catches a lot of them. Your incident tracker
17:18
counts both as one outage. Your dashboard says
17:22
MTTR looks fine. Your users say they spent all
17:26
morning waiting. Both are true. And that's the
17:29
whole episode, really. When the system breaks,
17:31
nobody experiences your architecture diagram.
17:34
They experience waiting. Waiting for a request.
17:38
Waiting for recovery. Waiting for a credential
17:40
to get revoked. Waiting for a deploy to stop
17:44
failing. Waiting for the control plane to come
17:46
back. Waiting for someone to find the right context.
17:50
So here's the takeaway. Don't only measure the
17:53
system from the server side. Measure it from
17:56
the waiting side. Because your users don't live
18:00
in your average. They live in the tail. And the
18:04
tail is usually where the real reliability story
18:07
is hiding. That's it for this week of Ship It
18:10
Weekly. We covered containerd runtime risk,
18:13
Postgres failover safety, AI incident response,
18:17
EKS control-plane egress, and why your users
18:20
feel the wait more than your dashboards show.
18:24
If this episode was useful, follow or subscribe
18:27
wherever you are watching or listening. If you're
18:29
on YouTube, hit subscribe. If you're in a podcast
18:32
app, follow the show there. And if you know someone
18:36
wrestling with Kubernetes runtime security, database
18:39
failover, AI incident response, or platform control
18:42
planes, send them this one. It genuinely helps
18:45
the show grow, and it helps me keep making this
18:48
for people who actually live with these systems.
18:51
You can find the weekly brief at OnCallBrief.com
18:54
and the full show notes, links, and past
18:57
episodes at ShipItWeekly.fm. I'm Brian Teller
19:01
from Teller's Tech. Thanks for listening. And
19:03
remember, your dashboards measure the average.
19:06
Your users feel the wait.
This episode is really about the control plane getting wider.
That sounds like a platform-engineering phrase, but it is becoming one of the more important ways to think about modern production systems.
A few years ago, when people said “control plane,” they usually meant something fairly specific. Kubernetes API server. Cloud API. CI/CD system. Maybe an internal deployment platform.
Now it is messier than that.
Your container runtime is part of the control plane because it decides how workloads actually start on the node.
Your database failover automation is part of the control plane because it decides whether recovery is safe or reckless.
Your AI incident-response agent is part of the control plane because it can inspect telemetry, summarize what changed, recommend action, and maybe someday trigger work directly.
Your Kubernetes API server egress path is part of the control plane because a stale route table or broken firewall path can stop admission webhooks, OIDC, and aggregated API calls from working.
Your credential revocation tooling is part of the control plane because compromised access has to be cut off fast.
Your cloud console is part of the control plane because operators still need a way to reach the environment during an incident.
Even object metadata starts to matter when data, AI, search, and agent workflows depend on understanding what an object is, not just where it lives.
That is the through-line in this episode.
containerd disclosed a batch of CRI plugin vulnerabilities, and the lesson is that Kubernetes security does not stop at pod specs, RBAC, admission control, or image scanning. Eventually the node runtime has to pull the image, unpack it, restore it, wire up devices, handle logs, and start the container. That runtime layer is not invisible plumbing. It is a trust boundary.
The Datadog PostgreSQL HA story is a different kind of control-plane lesson. Their gameday did not just ask whether PostgreSQL could fail over on Kubernetes. It exposed the harder question: can it fail over safely? If every standby is behind, promotion may be possible, but it may not be correct. And in databases, correct usually matters more than fast.
That is the part I love about the Datadog writeup. It is not the fantasy version of HA where automation magically fixes everything. It is the real version where replication lag, synchronous writes, RPO, RTO, and promotion safety all collide. Failover is only useful if the recovery path does not create a bigger problem.
The AWS DevOps Agent and Datadog MCP Server story pushes this same theme into AI operations. AI incident response is moving from demo to production workflow. That is exciting, but the question cannot just be “is the agent smart?” The better question is “what authority does it have?”
Can it only read?
Can it write?
Can it open tickets?
Can it trigger automation?
Can it roll back?
Can it page someone?
Can it make things worse quickly and very confidently?
That is the uncomfortable part. AI incident tooling can be genuinely useful, especially during the early chaos of an incident when everyone is jumping between dashboards, traces, logs, deploy history, and Slack threads. But once an agent sits near the operational control plane, it needs the same boring guardrails as any other production automation: least privilege, audit logs, approval boundaries, rollback rules, and a clear line between recommendation and execution.
Then there is EKS customer-routed control-plane egress, which is one of those features that sounds boring until you think through the failure modes. Routing Kubernetes API server outbound traffic through your own VPC is a real win for private and regulated environments. But it also means your route tables, security groups, NACLs, firewalls, and private connectivity can now become control-plane dependencies.
That is powerful.
It is also something you bring to a design review.
The lightning round kept hitting the same idea from different angles. GitHub credential revocation is incident-response infrastructure. AWS Console Private Access pulls more operator workflow behind private network boundaries. Vercel Connect points toward short-lived, task-scoped credentials for agents instead of long-lived secrets sitting around forever. S3 annotations make object metadata more directly attached, mutable, and queryable instead of living in another side table that drifts from reality.
Different stories, same shape.
Authority keeps moving.
Trust keeps spreading.
The blast radius keeps expanding.
And that is where Marc Brooker’s post on waiting fits so well as the closer.
Your dashboards may measure averages, but your users do not experience averages. They experience the time they spend waiting. A ten-minute outage and a ten-hour outage might both count as one incident in a tracker, but they do not feel the same to the people stuck inside them. A service with a decent average can still feel terrible if users keep landing in the tail.
That is the reliability lesson underneath the whole episode.
When the system breaks, users do not experience your architecture diagram. They do not care whether it was the runtime, the database, the AI agent, the credential system, the route table, the cloud console, or the metadata layer.
They experience waiting.
Waiting for a request.
Waiting for recovery.
Waiting for a deploy to stop failing.
Waiting for a credential to get revoked.
Waiting for the control plane to come back.
Waiting for someone to find the right context.
So the practical question is not just “is this system up?”
It is also:
Where is authority hiding?
What has to work before recovery can happen?
Which defaults are trusted?
Which control-plane paths are invisible?
Which tools can make changes?
Which systems can block deploys?
Which dependencies only show up when something breaks?
And most importantly, what does this feel like from the waiting side?
Because your dashboards measure the average.
Your users feel the wait.
Extra links worth including:
containerd CRI plugin vulnerabilities / AWS security bulletin
https://aws.amazon.com/security/security-bulletins/2026-046-aws/
Datadog: PostgreSQL high availability on Kubernetes
https://www.datadoghq.com/blog/engineering/postgresql-ha-kubernetes/
AWS DevOps Agent and Datadog MCP Server
https://aws.amazon.com/blogs/devops/production-ready-autonomous-incident-resolution-with-aws-devops-agent-now-ga-and-datadog-mcp-server/
Amazon EKS customer-routed control-plane egress
https://aws.amazon.com/blogs/containers/amazon-eks-now-supports-control-plane-egress-through-your-vpc/
GitHub self-service credential revocation for incident response
https://github.blog/changelog/2026-06-24-self-service-credential-revocation-for-incident-response/
AWS Management Console Private Access
https://aws.amazon.com/about-aws/whats-new/2026/06/aws-management-console-private/
Vercel Connect
https://vercel.com/blog/introducing-vercel-connect
Amazon S3 annotations
https://aws.amazon.com/blogs/aws/amazon-s3-annotations-attach-rich-queryable-context-directly-to-your-objects/
Marc Brooker: Waiting, latency, MTTR, and the inspection paradox
https://brooker.co.za/blog/2026/06/19/waiting.html
This week’s On Call Brief
https://www.tellerstech.com/on-call-brief-news/2026-W26/
More Ship It Weekly episodes
https://shipitweekly.fm/