Kubernetes Shake-ups, Platform Reality, and AI-Native SRE

Transcript

0:07 Hey, I'm Brian and this is Ship It Weekly by

0:10 Tellers Tech. This week is pretty stacked for

0:14 anyone running Kubernetes or building internal

0:17 platforms. We've got Kubernetes officially retiring

0:20 Ingress Engine X, CNCF, tightening up what platform

0:25 engineering actually means and a new Kubernetes

0:29 AI conformance program that lines up with a bunch

0:32 of SRE in the age of AI conversations. Then I'll

0:37 hit a few quick links worth bookmarking and we'll

0:40 close with a short piece on fixation during incidents

0:43 and how that messes with our thinking. Let's

0:46 start with the big one if you run clusters Ingress

0:50 Engine X. Kubernetes maintainers have announced

0:53 that the community Ingress Engine X project is

0:57 being retired. The official Kubernetes blog and

1:00 follow -up posts spell it out like this. Ingress

1:04 Engine X is moving to best effort maintenance

1:07 until March, 2026. After that, there will be

1:11 no new releases, no bug fixes, and no security

1:15 patches. The manifests and code will still be

1:18 there, but you're on your own if any new vulnerability

1:21 or bug shows up. The reason they give is pretty

1:26 straightforward. SIG Network and the Security

1:29 Response Committee want to prioritize the safety

1:32 and security of the ecosystem. Keeping such a

1:36 widely used Ingress on life support without enough

1:39 dedicated maintainers is a risk, so they're drawing

1:42 a clear line and telling people to move before

1:45 that date. If you're running Kubernetes, you

1:48 probably already know why this matters. Ingress

1:51 Engine X is one of the most common Ingress controllers

1:55 out there. It shows up in old blog posts, helm

1:58 charts, getting started guides, everywhere. For

2:02 a lot of teams, it's just assumed to be the default.

2:06 So what do you do with this? First, don't treat

2:09 March 2026 as we'll worry about it then. You

2:13 need real runway for this migration. This is

2:16 not just swapping an image tag. You're picking

2:19 a new edge component and threading that change

2:23 through your environments. There are a few broad

2:26 paths you can take. You can move towards Gateway

2:29 API based solutions and treat this as a chance

2:33 to modernize your traffic routing story. You

2:36 can adopt one of the vendor controllers that

2:39 support Gateway properly, or if you're already

2:42 using a cloud provider's ingress, load balancer

2:45 integration, you might lean further into that

2:48 and simplify. The right answer depends on whether

2:52 you want portability, deep integration with your

2:55 cloud, or something more like a full -blown API

2:58 gateway at the edge. Either way, I'd frame it

3:01 like this internally. This is a real time -boxed

3:06 infra -migration with risk and planning required,

3:10 not a background refactor. You want a plan for

3:14 where you're migrating to, how you're going to

3:17 test behavior and rules, how you'll roll out

3:21 by environment, and how you'll roll back if things

3:24 go sideways. Also, if you have clusters owned

3:28 by different teams or old clusters that nobody

3:31 touches because they just work, those are exactly

3:34 the places that are going to bite you when Ingress

3:38 Engine X is out of maintenance. It's a good excuse

3:42 to inventory where you're using it and pull those

3:46 into the same migration plan. The bigger lesson

3:49 here is that Kubernetes is not just the API server

3:54 and Kubelet. The ecosystem around it has a life

3:58 cycle too. You can't treat community projects

4:01 as permanent infrastructure. They age, they lose

4:04 maintainers, and they get retired. All right,

4:08 let's zoom out from Ingress and talk about platform

4:11 engineering in general. The CNCF just published

4:14 a new What is Platform Engineering post that's

4:19 worth a read, especially if your team already

4:22 has platform in the name. Their definition is

4:26 pretty close to how a lot of us have been using

4:29 the term, but it's nice to see it formalized.

4:33 They describe platform engineering as a discipline

4:36 focused on building and maintaining software

4:39 development platforms that provide self -service

4:42 for developer teams. In other words, your job

4:46 is to give product teams a coherent way to provision

4:49 infrastructure, deploy, test, observe, and operate

4:54 their apps without each team reinventing that

4:57 stack from scratch. A few things they emphasize.

5:01 Platform teams should be reducing developer cognitive

5:05 load, not adding to it. They talk about internal

5:09 developer platforms, golden paths, paved roads,

5:13 and treating the platform as a product with clear

5:17 users and feedback channels. compliance and policy

5:22 enforcement are built into the platform, not

5:25 bolted on as a separate gauntlet that devs have

5:29 to run at the end. What I like about this writeup

5:32 is that it gives you language you can point at

5:35 when your platform team is really just a rebranded

5:39 ops team doing tickets and fighting fires. If

5:42 your day to day is mostly people open a Jira,

5:45 we click buttons in the console. That's not what

5:48 CNCF is doing. describing here. They're talking

5:51 about a team that builds and evolves a product.

5:55 Internal APIs, templates. pipelines, and tooling

6:00 that developers can use themselves. You can use

6:04 this in a few practical ways. If you're trying

6:07 to justify time to work on self -service, developer

6:10 portals, or opinionated templates, you now have

6:14 a reference that says, this is not a vanity project.

6:18 This is literally what this discipline is meant

6:21 to do. If you're inheriting a mess of unstructured

6:24 Kubernetes, Terraform, and CI builds, you can

6:28 point at the CNCF definition and say, here's

6:32 what platform engineering actually looks like,

6:35 and here's the gap between that and what we have.

6:38 And if leadership wants platform engineering

6:41 because it's a buzzword, this is a nice way to

6:44 align them on what that implies. Roadmaps. UX,

6:49 and internal customers, not just more infra people.

6:54 They also recently published a top five hard

6:57 earned lessons piece from Kubernetes experts

7:00 that lines up nicely with this. It talks about

7:03 life cycle pain, upgrade complexity, and how

7:07 that lack of guardrails and policies burns teams

7:11 over time. That's basically the problem space

7:14 that platform engineering is trying to address.

7:17 Now let's take that platform story and connect

7:20 it to where the industry is clearly heading.

7:23 AI running on all of this. CNCF has now formally

7:28 launched the certified Kubernetes AI conformance

7:32 program. This came out of KubeCon North America

7:35 and has been picked up in places like Forbes

7:38 and other coverage. The idea is pretty simple.

7:41 Everyone is trying to run AI and ML workloads

7:45 on Kubernetes now, but every environment is slightly

7:49 different. different operators, different GPU

7:52 scheduling strategies, different ways of handling

7:55 storage and networking for models. The conformance

7:58 program defines a shared set of capabilities

8:02 and configurations that a platform needs to meet

8:06 to be considered. AI conformant on Kubernetes.

8:09 It's similar in spirit to the existing Kubernetes

8:12 software conformance program, but focused on

8:15 AI workloads. The goal is portability and predictability.

8:20 If a vendor or platform is certified, you should

8:23 be able to run common AI frameworks and workloads

8:26 there without a ton of custom glue. For platform

8:30 and SRE teams, there are a couple of implications.

8:34 First, AI and ML workloads are no longer pet

8:38 projects off to the side. They're becoming first

8:41 class citizens on your clusters and in your CI

8:44 CD. you're going to deal with GPU capacity planning,

8:49 noisy neighbors, model deployment pipelines,

8:52 and data movement as real operational concerns.

8:56 Second, standards like this give you something

8:59 to anchor on. Instead of every team building

9:01 their own ad hoc pattern for running models,

9:04 you can say, we want our platform to meet this

9:07 conformance, or use the checklist as input to

9:11 your own design. In parallel with that, there's

9:14 a good article on DevOps .com titled SRE in the

9:18 Age of AI, what reliability looks like when systems

9:22 learn. It talks about how SRE is shifting from

9:26 guarding mostly deterministic systems to working

9:30 with adaptive learning systems where behavior

9:33 changes over time. Traditional SRE practices

9:37 like SLOs, incident response, and postmortems

9:40 still matter. But now you have extra dimensions,

9:45 model drift, data quality, and feedback loops.

9:48 You're not just measuring latency and error rates,

9:52 you're worrying about correctness. bias, and

9:55 how often the model does something unexpected.

9:58 So if your org is doing AI, I'd be asking, who

10:02 owns the reliability of those workloads? Is it

10:06 the ML team, the platform team, or SRE? How are

10:10 we observing model behavior, not just pod CPU?

10:14 And are we going to align to something like this

10:17 AI conformance program, or are we comfortable

10:20 having a one -off AI setup for each team? big

10:23 picture, all of this says the platform you build

10:26 now has to support both normal services and AI

10:30 workloads and the reliability story has to keep

10:34 up with that. All right, that's the big three.

10:37 Let's hit a few quick links worth saving. Quick

10:40 lightning round. These are things you might want

10:43 to throw into your read later queue or share

10:46 with the team. First one is from the CNCF top

10:49 five hard earned lessons from the experts on

10:53 managing Kubernetes. It's a short piece, but

10:56 it reinforces what a lot of us already know.

10:58 Most of the pain isn't in the initial cluster

11:01 build, it's in upgrades, dependency sprawl, and

11:05 lack of guardrails. Good thing to hand to leaders

11:08 who think Kubernetes is set it and forget it.

11:11 From SRE Weekly, there's a nice database migration

11:14 story from Tynes about zero downtime migrations.

11:18 What I like there is they actually define what

11:21 zero downtime meant for them and admit where

11:23 they accepted some degradation. It's a solid

11:26 template if you're trying to socialize realistic

11:29 expectations around a big DB change. There's

11:33 also a piece about upgrading Postgres SQL with

11:37 minimal downtime at something like 20 ,000 transactions

11:41 per second. If you own Postgres in production,

11:44 this is a good case study on planning, replication,

11:47 and rollback. And finally, there's a really interesting

11:50 article on building a distributed priority queue

11:52 on top of Kafka from the They talk about supporting

11:57 different SLOs for different events and how they

12:00 added a proxy layer to avoid head of line blocking

12:04 within partitions. If you're using Kafka heavily

12:07 and starting to bump into everything has the

12:10 same priority, it's a worthwhile read. I'll drop

12:13 links to all of those in the show notes. Let's

12:16 finish with something on the human side of incidents.

12:19 In SRE Weekly issue 497, there's a piece by Lauren

12:24 Hutchstein titled, Fixation, The Ever -Present

12:26 Risk During Incident Handling. The core idea

12:30 is simple, but important. During incidents, we

12:33 tend to latch onto a single theory, a single

12:36 plan. or a single mental model of what's going

12:40 wrong. Once we're locked in, we ignore or downplay

12:44 signals that don't fit that story. You've probably

12:47 seen this, right? Someone says it's DNS or it's

12:50 the database early in the incident. And from

12:53 that point on, every log line and every graph

12:56 gets interpreted as evidence for that. even when

12:59 it doesn't really fit, or the first mitigation

13:02 we try becomes the plan and we keep pushing on

13:06 it long after it's clear it's not working. Lauren's

13:09 point is that fixation is a normal human response

13:12 under stress, but it's dangerous for incident

13:15 handling. The way out isn't to pretend we won't

13:18 do it, it's to structure our process so we catch

13:21 it. A couple of simple things you can do. Make

13:24 it someone's job in the incident to periodically

13:27 ask, what else could this be? And what evidence

13:31 would change our mind about the current theory?

13:34 It sounds basic, but actually saying that out

13:37 loud gives people permission to raise alternate

13:40 explanations. Also, be explicit when a hypothesis

13:45 fails. Instead of silently shifting to a new

13:48 theory, call it out. We tried X, it didn't change

13:51 the symptoms, so we're going to park that and

13:54 consider Y. That keeps the team from unconsciously

13:58 dragging the old theory along. The broader theme,

14:01 which ties back to the outages and platform work

14:04 we talked about earlier, is that incidents are

14:07 not just technical events. They're also snapshots

14:10 of how we think under pressure. how we reason

14:13 about risk and how we handle uncertainty and

14:17 whether our process helps or gets in the way.

14:19 I'll put a link to that fixation article and

14:22 also the ongoing trade -offs and incidents as

14:25 landmark piece from Fred Hebert in the notes

14:27 if you want to go deeper on this. All right.

14:30 That's it for this episode. We talked about the

14:33 retirement of Ingress Engine X. and why that

14:36 should trigger a real migration plan, not a to

14:40 -do buried in confluence. We walked through CNCF's

14:43 updated framing for platform engineering and

14:46 how it lines up with building actual internal

14:49 platforms instead of just rebranding ops. And

14:52 we looked at the new Kubernetes AI conformance

14:56 program and some of the thinking around SRE in

14:59 the age of AI and where models and data pipelines

15:03 are part of the reliability. In the show notes,

15:07 I'll include links to the Ingress Engine X retirement

15:09 posts and migration guides, the CNCF platform

15:13 engineering blog, and the AI conformance program,

15:17 the DevOps .com SRE in AI article, plus the SRE

15:22 weekly links we talked about. If you found this

15:25 useful, share it with someone on your team who's

15:27 thinking about platform work or planning cluster

15:30 changes. I'm Brian and this is Ship It Weekly

15:33 by Teller's Tech and I'll see you in the next

15:36 one.

Kubernetes Shake-ups, Platform Reality, and AI-Native SRE

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller