0:07
Hey, I'm Brian and this is Ship It Weekly by
0:10
Tellers Tech. This week is pretty stacked for
0:14
anyone running Kubernetes or building internal
0:17
platforms. We've got Kubernetes officially retiring
0:20
Ingress Engine X, CNCF, tightening up what platform
0:25
engineering actually means and a new Kubernetes
0:29
AI conformance program that lines up with a bunch
0:32
of SRE in the age of AI conversations. Then I'll
0:37
hit a few quick links worth bookmarking and we'll
0:40
close with a short piece on fixation during incidents
0:43
and how that messes with our thinking. Let's
0:46
start with the big one if you run clusters Ingress
0:50
Engine X. Kubernetes maintainers have announced
0:53
that the community Ingress Engine X project is
0:57
being retired. The official Kubernetes blog and
1:00
follow -up posts spell it out like this. Ingress
1:04
Engine X is moving to best effort maintenance
1:07
until March, 2026. After that, there will be
1:11
no new releases, no bug fixes, and no security
1:15
patches. The manifests and code will still be
1:18
there, but you're on your own if any new vulnerability
1:21
or bug shows up. The reason they give is pretty
1:26
straightforward. SIG Network and the Security
1:29
Response Committee want to prioritize the safety
1:32
and security of the ecosystem. Keeping such a
1:36
widely used Ingress on life support without enough
1:39
dedicated maintainers is a risk, so they're drawing
1:42
a clear line and telling people to move before
1:45
that date. If you're running Kubernetes, you
1:48
probably already know why this matters. Ingress
1:51
Engine X is one of the most common Ingress controllers
1:55
out there. It shows up in old blog posts, helm
1:58
charts, getting started guides, everywhere. For
2:02
a lot of teams, it's just assumed to be the default.
2:06
So what do you do with this? First, don't treat
2:09
March 2026 as we'll worry about it then. You
2:13
need real runway for this migration. This is
2:16
not just swapping an image tag. You're picking
2:19
a new edge component and threading that change
2:23
through your environments. There are a few broad
2:26
paths you can take. You can move towards Gateway
2:29
API based solutions and treat this as a chance
2:33
to modernize your traffic routing story. You
2:36
can adopt one of the vendor controllers that
2:39
support Gateway properly, or if you're already
2:42
using a cloud provider's ingress, load balancer
2:45
integration, you might lean further into that
2:48
and simplify. The right answer depends on whether
2:52
you want portability, deep integration with your
2:55
cloud, or something more like a full -blown API
2:58
gateway at the edge. Either way, I'd frame it
3:01
like this internally. This is a real time -boxed
3:06
infra -migration with risk and planning required,
3:10
not a background refactor. You want a plan for
3:14
where you're migrating to, how you're going to
3:17
test behavior and rules, how you'll roll out
3:21
by environment, and how you'll roll back if things
3:24
go sideways. Also, if you have clusters owned
3:28
by different teams or old clusters that nobody
3:31
touches because they just work, those are exactly
3:34
the places that are going to bite you when Ingress
3:38
Engine X is out of maintenance. It's a good excuse
3:42
to inventory where you're using it and pull those
3:46
into the same migration plan. The bigger lesson
3:49
here is that Kubernetes is not just the API server
3:54
and Kubelet. The ecosystem around it has a life
3:58
cycle too. You can't treat community projects
4:01
as permanent infrastructure. They age, they lose
4:04
maintainers, and they get retired. All right,
4:08
let's zoom out from Ingress and talk about platform
4:11
engineering in general. The CNCF just published
4:14
a new What is Platform Engineering post that's
4:19
worth a read, especially if your team already
4:22
has platform in the name. Their definition is
4:26
pretty close to how a lot of us have been using
4:29
the term, but it's nice to see it formalized.
4:33
They describe platform engineering as a discipline
4:36
focused on building and maintaining software
4:39
development platforms that provide self -service
4:42
for developer teams. In other words, your job
4:46
is to give product teams a coherent way to provision
4:49
infrastructure, deploy, test, observe, and operate
4:54
their apps without each team reinventing that
4:57
stack from scratch. A few things they emphasize.
5:01
Platform teams should be reducing developer cognitive
5:05
load, not adding to it. They talk about internal
5:09
developer platforms, golden paths, paved roads,
5:13
and treating the platform as a product with clear
5:17
users and feedback channels. compliance and policy
5:22
enforcement are built into the platform, not
5:25
bolted on as a separate gauntlet that devs have
5:29
to run at the end. What I like about this writeup
5:32
is that it gives you language you can point at
5:35
when your platform team is really just a rebranded
5:39
ops team doing tickets and fighting fires. If
5:42
your day to day is mostly people open a Jira,
5:45
we click buttons in the console. That's not what
5:48
CNCF is doing. describing here. They're talking
5:51
about a team that builds and evolves a product.
5:55
Internal APIs, templates. pipelines, and tooling
6:00
that developers can use themselves. You can use
6:04
this in a few practical ways. If you're trying
6:07
to justify time to work on self -service, developer
6:10
portals, or opinionated templates, you now have
6:14
a reference that says, this is not a vanity project.
6:18
This is literally what this discipline is meant
6:21
to do. If you're inheriting a mess of unstructured
6:24
Kubernetes, Terraform, and CI builds, you can
6:28
point at the CNCF definition and say, here's
6:32
what platform engineering actually looks like,
6:35
and here's the gap between that and what we have.
6:38
And if leadership wants platform engineering
6:41
because it's a buzzword, this is a nice way to
6:44
align them on what that implies. Roadmaps. UX,
6:49
and internal customers, not just more infra people.
6:54
They also recently published a top five hard
6:57
earned lessons piece from Kubernetes experts
7:00
that lines up nicely with this. It talks about
7:03
life cycle pain, upgrade complexity, and how
7:07
that lack of guardrails and policies burns teams
7:11
over time. That's basically the problem space
7:14
that platform engineering is trying to address.
7:17
Now let's take that platform story and connect
7:20
it to where the industry is clearly heading.
7:23
AI running on all of this. CNCF has now formally
7:28
launched the certified Kubernetes AI conformance
7:32
program. This came out of KubeCon North America
7:35
and has been picked up in places like Forbes
7:38
and other coverage. The idea is pretty simple.
7:41
Everyone is trying to run AI and ML workloads
7:45
on Kubernetes now, but every environment is slightly
7:49
different. different operators, different GPU
7:52
scheduling strategies, different ways of handling
7:55
storage and networking for models. The conformance
7:58
program defines a shared set of capabilities
8:02
and configurations that a platform needs to meet
8:06
to be considered. AI conformant on Kubernetes.
8:09
It's similar in spirit to the existing Kubernetes
8:12
software conformance program, but focused on
8:15
AI workloads. The goal is portability and predictability.
8:20
If a vendor or platform is certified, you should
8:23
be able to run common AI frameworks and workloads
8:26
there without a ton of custom glue. For platform
8:30
and SRE teams, there are a couple of implications.
8:34
First, AI and ML workloads are no longer pet
8:38
projects off to the side. They're becoming first
8:41
class citizens on your clusters and in your CI
8:44
CD. you're going to deal with GPU capacity planning,
8:49
noisy neighbors, model deployment pipelines,
8:52
and data movement as real operational concerns.
8:56
Second, standards like this give you something
8:59
to anchor on. Instead of every team building
9:01
their own ad hoc pattern for running models,
9:04
you can say, we want our platform to meet this
9:07
conformance, or use the checklist as input to
9:11
your own design. In parallel with that, there's
9:14
a good article on DevOps .com titled SRE in the
9:18
Age of AI, what reliability looks like when systems
9:22
learn. It talks about how SRE is shifting from
9:26
guarding mostly deterministic systems to working
9:30
with adaptive learning systems where behavior
9:33
changes over time. Traditional SRE practices
9:37
like SLOs, incident response, and postmortems
9:40
still matter. But now you have extra dimensions,
9:45
model drift, data quality, and feedback loops.
9:48
You're not just measuring latency and error rates,
9:52
you're worrying about correctness. bias, and
9:55
how often the model does something unexpected.
9:58
So if your org is doing AI, I'd be asking, who
10:02
owns the reliability of those workloads? Is it
10:06
the ML team, the platform team, or SRE? How are
10:10
we observing model behavior, not just pod CPU?
10:14
And are we going to align to something like this
10:17
AI conformance program, or are we comfortable
10:20
having a one -off AI setup for each team? big
10:23
picture, all of this says the platform you build
10:26
now has to support both normal services and AI
10:30
workloads and the reliability story has to keep
10:34
up with that. All right, that's the big three.
10:37
Let's hit a few quick links worth saving. Quick
10:40
lightning round. These are things you might want
10:43
to throw into your read later queue or share
10:46
with the team. First one is from the CNCF top
10:49
five hard earned lessons from the experts on
10:53
managing Kubernetes. It's a short piece, but
10:56
it reinforces what a lot of us already know.
10:58
Most of the pain isn't in the initial cluster
11:01
build, it's in upgrades, dependency sprawl, and
11:05
lack of guardrails. Good thing to hand to leaders
11:08
who think Kubernetes is set it and forget it.
11:11
From SRE Weekly, there's a nice database migration
11:14
story from Tynes about zero downtime migrations.
11:18
What I like there is they actually define what
11:21
zero downtime meant for them and admit where
11:23
they accepted some degradation. It's a solid
11:26
template if you're trying to socialize realistic
11:29
expectations around a big DB change. There's
11:33
also a piece about upgrading Postgres SQL with
11:37
minimal downtime at something like 20 ,000 transactions
11:41
per second. If you own Postgres in production,
11:44
this is a good case study on planning, replication,
11:47
and rollback. And finally, there's a really interesting
11:50
article on building a distributed priority queue
11:52
on top of Kafka from the They talk about supporting
11:57
different SLOs for different events and how they
12:00
added a proxy layer to avoid head of line blocking
12:04
within partitions. If you're using Kafka heavily
12:07
and starting to bump into everything has the
12:10
same priority, it's a worthwhile read. I'll drop
12:13
links to all of those in the show notes. Let's
12:16
finish with something on the human side of incidents.
12:19
In SRE Weekly issue 497, there's a piece by Lauren
12:24
Hutchstein titled, Fixation, The Ever -Present
12:26
Risk During Incident Handling. The core idea
12:30
is simple, but important. During incidents, we
12:33
tend to latch onto a single theory, a single
12:36
plan. or a single mental model of what's going
12:40
wrong. Once we're locked in, we ignore or downplay
12:44
signals that don't fit that story. You've probably
12:47
seen this, right? Someone says it's DNS or it's
12:50
the database early in the incident. And from
12:53
that point on, every log line and every graph
12:56
gets interpreted as evidence for that. even when
12:59
it doesn't really fit, or the first mitigation
13:02
we try becomes the plan and we keep pushing on
13:06
it long after it's clear it's not working. Lauren's
13:09
point is that fixation is a normal human response
13:12
under stress, but it's dangerous for incident
13:15
handling. The way out isn't to pretend we won't
13:18
do it, it's to structure our process so we catch
13:21
it. A couple of simple things you can do. Make
13:24
it someone's job in the incident to periodically
13:27
ask, what else could this be? And what evidence
13:31
would change our mind about the current theory?
13:34
It sounds basic, but actually saying that out
13:37
loud gives people permission to raise alternate
13:40
explanations. Also, be explicit when a hypothesis
13:45
fails. Instead of silently shifting to a new
13:48
theory, call it out. We tried X, it didn't change
13:51
the symptoms, so we're going to park that and
13:54
consider Y. That keeps the team from unconsciously
13:58
dragging the old theory along. The broader theme,
14:01
which ties back to the outages and platform work
14:04
we talked about earlier, is that incidents are
14:07
not just technical events. They're also snapshots
14:10
of how we think under pressure. how we reason
14:13
about risk and how we handle uncertainty and
14:17
whether our process helps or gets in the way.
14:19
I'll put a link to that fixation article and
14:22
also the ongoing trade -offs and incidents as
14:25
landmark piece from Fred Hebert in the notes
14:27
if you want to go deeper on this. All right.
14:30
That's it for this episode. We talked about the
14:33
retirement of Ingress Engine X. and why that
14:36
should trigger a real migration plan, not a to
14:40
-do buried in confluence. We walked through CNCF's
14:43
updated framing for platform engineering and
14:46
how it lines up with building actual internal
14:49
platforms instead of just rebranding ops. And
14:52
we looked at the new Kubernetes AI conformance
14:56
program and some of the thinking around SRE in
14:59
the age of AI and where models and data pipelines
15:03
are part of the reliability. In the show notes,
15:07
I'll include links to the Ingress Engine X retirement
15:09
posts and migration guides, the CNCF platform
15:13
engineering blog, and the AI conformance program,
15:17
the DevOps .com SRE in AI article, plus the SRE
15:22
weekly links we talked about. If you found this
15:25
useful, share it with someone on your team who's
15:27
thinking about platform work or planning cluster
15:30
changes. I'm Brian and this is Ship It Weekly
15:33
by Teller's Tech and I'll see you in the next
15:36
one.