💬 Host Commentary

Episode 2 is the “Kubernetes is growing up” episode.

It’s three themes that all connect if you’re on a platform team: old defaults are getting retired, platform engineering is turning into an actual discipline instead of a vibe, and AI is starting to become a first-class workload you’ll need to run and support.

First up, Ingress NGINX. Kubernetes is officially retiring it and moving it into best-effort maintenance until March 2026. If you’re still using it, this isn’t “panic today,” but it is a real clock. You need a plan, you need time for testing, and you need a way to migrate without turning your ingress layer into a random mix of controllers and annotations nobody understands.

The bigger point: core building blocks do get sunset. The earlier you treat “platform dependencies” like real dependencies, the less painful these transitions are.

Second, platform engineering. CNCF has been putting more shape around what it actually means, and I like that the conversation is moving past buzzwords. Platform as a product sounds corny until you realize it’s basically: internal customers, a roadmap, docs that don’t suck, paved paths, and feedback loops. Plus the Kubernetes lessons learned piece is full of the usual hard-earned truths… operational consistency beats cleverness, and the clusters that hurt the most are usually the ones that grew “organically” for years without guardrails.

Third, AI on Kubernetes and “AI-native SRE.” CNCF’s new AI Conformance program is a big signal. AI workloads are not just another stateless web app. They’re heavier, they’re weirder, and they care about things like GPU scheduling, data locality, and reproducibility. And on the SRE side, the “systems learn and drift” angle is real. Reliability isn’t only “is it up.” It’s also “is it behaving the same way it did last week.” If you’re responsible for operating AI-powered systems, you’re going to end up caring about model versions, data changes, and guardrails as much as you care about CPU and memory.

Then in the lightning round we hit a few great reads on zero-downtime database work, Postgres upgrades, and a Kafka priority queue, and we close with the human side of incidents: fixation during response and how incidents become landmarks for the tradeoffs you’ve been making over time.

If episode 1 was “the cloud is a dependency graph,” episode 2 is “your platform is a product, whether you admit it or not.”

Show notes below have all the links if you want to dig into the source posts.

📝 Show Notes

In this episode of Ship It Weekly, Brian digs into 3 big themes for anyone running Kubernetes or building internal platforms.

First, Kubernetes is officially retiring Ingress NGINX and moving it into best-effort maintenance until March 2026. We talk about what that actually means if you’re still using it and how to think about choosing and rolling out a replacement ingress.

Second, we look at how CNCF is defining platform engineering and what “platform as a product” looks like in practice, plus some hard-earned lessons from running Kubernetes in production.

Third, we talk about AI as a first-class workload on Kubernetes. CNCF’s new Certified Kubernetes AI Conformance Program aims to standardize how AI runs on K8s, and recent writing on SRE in the age of AI looks at what reliability means when systems learn and drift.

In the lightning round, we hit good reads on database migrations, Postgres upgrades, and a distributed priority queue on Kafka. We wrap with the human side of incidents: fixation during incident response and using incidents as landmarks for the tradeoffs you’ve been making over time.

If you’re on a platform team, responsible for SLOs, or the person people ping when “Kubernetes is weird,” this one should give you concrete questions to take back to your roadmap and runbooks.

Links from this episode

https://kubernetes.io/blog/2025/11/11/ingress-nginx-retirement/

https://www.haproxy.com/blog/ingress-nginx-is-retiring

https://www.cncf.io/blog/2025/11/19/what-is-platform-engineering/

https://www.cncf.io/announcements/2025/11/11/cncf-launches-certified-kubernetes-ai-conformance-program-to-standardize-ai-workloads-on-kubernetes/

https://devops.com/sre-in-the-age-of-ai-what-reliability-looks-like-when-systems-learn/

Lightning round

https://www.cncf.io/blog/2025/11/18/top-5-hard-earned-lessons-from-the-experts-on-managing-kubernetes/

https://www.tines.com/blog/zero-downtime-database-migrations-lessons-from-moving-a-live-production

https://palark.com/blog/postgresql-upgrade-no-data-loss-downtime/

https://klaviyo.tech/building-a-distributed-priority-queue-in-kafka-1b2d8063649e

https://sreweekly.com/sre-weekly-issue-497/

https://ferd.ca/ongoing-tradeoffs-and-incidents-as-landmarks.html