💬 Host Commentary

Episode 4 is my “big platforms week” episode.

We start with AWS re:Invent, but not in the usual hypey way. I’m looking at it like a platform team would: what changes the paved roads, what changes the reliability story, and what’s going to show up as a ticket in your queue three months from now.

That includes stuff like regional NAT Gateway availability and Route 53 resolver updates on the networking side, plus new opinionated paths like ECS Express Mode and the “EKS capabilities” direction AWS keeps leaning into. There’s also a clear AI and data signal with things like S3 Vectors and the bigger S3 object support. Even if you don’t care about the buzzwords, you should care about what this does to patterns teams will try to roll into your clusters and accounts.

Then we step out of AWS for a minute and talk about Google’s 130,000-node GKE cluster. It’s obviously an extreme case, but those write-ups are still useful because they show what breaks first: control plane pressure, scheduling behavior, networking limits, and how much operational discipline you need when “it scales” stops being a marketing phrase and becomes a daily reality.

And then we hit the spicy one: “kill staging.”

The argument isn’t “YOLO production.” It’s that staging is often a false sense of safety. The more your staging environment diverges from prod, the more it becomes a place where bugs hide, not where bugs get caught. The real conversation is how you test in production responsibly: feature flags, progressive rollouts, canaries, solid observability, and a rollback path that doesn’t rely on heroics.

The thread tying all of this together is pretty simple: the big cloud providers are making it easier to ship faster, but the only way that’s a win is if your platform has guardrails. Otherwise you just move faster into the wall.

Show notes below have all the links if you want to dig into the re:Invent announcements, the GKE story, and the staging debate.

📝 Show Notes

In this episode of Ship It Weekly, Brian looks at re:Invent through a platform/SRE lens and pulls out the updates that actually change how you design and run systems.

We talk about regional NAT Gateways and Route 53 Global Resolver on the networking side, ECS Express Mode and EKS Capabilities as new paved roads for app teams, S3 Vectors GA and 50 TB S3 objects for AI and data lakes, Aurora PostgreSQL dynamic data masking, CodeCommit’s return to full GA, and IAM Policy Autopilot for AI-assisted IAM policies. This was recorded mid–re:Invent, so consider it a “what matters so far” pass, not a full recap.

Outside AWS, we get into Google’s 130,000-node GKE cluster and what actually applies if you’re running normal-sized clusters, plus the “It’s time to kill staging” argument and what responsible testing in production looks like with feature flags, progressive delivery, and solid observability.

In the lightning round, we hit Zachary Loeber’s Terraform MCP server and terraform-ingest (letting AI tools speak your real Terraform modules), Runs-On’s EC2 instance rankings so you stop picking instance types by vibes, and Airbnb’s adaptive traffic management for their key-value store. We close with Nolan Lawson’s “The fate of small open source” and what it means when your platform quietly depends on one-maintainer libraries.

Links from this episode:

AWS highlights:

https://aws.amazon.com/about-aws/whats-new/2025/11/aws-nat-gateway-regional-availability

https://aws.amazon.com/blogs/aws/introducing-amazon-route-53-global-resolver-for-secure-anycast-dns-resolution-preview

https://aws.amazon.com/about-aws/whats-new/2025/11/announcing-amazon-ecs-express-mode

https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-s3-vectors-generally-available/

Other topics:

https://cloud.google.com/blog/products/containers-kubernetes/how-we-built-a-130000-node-gke-cluster

https://thenewstack.io/its-time-to-kill-staging-the-case-for-testing-in-production/

https://blog.zacharyloeber.com/article/terraform-custom-module-mcp-server/

https://go.runs-on.com/instances/ranking

https://medium.com/airbnb-engineering/from-static-rate-limiting-to-adaptive-traffic-management-in-airbnbs-key-value-store-29362764e5c2

https://nolanlawson.com/2025/11/16/the-fate-of-small-open-source/