AWS re:Invent for Platform Teams, GKE at 130k Nodes, and Killing Staging

Transcript

0:07 Hey, I'm Brian and this is Ship It Weekly by

0:11 Tellers Tech. It's reinvent week, which means

0:14 AWS has firehosed us with announcements. Instead

0:18 of trying to read you the keynote, I want to

0:21 pull out the stuff that actually matters if you

0:23 run platforms, Kubernetes, or CI in the real

0:26 world. So here's the plan for this episode. First,

0:29 we're going to hit the AWS updates that I think

0:32 are worth caring about if you own networking,

0:35 clusters, data, or security. That includes things

0:39 like regional NAT gateways, Route 53 global resolver,

0:44 EKS capabilities, ECS express mode, S3 vectors,

0:50 50 terabyte S3 objects, Aurora dynamic masking,

0:55 code commit coming back from the dead, and IAM

0:58 policy autopilot. Then we will step outside AWS.

1:02 We will talk about Google's 130 ,000 node GKE

1:06 cluster and what lessons from that actually apply

1:10 if you are just trying to keep your 20 node prod

1:13 cluster sane. After that, we will get into a

1:16 piece called It's Time to Kill Staging and talk

1:19 about what testing in production should and should

1:23 not mean. In the lightning round, we will hit

1:25 a Terraform MCP server that lets AI tools speak

1:29 your Terraform modules, a neat EC2 instance ranking

1:33 tool for right sizing, and an SRE story from

1:37 Airbnb on adaptive traffic management. And then

1:40 we will close with a human story about the fate

1:42 of small open source and what that means for

1:46 all the tiny projects your platform probably

1:49 depends on. All right, let's start with AWS.

1:52 I am not going to cover every AI chip and marketing

1:56 bullet from reInvent. I want to split this into

1:59 four buckets. Networking, compute and platform.

2:02 data, and DevTools plus security. Let's start

2:06 with networking because that is where a lot of

2:09 the quiet pain usually lives. AWS announced regional

2:13 availability mode for NAT gateway. Instead of

2:16 spinning up one NAT per AZ and wiring custom

2:20 routes to each one, you can now create a single

2:23 regional NAT that automatically spans all the

2:27 AZs in your VPC and scales with where your workloads

2:30 actually are. Practically, that means simpler

2:34 route tables, fewer moving parts to keep in sync,

2:37 and a more straightforward story when you talk

2:40 about high availability for private subnets.

2:43 You still need to think about cost and IP space,

2:46 but the model is more one service per region

2:49 rather than a little cluster of pets per AZ.

2:52 On the DNS side, AWS introduced Route 53 Global

2:56 Resolver. This is an Anycast DNS service that

3:00 sits in front of both your public and private

3:03 DNS, and it adds some smarts on top of it. It

3:06 can filter queries to suspicious domains and

3:09 uses algorithmic analysis to detect things like

3:13 DNS tunneling and weird domain generation patterns.

3:16 Not just is this domain on a known bad list.

3:19 There is also an accelerated recovery pattern

3:22 in the docs for managing public DNS records faster

3:26 and more safely. The recent US East 1 DNS pain

3:29 is still fresh in a lot of people's minds. So

3:32 this is a good moment to ask yourself, if Route

3:35 53 has a bad day again, how fast can we move?

3:38 And do we actually know where all of our critical

3:42 DNS records live? So if you own networking, here's

3:46 how I would use these. First, look at where you

3:49 are today with NAT. If you have a mix of per

3:52 AZ NATs, some services still hairpinning through

3:55 old instances, and a bunch of legacy route table

3:58 entries, the regional mode might be a nice forcing

4:01 function to clean that up. Use this as a chance

4:05 to revisit how you allocate IPs and whether you

4:08 can make IP, AM, and prefix lists do more work

4:12 for you instead of hand curated cider spreadsheets.

4:15 Second, treat global resolver as part of your

4:18 threat model, not just a neat new service. If

4:21 you have any compliance or data exfiltration

4:24 concerns, ask how do we want DNS to behave for

4:28 protected environments? And what logs do we need

4:31 out of this to actually detect weird behavior?

4:34 All right. Networking ran over. Let's talk compute

4:37 and platform stuff. On the container side, AWS

4:40 launched Amazon ECS express mode. This is basically

4:44 an easy button for ECS. You point it at a container

4:48 image and it wires up an ECS service, cluster,

4:51 application load balancer, out 53 records, auto

4:54 scaling, the usual plumbing with production ready

4:57 defaults. You still have access to all of the

5:00 underlying resources if you want to tweak them,

5:02 but the entry path for a new service is much

5:05 simpler. On the Kubernetes side, Amazon EKS capabilities

5:09 went GA. Think of this as a fully managed bundle

5:13 of platform features on top of VKS. It gives

5:16 you Kubernetes native components for workload

5:18 deployment, cloud resource management, and resource

5:22 composition. The idea is AWS runs and patches

5:25 a bunch of the core platform bits, and you interact

5:29 with it using familiar Kubernetes APIs. The story

5:32 here is pretty clear. AWS is trying to give you

5:35 paved paths for app teams. If you are a smaller

5:39 shop or you do not have the people to build your

5:42 own golden path. ECS express mode and EKS capabilities

5:46 are an attractive let AWS worry about more of

5:49 the platform option. If you already have a strong

5:52 platform story, these are still worth watching

5:54 as a reference point for what batteries included

5:57 looks like. Layered on top of that, we have Lambda

6:00 durability functions. These let you write long

6:03 -running, stateful workflows directly as Lambda

6:06 functions. They can checkpoint progress, pause

6:09 for up to a year, resume after failures, and

6:13 you do not have to bolt on your own state machine

6:16 or air handling engine. That overlaps a bit with

6:20 what folks use step functions or DIY orchestrators

6:24 for today. I would not rip anything out just

6:26 because durability functions is shiny, but if

6:29 you were about to build a workflow system where

6:32 functions need to wait on AI agents or external

6:35 callbacks, I would at least prototype it this

6:37 way and see if it simplifies your life. Now let's

6:40 talk data and storage. S3 vectors is now generally

6:44 available with scale bumping up to billions of

6:48 vectors per index. and trillions per bucket.

6:51 It's the first time one of the big clouds has

6:53 said, yeah, object storage can natively store

6:56 and query vectors rather than forcing you into

6:58 a separate vector database. The marketing line

7:01 is up to 90 % lower cost compared to specialized

7:05 vector stores and tighter integration with bedrock

7:08 knowledge bases and open search. This is a pretty

7:11 big deal if you are doing RAG or semantic search.

7:15 You no longer have to manage a completely separate

7:18 database just for embeddings. You can treat vectors

7:21 as another dimension of your S3 data lake. There

7:24 are still plenty of reasons to use a dedicated

7:27 vector store for certain use cases, but for a

7:29 lot of internal tooling, this is going to be

7:32 good enough and way easier to operate. S3 also

7:36 quietly increased the maximum object size from

7:39 5 terabytes to 50 terabytes. That changes how

7:42 you think about backups, big media, and AI training

7:46 data. The days of having to shard every giant

7:49 file into dozens of pieces just to fit it into

7:52 S3 limits are mostly gone now. On the database

7:55 side, Aurora Postgres picked up dynamic data

7:59 masking using the PG column mask extension. You

8:02 can define policies at the column level so certain

8:06 roles see full values, others see masked values,

8:09 and so on, enforced in the database itself. That

8:12 is interesting if you have BI users, contractors,

8:16 or internal tools that should see some shape

8:19 of the data, but not raw PII. It also gives you

8:23 another tool for compliance stories where keeping

8:26 masking closed to the data is a plus. Just remember

8:30 masking is not encryption and it does not replace

8:33 good role design or auditing. Finally, dev tools

8:37 and security. AWS officially walked back the

8:40 we are de -emphasizing code commit thing. Code

8:43 commit is back to full general availability with

8:46 AWS saying clearly that customers still want

8:50 a fully managed Git service that lives inside

8:53 their AWS estate. If you are in a heavily regulated

8:56 environment or you just like having repos inside

8:59 the same blast radius as everything else, that

9:02 is probably a relief. It also raises some awkward

9:04 questions for teams that did a big migration

9:07 off of CodeCommit after the original deprecation

9:10 plan. There is a meta lesson here about how much

9:14 you want to depend on any vendor's this service

9:17 is here forever statement. And then there is

9:20 IAM policy autopilot. This is a new open source

9:24 MCP server that reads your application code and

9:28 helps generate IAM policies that match what you

9:31 are actually doing, instead of star everything

9:35 in hope. It is designed to plug into AI coding

9:38 tools so they can propose least privileged policies

9:42 as part of your workflow. On one hand, this is

9:45 fantastic. Writing good IAM is tedious and anything

9:48 that helps teams stop shipping wildly overly

9:52 permissive policies is welcome. On the other

9:55 hand, this is one of those great power, great

9:57 responsibility things. I would absolutely run

10:01 its output through human review and test, and

10:04 I would be very careful about letting a model

10:07 both propose and apply policies without a person

10:11 in the loop. So if you zoom out, what are the

10:13 big AWS themes for platform folks this year?

10:17 Networking and DNS get simpler and a bit smarter.

10:20 Containers and Kubernetes get more paved roads.

10:24 Data and AI workloads move closer to S3. Dev

10:28 tools get more tightly integrated with IAM and

10:31 AI. The work for you is deciding whether you

10:35 want to lean in and let AWS carry more of the

10:38 platform and where you still want to keep your

10:41 own opinionated stack. There's still more reinvent

10:44 to go. So if anything huge drops after this recording,

10:48 we'll pick it up in a future episode. All right,

10:50 let's step out of Las Vegas and switch gears

10:53 and talk about Google for a minute. Google published

10:56 a blog about how they built a 130 ,000 node GKE

11:01 cluster. This is experimental, not a new default

11:04 limit, but it is still a wild number. The officially

11:08 supported limit today is 65 ,000 nodes per cluster.

11:11 So they basically doubled that for this project.

11:14 The post talks about demand for massive AI and

11:17 batch workloads. Think training or serving large

11:21 models, very large scale simulations, things

11:25 where packing as much work as possible into a

11:28 single control plane has operational benefits.

11:31 They had to do a bunch of architectural work

11:33 to make this even remotely practical. Things

11:36 like sharding control plane traffic and carefully

11:39 tuning API server scaling so you're not just

11:43 DDoSing your own Kubernetes API. Being very deliberate

11:46 about how many objects live in ETCD since you

11:49 are easily into the millions of pods and other

11:52 resources. Using job -oriented tooling like Q

11:54 to manage scheduling and fairness. So one noisy

11:58 job does not starve everything else. Here's the

12:01 thing though. Most of us are never going to run

12:03 a 130 ,000 node cluster, and that is fine. The

12:07 real lessons that I think are useful at a normal

12:09 scale are control plane capacity is a thing you

12:12 should care about. Even at 100 nodes, you can

12:15 run into API throttling or controller backlogs

12:18 during deploys or incident storms. Seeing Google

12:21 talk about their control plane SLOs at this scale

12:25 is a nice reminder that we should probably have

12:28 some for our smaller clusters too. Job and workload

12:32 management matters. Whether you have five jobs

12:35 or 5 ,000, Being explicit about priorities, quotas,

12:39 and fairness is the difference between prod is

12:42 fine during big experiments and someone kicked

12:45 off a batch job and our customer traffic died.

12:48 One cluster is not always better. The temptation

12:51 with fancy scale numbers is maybe we should consolidate

12:55 everything into one giant cluster. For most orgs,

12:58 blast radius compliance and team boundaries are

13:01 better served by multiple smaller clusters. Even

13:04 if that means a bit more overhead in the tooling.

13:08 So my recommendation here is not go chase 130

13:11 ,000 nodes. It is steal the thinking, read the

13:14 post, look at how they reason about control plane

13:17 scaling and scheduling, and then ask what the

13:20 equivalent version of that would look like at

13:22 your scale. All right, let's talk about staging

13:25 environments. There is a new stack article making

13:28 the rounds called it's time to kill staging,

13:31 the case for testing in production. The short

13:34 version is that staging environments are slow,

13:37 expensive, and often lie to you. And more teams

13:40 should lean into testing directly in production

13:43 with the right safety rails. I have mixed feelings,

13:46 which probably means the piece is doing its job.

13:49 On the one hand, a lot of us have worked in places

13:52 where staging is a bottleneck. 50 developers

13:56 all merging into a shared staging cluster that

13:59 does not really look like prod, then sitting

14:01 in a queue waiting for a staging sign -off that

14:05 is mostly vibes. When that staging environment

14:07 inevitably diverges from reality, you waste time

14:11 debugging issues that would never happen in prod,

14:14 and you miss issues that only show up under real

14:17 traffic patterns. On the other hand, test in

14:20 production without guardrails is just break production.

14:23 I think the healthy middle ground looks like

14:25 this. You treat staging as limited and cheap,

14:28 not sacred. Use it for fast feedback on basic

14:32 integration, maybe some performance smoke tests,

14:35 but do not pretend it is a perfect mirror. You

14:38 keep it simple enough that it is not its own

14:41 full -time job to maintain. Then you build serious

14:45 safety mechanisms into production. Feature flags

14:48 so you can roll changes out to 1 % of traffic

14:51 or only to internal users or only to specific

14:55 regions. Progressive delivery so you can ramp

14:58 traffic up and down based on real SLOs. Not just

15:02 it seems fine. Shadow traffic or replay so you

15:06 can feed realistic requests into new versions

15:09 without exposing users yet. good observability

15:13 and alerting so you know if the experiment is

15:16 hurting real people. The article's core point

15:18 is good. The only environment that exactly behaves

15:21 like production is production. So if you want

15:24 to be confident, you need to be able to experiment

15:26 there. Just do it intentionally. So if you are

15:29 listening to this and you own a platform, here

15:32 is a question to take back. If someone on your

15:35 team said, we want to turn off staging in six

15:38 months, what would you need in place in production

15:41 to feel safe? List those things. That is probably

15:44 your roadmap, whether or not you actually kill

15:47 staging. All right, let's hit a lightning round.

15:50 First quick one, Zachary Lober wrote a Terraform

15:52 custom module MCP server and released a project

15:56 called Terraform Ingest. It is a CLI and MCP

16:01 server that crawls your Terraform module repo,

16:05 summarizes them, and exposes that to AI tools

16:08 so they can understand your existing modules

16:11 instead of hallucinating new ones. This is exactly

16:14 the direction I expect a lot of teams to go.

16:17 Instead of asking a model, write me some random

16:20 terraform. You point it at your real modules

16:23 and you say, compose with these building blocks.

16:26 If you care about standardization and avoiding

16:29 weird snowflake stacks, this kind of pattern

16:32 is worth watching. Next, a small but very useful

16:34 tool, Runzons EC2 Instance Family Ranking. It

16:39 is a page that ranks EC2 families by passmark

16:43 CPU performance, split across x86 and ARM, and

16:47 it lets you dig into detailed benchmarks and

16:50 even pricing via their API. If you ever picked

16:53 an instance type purely by habit, this is the

16:56 antidote. Before you copy paste M5 large, again,

16:59 you can check where M7G or M8A sit on the performance

17:03 curve and what that means for your workloads.

17:07 It is a nice way to bring a little data into

17:10 those instance choice conversations without running

17:13 your own benchmark suite. And the last lightning

17:15 item, an SRE story from Airbnb. They published

17:19 From Static Rate Limiting to Adaptive Traffic

17:23 Management in Airbnb's Key Value Store. It is

17:26 about their Key Value Store muscle and how they

17:29 evolve static QPS limits into a more adaptive

17:32 system that looks at short -term latency relative

17:36 to long -term baseline and adjust limits dynamically.

17:39 The interesting idea here is using ratios like

17:42 current P95 over trailing P95 as a signal that

17:46 the system is under stress. the rate limiter

17:49 react before things really fall over. Even if

17:52 you do not copy their exact design, it is a nice

17:55 example of moving beyond fixed per customer limits

17:58 into something that responds to real conditions.

18:02 All right, let's close it with a human story.

18:04 Nolan Lawson wrote a piece called The Fate of

18:07 Small Open Source. It is about tiny libraries

18:11 and tools and the stuff that might be a few hundred

18:14 lines of code that still get millions of downloads

18:17 and quietly sit at the bottom of everyone's dependency

18:20 graph. He talks about one of his own packages

18:23 that has been around for about a decade, still

18:26 getting 5 million downloads per week and how

18:28 LLMs change the equation. If you can just ask

18:32 a model to spin out a custom helper function,

18:36 do you really need another dependency? And if

18:38 people do keep using these tiny libraries, what

18:41 does that mean for the one maintainer handling

18:44 issues and security reports for free? I think

18:46 this hits home for platform teams in two ways.

18:49 First, look at your own tooling. Terraform providers,

18:52 little CLI helpers, internal scripts, custom

18:56 controllers. A lot of that probably rests on

18:59 one or two small open source projects that someone

19:02 wrote on nights and weekends. TerraScam being

19:05 archived recently is a good reminder that tools

19:07 you depend on can go away when the incentives

19:10 for the maintainers shift. Second, we are starting

19:12 to see companies say, why add a new dependency

19:15 when an LLM can just generate the five lines

19:18 of code we need? That might - reduce supply chain

19:21 risk in some places, but it also raises questions

19:24 about how new utility libraries get created and

19:28 maintained in the first place. So what do you

19:31 do with that? Practically, I would make a list

19:33 of small critical dependencies in your platform.

19:36 Things where if the repo went read -only tomorrow,

19:40 you would be in trouble. Ask what your plan B

19:42 is. Could you fork it? Vendor it or replace it

19:46 if needed. Maybe consider sponsoring a few of

19:48 these maintainers. Even a small amount can make

19:51 a difference and is a good signal to the rest

19:53 of your org that this stuff matters. Think twice

19:57 before outsourcing important functionality to

19:59 a single tiny project without at least acknowledging

20:03 the risk. And when you are tempted to say the

20:06 AI can just generate this, maybe also think about

20:09 whether that code will need to be maintained,

20:12 audited, and shared across teams in the future.

20:15 Sometimes the boring little libraries with tests

20:17 and a maintainer is still the better choice.

20:20 All right. That is it for this episode of Ship

20:23 It Weekly. We walked through the AWS reinvent

20:26 updates that actually matter for platform teams.

20:29 from regional NAT gateways and Route 53 global

20:33 resolver to ECS express mode, EKS capabilities,

20:38 S3 vectors, 50 terabyte objects, Aurora dynamic

20:42 masking, code commits return to GA, and IAM policy

20:47 autopilot. We looked at Google's 130 ,000 node

20:51 GKE cluster and used it as a lens on control

20:55 plane scaling and cluster design at more normal

20:58 sizes. We dug into the kill staging test in production

21:02 argument and how to make that safe with feature

21:05 flags and progressive delivery. In the lightning

21:09 round, we talked about Terraform MCP servers

21:12 for module -aware AI, EC2 instance rankings that

21:16 help you right -size with data instead of vibes,

21:20 and Airbnb's adaptive traffic management for

21:24 their key value store. And we wrapped with Nolan

21:27 Lawson's piece on the fate of small open source,

21:30 and what that means for all the tiny projects

21:32 your platform silently leans on. I will put all

21:35 of the links we talked about in the show notes.

21:37 I am Brian. This is Ship It Weekly by Teller's

21:40 Tech. Thanks for hanging out and I'll see you

21:42 in the next one.

AWS re:Invent for Platform Teams, GKE at 130k Nodes, and Killing Staging

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller