0:07
Hey, I'm Brian and this is Ship It Weekly by
0:11
Tellers Tech. It's reinvent week, which means
0:14
AWS has firehosed us with announcements. Instead
0:18
of trying to read you the keynote, I want to
0:21
pull out the stuff that actually matters if you
0:23
run platforms, Kubernetes, or CI in the real
0:26
world. So here's the plan for this episode. First,
0:29
we're going to hit the AWS updates that I think
0:32
are worth caring about if you own networking,
0:35
clusters, data, or security. That includes things
0:39
like regional NAT gateways, Route 53 global resolver,
0:44
EKS capabilities, ECS express mode, S3 vectors,
0:50
50 terabyte S3 objects, Aurora dynamic masking,
0:55
code commit coming back from the dead, and IAM
0:58
policy autopilot. Then we will step outside AWS.
1:02
We will talk about Google's 130 ,000 node GKE
1:06
cluster and what lessons from that actually apply
1:10
if you are just trying to keep your 20 node prod
1:13
cluster sane. After that, we will get into a
1:16
piece called It's Time to Kill Staging and talk
1:19
about what testing in production should and should
1:23
not mean. In the lightning round, we will hit
1:25
a Terraform MCP server that lets AI tools speak
1:29
your Terraform modules, a neat EC2 instance ranking
1:33
tool for right sizing, and an SRE story from
1:37
Airbnb on adaptive traffic management. And then
1:40
we will close with a human story about the fate
1:42
of small open source and what that means for
1:46
all the tiny projects your platform probably
1:49
depends on. All right, let's start with AWS.
1:52
I am not going to cover every AI chip and marketing
1:56
bullet from reInvent. I want to split this into
1:59
four buckets. Networking, compute and platform.
2:02
data, and DevTools plus security. Let's start
2:06
with networking because that is where a lot of
2:09
the quiet pain usually lives. AWS announced regional
2:13
availability mode for NAT gateway. Instead of
2:16
spinning up one NAT per AZ and wiring custom
2:20
routes to each one, you can now create a single
2:23
regional NAT that automatically spans all the
2:27
AZs in your VPC and scales with where your workloads
2:30
actually are. Practically, that means simpler
2:34
route tables, fewer moving parts to keep in sync,
2:37
and a more straightforward story when you talk
2:40
about high availability for private subnets.
2:43
You still need to think about cost and IP space,
2:46
but the model is more one service per region
2:49
rather than a little cluster of pets per AZ.
2:52
On the DNS side, AWS introduced Route 53 Global
2:56
Resolver. This is an Anycast DNS service that
3:00
sits in front of both your public and private
3:03
DNS, and it adds some smarts on top of it. It
3:06
can filter queries to suspicious domains and
3:09
uses algorithmic analysis to detect things like
3:13
DNS tunneling and weird domain generation patterns.
3:16
Not just is this domain on a known bad list.
3:19
There is also an accelerated recovery pattern
3:22
in the docs for managing public DNS records faster
3:26
and more safely. The recent US East 1 DNS pain
3:29
is still fresh in a lot of people's minds. So
3:32
this is a good moment to ask yourself, if Route
3:35
53 has a bad day again, how fast can we move?
3:38
And do we actually know where all of our critical
3:42
DNS records live? So if you own networking, here's
3:46
how I would use these. First, look at where you
3:49
are today with NAT. If you have a mix of per
3:52
AZ NATs, some services still hairpinning through
3:55
old instances, and a bunch of legacy route table
3:58
entries, the regional mode might be a nice forcing
4:01
function to clean that up. Use this as a chance
4:05
to revisit how you allocate IPs and whether you
4:08
can make IP, AM, and prefix lists do more work
4:12
for you instead of hand curated cider spreadsheets.
4:15
Second, treat global resolver as part of your
4:18
threat model, not just a neat new service. If
4:21
you have any compliance or data exfiltration
4:24
concerns, ask how do we want DNS to behave for
4:28
protected environments? And what logs do we need
4:31
out of this to actually detect weird behavior?
4:34
All right. Networking ran over. Let's talk compute
4:37
and platform stuff. On the container side, AWS
4:40
launched Amazon ECS express mode. This is basically
4:44
an easy button for ECS. You point it at a container
4:48
image and it wires up an ECS service, cluster,
4:51
application load balancer, out 53 records, auto
4:54
scaling, the usual plumbing with production ready
4:57
defaults. You still have access to all of the
5:00
underlying resources if you want to tweak them,
5:02
but the entry path for a new service is much
5:05
simpler. On the Kubernetes side, Amazon EKS capabilities
5:09
went GA. Think of this as a fully managed bundle
5:13
of platform features on top of VKS. It gives
5:16
you Kubernetes native components for workload
5:18
deployment, cloud resource management, and resource
5:22
composition. The idea is AWS runs and patches
5:25
a bunch of the core platform bits, and you interact
5:29
with it using familiar Kubernetes APIs. The story
5:32
here is pretty clear. AWS is trying to give you
5:35
paved paths for app teams. If you are a smaller
5:39
shop or you do not have the people to build your
5:42
own golden path. ECS express mode and EKS capabilities
5:46
are an attractive let AWS worry about more of
5:49
the platform option. If you already have a strong
5:52
platform story, these are still worth watching
5:54
as a reference point for what batteries included
5:57
looks like. Layered on top of that, we have Lambda
6:00
durability functions. These let you write long
6:03
-running, stateful workflows directly as Lambda
6:06
functions. They can checkpoint progress, pause
6:09
for up to a year, resume after failures, and
6:13
you do not have to bolt on your own state machine
6:16
or air handling engine. That overlaps a bit with
6:20
what folks use step functions or DIY orchestrators
6:24
for today. I would not rip anything out just
6:26
because durability functions is shiny, but if
6:29
you were about to build a workflow system where
6:32
functions need to wait on AI agents or external
6:35
callbacks, I would at least prototype it this
6:37
way and see if it simplifies your life. Now let's
6:40
talk data and storage. S3 vectors is now generally
6:44
available with scale bumping up to billions of
6:48
vectors per index. and trillions per bucket.
6:51
It's the first time one of the big clouds has
6:53
said, yeah, object storage can natively store
6:56
and query vectors rather than forcing you into
6:58
a separate vector database. The marketing line
7:01
is up to 90 % lower cost compared to specialized
7:05
vector stores and tighter integration with bedrock
7:08
knowledge bases and open search. This is a pretty
7:11
big deal if you are doing RAG or semantic search.
7:15
You no longer have to manage a completely separate
7:18
database just for embeddings. You can treat vectors
7:21
as another dimension of your S3 data lake. There
7:24
are still plenty of reasons to use a dedicated
7:27
vector store for certain use cases, but for a
7:29
lot of internal tooling, this is going to be
7:32
good enough and way easier to operate. S3 also
7:36
quietly increased the maximum object size from
7:39
5 terabytes to 50 terabytes. That changes how
7:42
you think about backups, big media, and AI training
7:46
data. The days of having to shard every giant
7:49
file into dozens of pieces just to fit it into
7:52
S3 limits are mostly gone now. On the database
7:55
side, Aurora Postgres picked up dynamic data
7:59
masking using the PG column mask extension. You
8:02
can define policies at the column level so certain
8:06
roles see full values, others see masked values,
8:09
and so on, enforced in the database itself. That
8:12
is interesting if you have BI users, contractors,
8:16
or internal tools that should see some shape
8:19
of the data, but not raw PII. It also gives you
8:23
another tool for compliance stories where keeping
8:26
masking closed to the data is a plus. Just remember
8:30
masking is not encryption and it does not replace
8:33
good role design or auditing. Finally, dev tools
8:37
and security. AWS officially walked back the
8:40
we are de -emphasizing code commit thing. Code
8:43
commit is back to full general availability with
8:46
AWS saying clearly that customers still want
8:50
a fully managed Git service that lives inside
8:53
their AWS estate. If you are in a heavily regulated
8:56
environment or you just like having repos inside
8:59
the same blast radius as everything else, that
9:02
is probably a relief. It also raises some awkward
9:04
questions for teams that did a big migration
9:07
off of CodeCommit after the original deprecation
9:10
plan. There is a meta lesson here about how much
9:14
you want to depend on any vendor's this service
9:17
is here forever statement. And then there is
9:20
IAM policy autopilot. This is a new open source
9:24
MCP server that reads your application code and
9:28
helps generate IAM policies that match what you
9:31
are actually doing, instead of star everything
9:35
in hope. It is designed to plug into AI coding
9:38
tools so they can propose least privileged policies
9:42
as part of your workflow. On one hand, this is
9:45
fantastic. Writing good IAM is tedious and anything
9:48
that helps teams stop shipping wildly overly
9:52
permissive policies is welcome. On the other
9:55
hand, this is one of those great power, great
9:57
responsibility things. I would absolutely run
10:01
its output through human review and test, and
10:04
I would be very careful about letting a model
10:07
both propose and apply policies without a person
10:11
in the loop. So if you zoom out, what are the
10:13
big AWS themes for platform folks this year?
10:17
Networking and DNS get simpler and a bit smarter.
10:20
Containers and Kubernetes get more paved roads.
10:24
Data and AI workloads move closer to S3. Dev
10:28
tools get more tightly integrated with IAM and
10:31
AI. The work for you is deciding whether you
10:35
want to lean in and let AWS carry more of the
10:38
platform and where you still want to keep your
10:41
own opinionated stack. There's still more reinvent
10:44
to go. So if anything huge drops after this recording,
10:48
we'll pick it up in a future episode. All right,
10:50
let's step out of Las Vegas and switch gears
10:53
and talk about Google for a minute. Google published
10:56
a blog about how they built a 130 ,000 node GKE
11:01
cluster. This is experimental, not a new default
11:04
limit, but it is still a wild number. The officially
11:08
supported limit today is 65 ,000 nodes per cluster.
11:11
So they basically doubled that for this project.
11:14
The post talks about demand for massive AI and
11:17
batch workloads. Think training or serving large
11:21
models, very large scale simulations, things
11:25
where packing as much work as possible into a
11:28
single control plane has operational benefits.
11:31
They had to do a bunch of architectural work
11:33
to make this even remotely practical. Things
11:36
like sharding control plane traffic and carefully
11:39
tuning API server scaling so you're not just
11:43
DDoSing your own Kubernetes API. Being very deliberate
11:46
about how many objects live in ETCD since you
11:49
are easily into the millions of pods and other
11:52
resources. Using job -oriented tooling like Q
11:54
to manage scheduling and fairness. So one noisy
11:58
job does not starve everything else. Here's the
12:01
thing though. Most of us are never going to run
12:03
a 130 ,000 node cluster, and that is fine. The
12:07
real lessons that I think are useful at a normal
12:09
scale are control plane capacity is a thing you
12:12
should care about. Even at 100 nodes, you can
12:15
run into API throttling or controller backlogs
12:18
during deploys or incident storms. Seeing Google
12:21
talk about their control plane SLOs at this scale
12:25
is a nice reminder that we should probably have
12:28
some for our smaller clusters too. Job and workload
12:32
management matters. Whether you have five jobs
12:35
or 5 ,000, Being explicit about priorities, quotas,
12:39
and fairness is the difference between prod is
12:42
fine during big experiments and someone kicked
12:45
off a batch job and our customer traffic died.
12:48
One cluster is not always better. The temptation
12:51
with fancy scale numbers is maybe we should consolidate
12:55
everything into one giant cluster. For most orgs,
12:58
blast radius compliance and team boundaries are
13:01
better served by multiple smaller clusters. Even
13:04
if that means a bit more overhead in the tooling.
13:08
So my recommendation here is not go chase 130
13:11
,000 nodes. It is steal the thinking, read the
13:14
post, look at how they reason about control plane
13:17
scaling and scheduling, and then ask what the
13:20
equivalent version of that would look like at
13:22
your scale. All right, let's talk about staging
13:25
environments. There is a new stack article making
13:28
the rounds called it's time to kill staging,
13:31
the case for testing in production. The short
13:34
version is that staging environments are slow,
13:37
expensive, and often lie to you. And more teams
13:40
should lean into testing directly in production
13:43
with the right safety rails. I have mixed feelings,
13:46
which probably means the piece is doing its job.
13:49
On the one hand, a lot of us have worked in places
13:52
where staging is a bottleneck. 50 developers
13:56
all merging into a shared staging cluster that
13:59
does not really look like prod, then sitting
14:01
in a queue waiting for a staging sign -off that
14:05
is mostly vibes. When that staging environment
14:07
inevitably diverges from reality, you waste time
14:11
debugging issues that would never happen in prod,
14:14
and you miss issues that only show up under real
14:17
traffic patterns. On the other hand, test in
14:20
production without guardrails is just break production.
14:23
I think the healthy middle ground looks like
14:25
this. You treat staging as limited and cheap,
14:28
not sacred. Use it for fast feedback on basic
14:32
integration, maybe some performance smoke tests,
14:35
but do not pretend it is a perfect mirror. You
14:38
keep it simple enough that it is not its own
14:41
full -time job to maintain. Then you build serious
14:45
safety mechanisms into production. Feature flags
14:48
so you can roll changes out to 1 % of traffic
14:51
or only to internal users or only to specific
14:55
regions. Progressive delivery so you can ramp
14:58
traffic up and down based on real SLOs. Not just
15:02
it seems fine. Shadow traffic or replay so you
15:06
can feed realistic requests into new versions
15:09
without exposing users yet. good observability
15:13
and alerting so you know if the experiment is
15:16
hurting real people. The article's core point
15:18
is good. The only environment that exactly behaves
15:21
like production is production. So if you want
15:24
to be confident, you need to be able to experiment
15:26
there. Just do it intentionally. So if you are
15:29
listening to this and you own a platform, here
15:32
is a question to take back. If someone on your
15:35
team said, we want to turn off staging in six
15:38
months, what would you need in place in production
15:41
to feel safe? List those things. That is probably
15:44
your roadmap, whether or not you actually kill
15:47
staging. All right, let's hit a lightning round.
15:50
First quick one, Zachary Lober wrote a Terraform
15:52
custom module MCP server and released a project
15:56
called Terraform Ingest. It is a CLI and MCP
16:01
server that crawls your Terraform module repo,
16:05
summarizes them, and exposes that to AI tools
16:08
so they can understand your existing modules
16:11
instead of hallucinating new ones. This is exactly
16:14
the direction I expect a lot of teams to go.
16:17
Instead of asking a model, write me some random
16:20
terraform. You point it at your real modules
16:23
and you say, compose with these building blocks.
16:26
If you care about standardization and avoiding
16:29
weird snowflake stacks, this kind of pattern
16:32
is worth watching. Next, a small but very useful
16:34
tool, Runzons EC2 Instance Family Ranking. It
16:39
is a page that ranks EC2 families by passmark
16:43
CPU performance, split across x86 and ARM, and
16:47
it lets you dig into detailed benchmarks and
16:50
even pricing via their API. If you ever picked
16:53
an instance type purely by habit, this is the
16:56
antidote. Before you copy paste M5 large, again,
16:59
you can check where M7G or M8A sit on the performance
17:03
curve and what that means for your workloads.
17:07
It is a nice way to bring a little data into
17:10
those instance choice conversations without running
17:13
your own benchmark suite. And the last lightning
17:15
item, an SRE story from Airbnb. They published
17:19
From Static Rate Limiting to Adaptive Traffic
17:23
Management in Airbnb's Key Value Store. It is
17:26
about their Key Value Store muscle and how they
17:29
evolve static QPS limits into a more adaptive
17:32
system that looks at short -term latency relative
17:36
to long -term baseline and adjust limits dynamically.
17:39
The interesting idea here is using ratios like
17:42
current P95 over trailing P95 as a signal that
17:46
the system is under stress. the rate limiter
17:49
react before things really fall over. Even if
17:52
you do not copy their exact design, it is a nice
17:55
example of moving beyond fixed per customer limits
17:58
into something that responds to real conditions.
18:02
All right, let's close it with a human story.
18:04
Nolan Lawson wrote a piece called The Fate of
18:07
Small Open Source. It is about tiny libraries
18:11
and tools and the stuff that might be a few hundred
18:14
lines of code that still get millions of downloads
18:17
and quietly sit at the bottom of everyone's dependency
18:20
graph. He talks about one of his own packages
18:23
that has been around for about a decade, still
18:26
getting 5 million downloads per week and how
18:28
LLMs change the equation. If you can just ask
18:32
a model to spin out a custom helper function,
18:36
do you really need another dependency? And if
18:38
people do keep using these tiny libraries, what
18:41
does that mean for the one maintainer handling
18:44
issues and security reports for free? I think
18:46
this hits home for platform teams in two ways.
18:49
First, look at your own tooling. Terraform providers,
18:52
little CLI helpers, internal scripts, custom
18:56
controllers. A lot of that probably rests on
18:59
one or two small open source projects that someone
19:02
wrote on nights and weekends. TerraScam being
19:05
archived recently is a good reminder that tools
19:07
you depend on can go away when the incentives
19:10
for the maintainers shift. Second, we are starting
19:12
to see companies say, why add a new dependency
19:15
when an LLM can just generate the five lines
19:18
of code we need? That might - reduce supply chain
19:21
risk in some places, but it also raises questions
19:24
about how new utility libraries get created and
19:28
maintained in the first place. So what do you
19:31
do with that? Practically, I would make a list
19:33
of small critical dependencies in your platform.
19:36
Things where if the repo went read -only tomorrow,
19:40
you would be in trouble. Ask what your plan B
19:42
is. Could you fork it? Vendor it or replace it
19:46
if needed. Maybe consider sponsoring a few of
19:48
these maintainers. Even a small amount can make
19:51
a difference and is a good signal to the rest
19:53
of your org that this stuff matters. Think twice
19:57
before outsourcing important functionality to
19:59
a single tiny project without at least acknowledging
20:03
the risk. And when you are tempted to say the
20:06
AI can just generate this, maybe also think about
20:09
whether that code will need to be maintained,
20:12
audited, and shared across teams in the future.
20:15
Sometimes the boring little libraries with tests
20:17
and a maintainer is still the better choice.
20:20
All right. That is it for this episode of Ship
20:23
It Weekly. We walked through the AWS reinvent
20:26
updates that actually matter for platform teams.
20:29
from regional NAT gateways and Route 53 global
20:33
resolver to ECS express mode, EKS capabilities,
20:38
S3 vectors, 50 terabyte objects, Aurora dynamic
20:42
masking, code commits return to GA, and IAM policy
20:47
autopilot. We looked at Google's 130 ,000 node
20:51
GKE cluster and used it as a lens on control
20:55
plane scaling and cluster design at more normal
20:58
sizes. We dug into the kill staging test in production
21:02
argument and how to make that safe with feature
21:05
flags and progressive delivery. In the lightning
21:09
round, we talked about Terraform MCP servers
21:12
for module -aware AI, EC2 instance rankings that
21:16
help you right -size with data instead of vibes,
21:20
and Airbnb's adaptive traffic management for
21:24
their key value store. And we wrapped with Nolan
21:27
Lawson's piece on the fate of small open source,
21:30
and what that means for all the tiny projects
21:32
your platform silently leans on. I will put all
21:35
of the links we talked about in the show notes.
21:37
I am Brian. This is Ship It Weekly by Teller's
21:40
Tech. Thanks for hanging out and I'll see you
21:42
in the next one.