0:06
Hey, I'm Brian and this is Ship It Weekly by
0:10
Tellers Tech. This episode is very much a running
0:14
real systems in production one. Kubernetes just
0:17
dropped an official configuration good practices
0:20
guide. AWS is quietly admitting that EKS control
0:25
planes and networking need more love, and GitHub
0:28
is pushing harder into being your platform brain
0:32
with OIDC tweaks and copilot customization. Then
0:36
we'll hit a lightning round with Terrascan getting
0:38
archived, a ridiculous 15 terabit DDoS against
0:42
Azure, and CloudFront flat rate pricing. We'll
0:46
then close with a human story about writing incident
0:49
reports for a future AI SRE and why that actually
0:53
makes them better for humans too. Let's start
0:56
with Kubernetes config. Kubernetes posted a new
1:00
configuration good practices blog and it is very
1:03
much a please stop hurting yourselves kind of
1:06
document. The short version is configuration
1:09
seems small until it absolutely is not. The blog
1:12
pulls together a bunch of things most of us learn
1:15
the hard way. Stuff like keep configuration in
1:19
source control instead of hand editing manifests
1:21
on a jump box. Use tools like Customize or Helm
1:25
in a consistent way so you do not end up with
1:28
five different templating patterns in the same
1:30
company. Avoid magic defaults. Be deliberate
1:33
about labels and annotations so you can actually
1:37
select and observe your workloads later. There
1:40
is also a theme of prefer simple explicit config
1:44
over clever tricks. Things like avoiding giant
1:48
copy pasted yaml blobs across environments, keeping
1:52
environment specific bits in overlays instead
1:55
of forking charts and validating config. early
1:59
with tools instead of discovering issues when
2:02
you hit apply on production. None of this is
2:04
shocking, but that is kind of the point. The
2:07
worst Kubernetes incidents I see are almost never
2:10
Kubla exploded. It is usually a small config
2:14
problem that slipped through. So if you are running
2:17
clusters, here is how I would use this blog in
2:19
real life. First, treat it as a checklist. Take
2:23
one of your core apps and walk through how you
2:26
do config today. Is everything for that app in
2:29
Git, or do you still have a couple of hotfix
2:32
-only manifests that people kubectl apply from
2:35
their laptop? Are you mixing Helm and raw YAML
2:39
and customize in the same namespace? Do you actually
2:42
have a policy or linterstep in CI that fails
2:46
on bad patterns, or is it we just trust reviewers
2:50
to catch it? Second, use it to standardize your
2:53
approach. Most orgs have grown Kubernetes config
2:56
very organically. One team copied a blog post,
3:00
another team used a different starter repo. Five
3:03
years later, you have four different ways of
3:05
doing environment overlays and nobody really
3:08
knows which is the right one. The Kubernetes
3:11
post gives you a neutral reference you can point
3:14
at when you say, okay, we are going to converge
3:17
on this pattern. Third, treat config as part
3:21
of your platform. If you have a platform or infra
3:24
team, one of your jobs is to make it harder for
3:27
people to write bad config. That might mean shipping
3:30
base helm charts with same defaults, adding OPA
3:34
or Kyverno policies, wiring up schema validation,
3:38
or even just having a a kubectl diff step before
3:42
applies in CI. Little stuff that stops the oops
3:45
prod moments. So if you only skim one thing this
3:48
week, that config good practices blog is worth
3:51
a read and honestly, worth a little internal
3:54
workshop with your team. All right, let's talk
3:58
EKS. AWS announced something called EKS provision
4:02
control plane. This is basically stop guessing
4:05
how much control plane you get. Historically,
4:07
If you spin up an EKS cluster, AWS handles the
4:11
control plane for you. It scales, there is some
4:14
magic, and you mostly only think about worker
4:16
nodes. But at scale or during big traffic spikes,
4:20
you can absolutely run into API throttling and
4:23
control plane saturation. You see weird cube
4:26
API latencies during deploys or controllers getting
4:30
rate limited. With provisioned control plane,
4:33
you can pick from scaling tiers that guarantee
4:36
a certain level of control plane capacity. The
4:39
docs talk about selecting a tier for high, predictable
4:43
performance, regardless of current utilization.
4:46
So instead of hoping AWS auto scales in time,
4:50
you say this cluster needs this much headroom
4:52
and you pay for that. I like this for a few reasons.
4:56
First, it forces people to take control plane
4:59
SLOs seriously. Most teams have metrics and alerts
5:02
for node CPU and pod restarts, but almost nobody
5:06
writes our cube API P99 latency should stay under
5:10
X, or the cluster should be able to handle Y
5:13
QPS of controller traffic during a big deploy.
5:16
Provision tiers give you a way to align those
5:19
expectations with something concrete in the platform.
5:23
Second, it clarifies multi -tenant cluster design.
5:26
If you are stuffing a ton of teams into a few
5:29
big clusters, you are all sharing that control
5:31
plane. Being able to size it explicitly and maybe
5:35
have a different tier for noisy clusters is a
5:38
nice lever. AWS also dropped enhanced container
5:41
network observability. for EKS. This gives you
5:45
granular network metrics for cluster traffic.
5:48
cross AZ flows, and traffic to AWS services with
5:52
CloudWatch visualizations and deeper flow insight.
5:55
Translated to human, they're making it easier
5:58
to answer the question, what the heck is talking
6:00
to what inside of your clusters? That helps with
6:03
finding cross AZ chatty services that are quietly
6:06
burning money and adding latency, spotting weird
6:10
egress patterns to the internet or to manage
6:13
services. Debugging, this service is slow, but
6:16
CPU and memory look fine. If you have ever done
6:19
the dance where an app team swears it is not
6:22
them causing cross AZ data transfer and you are
6:25
trying to prove it, this kind of visibility is
6:28
what you want. The pattern here is pretty clear.
6:30
EKS is growing features where real production
6:34
pain lives. Control plane capacity and networking.
6:38
As a platform team, this is a good time to write
6:41
some SLOs for your control plane. Decide which
6:44
cluster actually need a provision tier. hook
6:48
the network observability metrics into your existing
6:51
dashboards and alerts rather than letting them
6:54
sit in a default cloud watch view nobody opens.
6:57
All right, let's shift over to GitHub and talk
7:00
about OIDC and Copilot. GitHub shipped a neat
7:04
little security and governance improvement for
7:06
Actions. The OIDC token that Actions uses to
7:10
talk to cloud providers now includes a check
7:12
run ID claim. Before this, you could tell this
7:16
token came from this workflow run, but it was
7:19
harder to tie it to a specific job or check.
7:22
Now, if a workflow calls into AWS or an internal
7:26
service using OIDC, you can log and enforce policies
7:30
based on the exact check run that generated the
7:33
token. For platform teams, that means a few things.
7:36
You can write more fine -grained IAM or ABAC
7:41
policies on the cloud side that say only this
7:44
particular job in this repo can assume this role,
7:47
instead of any workflow from this org. That is
7:51
big for least privilege. You can audit access
7:54
better. When someone asks who touched this role
7:57
or what did this job actually do in AWS, there
8:01
is a clear link between the token, the check
8:04
run, and the cloud actions. You can now also
8:07
use it for internal services. If you have an
8:10
internal deploy API or some platform endpoint
8:14
that Action calls, it can now require a check
8:17
run ID and log it. instead of just trusting that
8:20
anything with a valid token is allowed. On the
8:23
AI side, GitHub published unlocking the full
8:26
power of Copilot code review. Master your instructions
8:30
files. It is basically a guide to making Copilot
8:33
code review actually useful instead of spammy.
8:36
Instruction files let you tell Copilot what you
8:40
care about in reviews. That could be things like,
8:43
this repo uses specific architecture. We prefer
8:46
structured logging. Avoid unsafe SQL or this
8:50
service is latency sensitive. Be careful about
8:53
extra network calls. This is very platformy because
8:57
you can standardize expectations across repos.
9:01
Instead of everyone writing their own reviewer
9:03
guidelines in a wiki that nobody reads, you can
9:06
put it in an instructions file that your AI reviewer
9:09
actually uses. Encode security and performance
9:12
concerns as part of the review process. Help
9:15
junior devs by having Copilot nudge them towards
9:18
patterns your platform team wants. GitHub also
9:21
has a customization library for Copilot that
9:24
includes custom agents. These are specialized
9:27
versions of the Copilot coding agent that you
9:30
configure for specific workflows, like implementation
9:34
planner, bug fix teammate, or your own workspace
9:37
specific helpers. You can define their behavior
9:40
in files under .github .agents, and they keep
9:44
that persona across a workflow rather than being
9:47
a one -off prompt. Where this gets interesting
9:50
for platform teams is that you can start to imagine
9:53
internal agents that know your stack. your standard
9:57
modules, your CI patterns. Know your security
10:00
policies and naming conventions. Help people
10:03
generate Terraform, Kubernetes manifests, or
10:06
CI configs that line up with your platform. Obviously,
10:10
you still need guardrails. You do not want an
10:13
overeager agent writing IAM policies or production
10:16
manifests and merging them without human review.
10:19
But it is pretty clear that in the next couple
10:22
of years, we are going to be building platforms
10:24
for humans and for AI helpers at the same time.
10:28
So tying this segment together, GitHub is not
10:31
just a repo anymore. It is your off -bridge into
10:34
the cloud via OIDC, and it is becoming the place
10:37
where you define how AI participates in your
10:40
code reviews and workflows. All right, let's
10:43
hit a lightning round. Quick hits. Each of these
10:46
could be its own rabbit hole, but I will keep
10:48
it short. Terra scan. The IAC security scanner
10:51
that a lot of folks used for Terraform and Kubernetes
10:54
has now been archived on GitHub. The repo shows
10:58
that it was archived by the owner on November
11:00
20th, 2025 and is now read -only. Tenable also
11:05
ended support for TerraScan in their Nessus release
11:07
notes earlier, recommending their own cloud security
11:10
product instead. So if Terrascan is still in
11:13
your pipeline, this is a nudge to treat it like
11:16
any other deprecated dependency. Design your
11:19
CI so the scanner is a pluggable step, not baked
11:22
into everything. Swapping to Trivi or OPA should
11:25
be a config change, not a rewrite. Next up, Azure
11:29
just blocked a record 15 .72 terabit per second
11:33
DDoS attack sourced from an IoT botnet. The Microsoft
11:38
post and follow -on coverage talk about over
11:41
500 ,000 IPs, multi -vector attacks, and traffic
11:45
equivalent to millions of Netflix streams hitting
11:48
a single endpoint in Australia. The interesting
11:51
takeaway is not, wow, that is big. It is that
11:54
this was powered by compromised home routers
11:56
and cameras. And Azure's DDoS protections handled
12:00
it without customer impact. For us, it is a reminder
12:03
to actually understand what DDoS protections
12:06
we have on our own providers. Are we just relying
12:10
on defaults or have we validated our mitigation
12:13
tiers, alerting and run books? On the cost side,
12:17
AWS introduced flat rate pricing plans for website
12:21
delivery. These are cloud front based plans in
12:24
free, pro, business, and premium tiers that bundle
12:28
CDN, some security features, and S3 storage credits
12:32
into a monthly price with no overages. From a
12:35
FinOps and platform perspective, this is interesting
12:38
if you have fairly predictable traffic and you
12:41
want a cap on surprise bills. It is less exciting
12:43
for very spiky workloads or stuff that is still
12:46
experimental. But if you are running a big marketing
12:49
site or a core app with steady usage, this might
12:52
be worth modeling. The key question is, can I
12:55
align teams and cost allocation with this kind
12:58
of plan without creating a mess? All right, let's
13:01
close with the human side. For this episode's
13:04
human story, I want to pull from a post by Lauren
13:06
Hutchstein called Two Thought Experiments, which
13:10
showed up in SRE Weekly Issue 498. One of the
13:13
thought experiments is about incident reports
13:16
and AI. He basically asks, what if we assume
13:20
that our incident reports will be consumed by
13:23
an AI SRE tool in the future? What kinds of details
13:27
would be useful to that tool in helping troubleshoot
13:30
future incidents? And if we wrote with that in
13:33
mind, would humans actually get more value out
13:35
of those reports too? I really like this framing
13:38
because it solves a couple of problems at once.
13:41
Right now, a lot of postmortems end up as vague
13:44
timelines, a couple of bullet points about root
13:46
cause, and some action items that may or may
13:49
not get done. There is often not enough detail
13:52
about how people reasoned, what signals were
13:55
confusing, and what trade -offs they made in
13:57
the moment. If you imagine an AISRE reading that,
14:01
trying to learn how your systems fail and how
14:03
your humans debug them, you realize pretty quickly
14:06
that you need to include more of the work in
14:09
the writeup. not just service X was slow, but
14:12
we first suspected dependency Y because of metric
14:16
Z, tried mitigation A, saw it did nothing, and
14:20
then switched hypothesis because of this specific
14:23
signal. Those same details are exactly what help
14:26
new humans learn from incidents 2. So a practical
14:29
way to bring this into your own team is not to
14:31
say we are writing for robots now, but something
14:34
like, let's pretend an AI assistant is going
14:37
to read this and use it to help the on -call
14:39
next time. What would we need to capture so it
14:42
actually learns something? Hypothesis we tried,
14:46
signals that turned out to be misleading, constraints
14:49
we were under, the parts that were surprising.
14:52
If you bake that expectation into your templates
14:55
and your review process, you get better incident
14:58
reports today, and you also future -proof yourself
15:01
a bit for when you do introduce more AI into
15:05
your incident response tooling. And the side
15:07
benefit is you shift the focus from blame and
15:11
root cause into understanding and how people
15:14
and systems interact under pressure, which is
15:17
usually where the real learning is. All right,
15:20
that is it for this episode of Ship It Weekly.
15:23
We talked about the new Kubernetes configuration
15:25
good practices and how they map to the boring
15:28
but critical work of keeping your clusters sane.
15:32
We looked at EKS provision control plane and
15:35
enhanced network observability and what that
15:38
means if you are running big multi -tenant clusters.
15:41
We dug into GitHub's updates around Actions OIDC.
15:44
copilot instructions, and custom agents, and
15:48
how that all ties into platform and AI workloads.
15:51
In the lightning round, we touched on Terrascan
15:54
being archived, the massive Azure DDoS that was
15:57
quietly handled, and CloudFront flat rate plans
16:01
for more predictable costs. And we wrapped with
16:03
Lauren's thought experiment about writing incident
16:06
reports for a future AI SRE, and how that mindset
16:10
actually improves things for the humans reading
16:13
them today. I will put all of the links we talked
16:16
about in the show notes. I am Brian. This is
16:18
Ship It Weekly by Tellers Tech. Thanks for hanging
16:21
out and I'll see you in the next one.