Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails

Transcript

0:06 Hey, I'm Brian and this is Ship It Weekly by

0:10 Tellers Tech. This episode is very much a running

0:14 real systems in production one. Kubernetes just

0:17 dropped an official configuration good practices

0:20 guide. AWS is quietly admitting that EKS control

0:25 planes and networking need more love, and GitHub

0:28 is pushing harder into being your platform brain

0:32 with OIDC tweaks and copilot customization. Then

0:36 we'll hit a lightning round with Terrascan getting

0:38 archived, a ridiculous 15 terabit DDoS against

0:42 Azure, and CloudFront flat rate pricing. We'll

0:46 then close with a human story about writing incident

0:49 reports for a future AI SRE and why that actually

0:53 makes them better for humans too. Let's start

0:56 with Kubernetes config. Kubernetes posted a new

1:00 configuration good practices blog and it is very

1:03 much a please stop hurting yourselves kind of

1:06 document. The short version is configuration

1:09 seems small until it absolutely is not. The blog

1:12 pulls together a bunch of things most of us learn

1:15 the hard way. Stuff like keep configuration in

1:19 source control instead of hand editing manifests

1:21 on a jump box. Use tools like Customize or Helm

1:25 in a consistent way so you do not end up with

1:28 five different templating patterns in the same

1:30 company. Avoid magic defaults. Be deliberate

1:33 about labels and annotations so you can actually

1:37 select and observe your workloads later. There

1:40 is also a theme of prefer simple explicit config

1:44 over clever tricks. Things like avoiding giant

1:48 copy pasted yaml blobs across environments, keeping

1:52 environment specific bits in overlays instead

1:55 of forking charts and validating config. early

1:59 with tools instead of discovering issues when

2:02 you hit apply on production. None of this is

2:04 shocking, but that is kind of the point. The

2:07 worst Kubernetes incidents I see are almost never

2:10 Kubla exploded. It is usually a small config

2:14 problem that slipped through. So if you are running

2:17 clusters, here is how I would use this blog in

2:19 real life. First, treat it as a checklist. Take

2:23 one of your core apps and walk through how you

2:26 do config today. Is everything for that app in

2:29 Git, or do you still have a couple of hotfix

2:32 -only manifests that people kubectl apply from

2:35 their laptop? Are you mixing Helm and raw YAML

2:39 and customize in the same namespace? Do you actually

2:42 have a policy or linterstep in CI that fails

2:46 on bad patterns, or is it we just trust reviewers

2:50 to catch it? Second, use it to standardize your

2:53 approach. Most orgs have grown Kubernetes config

2:56 very organically. One team copied a blog post,

3:00 another team used a different starter repo. Five

3:03 years later, you have four different ways of

3:05 doing environment overlays and nobody really

3:08 knows which is the right one. The Kubernetes

3:11 post gives you a neutral reference you can point

3:14 at when you say, okay, we are going to converge

3:17 on this pattern. Third, treat config as part

3:21 of your platform. If you have a platform or infra

3:24 team, one of your jobs is to make it harder for

3:27 people to write bad config. That might mean shipping

3:30 base helm charts with same defaults, adding OPA

3:34 or Kyverno policies, wiring up schema validation,

3:38 or even just having a a kubectl diff step before

3:42 applies in CI. Little stuff that stops the oops

3:45 prod moments. So if you only skim one thing this

3:48 week, that config good practices blog is worth

3:51 a read and honestly, worth a little internal

3:54 workshop with your team. All right, let's talk

3:58 EKS. AWS announced something called EKS provision

4:02 control plane. This is basically stop guessing

4:05 how much control plane you get. Historically,

4:07 If you spin up an EKS cluster, AWS handles the

4:11 control plane for you. It scales, there is some

4:14 magic, and you mostly only think about worker

4:16 nodes. But at scale or during big traffic spikes,

4:20 you can absolutely run into API throttling and

4:23 control plane saturation. You see weird cube

4:26 API latencies during deploys or controllers getting

4:30 rate limited. With provisioned control plane,

4:33 you can pick from scaling tiers that guarantee

4:36 a certain level of control plane capacity. The

4:39 docs talk about selecting a tier for high, predictable

4:43 performance, regardless of current utilization.

4:46 So instead of hoping AWS auto scales in time,

4:50 you say this cluster needs this much headroom

4:52 and you pay for that. I like this for a few reasons.

4:56 First, it forces people to take control plane

4:59 SLOs seriously. Most teams have metrics and alerts

5:02 for node CPU and pod restarts, but almost nobody

5:06 writes our cube API P99 latency should stay under

5:10 X, or the cluster should be able to handle Y

5:13 QPS of controller traffic during a big deploy.

5:16 Provision tiers give you a way to align those

5:19 expectations with something concrete in the platform.

5:23 Second, it clarifies multi -tenant cluster design.

5:26 If you are stuffing a ton of teams into a few

5:29 big clusters, you are all sharing that control

5:31 plane. Being able to size it explicitly and maybe

5:35 have a different tier for noisy clusters is a

5:38 nice lever. AWS also dropped enhanced container

5:41 network observability. for EKS. This gives you

5:45 granular network metrics for cluster traffic.

5:48 cross AZ flows, and traffic to AWS services with

5:52 CloudWatch visualizations and deeper flow insight.

5:55 Translated to human, they're making it easier

5:58 to answer the question, what the heck is talking

6:00 to what inside of your clusters? That helps with

6:03 finding cross AZ chatty services that are quietly

6:06 burning money and adding latency, spotting weird

6:10 egress patterns to the internet or to manage

6:13 services. Debugging, this service is slow, but

6:16 CPU and memory look fine. If you have ever done

6:19 the dance where an app team swears it is not

6:22 them causing cross AZ data transfer and you are

6:25 trying to prove it, this kind of visibility is

6:28 what you want. The pattern here is pretty clear.

6:30 EKS is growing features where real production

6:34 pain lives. Control plane capacity and networking.

6:38 As a platform team, this is a good time to write

6:41 some SLOs for your control plane. Decide which

6:44 cluster actually need a provision tier. hook

6:48 the network observability metrics into your existing

6:51 dashboards and alerts rather than letting them

6:54 sit in a default cloud watch view nobody opens.

6:57 All right, let's shift over to GitHub and talk

7:00 about OIDC and Copilot. GitHub shipped a neat

7:04 little security and governance improvement for

7:06 Actions. The OIDC token that Actions uses to

7:10 talk to cloud providers now includes a check

7:12 run ID claim. Before this, you could tell this

7:16 token came from this workflow run, but it was

7:19 harder to tie it to a specific job or check.

7:22 Now, if a workflow calls into AWS or an internal

7:26 service using OIDC, you can log and enforce policies

7:30 based on the exact check run that generated the

7:33 token. For platform teams, that means a few things.

7:36 You can write more fine -grained IAM or ABAC

7:41 policies on the cloud side that say only this

7:44 particular job in this repo can assume this role,

7:47 instead of any workflow from this org. That is

7:51 big for least privilege. You can audit access

7:54 better. When someone asks who touched this role

7:57 or what did this job actually do in AWS, there

8:01 is a clear link between the token, the check

8:04 run, and the cloud actions. You can now also

8:07 use it for internal services. If you have an

8:10 internal deploy API or some platform endpoint

8:14 that Action calls, it can now require a check

8:17 run ID and log it. instead of just trusting that

8:20 anything with a valid token is allowed. On the

8:23 AI side, GitHub published unlocking the full

8:26 power of Copilot code review. Master your instructions

8:30 files. It is basically a guide to making Copilot

8:33 code review actually useful instead of spammy.

8:36 Instruction files let you tell Copilot what you

8:40 care about in reviews. That could be things like,

8:43 this repo uses specific architecture. We prefer

8:46 structured logging. Avoid unsafe SQL or this

8:50 service is latency sensitive. Be careful about

8:53 extra network calls. This is very platformy because

8:57 you can standardize expectations across repos.

9:01 Instead of everyone writing their own reviewer

9:03 guidelines in a wiki that nobody reads, you can

9:06 put it in an instructions file that your AI reviewer

9:09 actually uses. Encode security and performance

9:12 concerns as part of the review process. Help

9:15 junior devs by having Copilot nudge them towards

9:18 patterns your platform team wants. GitHub also

9:21 has a customization library for Copilot that

9:24 includes custom agents. These are specialized

9:27 versions of the Copilot coding agent that you

9:30 configure for specific workflows, like implementation

9:34 planner, bug fix teammate, or your own workspace

9:37 specific helpers. You can define their behavior

9:40 in files under .github .agents, and they keep

9:44 that persona across a workflow rather than being

9:47 a one -off prompt. Where this gets interesting

9:50 for platform teams is that you can start to imagine

9:53 internal agents that know your stack. your standard

9:57 modules, your CI patterns. Know your security

10:00 policies and naming conventions. Help people

10:03 generate Terraform, Kubernetes manifests, or

10:06 CI configs that line up with your platform. Obviously,

10:10 you still need guardrails. You do not want an

10:13 overeager agent writing IAM policies or production

10:16 manifests and merging them without human review.

10:19 But it is pretty clear that in the next couple

10:22 of years, we are going to be building platforms

10:24 for humans and for AI helpers at the same time.

10:28 So tying this segment together, GitHub is not

10:31 just a repo anymore. It is your off -bridge into

10:34 the cloud via OIDC, and it is becoming the place

10:37 where you define how AI participates in your

10:40 code reviews and workflows. All right, let's

10:43 hit a lightning round. Quick hits. Each of these

10:46 could be its own rabbit hole, but I will keep

10:48 it short. Terra scan. The IAC security scanner

10:51 that a lot of folks used for Terraform and Kubernetes

10:54 has now been archived on GitHub. The repo shows

10:58 that it was archived by the owner on November

11:00 20th, 2025 and is now read -only. Tenable also

11:05 ended support for TerraScan in their Nessus release

11:07 notes earlier, recommending their own cloud security

11:10 product instead. So if Terrascan is still in

11:13 your pipeline, this is a nudge to treat it like

11:16 any other deprecated dependency. Design your

11:19 CI so the scanner is a pluggable step, not baked

11:22 into everything. Swapping to Trivi or OPA should

11:25 be a config change, not a rewrite. Next up, Azure

11:29 just blocked a record 15 .72 terabit per second

11:33 DDoS attack sourced from an IoT botnet. The Microsoft

11:38 post and follow -on coverage talk about over

11:41 500 ,000 IPs, multi -vector attacks, and traffic

11:45 equivalent to millions of Netflix streams hitting

11:48 a single endpoint in Australia. The interesting

11:51 takeaway is not, wow, that is big. It is that

11:54 this was powered by compromised home routers

11:56 and cameras. And Azure's DDoS protections handled

12:00 it without customer impact. For us, it is a reminder

12:03 to actually understand what DDoS protections

12:06 we have on our own providers. Are we just relying

12:10 on defaults or have we validated our mitigation

12:13 tiers, alerting and run books? On the cost side,

12:17 AWS introduced flat rate pricing plans for website

12:21 delivery. These are cloud front based plans in

12:24 free, pro, business, and premium tiers that bundle

12:28 CDN, some security features, and S3 storage credits

12:32 into a monthly price with no overages. From a

12:35 FinOps and platform perspective, this is interesting

12:38 if you have fairly predictable traffic and you

12:41 want a cap on surprise bills. It is less exciting

12:43 for very spiky workloads or stuff that is still

12:46 experimental. But if you are running a big marketing

12:49 site or a core app with steady usage, this might

12:52 be worth modeling. The key question is, can I

12:55 align teams and cost allocation with this kind

12:58 of plan without creating a mess? All right, let's

13:01 close with the human side. For this episode's

13:04 human story, I want to pull from a post by Lauren

13:06 Hutchstein called Two Thought Experiments, which

13:10 showed up in SRE Weekly Issue 498. One of the

13:13 thought experiments is about incident reports

13:16 and AI. He basically asks, what if we assume

13:20 that our incident reports will be consumed by

13:23 an AI SRE tool in the future? What kinds of details

13:27 would be useful to that tool in helping troubleshoot

13:30 future incidents? And if we wrote with that in

13:33 mind, would humans actually get more value out

13:35 of those reports too? I really like this framing

13:38 because it solves a couple of problems at once.

13:41 Right now, a lot of postmortems end up as vague

13:44 timelines, a couple of bullet points about root

13:46 cause, and some action items that may or may

13:49 not get done. There is often not enough detail

13:52 about how people reasoned, what signals were

13:55 confusing, and what trade -offs they made in

13:57 the moment. If you imagine an AISRE reading that,

14:01 trying to learn how your systems fail and how

14:03 your humans debug them, you realize pretty quickly

14:06 that you need to include more of the work in

14:09 the writeup. not just service X was slow, but

14:12 we first suspected dependency Y because of metric

14:16 Z, tried mitigation A, saw it did nothing, and

14:20 then switched hypothesis because of this specific

14:23 signal. Those same details are exactly what help

14:26 new humans learn from incidents 2. So a practical

14:29 way to bring this into your own team is not to

14:31 say we are writing for robots now, but something

14:34 like, let's pretend an AI assistant is going

14:37 to read this and use it to help the on -call

14:39 next time. What would we need to capture so it

14:42 actually learns something? Hypothesis we tried,

14:46 signals that turned out to be misleading, constraints

14:49 we were under, the parts that were surprising.

14:52 If you bake that expectation into your templates

14:55 and your review process, you get better incident

14:58 reports today, and you also future -proof yourself

15:01 a bit for when you do introduce more AI into

15:05 your incident response tooling. And the side

15:07 benefit is you shift the focus from blame and

15:11 root cause into understanding and how people

15:14 and systems interact under pressure, which is

15:17 usually where the real learning is. All right,

15:20 that is it for this episode of Ship It Weekly.

15:23 We talked about the new Kubernetes configuration

15:25 good practices and how they map to the boring

15:28 but critical work of keeping your clusters sane.

15:32 We looked at EKS provision control plane and

15:35 enhanced network observability and what that

15:38 means if you are running big multi -tenant clusters.

15:41 We dug into GitHub's updates around Actions OIDC.

15:44 copilot instructions, and custom agents, and

15:48 how that all ties into platform and AI workloads.

15:51 In the lightning round, we touched on Terrascan

15:54 being archived, the massive Azure DDoS that was

15:57 quietly handled, and CloudFront flat rate plans

16:01 for more predictable costs. And we wrapped with

16:03 Lauren's thought experiment about writing incident

16:06 reports for a future AI SRE, and how that mindset

16:10 actually improves things for the humans reading

16:13 them today. I will put all of the links we talked

16:16 about in the show notes. I am Brian. This is

16:18 Ship It Weekly by Tellers Tech. Thanks for hanging

16:21 out and I'll see you in the next one.

Kubernetes Config Reality Check, EKS Control Planes, and GitHub Guardrails

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller