When guardrails break prod: GitHub “Too Many Requests” from legacy defenses, Kubernetes nodes/proxy GET RCE, HCP Vault resilience in an AWS regional outage, and PCI DSS scope creep

Transcript

0:00 This week is basically a tour of the safety system

0:03 became the incident. A protection rule that made

0:06 sense once, then quietly started blocking real

0:09 users. A Kubernetes permission that everyone

0:12 treats like read -only, but it absolutely is

0:15 not. A platform example that actually got the

0:18 control plane versus data plane separation right.

0:20 And compliant scope expanding in a way that is

0:24 easy to underestimate until you are drowning

0:27 in evidence requests. So yeah, not a week of

0:30 flashy tech. More like a week of go -check -your

0:33 -assumptions. Thank you. Hey, I'm Brian, and

0:52 this is Ship It Weekly. If you like the show,

0:55 follow or subscribe wherever you are listening.

0:57 It helps a ton. Everything lives at shipitweekly

1:00 .fm. Also, I'm starting another round of interviews.

1:03 If you'd like to come on and talk about real

1:05 world ops, hit me up at shipitweekly .fm. all

1:09 right let's get into it four main stories for

1:12 today github reworked layered abuse defenses

1:15 after legacy rules blocked legitimate traffic

1:18 kubernetes node proxy git the telemetry permission

1:22 that can turn into cluster wide rce hcp vault

1:26 and what actually stayed up during a real aws

1:29 regional disruption AWS PCI DSS scope expansion

1:33 and the operational reality of compliance scope

1:37 changes. Then a quick lightning round. Then a

1:40 human closer on reasonable assurance turning

1:43 into busy work and what to do about it. Some

1:49 of GitHub's legacy defenses are blocking legitimate

1:52 traffic. So what happened? GitHub had users hitting

1:55 unexpected too -many -request errors. And it's

1:59 not because everyone suddenly got evil, but because

2:02 old abuse mitigation rules were still active

2:05 long after the original incidents that created

2:08 them. So they went back. traced it, and reworked

2:11 how layered defenses are managed, including how

2:15 those rules get maintained and retried. This

2:18 is a February 2026 story, and it's the kind of

2:21 thing every platform team recognizes instantly.

2:25 So why does it matter? If you ship defensive

2:27 controls and you don't give them a lifecycle,

2:30 they become permanent production dependencies.

2:32 And that's the sneaky part. A security layer

2:35 is not just security. It sits on the request

2:38 path. It can block revenue. It can break signups.

2:41 It can break API clients. And it can create phantom

2:45 outages that look like the app is slow when the

2:48 app is fine. The failure mode is also brutal

2:51 for teams because when the source of the problem

2:54 is an old mitigation rule, the people on call

2:57 might not even know it exists. There's no muscle

3:00 memory, there's no runbook, there's no customer

3:03 complaints, and a bunch of dashboards that lie

3:06 by omission. Old mitigations are like dead code

3:09 until they suddenly run your business. So what

3:12 do you need to do Monday? Do a quick inventory

3:15 of your traffic safety layers. That could be

3:18 CDN rules, WAF rules, bot protections, rate limits,

3:23 edge middleware. app throttles, whatever sits

3:26 between the user and your service. Now you need

3:29 to ask yourself two annoying questions. Who owns

3:32 each layer? Not the team and actual owner. And

3:35 how do you disable it safely if it starts blocking

3:38 legitimate traffic? If you don't have a kill

3:40 switch plan, you need to make one. Even if it's

3:43 ugly, even if it's toggle this feature flag and

3:46 accept higher risk for an hour. Then add one

3:49 reliability metric you probably don't track today.

3:52 False positives. Not in a philosophical way,

3:55 in a how many legitimate requests did we reject

3:58 way. Because once your defensive layer starts

4:01 rejecting good traffic, it's not a security success.

4:05 It's an availability incident with a security

4:07 label on it. Okay, that's story one. For our

4:14 second main story today, it's Kubernetes node

4:16 proxy get and the read -only permission that

4:19 isn't. So what happened? There is a Kubernetes

4:22 RBAC permission that a lot of orgs grant to monitoring

4:25 and observability tooling. It looks harmless.

4:28 It's get on nodes slash proxy. The intent is

4:32 basically let this thing scrape node metrics,

4:34 stats, logs, healthy endpoints, that kind of

4:38 stuff. But research and write -ups in late January

4:40 2026 show how that permission can be abused to

4:44 reach kubelet permissions and turn into arbitrary

4:47 command execution in pods. The punchline is simple.

4:51 A permission that teams treat like read -only

4:53 telemetry can collapse trust boundaries if you

4:57 hand it out broadly. So why does this matter?

4:59 This is exactly how clusters get popped in real

5:02 life. Not always by some fancy zero day. Sometimes

5:06 by a permission that was granted for convenience.

5:10 Because observability stacks are the classic

5:12 just give it cluster admin snowball. You start

5:15 with it just needs to scrape metrics. Then it

5:18 just needs to list pods. Then it needs node stats.

5:21 Then you're granting node proxy access because

5:24 charts and docs tell you to. And the scary part

5:27 is psychological. Permissions like this don't

5:29 trigger your gut. They don't look like exec.

5:32 They don't look like secrets. They look like

5:34 plumbing. So nobody thinks to threat model it.

5:36 Kubernetes RBAC is a minefield because the dangerous

5:40 stuff looks boring. So what do you need to do

5:42 Monday? First, go find it. Search your cluster

5:45 roles for nodes slash proxy. Then list every

5:48 subject that binds to those roles. Don't stop

5:51 at Prometheus. Look for logging agents, APM agents,

5:55 cluster UIs, anything installed by Helm charts

5:58 with default RBAC. Second, force the uncomfortable

6:01 conversation. Which tool truly needs this? Is

6:05 it using it or is it just granted because it

6:08 was in the chart? If you truly need it, scope

6:10 it tighter. Separate service accounts per tool.

6:13 List namespace access where possible. and document

6:17 the justification so six months from now, somebody

6:20 knows why it exists. Third, add detection where

6:23 it matters. A lot of teams only monitor Kubernetes

6:27 API server audit logs and call it a day. But

6:30 if the abuse path is Kubelet -level behavior

6:33 reached through proxying, you want visibility

6:36 into that access pattern too. At minimum, treat

6:40 node proxy access granted as a review trigger.

6:44 That permission should not be invisible. That's

6:47 story two. Story three is HCP vault resilience

6:54 during a real AWS regional disruption. Okay,

6:58 so what happened? HashiCorp published a write

7:00 -up on how HCP Vault behaved during a real AWS

7:05 US East 1 disruption. Their control plane experienced

7:08 elevated HTTP 500s and intermittent panics around

7:13 7 a .m. UTC. But they say customer HCP Vault

7:17 dedicated clusters maintained 100 % uptime and

7:21 kept serving workloads. So the management plane

7:24 had issues while the data plane stayed up. That's

7:27 the key. So why does this matter? This is the

7:30 kind of architecture decision that sounds like

7:32 overkill until it saves you. Control plane outages

7:35 are common, not because everyone is bad at engineering,

7:39 but because control planes are complex and they're

7:42 multi -tenant and often exposed to weird edge

7:45 cases. But customers don't care if your admin

7:48 dashboard is having a bad day. They care if secret

7:51 resolution fails and their apps stop booting.

7:55 So that separation matters. If your management

7:58 layer being flaky can break your production usage

8:01 path, you built one shared blast radius. And

8:04 you see this pattern everywhere. Terraform is

8:07 down so we can't deploy is annoying. Terraform

8:09 is down so production can't read config is unacceptable.

8:13 If admin plane downtime breaks prod reads, you

8:17 don't have separation. Okay, so what are the

8:19 action items? Do this for your top three critical

8:22 systems. Write down what is control plane and

8:25 what is data plane. Then answer one brutally

8:28 honest question. If the control plane disappears

8:31 for two hours, what still works? Can apps still

8:35 authenticate? Can apps still read what they need

8:38 to run? Can you still scale? Can you still recover?

8:41 Also, check your runbooks. A lot of runbooks

8:44 quietly assume the control plane is healthy.

8:47 They tell you to click here, or run this automation,

8:50 or use this UI. If the whole point is resilience,

8:54 right, the control plane is down path too. Even

8:57 if it's ugly CLI, because the day you need it,

8:59 you will not be in a calm and well -rested mental

9:02 state. Okay, that's story three. Story four is

9:09 AWS PCI DSS compliance package expansion and

9:13 what it really means for teams. So AWS announced

9:17 updates to their fall 2025 PCI DSS compliance

9:21 package. They added two services to the scope

9:24 of their PCI DSS certification. AWS Security

9:27 Incident Response, and AWS Transform. They also

9:31 added the Asia Pacific Taipei region into the

9:34 PSI DSS scope. On paper, that sounds like a nice

9:38 checkbook update. In practice, scope changes

9:41 have consequences. So why does it matter? Compliance

9:44 scope changes do not usually break production,

9:47 but they absolutely create work. And they create

9:50 it in the most dangerous way. Slow, distributed,

9:54 easy to underestimate. and easy to turn into

9:57 chaos when audit season hits so here's the trap

10:00 a scope change lands security is happy leadership

10:04 is happy then six months later some poor team

10:07 is asked to reprove a pile of controls except

10:10 now the region lists changed the service list

10:13 changed and nobody knows what in scope even means

10:17 in your org that's how reasonable assurance turns

10:20 into hours of evidence churn And it's why compliance

10:23 can feel like it fights delivery, even when it's

10:26 trying to protect the business. Audits aren't

10:29 the enemy. Recreating proof from scratch is.

10:32 So if this compliance change affects you and

10:34 you're in a PCI environment, do a quick boundary

10:37 check. What accounts are in scope? What regions

10:40 are allowed? what services are approved. Then

10:42 decide how you will generate evidence, not we'll

10:45 do it when asked. Pick the repeatable pieces

10:48 now. A reusable evidence package is the goal.

10:51 One place that says what the control is, how

10:54 it's implemented, and how you prove it. You can

10:57 point to it, update it, and stop rewriting narratives

11:00 every time a new spreadsheet shows up. If you

11:03 want compliance to be sustainable, you have to

11:06 treat evidence like a product, not like a one

11:09 -off homework assignment. Okay, that's story

11:11 four. Now it's time for the lightning round.

11:20 Here's some quick hits. GitHub Actions extended

11:23 the timeline for self -hosted runner minimum

11:26 version enforcement. Starting March 16th, 2026,

11:30 older self -hosted runners get blocked if they're

11:33 below the minimum version. with a brownout period

11:36 between February 16th and March 16th to help

11:39 you find the stragglers. This is one of those,

11:42 it won't be urgent until it's urgent things.

11:45 If you run self -hosted runners, go check what

11:47 versions are actually deployed. Next, Edlamp,

11:50 the Kubernetes UI project, is now officially

11:53 part of the Kubernetes SIG UI, and they posted

11:57 a 2025 highlights recap. The reason I care about

12:00 this is simple. Kubernetes UIs are finally moving

12:03 beyond toy dashboard into useful day -to -day

12:06 tooling. And teams need sane, supported ways

12:09 to visualize cluster state without giving everyone

12:13 kubectl to prod. And last, AWS Network Firewall

12:17 Active Threat Defense. AWS wrote up how it draws

12:20 near real -time intelligence from MadPot, their

12:24 honeypot sensor network, and uses that to detect

12:27 and block threats faster. Even if you never buy

12:30 the feature, the idea is worth stealing. Speed

12:33 matters. Threat intel that shows up three weeks

12:36 later is mostly trivial. Okay, that's the lightning

12:39 round. Time for the human closer. And this week

12:50 we're going to be talking about reasonable assurance

12:51 turning into busywork. There's a thread in r

12:54 slash sre that's basically the most relatable

12:57 sentence ever. At what point does reasonable

13:00 assurance turn into busywork? And the vibe is

13:03 not we hate audits. It's more like why are we

13:06 spending engineer time formatting proof instead

13:08 of reducing risk? That thread pairs perfectly

13:11 with the GitHub story this week. GitHub had legacy

13:14 protections that stuck around. They outlived

13:17 the incident that created them and started blocking

13:19 real users. That's what happens when a control

13:22 doesn't have an owner and a lifecycle. Compliance

13:25 busywork is the same failure mode, just slower.

13:28 A control exists. Maybe it's good. Maybe it's

13:31 necessary. But over time, the evidence requests

13:34 multiply. The templates change. The risk language

13:38 changes. And suddenly, the work has not proved

13:40 this system is safe. It's translate what we already

13:44 do into 12 different formats. One comment in

13:47 this thread nails it. If it's consistent and

13:50 repetitive, automate the documentation. Audits

13:53 matter. Formatting shouldn't. That's the move.

13:56 You don't win by arguing with auditors. You win

13:59 by making your proof boring. One control. One

14:02 source of truth. One evidence artifact that stays

14:06 alive. And then you update it when reality changes,

14:09 not when somebody sends you a spreadsheet. So

14:12 my Monday challenge is simple. Pick one control

14:15 you get asked about constantly. access reviews

14:18 backup validation change management whatever

14:20 now measure how many engineer hours go into evidence

14:24 churn for that one control in one cycle if it's

14:28 painful good you just found your next high leverage

14:31 reliability project because busy work is not

14:34 just annoying It steals time from actually making

14:37 the system safer. Okay, that's the human story.

14:40 Time for a quick recap. We talked about how GitHub

14:43 reworked layered defenses after legacy mitigations

14:46 started blocking legit traffic. We talked about

14:49 Kubernetes node slash proxy Git and how it's

14:52 a permission you should treat like a real threat

14:54 boundary. We talked about HCP Vault and how it's

14:57 a clean example of control plane pain not taking

15:00 down data plane availability. And we talked about

15:03 AWS expanding their PCI DSS scope. And that's

15:07 your reminder to productize compliance evidence

15:10 before it becomes chaos. And the human story

15:12 was reasonable assurance turns into busy work

15:16 the moment you are just formatting proof. If

15:19 you like the show, follow or subscribe wherever

15:21 you're listening. And everything, including the

15:23 show notes and links, will be on shipitweekly

15:26 .fm. And if you want to come on for an interview

15:29 round, reach out at shipitweekly .fm. I'll catch

15:32 you next week. Thanks.

When guardrails break prod: GitHub “Too Many Requests” from legacy defenses, Kubernetes nodes/proxy GET RCE, HCP Vault resilience in an AWS regional outage, and PCI DSS scope creep

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller