0:00
This week is basically a tour of the safety system
0:03
became the incident. A protection rule that made
0:06
sense once, then quietly started blocking real
0:09
users. A Kubernetes permission that everyone
0:12
treats like read -only, but it absolutely is
0:15
not. A platform example that actually got the
0:18
control plane versus data plane separation right.
0:20
And compliant scope expanding in a way that is
0:24
easy to underestimate until you are drowning
0:27
in evidence requests. So yeah, not a week of
0:30
flashy tech. More like a week of go -check -your
0:33
-assumptions. Thank you. Hey, I'm Brian, and
0:52
this is Ship It Weekly. If you like the show,
0:55
follow or subscribe wherever you are listening.
0:57
It helps a ton. Everything lives at shipitweekly
1:00
.fm. Also, I'm starting another round of interviews.
1:03
If you'd like to come on and talk about real
1:05
world ops, hit me up at shipitweekly .fm. all
1:09
right let's get into it four main stories for
1:12
today github reworked layered abuse defenses
1:15
after legacy rules blocked legitimate traffic
1:18
kubernetes node proxy git the telemetry permission
1:22
that can turn into cluster wide rce hcp vault
1:26
and what actually stayed up during a real aws
1:29
regional disruption AWS PCI DSS scope expansion
1:33
and the operational reality of compliance scope
1:37
changes. Then a quick lightning round. Then a
1:40
human closer on reasonable assurance turning
1:43
into busy work and what to do about it. Some
1:49
of GitHub's legacy defenses are blocking legitimate
1:52
traffic. So what happened? GitHub had users hitting
1:55
unexpected too -many -request errors. And it's
1:59
not because everyone suddenly got evil, but because
2:02
old abuse mitigation rules were still active
2:05
long after the original incidents that created
2:08
them. So they went back. traced it, and reworked
2:11
how layered defenses are managed, including how
2:15
those rules get maintained and retried. This
2:18
is a February 2026 story, and it's the kind of
2:21
thing every platform team recognizes instantly.
2:25
So why does it matter? If you ship defensive
2:27
controls and you don't give them a lifecycle,
2:30
they become permanent production dependencies.
2:32
And that's the sneaky part. A security layer
2:35
is not just security. It sits on the request
2:38
path. It can block revenue. It can break signups.
2:41
It can break API clients. And it can create phantom
2:45
outages that look like the app is slow when the
2:48
app is fine. The failure mode is also brutal
2:51
for teams because when the source of the problem
2:54
is an old mitigation rule, the people on call
2:57
might not even know it exists. There's no muscle
3:00
memory, there's no runbook, there's no customer
3:03
complaints, and a bunch of dashboards that lie
3:06
by omission. Old mitigations are like dead code
3:09
until they suddenly run your business. So what
3:12
do you need to do Monday? Do a quick inventory
3:15
of your traffic safety layers. That could be
3:18
CDN rules, WAF rules, bot protections, rate limits,
3:23
edge middleware. app throttles, whatever sits
3:26
between the user and your service. Now you need
3:29
to ask yourself two annoying questions. Who owns
3:32
each layer? Not the team and actual owner. And
3:35
how do you disable it safely if it starts blocking
3:38
legitimate traffic? If you don't have a kill
3:40
switch plan, you need to make one. Even if it's
3:43
ugly, even if it's toggle this feature flag and
3:46
accept higher risk for an hour. Then add one
3:49
reliability metric you probably don't track today.
3:52
False positives. Not in a philosophical way,
3:55
in a how many legitimate requests did we reject
3:58
way. Because once your defensive layer starts
4:01
rejecting good traffic, it's not a security success.
4:05
It's an availability incident with a security
4:07
label on it. Okay, that's story one. For our
4:14
second main story today, it's Kubernetes node
4:16
proxy get and the read -only permission that
4:19
isn't. So what happened? There is a Kubernetes
4:22
RBAC permission that a lot of orgs grant to monitoring
4:25
and observability tooling. It looks harmless.
4:28
It's get on nodes slash proxy. The intent is
4:32
basically let this thing scrape node metrics,
4:34
stats, logs, healthy endpoints, that kind of
4:38
stuff. But research and write -ups in late January
4:40
2026 show how that permission can be abused to
4:44
reach kubelet permissions and turn into arbitrary
4:47
command execution in pods. The punchline is simple.
4:51
A permission that teams treat like read -only
4:53
telemetry can collapse trust boundaries if you
4:57
hand it out broadly. So why does this matter?
4:59
This is exactly how clusters get popped in real
5:02
life. Not always by some fancy zero day. Sometimes
5:06
by a permission that was granted for convenience.
5:10
Because observability stacks are the classic
5:12
just give it cluster admin snowball. You start
5:15
with it just needs to scrape metrics. Then it
5:18
just needs to list pods. Then it needs node stats.
5:21
Then you're granting node proxy access because
5:24
charts and docs tell you to. And the scary part
5:27
is psychological. Permissions like this don't
5:29
trigger your gut. They don't look like exec.
5:32
They don't look like secrets. They look like
5:34
plumbing. So nobody thinks to threat model it.
5:36
Kubernetes RBAC is a minefield because the dangerous
5:40
stuff looks boring. So what do you need to do
5:42
Monday? First, go find it. Search your cluster
5:45
roles for nodes slash proxy. Then list every
5:48
subject that binds to those roles. Don't stop
5:51
at Prometheus. Look for logging agents, APM agents,
5:55
cluster UIs, anything installed by Helm charts
5:58
with default RBAC. Second, force the uncomfortable
6:01
conversation. Which tool truly needs this? Is
6:05
it using it or is it just granted because it
6:08
was in the chart? If you truly need it, scope
6:10
it tighter. Separate service accounts per tool.
6:13
List namespace access where possible. and document
6:17
the justification so six months from now, somebody
6:20
knows why it exists. Third, add detection where
6:23
it matters. A lot of teams only monitor Kubernetes
6:27
API server audit logs and call it a day. But
6:30
if the abuse path is Kubelet -level behavior
6:33
reached through proxying, you want visibility
6:36
into that access pattern too. At minimum, treat
6:40
node proxy access granted as a review trigger.
6:44
That permission should not be invisible. That's
6:47
story two. Story three is HCP vault resilience
6:54
during a real AWS regional disruption. Okay,
6:58
so what happened? HashiCorp published a write
7:00
-up on how HCP Vault behaved during a real AWS
7:05
US East 1 disruption. Their control plane experienced
7:08
elevated HTTP 500s and intermittent panics around
7:13
7 a .m. UTC. But they say customer HCP Vault
7:17
dedicated clusters maintained 100 % uptime and
7:21
kept serving workloads. So the management plane
7:24
had issues while the data plane stayed up. That's
7:27
the key. So why does this matter? This is the
7:30
kind of architecture decision that sounds like
7:32
overkill until it saves you. Control plane outages
7:35
are common, not because everyone is bad at engineering,
7:39
but because control planes are complex and they're
7:42
multi -tenant and often exposed to weird edge
7:45
cases. But customers don't care if your admin
7:48
dashboard is having a bad day. They care if secret
7:51
resolution fails and their apps stop booting.
7:55
So that separation matters. If your management
7:58
layer being flaky can break your production usage
8:01
path, you built one shared blast radius. And
8:04
you see this pattern everywhere. Terraform is
8:07
down so we can't deploy is annoying. Terraform
8:09
is down so production can't read config is unacceptable.
8:13
If admin plane downtime breaks prod reads, you
8:17
don't have separation. Okay, so what are the
8:19
action items? Do this for your top three critical
8:22
systems. Write down what is control plane and
8:25
what is data plane. Then answer one brutally
8:28
honest question. If the control plane disappears
8:31
for two hours, what still works? Can apps still
8:35
authenticate? Can apps still read what they need
8:38
to run? Can you still scale? Can you still recover?
8:41
Also, check your runbooks. A lot of runbooks
8:44
quietly assume the control plane is healthy.
8:47
They tell you to click here, or run this automation,
8:50
or use this UI. If the whole point is resilience,
8:54
right, the control plane is down path too. Even
8:57
if it's ugly CLI, because the day you need it,
8:59
you will not be in a calm and well -rested mental
9:02
state. Okay, that's story three. Story four is
9:09
AWS PCI DSS compliance package expansion and
9:13
what it really means for teams. So AWS announced
9:17
updates to their fall 2025 PCI DSS compliance
9:21
package. They added two services to the scope
9:24
of their PCI DSS certification. AWS Security
9:27
Incident Response, and AWS Transform. They also
9:31
added the Asia Pacific Taipei region into the
9:34
PSI DSS scope. On paper, that sounds like a nice
9:38
checkbook update. In practice, scope changes
9:41
have consequences. So why does it matter? Compliance
9:44
scope changes do not usually break production,
9:47
but they absolutely create work. And they create
9:50
it in the most dangerous way. Slow, distributed,
9:54
easy to underestimate. and easy to turn into
9:57
chaos when audit season hits so here's the trap
10:00
a scope change lands security is happy leadership
10:04
is happy then six months later some poor team
10:07
is asked to reprove a pile of controls except
10:10
now the region lists changed the service list
10:13
changed and nobody knows what in scope even means
10:17
in your org that's how reasonable assurance turns
10:20
into hours of evidence churn And it's why compliance
10:23
can feel like it fights delivery, even when it's
10:26
trying to protect the business. Audits aren't
10:29
the enemy. Recreating proof from scratch is.
10:32
So if this compliance change affects you and
10:34
you're in a PCI environment, do a quick boundary
10:37
check. What accounts are in scope? What regions
10:40
are allowed? what services are approved. Then
10:42
decide how you will generate evidence, not we'll
10:45
do it when asked. Pick the repeatable pieces
10:48
now. A reusable evidence package is the goal.
10:51
One place that says what the control is, how
10:54
it's implemented, and how you prove it. You can
10:57
point to it, update it, and stop rewriting narratives
11:00
every time a new spreadsheet shows up. If you
11:03
want compliance to be sustainable, you have to
11:06
treat evidence like a product, not like a one
11:09
-off homework assignment. Okay, that's story
11:11
four. Now it's time for the lightning round.
11:20
Here's some quick hits. GitHub Actions extended
11:23
the timeline for self -hosted runner minimum
11:26
version enforcement. Starting March 16th, 2026,
11:30
older self -hosted runners get blocked if they're
11:33
below the minimum version. with a brownout period
11:36
between February 16th and March 16th to help
11:39
you find the stragglers. This is one of those,
11:42
it won't be urgent until it's urgent things.
11:45
If you run self -hosted runners, go check what
11:47
versions are actually deployed. Next, Edlamp,
11:50
the Kubernetes UI project, is now officially
11:53
part of the Kubernetes SIG UI, and they posted
11:57
a 2025 highlights recap. The reason I care about
12:00
this is simple. Kubernetes UIs are finally moving
12:03
beyond toy dashboard into useful day -to -day
12:06
tooling. And teams need sane, supported ways
12:09
to visualize cluster state without giving everyone
12:13
kubectl to prod. And last, AWS Network Firewall
12:17
Active Threat Defense. AWS wrote up how it draws
12:20
near real -time intelligence from MadPot, their
12:24
honeypot sensor network, and uses that to detect
12:27
and block threats faster. Even if you never buy
12:30
the feature, the idea is worth stealing. Speed
12:33
matters. Threat intel that shows up three weeks
12:36
later is mostly trivial. Okay, that's the lightning
12:39
round. Time for the human closer. And this week
12:50
we're going to be talking about reasonable assurance
12:51
turning into busywork. There's a thread in r
12:54
slash sre that's basically the most relatable
12:57
sentence ever. At what point does reasonable
13:00
assurance turn into busywork? And the vibe is
13:03
not we hate audits. It's more like why are we
13:06
spending engineer time formatting proof instead
13:08
of reducing risk? That thread pairs perfectly
13:11
with the GitHub story this week. GitHub had legacy
13:14
protections that stuck around. They outlived
13:17
the incident that created them and started blocking
13:19
real users. That's what happens when a control
13:22
doesn't have an owner and a lifecycle. Compliance
13:25
busywork is the same failure mode, just slower.
13:28
A control exists. Maybe it's good. Maybe it's
13:31
necessary. But over time, the evidence requests
13:34
multiply. The templates change. The risk language
13:38
changes. And suddenly, the work has not proved
13:40
this system is safe. It's translate what we already
13:44
do into 12 different formats. One comment in
13:47
this thread nails it. If it's consistent and
13:50
repetitive, automate the documentation. Audits
13:53
matter. Formatting shouldn't. That's the move.
13:56
You don't win by arguing with auditors. You win
13:59
by making your proof boring. One control. One
14:02
source of truth. One evidence artifact that stays
14:06
alive. And then you update it when reality changes,
14:09
not when somebody sends you a spreadsheet. So
14:12
my Monday challenge is simple. Pick one control
14:15
you get asked about constantly. access reviews
14:18
backup validation change management whatever
14:20
now measure how many engineer hours go into evidence
14:24
churn for that one control in one cycle if it's
14:28
painful good you just found your next high leverage
14:31
reliability project because busy work is not
14:34
just annoying It steals time from actually making
14:37
the system safer. Okay, that's the human story.
14:40
Time for a quick recap. We talked about how GitHub
14:43
reworked layered defenses after legacy mitigations
14:46
started blocking legit traffic. We talked about
14:49
Kubernetes node slash proxy Git and how it's
14:52
a permission you should treat like a real threat
14:54
boundary. We talked about HCP Vault and how it's
14:57
a clean example of control plane pain not taking
15:00
down data plane availability. And we talked about
15:03
AWS expanding their PCI DSS scope. And that's
15:07
your reminder to productize compliance evidence
15:10
before it becomes chaos. And the human story
15:12
was reasonable assurance turns into busy work
15:16
the moment you are just formatting proof. If
15:19
you like the show, follow or subscribe wherever
15:21
you're listening. And everything, including the
15:23
show notes and links, will be on shipitweekly
15:26
.fm. And if you want to come on for an interview
15:29
round, reach out at shipitweekly .fm. I'll catch
15:32
you next week. Thanks.