0:00
This week is a reminder that the stuff we treat
0:02
like glue is now a primary failure domain. CI
0:06
trigger roles, cert renewals, helm charts, automation
0:09
tools. One tiny assumption goes sideways and
0:13
suddenly you're dealing with a supply chain risk,
0:16
a global build break, or an RCE in the thing
0:19
that holds all your credentials. Hey, I'm Brian.
0:38
I work in DevOps and SRE and I run Tellers Tech.
0:41
This is Ship It Weekly, where I filter the noise
0:44
and pull out what actually matters when you're
0:47
the one running infrastructure and owning reliability.
0:50
If something's hype, I'll call it hype. If it
0:52
changes how you operate, we'll talk about it.
0:54
Quick bit of housekeeping, the show notes and
0:57
links are on shipitweekly .fm. If the show's
1:00
been useful, follow it wherever you listen. Also,
1:03
a rating helps way more than it should. Four
1:06
main stories for today. First, code breach. A
1:08
CI trigger and filtering issue in a small set
1:11
of AWS managed repos that's a perfect reminder
1:14
that CI glue is part of your security boundary
1:17
now. Second, the Bazel TLS cert expiry incident.
1:21
The kind of failure that is boring, binary, and
1:24
absolutely capable of blocking your entire engineering
1:27
org. Third, Helm chart reliability. prequel reviewed
1:31
over a hundred popular charts and the results
1:34
are basically a post -mortem template for why
1:36
it installed fine is not a reliability guarantee
1:39
fourth n8n two new high severity flaws disclosed
1:44
by jfrog that can lead to code execution we're
1:48
going to treat this one like a mini story because
1:50
workflow automation tools are basically a control
1:53
plane holding your secrets then a quick lightning
1:56
round with a few operator friendly tools and
1:59
takeaways and a human closer tying the theme
2:02
together alright let's get into it So code breach.
2:10
AWS published a security bulletin describing
2:12
a misconfiguration involving unanchored account
2:16
underscore ID webhook filters for code build.
2:20
used by a small set of AWS -managed open -source
2:23
repos. AWS says they mitigated it quickly, rotated
2:27
credentials, reviewed logs, and added additional
2:31
mitigations and protections around build processes
2:33
with credentials in memory. Wiz's research frames
2:37
the risk clearly. If you can trigger a privileged
2:40
build in a repo that's part of a supply chain,
2:43
you potentially get access to tokens and credentials
2:46
that can be used to push changes or create malicious
2:50
artifacts. That's why it's not just CI, it becomes
2:54
supply chain. Now, the operator lesson is not
2:57
code build is bad. The operator lesson is stop
3:01
treating pipeline trigger logic like it's harmless.
3:04
If an untrusted event can cause a trusted pipeline
3:07
to run, you do not have CI. You have an execution
3:10
environment exposed to the internet. And almost
3:13
every org drifts towards this risk without meaning
3:17
to. Here's the drift pattern. You start with
3:20
PR checks, lint, test, build, great. Then someone
3:24
adds integration tests. And those tests need
3:27
credentials to hit an environment, or to pull
3:30
from a private registry, or to call a third -party
3:33
API. Then someone adds preview deployments. Those
3:38
need cloud creds, or at least some deploy token.
3:41
Then someone adds artifact publishing because
3:44
it's easier if the PR build produces the image.
3:47
Now your PR pipeline can build and push images.
3:50
Then someone uses those images in staging. And
3:53
now your PR pipeline is part of your release
3:56
path. And at that point, you've accidentally
3:58
created a supply chain path where a PR can influence
4:02
something that runs in your environment. This
4:05
is why the most dangerous sentence in CI is,
4:08
it only runs in CI. Because CI is usually the
4:12
thing that holds credentials that can touch everything
4:15
else. So the practical question to ask is simple.
4:19
Can untrusted events cause trusted actions? Can
4:22
a forked PR run a job that has secrets available?
4:26
Can a PR comment trigger a workflow that can
4:30
deploy? Can a PR workflow push artifacts that
4:33
are later deployed? Can a PR workflow assume
4:37
cloud roles? If the answer is yes, or even I'm
4:40
not sure, you've got a boundary problem. Now,
4:43
I want to give you a do this Monday playbook
4:46
that doesn't require a big platform rewrite.
4:49
Step one, label pipelines mentally as untrusted
4:53
and trusted. Untrusted pipelines is PRs and forks.
4:57
No secrets, no publish, no deploy. It answers,
5:01
does it compile? Do tests pass? Trusted pipelines
5:05
run on merges to main, tags, or explicit manual
5:09
approvals. That's the one allowed to publish
5:12
and deploy. Step two. Pick one repo, your most
5:16
sensitive one, and map the event chain in plain
5:19
English. What events trigger workflows? Which
5:22
workflows can access secrets? Which workflows
5:25
can write artifacts? Which workflows can deploy
5:28
or change infra? Step 3. Check for the classic
5:32
foot guns. Are secrets injected into your PR
5:35
workflows at all? Are you using conditions like
5:38
actor checks or branch checks as auth? Are you
5:42
relying on the code won't print secrets as a
5:45
control? Do you allow PR builds from forks to
5:49
run with privilege tokens? If you find these,
5:52
the fix is not perfect security. The fix is strong
5:55
separation. PR builds can still run. just keep
5:58
them in the untrusted sandbox. If you need integration
6:02
tests, run them in a separate environment with
6:05
a separate low privilege credential set. If you
6:08
need to build images, build them but don't push
6:11
them to a production registry. Or push to an
6:15
isolated registry that cannot be used for deploy.
6:18
If you need a preview environment, require approval
6:21
before anything privileged runs. And the last
6:24
point, even if you fix the triggers, scope matters.
6:27
Least privilege for CI tokens is not optional.
6:31
Your pipeline credentials should not be able
6:34
to do everything. That's story one. Story two
6:41
is Bazel. On December 26th, 2025, the TLS certificates
6:46
for many star .bazel .build domains expired and
6:50
it caused widespread build breakage. The Bazel
6:54
team's postmortem says the outage lasted around
6:57
13 hours before it was resolved. This is one
7:00
of those incidents that's both boring and terrifying.
7:03
Boring because it's just a cert. Terrifying because
7:07
cert failures are a binary cliff. Everything
7:10
works, then it doesn't. And the blast radius
7:13
is immediate because every client that depends
7:15
on that endpoint fails at the same time. Also,
7:19
auto -renew does not prevent this class of incident.
7:22
Auto -renew is one link in a chain. The full
7:25
chain is issuance, renewal, deployment, Reload
7:29
and verification. A lot of real cert outages
7:32
are renewal succeeded but deployment didn't reload.
7:36
Or new hostname wasn't included. Or monitoring
7:39
checked the wrong endpoint. Or DNS validation
7:42
broke and nobody noticed. So here's the practical
7:46
operator version of this story. You'd need external
7:49
monitoring for cert expiry from the outside.
7:53
against the actual endpoint users hit. Not an
7:56
internal health check. Not a dashboard in the
7:59
cert system. The real endpoint. You need ownership.
8:03
A named owner. A team. A channel. Someone that
8:07
at 2am on call can tag and get traction. And
8:10
you need runway. Alerts well before expiry. If
8:14
your alert fires 24 hours before expiry, you
8:18
are still basically doing incident response.
8:20
If it fires 30 days before expiry, you can fix
8:24
weird edge cases like DNS changes, migrations,
8:27
or validation issues calmly. Now, the do this
8:31
Monday pass. Pick your top three engineering
8:33
org blockers. These are not always customer facing.
8:37
Often they're internal systems that block shipping.
8:40
Artifact Registry, Git Host, CI Endpoint, SSO
8:44
Login, Webhook Receiver, Package Download Host,
8:48
any of those. For each one, answer. Do we have
8:51
external monitoring for cert expiry and chain
8:54
validity? Does it alert at least 14 days out,
8:58
ideally 30? Is there an owner written down? Do
9:01
we know where to fix it if renewal breaks? If
9:04
the answer is no, that's a cheap reliability
9:06
win you can fix without rewriting anything. And
9:09
if you want a meta lesson, certs are one of those
9:13
dependencies where the failure is so preventable
9:15
that it's painful. Which is why teams tend to
9:18
overcut corners until they get burned. Don't
9:22
wait for the burn. Story 3 helm chart reliability
9:29
prequel reviewed 105 popular open source helm
9:33
charts and they found the average reliability
9:35
score was roughly 3 .98 out of 10 with a median
9:39
around 4 out of 10. their point isn't helm is
9:42
bad it's that many charts ship demo friendly
9:45
defaults not reliability -friendly defaults.
9:49
This matters because charts aren't just packaging.
9:52
They encode operational behavior, readiness and
9:55
liveness, resource requests, update strategies,
9:59
disruption behavior, security contexts, sometimes
10:02
topology and scheduling assumptions. So when
10:06
you install a chart, you are adopting a set of
10:08
operational opinions whether you realize it or
10:11
not. Here's how this bites teams. A chart has
10:13
no resource requests. In dev, it looks fine.
10:17
In prod, under pressure, it becomes unpredictable.
10:20
Pods are throttled, get evicted, or get starved.
10:24
Probes are missing or sloppy. Traffic gets routed
10:27
to pods that aren't ready. Or probes are too
10:30
aggressive and under load, they trigger restarts.
10:34
no pod disruption budget, no topology spread.
10:38
And then routing node maintenance becomes a cascading
10:41
outage. Everything was highly available until
10:44
you drained a node and lost a majority of replicas
10:47
in one place. Unsafe update strategy and rollouts
10:50
turn into brownouts. And the worst version is
10:53
when Kubernetes says everything is green while
10:56
your app is melting. That's where chart defaults
10:59
turn into long incident timelines. So what do
11:02
you do without forking every chart? You create
11:05
a baseline overlay and a checklist. Baseline
11:08
overlay is a thin layer in your GitOps repo or
11:12
Terraform or Helm release config where you enforce
11:15
defaults. Resources required. Probes required.
11:19
Explicit update strategy. PDB when appropriate.
11:23
Spread constraints if the service needs real
11:26
availability. security context that matches your
11:29
cluster policy. And the checklist is just does
11:33
this chart behave predictably under rollout,
11:36
under node drain, under load spike. Now, the
11:39
do this Monday pass. Pick one chart you run in
11:43
production that matters. The one that would page
11:45
if it went sideways. Open its values and answer.
11:49
Are resource requests set? Are probes set and
11:52
meaningful? Do we have safe rollout behavior?
11:55
Do we have disruption behavior planned? Will
11:58
replicas spread across failure domains, or can
12:01
they all land on one node? What happens if a
12:04
node gets drained? If you can't answer quickly,
12:08
that's your signal. Add the defaults, document
12:10
them, and move on. You don't need perfect charts.
12:14
You need boring, predictable behavior. Now, N8n.
12:21
And this is new. Hacker News covered two high
12:24
severity vulnerabilities in N8n discovered by
12:27
JFrog. The short version is these flaws can let
12:30
authenticated users escape the sandbox and execute
12:34
code. One is in the expression sandbox and the
12:38
other involves Python code execution in internal
12:41
mode. Here's the important part. People hear
12:44
authenticated and they relax. In workflow automation
12:48
platforms, the permission model is often broader
12:51
than you think. Authenticated can include a lot
12:54
of people who can build or edit workflows. And
12:57
in tools like N8N, workflow editing is basically
13:00
code execution. Because workflows can evaluate
13:04
expressions and interact with credentials. So
13:07
this isn't, oh no, an attacker needs an account.
13:10
The real question is, who in your org has an
13:13
account? And what can their workflows touch?
13:16
And what makes this class of bug extra painful
13:18
is these tools often sit in the middle of your
13:21
environment holding keys. Slack, GitHub, Jira,
13:25
AWS keys, database credentials, webhooks, secret
13:30
managers, all of it. So a sandbox escape is not
13:33
just someone ran code. It's someone ran code
13:36
where the keys live. That's why we keep coming
13:39
back to N8N on this show. It's not because N8N
13:43
is uniquely bad. It's because the category is
13:46
high leverage. Okay, practical actions. First,
13:50
patch. Don't debate it. If you self -host N8n,
13:54
patch quickly when sandbox escapes drop. The
13:57
blast radius is too high to slow walk it. Second,
14:00
reduce who can author workflows. Don't treat
14:03
workflow editor as a casual permission. Treat
14:06
it like can run code in a privileged environment
14:09
because effectively that's what it is. Third,
14:12
reduce exposure. If your N8n UI is public on
14:16
the internet, you are playing on hard mode. Put
14:19
it behind SSO, VPN, IP allow lists, whatever
14:24
fits your org. You want fewer people able to
14:27
even reach the attack surface. Fourth, isolate
14:30
it. If it's holding keys to a bunch of systems,
14:33
at least give it a narrow runtime permission
14:35
set. Least privilege on the credentials it uses.
14:39
Separate credentials per workflow if you can.
14:41
Don't run it with god mode access to AWS just
14:45
because it was convenient once. And now the tie
14:47
back to our last episodes where we talked about
14:50
N8N CVEs. The theme has been consistent. Workflow
14:54
automation tools are basically control planes.
14:57
They need the same operational rigor you'd give
15:00
to an internal platform. Patch fast, lock down
15:03
authorship, reduce exposure, least privilege
15:07
the credentials. If you treat it like just a
15:09
tool, it will eventually treat you like just
15:13
a breach. All right, time for the lightning round.
15:22
First. Use Tusk Fence. It's a lightweight sandbox
15:26
for running commands that block network by default.
15:30
If you are experimenting with agents, runbooks
15:32
that execute, or any workflow where code runs
15:35
on behalf of a user, this is the kind of primitive
15:38
you want. Safe by default beats clever by default.
15:42
Next, HashiCorp agent skills. This is part of
15:46
a trend I actually like. Vendors shipping structured
15:49
reusable skills and guardrails instead of just
15:52
telling you to prompt better and hope. Next,
15:55
Marimo. It's a reactive Python notebook that
15:58
stores as normal Python. That sounds small, but
16:02
it matters because Git -friendly notebooks are
16:04
actually useful for incident analysis, runbooks,
16:08
or one -off ops experiments you want to keep
16:11
without the notebook JSON misery. And a quick
16:14
one from the register, the Ralph Wiggum clawed
16:17
loop story. It's funny, but the real point is
16:20
dead serious. People are already building loops
16:23
that keep agents running until they produce output.
16:26
Without constraints and verification, that becomes
16:29
confident nonsense at scale. The ops lesson is
16:33
the same as alert fatigue. If your system floods
16:36
you with low -quality output, humans stop trusting
16:39
it. Okay, time for the human closer. Every story
16:49
today is a glue failure. CI trigger logic becomes
16:53
security boundary and someone implemented it
16:55
like it was just config. Cert renewals were treated
16:58
like solved and then the cliff happened. Charts
17:02
were treated like installers, not operational
17:04
dependencies. And workflow automation tools are
17:07
treated like a convenience layer even though
17:10
they hold the keys. So the takeaway isn't stop
17:13
using tools. It's treat guardrails like product
17:16
work. Make untrusted pipelines truly untrusted.
17:19
Make cert monitoring external and owned. Make
17:23
helm baselines explicit. Make workflow authoring
17:26
privileged. Make credentials least privileged.
17:30
Because if you only build accelerators, you are
17:33
not building a better platform. You are just
17:35
building a faster incident. All right, time for
17:38
a quick recap. We talked about code breach and
17:41
why CI triggers and filters are real security
17:44
boundaries. We talked about Bazel and the CertCliff
17:47
problem. And we talked about Helm chart reliability
17:50
and why defaults matter more than installs. And
17:54
we talked about the new N8n sandbox escape flaws
17:58
and why workflow automation needs to be treated
18:01
like a control plane. The lightning round was
18:04
Fence, HashiCorp Agent Skills, Marimo, and The
18:08
Agent Loop Cautionary Tale. Links and show notes
18:11
are on shipitweekly .fm. If you got something
18:14
out of this, follow the show wherever you are
18:17
listening. And if you can, leave a quick rating
18:19
or review. It helps a ton. I'm Brian, and I'll
18:22
see you next week.