CodeBreach in AWS CodeBuild, Bazel TLS Certificate Expiry Breaks Builds, Helm Charts Reliability Audit, and New n8n Sandbox Escape RCE

Transcript

0:00 This week is a reminder that the stuff we treat

0:02 like glue is now a primary failure domain. CI

0:06 trigger roles, cert renewals, helm charts, automation

0:09 tools. One tiny assumption goes sideways and

0:13 suddenly you're dealing with a supply chain risk,

0:16 a global build break, or an RCE in the thing

0:19 that holds all your credentials. Hey, I'm Brian.

0:38 I work in DevOps and SRE and I run Tellers Tech.

0:41 This is Ship It Weekly, where I filter the noise

0:44 and pull out what actually matters when you're

0:47 the one running infrastructure and owning reliability.

0:50 If something's hype, I'll call it hype. If it

0:52 changes how you operate, we'll talk about it.

0:54 Quick bit of housekeeping, the show notes and

0:57 links are on shipitweekly .fm. If the show's

1:00 been useful, follow it wherever you listen. Also,

1:03 a rating helps way more than it should. Four

1:06 main stories for today. First, code breach. A

1:08 CI trigger and filtering issue in a small set

1:11 of AWS managed repos that's a perfect reminder

1:14 that CI glue is part of your security boundary

1:17 now. Second, the Bazel TLS cert expiry incident.

1:21 The kind of failure that is boring, binary, and

1:24 absolutely capable of blocking your entire engineering

1:27 org. Third, Helm chart reliability. prequel reviewed

1:31 over a hundred popular charts and the results

1:34 are basically a post -mortem template for why

1:36 it installed fine is not a reliability guarantee

1:39 fourth n8n two new high severity flaws disclosed

1:44 by jfrog that can lead to code execution we're

1:48 going to treat this one like a mini story because

1:50 workflow automation tools are basically a control

1:53 plane holding your secrets then a quick lightning

1:56 round with a few operator friendly tools and

1:59 takeaways and a human closer tying the theme

2:02 together alright let's get into it So code breach.

2:10 AWS published a security bulletin describing

2:12 a misconfiguration involving unanchored account

2:16 underscore ID webhook filters for code build.

2:20 used by a small set of AWS -managed open -source

2:23 repos. AWS says they mitigated it quickly, rotated

2:27 credentials, reviewed logs, and added additional

2:31 mitigations and protections around build processes

2:33 with credentials in memory. Wiz's research frames

2:37 the risk clearly. If you can trigger a privileged

2:40 build in a repo that's part of a supply chain,

2:43 you potentially get access to tokens and credentials

2:46 that can be used to push changes or create malicious

2:50 artifacts. That's why it's not just CI, it becomes

2:54 supply chain. Now, the operator lesson is not

2:57 code build is bad. The operator lesson is stop

3:01 treating pipeline trigger logic like it's harmless.

3:04 If an untrusted event can cause a trusted pipeline

3:07 to run, you do not have CI. You have an execution

3:10 environment exposed to the internet. And almost

3:13 every org drifts towards this risk without meaning

3:17 to. Here's the drift pattern. You start with

3:20 PR checks, lint, test, build, great. Then someone

3:24 adds integration tests. And those tests need

3:27 credentials to hit an environment, or to pull

3:30 from a private registry, or to call a third -party

3:33 API. Then someone adds preview deployments. Those

3:38 need cloud creds, or at least some deploy token.

3:41 Then someone adds artifact publishing because

3:44 it's easier if the PR build produces the image.

3:47 Now your PR pipeline can build and push images.

3:50 Then someone uses those images in staging. And

3:53 now your PR pipeline is part of your release

3:56 path. And at that point, you've accidentally

3:58 created a supply chain path where a PR can influence

4:02 something that runs in your environment. This

4:05 is why the most dangerous sentence in CI is,

4:08 it only runs in CI. Because CI is usually the

4:12 thing that holds credentials that can touch everything

4:15 else. So the practical question to ask is simple.

4:19 Can untrusted events cause trusted actions? Can

4:22 a forked PR run a job that has secrets available?

4:26 Can a PR comment trigger a workflow that can

4:30 deploy? Can a PR workflow push artifacts that

4:33 are later deployed? Can a PR workflow assume

4:37 cloud roles? If the answer is yes, or even I'm

4:40 not sure, you've got a boundary problem. Now,

4:43 I want to give you a do this Monday playbook

4:46 that doesn't require a big platform rewrite.

4:49 Step one, label pipelines mentally as untrusted

4:53 and trusted. Untrusted pipelines is PRs and forks.

4:57 No secrets, no publish, no deploy. It answers,

5:01 does it compile? Do tests pass? Trusted pipelines

5:05 run on merges to main, tags, or explicit manual

5:09 approvals. That's the one allowed to publish

5:12 and deploy. Step two. Pick one repo, your most

5:16 sensitive one, and map the event chain in plain

5:19 English. What events trigger workflows? Which

5:22 workflows can access secrets? Which workflows

5:25 can write artifacts? Which workflows can deploy

5:28 or change infra? Step 3. Check for the classic

5:32 foot guns. Are secrets injected into your PR

5:35 workflows at all? Are you using conditions like

5:38 actor checks or branch checks as auth? Are you

5:42 relying on the code won't print secrets as a

5:45 control? Do you allow PR builds from forks to

5:49 run with privilege tokens? If you find these,

5:52 the fix is not perfect security. The fix is strong

5:55 separation. PR builds can still run. just keep

5:58 them in the untrusted sandbox. If you need integration

6:02 tests, run them in a separate environment with

6:05 a separate low privilege credential set. If you

6:08 need to build images, build them but don't push

6:11 them to a production registry. Or push to an

6:15 isolated registry that cannot be used for deploy.

6:18 If you need a preview environment, require approval

6:21 before anything privileged runs. And the last

6:24 point, even if you fix the triggers, scope matters.

6:27 Least privilege for CI tokens is not optional.

6:31 Your pipeline credentials should not be able

6:34 to do everything. That's story one. Story two

6:41 is Bazel. On December 26th, 2025, the TLS certificates

6:46 for many star .bazel .build domains expired and

6:50 it caused widespread build breakage. The Bazel

6:54 team's postmortem says the outage lasted around

6:57 13 hours before it was resolved. This is one

7:00 of those incidents that's both boring and terrifying.

7:03 Boring because it's just a cert. Terrifying because

7:07 cert failures are a binary cliff. Everything

7:10 works, then it doesn't. And the blast radius

7:13 is immediate because every client that depends

7:15 on that endpoint fails at the same time. Also,

7:19 auto -renew does not prevent this class of incident.

7:22 Auto -renew is one link in a chain. The full

7:25 chain is issuance, renewal, deployment, Reload

7:29 and verification. A lot of real cert outages

7:32 are renewal succeeded but deployment didn't reload.

7:36 Or new hostname wasn't included. Or monitoring

7:39 checked the wrong endpoint. Or DNS validation

7:42 broke and nobody noticed. So here's the practical

7:46 operator version of this story. You'd need external

7:49 monitoring for cert expiry from the outside.

7:53 against the actual endpoint users hit. Not an

7:56 internal health check. Not a dashboard in the

7:59 cert system. The real endpoint. You need ownership.

8:03 A named owner. A team. A channel. Someone that

8:07 at 2am on call can tag and get traction. And

8:10 you need runway. Alerts well before expiry. If

8:14 your alert fires 24 hours before expiry, you

8:18 are still basically doing incident response.

8:20 If it fires 30 days before expiry, you can fix

8:24 weird edge cases like DNS changes, migrations,

8:27 or validation issues calmly. Now, the do this

8:31 Monday pass. Pick your top three engineering

8:33 org blockers. These are not always customer facing.

8:37 Often they're internal systems that block shipping.

8:40 Artifact Registry, Git Host, CI Endpoint, SSO

8:44 Login, Webhook Receiver, Package Download Host,

8:48 any of those. For each one, answer. Do we have

8:51 external monitoring for cert expiry and chain

8:54 validity? Does it alert at least 14 days out,

8:58 ideally 30? Is there an owner written down? Do

9:01 we know where to fix it if renewal breaks? If

9:04 the answer is no, that's a cheap reliability

9:06 win you can fix without rewriting anything. And

9:09 if you want a meta lesson, certs are one of those

9:13 dependencies where the failure is so preventable

9:15 that it's painful. Which is why teams tend to

9:18 overcut corners until they get burned. Don't

9:22 wait for the burn. Story 3 helm chart reliability

9:29 prequel reviewed 105 popular open source helm

9:33 charts and they found the average reliability

9:35 score was roughly 3 .98 out of 10 with a median

9:39 around 4 out of 10. their point isn't helm is

9:42 bad it's that many charts ship demo friendly

9:45 defaults not reliability -friendly defaults.

9:49 This matters because charts aren't just packaging.

9:52 They encode operational behavior, readiness and

9:55 liveness, resource requests, update strategies,

9:59 disruption behavior, security contexts, sometimes

10:02 topology and scheduling assumptions. So when

10:06 you install a chart, you are adopting a set of

10:08 operational opinions whether you realize it or

10:11 not. Here's how this bites teams. A chart has

10:13 no resource requests. In dev, it looks fine.

10:17 In prod, under pressure, it becomes unpredictable.

10:20 Pods are throttled, get evicted, or get starved.

10:24 Probes are missing or sloppy. Traffic gets routed

10:27 to pods that aren't ready. Or probes are too

10:30 aggressive and under load, they trigger restarts.

10:34 no pod disruption budget, no topology spread.

10:38 And then routing node maintenance becomes a cascading

10:41 outage. Everything was highly available until

10:44 you drained a node and lost a majority of replicas

10:47 in one place. Unsafe update strategy and rollouts

10:50 turn into brownouts. And the worst version is

10:53 when Kubernetes says everything is green while

10:56 your app is melting. That's where chart defaults

10:59 turn into long incident timelines. So what do

11:02 you do without forking every chart? You create

11:05 a baseline overlay and a checklist. Baseline

11:08 overlay is a thin layer in your GitOps repo or

11:12 Terraform or Helm release config where you enforce

11:15 defaults. Resources required. Probes required.

11:19 Explicit update strategy. PDB when appropriate.

11:23 Spread constraints if the service needs real

11:26 availability. security context that matches your

11:29 cluster policy. And the checklist is just does

11:33 this chart behave predictably under rollout,

11:36 under node drain, under load spike. Now, the

11:39 do this Monday pass. Pick one chart you run in

11:43 production that matters. The one that would page

11:45 if it went sideways. Open its values and answer.

11:49 Are resource requests set? Are probes set and

11:52 meaningful? Do we have safe rollout behavior?

11:55 Do we have disruption behavior planned? Will

11:58 replicas spread across failure domains, or can

12:01 they all land on one node? What happens if a

12:04 node gets drained? If you can't answer quickly,

12:08 that's your signal. Add the defaults, document

12:10 them, and move on. You don't need perfect charts.

12:14 You need boring, predictable behavior. Now, N8n.

12:21 And this is new. Hacker News covered two high

12:24 severity vulnerabilities in N8n discovered by

12:27 JFrog. The short version is these flaws can let

12:30 authenticated users escape the sandbox and execute

12:34 code. One is in the expression sandbox and the

12:38 other involves Python code execution in internal

12:41 mode. Here's the important part. People hear

12:44 authenticated and they relax. In workflow automation

12:48 platforms, the permission model is often broader

12:51 than you think. Authenticated can include a lot

12:54 of people who can build or edit workflows. And

12:57 in tools like N8N, workflow editing is basically

13:00 code execution. Because workflows can evaluate

13:04 expressions and interact with credentials. So

13:07 this isn't, oh no, an attacker needs an account.

13:10 The real question is, who in your org has an

13:13 account? And what can their workflows touch?

13:16 And what makes this class of bug extra painful

13:18 is these tools often sit in the middle of your

13:21 environment holding keys. Slack, GitHub, Jira,

13:25 AWS keys, database credentials, webhooks, secret

13:30 managers, all of it. So a sandbox escape is not

13:33 just someone ran code. It's someone ran code

13:36 where the keys live. That's why we keep coming

13:39 back to N8N on this show. It's not because N8N

13:43 is uniquely bad. It's because the category is

13:46 high leverage. Okay, practical actions. First,

13:50 patch. Don't debate it. If you self -host N8n,

13:54 patch quickly when sandbox escapes drop. The

13:57 blast radius is too high to slow walk it. Second,

14:00 reduce who can author workflows. Don't treat

14:03 workflow editor as a casual permission. Treat

14:06 it like can run code in a privileged environment

14:09 because effectively that's what it is. Third,

14:12 reduce exposure. If your N8n UI is public on

14:16 the internet, you are playing on hard mode. Put

14:19 it behind SSO, VPN, IP allow lists, whatever

14:24 fits your org. You want fewer people able to

14:27 even reach the attack surface. Fourth, isolate

14:30 it. If it's holding keys to a bunch of systems,

14:33 at least give it a narrow runtime permission

14:35 set. Least privilege on the credentials it uses.

14:39 Separate credentials per workflow if you can.

14:41 Don't run it with god mode access to AWS just

14:45 because it was convenient once. And now the tie

14:47 back to our last episodes where we talked about

14:50 N8N CVEs. The theme has been consistent. Workflow

14:54 automation tools are basically control planes.

14:57 They need the same operational rigor you'd give

15:00 to an internal platform. Patch fast, lock down

15:03 authorship, reduce exposure, least privilege

15:07 the credentials. If you treat it like just a

15:09 tool, it will eventually treat you like just

15:13 a breach. All right, time for the lightning round.

15:22 First. Use Tusk Fence. It's a lightweight sandbox

15:26 for running commands that block network by default.

15:30 If you are experimenting with agents, runbooks

15:32 that execute, or any workflow where code runs

15:35 on behalf of a user, this is the kind of primitive

15:38 you want. Safe by default beats clever by default.

15:42 Next, HashiCorp agent skills. This is part of

15:46 a trend I actually like. Vendors shipping structured

15:49 reusable skills and guardrails instead of just

15:52 telling you to prompt better and hope. Next,

15:55 Marimo. It's a reactive Python notebook that

15:58 stores as normal Python. That sounds small, but

16:02 it matters because Git -friendly notebooks are

16:04 actually useful for incident analysis, runbooks,

16:08 or one -off ops experiments you want to keep

16:11 without the notebook JSON misery. And a quick

16:14 one from the register, the Ralph Wiggum clawed

16:17 loop story. It's funny, but the real point is

16:20 dead serious. People are already building loops

16:23 that keep agents running until they produce output.

16:26 Without constraints and verification, that becomes

16:29 confident nonsense at scale. The ops lesson is

16:33 the same as alert fatigue. If your system floods

16:36 you with low -quality output, humans stop trusting

16:39 it. Okay, time for the human closer. Every story

16:49 today is a glue failure. CI trigger logic becomes

16:53 security boundary and someone implemented it

16:55 like it was just config. Cert renewals were treated

16:58 like solved and then the cliff happened. Charts

17:02 were treated like installers, not operational

17:04 dependencies. And workflow automation tools are

17:07 treated like a convenience layer even though

17:10 they hold the keys. So the takeaway isn't stop

17:13 using tools. It's treat guardrails like product

17:16 work. Make untrusted pipelines truly untrusted.

17:19 Make cert monitoring external and owned. Make

17:23 helm baselines explicit. Make workflow authoring

17:26 privileged. Make credentials least privileged.

17:30 Because if you only build accelerators, you are

17:33 not building a better platform. You are just

17:35 building a faster incident. All right, time for

17:38 a quick recap. We talked about code breach and

17:41 why CI triggers and filters are real security

17:44 boundaries. We talked about Bazel and the CertCliff

17:47 problem. And we talked about Helm chart reliability

17:50 and why defaults matter more than installs. And

17:54 we talked about the new N8n sandbox escape flaws

17:58 and why workflow automation needs to be treated

18:01 like a control plane. The lightning round was

18:04 Fence, HashiCorp Agent Skills, Marimo, and The

18:08 Agent Loop Cautionary Tale. Links and show notes

18:11 are on shipitweekly .fm. If you got something

18:14 out of this, follow the show wherever you are

18:17 listening. And if you can, leave a quick rating

18:19 or review. It helps a ton. I'm Brian, and I'll

18:22 see you next week.

CodeBreach in AWS CodeBuild, Bazel TLS Certificate Expiry Breaks Builds, Helm Charts Reliability Audit, and New n8n Sandbox Escape RCE

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller