Hackerbot-Claw Grows, Xygeni Tag Poisoning, GitHub Search HA, Windows SID Failures, and AI Skills Supply Chain

Transcript

0:00 Shortcuts keep turning into trust decisions.

0:03 It's kind of the whole show today. A GitHub workflow

0:06 gets treated like plumbing until it hands over

0:08 a token. A version tag looks stable until somebody

0:12 moves it. A cloned Windows image looks fine until

0:16 the platform finally decides it cares. And a

0:19 skills marketplace looks like discovery right

0:22 up until you remember the code is coming from

0:25 wherever that URL points. That's where we're

0:28 at. Hey, I'm Brian Teller. I work in DevOps and

0:47 SRE, and I run Teller's Tech. This is Ship It

0:51 Weekly. where I filter the noise and focus on

0:53 what actually changes how we run infrastructure

0:56 and own reliability. Show notes and links are

0:59 on shipitweekly.fm. If the show's been useful,

1:03 follow it wherever you listen. Ratings help way

1:05 more than they should. And if you want more signal

1:07 between episodes, check out oncallbrief.com.

1:11 we have five main stories today then the lightning

1:14 round and we'll wrap with the human closer we're

1:16 starting by coming back to the trivy story but

1:19 from a different angle we already covered trivy

1:21 on the show on episode 24. the reason to revisit

1:24 it now is that it is clearly not just one ugly

1:28 repo incident it was part of a bigger github

1:30 actions campaign after that github has one of

1:33 the better platform engineering stories i've

1:36 seen lately They rebuilt search high availability

1:39 in GitHub Enterprise Server because the old shape

1:42 could literally trap people in a bad maintenance

1:45 state. Then a Windows Server 2025 identity mess

1:50 that feels extremely real if you've ever lived

1:53 through cloned image weirdness. And finally,

1:56 agent skills, because apparently we are rebuilding

1:59 package manager history all over again. Except

2:03 this time, the packages can inherit shell access,

2:06 file access, and credentials. Story 1. Trivy

2:13 was part of a bigger GitHub Actions campaign.

2:16 Let's start there. I don't think the trivy angle

2:19 is really the headline anymore. The bigger story

2:21 is that OpenSSF warned on March 1st about an

2:25 active campaign called HackerBot Claw. Their

2:28 advisory said it was actively scanning public

2:31 repositories for weak GitHub Actions workflows

2:34 and exploiting them to execute code and steal

2:38 credentials. InfoQ's follow -up said the campaign

2:41 hit projects tied to Microsoft, Datadog, CNCF,

2:46 and aqua security and that it got remote code

2:50 execution in five of seven targeted repositories

2:54 that matters because it takes this out of the

2:56 bucket of one repo got unlucky this is a real

2:59 pattern now open ssfs write up called out the

3:03 exact stuff most teams already know they should

3:06 be nervous about privileged triggers like pull

3:09 request target running untrusted code from forks

3:12 inline shell unvalidated input week -auth checks

3:16 before workflows run. And they were pretty blunt

3:19 about the response too. Review, harden, and monitor

3:22 your workflows now. So for me, the real lesson

3:26 is not wow, an AI bot did a thing. The real lesson

3:29 is that repo automation has become part of the

3:32 trust boundary. And a lot of teams still treat

3:35 it like background glue. It isn't glue anymore.

3:38 If a workflow can write releases, publish artifacts,

3:42 touch environments, or push back to the repo,

3:45 that is real production adjacent power. Attackers

3:48 know that. We should probably act like we know

3:51 that too. One thing I'd actually do after this

3:54 episode, pick a couple of workflows and ask a

3:57 very boring question. Where exactly does untrusted

4:00 input enter? And where does it land? Branch names,

4:04 PR titles, file names, checked out code. If you

4:08 can't draw the trust boundary cleanly, you probably

4:11 don't have one. Story two, Xygeni and the problem

4:18 with mutable trust. The Xygeni one is a perfect

4:21 follow -on because it is the same lesson from

4:24 a different direction. Step Security reported

4:27 that the official Xygeni action was compromised

4:30 on March 3rd when an attacker using stolen maintainer

4:34 credentials injected a reverse shell, and moved

4:37 the mutable v5 tag to a malicious commit. The

4:41 nasty part is that repos using at v5 did not

4:44 need to change their workflow file at all. The

4:47 trust moved underneath them. StepSecurity later

4:50 said Xygeni removed the bad tag on March 10th,

4:53 rotated contributor tokens, enabled release immutability,

4:57 and added tag protection rules. That's what makes

5:00 this one so useful. Nothing had to look different

5:03 in your YAML. No scary PR. No obvious version

5:06 bump. No, why did Dependabot open this? Just

5:10 the same reference, now pointing somewhere else.

5:12 And that is why I keep coming back to this idea

5:15 that mutable references are not a minor detail.

5:19 They are a trust decision. Teams talk about major

5:22 version tags like they are basically stable,

5:25 but they are only stable if the permissions,

5:27 protections, and release process behind them

5:29 are stable. If the tag can move, the trust moved

5:32 too. So I would not make the takeaway pin everything

5:35 or you're reckless. I'd make it a little more

5:37 honest than that. Know what layer your trust

5:40 lives on. If you are depending on a mutable tag,

5:43 say that out loud. If you are okay with that

5:46 trade -off, fine. But make it a real decision.

5:49 Don't let it be an accidental one. And if you

5:52 own internal reusable actions, this is a pretty

5:55 good excuse to tighten those up right now. tag

5:58 protection signed commits release immutability

6:01 SHA pinning guidance the usual grown -up stuff

6:05 story 3 github rebuilt high availability search

6:12 in github enterprise server now for the one that

6:15 is not mostly security doom GitHub published

6:17 a nice write -up on rebuilding search high availability

6:20 in GitHub Enterprise Server. The old model clustered

6:24 Elasticsearch across the primary and replica

6:26 nodes. GitHub said that created ugly failure

6:29 modes during upgrades and maintenance. A primary

6:33 shard could move to a replica, then that replica

6:36 could go down for maintenance, and the whole

6:38 thing could get stuck waiting on itself. Their

6:40 new design uses Elasticsearch cross -cluster

6:43 replication and moves each Enterprise Server

6:46 instance to its own single -node Elasticsearch

6:49 cluster. GitHub says support starts in G -H -E

6:53 -S 3 .19 .1. I like this story because it feels

6:58 like real platform work. Not we added an AI button.

7:01 Not we made this graph go up. More like this

7:04 architecture had a shape problem, and the shape

7:07 problem kept showing up during normal operations.

7:10 So we changed the shape. That's good engineering.

7:13 And a lot of teams are sitting on something like

7:15 that right now. You know the kind of system I

7:18 mean. It mostly works but everybody knows upgrades

7:21 are touchy. Maintenance has to happen in the

7:23 correct moon phase. Failover is theoretically

7:25 fine but nobody wants to test it on purpose.

7:28 That usually means the system does not need another

7:31 warning label. It probably needs a different

7:34 shape. That's the part worth stealing here. If

7:36 a core platform component keeps demanding perfect

7:38 choreography just to survive normal admin work,

7:41 stop asking how to babysit it better. Start asking

7:44 whether the design is wrong for the job now.

7:50 Story 4. Windows Server 2025 is surfacing old

7:54 image sins. This one is a little less flashy,

7:57 but man, it feels real. Microsoft published exchange

8:00 guidance in March for Windows Server 2025 systems

8:04 built from non -generalized images. The root

8:08 issue is duplicate computer SIDs. Microsoft says

8:12 those can happen if you deploy an image without

8:15 running sysprep, clone an existing Windows Server

8:18 instance, or reuse a VM snapshot or image the

8:22 wrong way. And the reason this is showing up

8:25 now is that KB5065426 introduced strict checks

8:31 for duplicate SIDs. Microsoft also says PS GET

8:35 SID is the way to verify it. I wanted this in

8:39 the episode because this is such a classic ops

8:42 problem. The bad habit can sit there for a long

8:45 time. maybe years. Everything mostly works. Then

8:48 the platform hardens one layer of identity behavior.

8:51 And suddenly you're not dealing with security

8:54 improvement in the abstract. You're dealing with

8:57 a weird auth failure that sends people digging

9:00 through the wrong part of the stack for hours.

9:03 That's the pattern. Hardening doesn't just improve

9:05 security. It also drags old shortcuts into the

9:08 light. So if you are in one of those environments

9:11 with old templates, old images, or cloned Windows

9:14 boxes nobody has wanted to touch, this is a nice

9:17 reminder that it works and it was deployed correctly

9:20 are not the same thing. One practical move here.

9:24 If you've got any suspicion your Windows fleet

9:26 has image history you don't fully trust, go check

9:30 before the next update checks for you. Story

9:36 5. Agent skills are replaying package ecosystem

9:39 history. The last main story is the one that

9:42 keeps getting more obvious. Socket wrote in February

9:45 that skills .sh had already indexed more than

9:49 60 ,000 skills across tools like Cursor, Cloud

9:53 Code, GitHub Copilot, and Windsor, and that anyone

9:57 can publish a skill from any GitHub repository.

10:00 Their point was simple. Skills are decentralized

10:03 by design, so they inherit the same supply chain

10:06 problems we already know from places like NPM.

10:10 Then Snyk took a harder look at the ecosystem,

10:12 and their numbers were not exactly comforting.

10:16 Their February study scanned 3 ,984 skills from

10:20 ClawHub and Skills .sh and said 13 .4 % had at

10:26 least one critical issue, 36 .82 % had at least

10:30 one security flaw of any severity, and 76 malicious

10:34 payloads were confirmed through human review.

10:38 They also called out the part that really matters

10:40 operationally. These things can inherit shell

10:43 access, file access, credentials, messaging access,

10:47 and persistence through memory depending on the

10:50 agent they extend. That's why this story matters

10:53 to me. Not because agent skills are risky is

10:56 some hot take. Of course they are. It matters

10:58 because this is the same old supply chain problem

11:01 wearing a newer outfit. Lightweight distribution.

11:04 Easy publishing. Borrowed trust. Code coming

11:07 from repos you don't control. And a whole lot

11:10 of people treating discovery as if it implies

11:13 review. It doesn't. Popular does not mean safe.

11:17 Installed does not mean verified. And it's just

11:21 a skill is probably going to age about as well

11:24 as it's just a package. So if your team is experimenting

11:28 with skills, sub -agents, MCP servers, whatever

11:31 flavor of extension system is hot this month,

11:35 treat that install path like a real dependency

11:38 path. Because that's what it is. A few quick

11:48 ones before we wrap. GitHub paused enforcement

11:50 of the minimum self -hosted runner version requirement.

11:54 The version is still v2 .329 .0. Runners below

11:59 that can still register for now. But GitHub also

12:02 said the long -term direction has not changed.

12:06 So this is a pause, not a pardon. GitHub also

12:09 rolled out March secret scanning updates. They

12:12 added 28 new detectors from 15 providers, turned

12:16 on push protection by default for 39 detectors,

12:19 and added more validity checks. I like this one

12:22 because it is the kind of boring improvement

12:24 that actually saves people from themselves. And

12:27 one more little connective tissue point from

12:30 the OpenSSF advisory. Pin third -party GitHub

12:33 actions by commit SHAs. Keep workflow permissions

12:36 minimal. And stop using privileged triggers unless

12:40 you really need them. That advice is not new.

12:43 But the campaign is a nice reminder that we know

12:46 better and we configured it better are not the

12:49 same thing. I think the cleanest thread through

12:59 all of this is that convenience keeps getting

13:01 mistaken for safety. A workflow is convenient.

13:04 A tag is convenient. A cloned image is convenient.

13:07 A skill directory is convenient. And none of

13:09 those things are neutral once they start making

13:12 trust decisions for you. That's the real pattern

13:14 here. The Trivy follow -up story is really about

13:17 CI being part of the trust boundary now. Xygeni

13:21 is about mutable trust hiding under a stable

13:24 looking reference. The GitHub search story is

13:26 what it looks like when a platform team admits

13:29 the system shape is the problem and they fix

13:31 it. The Windows story is what happens when the

13:34 platform finally starts enforcing a thing you

13:36 were sloppy about years ago. And the skills ecosystem

13:39 is just the latest reminder that the minute installation

13:42 gets easy, trust gets fuzzy, unless somebody

13:45 does the work to pin it down. So the operator

13:48 version of the episode is pretty simple. Stop

13:50 calling these things helpers if they can change

13:53 outcome. If it can execute, publish, authenticate,

13:56 install, or inherit credentials, it is not just

13:59 automation. It is part of the control plane now.

14:02 And that means you owe it the same questions

14:04 you always owed production. What is mutable?

14:07 Who can change it? What is isolated? What is

14:10 logged? What fails open? And what are you trusting

14:13 by default that you probably shouldn't be? All

14:16 right, that's it for this episode of Ship It

14:18 Weekly. Quick recap. We came back to Trivy by

14:21 zooming out to the bigger HackerClaw campaign,

14:24 then Xygeni and tag provisioning, GitHub rebuilding

14:27 high availability search in GitHub Enterprise

14:30 Server, Windows Server 2025 surfacing old duplicate

14:34 SID problems, and agent skills replaying package

14:38 ecosystem history, except with even more access

14:41 and worse defaults. Links and show notes are

14:44 on shipitweekly .fm. You can also find the video

14:48 versions on YouTube. And if you want the signal

14:51 before the episode, check out OnCallBrief .com.

14:54 If this was useful, follow or subscribe wherever

14:57 you listen. And send it to the person on your

14:59 team who still thinks CI, tags, images, and just

15:03 helper tools live outside the real trust boundary.

15:07 I'm Brian, and I'll see you next week.

Hackerbot-Claw Grows, Xygeni Tag Poisoning, GitHub Search HA, Windows SID Failures, and AI Skills Supply Chain

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller