AWS Bahrain/UAE Data Center Issues Amid Iran Strikes, ArgoCD vs Flux GitOps Failures, GitHub Actions Hackerbot-Claw Attacks (Trivy), RoguePilot Codespaces Prompt Injection, Block “AI Remake” Layoffs, Claude Code Security

Transcript

0:00 This week is another reminder that the boundary

0:02 of ops keeps expanding. Sometimes the incident

0:05 trigger isn't a bad deploy. It's physical disruption

0:08 in a cloud region. Sometimes it isn't your app.

0:11 It's your GitOps control plane getting stuck.

0:13 Sometimes it isn't a vuln in prod. It's your

0:17 CI getting actively hunted. And sometimes it's

0:20 your company saying AI remake while expecting

0:23 the same reliability with fewer humans. All right,

0:26 let's get into it. Hey, I'm Brian Teller. I work

0:45 in DevOps and SRE, and I run Teller's Tech. This

0:49 is Ship It Weekly, where I filter the noise and

0:51 focus on what actually changes how we run infrastructure

0:54 and own reliability. Show notes and links are

0:57 on shipitweekly .fm. If the show's been useful,

1:01 follow it wherever you listen. Also, ratings

1:03 help way more than they should. Six main stories

1:06 today, then the lightning round, and then the

1:08 human closer. Story 1 is AWS flagging issues

1:11 in Bahrain and the UAE data centers amid Iran's

1:15 strikes, and what this means for regional resilience.

1:19 Story 2 is Argo CD to Flux and the specific Argo

1:23 CD failure mode that makes GitOps feel like a

1:26 pager generator. Story 3 is HackerBot Claw, an

1:29 automated campaign exploiting GitHub Actions

1:32 and Trivy getting hit as part of it story 4 is

1:36 rogue pilot a github code spaces co -pilot attack

1:39 chain it's basically prompt ejection meets real

1:43 credentials story 5 is block cutting 4 000 jobs

1:47 framed as an ai remake and why that's an ops

1:51 execution story story six is anthropic pushing

1:55 frontier cyber security capabilities for defenders

1:58 and what that means when these tools move from

2:01 scan to suggest fixes then the lightning round

2:04 then the human closer all right story one So

2:11 Reuters reported AWS flagged power and connectivity

2:15 issues tied to incidents at facilities in the

2:18 UAE and Bahrain amid the regional conflict. The

2:22 practical takeaway is simple. Multi -AZ is not

2:26 multi -region, and cloud outages now include

2:29 physical risk. If the region is degraded for

2:32 hours or days, the question becomes, can you

2:35 operate elsewhere and how fast can you decide?

2:38 A lot of teams say we're highly available, and

2:41 what they really mean is we can lose an AZ. That's

2:44 great, you should do that. But it's not the same

2:46 as losing an entire region's capacity, networking,

2:50 or connectivity to the outside world. This is

2:53 where a DR plan stops being a diagram and becomes

2:57 a decision tree. Here's how this bites real orgs.

3:00 Your apps might technically still run, but your

3:03 dependencies don't. Payments provider timeouts.

3:06 Queue backlogs. Outbound traffic gets weird.

3:09 Latency goes from fine to unusable. And suddenly

3:13 you are in the ugly space where nothing is fully

3:16 down, but everything is failing. That's also

3:19 where people make the wrong call. They wait too

3:22 long because it's partially working. Then failover

3:25 gets harder because data divergence grows and

3:29 backlogs pile up. Do this Monday. Pick one region

3:33 you rely on heavily and run this thought experiment.

3:36 Assume it's impaired for 48 hours. Who makes

3:40 the failover call? Not we would, a name or a

3:43 role. And what signals trigger it? Error rate?

3:46 Latency? Provider status? Customer impact? What's

3:50 the DNS plan? Do you use weighted routing? Failover

3:54 routing? Manual cutover? How long does it take?

3:57 What's the rollback plan? What's the data plan?

4:00 not the app plan. If your database is regional,

4:03 your app is regional. If your queue is regional,

4:06 your app is regional. If your identity system

4:09 is regional, your app is regional. Have you tested

4:12 this in the last year? Even a tabletop, even

4:15 a low -risk exercise, anything besides we totally

4:19 could if we had to. If you are in edge regions

4:22 or high -risk geos, you don't get to pretend

4:25 this is theoretical. Next up, story two. Story

4:32 two is GitOps pain. There's a great write -up

4:35 on migrating from Argo CD to Flux, and the best

4:39 part is the section literally titled The Problem

4:42 with Argo CD. The complaint is a specific failure

4:45 mode. A sync fails. Argo CD marks the app sync

4:49 failed, and then it can get stuck retrying the

4:52 failed state instead of progressing to the newer

4:55 commit that fixes it. The CRD ordering example

4:59 is the one everyone runs into at least once.

5:02 You push a commit that creates a custom resource.

5:06 The CRD isn't there yet, so sync fails. You push

5:09 a new commit, adding the CRD. And Argo CD can

5:12 keep banging its head against the old failed

5:15 commit. This is the moment where GitOps stops

5:18 feeling like declarative desired state and starts

5:21 feeling like a controller stuck in a loop that

5:24 needs a human rescue. The fastest way to hate

5:27 GitOps is getting paged by GitOps. And here's

5:30 the thing. Argo CD is good. Lots of teams run

5:33 it successfully. But the operational foot guns

5:36 are real. And if you haven't been burned yet,

5:38 you will be. GitOps tools are control planes.

5:41 When the control plane is wrong, it's not one

5:44 service down. It's your deployment mechanism

5:47 becoming the incident. So what do you actually

5:49 do with this story if you're not migrating to

5:52 Flux tomorrow? Do this Monday. If you run Argo

5:55 CD, make sure you have a runbook for sync failed

5:58 and won't progress. What do you do when it's

6:01 stuck on a bad desired state? What's the break

6:04 glass path that doesn't involve turning off all

6:07 automation? How do you handle CRDs and ordering

6:11 safely? Do you use sync waves? Do you split CRDs

6:14 into a separate app? do you pre -install them

6:17 whatever your pattern is write it down and standardize

6:21 it and make sure the on -call knows the recovery

6:24 moves manual sync refresh hard refresh prune

6:28 behavior deleting resources recreating app the

6:31 mechanics matter when you're under pressure now

6:33 on the flux side if you are considering flux

6:36 don't migrate your fleet first pick one low risk

6:39 service run it side by side Learn the failure

6:43 modes. Don't make this a religious war. Make

6:46 it an operational decision. Alright, story three.

6:53 This story is spicy in the way ops teams should

6:56 care about. Step Security documented an automated

6:59 campaign they call HackerBot Claw. It targeted

7:03 GitHub Actions workflows across major repos,

7:06 remote code execution in several targets, and

7:09 token theft, including a token with right permissions.

7:13 Then Trivy maintainers posted their own incident

7:16 report saying Trivy was attacked via GitHub Actions

7:19 as part of the same campaign. And they believe

7:23 the vulnerability came from a specific workflow

7:26 which they fixed. This is the your pipeline is

7:29 production story. Attackers are not waiting for

7:32 your app to have a bug. They are going after

7:35 the thing that can publish artifacts, ship releases,

7:38 and mint trust. CI is basically a skeleton key

7:42 that we keep leaving under the doormat. The real

7:45 reason this matters for DevOps is that CI feels

7:49 internal, but it's usually triggered by public

7:52 inputs. PR titles, PR descriptions, issues, forks,

7:57 external contributors, and if any of that can

8:00 reach privileged execution, you have an internet

8:03 -facing execution engine with credentials. Do

8:07 this Monday. Open your workflows and look for

8:10 common foot guns. Are you using pull request

8:13 target? If yes, do you fully understand which

8:16 code runs and what permissions it has? Are actions

8:20 pinned to commit SHAs or are you trusting tags?

8:23 Tags move, SHAs don't. Is GitHub token overprivileged?

8:28 Set default permissions to read -only and explicitly

8:31 grant what you need per job. Are secrets accessible

8:35 in contexts influenced by untrusted PRs? If untrusted

8:39 code can run in jobs with secrets, assume those

8:43 secrets are compromised eventually. Bonus hardening

8:47 that actually helps. Use OIDC to cloud providers

8:50 instead of long -lived cloud keys. Use environments

8:53 with required reviewers for deploy jobs. Separate

8:57 untrusted test pipeline from trusted published

9:00 deploy pipeline. If you do only one thing this

9:03 week, stop untrusted events from running privileged

9:06 steps. Okay, story four. this one is a perfect

9:14 bridge between agents are real and credentials

9:17 are real rogue pilot is an attack chain where

9:20 a malicious github issue can embed instructions

9:24 that get processed when a developer launches

9:27 a code space from that issue it's basically passive

9:30 prompt injection the attacker's content is the

9:34 prompt the scary part isn't the model said something

9:36 weird the scary part is the model can end up

9:40 operating in an environment that has real access

9:43 like a github token and the attacker's instructions

9:46 can steer what it does in the worst case it becomes

9:50 token exfiltration and repo takeover outcomes

9:53 this is exactly why agent boundaries are not

9:56 optional Untrusted text exists everywhere. Issues,

10:00 PRs, readmes, docs. If an agent reads it and

10:05 then can run commands, you need real trust boundaries.

10:08 Anything that reads issues is reading untrusted

10:11 input. Period. Do this Monday. If your org uses

10:15 Codespaces, Copilot -style agents, or anything

10:19 that can act, treat repo content as untrusted

10:23 input. Separate read context from act context.

10:27 Don't let the same agent both ingest and execute

10:30 without gates. Use least privileged tokens in

10:34 dev environments. A developer workspace should

10:37 not have a token that can publish releases or

10:40 modify workflows. If you rely on external issues

10:44 and PRs, tighten who can trigger what. And make

10:47 sure you have logging that captures tool actions,

10:50 not just chat. This story will keep repeating

10:54 across tools. It's a category, not a one -off.

10:57 All right, story five. So Block. If you're not

11:05 familiar, Block is the fintech company run by

11:08 Jack Dorsey. If the name sounds familiar, it's

11:11 because he's also the co -founder of Twitter

11:13 and one of the more influential figures in modern

11:16 tech platforms. Block used to be called Square.

11:20 It's the company behind things like Square Payments,

11:23 Cash App, and a bunch of fintech infrastructure

11:25 that powers small businesses and consumer payments.

11:29 So this isn't some random startup making headlines.

11:32 This is a large public tech company run by someone

11:36 who's been through multiple platform shifts already.

11:40 And now they're talking about what they're calling

11:42 an AI remake. Block is cutting around 4 ,000

11:46 jobs. This is not just an HR story for ops teams.

11:50 It's an execution story. Because the pattern

11:54 we're going to see everywhere is fewer humans,

11:57 higher output volume, same reliability expectations.

12:00 If AI increases code output, your safety net

12:04 has to scale too. That safety net is not hero

12:08 engineers. It's guardrails, ownership, and systems

12:12 that absorb change without constant babysitting.

12:15 Small teams move fast, only works if your brakes

12:18 work. Here's the practical ops angle. When teams

12:22 shrink, you lose redundancy. On -call rotations

12:25 get thinner. Specialists disappear. And then

12:28 an incident hits and the blast radius of we lost

12:31 that context is massive. So if leadership is

12:34 pushing AI productivity, the immediate question

12:37 is, what are we doing to preserve safety? Do

12:41 this Monday. If your org is in AI productivity

12:44 mode, is your main protected with required checks

12:48 that are actually required? Are releases gated

12:51 in a way that matches risk? Do you have a real

12:55 rollback muscle? Is ownership clear when output

12:58 volume increases? Also, watch the human metrics.

13:02 On -call load, mean time to restore, number of

13:05 pages per week. If those go up while headcount

13:08 goes down, your system is signaling that the

13:11 brakes are failing. One more main story before

13:14 the lightning round. Anthropic announced Claude

13:21 Code security in limited research preview. The

13:24 pitches scan code bases for vulnerabilities and

13:27 suggest targeted patches for human review. This

13:30 is the direction the industry is moving. Security

13:33 work becomes part of the normal coding loop.

13:36 And if it works well, it can reduce backlog and

13:39 reduce time to fix. But the operational lesson

13:42 is the same as every agentic tool. Suggestion

13:46 mode is safe. Autonomous change is where you

13:49 need boundaries. If it can open PRs, it needs

13:52 the same roles as a human. So do this Monday.

13:56 If you are adopting AI security tooling, start

13:59 with suggestion mode. No auto -merge. Require

14:02 human approval for changes. Keep audit trails

14:05 of tool actions, not just a chat transcript.

14:08 Also, measure it like a real tool. How many findings

14:12 were real? How many were noise? Did it actually

14:15 reduce time to fix? Did it introduce risky changes?

14:19 Treat it like CI. Scoped access. Clear ownership.

14:23 And guardrails that pre - prevent fixing by breaking

14:27 things. All right, time for the lightning round.

14:36 Too -quick -bigger -than -they -look stories.

14:39 DeepSeek reportedly withheld early access to

14:42 a new model from U .S. chipmakers while giving

14:45 Chinese firms early access. AI supply chains

14:49 are geopolitical now, and that impacts what you

14:52 can run where, and what best model even means.

14:56 Vercel wrote a great post on security boundaries

14:59 in Agentic Architectures. The core point? Most

15:02 agents run code with access to secrets, and without

15:06 explicit boundaries that become an incident generator.

15:09 Two CVE hits that are very DevOps SRE relevant.

15:13 CISA added VMware ARIA Operations CVE -2026 -22719

15:20 to the KEV catalog, meaning active exploitation.

15:25 If you run ARIA Ops, treat that as patch now.

15:29 And CVE -2026 -27825 and 27826 in MCP Atlassian

15:37 is nasty because it lives at the agent tolling

15:40 layer. SSRF to RCE style chain. Exactly why agent

15:46 toll chains need the same rigor as any other

15:49 production system. And one more quick hit. Cloud

15:52 Cowork Scheduled Tasks is the agents in boring

15:56 workflows story. Recurring automation is where

15:59 little mistakes turn into big messes. So you

16:03 want approvals, scope limits, and audit ability.

16:07 All right, human closer. Quick credit where it's

16:17 due. If this episode feels like everything is

16:20 a control plane now, that's not just my opinion.

16:23 Uwe Friedrichsen has been writing about the ironies

16:26 of automation for a while, and we used that framing

16:29 in a prior Ship It Weekly episode, too. The idea

16:32 is simple. Automation doesn't remove responsibility.

16:35 It concentrates it. And as systems get faster,

16:38 human oversight doesn't magically speed up. You

16:41 can see it across everything we've covered. Physical

16:44 disruption in a region forces human decisions

16:47 under pressure. GitOps getting stuck forces humans

16:51 to rescue the control plane. CI being actively

16:54 exploited is literally attack the automation.

16:57 Org changes, framed as AI remake, increase output

17:01 and reduce humans, which makes guardrails the

17:04 safety net. And AI security tooling is only a

17:07 win if boundaries and approvals stay intact.

17:11 So the takeaway is boring, but it's the job.

17:14 Guardrails are product work. You either build

17:17 breaks or you build faster incidents. All right,

17:20 that's it for this week of Ship It Weekly. AWS

17:23 issues in Bahrain and the UAE. Argo CD pain and

17:28 why Flux keeps coming up. GitHub actions being

17:31 actively hunted and Trivy getting hit. Rogue

17:34 pilot and prompt injection meet real credentials.

17:38 Blocks AI -driven layoffs as an execution risk

17:42 story. And Anthropic pushing AI security tooling

17:46 deeper into the dev workflow. Links and show

17:49 notes are on shipitweekly .fm. If this episode

17:53 was useful, hit follow or subscribe wherever

17:55 you listen. And share it with an ops friend who's

17:58 living in CI, GitOps, or AI everywhere right

18:02 now. I'm Brian, and I'll see you next week.

AWS Bahrain/UAE Data Center Issues Amid Iran Strikes, ArgoCD vs Flux GitOps Failures, GitHub Actions Hackerbot-Claw Attacks (Trivy), RoguePilot Codespaces Prompt Injection, Block “AI Remake” Layoffs, Claude Code Security

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller