0:00
This week is another reminder that the boundary
0:02
of ops keeps expanding. Sometimes the incident
0:05
trigger isn't a bad deploy. It's physical disruption
0:08
in a cloud region. Sometimes it isn't your app.
0:11
It's your GitOps control plane getting stuck.
0:13
Sometimes it isn't a vuln in prod. It's your
0:17
CI getting actively hunted. And sometimes it's
0:20
your company saying AI remake while expecting
0:23
the same reliability with fewer humans. All right,
0:26
let's get into it. Hey, I'm Brian Teller. I work
0:45
in DevOps and SRE, and I run Teller's Tech. This
0:49
is Ship It Weekly, where I filter the noise and
0:51
focus on what actually changes how we run infrastructure
0:54
and own reliability. Show notes and links are
0:57
on shipitweekly .fm. If the show's been useful,
1:01
follow it wherever you listen. Also, ratings
1:03
help way more than they should. Six main stories
1:06
today, then the lightning round, and then the
1:08
human closer. Story 1 is AWS flagging issues
1:11
in Bahrain and the UAE data centers amid Iran's
1:15
strikes, and what this means for regional resilience.
1:19
Story 2 is Argo CD to Flux and the specific Argo
1:23
CD failure mode that makes GitOps feel like a
1:26
pager generator. Story 3 is HackerBot Claw, an
1:29
automated campaign exploiting GitHub Actions
1:32
and Trivy getting hit as part of it story 4 is
1:36
rogue pilot a github code spaces co -pilot attack
1:39
chain it's basically prompt ejection meets real
1:43
credentials story 5 is block cutting 4 000 jobs
1:47
framed as an ai remake and why that's an ops
1:51
execution story story six is anthropic pushing
1:55
frontier cyber security capabilities for defenders
1:58
and what that means when these tools move from
2:01
scan to suggest fixes then the lightning round
2:04
then the human closer all right story one So
2:11
Reuters reported AWS flagged power and connectivity
2:15
issues tied to incidents at facilities in the
2:18
UAE and Bahrain amid the regional conflict. The
2:22
practical takeaway is simple. Multi -AZ is not
2:26
multi -region, and cloud outages now include
2:29
physical risk. If the region is degraded for
2:32
hours or days, the question becomes, can you
2:35
operate elsewhere and how fast can you decide?
2:38
A lot of teams say we're highly available, and
2:41
what they really mean is we can lose an AZ. That's
2:44
great, you should do that. But it's not the same
2:46
as losing an entire region's capacity, networking,
2:50
or connectivity to the outside world. This is
2:53
where a DR plan stops being a diagram and becomes
2:57
a decision tree. Here's how this bites real orgs.
3:00
Your apps might technically still run, but your
3:03
dependencies don't. Payments provider timeouts.
3:06
Queue backlogs. Outbound traffic gets weird.
3:09
Latency goes from fine to unusable. And suddenly
3:13
you are in the ugly space where nothing is fully
3:16
down, but everything is failing. That's also
3:19
where people make the wrong call. They wait too
3:22
long because it's partially working. Then failover
3:25
gets harder because data divergence grows and
3:29
backlogs pile up. Do this Monday. Pick one region
3:33
you rely on heavily and run this thought experiment.
3:36
Assume it's impaired for 48 hours. Who makes
3:40
the failover call? Not we would, a name or a
3:43
role. And what signals trigger it? Error rate?
3:46
Latency? Provider status? Customer impact? What's
3:50
the DNS plan? Do you use weighted routing? Failover
3:54
routing? Manual cutover? How long does it take?
3:57
What's the rollback plan? What's the data plan?
4:00
not the app plan. If your database is regional,
4:03
your app is regional. If your queue is regional,
4:06
your app is regional. If your identity system
4:09
is regional, your app is regional. Have you tested
4:12
this in the last year? Even a tabletop, even
4:15
a low -risk exercise, anything besides we totally
4:19
could if we had to. If you are in edge regions
4:22
or high -risk geos, you don't get to pretend
4:25
this is theoretical. Next up, story two. Story
4:32
two is GitOps pain. There's a great write -up
4:35
on migrating from Argo CD to Flux, and the best
4:39
part is the section literally titled The Problem
4:42
with Argo CD. The complaint is a specific failure
4:45
mode. A sync fails. Argo CD marks the app sync
4:49
failed, and then it can get stuck retrying the
4:52
failed state instead of progressing to the newer
4:55
commit that fixes it. The CRD ordering example
4:59
is the one everyone runs into at least once.
5:02
You push a commit that creates a custom resource.
5:06
The CRD isn't there yet, so sync fails. You push
5:09
a new commit, adding the CRD. And Argo CD can
5:12
keep banging its head against the old failed
5:15
commit. This is the moment where GitOps stops
5:18
feeling like declarative desired state and starts
5:21
feeling like a controller stuck in a loop that
5:24
needs a human rescue. The fastest way to hate
5:27
GitOps is getting paged by GitOps. And here's
5:30
the thing. Argo CD is good. Lots of teams run
5:33
it successfully. But the operational foot guns
5:36
are real. And if you haven't been burned yet,
5:38
you will be. GitOps tools are control planes.
5:41
When the control plane is wrong, it's not one
5:44
service down. It's your deployment mechanism
5:47
becoming the incident. So what do you actually
5:49
do with this story if you're not migrating to
5:52
Flux tomorrow? Do this Monday. If you run Argo
5:55
CD, make sure you have a runbook for sync failed
5:58
and won't progress. What do you do when it's
6:01
stuck on a bad desired state? What's the break
6:04
glass path that doesn't involve turning off all
6:07
automation? How do you handle CRDs and ordering
6:11
safely? Do you use sync waves? Do you split CRDs
6:14
into a separate app? do you pre -install them
6:17
whatever your pattern is write it down and standardize
6:21
it and make sure the on -call knows the recovery
6:24
moves manual sync refresh hard refresh prune
6:28
behavior deleting resources recreating app the
6:31
mechanics matter when you're under pressure now
6:33
on the flux side if you are considering flux
6:36
don't migrate your fleet first pick one low risk
6:39
service run it side by side Learn the failure
6:43
modes. Don't make this a religious war. Make
6:46
it an operational decision. Alright, story three.
6:53
This story is spicy in the way ops teams should
6:56
care about. Step Security documented an automated
6:59
campaign they call HackerBot Claw. It targeted
7:03
GitHub Actions workflows across major repos,
7:06
remote code execution in several targets, and
7:09
token theft, including a token with right permissions.
7:13
Then Trivy maintainers posted their own incident
7:16
report saying Trivy was attacked via GitHub Actions
7:19
as part of the same campaign. And they believe
7:23
the vulnerability came from a specific workflow
7:26
which they fixed. This is the your pipeline is
7:29
production story. Attackers are not waiting for
7:32
your app to have a bug. They are going after
7:35
the thing that can publish artifacts, ship releases,
7:38
and mint trust. CI is basically a skeleton key
7:42
that we keep leaving under the doormat. The real
7:45
reason this matters for DevOps is that CI feels
7:49
internal, but it's usually triggered by public
7:52
inputs. PR titles, PR descriptions, issues, forks,
7:57
external contributors, and if any of that can
8:00
reach privileged execution, you have an internet
8:03
-facing execution engine with credentials. Do
8:07
this Monday. Open your workflows and look for
8:10
common foot guns. Are you using pull request
8:13
target? If yes, do you fully understand which
8:16
code runs and what permissions it has? Are actions
8:20
pinned to commit SHAs or are you trusting tags?
8:23
Tags move, SHAs don't. Is GitHub token overprivileged?
8:28
Set default permissions to read -only and explicitly
8:31
grant what you need per job. Are secrets accessible
8:35
in contexts influenced by untrusted PRs? If untrusted
8:39
code can run in jobs with secrets, assume those
8:43
secrets are compromised eventually. Bonus hardening
8:47
that actually helps. Use OIDC to cloud providers
8:50
instead of long -lived cloud keys. Use environments
8:53
with required reviewers for deploy jobs. Separate
8:57
untrusted test pipeline from trusted published
9:00
deploy pipeline. If you do only one thing this
9:03
week, stop untrusted events from running privileged
9:06
steps. Okay, story four. this one is a perfect
9:14
bridge between agents are real and credentials
9:17
are real rogue pilot is an attack chain where
9:20
a malicious github issue can embed instructions
9:24
that get processed when a developer launches
9:27
a code space from that issue it's basically passive
9:30
prompt injection the attacker's content is the
9:34
prompt the scary part isn't the model said something
9:36
weird the scary part is the model can end up
9:40
operating in an environment that has real access
9:43
like a github token and the attacker's instructions
9:46
can steer what it does in the worst case it becomes
9:50
token exfiltration and repo takeover outcomes
9:53
this is exactly why agent boundaries are not
9:56
optional Untrusted text exists everywhere. Issues,
10:00
PRs, readmes, docs. If an agent reads it and
10:05
then can run commands, you need real trust boundaries.
10:08
Anything that reads issues is reading untrusted
10:11
input. Period. Do this Monday. If your org uses
10:15
Codespaces, Copilot -style agents, or anything
10:19
that can act, treat repo content as untrusted
10:23
input. Separate read context from act context.
10:27
Don't let the same agent both ingest and execute
10:30
without gates. Use least privileged tokens in
10:34
dev environments. A developer workspace should
10:37
not have a token that can publish releases or
10:40
modify workflows. If you rely on external issues
10:44
and PRs, tighten who can trigger what. And make
10:47
sure you have logging that captures tool actions,
10:50
not just chat. This story will keep repeating
10:54
across tools. It's a category, not a one -off.
10:57
All right, story five. So Block. If you're not
11:05
familiar, Block is the fintech company run by
11:08
Jack Dorsey. If the name sounds familiar, it's
11:11
because he's also the co -founder of Twitter
11:13
and one of the more influential figures in modern
11:16
tech platforms. Block used to be called Square.
11:20
It's the company behind things like Square Payments,
11:23
Cash App, and a bunch of fintech infrastructure
11:25
that powers small businesses and consumer payments.
11:29
So this isn't some random startup making headlines.
11:32
This is a large public tech company run by someone
11:36
who's been through multiple platform shifts already.
11:40
And now they're talking about what they're calling
11:42
an AI remake. Block is cutting around 4 ,000
11:46
jobs. This is not just an HR story for ops teams.
11:50
It's an execution story. Because the pattern
11:54
we're going to see everywhere is fewer humans,
11:57
higher output volume, same reliability expectations.
12:00
If AI increases code output, your safety net
12:04
has to scale too. That safety net is not hero
12:08
engineers. It's guardrails, ownership, and systems
12:12
that absorb change without constant babysitting.
12:15
Small teams move fast, only works if your brakes
12:18
work. Here's the practical ops angle. When teams
12:22
shrink, you lose redundancy. On -call rotations
12:25
get thinner. Specialists disappear. And then
12:28
an incident hits and the blast radius of we lost
12:31
that context is massive. So if leadership is
12:34
pushing AI productivity, the immediate question
12:37
is, what are we doing to preserve safety? Do
12:41
this Monday. If your org is in AI productivity
12:44
mode, is your main protected with required checks
12:48
that are actually required? Are releases gated
12:51
in a way that matches risk? Do you have a real
12:55
rollback muscle? Is ownership clear when output
12:58
volume increases? Also, watch the human metrics.
13:02
On -call load, mean time to restore, number of
13:05
pages per week. If those go up while headcount
13:08
goes down, your system is signaling that the
13:11
brakes are failing. One more main story before
13:14
the lightning round. Anthropic announced Claude
13:21
Code security in limited research preview. The
13:24
pitches scan code bases for vulnerabilities and
13:27
suggest targeted patches for human review. This
13:30
is the direction the industry is moving. Security
13:33
work becomes part of the normal coding loop.
13:36
And if it works well, it can reduce backlog and
13:39
reduce time to fix. But the operational lesson
13:42
is the same as every agentic tool. Suggestion
13:46
mode is safe. Autonomous change is where you
13:49
need boundaries. If it can open PRs, it needs
13:52
the same roles as a human. So do this Monday.
13:56
If you are adopting AI security tooling, start
13:59
with suggestion mode. No auto -merge. Require
14:02
human approval for changes. Keep audit trails
14:05
of tool actions, not just a chat transcript.
14:08
Also, measure it like a real tool. How many findings
14:12
were real? How many were noise? Did it actually
14:15
reduce time to fix? Did it introduce risky changes?
14:19
Treat it like CI. Scoped access. Clear ownership.
14:23
And guardrails that pre - prevent fixing by breaking
14:27
things. All right, time for the lightning round.
14:36
Too -quick -bigger -than -they -look stories.
14:39
DeepSeek reportedly withheld early access to
14:42
a new model from U .S. chipmakers while giving
14:45
Chinese firms early access. AI supply chains
14:49
are geopolitical now, and that impacts what you
14:52
can run where, and what best model even means.
14:56
Vercel wrote a great post on security boundaries
14:59
in Agentic Architectures. The core point? Most
15:02
agents run code with access to secrets, and without
15:06
explicit boundaries that become an incident generator.
15:09
Two CVE hits that are very DevOps SRE relevant.
15:13
CISA added VMware ARIA Operations CVE -2026 -22719
15:20
to the KEV catalog, meaning active exploitation.
15:25
If you run ARIA Ops, treat that as patch now.
15:29
And CVE -2026 -27825 and 27826 in MCP Atlassian
15:37
is nasty because it lives at the agent tolling
15:40
layer. SSRF to RCE style chain. Exactly why agent
15:46
toll chains need the same rigor as any other
15:49
production system. And one more quick hit. Cloud
15:52
Cowork Scheduled Tasks is the agents in boring
15:56
workflows story. Recurring automation is where
15:59
little mistakes turn into big messes. So you
16:03
want approvals, scope limits, and audit ability.
16:07
All right, human closer. Quick credit where it's
16:17
due. If this episode feels like everything is
16:20
a control plane now, that's not just my opinion.
16:23
Uwe Friedrichsen has been writing about the ironies
16:26
of automation for a while, and we used that framing
16:29
in a prior Ship It Weekly episode, too. The idea
16:32
is simple. Automation doesn't remove responsibility.
16:35
It concentrates it. And as systems get faster,
16:38
human oversight doesn't magically speed up. You
16:41
can see it across everything we've covered. Physical
16:44
disruption in a region forces human decisions
16:47
under pressure. GitOps getting stuck forces humans
16:51
to rescue the control plane. CI being actively
16:54
exploited is literally attack the automation.
16:57
Org changes, framed as AI remake, increase output
17:01
and reduce humans, which makes guardrails the
17:04
safety net. And AI security tooling is only a
17:07
win if boundaries and approvals stay intact.
17:11
So the takeaway is boring, but it's the job.
17:14
Guardrails are product work. You either build
17:17
breaks or you build faster incidents. All right,
17:20
that's it for this week of Ship It Weekly. AWS
17:23
issues in Bahrain and the UAE. Argo CD pain and
17:28
why Flux keeps coming up. GitHub actions being
17:31
actively hunted and Trivy getting hit. Rogue
17:34
pilot and prompt injection meet real credentials.
17:38
Blocks AI -driven layoffs as an execution risk
17:42
story. And Anthropic pushing AI security tooling
17:46
deeper into the dev workflow. Links and show
17:49
notes are on shipitweekly .fm. If this episode
17:53
was useful, hit follow or subscribe wherever
17:55
you listen. And share it with an ops friend who's
17:58
living in CI, GitOps, or AI everywhere right
18:02
now. I'm Brian, and I'll see you next week.
For this episode, the theme that kept showing up was control planes under pressure.
Not just the obvious ones like Kubernetes or CI/CD.
But the broader set of systems we now depend on to run infrastructure: GitOps controllers, developer workspaces, agent tooling, and even the geopolitical reality behind cloud regions and AI supply chains.
A lot of the stories this week look unrelated on the surface.
An AWS region dealing with infrastructure disruptions in the Middle East.
A GitOps migration story from ArgoCD to Flux.
CI pipelines being actively hunted by automated attackers.
Prompt injection turning into token theft in developer environments.
Companies restructuring around “AI productivity.”
And security tooling itself becoming AI-driven.
But if you zoom out a little, they’re all variations of the same underlying shift.
The automation layer has become the real control plane of modern infrastructure.
And once that happens, two things follow very quickly.
First, attackers target the control plane.
Second, organizations try to scale it faster than their guardrails.
You can see the first part clearly in the GitHub Actions attacks that StepSecurity documented.
This wasn’t someone finding a bug in an application.
It was an automated campaign targeting CI workflows across open source repositories.
The attacker isn’t interested in the application logic.
They want the release mechanism.
If you compromise CI, you don’t need a vulnerability in the app.
You can modify the artifacts, steal tokens, publish malicious packages, or pivot into infrastructure.
That’s why the Trivy maintainers’ response was interesting. They quickly published details about the attack vector and the workflow that was responsible.
That’s the right instinct.
CI incidents are supply chain incidents now.
StepSecurity hackerbot-claw analysis
https://www.stepsecurity.io/blog/hackerbot-claw-github-actions-exploitation
Trivy incident discussion
https://github.com/aquasecurity/trivy/discussions/10265
The second theme was control planes that fail in subtle ways.
The ArgoCD story is a good example.
GitOps sounds beautiful in theory.
Git is the source of truth.
The cluster reconciles to match it.
Everything is declarative.
But operationally, the controller itself becomes part of the reliability story.
If it gets stuck on a failed state, it can block the path to recovery.
And the CRD ordering problem mentioned in that migration write-up is something many teams eventually encounter.
It’s not a catastrophic bug.
It’s a behavior mismatch between how engineers expect reconciliation to work and how the system actually behaves.
That’s the dangerous category of failure.
Because it usually shows up during an incident, when you’re trying to deploy the fix.
Migration write-up
https://hai.wxs.ro/migrations/argocd-to-flux/
Another version of this control plane expansion is happening in developer environments.
The RoguePilot research is a good example of that.
A malicious GitHub issue can contain instructions that get interpreted when a developer launches a Codespace.
That’s essentially prompt injection as a supply chain vector.
And the problem isn’t just the model.
It’s the environment.
If the agent reading that issue has access to a
GITHUB_TOKEN, or can run commands, or can open pull requests, the attacker has a pathway into real operations.RoguePilot overview
https://thehackernews.com/2026/02/roguepilot-flaw-in-github-codespaces.html
Original research
https://orca.security/resources/blog/roguepilot-github-copilot-vulnerability/
There’s also a bigger conversation happening about agent boundaries.
Vercel published a good write-up on this recently.
The core idea is simple: most agents today run generated code with the same privileges as the developer or system running them.
Which means the real question becomes:
Where is the trust boundary?
Is the agent allowed to read untrusted content?
Is it allowed to execute commands?
Does it have access to secrets?
Those are the same questions we’ve been asking about CI systems for years.
We’re just asking them again for AI tools.
Security boundaries in agentic architectures
https://vercel.com/blog/security-boundaries-in-agentic-architectures
The AWS regional disruption story fits into this theme in a different way.
It’s a reminder that cloud infrastructure still exists in the physical world.
Power events, connectivity problems, geopolitical instability — all of these can show up as “cloud issues.”
And that’s why the phrase multi-AZ is not multi-region matters.
Availability zones protect you from localized failures.
Regions protect you from systemic ones.
And organizations that treat those as interchangeable eventually discover the difference during a very long outage.
Reuters coverage
https://www.reuters.com/world/middle-east/amazon-cloud-unit-flags-issues-bahrain-uae-data-centers-amid-iran-strikes-2026-03-02/
Then there’s the organizational side of all this.
The Block layoffs framed as an “AI remake” are part of a pattern we’re seeing across the industry.
Companies expect automation and AI to increase productivity.
And in many cases, they’re right.
But there’s a hidden constraint.
Automation scales faster than human oversight.
That idea has been explored really well by Uwe Friedrichsen in his Ironies of Automation series.
The key insight is that automation concentrates responsibility rather than eliminating it.
Systems get faster.
Systems get more capable.
But humans do not scale at the same rate.
Which means failures propagate faster than organizations can understand them.
Ironies of Automation series
https://www.ufried.com/blog/ironies_of_automation/
Ironies of AI (Part 2)
https://www.ufried.com/blog/ironies_of_ai_2/
We actually touched on that idea in an earlier Ship It Weekly episode when talking about control planes and automated RCA.
That conversation still applies here.
Earlier episode reference
https://www.tellerstech.com/ship-it-weekly/fail-small-iac-control-planes-and-automated-rca/
One more interesting development this week was Anthropic announcing Claude Code Security.
Tools like this aim to scan codebases for vulnerabilities and propose fixes automatically.
In theory, that’s extremely powerful.
Security teams spend huge amounts of time triaging issues that developers never get around to fixing.
If AI can propose safe patches and reduce that backlog, that’s a real win.
But it also raises the same operational question we’ve been talking about throughout this episode.
Is the tool suggesting changes, or making them autonomously?
Because the moment a system can modify code, open pull requests, or deploy changes, it’s no longer just a scanner.
It’s part of the control plane.
Claude Code Security
https://www.anthropic.com/news/claude-code-security
Finally, a quick note on the AI supply chain angle we mentioned in the lightning round.
DeepSeek reportedly withheld access to a new model from certain U.S. chipmakers while making it available earlier to domestic firms.
This is another reminder that AI infrastructure is now intertwined with geopolitics and hardware supply chains.
Which means “what model can we run” may become just as much a business or regulatory question as a technical one.
DeepSeek coverage
https://www.reuters.com/world/china/deepseek-withholds-latest-ai-model-us-chipmakers-including-nvidia-sources-say-2026-02-25/
If you step back from all of these stories, the through-line becomes pretty clear.
Infrastructure reliability used to be mostly about applications and servers.
Now it’s about the systems that operate the systems.
CI pipelines
GitOps controllers
Developer environments
Agent frameworks
Security automation
Cloud regions
These are the new control planes.
And the work of DevOps and SRE increasingly revolves around making sure those layers are safe, observable, and recoverable when something inevitably goes wrong.
That’s all for this week’s commentary.
If you want the full breakdown of the stories discussed in the episode, check the show notes and episode page.
More episodes are available at
https://shipitweekly.fm
And if this show has been useful, consider sharing it with a teammate or another engineer who’s living in the same automation-heavy world we all are right now.
Thanks for listening. See you next week.