GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep

Transcript

This week feels like the platforms are getting

opinionated. Not here's a new feature, more like

here's the new default way you're going to operate.

And if you don't notice the default changed,

you'll feel it later. Usually as surprise work,

surprise risk, or surprise downtime. What's up,

everybody? I'm Brian, and this is Ship It Weekly,

a short weekly show where I filter the noise

and focus on what actually changes how we run

infra. Quick update before we jump in. Ship It

Weekly is officially a video podcast now. Every

episode going forward is available on YouTube

in full video format. If you're audio only, nothing

changes. Same feed, same cadence, and same show.

Also, OnCallBrief .com is live. That's my system

for tracking DevOps and infra news without drowning

in tabs. Briefs start on Sunday to get refined

throughout the week, and Thursday is the final

pass. If you want to see what I'm tracking before

the episode drops, that's the place. And Teller's

Tech now has a sub stack too. If you want episodes

and weekly briefs delivered to your inbox, that's

the easiest option. And one last thing, I'm starting

another round of interviews. If you want to come

on and talk a migration, a post -mortem, a weird

outage, or how your team actually runs production,

hit me up on shipitweekly .fm. All right, let's

get into it. Five stories for today. GitHub is

putting agentic workflows directly into actions

that changes what CI can do without a human driving.

Gen 2 is moving away from GitHub to Codeburg.

It's a reminder that the Forge is not neutral

infrastructure anymore. Argo CD upgrades are

forcing server -side apply in certain paths.

This is one of those small line in the notes,

big day in prod things. AWS Config expanded coverage

again, which sounds boring until you realize

governance scope can move under your feet. And

AWS enabled nested visualization on virtual EC2

instances, a capability unlock that's going to

change what people attempt to run. Then we'll

do a lightning round and a human closer on the

gap between AI everywhere and incidents still

being painful. First main story, GitHub agentic

workflows in actions. GitHub dropped agentic

workflows into technical preview. The simple

version is you write the intent in Markdown and

an agent runs inside GitHub actions to do repo

work. It's not just run these steps, it's more

like here's the goal, go handle it. Think issue

triage, basic repo maintenance, investigating

CI failures, and proposing fixes. This is not

the same thing as Agent HQ. Agent HQ was agents

in the GitHub experience. This is agents inside

your automation engine. That's a very different

place to add intelligence. So why does this matter?

Actions is where a lot of secrets and permissions

live. It's also where small automations quietly

become core production workflows. When an agent

can take actions, not just produce suggestions,

you've created a new right path. And right paths

have two roles. One, they need ownership. But

two, they need constraints. Because the failure

mode isn't always malicious. The failure mode

is helpful, fast, and wrong the scary version

is an agent editing workflow files because workflow

files are basically the keys to the kingdom or

an agent doing cleanup that breaks a dependency

you didn't even remember existed or an agent

repeatedly retrying something until it finds

a path that works but violates policy the first

time an agent edits a workflow it's a different

game So do this Monday. First, inventory what

actions can actually do today. Don't start with

features, start with permissions. Which repos

allow workflows to write back to the repo? Which

repos allow modifying workflow files? Which workflows

can deploy, publish artifacts, or touch environments?

Second, separate read -only automation from write

automation. Agents can comment and propose, sure,

but merges should require a human. That's the

clean line that prevents a lot of pain. Third,

lock down permissions in workflows. If you are

still running broad default tokens, you need

to fix that now. Treat write scopes like production

credentials, because that's what they are. Fourth,

add explicit approval gates for anything high

leverage. Secrets, deployments, runners, workflow

file changes. If an agent can do it, a human

should approve it. At least until you trust it

and you've tested the blast radius. Fifth, logging.

You need a record of what the agent changed and

why, not just standard out from the workflow.

Because the post -incident question won't be

who clicked it, it'll be what sequence of actions

did the agent take. Okay, that's story one. Story

two is Gentoo moves off of GitHub to Codeburg.

So Gentoo is moving repositories off of GitHub

and towards Codeburg. They are pretty direct

about why. They're uncomfortable with the direction

GitHub is going around co -pilot and AI pressure.

This isn't a rage quit. It's an operational migration.

Mirrors, workflow changes, community adjustments,

all of the annoying real -world stuff. So why

does this matter? Because GitHub isn't just a

place where code lives. It's your auth story.

your CI story, your review flow, your issue tracker,

your release automation, your dependency and

security alerts. So when the platform changes

incentives or defaults, it's not a cosmetic change.

It alters how work gets done. AntGen2 is basically

saying, we want leverage and we want options.

Even if you don't care about the politics, the

engineering lesson is real. If a platform becomes

your default for everything, you don't notice

lock - until you have to change something fast.

And then you realize half your workflow isn't

portable. Not the code, the workflow. Mirrors

aren't boring. Mirrors are exit ramps. So do

this Monday. Pick one repo that matters. Map

the dependencies. What breaks if GitHub is down

for a day? Not developers are annoyed. What breaks

operationally? Builds, releases, packages, security

workflows, required checks, even how do we coordinate

changes? Now, ask a better question. What do

we need to keep shipping if the forge is degraded?

Do we have mirrors? Do we have backups of more

than Git objects? Issues and PR metadata matter.

Release artifacts matter. Actions, config, and

required checks matter. If you've never tested

Restore, do a small exercise. Pretend the repo

is gone. Restore it and run the minimum path

to ship. You don't need a full migration plan.

You need confidence that you are not trapped.

Okay, that's story two. Story 3 is Argo CD upgrades

and server -side apply requirements. This one

is pure operator reality. Argo CD has an upgrade

path where server -side apply is required in

certain setups, especially when Argo is managing

itself or when you are applying manifests directly.

The reason is a Kubernetes annotation size limitation.

The last applied blob can get too large and it

breaks apply behavior in weird ways. So the fix

becomes use SSA. Let the API server own field

management. So why does this matter? This isn't

just a flag you add to a command. It changes

how ownership works. SSA tracks fields differently.

And that means upgrades can overwrite things

you didn't realize you were relying on. This

is where hidden customizations come back to haunt

you. The little prod -only patches. The tolerations

someone added once. The probe tweak that never

got upstreamed into your real config. Upgrades

are where tribal knowledge gets erased, or worse,

half -erased. And Argo is a special kind of risky

upgrade because it's your deploy system. If Argo

is down, you don't just lose a tool. You lose

the safe path to change. Then everyone starts

doing manual kubectl, and Drift shows up immediately.

GitOps is calm until the GitOps system is the

incident. So do this Monday. First, look at how

you deploy Argo today. Is Argo managing itself?

If yes, Verify SSA is enabled in the application

sync options before upgrading. Don't wait to

learn this during the upgrade. Second, diff your

live Argo resources against what you think you

apply, and then find the hand edits. Find the

temporary patches, write them down, and formalize

them. Third, build an upgrade lane for Argo,

even if it's small. A rehearsal environment.

Same method, same manifest, same shape. Practice

upgrade and rollback, and validate can we sync

a known app after upgrade. Fourth, rehearse Argo

as down mode. How do you deploy without it? How

do you stop it from fighting you if it's partially

alive? How do you get back to a known good state?

Because when Argo breaks, every minute feels

expensive. And thinking clearly gets harder.

Okay, that's story three. Story four is AWS Config

adding 30 new resource types. So AWS Config just

added support for 30 additional resource types.

Here's the key detail. If you record all resource

types, Config can start tracking new types automatically.

So Scope expands under you without you doing

anything. This is a quiet change. But it affects

inventory, governance, and sometimes cost. So

why does this matter? Most teams don't treat

Config like a living system. They treat it like

a checkbox. Then one day they try to get serious

about governance and compliance, and they realize

Config is actually foundational data. So when

coverage expands, that's good. But it also means

more evaluation surface. More resources showing

up in aggregators. More roles being evaluated.

More non -compliant noise. And the worst failure

mode is not config has data. It's config has

data nobody owns. New resource types show up.

Roles fire. Nobody knows who should fix it. So

it becomes a platform team problem by default.

And platform teams get buried in triage instead

of improving systems. Inventory expansion is

great until it becomes surprise accountability.

So what I would do Monday, go check your config

recorder settings. Are you recording all resource

types? If yes, decide if that's intentional or

if it's just we clicked it once years ago. then

check your roles which rules will start evaluating

these new types tagging roles encryption rules

public access rules all the stuff if new types

will create noise decide routing who owns the

alerts who owns remediation also baseline your

config usage and costs not because this change

will wreck your bill but because it changes the

scope and it's easier to explain early than late

Finally, tighten your ownership metadata. If

you can't answer who owns this resource, governance

becomes a scavenger hunt. That's story four.

Story five is EC2 nested visualization on virtual

instances. AWS now supports nested virtualization

on certain virtual EC2 instances. Historically,

nested virtualization on AWS was usually a bare

metal story. Now it's possible on virtualized

instances for some families. This is a capacity

unlock. It's also a behavior unlock because the

moment this exists, teams will attempt things

they couldn't justify before. Full lab environments

inside EC2. VM heavy testing. Security sandboxes.

Let's run a hypervisor inside our runner fleet.

So why does this matter? Nested virtualization

sounds niche, but it's really about reproducibility

and isolation. If you've ever wanted a test environment

that looks closer to prod, this helps. And if

you have tooling that expects a hypervisor, this

helps. And if you were paying the bare metal

tax purely for nested vert, this might be a cost

lever, which is going to matter for some orgs.

But it's also a foot gun if you combine it with

credentials and loose network controls. Anything

that starts looking like a workstation gets treated

like a workstation. People install random tools.

People store secrets in the wrong places. People

run just this one thing. And that's why anything

that looks like a workstation eventually gets

treated like one. So do this Monday. If you have

runner fleets, build fleets, or sandbox accounts,

add this to your threat model. Ask what changes

if nested VMs become available. Then set boundaries.

Which accounts allow it? Which VPCs allow it?

Do you want tighter egress controls? And then

document and internal stance. Even a short note

helps. Like, we support this for these use cases

on these instance families with these guardrails.

Because if you don't write the rules, you will

end up inheriting random experiments. And then

you'll learn about them during an incident or

a bill review. Okay, that's story five. Okay,

time for the lightning round, short and practical.

GitHub updated their status page experience.

There's now a 90 -day historical view and better

linking between incident days and availability

trends. And honestly, given GitHub's hiccups

lately, having a status page that's actually

useful is a welcome addition. OpenBuild Service

published a post -mortem on a disruption that

came down to database migration and locking behavior.

It's a good reminder that migration plan is not

the same as rollback plan. And a quick reminder

because we covered it already, GitHub Actions

extended the self -hosted runner minimum version

enforcement window. Treat that as runway, not

permission to ignore it. If you have self -hosted

runners, schedule the upgrade work. Another quick

GitHub one, Actions had early February updates

around things like runner controls and settings

that reduce surprise drift across orgs. It's

not headline news, but it's the kind of incremental

improvement that saves platform teams time. AWS

config expanding coverage is also a reminder

of a bigger pattern. A lot of discover everything

services expand under you as AWS adds new stuff.

That's not bad. It just means you need ownership

or the tool becomes noise. And if you are experimenting

with agentic workflows, don't skip the boring

part. Permissions, approval gates, and audit

trails. That's the difference between useful

automation and mystery automation. Okay, that's

the lightning round. Time for the human closer.

There's a post called Lots of AI SRE, No AI Incident

Management, and it nails something that feels

obvious once you say it. Most AI tooling in ops

is aimed at producing output faster. Write the

YAML, draft the runbook, summarize the log, generate

the postmortem doc. That's useful, but it's not

the core pain during a real incident. Incidents

aren't mostly writing. Incidents are uncertainty.

What changed? What's real? What's correlated

versus not? And incidents are coordination. Who's

driving? Who's communicating externally? Who's

making the rollback call? And how do we keep

the team aligned when five things are happening

at once? That is still wildly human. And honestly,

that's what makes on -call exhausting. Now tie

this back to today's stories. We are putting

more automation into the workflow. And in some

cases, we are giving it more agency. Agents in

actions deploy systems that can strand you mid

-upgrade. Governance tools that expand scope

automatically. And all of this increases the

number of things happening around incidents.

So if those tools don't reduce uncertainty, they

can increase chaos. The win is not faster output.

The win is less uncertainty for tired humans.

If AI can help, great. But the bar is, does it

help you decide what to do next safely? Does

it tell you what it's unsure about? Does it show

you what it tried and ruled out? Can it give

you an explanation you can trust at 3 a .m.,

not just a confident guess? So my take this week

is simple. When you evaluate tooling, don't judge

it by how clever it sounds. Judge it by whether

it reduces uncertainty when you are on call.

Because that's the moment that matters. That's

where reliability is real. Okay, time for a recap.

Today we talked about GitHub agentic workflows

and actions, and how it's not just nicer CI,

it's a new right path that needs guardrails.

Gen 2 moving towards Codeburg, Forge choice is

supply chain, governance, and leverage, not just

convenience. Argo CD upgrades requiring SSA in

certain paths. Control plane upgrades deserve

their own lane and rehearsals. AWS Config adding

30 new resource types. Great coverage, but scope

can expand under you, so be intentional. EC2

nested virtualization on virtual instances. Capability

unlock, and also a new what -will -teams -attempt

-now moment. The lightning round was around some

GitHub stories and OpenBuild service publishing

a postmortem. If you want the video version,

full episodes are now on YouTube going forward.

If you want the weekly briefs, OnCallBrief .com

is live. And if you want everything delivered

by email, Teller's Tech Substack is up. And lastly,

if you want to come on the show for an interview,

reach out at ShipItWeekly .fm. More episodes,

links, and show notes are on ShipItWeekly .fm.

All right, I'm Brian, and I'll catch you next

week.

GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep

Watch this episode here

Chapters

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

EKS Rollbacks, GitHub Actions Supply Chain Attacks, AI Agentjacking, CloudWatch Log Alarms, and Why Safety Nets Don’t Replace Ownership

containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

PeopleSoft Zero-Day Exploited, npm v12 Install Script Changes, GitHub Agentic Tokens, Anthropic Model Risk, and Default Trust Breaking

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

Get the next episode in your inbox