GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep

Transcript

0:00 This week feels like the platforms are getting

0:02 opinionated. Not here's a new feature, more like

0:05 here's the new default way you're going to operate.

0:08 And if you don't notice the default changed,

0:11 you'll feel it later. Usually as surprise work,

0:13 surprise risk, or surprise downtime. What's up,

0:33 everybody? I'm Brian, and this is Ship It Weekly,

0:36 a short weekly show where I filter the noise

0:39 and focus on what actually changes how we run

0:41 infra. Quick update before we jump in. Ship It

0:45 Weekly is officially a video podcast now. Every

0:48 episode going forward is available on YouTube

0:51 in full video format. If you're audio only, nothing

0:54 changes. Same feed, same cadence, and same show.

0:57 Also, OnCallBrief .com is live. That's my system

1:02 for tracking DevOps and infra news without drowning

1:05 in tabs. Briefs start on Sunday to get refined

1:09 throughout the week, and Thursday is the final

1:11 pass. If you want to see what I'm tracking before

1:14 the episode drops, that's the place. And Teller's

1:17 Tech now has a sub stack too. If you want episodes

1:20 and weekly briefs delivered to your inbox, that's

1:23 the easiest option. And one last thing, I'm starting

1:26 another round of interviews. If you want to come

1:28 on and talk a migration, a post -mortem, a weird

1:32 outage, or how your team actually runs production,

1:35 hit me up on shipitweekly .fm. All right, let's

1:38 get into it. Five stories for today. GitHub is

1:42 putting agentic workflows directly into actions

1:45 that changes what CI can do without a human driving.

1:49 Gen 2 is moving away from GitHub to Codeburg.

1:53 It's a reminder that the Forge is not neutral

1:56 infrastructure anymore. Argo CD upgrades are

1:59 forcing server -side apply in certain paths.

2:02 This is one of those small line in the notes,

2:05 big day in prod things. AWS Config expanded coverage

2:09 again, which sounds boring until you realize

2:13 governance scope can move under your feet. And

2:16 AWS enabled nested visualization on virtual EC2

2:20 instances, a capability unlock that's going to

2:23 change what people attempt to run. Then we'll

2:26 do a lightning round and a human closer on the

2:28 gap between AI everywhere and incidents still

2:31 being painful. First main story, GitHub agentic

2:38 workflows in actions. GitHub dropped agentic

2:41 workflows into technical preview. The simple

2:44 version is you write the intent in Markdown and

2:47 an agent runs inside GitHub actions to do repo

2:50 work. It's not just run these steps, it's more

2:53 like here's the goal, go handle it. Think issue

2:56 triage, basic repo maintenance, investigating

2:59 CI failures, and proposing fixes. This is not

3:03 the same thing as Agent HQ. Agent HQ was agents

3:07 in the GitHub experience. This is agents inside

3:10 your automation engine. That's a very different

3:12 place to add intelligence. So why does this matter?

3:16 Actions is where a lot of secrets and permissions

3:18 live. It's also where small automations quietly

3:21 become core production workflows. When an agent

3:24 can take actions, not just produce suggestions,

3:28 you've created a new right path. And right paths

3:31 have two roles. One, they need ownership. But

3:34 two, they need constraints. Because the failure

3:37 mode isn't always malicious. The failure mode

3:40 is helpful, fast, and wrong the scary version

3:44 is an agent editing workflow files because workflow

3:47 files are basically the keys to the kingdom or

3:50 an agent doing cleanup that breaks a dependency

3:52 you didn't even remember existed or an agent

3:56 repeatedly retrying something until it finds

3:59 a path that works but violates policy the first

4:02 time an agent edits a workflow it's a different

4:05 game So do this Monday. First, inventory what

4:09 actions can actually do today. Don't start with

4:12 features, start with permissions. Which repos

4:15 allow workflows to write back to the repo? Which

4:18 repos allow modifying workflow files? Which workflows

4:22 can deploy, publish artifacts, or touch environments?

4:25 Second, separate read -only automation from write

4:29 automation. Agents can comment and propose, sure,

4:32 but merges should require a human. That's the

4:35 clean line that prevents a lot of pain. Third,

4:39 lock down permissions in workflows. If you are

4:42 still running broad default tokens, you need

4:44 to fix that now. Treat write scopes like production

4:48 credentials, because that's what they are. Fourth,

4:51 add explicit approval gates for anything high

4:53 leverage. Secrets, deployments, runners, workflow

4:57 file changes. If an agent can do it, a human

5:01 should approve it. At least until you trust it

5:03 and you've tested the blast radius. Fifth, logging.

5:07 You need a record of what the agent changed and

5:10 why, not just standard out from the workflow.

5:13 Because the post -incident question won't be

5:15 who clicked it, it'll be what sequence of actions

5:19 did the agent take. Okay, that's story one. Story

5:26 two is Gentoo moves off of GitHub to Codeburg.

5:30 So Gentoo is moving repositories off of GitHub

5:33 and towards Codeburg. They are pretty direct

5:36 about why. They're uncomfortable with the direction

5:39 GitHub is going around co -pilot and AI pressure.

5:42 This isn't a rage quit. It's an operational migration.

5:46 Mirrors, workflow changes, community adjustments,

5:49 all of the annoying real -world stuff. So why

5:53 does this matter? Because GitHub isn't just a

5:55 place where code lives. It's your auth story.

5:58 your CI story, your review flow, your issue tracker,

6:03 your release automation, your dependency and

6:05 security alerts. So when the platform changes

6:08 incentives or defaults, it's not a cosmetic change.

6:12 It alters how work gets done. AntGen2 is basically

6:16 saying, we want leverage and we want options.

6:19 Even if you don't care about the politics, the

6:22 engineering lesson is real. If a platform becomes

6:24 your default for everything, you don't notice

6:27 lock - until you have to change something fast.

6:30 And then you realize half your workflow isn't

6:32 portable. Not the code, the workflow. Mirrors

6:35 aren't boring. Mirrors are exit ramps. So do

6:39 this Monday. Pick one repo that matters. Map

6:42 the dependencies. What breaks if GitHub is down

6:44 for a day? Not developers are annoyed. What breaks

6:48 operationally? Builds, releases, packages, security

6:52 workflows, required checks, even how do we coordinate

6:56 changes? Now, ask a better question. What do

6:59 we need to keep shipping if the forge is degraded?

7:02 Do we have mirrors? Do we have backups of more

7:05 than Git objects? Issues and PR metadata matter.

7:09 Release artifacts matter. Actions, config, and

7:12 required checks matter. If you've never tested

7:14 Restore, do a small exercise. Pretend the repo

7:17 is gone. Restore it and run the minimum path

7:20 to ship. You don't need a full migration plan.

7:23 You need confidence that you are not trapped.

7:26 Okay, that's story two. Story 3 is Argo CD upgrades

7:34 and server -side apply requirements. This one

7:37 is pure operator reality. Argo CD has an upgrade

7:41 path where server -side apply is required in

7:44 certain setups, especially when Argo is managing

7:47 itself or when you are applying manifests directly.

7:50 The reason is a Kubernetes annotation size limitation.

7:54 The last applied blob can get too large and it

7:57 breaks apply behavior in weird ways. So the fix

8:00 becomes use SSA. Let the API server own field

8:05 management. So why does this matter? This isn't

8:08 just a flag you add to a command. It changes

8:11 how ownership works. SSA tracks fields differently.

8:15 And that means upgrades can overwrite things

8:17 you didn't realize you were relying on. This

8:20 is where hidden customizations come back to haunt

8:22 you. The little prod -only patches. The tolerations

8:26 someone added once. The probe tweak that never

8:29 got upstreamed into your real config. Upgrades

8:32 are where tribal knowledge gets erased, or worse,

8:35 half -erased. And Argo is a special kind of risky

8:39 upgrade because it's your deploy system. If Argo

8:43 is down, you don't just lose a tool. You lose

8:46 the safe path to change. Then everyone starts

8:49 doing manual kubectl, and Drift shows up immediately.

8:53 GitOps is calm until the GitOps system is the

8:56 incident. So do this Monday. First, look at how

9:00 you deploy Argo today. Is Argo managing itself?

9:04 If yes, Verify SSA is enabled in the application

9:08 sync options before upgrading. Don't wait to

9:11 learn this during the upgrade. Second, diff your

9:14 live Argo resources against what you think you

9:18 apply, and then find the hand edits. Find the

9:21 temporary patches, write them down, and formalize

9:24 them. Third, build an upgrade lane for Argo,

9:27 even if it's small. A rehearsal environment.

9:29 Same method, same manifest, same shape. Practice

9:33 upgrade and rollback, and validate can we sync

9:36 a known app after upgrade. Fourth, rehearse Argo

9:40 as down mode. How do you deploy without it? How

9:42 do you stop it from fighting you if it's partially

9:45 alive? How do you get back to a known good state?

9:48 Because when Argo breaks, every minute feels

9:51 expensive. And thinking clearly gets harder.

9:55 Okay, that's story three. Story four is AWS Config

10:02 adding 30 new resource types. So AWS Config just

10:06 added support for 30 additional resource types.

10:09 Here's the key detail. If you record all resource

10:12 types, Config can start tracking new types automatically.

10:16 So Scope expands under you without you doing

10:19 anything. This is a quiet change. But it affects

10:22 inventory, governance, and sometimes cost. So

10:25 why does this matter? Most teams don't treat

10:27 Config like a living system. They treat it like

10:30 a checkbox. Then one day they try to get serious

10:33 about governance and compliance, and they realize

10:36 Config is actually foundational data. So when

10:39 coverage expands, that's good. But it also means

10:41 more evaluation surface. More resources showing

10:44 up in aggregators. More roles being evaluated.

10:48 More non -compliant noise. And the worst failure

10:51 mode is not config has data. It's config has

10:54 data nobody owns. New resource types show up.

10:57 Roles fire. Nobody knows who should fix it. So

11:00 it becomes a platform team problem by default.

11:03 And platform teams get buried in triage instead

11:06 of improving systems. Inventory expansion is

11:09 great until it becomes surprise accountability.

11:12 So what I would do Monday, go check your config

11:15 recorder settings. Are you recording all resource

11:17 types? If yes, decide if that's intentional or

11:21 if it's just we clicked it once years ago. then

11:24 check your roles which rules will start evaluating

11:26 these new types tagging roles encryption rules

11:29 public access rules all the stuff if new types

11:33 will create noise decide routing who owns the

11:35 alerts who owns remediation also baseline your

11:39 config usage and costs not because this change

11:43 will wreck your bill but because it changes the

11:45 scope and it's easier to explain early than late

11:49 Finally, tighten your ownership metadata. If

11:52 you can't answer who owns this resource, governance

11:55 becomes a scavenger hunt. That's story four.

12:02 Story five is EC2 nested visualization on virtual

12:05 instances. AWS now supports nested virtualization

12:09 on certain virtual EC2 instances. Historically,

12:13 nested virtualization on AWS was usually a bare

12:18 metal story. Now it's possible on virtualized

12:21 instances for some families. This is a capacity

12:24 unlock. It's also a behavior unlock because the

12:28 moment this exists, teams will attempt things

12:30 they couldn't justify before. Full lab environments

12:34 inside EC2. VM heavy testing. Security sandboxes.

12:38 Let's run a hypervisor inside our runner fleet.

12:42 So why does this matter? Nested virtualization

12:45 sounds niche, but it's really about reproducibility

12:48 and isolation. If you've ever wanted a test environment

12:52 that looks closer to prod, this helps. And if

12:55 you have tooling that expects a hypervisor, this

12:58 helps. And if you were paying the bare metal

13:01 tax purely for nested vert, this might be a cost

13:05 lever, which is going to matter for some orgs.

13:08 But it's also a foot gun if you combine it with

13:11 credentials and loose network controls. Anything

13:14 that starts looking like a workstation gets treated

13:17 like a workstation. People install random tools.

13:20 People store secrets in the wrong places. People

13:23 run just this one thing. And that's why anything

13:25 that looks like a workstation eventually gets

13:27 treated like one. So do this Monday. If you have

13:30 runner fleets, build fleets, or sandbox accounts,

13:34 add this to your threat model. Ask what changes

13:37 if nested VMs become available. Then set boundaries.

13:41 Which accounts allow it? Which VPCs allow it?

13:44 Do you want tighter egress controls? And then

13:47 document and internal stance. Even a short note

13:50 helps. Like, we support this for these use cases

13:54 on these instance families with these guardrails.

13:57 Because if you don't write the rules, you will

14:00 end up inheriting random experiments. And then

14:03 you'll learn about them during an incident or

14:05 a bill review. Okay, that's story five. Okay,

14:15 time for the lightning round, short and practical.

14:18 GitHub updated their status page experience.

14:21 There's now a 90 -day historical view and better

14:24 linking between incident days and availability

14:26 trends. And honestly, given GitHub's hiccups

14:30 lately, having a status page that's actually

14:32 useful is a welcome addition. OpenBuild Service

14:35 published a post -mortem on a disruption that

14:39 came down to database migration and locking behavior.

14:42 It's a good reminder that migration plan is not

14:45 the same as rollback plan. And a quick reminder

14:48 because we covered it already, GitHub Actions

14:51 extended the self -hosted runner minimum version

14:53 enforcement window. Treat that as runway, not

14:56 permission to ignore it. If you have self -hosted

14:59 runners, schedule the upgrade work. Another quick

15:02 GitHub one, Actions had early February updates

15:06 around things like runner controls and settings

15:09 that reduce surprise drift across orgs. It's

15:12 not headline news, but it's the kind of incremental

15:15 improvement that saves platform teams time. AWS

15:18 config expanding coverage is also a reminder

15:21 of a bigger pattern. A lot of discover everything

15:25 services expand under you as AWS adds new stuff.

15:28 That's not bad. It just means you need ownership

15:31 or the tool becomes noise. And if you are experimenting

15:35 with agentic workflows, don't skip the boring

15:37 part. Permissions, approval gates, and audit

15:40 trails. That's the difference between useful

15:42 automation and mystery automation. Okay, that's

15:45 the lightning round. Time for the human closer.

15:55 There's a post called Lots of AI SRE, No AI Incident

15:59 Management, and it nails something that feels

16:02 obvious once you say it. Most AI tooling in ops

16:05 is aimed at producing output faster. Write the

16:08 YAML, draft the runbook, summarize the log, generate

16:12 the postmortem doc. That's useful, but it's not

16:15 the core pain during a real incident. Incidents

16:18 aren't mostly writing. Incidents are uncertainty.

16:22 What changed? What's real? What's correlated

16:24 versus not? And incidents are coordination. Who's

16:28 driving? Who's communicating externally? Who's

16:30 making the rollback call? And how do we keep

16:33 the team aligned when five things are happening

16:35 at once? That is still wildly human. And honestly,

16:39 that's what makes on -call exhausting. Now tie

16:42 this back to today's stories. We are putting

16:45 more automation into the workflow. And in some

16:48 cases, we are giving it more agency. Agents in

16:51 actions deploy systems that can strand you mid

16:54 -upgrade. Governance tools that expand scope

16:56 automatically. And all of this increases the

16:59 number of things happening around incidents.

17:02 So if those tools don't reduce uncertainty, they

17:05 can increase chaos. The win is not faster output.

17:09 The win is less uncertainty for tired humans.

17:13 If AI can help, great. But the bar is, does it

17:16 help you decide what to do next safely? Does

17:20 it tell you what it's unsure about? Does it show

17:22 you what it tried and ruled out? Can it give

17:26 you an explanation you can trust at 3 a .m.,

17:29 not just a confident guess? So my take this week

17:33 is simple. When you evaluate tooling, don't judge

17:36 it by how clever it sounds. Judge it by whether

17:39 it reduces uncertainty when you are on call.

17:43 Because that's the moment that matters. That's

17:45 where reliability is real. Okay, time for a recap.

17:49 Today we talked about GitHub agentic workflows

17:51 and actions, and how it's not just nicer CI,

17:55 it's a new right path that needs guardrails.

17:58 Gen 2 moving towards Codeburg, Forge choice is

18:01 supply chain, governance, and leverage, not just

18:05 convenience. Argo CD upgrades requiring SSA in

18:09 certain paths. Control plane upgrades deserve

18:12 their own lane and rehearsals. AWS Config adding

18:16 30 new resource types. Great coverage, but scope

18:20 can expand under you, so be intentional. EC2

18:24 nested virtualization on virtual instances. Capability

18:28 unlock, and also a new what -will -teams -attempt

18:31 -now moment. The lightning round was around some

18:34 GitHub stories and OpenBuild service publishing

18:36 a postmortem. If you want the video version,

18:39 full episodes are now on YouTube going forward.

18:43 If you want the weekly briefs, OnCallBrief .com

18:46 is live. And if you want everything delivered

18:49 by email, Teller's Tech Substack is up. And lastly,

18:52 if you want to come on the show for an interview,

18:55 reach out at ShipItWeekly .fm. More episodes,

18:58 links, and show notes are on ShipItWeekly .fm.

19:02 All right, I'm Brian, and I'll catch you next

19:04 week.

GitHub Agentic Workflows, Gentoo Leaves GitHub, Argo CD 3.3 Upgrade Gotcha, AWS Config Scope Creep

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Ship It Conversations: Jake Warner on Cycle.io, Bare Metal’s Comeback, and Why Private Cloud Is Getting Interesting Again

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk