0:00
This week feels like the platforms are getting
0:02
opinionated. Not here's a new feature, more like
0:05
here's the new default way you're going to operate.
0:08
And if you don't notice the default changed,
0:11
you'll feel it later. Usually as surprise work,
0:13
surprise risk, or surprise downtime. What's up,
0:33
everybody? I'm Brian, and this is Ship It Weekly,
0:36
a short weekly show where I filter the noise
0:39
and focus on what actually changes how we run
0:41
infra. Quick update before we jump in. Ship It
0:45
Weekly is officially a video podcast now. Every
0:48
episode going forward is available on YouTube
0:51
in full video format. If you're audio only, nothing
0:54
changes. Same feed, same cadence, and same show.
0:57
Also, OnCallBrief .com is live. That's my system
1:02
for tracking DevOps and infra news without drowning
1:05
in tabs. Briefs start on Sunday to get refined
1:09
throughout the week, and Thursday is the final
1:11
pass. If you want to see what I'm tracking before
1:14
the episode drops, that's the place. And Teller's
1:17
Tech now has a sub stack too. If you want episodes
1:20
and weekly briefs delivered to your inbox, that's
1:23
the easiest option. And one last thing, I'm starting
1:26
another round of interviews. If you want to come
1:28
on and talk a migration, a post -mortem, a weird
1:32
outage, or how your team actually runs production,
1:35
hit me up on shipitweekly .fm. All right, let's
1:38
get into it. Five stories for today. GitHub is
1:42
putting agentic workflows directly into actions
1:45
that changes what CI can do without a human driving.
1:49
Gen 2 is moving away from GitHub to Codeburg.
1:53
It's a reminder that the Forge is not neutral
1:56
infrastructure anymore. Argo CD upgrades are
1:59
forcing server -side apply in certain paths.
2:02
This is one of those small line in the notes,
2:05
big day in prod things. AWS Config expanded coverage
2:09
again, which sounds boring until you realize
2:13
governance scope can move under your feet. And
2:16
AWS enabled nested visualization on virtual EC2
2:20
instances, a capability unlock that's going to
2:23
change what people attempt to run. Then we'll
2:26
do a lightning round and a human closer on the
2:28
gap between AI everywhere and incidents still
2:31
being painful. First main story, GitHub agentic
2:38
workflows in actions. GitHub dropped agentic
2:41
workflows into technical preview. The simple
2:44
version is you write the intent in Markdown and
2:47
an agent runs inside GitHub actions to do repo
2:50
work. It's not just run these steps, it's more
2:53
like here's the goal, go handle it. Think issue
2:56
triage, basic repo maintenance, investigating
2:59
CI failures, and proposing fixes. This is not
3:03
the same thing as Agent HQ. Agent HQ was agents
3:07
in the GitHub experience. This is agents inside
3:10
your automation engine. That's a very different
3:12
place to add intelligence. So why does this matter?
3:16
Actions is where a lot of secrets and permissions
3:18
live. It's also where small automations quietly
3:21
become core production workflows. When an agent
3:24
can take actions, not just produce suggestions,
3:28
you've created a new right path. And right paths
3:31
have two roles. One, they need ownership. But
3:34
two, they need constraints. Because the failure
3:37
mode isn't always malicious. The failure mode
3:40
is helpful, fast, and wrong the scary version
3:44
is an agent editing workflow files because workflow
3:47
files are basically the keys to the kingdom or
3:50
an agent doing cleanup that breaks a dependency
3:52
you didn't even remember existed or an agent
3:56
repeatedly retrying something until it finds
3:59
a path that works but violates policy the first
4:02
time an agent edits a workflow it's a different
4:05
game So do this Monday. First, inventory what
4:09
actions can actually do today. Don't start with
4:12
features, start with permissions. Which repos
4:15
allow workflows to write back to the repo? Which
4:18
repos allow modifying workflow files? Which workflows
4:22
can deploy, publish artifacts, or touch environments?
4:25
Second, separate read -only automation from write
4:29
automation. Agents can comment and propose, sure,
4:32
but merges should require a human. That's the
4:35
clean line that prevents a lot of pain. Third,
4:39
lock down permissions in workflows. If you are
4:42
still running broad default tokens, you need
4:44
to fix that now. Treat write scopes like production
4:48
credentials, because that's what they are. Fourth,
4:51
add explicit approval gates for anything high
4:53
leverage. Secrets, deployments, runners, workflow
4:57
file changes. If an agent can do it, a human
5:01
should approve it. At least until you trust it
5:03
and you've tested the blast radius. Fifth, logging.
5:07
You need a record of what the agent changed and
5:10
why, not just standard out from the workflow.
5:13
Because the post -incident question won't be
5:15
who clicked it, it'll be what sequence of actions
5:19
did the agent take. Okay, that's story one. Story
5:26
two is Gentoo moves off of GitHub to Codeburg.
5:30
So Gentoo is moving repositories off of GitHub
5:33
and towards Codeburg. They are pretty direct
5:36
about why. They're uncomfortable with the direction
5:39
GitHub is going around co -pilot and AI pressure.
5:42
This isn't a rage quit. It's an operational migration.
5:46
Mirrors, workflow changes, community adjustments,
5:49
all of the annoying real -world stuff. So why
5:53
does this matter? Because GitHub isn't just a
5:55
place where code lives. It's your auth story.
5:58
your CI story, your review flow, your issue tracker,
6:03
your release automation, your dependency and
6:05
security alerts. So when the platform changes
6:08
incentives or defaults, it's not a cosmetic change.
6:12
It alters how work gets done. AntGen2 is basically
6:16
saying, we want leverage and we want options.
6:19
Even if you don't care about the politics, the
6:22
engineering lesson is real. If a platform becomes
6:24
your default for everything, you don't notice
6:27
lock - until you have to change something fast.
6:30
And then you realize half your workflow isn't
6:32
portable. Not the code, the workflow. Mirrors
6:35
aren't boring. Mirrors are exit ramps. So do
6:39
this Monday. Pick one repo that matters. Map
6:42
the dependencies. What breaks if GitHub is down
6:44
for a day? Not developers are annoyed. What breaks
6:48
operationally? Builds, releases, packages, security
6:52
workflows, required checks, even how do we coordinate
6:56
changes? Now, ask a better question. What do
6:59
we need to keep shipping if the forge is degraded?
7:02
Do we have mirrors? Do we have backups of more
7:05
than Git objects? Issues and PR metadata matter.
7:09
Release artifacts matter. Actions, config, and
7:12
required checks matter. If you've never tested
7:14
Restore, do a small exercise. Pretend the repo
7:17
is gone. Restore it and run the minimum path
7:20
to ship. You don't need a full migration plan.
7:23
You need confidence that you are not trapped.
7:26
Okay, that's story two. Story 3 is Argo CD upgrades
7:34
and server -side apply requirements. This one
7:37
is pure operator reality. Argo CD has an upgrade
7:41
path where server -side apply is required in
7:44
certain setups, especially when Argo is managing
7:47
itself or when you are applying manifests directly.
7:50
The reason is a Kubernetes annotation size limitation.
7:54
The last applied blob can get too large and it
7:57
breaks apply behavior in weird ways. So the fix
8:00
becomes use SSA. Let the API server own field
8:05
management. So why does this matter? This isn't
8:08
just a flag you add to a command. It changes
8:11
how ownership works. SSA tracks fields differently.
8:15
And that means upgrades can overwrite things
8:17
you didn't realize you were relying on. This
8:20
is where hidden customizations come back to haunt
8:22
you. The little prod -only patches. The tolerations
8:26
someone added once. The probe tweak that never
8:29
got upstreamed into your real config. Upgrades
8:32
are where tribal knowledge gets erased, or worse,
8:35
half -erased. And Argo is a special kind of risky
8:39
upgrade because it's your deploy system. If Argo
8:43
is down, you don't just lose a tool. You lose
8:46
the safe path to change. Then everyone starts
8:49
doing manual kubectl, and Drift shows up immediately.
8:53
GitOps is calm until the GitOps system is the
8:56
incident. So do this Monday. First, look at how
9:00
you deploy Argo today. Is Argo managing itself?
9:04
If yes, Verify SSA is enabled in the application
9:08
sync options before upgrading. Don't wait to
9:11
learn this during the upgrade. Second, diff your
9:14
live Argo resources against what you think you
9:18
apply, and then find the hand edits. Find the
9:21
temporary patches, write them down, and formalize
9:24
them. Third, build an upgrade lane for Argo,
9:27
even if it's small. A rehearsal environment.
9:29
Same method, same manifest, same shape. Practice
9:33
upgrade and rollback, and validate can we sync
9:36
a known app after upgrade. Fourth, rehearse Argo
9:40
as down mode. How do you deploy without it? How
9:42
do you stop it from fighting you if it's partially
9:45
alive? How do you get back to a known good state?
9:48
Because when Argo breaks, every minute feels
9:51
expensive. And thinking clearly gets harder.
9:55
Okay, that's story three. Story four is AWS Config
10:02
adding 30 new resource types. So AWS Config just
10:06
added support for 30 additional resource types.
10:09
Here's the key detail. If you record all resource
10:12
types, Config can start tracking new types automatically.
10:16
So Scope expands under you without you doing
10:19
anything. This is a quiet change. But it affects
10:22
inventory, governance, and sometimes cost. So
10:25
why does this matter? Most teams don't treat
10:27
Config like a living system. They treat it like
10:30
a checkbox. Then one day they try to get serious
10:33
about governance and compliance, and they realize
10:36
Config is actually foundational data. So when
10:39
coverage expands, that's good. But it also means
10:41
more evaluation surface. More resources showing
10:44
up in aggregators. More roles being evaluated.
10:48
More non -compliant noise. And the worst failure
10:51
mode is not config has data. It's config has
10:54
data nobody owns. New resource types show up.
10:57
Roles fire. Nobody knows who should fix it. So
11:00
it becomes a platform team problem by default.
11:03
And platform teams get buried in triage instead
11:06
of improving systems. Inventory expansion is
11:09
great until it becomes surprise accountability.
11:12
So what I would do Monday, go check your config
11:15
recorder settings. Are you recording all resource
11:17
types? If yes, decide if that's intentional or
11:21
if it's just we clicked it once years ago. then
11:24
check your roles which rules will start evaluating
11:26
these new types tagging roles encryption rules
11:29
public access rules all the stuff if new types
11:33
will create noise decide routing who owns the
11:35
alerts who owns remediation also baseline your
11:39
config usage and costs not because this change
11:43
will wreck your bill but because it changes the
11:45
scope and it's easier to explain early than late
11:49
Finally, tighten your ownership metadata. If
11:52
you can't answer who owns this resource, governance
11:55
becomes a scavenger hunt. That's story four.
12:02
Story five is EC2 nested visualization on virtual
12:05
instances. AWS now supports nested virtualization
12:09
on certain virtual EC2 instances. Historically,
12:13
nested virtualization on AWS was usually a bare
12:18
metal story. Now it's possible on virtualized
12:21
instances for some families. This is a capacity
12:24
unlock. It's also a behavior unlock because the
12:28
moment this exists, teams will attempt things
12:30
they couldn't justify before. Full lab environments
12:34
inside EC2. VM heavy testing. Security sandboxes.
12:38
Let's run a hypervisor inside our runner fleet.
12:42
So why does this matter? Nested virtualization
12:45
sounds niche, but it's really about reproducibility
12:48
and isolation. If you've ever wanted a test environment
12:52
that looks closer to prod, this helps. And if
12:55
you have tooling that expects a hypervisor, this
12:58
helps. And if you were paying the bare metal
13:01
tax purely for nested vert, this might be a cost
13:05
lever, which is going to matter for some orgs.
13:08
But it's also a foot gun if you combine it with
13:11
credentials and loose network controls. Anything
13:14
that starts looking like a workstation gets treated
13:17
like a workstation. People install random tools.
13:20
People store secrets in the wrong places. People
13:23
run just this one thing. And that's why anything
13:25
that looks like a workstation eventually gets
13:27
treated like one. So do this Monday. If you have
13:30
runner fleets, build fleets, or sandbox accounts,
13:34
add this to your threat model. Ask what changes
13:37
if nested VMs become available. Then set boundaries.
13:41
Which accounts allow it? Which VPCs allow it?
13:44
Do you want tighter egress controls? And then
13:47
document and internal stance. Even a short note
13:50
helps. Like, we support this for these use cases
13:54
on these instance families with these guardrails.
13:57
Because if you don't write the rules, you will
14:00
end up inheriting random experiments. And then
14:03
you'll learn about them during an incident or
14:05
a bill review. Okay, that's story five. Okay,
14:15
time for the lightning round, short and practical.
14:18
GitHub updated their status page experience.
14:21
There's now a 90 -day historical view and better
14:24
linking between incident days and availability
14:26
trends. And honestly, given GitHub's hiccups
14:30
lately, having a status page that's actually
14:32
useful is a welcome addition. OpenBuild Service
14:35
published a post -mortem on a disruption that
14:39
came down to database migration and locking behavior.
14:42
It's a good reminder that migration plan is not
14:45
the same as rollback plan. And a quick reminder
14:48
because we covered it already, GitHub Actions
14:51
extended the self -hosted runner minimum version
14:53
enforcement window. Treat that as runway, not
14:56
permission to ignore it. If you have self -hosted
14:59
runners, schedule the upgrade work. Another quick
15:02
GitHub one, Actions had early February updates
15:06
around things like runner controls and settings
15:09
that reduce surprise drift across orgs. It's
15:12
not headline news, but it's the kind of incremental
15:15
improvement that saves platform teams time. AWS
15:18
config expanding coverage is also a reminder
15:21
of a bigger pattern. A lot of discover everything
15:25
services expand under you as AWS adds new stuff.
15:28
That's not bad. It just means you need ownership
15:31
or the tool becomes noise. And if you are experimenting
15:35
with agentic workflows, don't skip the boring
15:37
part. Permissions, approval gates, and audit
15:40
trails. That's the difference between useful
15:42
automation and mystery automation. Okay, that's
15:45
the lightning round. Time for the human closer.
15:55
There's a post called Lots of AI SRE, No AI Incident
15:59
Management, and it nails something that feels
16:02
obvious once you say it. Most AI tooling in ops
16:05
is aimed at producing output faster. Write the
16:08
YAML, draft the runbook, summarize the log, generate
16:12
the postmortem doc. That's useful, but it's not
16:15
the core pain during a real incident. Incidents
16:18
aren't mostly writing. Incidents are uncertainty.
16:22
What changed? What's real? What's correlated
16:24
versus not? And incidents are coordination. Who's
16:28
driving? Who's communicating externally? Who's
16:30
making the rollback call? And how do we keep
16:33
the team aligned when five things are happening
16:35
at once? That is still wildly human. And honestly,
16:39
that's what makes on -call exhausting. Now tie
16:42
this back to today's stories. We are putting
16:45
more automation into the workflow. And in some
16:48
cases, we are giving it more agency. Agents in
16:51
actions deploy systems that can strand you mid
16:54
-upgrade. Governance tools that expand scope
16:56
automatically. And all of this increases the
16:59
number of things happening around incidents.
17:02
So if those tools don't reduce uncertainty, they
17:05
can increase chaos. The win is not faster output.
17:09
The win is less uncertainty for tired humans.
17:13
If AI can help, great. But the bar is, does it
17:16
help you decide what to do next safely? Does
17:20
it tell you what it's unsure about? Does it show
17:22
you what it tried and ruled out? Can it give
17:26
you an explanation you can trust at 3 a .m.,
17:29
not just a confident guess? So my take this week
17:33
is simple. When you evaluate tooling, don't judge
17:36
it by how clever it sounds. Judge it by whether
17:39
it reduces uncertainty when you are on call.
17:43
Because that's the moment that matters. That's
17:45
where reliability is real. Okay, time for a recap.
17:49
Today we talked about GitHub agentic workflows
17:51
and actions, and how it's not just nicer CI,
17:55
it's a new right path that needs guardrails.
17:58
Gen 2 moving towards Codeburg, Forge choice is
18:01
supply chain, governance, and leverage, not just
18:05
convenience. Argo CD upgrades requiring SSA in
18:09
certain paths. Control plane upgrades deserve
18:12
their own lane and rehearsals. AWS Config adding
18:16
30 new resource types. Great coverage, but scope
18:20
can expand under you, so be intentional. EC2
18:24
nested virtualization on virtual instances. Capability
18:28
unlock, and also a new what -will -teams -attempt
18:31
-now moment. The lightning round was around some
18:34
GitHub stories and OpenBuild service publishing
18:36
a postmortem. If you want the video version,
18:39
full episodes are now on YouTube going forward.
18:43
If you want the weekly briefs, OnCallBrief .com
18:46
is live. And if you want everything delivered
18:49
by email, Teller's Tech Substack is up. And lastly,
18:52
if you want to come on the show for an interview,
18:55
reach out at ShipItWeekly .fm. More episodes,
18:58
links, and show notes are on ShipItWeekly .fm.
19:02
All right, I'm Brian, and I'll catch you next
19:04
week.
For this episode, I wanted to anchor on something I think a lot of teams miss until it bites them.
The default behavior of the platforms we lean on is shifting.
Not in a “new feature, neat” way.
In a “this is how work happens now unless you intentionally opt out” way.
And ops pain almost always shows up when a default changes quietly, then becomes a dependency.
GitHub Agentic Workflows inside Actions is the clearest example.
It’s not “AI in the UI.” It’s “AI in the automation engine.”
That matters because Actions is where the permissions live, and where small scripts quietly become production processes.
The moment an agent can propose changes, run experiments, open PRs, retry, reroute, and generally keep iterating, you’ve moved from deterministic automation to goal-seeking automation.
That can be awesome, but the guardrails have to shift too.
If you treat it like a nicer YAML syntax, you’ll miss the real question.
“What is this allowed to change, and how do I prove what it changed?”
GitHub Agentic Workflows (preview)
https://github.blog/changelog/2026-02-13-github-agentic-workflows-are-now-in-technical-preview/
My practical take: start with “agents can propose, humans can merge.”
Make that the default until you have a reason to loosen it.
And do a permissions inventory first, not last.
Because if your workflows can write to the repo, publish releases, or touch environments, the blast radius is already there.
You’re just adding a smarter actor to the same set of keys.
Next, the Gentoo move to Codeberg.
This story isn’t just open source politics.
It’s a reminder that “the forge” is no longer a neutral place where code happens to live.
It’s now shaping behavior.
Policy decisions, product direction, incentive direction, even just the ambient pressure of “here’s the new recommended workflow.”
When a project like Gentoo moves, they’re basically paying a real cost to buy back optionality.
That’s a thing ops teams should recognize, because we deal with the exact same tradeoff in enterprises.
Convenience becomes dependency.
Dependency becomes lock-in.
Lock-in only becomes visible when the platform is degraded, changes direction, or becomes a risk you can’t explain away.
Gentoo moves to Codeberg
https://www.theregister.com/2026/02/17/gentoo_moves_to_codeberg_amid/
The practical move here is not “everyone should migrate off GitHub.”
It’s “know what you are renting.”
Your git remote is portable.
Your whole workflow often isn’t.
Issues, PR metadata, CI config, release automation, required checks, even your contributor and access model.
If you want leverage, you need at least one exit ramp.
Mirrors, backups, and a tested restore path are the boring version of freedom.
Then Argo CD 3.3 and the Server-Side Apply requirement.
This one looks like a technical detail, but it’s actually a reliability story.
Argo is your deployment system.
If you can’t upgrade it safely, you’re going to end up doing manual kubectl during a bad moment.
And the reason this upgrade note matters is it’s one of those “Kubernetes paper cuts” that turns into a real incident when you combine it with self-management patterns.
Annotation size limits are not exciting, but they’re exactly the kind of limit that surfaces at the worst time, and forces you into an emergency upgrade path.
Argo CD upgrade guide: 3.2 to 3.3 (SSA)
https://argo-cd.readthedocs.io/en/latest/operator-manual/upgrading/3.2-3.3/
My take: GitOps systems deserve an upgrade lane.
Treat Argo upgrades like you treat Kubernetes upgrades.
Rehearse them.
Diff live state vs what you think you apply.
And hunt down hand edits and “temporary overlays” before the upgrade does it for you.
SSA changes ownership semantics, and ownership semantics are where accidental overrides happen.
If you’ve ever said “we only changed one small thing in prod,” this is where that small thing disappears.
Next, AWS Config adding 30 new resource types.
This is the kind of change that’s easy to ignore because it feels like background.
But it’s exactly how governance scope creeps.
If you record “all resource types,” AWS can expand your inventory without asking.
That’s good coverage, but it can also mean new rule evaluations, new findings, new “noncompliant” noise, and new accountability questions.
And if you don’t have clear ownership, these tools don’t create governance.
They create a backlog.
AWS Config: 30 new resource types
https://aws.amazon.com/about-aws/whats-new/2026/02/aws-config-new-resource-types
My take: treat Config like a dataset you operate, not a checkbox.
Know if you are recording all resource types.
Baseline the rule surface.
And decide where findings route before they start routing to “whoever is awake.”
Also, this is where tagging and ownership metadata pays off.
Inventory is only useful when it’s attributable.
Otherwise, it’s just a bigger pile of “someone should fix this.”
Lightning round quick thoughts.
GitHub’s improved status page experience is genuinely nice.
It sounds small, but the best status pages aren’t the ones that look pretty, they’re the ones that answer “is this me or is it them” quickly.
And given GitHub’s hiccups lately, anything that makes the status view more usable is a win.
GitHub status page update
https://github.blog/changelog/2026-02-13-updated-status-experience/
The early-Feb Actions updates and the runner enforcement reminder are in that same category.
Not sexy, but operationally relevant.
The teams that keep things boring win long term.
Actions updates
https://github.blog/changelog/2026-02-05-github-actions-early-february-2026-updates/
Runner enforcement extended
https://github.blog/changelog/2026-02-05-github-actions-self-hosted-runner-minimum-version-enforcement-extended/
And the Open Build Service postmortem is worth reading if you’ve ever done “simple” migrations that turned out not simple.
If your migration plan doesn’t include rollback behavior under lock contention, degraded DB, or partial completion, you don’t have a plan yet.
You have hope.
Open Build Service postmortem
https://openbuildservice.org/2026/02/02/post-mortem/
Human closer.
The Lorin Hochstein post is the cleanest “smart take” I’ve seen lately on AI in ops.
Lots of AI SRE, no AI incident management.
That title is basically the whole point.
We’re getting tools that generate output.
Summaries, runbooks, postmortems, YAML, tickets.
That’s helpful, but it’s not the core pain of incidents.
Incidents are uncertainty and coordination.
What changed.
What’s real.
What’s correlated vs causal.
Who is driving.
What are we telling customers.
What are we rolling back and why.
If “AI for ops” doesn’t reduce uncertainty, it can accidentally increase chaos.
Because you’ll get more activity without more confidence.
You’ll get more suggestions without better verification.
You’ll get a faster loop that still depends on a tired human to decide what’s safe.
So my bar for AI tooling is simple.
Does it help a human make a safer decision faster.
Does it show its work.
Does it admit uncertainty.
Does it track actions taken, not just produce a narrative.
Because at 3am, a confident guess is worse than no guess.
Lots of AI SRE, no AI incident management
https://surfingcomplexity.blog/2026/02/14/lots-of-ai-sre-no-ai-incident-management/
That ties back to the whole episode.
Platforms are shifting defaults in ways that increase agency.
Agents inside CI.
Workflow and policy baked into the forge.
GitOps systems that require more careful ownership semantics.
Governance tools that expand scope automatically.
The work doesn’t go away.
It moves.
And the teams that do best are the ones that notice the default changed early, then operationalize it before it becomes an incident.
More episodes, plus the video playlist, weekly briefs, and Substack are all linked from here:
https://shipitweekly.fm