Azure VM Control Plane Outage, GitHub Agent HQ (Claude + Codex), Claude Opus 4.6, Gemini CLI, MCP

Transcript

0:00 This week felt like AI went from cute helper

0:02 to, okay, this is now part of the platform, not

0:05 just in your IDE. It's showing up in CI, incident

0:09 response, and in the places where mistakes become

0:12 expensive. Also, Azure had one of those outages

0:15 that doesn't just break a service. It breaks

0:18 your ability to operate the service. And that

0:21 one always hits different. Welcome back to Ship

0:40 It Weekly. I'm Brian from Tellers Tech, and this

0:43 is Ship It Weekly, the DevOps and SRE news show

0:46 for people who actually get paged. All links

0:49 are in the show notes on shipitweekly .fm. And

0:52 a quick ask up front. If you're getting value

0:55 out of the show, hit follow or subscribe wherever

0:58 you listen. And if you're on Apple Podcasts or

1:00 Spotify, a rating helps a lot. Even a quick review

1:03 is huge for discoverability. All right, five

1:07 main stories for today. One, the Azure control

1:10 plane incident and why it creates the most annoying

1:13 type of failure. Two, GitHub leaning harder into

1:17 multi -agent dev with Claude and Codex in the

1:20 loop, plus actions getting a little less painful.

1:23 Three, Claude Opus 4 .6 and the real difference

1:27 between better model and new failure mode. Four.

1:31 Google talking openly about SREs using Gemini

1:35 CLI during outages. Five, the connective tissue

1:40 story, MCP as plumbing, and observability getting

1:44 pulled into the AI platform gravity well. Then

1:48 we'll do a lightning round. Then a quick follow

1:51 -up on Ingress Engine X and N8n since we covered

1:54 those in previous episodes. And we'll close with

1:57 a human closer around trust, guardrails, and

2:00 why - why agentic work is not a free lunch. Story

2:08 one, Azure incident. VM management ops gets weird

2:13 and that ripples everywhere. So what happened?

2:16 Azure had a platform incident that impacted VM

2:19 management operations across multiple regions.

2:22 So not a VM went down. More like the thing that

2:25 manages VMs is degraded. Create, update, scale,

2:30 start, stop. The day two operations that every

2:33 platform team builds automation around. And when

2:37 that gets weird, you feel it everywhere because

2:40 it's not just VEBS, it's auto -scaling, it's

2:43 instance group updates, it's node pools trying

2:46 to heal, it's pipeline steps that assume I can

2:50 always start a node or I always roll a deployment.

2:53 So why does it matter? This is the outage category

2:56 I hate the most because it creates misleading

2:59 symptoms. Your app might look up. Your load balancers

3:03 are still answering. Your metrics are fine until

3:06 they aren't. But deploys start hanging. Scale

3:09 -outs don't happen. Rollbacks don't apply cleanly.

3:13 And you end up in this gray zone where everyone

3:16 is arguing about whether it's our code or the

3:20 cloud. And here's the key point. Most teams build

3:23 reliability thinking about the data plane, requests,

3:26 latency, error rates, but control plane failures

3:30 break your ability to respond. If you can't change

3:33 the system, you can't mitigate. And it turns

3:36 a 15 -minute incident into a two -hour incident

3:38 purely because you can't operate your own platform.

3:42 And if you've ever been stuck in a loop where

3:44 you're trying to fix something and all your fixing

3:47 tools are failing too, that's what this is. So

3:50 what do you do Monday? Go look at your automation

3:53 paths and ask a boring question. What happens

3:57 when the control plane is slow or degraded? Do

4:00 you have retries with exponential backoff? Or

4:03 do you have hammer it until it works? Do you

4:05 have timeouts that fail safely? Or do you have

4:08 jobs that hang forever? Because a lot of people

4:11 accidentally build infinite retry storms into

4:15 their pipelines. And the control plane is exactly

4:17 where that becomes self -inflicted pain. Also,

4:21 do you have a decision point in your incident

4:24 process that says, stop trying to change the

4:27 system, stabilize, communicate, wait for the

4:30 provider? That's hard to do because it feels

4:32 like giving up. But sometimes it's the correct

4:35 move. And one more thing. If your response plan

4:39 assumes you can always add capacity or always

4:42 recreate nodes or always scale out, make sure

4:46 you have a mitigation that doesn't require that.

4:49 Sometimes the right move is degrade features

4:51 instead of add servers. Sometimes it's shift

4:54 traffic instead of rebuild. And you only learn

4:57 that when you talk through it before the outage.

5:01 All right, that's story one. Story two is GitHub

5:03 leaning into multi -agent dev. And there's a

5:06 bunch of practical implications for platform

5:09 teams hiding under the surface. So story two

5:16 is GitHub multi -agent in the workflow, plus

5:18 actions getting less painful. What happened?

5:21 GitHub rolled out AgentHQ support for using Anthropic

5:25 Cloud and OpenAI Codex in GitHub and VS Code

5:29 alongside Copilot. So instead of co -pilot only,

5:32 it's more like pick the agent that's best for

5:34 the job. And GitHub Actions shipped workflow

5:37 updates too, including a case function in Expressions.

5:41 If you've ever had a workflow where the logic

5:44 is basically a pile of nested ands and pipes,

5:47 you probably just felt your shoulders drop a

5:49 little. So why does it matter? Two reasons. First,

5:53 multi -agent is an emission of reality. People

5:56 are already switching between tools. Cursor.

5:59 Copilot, Claude Code, whatever. Different models

6:02 are better at different things. Some are better

6:05 at refactors. Some are better at reasoning through

6:08 messy requirements. Some are better at quick

6:11 code completion. GitHub putting multiple models

6:13 in the same surface is them trying to become

6:16 the control plane for AI -assisted development.

6:19 And if you are a platform team, that matters.

6:22 Because the moment these tools can open PRs and

6:25 modify workflows, they are part of your software

6:28 supply chain. Second, actions changes matter

6:31 because CI is still where a ton of production

6:34 risk lives. Workflow changes are not just developer

6:38 productivity. They are permissions. They are

6:40 secrets. They are build artifacts. They are deployment

6:44 paths. So if you are enabling AI agents in that

6:48 world, you should treat it the way that you treat

6:51 any automation. Not cool, ship it. More like

6:54 cool, what's the blast radius? So what do you

6:58 do Monday? If you want to test agent PRs, do

7:01 it like onboarding a new engineer. Start with

7:04 read -only or low -risk tasks. Have it summarize

7:08 failing CI. Have it draft release notes. Have

7:11 it propose a change and explain why, but don't

7:14 let it merge. And keep it out of workflow files

7:17 and infra repos until you've decided what your

7:20 governance is. Because workflow files are basically

7:23 production code. Now, about that case function.

7:27 That sounds tiny, but here's what it really buys

7:30 you. It lets you reduce clever logic and replace

7:33 it with readable logic. And readable CI is safer

7:37 CI. Because the number of incidents that start

7:40 with why did this deploy to prod is not zero.

7:44 If your CI logic is hard to reason about, you

7:47 are one sleepy reviewer away from a bad day.

7:50 Also, a quick mid -episode reminder. If you are

7:53 enjoying the show, hit follow or subscribe. It's

7:56 a small thing, but it helps a ton. That's Story

7:59 2. Story 3 is Opus 4 .6. Story 3 is Opus 4 .6,

8:08 and I want to talk about this in a way that's

8:10 actually useful, not just the new model dropped.

8:14 Claude Opus 4 .6, the real change is trust and

8:18 longer running work. Anthropic released Claude

8:21 Opus 4 .6. They are positioning it as better

8:24 at agentic tasks, longer horizons, more reliable

8:28 multi -step work, more give me a goal, I'll do

8:32 a thing behavior. So why does it matter? Here's

8:35 the trap. When people say better model, a lot

8:38 of folks hear cool, the answers will be a little

8:40 smarter. But the bigger shift is behavioral.

8:43 When the tool gets good enough, people start

8:46 delegating bigger chunks of work. Not rewrite

8:49 this function, more like implement this feature.

8:52 Or refactor this module. or update these workflows

8:56 or go fix the failing tests and that's where

8:59 the failure mode changes because now it's not

9:02 the assistant gave me a wrong answer it's the

9:05 assistant completed a plan that was subtly wrong

9:08 And it did it confidently. And it did it across

9:11 15 files. So now you've got to do a different

9:14 kind of review. Not just does this compile, more

9:18 like does this match intent? And also, did it

9:21 change anything surprising? Because agents are

9:24 really good at making changes that look plausible.

9:27 So what do you do Monday? If you are using these

9:30 tools seriously, I think teams need a new habit.

9:33 Spec first, even lightweight. Before you let

9:36 the agent touch the repo, you want a short plan.

9:40 What are you changing? Why? What does done look

9:43 like? What are the risks? What test will it prove?

9:47 Because without that, the agent will happily

9:50 fill in the blanks with something that feels

9:52 right, but isn't. Also, decide what you consider

9:55 sacred ground. for a lot of teams that's ci workflows

9:59 iam network policy and anything that changes

10:03 prod deployment behavior if an agent is going

10:05 to touch those that should be behind stricter

10:08 review two humans minimum no auto merge and probably

10:12 a show me exactly what changed and why requirement

10:15 and if you are a security minded org this is

10:18 the week to remind people AI capacity upgrades

10:22 help defenders and attackers. It's easier to

10:25 find bugs. It's easier to exploit them, too.

10:28 So the defense is still the same boring stuff.

10:31 Least privilege, secrets hygiene, fast patching,

10:35 and not letting random automation have right

10:38 access to the keys to the kingdom. That's story

10:41 three. Story four is Google talking about SREs

10:44 using Gemini CLI in outages. And this is where

10:48 the AI and the ops conversation gets real. Google

10:56 SREs Gemini CLI as an outage sidekick. What happened?

11:01 Google published a write -up on SREs using Gemini

11:04 CLI during real outages. The framing is basically

11:08 AI in the terminal as an incident companion.

11:11 They're using it to classify symptoms, pull context,

11:14 suggest mitigations, and reduce the early incident

11:18 chaos. So why does it matter? This is the direction

11:21 a lot of teams want to go, even if they're not

11:24 saying it out loud. Because the hardest part

11:26 of incident response is not typing commands.

11:29 It's context. What changed? What's impacted?

11:32 What are the known safe mitigations? What's the

11:35 fastest way to reduce customer pain? And under

11:38 stress, humans drop context. We forget what we

11:42 already tried. we lose track of timelines. We

11:45 waste cycles writing status updates that are

11:48 vague because nobody has the full picture. So

11:50 the AI as scribe idea is actually super compelling.

11:55 You let the tool do the relentless bookkeeping.

11:58 Summarize logs, pull the last deploy, list recent

12:02 config changes, draft a timeline, draft a status

12:06 update. draft a mitigation checklist based on

12:09 the runbook. If you can reliably do those things

12:11 faster, you've bought the human's time to think.

12:14 And time to think is basically MTTR reduction.

12:18 But there's an obvious danger. The moment you

12:20 let AI run commands, you have to treat it like

12:23 any other automation with prod access. You don't

12:26 give it a shell and vibes. You give it narrow

12:29 tools. And you build guardrails. So what do you

12:32 do Monday? If you want to copy this pattern without

12:35 being Google, start with the safe version. Make

12:38 an incident helper that is read -only. It can

12:41 query logs. It can query metrics. It can pull

12:44 deployment history. It can draft comms. And it

12:48 can suggest mitigations from the runbook. But

12:50 a human has to execute. That alone is valuable.

12:54 And it's a good stepping stone. Because the minute

12:57 you jump straight to AI executes mitigation,

13:00 you are going to have trust problems. And trust

13:03 problems become adoption problems. And then this

13:06 whole thing turns into a failed experiment. So

13:09 start small, prove it helps, and build from there.

13:12 that's story four story five is the this will

13:15 matter more over time story mcp and the platform

13:20 layer that's forming around agents ai is becoming

13:27 platform plumbing mcp governance and observability

13:32 consolidation We are seeing vendors rally around

13:35 MCP, the Model Context Protocol, as a way to

13:39 connect agents to tools and data sources. Miro

13:42 launched an MCP server story to connect collaboration

13:45 artifacts into AI coding tools. Kong announced

13:49 an MCP registry angle for governance and tool

13:52 discovery. So we are moving towards a world where

13:56 agents can be given tools in a standardized way.

13:59 and in parallel observability is getting pulled

14:02 into the ai orbit moves like snowflake planning

14:06 to acquire observe are basically saying telemetry

14:09 is data and we want ai sitting on top of it so

14:13 why does this matter let me translate mcp into

14:16 normal person ops language it's basically a standard

14:19 way to say here are the tools an agent can use

14:22 here's how it calls them and here's the context

14:25 it can see And if that becomes common, platform

14:29 teams are going to have a new inventory problem.

14:32 What agents exist? What tools do they have access

14:35 to? What credentials are behind those tools?

14:38 What data sources are exposed? What's logged?

14:41 What's audited? And who owns the thing when it

14:44 breaks? Because if you don't handle this like

14:46 a real integration layer, you will end up with

14:49 shadow agents the way we ended up with shadow

14:52 CI jobs and shadow admin access. And on observability,

14:57 the shift is also real. More companies want log

15:00 retention. They want to correlate telemetry with

15:03 business events. They want to ask what changed

15:06 across logs, traces, config, and deploy history.

15:10 Then they want AI to help explain it. That's

15:14 powerful. But it also turns observability into

15:17 governance. PII risk, access controls, data retention

15:21 policy, cost controls. Because if you keep all

15:25 telemetry forever, you can easily build a compliance

15:28 nightmare and a gigantic bill. So what do you

15:32 do Monday? If MCP is showing up in your org,

15:35 treat it like production integration, not a toy.

15:38 You want ownership. You want access controls.

15:41 You want an inventory. You want a default deny

15:45 stance. And you want audit logs. Because the

15:48 first time an agent does something surprising,

15:51 the only thing that will save you is knowing

15:53 what it did and why. And for observability, if

15:57 you are exploring AI there, don't start with

15:59 let's let a model roam through prod logs. Start

16:03 with specific measurable wins. Better incident

16:06 timeline creation. Better alert deduping. Better

16:10 correlation across services. Faster what changed

16:13 answers. Then measure if it actually reduces

16:16 pager pain and toil. All right, time for the

16:19 lightning round. All right, lightning round.

16:28 GitHub Actions hosted runners had an incident

16:31 this week where jobs queued, timed out, and people

16:35 had a rough morning. It's a good reminder that

16:37 hosted is not the same as immune. If CI is business

16:42 critical, know what your fallback is. Docker

16:44 patched an Ask Gordon issue. And even if you

16:48 don't care about the specifics, the theme matters.

16:51 If you're feeding AI systems untrusted input,

16:54 including metadata, you can accidentally build

16:57 a prompt injection path into your workflow. Treat

17:00 AI inputs like any other untrusted input. Terraform

17:03 1 .15 alpha dropped. Not a headline, but worth

17:07 noting for platform folks. Deprecation warnings

17:10 for variables and outputs is the kind of feature

17:13 that makes shared modules easier to evolve without

17:16 surprise breaks. And Wiz wrote up the Malt Book

17:19 situation. Misconfigured backend. Tons of exposed

17:23 API keys and sensitive data. It's a perfect vibe

17:27 coding meets production credentials cautionary

17:29 tale. Also, a quick mention from the open source

17:32 side. Chainguard's Ameri -OSS effort is an interesting

17:36 trend. Commercial orgs stepping in to maintain

17:39 projects that everyone depends on, but nobody

17:43 wants to own. You don't need to have a strong

17:46 opinion on it yet, but it is worth watching.

17:49 And a couple of quick follow -ups. Ingress Engine

17:51 X had multiple issues disclosed and fixed versions

17:54 called out. If you run Ingress Engine X, double

17:58 check your versions and patch. And N8n is still

18:01 in the patch fast and treat your automation layer

18:03 like a control plane category. If you are running

18:06 workflow automation tools, assume they are part

18:09 of your perimeter. And here's the human theme

18:19 this week. We keep talking about agents like

18:22 they remove humans from the loop. They don't.

18:24 They move humans to a different part of the loop.

18:27 The hard part is shifting from can I run the

18:29 command to can I decide what to trust? And then

18:33 can I recover cleanly when the automation does

18:36 something confidently wrong? Because with agents,

18:39 the risk isn't the tool is dumb. The risk is

18:43 that the tool is competent enough that people

18:45 stop double checking. And the first time that

18:47 that happens in pride, you get a weird kind of

18:50 incident. Not a bug. Not a bad deploy. More like

18:54 automation drift. Everyone thought the automation

18:56 was safe. Then one day it does the wrong thing

19:00 and nobody notices until it's already rolled.

19:03 So if you are adopting this stuff, the boring

19:05 work is the real work. Guardrails. Approvals.

19:09 Ownership. Audit trails. And clear stop the line

19:12 rules. Because the future isn't no humans. It's

19:16 humans overseeing more automation. And that's

19:19 its own skill. All right, time for a recap. We

19:23 had five main stories for today. Azure reminded

19:25 us the cloud control plane is part of the product,

19:29 and it can absolutely be your incident even when

19:32 your app is fine. GitHub is going multi -agent

19:35 with Claude and Codex, and Actions got a little

19:38 less painful with better workflow logic. Opus

19:41 4 .6 pushed agent workflows further into the

19:44 mainstream, and the real risk is trust and multi

19:47 -step changes. Google is using Gemini CLI in

19:51 real SRE outage work, which is basically the

19:54 AI and ops playbook taking shape. And MCP plus

19:58 AI -driven observability is turning into platform

20:01 plumbing. And platform teams are going to own

20:04 governance whether they want to or not. And the

20:07 lightning round was GitHub Actions hosted runners

20:09 having an incident, Docker patching an Ask Gordon

20:12 AI supply chain issue, Terraform 1 .15 alpha,

20:16 the Malt Book credential exposure story, and

20:20 the Emirate OSS maintenance trend. All links

20:22 are in the show notes on shipitweekly .fm. If

20:26 you made it this far, hit follow or subscribe

20:28 wherever you are listening. And if you can, drop

20:31 a rating or review on Apple or Spotify. It helps

20:34 more than you think. All right, I'll see you

20:37 next week.

Azure VM Control Plane Outage, GitHub Agent HQ (Claude + Codex), Claude Opus 4.6, Gemini CLI, MCP

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller