Host Commentary

For this Ship It Weekly episode, I wanted to zoom in on a pattern I keep seeing: the stuff we call “glue” is now the blast radius.

A few years ago, you could kind of mentally separate things into buckets. There was “the app,” there was “infra,” and then there was a pile of scripts and YAML and CI jobs that felt like supporting cast. Useful, but not the main story. This week’s stories are a reminder that the supporting cast is now running the show. The control plane, CI triggers, agent tooling, metadata… that’s where outages and security incidents start getting born.

We opened with Azure because it’s the cleanest example of a control plane incident that doesn’t look like a clean outage. When VM service management ops get degraded, it’s not a single red alert that screams “Azure is down.” It’s this slow-motion failure where everything feels “kind of broken” in a way that’s hard to pin on one thing. Deploys hang. Scaling actions don’t apply. Nodes don’t come back the way they normally do. Rollbacks take longer than they should. And because your product still responds to some traffic, humans argue about whether it’s real, whether it’s your code, whether it’s the cluster, whether it’s just “a temporary hiccup.” That’s the exact zone where time disappears.

And the ugly truth is: even if your application is fine, you can still be in a bad incident if you can’t operate the platform. A lot of teams build resiliency thinking primarily about the data plane, requests, latency, errors, and throughput. But control plane issues break your ability to respond. If you can’t scale out, can’t recreate nodes, can’t change configs, can’t drain traffic the way you normally do, then the incident becomes less about the technical fix and more about human coordination and waiting. That’s how a 15-minute issue becomes a two-hour issue. Not because the original problem was huge, but because you lost the steering wheel.

Then we moved into GitHub’s agent push because that’s the other half of the same theme. AI tooling is no longer a side tab. GitHub putting Claude and Codex into Agent HQ alongside Copilot is them saying “this is the workflow now.” And for DevOps and platform folks, that’s not a fun gadget story. That’s a supply chain story.

Because once an agent can open PRs, update workflows, or propose changes in an infra repo, you’ve effectively created a new automation actor in your environment. An actor that makes changes quickly, confidently, and sometimes without the same instinct humans have for “wait… should we touch that?” The interesting thing about the GitHub Actions case() update is it’s small, but it points at a bigger trend. CI logic is becoming more expressive and more central. That’s good. It also means the difference between a safe deploy and a “why did this run in prod?” incident is increasingly hidden in workflow logic and permissions. If AI agents are going to play in that space, your guardrails and reviews matter more than the model pick.

There’s a mindset shift here that I think teams are going to have to make this year. CI is not just a build system. It’s a control plane. The workflow files are production code. The runner fleet is infrastructure. The artifacts and tokens are high-value security assets. So if you’re letting agents touch that layer, treat it like you’re granting access to a real teammate, not like you’re enabling autocomplete. Start read-only. Start “suggest and explain.” Force review. Keep write access narrow. Build audit trails. If you don’t, you’re going to get a brand new class of incident: not a broken app, but an agent-initiated change that’s logically plausible and still wrong.

The DockerDash story is the one I didn’t want to gloss over, because it’s easy to hear “AI vuln” and tune out. The important part isn’t Docker specifically. The important part is the pattern: once agents are wired to tools, “untrusted input” expands. Most of us have mental models for untrusted input. HTTP requests, form fields, user uploads. We’ve spent twenty years building guardrails around those.

But now you’ve got systems where image metadata, descriptions, README text, issue comments, commit messages, or even a cleverly phrased error log can become part of the agent’s context. If the agent is allowed to act on that context, now you’ve got a prompt injection path. And prompt injection isn’t magic. It’s just a new way to trick an automation system into doing something dumb. We’ve had this problem forever. We called it social engineering. We called it command injection. We called it supply chain poisoning. Now it’s “prompt injection,” but the defense mindset is the same. Don’t let untrusted text drive privileged actions. Put hard gates between reading and doing. Scope tools. Use allowlists. Log everything. And assume anything that can be influenced by an external party will eventually be influenced by an external party.

That leads into MCP, which is the connective tissue story. MCP sounds like “yet another protocol,” and it’s tempting to ignore it. But I think this is the layer that will decide whether the agent wave is useful or chaotic. Because the moment you have a standard way for agents to discover tools and call them, you’ve created a new platform surface area.

Now you need the same boring, essential platform engineering stuff we learned with APIs. Inventory, so you know what exists. Ownership, so somebody is accountable. Auth, so access is scoped. Policy, so you can enforce guardrails centrally. Auditing, so you can reconstruct what happened when something goes sideways. Rate limits, so an agent doesn’t melt your internal systems. And you need “break glass” and “kill switch” thinking, because the worst incidents in automation are the ones where you can’t quickly stop the automation.

And then observability tying into all this is interesting, too. When companies start treating telemetry as data that belongs in the same universe as everything else, and then layering AI on top, they’re basically saying: “we want systems that can reason about reality faster than humans can.” That sounds good. But it also means your telemetry pipeline is now a governance issue, not just a tooling issue. Who can query what? What’s in those logs? What secrets accidentally end up there? What’s your retention policy? What happens when an AI assistant can search your logs better than your humans can? That’s powerful. It also changes the risk profile of “we keep everything forever.”

So my throughline for the episode is simple: the control plane is part of the product, and AI is becoming part of the control plane. Treat it that way.

If you’re adopting agents, don’t think of it as “we’re adding a tool.” Think of it as “we’re adding an actor.” And actors need identity, permissions, constraints, and accountability. The success path here isn’t hype. It’s boring. It’s guardrails, approvals, audit logs, and a team-wide understanding of what agents are allowed to touch.

Because the failure mode isn’t that the agent is dumb. The failure mode is that the agent is competent enough that people stop double-checking.

And when that happens, you don’t get a normal outage. You get an automation incident. The kind where everything looks plausible, until you realize the system is drifting in the wrong direction and nobody noticed because everyone trusted the glue.


Quick follow-ups since we covered these recently: ingress-nginx and n8n are both still in the “patch fast, then verify you’re actually patched” bucket. For ingress-nginx, there’s an updated security advisory thread with multiple issues and fixed versions called out, plus Chainguard is publishing updates around keeping ingress-nginx alive as upstream heads toward the March 2026 retirement timeline. For n8n, there’s a fresh security bulletin with upgrade guidance, and it’s a good reminder that workflow automation tools sit right next to your secrets, so “authenticated” vulns still matter a lot if workflow authoring isn’t tightly restricted.

Links
ingress-nginx advisory thread: https://discuss.kubernetes.io/t/security-advisory-multiple-issues-in-ingress-nginx/34115
Chainguard on ingress-nginx: https://www.chainguard.dev/unchained/keeping-ingress-nginx-alive

Past episode where we covered ingress-nginx retirement / March 2026 timeline:
S1E2Nov 21, 2025⏱️ 15:53Kubernetes Shake-ups, Platform Reality, and AI-Native SREEpisode: Kubernetes Shake-ups, Platform Reality, and AI-Native SRE

n8n security bulletin: https://community.n8n.io/t/security-bulletin-february-6-2026/261682

Past episode: n8n “Ni8mare” / CVE-2026-21858
S1E12Jan 9, 2026⏱️ 16:18n8n Critical CVE (CVE-2026-21858), AWS GPU Capacity Blocks Price Hike, Netflix TemporalEpisode: n8n Critical CVE (CVE-2026-21858), AWS GPU Capacity Blocks Price Hike, Netflix Temporal

Past episode: n8n Auth RCE / CVE-2026-21877
S1E14Jan 16, 2026⏱️ 12:28n8n Auth RCE (CVE-2026-21877), GitHub Artifact Permissions, and AWS DevOps Agent LessonsEpisode: n8n Auth RCE (CVE-2026-21877), GitHub Artifact Permissions, and AWS DevOps Agent Lessons

We also mentioned Observe getting acquired by Snowflake - https://www.linkedin.com/posts/snowflake-computing_welcome-to-the-team-observe-inc-today-activity-7424199034833301504-icN8

Show Notes

This week on Ship It Weekly, Brian hits four “control plane + trust boundary” stories where the glue layer becomes the incident.

Azure had a platform incident that impacted VM management operations across multiple regions. Your app can be up, but ops is degraded.

GitHub is pushing Agent HQ (Claude + Codex in the repo/CI flow), and Actions added a case() function so workflow logic is less brittle.

MCP is becoming platform plumbing: Miro launched an MCP server and Kong launched an MCP Registry.

Links

Azure status incident (VM service management issues) https://azure.status.microsoft/en-us/status/history/?trackingId=FNJ8-VQZ

GitHub Agent HQ: Claude + Codex https://github.blog/news-insights/company-news/pick-your-agent-use-claude-and-codex-on-agent-hq/

GitHub Actions update (case() function) https://github.blog/changelog/2026-01-29-github-actions-smarter-editing-clearer-debugging-and-a-new-case-function/

Claude Opus 4.6 https://www.anthropic.com/news/claude-opus-4-6

How Google SREs use Gemini CLI https://cloud.google.com/blog/topics/developers-practitioners/how-google-sres-use-gemini-cli-to-solve-real-world-outages

Miro MCP server announcement https://www.businesswire.com/news/home/20260202411670/en/Miro-Launches-MCP-Server-to-Connect-Visual-Collaboration-With-AI-Coding-Tools

Kong MCP Registry announcement https://konghq.com/company/press-room/press-release/kong-introduces-mcp-registry

GitHub Actions hosted runners incident thread https://github.com/orgs/community/discussions/186184

DockerDash / Ask Gordon research https://noma.security/blog/dockerdash-two-attack-paths-one-ai-supply-chain-crisis/

Terraform 1.15 alpha https://github.com/hashicorp/terraform/releases/tag/v1.15.0-alpha20260204

Wiz Moltbook write-up https://www.wiz.io/blog/exposed-moltbook-database-reveals-millions-of-api-keys

Chainguard “EmeritOSS” https://www.chainguard.dev/unchained/introducing-chainguard-emeritoss

More episodes + details: https://shipitweekly.fm