Host Commentary

For this episode, I wanted to anchor on something I think every ops team learns the hard way.

The incidents that hurt the most are rarely the big obvious deploys.

It’s the background systems. The reconcilers. The cleanup jobs. The “this should be safe because it’s routine” automation.

Because those jobs are usually touching shared truth.

Routing state. Prefix state. Permissions. Database statistics. The stuff everything else quietly depends on.

And when that shared truth shifts under you, you don’t just get a bug. You get reachability problems. You get cascading retries. You get queueing. You get “everything is up but nothing works.”

Cloudflare BYOIP is the cleanest example this week.

This wasn’t “somebody fat-fingered BGP.” It was a buggy cleanup sub-task that queried the Addressing API wrong and ended up withdrawing about 1,100 BYOIP prefixes before they could revert the change. Some customers could re-advertise their prefixes from the dashboard, but the real work was restoring prefix configuration state back to normal. 

That’s the lesson. If you have automation that can touch reachability, it is production control plane. Treat it like prod deploy tooling, not like “just a job.” Put caps on it. Put canaries on it. Put a circuit breaker on it. And most importantly, build rollback that does not require tribal knowledge at 3am.

Cloudflare outage postmortem

https://blog.cloudflare.com/cloudflare-outage-february-20-2026

Next, Clerk’s postmortem is the same theme, just inside Postgres instead of BGP.

Auto analyze ran, statistics shifted, a query plan flipped into something awful, and suddenly the system is shedding load so hard that most traffic is coming back 429 without even being handled. They fixed it by forcing ANALYZE again, and then they got really explicit about hardening failover so it can trigger on “any failure at origin,” not just “Postgres is down.” 

This is why I keep saying “degraded is harder than down.”

Most teams have alarms for dead things. A lot fewer teams have alarms for “same query, different plan” or “latency is spiking but nothing is technically failing.” That gap is where the really ugly incidents live.

Clerk outage postmortem

https://clerk.com/blog/2026-02-19-system-outage-postmortem

And then you’ve got the AWS Kiro story, which is going to get summarized everywhere as “AI took down AWS.”

AWS’s response is basically: no, it was misconfigured access controls, and they added safeguards like mandatory peer review for production access. Reuters covered the reporting around it, and AWS published their own statement pushing back. 

Here’s my take.

Whether it was an agent or a bash script, it’s the same root problem: a tool got permissions it shouldn’t have had.

So the practical move is boring, but it’s the whole game.

Separate propose from execute.

Let tools draft plans, diffs, PRs, and recommendations all day long.

But when it comes to destructive actions, make that path intentionally gated, intentionally scoped, and painfully auditable.

AWS response on Kiro

https://www.aboutamazon.com/news/aws/aws-service-outage-ai-bot-kiro

AWS outage reporting (Reuters)

https://www.reuters.com/business/retail-consumer/amazons-cloud-unit-hit-by-least-two-outages-involving-ai-tools-ft-says-2026-02-20

Quick platform note before we move on.

AWS open-sourced the EKS Node Monitoring Agent, which is aimed at detecting node-level issues and surfacing them as signals EKS can act on, including automated node repair paths. This is one of those “make the pager quieter” features that I actually like seeing. 

EKS Node Monitoring Agent

https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-eks-node-monitoring-agent-open-source

Lightning round quick thoughts.

Grafana has a high severity issue where a user with permission management rights on one dashboard could modify permissions on other dashboards. If you run Grafana in a shared org and you’ve got a lot of teams in there, that’s a “patch it” item. 

https://grafana.com/security/security-advisories/cve-2026-21721

AWS published a bulletin on runc CVEs affecting container runtime behavior when launching new containers. The evergreen reminder is still true: containers are not a security boundary, and runtime bugs turn into host risk depending on how you’re running workloads. 

https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024

GitLab shipped patch releases 18.6.1, 18.5.3, 18.4.5. If you self-host GitLab, you already know the rule. Staying behind becomes “we’ll do it later” until it becomes a weekend incident. 

https://about.gitlab.com/releases/2025/11/26/patch-release-gitlab-18-6-1-released

And Atlassian’s February bulletin is the monthly reminder that on-prem Data Center products are a patch treadmill. They call out a pile of high severity vulns and critical severity ones fixed in newer versions. 

https://confluence.atlassian.com/security/security-bulletin-february-17-2026-1722256046.html

Human closer.

ACM Queue ran a piece called “SRE Is Anti-Transactional,” and it’s basically describing the exact emotional arc behind all of these stories.

SRE and platform teams aren’t trying to dodge work.

They’re trying to move the org away from manual, transactional toil, toward systems that do safe work by default, and only involve humans for exceptions.

But this week is a reminder that you don’t get autonomy by giving tools more power.

You get autonomy by engineering the guardrails first, then widening the lane over time. 

SRE Is Anti-Transactional

https://queue.acm.org/detail.cfm?id=3773094

That ties the whole episode together.

Cloudflare automated cleanup touching routing state.

Clerk got hit by a “system is up but behavior changed” database failure mode.

AWS is reinforcing that permissions are still the sharp edge, no matter what tool is holding the knife.

Defaults shift. Background systems become dependencies. Guardrails decide whether it’s a story you learn from, or a story you apologize for.

Full show notes are on shipitweekly.fm. The weekly curated brief is on oncallbrief.com.

And if you got value out of this episode, follow or subscribe wherever you listen. Helps a ton. 

Show Notes

This week on Ship It Weekly, Brian covers three “automation meets reality” stories that every DevOps, SRE, and platform team can learn from.

Cloudflare accidentally withdrew customer BYOIP prefixes due to a buggy cleanup task, Clerk got knocked over by a Postgres auto-analyze query plan flip, and AWS responded to reports about its internal Kiro tooling by framing the incident as misconfigured access controls. Plus: a quick EKS node monitoring update, and a tight security lightning round.

Links

Cloudflare BYOIP outage postmortem https://blog.cloudflare.com/cloudflare-outage-february-20-2026/

Clerk outage postmortem (Feb 19, 2026) https://clerk.com/blog/2026-02-19-system-outage-postmortem

AWS outage report (Reuters) https://www.reuters.com/business/retail-consumer/amazons-cloud-unit-hit-by-least-two-outages-involving-ai-tools-ft-says-2026-02-20/

AWS response on Kiro + access controls https://www.aboutamazon.com/news/aws/aws-service-outage-ai-bot-kiro

EKS Node Monitoring Agent (open source) https://aws.amazon.com/about-aws/whats-new/2026/02/amazon-eks-node-monitoring-agent-open-source/

Grafana CVE-2026-21721 https://grafana.com/security/security-advisories/cve-2026-21721/

runc CVEs (AWS-2025-024) https://aws.amazon.com/security/security-bulletins/rss/aws-2025-024/

GitLab patch releases https://about.gitlab.com/releases/2025/11/26/patch-release-gitlab-18-6-1-released/

Atlassian Feb 2026 security bulletin https://confluence.atlassian.com/security/security-bulletin-february-17-2026-1722256046.html

Human story: SRE Is Anti-Transactional (ACM Queue) https://queue.acm.org/detail.cfm?id=3773094

More episodes and show notes at https://shipitweekly.fm

On Call Briefs at: https://oncallbrief.com