💬 Host Commentary

This episode is basically my “welcome back to reality” message for the start of the year.

A lot of teams are coming into January with momentum: new initiatives, new tooling, new automation, more AI agents in workflows, more platform “self-service.” That’s great. But the trap is thinking speed automatically means progress.

Speed without containment just means you fail faster.

That’s why the theme of Episode 10 is brakes and blast radius.

Cloudflare: “Fail small” isn’t a slogan, it’s a design requirement
Cloudflare’s post hit for me because it’s the type of resilience work that’s invisible when it’s done right and brutally obvious when it’s not.

If you’ve been in ops long enough, you learn that the outages that really hurt are rarely “one machine died.” The painful ones are correlated. The same broken change lands everywhere, the same dependency falls over across regions, the same config causes a cascade across the fleet.

“Fail small” is the mindset that forces you to design systems where problems stay local by default.

You can translate that into normal-company terms pretty easily:

If one bad Terraform module change can break dozens of repos, you don’t have a module problem, you have a blast radius problem.

If one CI permission or runner change can halt every pipeline, you don’t have a GitHub problem, you have a control plane dependency problem.

If one networking change can brick multiple clusters, you don’t have a VPC problem, you have an environment isolation problem.

The answer isn’t “be more careful.” The answer is segmentation and progressive delivery. Make it physically hard for a change to take everything down at once.

If you want a quick “do we fail small?” gut check, it’s this:

Can we roll changes out to a small slice of traffic by default?
Can we stop the rollout quickly?
Can we roll back quickly?
And can we prevent a single change from touching all environments at once?

If any of those are “not really,” you’ve got a reliability project hiding inside your delivery process.

Pulumi’s IaC push: the workflow is the product now
Pulumi’s “all IaC, including Terraform and HCL” post is interesting because it’s not really about Terraform vs Pulumi.

It’s about the control plane around infrastructure changes.

The market has been shifting from “pick an IaC language” to “pick a workflow that your org can live inside.” Approvals. Policy enforcement. Auditing. Drift detection. Visibility. Team-wide patterns. This is where platform work actually lives.

The reason Pulumi’s move matters is that it’s trying to lower the rewrite tax. Most shops don’t want to rebuild everything. They want better guardrails without a multi-year refactor.

So if you’re a Terraform shop, the question isn’t “should we switch?” The question is:

Do we have a real, consistent workflow around infrastructure change?
Or are we still depending on hero knowledge and fragile pipelines?

If you’re already happy with your control plane (TFC/TFE, Spacelift, Atlantis, or your own internal setup), cool. This is still worth watching because it’s a sign of where the “platform baseline” expectations are going in 2026: centralized runs, policy-as-code, least privilege, auditable approvals, and safer defaults.

Meta DrP: turning incident investigation into software
The DrP story is my favorite one because it’s the purest “SRE as engineering” example in the episode.

Humans do the same investigation steps every time an incident happens:
What changed?
What’s the timeline?
What services are correlated?
What’s the dependency health?
What errors spiked?
What deploys landed?

Meta’s angle is: stop doing this manually. Codify the investigation patterns and run them automatically as analyzers.

Even if you never build anything like DrP, the model is worth stealing:

Pick your top recurring incident types.
For each, identify the first three questions you always ask.
Automate those three questions into a consistent “first 10 minutes” incident response.

That can be a Slack bot.
It can be a runbook template with real links.
It can be a script that generates a timeline.
It can be as simple as “incident channel gets auto-populated with deploys, dashboards, and relevant queries.”

The win isn’t AI magic. The win is consistency. It reduces the cognitive load when things are on fire, and it stops your on-call quality from depending on who drew the short straw that week.

Lightning round quick follow-ups
GitHub Actions is still the theme of “control planes matter.” If you caught Episode 6, you already know my take: Actions isn’t just CI anymore, it’s part of the delivery pipeline, the GitOps loop, and sometimes the break-glass path. If that control plane has pricing changes, incidents, or performance problems, it’s not “dev inconvenience,” it’s operational impact.

AWS ECR creating repos on push is one of those features that sounds small until you multiply it across a big org. It’s either a nice automation win or a new sprawl problem, depending on whether you have naming standards and default security controls.

Metrics and MTTR: I love the reminder that averages can lie. MTTR is often dominated by outliers, and that makes “we improved our MTTR by 10%” a pretty weak claim unless you’re looking at distributions and repeatable process improvements.

And drift is still drift. If your IaC story doesn’t include drift detection and a plan to respond, “source of truth” is basically a motivational poster.

Human closer: the ironies of automation
This is the part I want to underline.

Automation doesn’t remove responsibility. It moves responsibility.

Faster automation creates a harder oversight problem. If a system can do more things faster, it can also do the wrong thing faster, and humans have less time to notice and react.

So the real platform job is not just “add automation.” It’s “design the control loop”:

Make actions observable.
Contain blast radius.
Slow down when confidence is low.
Make rollback easy.
Design safety rails that don’t depend on perfect humans.

That’s what “fail small” is really about. And it’s why it’s a perfect theme for the first weekly episode of the year.

Links
SRE Weekly #503: https://sreweekly.com/sre-weekly-issue-503/
Pulumi: all IaC, including Terraform and HCL: https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/
GitHub Actions direction: https://github.blog/news-insights/product-news/lets-talk-about-github-actions/
AWS ECR create repos on push: https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/
DriftHound: https://drifthound.io/
Superset: https://superset.sh/
Episode 6 (Actions pricing pause backstory): https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/

📝 Show Notes

This week on Ship It Weekly, Brian kicks off the new year with one theme: automation is getting faster, and that makes blast radius and oversight matter more than ever.

We start with Cloudflare’s “fail small” mindset. The core idea is simple: big outages usually come from correlated failure, not one box dying. If a bad change lands everywhere at once, you’re toast. “Fail small” is about forcing problems to stay local so you can stop the bleeding before it becomes global.

Next is Pulumi’s push to be the control plane for all your IaC, including Terraform and HCL. The interesting part isn’t syntax wars. It’s the workflow layer: approvals, policy enforcement, audit trails, drift, and how teams standardize without signing up for a multi-year rewrite.

Third is Meta’s DrP, a root cause analysis platform that turns repeated incident investigation steps into software. Even if you’re not Meta, the pattern is worth stealing: automate the first 10–15 minutes of your most common incident types so on-call is consistent no matter who’s holding the pager.

In the lightning round: a follow-up on GitHub Actions direction (and a quick callback to Episode 6’s runner pricing pause), AWS ECR creating repos on push, a smarter take on incident metrics, Terraform drift visibility, and parallel “coding agent” workflows.

We wrap with a human reminder about the ironies of automation: automation doesn’t remove responsibility, it moves it. Faster systems require better brakes, better observability, and easier rollback.

Links from this episode

SRE Weekly issue 503 (source roundup - CloudFlare) https://sreweekly.com/sre-weekly-issue-503/

Pulumi: all IaC, including Terraform and HCL https://www.pulumi.com/blog/all-iac-including-terraform-and-hcl/

Meta DrP: https://engineering.fb.com/2025/12/19/data-infrastructure/drp-metas-root-cause-analysis-platform-at-scale/

GitHub Actions: “Let’s talk about GitHub Actions” https://github.blog/news-insights/product-news/lets-talk-about-github-actions/

Episode 6 (GitHub runner pricing pause, Terraform Cloud limits, AI in CI) https://www.tellerstech.com/ship-it-weekly/github-runner-pricing-pause-terraform-cloud-limits-and-ai-in-ci/

AWS ECR: create repositories on push https://aws.amazon.com/about-aws/whats-new/2025/12/amazon-ecr-creating-repositories-on-push/

DriftHound https://drifthound.io/

Superset https://superset.sh/

More episodes + contact info, and more details on this episode can be found on our website: https://shipitweekly.fm