Fail Small, IaC Control Planes, and Automated RCA

Transcript

0:00 You can automate more than ever right now. Agents,

0:03 pipelines, self -healing systems, all of it.

0:06 But if you don't build brakes and blast radius

0:09 controls at the same time, you're just speeding

0:12 up the failure too. That's the theme for today.

0:33 Hey, I'm Brian Teller. I work in DevOps and SRE

0:36 and I run Teller's Tech. This is Ship It Weekly,

0:39 where I filter the noise and pull out what actually

0:42 matters when you're the one running infrastructure

0:44 and owning reliability. If something's hype,

0:48 I'll call it hype. If it changes how you operate,

0:50 we'll talk about it. Happy New Year! It's January

0:53 2nd, so if you're listening on a walk, cleaning

0:56 up a lab, or just avoiding thinking about Monday

0:59 yet, welcome. Quick housekeeping, the site is

1:02 shipitweekly .fm. That's where the links and

1:05 show notes live, plus the contact email. Also,

1:09 I'm looking for interview guests. If you're building

1:11 real infra, platform, SRE, DevEx, governance,

1:15 migrations, any of it, and you want to come on

1:18 for a chill conversation episode, hit the contact

1:21 form. on shipitweekly .fm. And one quick ask,

1:25 if the show has been useful, follow or subscribe

1:28 wherever you are listening. And if you've got

1:30 10 seconds, a rating or review really helps way

1:33 more than it should. All right, three main stories

1:36 for today. Cloudflare's Code Orange resilience

1:39 plan and their whole fail small approach. It's

1:43 basically the internet's biggest edge provider

1:45 saying we're changing how we ship and how we

1:48 contain blast radius. Second, Pulumi making a

1:52 very direct move at Terraform and HCL. This is

1:56 about the IAC control plane. governance, and

1:59 what the next year looks like for teams trying

2:02 to standardize without rewriting the world. Third,

2:05 Meta's DRP, a root cause analysis platform at

2:10 scale, not we bought a tool. More like we turned

2:13 incident investigation into software. Then we'll

2:17 do a lightning round with a few practical updates

2:19 and tools. And we'll close with a human story

2:22 about the ironies of automation and why faster

2:26 systems can actually increase the burden on humans

2:30 if you don't design the oversight part. Let's

2:37 get into it. So Cloudflare published a post about

2:40 Code Orange and Fail Small. And the vibe is basically,

2:44 we've had enough of big correlated failures.

2:48 And I love that they're saying the quiet part

2:50 out loud. When you run global infrastructure,

2:54 a lot of the stuff that hurts you isn't a single

2:57 server dying. It's one config change. one policy

3:02 tweak, one rollout mistake, and suddenly you

3:05 have the same failure everywhere at once. That's

3:09 the nightmare. Correlated failure. So fail small

3:13 is the strategy to force problems to stay local.

3:17 If you ship a bad change, it should not be able

3:21 to take down everything. It should hit one slice,

3:24 one region, one cell, one shard, one whatever,

3:29 and then you learn and stop it. That's the ideal.

3:33 Now, the reason this matters for normal humans

3:36 like us is because this is the same exact pattern

3:40 inside companies, just scaled. down. One shared

3:44 Terraform module change breaks 50 repos. One

3:48 CI permission tweak breaks every pipeline. One

3:52 harmless IAM policy update knocks out a production

3:56 deploy. One bad external DNS change takes out

4:00 half your ingress. You know the vibe. Cloudflare's

4:03 post is basically a reminder that blast radius

4:07 control is a feature you have to design and defend.

4:11 It doesn't appear automatically. And it's also

4:14 a reminder that reliability work often starts

4:18 with how do we ship changes safely, not how do

4:22 we add more dashboards. If you want something...

4:25 actionable from this, here's the mental model

4:27 I keep in my head. If you cannot confidently

4:30 say this change will only affect a small part

4:33 of traffic, then you need better segmentation

4:36 or better rollout controls. That might mean canaries.

4:40 It might mean progressive delivery. It might

4:43 mean feature flags that can be disabled quickly.

4:47 It might mean splitting a giant shared module

4:50 into smaller modules so one change doesn't cascade.

4:55 But either way, the goal is the same. Make failure

4:58 smaller than your ability to recover. And going

5:02 into a new year, this is a good story to start

5:05 with because everyone comes back from the holidays

5:08 with energy and plans and big changes. And the

5:13 fastest way to ruin January is to ship a big

5:16 change with a big blast radius. So that's story

5:20 one. Cloudflare is doing the grown -up version

5:22 of build guardrails for change. It's not glamorous,

5:26 it's mandatory. All right, story two is a little

5:34 more control plane politics. Pulumi dropped a

5:37 blog post that's basically a direct play at Terraform

5:40 and HCL shops. If you zoom out, this isn't just

5:45 here's a feature. It's them saying you can manage

5:48 your Terraform and your HCL inside Pulumi's ecosystem,

5:53 and you can migrate without rewriting everything,

5:56 which is smart. Because the big friction for

5:59 most platform teams is not that Terraform is

6:02 bad. It's that standardization is hard and change

6:06 is expensive. A lot of teams want stronger guardrails,

6:10 better policy enforcement, better visibility.

6:13 better workflows. They also do not want a two

6:17 -year rewrite project where you rebuild everything

6:20 in a new language. So Pulumi is trying to reduce

6:24 the rewrite tax. They're basically saying, keep

6:28 your existing Terraform, but move the management

6:31 layer. Use Pulumi Cloud as the place you orchestrate

6:35 runs, policy, approvals, drift. governance. And

6:39 they also mention native HCL support inside Pulumi's

6:44 IAC, so HCL isn't treated like a second -class

6:48 citizen. This is why this story matters for your

6:51 audience. It's not about who wins IAC. It's about

6:55 control planes. In 2026, the platform conversation

6:59 is increasingly about who owns the workflow and

7:02 guardrails around infrastructure changes, not

7:06 which syntax you prefer. And it ties directly

7:09 to what we talked about in episode eight with

7:11 Cloudflare's IAC governance and shift left patterns.

7:15 Same idea. You want infrastructure changes to

7:19 be reviewed, tested, policy checked, and auditable

7:23 before it touches production. Now, if you're

7:26 a Terraform shop, I'm not telling you to run

7:29 to Pulumi. But I am saying this is worth watching

7:32 because it's a sign of where the market is going.

7:36 Teams want guardrails. They want policy. They

7:40 want better workflows. They want less state pain.

7:43 They want less secrets leakage. They want less,

7:47 oops, I applied the wrong workspace. And the

7:50 vendors are fighting for that control plane.

7:53 Practical takeaway here. If your IAC workflow

7:56 is still people run applies locally, or we have

8:00 a fragile CI pipeline that's half tribal knowledge,

8:03 then you're going to feel this pressure all year.

8:07 Whether you solve it with Terraform Cloud, Spacelift,

8:10 Atlantis, GitHub Actions, Pulumi, or something

8:13 homegrown, the point is the same. The workflow

8:17 is the product. Alright, Story 3 is my favorite

8:25 kind of SRE story. We built a system to do the

8:29 thing humans keep doing manually. Meta wrote

8:32 about DRP, their Root Cause Analysis Platform.

8:36 And the idea is actually pretty simple. But the

8:40 execution is what makes it interesting. When

8:42 an incident happens, teams do the same basic

8:45 investigation patterns over and over. You check

8:48 recent deploys. You check configuration changes.

8:51 You look for correlated alerts. You compare time

8:55 series. You scan logs for known signatures. You

8:58 look for dependency failures. You try to answer

9:01 the same questions every time. And that's slow.

9:05 It's manual. It depends on who's on call. It

9:08 depends on tribal knowledge. So Meta built DRP

9:12 to codify investigations into analyzers. Think

9:16 of an analyzer as an automated investigator that

9:20 knows how to pull signals, correlate timelines,

9:23 and produce a hypothesis. They describe it as

9:26 a platform that runs these analysis at scale.

9:29 A lot of teams, a lot of analysis, all the time.

9:32 Even if you ignore the big headline numbers,

9:35 the approach is the real story. SRE, maturity,

9:38 isn't just more runbooks. It's turning repeated

9:41 operational work into software. This is the same

9:45 idea behind good incident tooling in general.

9:49 You don't want hero debugging. You want systematic

9:51 debugging. You want the first 15 minutes of an

9:55 incident to be consistent no matter who's on

9:58 call. DRP is basically that idea productized

10:02 internally. So what's the takeaway for normal

10:05 shops? You don't need a meta -level RCA platform,

10:09 but you can steal the pattern. Take the top five

10:12 recurring incident types you see. the stuff that

10:15 keeps happening. Then ask, what are the first

10:18 three questions we always ask during those incidents?

10:22 Then automate those questions. That could be

10:25 as simple as a Slack bot that pulls recent deploys

10:28 for this service, last config changes, related

10:31 incidents, top error rate deltas, and posts it

10:34 automatically in the incident channel. Or a runbook

10:38 that starts with real queries and links, not

10:41 vague instructions like check the logs. Or a...

10:44 A simple script that generates a timeline, pulls

10:47 the right dashboards, and collects the evidence.

10:51 You don't need to solve everything. You just

10:53 need to stop doing the same manual steps forever.

10:57 This is a great new year story because it's a

11:00 reminder that SRE work is supposed to reduce

11:03 human toil, not just respond faster to pain.

11:13 Alright, lightning round. First up, AWS ECR can

11:18 now create repositories on a push. This is a

11:22 small feature that can save a lot of time in

11:25 orgs that spin up a lot of services, environments,

11:29 or ephemeral projects. Instead of manually creating

11:32 the repo first, you can push and ECR can create

11:35 it. But, as always, convenience is a governance

11:39 problem in disguise. If you let anyone auto -create

11:43 repos with any naming pattern, you'll end up

11:46 with a junk drawer. So you probably want guardrails

11:50 around who can do it, what names are allowed,

11:53 and what defaults get applied, like scanning,

11:56 lifecycle rules, encryption, that sort of thing.

12:00 Next, GitHub put out a Let's Talk About GitHub

12:03 Actions post. I like these when they're less

12:06 here's marketing and more here's how the platform

12:09 is evolving. It's worth skimming if you live

12:12 in actions and you want to understand where they're

12:14 investing. Especially if actions is your company's

12:17 CI backbone. Because small improvements in reliability

12:21 and performance turn into real saved time during

12:24 incident weeks. Quick callback. Episode 6 covered

12:28 the runner pricing pause. Same takeaway still

12:31 holds. Actions isn't just CI. It's a control

12:34 plane. Next, Lauren Hochstein had a great SRE

12:38 -ish post about incident data and statistical

12:41 process control. The short version is, MTTR is

12:45 usually not under statistical control, which

12:49 means the average can be misleading. If you've

12:52 ever had one weird, horrible incident blow up

12:55 your average, you already know this. The point

12:57 is not stop measuring. The point is don't drive

13:00 the car by staring at one metric that gets yanked

13:04 around by outliers. Look at distributions, look

13:07 at trends, and focus on repeatable improvements.

13:11 Next, Drift Hound. This is a tool aimed at drift

13:15 detection and visibility for terraform. The reason

13:18 it's worth mentioning is that Drift is one of

13:20 those problems that gets worse as your infra

13:23 footprint grows, and it quietly undermines IAC

13:28 as a source of truth. Even if you don't use this

13:31 tool specifically, it's a good reminder to revisit

13:34 how you detect Drift and how you respond to it,

13:37 because we'll notice eventually is not a strategy.

13:41 And last, Superset. This one is interesting because

13:44 it's about running multiple coding agents locally

13:48 in parallel, like a manager for AI agents concept.

13:52 Useful, but also a good reminder that agentic

13:55 tools amplify both speed and risk, which is a

13:58 perfect segue into the human story. For the human

14:09 closer, I want to pull from Uwe Fredersen's AI

14:12 and the ironies of automation. The core point

14:15 is something we keep learning the hard way in

14:18 operations. Automation doesn't remove responsibility.

14:21 It moves responsibility. If a system gets faster,

14:26 the human oversight problem gets harder, not

14:29 easier. Because now the system can do more things

14:32 faster, including the wrong things. And the human

14:35 has less time to notice. less time to understand

14:39 and less time to stop it if you've ever been

14:42 on call during an automated rollout that goes

14:45 sideways you know this feeling you are watching

14:48 the machine do damage at a speed you can't match

14:52 so the real job isn't just add automation it's

14:55 design the control loop how do you detect that

14:58 automation is going wrong how do you slow it

15:01 down How do you contain the blast radius? How

15:04 do you make it observable so humans can understand

15:07 what it's doing? And how can you make rollback

15:11 as easy as rollout? This ties back to everything

15:15 in this episode. Cloudflare's fail small is literally

15:19 about making automation safer by limiting correlated

15:22 impact. Pulumi's control plane move is about

15:26 controlling the workflow. policies and safety

15:29 rails around changes meta's drp is about automating

15:33 investigation but in a way that makes humans

15:36 faster and more consistent not replaced and blind

15:40 and the agent tools in the lightning round are

15:43 about the new frontier of this if you give an

15:46 agent the ability to write code open prs run

15:50 ci or touch infra you need to treat it like a

15:54 junior engineer with a caffeine problem fast,

15:58 helpful, and absolutely capable of doing something

16:01 dumb if you give it too much freedom. So going

16:04 into the new year, here's the mindset I'd keep.

16:08 Automate? Yes. But build the brakes at the same

16:11 time. Alright, that's it for this episode of

16:15 Ship It Weekly. We covered Cloudflare's Fail

16:19 small push. Pulumi's move to manage all of your

16:22 IAC, including Terraform and HCL. And Meta's

16:27 DRP platform for automated incident investigations.

16:31 The lightning round was AWS ECR auto -creating

16:35 repos on push. GitHub Actions platform direction.

16:39 A better way to think about incident metrics,

16:41 drift detection, and a quick look at multi -agent

16:45 tooling. If you got something out of this, hit

16:48 follow or subscribe wherever you are listening.

16:51 And if you can, leave a quick rating or review.

16:54 It helps a ton. Links and show notes are on shipitweekly

16:58 .fm. And last reminder, I'm looking for interview

17:02 guests. If you want to come on and talk through

17:04 real DevOps, SRE, platform engineering, cloud

17:08 governance, migrations, incident scars, whatever,

17:12 hit the contact form on the site. I'm Brian.

17:15 Happy New Year, and I'll see you next week.

Fail Small, IaC Control Planes, and Automated RCA

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller