GitHub Actions Hardening, Airbnb Config Rollouts, Cloudflare Rust Restarts, ECS Managed Daemons, and Terraform Access Controls

Transcript

0:00 A lot of the work that keeps systems safe does

0:02 not look important until the day it is. It's

0:05 a pin commit hash, a safer config rollout, a

0:08 cleaner restart, a daemon that starts before

0:11 the app does, an IP range nobody can use your

0:14 token from outside of. None of that is sexy.

0:17 All of it matters. Hey, I'm Brian Teller. I work

0:37 in DevOps and SRE, and I run Teller's Tech. This

0:40 is Ship It Weekly, where I filter the noise and

0:42 focus on what actually changes how we run infrastructure

0:45 and own reliability. Show notes and links are

0:49 on shipitweekly .fm. If the show's been useful,

0:52 follow it wherever you listen. Ratings help way

0:54 more than they should. And if you want more signal

0:57 between episodes, check out oncallbrief .com.

1:00 We have five main stories today, then the lightning

1:02 round, and we'll wrap with the human closer.

1:05 We're starting with GitHub Actions and Kubernetes,

1:07 because the helper layer is officially part of

1:10 the trust boundary now. Then Airbnb, with one

1:13 of the better real -world platform stories I've

1:16 seen in a while on shipping config changes safely.

1:19 After that, Cloudflare open sourcing the grateful

1:23 restart plumbing behind zero downtime upgrades

1:26 for Rust services. Then AWS ECS managed daemons,

1:30 which is a nice clean platform engineering story.

1:34 And finally, HCP Terraform getting more serious

1:37 about narrowing access with IP allow lists and

1:41 temporary AWS permission delegation. Story one,

1:48 GitHub Actions is not just CI anymore. Let's

1:51 start there because I think the easy part of

1:53 GitHub actions is over. On this week's on -call

1:56 brief, we flagged that Kubernetes -related repositories

2:00 are moving towards full 40 -character SHA pinning

2:03 for actions, with non -compliant workflows set

2:07 to fail after April 15th. That sounds like a

2:10 tiny implementation detail right up until you

2:12 remember how many teams got burned by mutable

2:15 tags and trusted automation over the last year.

2:18 And the broader GitHub direction lines up with

2:20 that. GitHub's 2026 action security roadmap is

2:24 basically one big admission that CI and workflow

2:27 automation are part of the software supply chain

2:30 now. GitHub says the roadmap includes dependency

2:33 locking for workflows, centralized policy controls,

2:37 and actions data stream for observability. and

2:40 egress controls for hosted runners. GitHub's

2:43 security team also said this week that recent

2:46 attacks are increasingly about exfiltrating secrets

2:49 and that many of them start by compromising a

2:52 workflow on GitHub Actions. That's the story,

2:55 not GitHub added a security feature. More like

2:58 the industry has finally stopped pretending build

3:01 automation is some layer off to the side. If

3:04 a workflow can publish artifacts, push code,

3:07 assume cloud roles, or touch secrets, then it

3:10 is part of the trust boundary, period. So the

3:13 practical takeaway is pretty simple. If you are

3:15 still using broad version tags or workflows that

3:19 nobody has really reviewed in a while, the platform

3:21 is telling you where this is going. The old convenience

3:24 model is getting squeezed out. And honestly,

3:27 good. It probably should. Story 2. Airbnb built

3:35 the kind of config platform people say they want.

3:38 Next up, Airbnb. I really like this one because

3:40 it is not hype. It is not we added AI to configs.

3:43 It is just good platform work. Airbnb wrote about

3:47 its internal dynamic config platform, Sitarr.

3:50 The basic idea is to make runtime config changes

3:53 safer without making them painfully slow. Their

3:56 setup uses a Git -based workflow by default,

3:59 schema validation and review before rollout,

4:02 staged rollouts, fast rollback, and a separation

4:05 between the control plane and the data plane.

4:08 On the service side, there's an agent sidecar

4:11 and a local cache. So services can keep running

4:14 on the last known good config, even if the back

4:17 end is degraded. That's the kind of story I want

4:19 more of. Because config is one of those places

4:22 where teams love the flexibility right up until

4:25 it becomes the outage. And the answer is usually

4:27 not ban dynamic config. The answer is build the

4:30 guardrails and the blast radius controls so the

4:33 flexibility does not turn into roulette. The

4:36 part I liked the most is the last known good

4:39 behavior. This is such a real operator move.

4:42 Not just our config platform is available, more

4:44 like what happens to the service if the config

4:47 backend is having a bad day? That's the right

4:50 question. A lot of systems look fine until their

4:53 control plane gets weird. Airbnb is clearly thinking

4:57 past that. So yeah, for me, the takeaway here

4:59 is the safe speed is real engineering work. It's

5:03 not just vibes. It is staged rollout, rollback,

5:06 validation, and not making every config change

5:10 a big bang. Story 3. Cloudflare opensourced the

5:18 restart plumbing nobody notices when it works.

5:21 Now for one of my favorite kind of infra stories.

5:23 Cloudflare opensourced a Rust library called

5:26 Ecstasys that it says has been in production

5:29 for five years and enables zero downtime upgrades

5:33 across critical Rust services. Their write -up

5:36 says the point is to restart network services

5:38 without dropping live connections or refusing

5:42 new ones, even at Cloudflare scale. The way it

5:45 works is a fork -and -exec handoff, where the

5:47 child process inherits the listening socket,

5:50 signals readiness, and only then does the parent

5:53 stop accepting new work and drain existing connections.

5:57 Cloudflare says that approach preserves millions

5:59 of requests across its global network on every

6:03 restart. This is exactly the kind of thing people

6:05 outside ops do not think about much. But it matters,

6:08 because just restart the service sounds clean

6:11 until the service is live. Handling real traffic

6:14 and the wrong restart behavior turns into a customer

6:17 -facing mess. The boring, invisible reliability

6:20 work is usually where the real maturity is. Not

6:24 in launch day demos, in the stuff that lets you

6:27 patch, upgrade, and change things without making

6:29 users feel it. And I also just like the honesty

6:32 of the post. It walked through why the naive

6:34 restart is bad, why so -reused port is not enough

6:38 for graceful restarts, and why the socket handoff

6:41 matters. That's real engineering. You can hear

6:44 the battle scars in it. So if you run long -lived

6:47 services, proxies, gateways, or anything else

6:50 where restart behavior is part of reliability,

6:52 this is a good reminder that restart logic is

6:56 not housekeeping. It is production behavior.

7:02 Story four, Amazon ECS Managed Daemons is a good

7:06 platform team feature. The AWS story this week

7:09 is Amazon ECS Managed Daemons. And honestly,

7:13 I think this is a better story than yet another

7:15 agent announcement. AWS says Managed Daemons

7:18 for ECS Managed Instances lets platform teams

7:22 independently deploy and manage logging, tracing,

7:25 monitoring, and security agents without bundling

7:29 all of that into application deployments. AWS

7:32 says ECS runs exactly one daemon task per managed

7:35 instance, starts the daemon before application

7:38 tasks are placed, drains it last, and supports

7:41 rolling updates with rollback protection. There

7:44 is no extra feature charge beyond the compute

7:46 the daemon uses. That is clean. That is useful.

7:50 And that is one of those features where the value

7:52 shows up immediately if you've ever had to coordinate

7:55 host -level tooling through app teams that really

7:59 do not want to think about your agent lifecycle.

8:01 The part that I like is the separation of concerns.

8:04 App teams own app rollout. Platform teams own

8:08 platform tooling. That should not be controversial,

8:11 but a lot of environments still make those things

8:13 trip over each other. So when AWS gives people

8:16 a more explicit way to keep those concerns separate,

8:19 that is worth paying attention to. And it fits

8:21 the theme of the episode pretty well too. Another

8:24 background layer becoming more explicit, more

8:26 managed, and a little less improvised. Story

8:33 5. HCP Terraform is narrowing access in the places

8:37 people usually leave soft. The last main story

8:40 today, HCP Terraform. HashiCorp announced two

8:43 security features that I think fit together nicely,

8:46 even though they came as separate updates. First,

8:48 HTTP Terraform now supports IP allow lists at

8:52 the organization and agent level. HashiCorp says

8:55 that means tokens are only accepted from trusted,

8:58 predefined IP addresses, and that it closes a

9:01 pretty obvious gap where valid credentials could

9:04 previously be used from anywhere by default.

9:08 HashiCorp also says agent pull allow lists can

9:11 be scoped separately, which lets teams align

9:14 access with real egress points like NAT gateways

9:18 or trusted VPC egress. Second, AWS permission

9:21 delegation is now generally available in HTTP

9:24 Terraform. HashiCorp says this uses AWS temporary

9:28 permission delegation, so customers can grant

9:31 trusted partners short -lived, customer -approved

9:34 access for setup and onboarding tasks instead

9:37 of handing out long -lived permissions. This

9:40 is the kind of tightening I like. Nothing magical,

9:42 nothing flashy, just narrower access and less

9:46 standing privilege. And that is usually where

9:48 a lot of the real security value lives anyway,

9:51 not in adding another dashboard, in making the

9:54 default trust path shorter and more explicit.

9:57 So the practical takeaway here is if your Terraform

10:00 environment still feels a little too open by

10:02 default, HashiCorp is giving teams some better

10:04 primitives now. Network -based restrictions and

10:08 shorter -lived delegated access are both good

10:10 moves. A few quick ones before we wrap. GitHub's

10:21 secret scanning kept moving in March. GitHub

10:23 says it added 28 new detectors from 15 providers,

10:27 enabled push protection by default for 39 detectors,

10:30 and added more validity checks. Then GitHub added

10:33 secret scanning through the GitHub MCP server.

10:36 So AI coding agents can scan changes for exposed

10:39 secrets before you commit or open a pull request.

10:42 GitHub Codespaces is now generally available

10:44 for GitHub Enterprise Cloud with data residency.

10:48 GitHub says it supports Australia, the EU, Japan,

10:52 and the US, but requires enterprise or organization

10:55 -owned Codespaces. User -owned Codespaces are

10:59 not supported in that setup. And Kubernetes version

11:02 1 .36 is closing off a couple of old foot guns.

11:05 The project says .spec .externalips is deprecated

11:10 because of long -standing security risk. And

11:12 the old git repo volume driver is now permanently

11:16 disabled because it could allow code execution

11:19 as root on the node. I think the cleanest closer

11:29 for this one is pretty simple. Failure is loud.

11:32 Prevention is quiet. SRE Weekly put it that way

11:35 this week, and I think it lands because it's

11:37 true. Enterprises do not usually underinvest

11:40 in reliability because they hate reliability.

11:43 They underinvest because outages scream and prevention

11:46 whispers. Budgeting systems respond to noise.

11:49 Prevention work often looks boring right up until

11:52 the day it would have saved everybody a lot of

11:55 pain. And that is basically the whole episode.

11:58 Pinned actions are prevention. Safer config rollout

12:01 is prevention. Graceful restarts are prevention.

12:04 Managed daemons are prevention. And narrower

12:07 terraform access is prevention. None of that

12:10 usually gets celebrated the way incident response

12:12 does. But if you've ever been the person on the

12:15 hook at 2am, you already know which work matters

12:18 more. And I think that is the human side of ops

12:21 that sometimes gets lost. A lot of the best work

12:24 we do is the stuff that nobody notices because

12:27 nothing happened. No page. No rollback scramble.

12:30 No emergency patch. No weird config blast radius.

12:34 No restart turning into an outage. No token working

12:37 from somewhere it never should have. That work

12:39 counts. Even when it is quiet. Especially when

12:43 it is quiet. Alright, that's it for this week

12:45 of Ship It Weekly. Quick recap. We talked about

12:48 GitHub Actions hardening and why CI is part of

12:51 the trust boundary now. Airbnb showing what safe

12:55 config rollout actually looks like. Cloudflare

12:58 open sourcing the restart plumbing behind zero

13:01 downtime rust upgrades. Amazon ECS managed daemons.

13:05 And HCP Terraform tightening access with IP allow

13:09 list and temporary AWS permission delegation.

13:12 Links and show notes are on shipitweekly .fm.

13:15 You can also find the video versions on YouTube.

13:18 And if you want more signal before the episode,

13:20 check out oncallbrief .com. If this episode was

13:23 useful, follow or subscribe wherever you listen.

13:26 And send it to the person on your team who keeps

13:29 getting asked to move faster while quietly doing

13:32 all of the work that keeps the background layers

13:34 from becoming the next incident. I'm Brian, and

13:37 I'll see you next week.

GitHub Actions Hardening, Airbnb Config Rollouts, Cloudflare Rust Restarts, ECS Managed Daemons, and Terraform Access Controls

Transcript

Catch This Episode

Show Notes

Meet Brian Teller