0:00
A lot of the work that keeps systems safe does
0:02
not look important until the day it is. It's
0:05
a pin commit hash, a safer config rollout, a
0:08
cleaner restart, a daemon that starts before
0:11
the app does, an IP range nobody can use your
0:14
token from outside of. None of that is sexy.
0:17
All of it matters. Hey, I'm Brian Teller. I work
0:37
in DevOps and SRE, and I run Teller's Tech. This
0:40
is Ship It Weekly, where I filter the noise and
0:42
focus on what actually changes how we run infrastructure
0:45
and own reliability. Show notes and links are
0:49
on shipitweekly .fm. If the show's been useful,
0:52
follow it wherever you listen. Ratings help way
0:54
more than they should. And if you want more signal
0:57
between episodes, check out oncallbrief .com.
1:00
We have five main stories today, then the lightning
1:02
round, and we'll wrap with the human closer.
1:05
We're starting with GitHub Actions and Kubernetes,
1:07
because the helper layer is officially part of
1:10
the trust boundary now. Then Airbnb, with one
1:13
of the better real -world platform stories I've
1:16
seen in a while on shipping config changes safely.
1:19
After that, Cloudflare open sourcing the grateful
1:23
restart plumbing behind zero downtime upgrades
1:26
for Rust services. Then AWS ECS managed daemons,
1:30
which is a nice clean platform engineering story.
1:34
And finally, HCP Terraform getting more serious
1:37
about narrowing access with IP allow lists and
1:41
temporary AWS permission delegation. Story one,
1:48
GitHub Actions is not just CI anymore. Let's
1:51
start there because I think the easy part of
1:53
GitHub actions is over. On this week's on -call
1:56
brief, we flagged that Kubernetes -related repositories
2:00
are moving towards full 40 -character SHA pinning
2:03
for actions, with non -compliant workflows set
2:07
to fail after April 15th. That sounds like a
2:10
tiny implementation detail right up until you
2:12
remember how many teams got burned by mutable
2:15
tags and trusted automation over the last year.
2:18
And the broader GitHub direction lines up with
2:20
that. GitHub's 2026 action security roadmap is
2:24
basically one big admission that CI and workflow
2:27
automation are part of the software supply chain
2:30
now. GitHub says the roadmap includes dependency
2:33
locking for workflows, centralized policy controls,
2:37
and actions data stream for observability. and
2:40
egress controls for hosted runners. GitHub's
2:43
security team also said this week that recent
2:46
attacks are increasingly about exfiltrating secrets
2:49
and that many of them start by compromising a
2:52
workflow on GitHub Actions. That's the story,
2:55
not GitHub added a security feature. More like
2:58
the industry has finally stopped pretending build
3:01
automation is some layer off to the side. If
3:04
a workflow can publish artifacts, push code,
3:07
assume cloud roles, or touch secrets, then it
3:10
is part of the trust boundary, period. So the
3:13
practical takeaway is pretty simple. If you are
3:15
still using broad version tags or workflows that
3:19
nobody has really reviewed in a while, the platform
3:21
is telling you where this is going. The old convenience
3:24
model is getting squeezed out. And honestly,
3:27
good. It probably should. Story 2. Airbnb built
3:35
the kind of config platform people say they want.
3:38
Next up, Airbnb. I really like this one because
3:40
it is not hype. It is not we added AI to configs.
3:43
It is just good platform work. Airbnb wrote about
3:47
its internal dynamic config platform, Sitarr.
3:50
The basic idea is to make runtime config changes
3:53
safer without making them painfully slow. Their
3:56
setup uses a Git -based workflow by default,
3:59
schema validation and review before rollout,
4:02
staged rollouts, fast rollback, and a separation
4:05
between the control plane and the data plane.
4:08
On the service side, there's an agent sidecar
4:11
and a local cache. So services can keep running
4:14
on the last known good config, even if the back
4:17
end is degraded. That's the kind of story I want
4:19
more of. Because config is one of those places
4:22
where teams love the flexibility right up until
4:25
it becomes the outage. And the answer is usually
4:27
not ban dynamic config. The answer is build the
4:30
guardrails and the blast radius controls so the
4:33
flexibility does not turn into roulette. The
4:36
part I liked the most is the last known good
4:39
behavior. This is such a real operator move.
4:42
Not just our config platform is available, more
4:44
like what happens to the service if the config
4:47
backend is having a bad day? That's the right
4:50
question. A lot of systems look fine until their
4:53
control plane gets weird. Airbnb is clearly thinking
4:57
past that. So yeah, for me, the takeaway here
4:59
is the safe speed is real engineering work. It's
5:03
not just vibes. It is staged rollout, rollback,
5:06
validation, and not making every config change
5:10
a big bang. Story 3. Cloudflare opensourced the
5:18
restart plumbing nobody notices when it works.
5:21
Now for one of my favorite kind of infra stories.
5:23
Cloudflare opensourced a Rust library called
5:26
Ecstasys that it says has been in production
5:29
for five years and enables zero downtime upgrades
5:33
across critical Rust services. Their write -up
5:36
says the point is to restart network services
5:38
without dropping live connections or refusing
5:42
new ones, even at Cloudflare scale. The way it
5:45
works is a fork -and -exec handoff, where the
5:47
child process inherits the listening socket,
5:50
signals readiness, and only then does the parent
5:53
stop accepting new work and drain existing connections.
5:57
Cloudflare says that approach preserves millions
5:59
of requests across its global network on every
6:03
restart. This is exactly the kind of thing people
6:05
outside ops do not think about much. But it matters,
6:08
because just restart the service sounds clean
6:11
until the service is live. Handling real traffic
6:14
and the wrong restart behavior turns into a customer
6:17
-facing mess. The boring, invisible reliability
6:20
work is usually where the real maturity is. Not
6:24
in launch day demos, in the stuff that lets you
6:27
patch, upgrade, and change things without making
6:29
users feel it. And I also just like the honesty
6:32
of the post. It walked through why the naive
6:34
restart is bad, why so -reused port is not enough
6:38
for graceful restarts, and why the socket handoff
6:41
matters. That's real engineering. You can hear
6:44
the battle scars in it. So if you run long -lived
6:47
services, proxies, gateways, or anything else
6:50
where restart behavior is part of reliability,
6:52
this is a good reminder that restart logic is
6:56
not housekeeping. It is production behavior.
7:02
Story four, Amazon ECS Managed Daemons is a good
7:06
platform team feature. The AWS story this week
7:09
is Amazon ECS Managed Daemons. And honestly,
7:13
I think this is a better story than yet another
7:15
agent announcement. AWS says Managed Daemons
7:18
for ECS Managed Instances lets platform teams
7:22
independently deploy and manage logging, tracing,
7:25
monitoring, and security agents without bundling
7:29
all of that into application deployments. AWS
7:32
says ECS runs exactly one daemon task per managed
7:35
instance, starts the daemon before application
7:38
tasks are placed, drains it last, and supports
7:41
rolling updates with rollback protection. There
7:44
is no extra feature charge beyond the compute
7:46
the daemon uses. That is clean. That is useful.
7:50
And that is one of those features where the value
7:52
shows up immediately if you've ever had to coordinate
7:55
host -level tooling through app teams that really
7:59
do not want to think about your agent lifecycle.
8:01
The part that I like is the separation of concerns.
8:04
App teams own app rollout. Platform teams own
8:08
platform tooling. That should not be controversial,
8:11
but a lot of environments still make those things
8:13
trip over each other. So when AWS gives people
8:16
a more explicit way to keep those concerns separate,
8:19
that is worth paying attention to. And it fits
8:21
the theme of the episode pretty well too. Another
8:24
background layer becoming more explicit, more
8:26
managed, and a little less improvised. Story
8:33
5. HCP Terraform is narrowing access in the places
8:37
people usually leave soft. The last main story
8:40
today, HCP Terraform. HashiCorp announced two
8:43
security features that I think fit together nicely,
8:46
even though they came as separate updates. First,
8:48
HTTP Terraform now supports IP allow lists at
8:52
the organization and agent level. HashiCorp says
8:55
that means tokens are only accepted from trusted,
8:58
predefined IP addresses, and that it closes a
9:01
pretty obvious gap where valid credentials could
9:04
previously be used from anywhere by default.
9:08
HashiCorp also says agent pull allow lists can
9:11
be scoped separately, which lets teams align
9:14
access with real egress points like NAT gateways
9:18
or trusted VPC egress. Second, AWS permission
9:21
delegation is now generally available in HTTP
9:24
Terraform. HashiCorp says this uses AWS temporary
9:28
permission delegation, so customers can grant
9:31
trusted partners short -lived, customer -approved
9:34
access for setup and onboarding tasks instead
9:37
of handing out long -lived permissions. This
9:40
is the kind of tightening I like. Nothing magical,
9:42
nothing flashy, just narrower access and less
9:46
standing privilege. And that is usually where
9:48
a lot of the real security value lives anyway,
9:51
not in adding another dashboard, in making the
9:54
default trust path shorter and more explicit.
9:57
So the practical takeaway here is if your Terraform
10:00
environment still feels a little too open by
10:02
default, HashiCorp is giving teams some better
10:04
primitives now. Network -based restrictions and
10:08
shorter -lived delegated access are both good
10:10
moves. A few quick ones before we wrap. GitHub's
10:21
secret scanning kept moving in March. GitHub
10:23
says it added 28 new detectors from 15 providers,
10:27
enabled push protection by default for 39 detectors,
10:30
and added more validity checks. Then GitHub added
10:33
secret scanning through the GitHub MCP server.
10:36
So AI coding agents can scan changes for exposed
10:39
secrets before you commit or open a pull request.
10:42
GitHub Codespaces is now generally available
10:44
for GitHub Enterprise Cloud with data residency.
10:48
GitHub says it supports Australia, the EU, Japan,
10:52
and the US, but requires enterprise or organization
10:55
-owned Codespaces. User -owned Codespaces are
10:59
not supported in that setup. And Kubernetes version
11:02
1 .36 is closing off a couple of old foot guns.
11:05
The project says .spec .externalips is deprecated
11:10
because of long -standing security risk. And
11:12
the old git repo volume driver is now permanently
11:16
disabled because it could allow code execution
11:19
as root on the node. I think the cleanest closer
11:29
for this one is pretty simple. Failure is loud.
11:32
Prevention is quiet. SRE Weekly put it that way
11:35
this week, and I think it lands because it's
11:37
true. Enterprises do not usually underinvest
11:40
in reliability because they hate reliability.
11:43
They underinvest because outages scream and prevention
11:46
whispers. Budgeting systems respond to noise.
11:49
Prevention work often looks boring right up until
11:52
the day it would have saved everybody a lot of
11:55
pain. And that is basically the whole episode.
11:58
Pinned actions are prevention. Safer config rollout
12:01
is prevention. Graceful restarts are prevention.
12:04
Managed daemons are prevention. And narrower
12:07
terraform access is prevention. None of that
12:10
usually gets celebrated the way incident response
12:12
does. But if you've ever been the person on the
12:15
hook at 2am, you already know which work matters
12:18
more. And I think that is the human side of ops
12:21
that sometimes gets lost. A lot of the best work
12:24
we do is the stuff that nobody notices because
12:27
nothing happened. No page. No rollback scramble.
12:30
No emergency patch. No weird config blast radius.
12:34
No restart turning into an outage. No token working
12:37
from somewhere it never should have. That work
12:39
counts. Even when it is quiet. Especially when
12:43
it is quiet. Alright, that's it for this week
12:45
of Ship It Weekly. Quick recap. We talked about
12:48
GitHub Actions hardening and why CI is part of
12:51
the trust boundary now. Airbnb showing what safe
12:55
config rollout actually looks like. Cloudflare
12:58
open sourcing the restart plumbing behind zero
13:01
downtime rust upgrades. Amazon ECS managed daemons.
13:05
And HCP Terraform tightening access with IP allow
13:09
list and temporary AWS permission delegation.
13:12
Links and show notes are on shipitweekly .fm.
13:15
You can also find the video versions on YouTube.
13:18
And if you want more signal before the episode,
13:20
check out oncallbrief .com. If this episode was
13:23
useful, follow or subscribe wherever you listen.
13:26
And send it to the person on your team who keeps
13:29
getting asked to move faster while quietly doing
13:32
all of the work that keeps the background layers
13:34
from becoming the next incident. I'm Brian, and
13:37
I'll see you next week.