0:00
Thank you. Hey, I'm Brian Teller. I work in DevOps
0:09
and SRE and I run Teller's Tech. Ship It Weekly
0:13
is where I filter the noise and pull out what
0:15
actually matters when you're the one running
0:18
infrastructure and owning reliability. If something's
0:22
hype, I'll call it hype. If it changes how you
0:25
operate, I'll break it down in plain English.
0:28
Most weeks, this is a quick news recap. In between
0:31
those, I drop interview episodes with folks across
0:34
the DevOps world. Happy holidays. Merry Christmas,
0:38
all that good stuff. It's the day after Christmas,
0:41
so if you're listening while hiding from your
0:43
family, checking one thing real quick, I respect
0:46
it. Quick piece of housekeeping, the new site
0:49
is live at shipitweekly .fm. That's where I'm
0:53
putting links and show notes. Also, I'm looking
0:55
for interview guests. If you're building real
0:58
infra or platform stuff and you want to come
1:01
on for a chill conversation episode, hit the
1:04
email on shipitweekly .fm. If you've got war
1:07
stories, even better. And one quick ask before
1:10
we start. If the show has been useful, hit follow
1:13
or subscribe wherever you are listening. And
1:15
if you've got 10 seconds, a rating or review
1:18
really helps way more than it should. All right,
1:21
three main stories for today. First, Cloudflare
1:24
wrote up how they built an internal maintenance
1:27
scheduler on workers. This is real platform engineering.
1:31
Memory limits. data modeling, query optimization,
1:35
caching, Parquet for historical analysis. Second,
1:39
AWS databases are now available directly in Vercel
1:43
marketplace. It's a quiet shift, but it's a big
1:46
one. Devs can click button real AWS databases
1:50
from inside Vercel, but you still have to own
1:53
governance, billing, and the blast radius. Third,
1:56
there's an open source project from AWS called
1:59
Team. Temporary Elevated Access Management, and
2:03
it's built around IAM Identity Center. It's approval
2:06
-based, time -bound access. This is one of those
2:09
everybody wants it, few implement it cleanly
2:12
problems. Then we'll do a lightning round, and
2:14
we'll close with Mark Brooker's What Now? Handling
2:17
Errors in Large Systems. Let's get into it. Cloudflare
2:21
has a really good post on how they built an internal
2:24
maintenance brain on workers. The core problem
2:27
is kind of obvious once you hear it. when you
2:29
run infra at their scale you cannot rely on humans
2:33
to remember every dependency and every weird
2:36
routing rule and every if these two things go
2:39
down at the same time a customer special setup
2:42
gets wrecked scenario so they built a centralized
2:46
scheduler that treats maintenance like a set
2:49
of constraints like we must always have at least
2:53
one of these routers active or this customer
2:56
pins traffic through these data centers, so don't
3:00
take all of them out at once. The fun part is
3:03
how they got it working within worker's limits.
3:06
Their first naive approach was basically load
3:09
everything into one worker. All the relationships,
3:12
all the product config, all the metrics, then
3:16
compute constraints. And even in proof of concept,
3:19
they hit out of memory errors. So they took a
3:22
step back and said, okay. Workers have limits.
3:26
We can't treat this like a giant in -memory analytics
3:29
job. We need to only load the data that matters
3:32
for the specific maintenance request. If you
3:35
get a maintenance request for your router in
3:37
Frankfurt, you probably do not need to load Australia.
3:41
You need the dependency neighborhood around that
3:43
router. that pushed them into graph modeling
3:46
they describe constraints as objects and associations
3:50
basically vertices and edges routers are objects
3:54
pulls are objects and the dependencies are associations
3:57
and then they built a typed association interface
4:01
so the constraint logic stays simple but the
4:04
backing implementation can get smarter over time
4:07
then they flip their data fetching style instead
4:10
of pulling down huge responses and filtering
4:13
locally They started doing targeted requests
4:16
through the graph interface. They claim response
4:19
sizes dropped by 100 times in one spot. That's
4:23
huge. Also, it's the exact kind of win you get
4:26
when you stop shipping your entire dataset into
4:30
your app layer just to throw most of it away.
4:33
Of course, that created the next problem. sub
4:36
-request limits. They traded a few massive requests
4:39
for a ton of tiny requests. And then they started
4:43
breaching sub -request limits. So they built
4:46
a fetch pipeline with request deduping, a small
4:50
LRU cache, edge caching via caches .default,
4:54
and sane retry and back off. After tuning, they
4:59
were seeing about a 99 % cache hit rate on those
5:02
fetches. That's wild. And it's It's also how
5:05
you make something like this survive at scale
5:08
without tuning your internal APIs into a creator.
5:12
Then there's a super relatable metric story.
5:14
They use Thanos for Prometheus queries, and they
5:18
call out what a lot of teams do by accident.
5:21
Ask for everything. Get megabytes back, parse
5:25
JSON into single -threaded runtime, then filter
5:29
most of it out. That's basically self -inflicted
5:32
pain. instead they use the graph to find the
5:35
specific relationships first then issue much
5:39
more targeted thanos queries they say average
5:41
response size went from multiple megabytes to
5:44
about one kilobyte in one case so the theme of
5:48
this story is stop dragging huge blobs of data
5:51
into your application just so you can toss 99
5:54
of it and then they bring it home with historical
5:57
analysis real time is one thing historical is
6:01
a different because now you're scanning months
6:04
of data to see if your logic is actually safe
6:07
and accurate. They talk about how Prometheus
6:10
TSDB blocks are not really designed for object
6:14
storage access patterns and how that turns into
6:16
a lot of random reads. So they adopt a parquet
6:19
conversation layer for historical data. Columnar
6:23
format. better stats, and you can fetch what
6:26
you need without slamming object storage with
6:28
random IO. Takeaways you can steal even if you're
6:31
not Cloudflare. If you're building platform brains
6:34
like schedulers, deploy orchestrators, policy
6:38
evaluators, you will hit limits. Memory, CPU,
6:42
API quotas, request fanout. You win by changing
6:47
the shape of the problem, not brute forcing it.
6:50
Graph interfaces are a cheat code for dependency
6:53
-heavy domains. And targeted queries plus caching
6:57
plus backoff is still undefeated. All right,
7:00
let's move from Cloudflare did adult engineering
7:03
to developers can click -button databases. AWS
7:07
announced that AWS databases are now available
7:10
on the Vercel marketplace. The headline is simple.
7:14
From Vercel, you can provision and connect to
7:17
Aurora Postgres, Aurora DSQL, and DynamoDB in
7:22
seconds. And here's the part platform folks should
7:25
not ignore. The onboarding path is basically
7:28
create a new AWS account from Vercel with some
7:32
starter credits. So the dev experience is you're
7:35
already in Vercel. You click a thing, and now
7:38
you have a database, and the app is wired up.
7:41
That is great for Velocity. It is also the kind
7:44
of thing that bypasses governance if you don't
7:47
get in front of it. Even the AWS side is legit.
7:50
You still have to deal with who owns that AWS
7:53
account long term. Is it inside your AWS organization
7:57
or is it an orphan account that exists because
8:00
Vercel made it easy? How do SCPs apply? How do
8:05
guardrails apply? How do you do tagging and cost
8:07
allocation so finance doesn't show up later asking
8:11
why there are mystery accounts? What does networking
8:14
look like? Are public endpoints acceptable? Do
8:17
you need private connectivity? Do you have a
8:20
VPC strategy that fits Vercel -first teams? What's
8:24
your audit baseline? Cloud trail? Config? Detective
8:28
controls? All the boring stuff. Also, Region
8:32
selection matters for data, residency, and latency.
8:36
It's not just pick whatever. Vercel also hinted
8:39
this is evolving. They're talking about coming
8:42
soon support for provisioning into an existing
8:45
AWS account, not just a new one. If that lands
8:48
cleanly, that's the version I'd actually want
8:51
as a platform team because now you can meet developers
8:54
where they are without losing governance. So
8:57
the takeaway, if your dev platform can create
9:00
AWS resources, your governance has to meet it
9:03
there. The database is easy. The ownership model
9:06
is the hard part. All right. Story 3 is about
9:10
access, which matters even more once you have
9:13
more accounts and more surfaces. TEAM stands
9:15
for Temporary Elevated Access Management. It's
9:19
an open -source solution built around AWS IAM
9:22
Identity Center. The pitch is basically approval
9:26
-based, time -bound elevated access to AWS accounts.
9:30
Users request elevated access for a specific
9:32
period of time, with a reason. Approvers approve
9:36
or deny it. If approved, access is granted. When
9:39
time expires, it is automatically removed. That
9:42
automatic removal is the whole point. Because
9:45
most orgs fail here. Someone gets admin just
9:48
for this incident, and then it stays for months.
9:52
Privilege creep becomes the default. Team also
9:55
leans into auditing and visibility. Who requested
9:58
what, who approved it, when it expired, plus
10:01
session logging. How I'd frame it. This is not
10:05
your break glass story. Break glass is the world
10:08
is on fire and we need access right now. It should
10:11
be rare, noisy, and heavily monitored. Team is
10:16
the daily, I need admin for 45 minutes to do
10:19
this legit change workflow. If you want this
10:21
to actually work culturally, approvals need to
10:24
be fast enough that people don't route around
10:27
it. And the default permission sets need to be
10:30
sane so elevation is actually meaningful, not
10:33
just ceremony. If you are already on IAM Identity
10:37
Center and you've been hand -waving we should
10:39
do JIT access, team is at least worth a look.
10:43
Even if you don't adopt it, it's a good reference
10:45
for what time -bound elevation can look like
10:48
without building everything from scratch. Alright,
10:51
time for the lightning round. GitHub Actions
10:54
improved performance on the workflows page. Small
10:57
change, but if you live in Actions, you'll feel
10:59
it. Big workflows render better now, lazy loading,
11:03
and you can filter jobs by status so you can
11:06
just see failures or in -progress stuff. During
11:09
an incident, this is a real quality of life upgrade.
11:12
Next, the weird one, Lambda managed instances.
11:16
This is basically run Lambda functions on EC2,
11:19
but AWS manages the lifecycle of those instances
11:22
for you. It's meant for steady state workloads
11:25
and specialized compute needs. It also changes
11:28
concurrency and execution assumptions a bit.
11:31
So you actually need to care about thread safety
11:34
and shared state in ways you might not with regular
11:37
Lambda. Interesting, slightly cursed. But I get
11:40
why it exists. Atmos quick hit. There's a Cloud
11:43
Posse Atmos issue where vendoring a component
11:47
with an invalid URL triggers a weird GitHub username
11:50
prompt. I'm mentioning it less because of the
11:53
bug and more because Atmos is clearly turning
11:57
into a bigger workflow ecosystem now. CLI, dev
12:01
containers, IDE integrations, the whole thing.
12:04
And last, k8sdiagram .fun. It's a free Kubernetes
12:09
diagram builder that can also generate YAML for
12:13
common resources. I would not blindly apply auto
12:17
-generated YAML to prod, but for teaching, prototyping,
12:21
or explaining architecture to humans, it's actually
12:23
super handy. All right, let's close with the
12:27
human story, because this ties into everything
12:29
we just talked about. Mark Brooker wrote a post
12:32
called, What Now? Handling errors in large systems.
12:36
It's basically an interactive error handling
12:39
game. You decide whether a system should crash
12:42
or keep going when something goes wrong. And
12:45
then he explains his take. The key idea is simple.
12:48
And it's something we forget all the time. Error
12:51
handling isn't a local decision. It's a global
12:54
property of the system. We love to argue about
12:57
one line of code. Should this crash? Should this
13:00
retry? Should this be best effort? Mark's point
13:03
is, that decision only makes sense if you understand
13:07
the architecture around it. He asks questions
13:10
like, are failures correlated? If the same bad
13:13
input can hit every node, crashing can amplify
13:17
the blast radius. Can a higher layer handle the
13:20
air? Some architectures are designed to tolerate
13:23
a few crashes. None are designed to tolerate
13:26
a ton of crashes continuously. Is it actually
13:29
safe to keep crashing? Is it actually safe to
13:32
keep running? Sometimes continuing means silent
13:35
corruption, which is worse than a crash. Then
13:38
he ties it to blast radius reduction. Cell -based
13:42
architectures, independent regions, isolating
13:45
failures so you don't have a single mistake become
13:48
a global outage. That connects to today's stories
13:51
perfectly. Cloudflare's scheduler exists because
13:55
humans will guess wrong sometimes, and the systems
13:58
need to prevent correlated failure. The Vercel
14:01
Marketplace DB story is a new surface area. The
14:05
failure modes aren't just technical, they're
14:08
governance and ownership failures. And team is
14:11
literally about reducing the risk of one person
14:13
with standing admin turning a mistake into a
14:17
disaster. So yeah. If you want a good mindset
14:20
going into next year, stop treating air handling
14:23
like a code style preference. Treat it like architecture.
14:27
All right, that's it for this episode of Ship
14:30
It Weekly. We covered Cloudflare's maintenance
14:32
scheduler on workers and the platform limits
14:35
force better design lessons. AWS databases inside
14:38
the Vercel marketplace and what that means for
14:41
governance and blast radius. And team as a practical
14:45
path to time -bound elevated access with IAM
14:49
Identity Center. If you got something out of
14:52
this, hit follow or subscribe wherever you are
14:55
listening. And if you can, leave a quick rating
14:58
or review. It's annoying how much that helps
15:01
the show. Links and show notes are on shipitweekly
15:04
.fm. And one last reminder, I'm looking for interview
15:08
guests. If you want to come on and talk through
15:11
real DevOps or platform work you're doing, hit
15:14
the email on the site. I'm Brian. Thanks for
15:16
listening. Happy holidays. And I'll see you next
15:19
week, which is technically next year.