Cloudflare’s Workers Scheduler, AWS DBs on Vercel, and JIT Admin Access

Transcript

0:00 Thank you. Hey, I'm Brian Teller. I work in DevOps

0:09 and SRE and I run Teller's Tech. Ship It Weekly

0:13 is where I filter the noise and pull out what

0:15 actually matters when you're the one running

0:18 infrastructure and owning reliability. If something's

0:22 hype, I'll call it hype. If it changes how you

0:25 operate, I'll break it down in plain English.

0:28 Most weeks, this is a quick news recap. In between

0:31 those, I drop interview episodes with folks across

0:34 the DevOps world. Happy holidays. Merry Christmas,

0:38 all that good stuff. It's the day after Christmas,

0:41 so if you're listening while hiding from your

0:43 family, checking one thing real quick, I respect

0:46 it. Quick piece of housekeeping, the new site

0:49 is live at shipitweekly .fm. That's where I'm

0:53 putting links and show notes. Also, I'm looking

0:55 for interview guests. If you're building real

0:58 infra or platform stuff and you want to come

1:01 on for a chill conversation episode, hit the

1:04 email on shipitweekly .fm. If you've got war

1:07 stories, even better. And one quick ask before

1:10 we start. If the show has been useful, hit follow

1:13 or subscribe wherever you are listening. And

1:15 if you've got 10 seconds, a rating or review

1:18 really helps way more than it should. All right,

1:21 three main stories for today. First, Cloudflare

1:24 wrote up how they built an internal maintenance

1:27 scheduler on workers. This is real platform engineering.

1:31 Memory limits. data modeling, query optimization,

1:35 caching, Parquet for historical analysis. Second,

1:39 AWS databases are now available directly in Vercel

1:43 marketplace. It's a quiet shift, but it's a big

1:46 one. Devs can click button real AWS databases

1:50 from inside Vercel, but you still have to own

1:53 governance, billing, and the blast radius. Third,

1:56 there's an open source project from AWS called

1:59 Team. Temporary Elevated Access Management, and

2:03 it's built around IAM Identity Center. It's approval

2:06 -based, time -bound access. This is one of those

2:09 everybody wants it, few implement it cleanly

2:12 problems. Then we'll do a lightning round, and

2:14 we'll close with Mark Brooker's What Now? Handling

2:17 Errors in Large Systems. Let's get into it. Cloudflare

2:21 has a really good post on how they built an internal

2:24 maintenance brain on workers. The core problem

2:27 is kind of obvious once you hear it. when you

2:29 run infra at their scale you cannot rely on humans

2:33 to remember every dependency and every weird

2:36 routing rule and every if these two things go

2:39 down at the same time a customer special setup

2:42 gets wrecked scenario so they built a centralized

2:46 scheduler that treats maintenance like a set

2:49 of constraints like we must always have at least

2:53 one of these routers active or this customer

2:56 pins traffic through these data centers, so don't

3:00 take all of them out at once. The fun part is

3:03 how they got it working within worker's limits.

3:06 Their first naive approach was basically load

3:09 everything into one worker. All the relationships,

3:12 all the product config, all the metrics, then

3:16 compute constraints. And even in proof of concept,

3:19 they hit out of memory errors. So they took a

3:22 step back and said, okay. Workers have limits.

3:26 We can't treat this like a giant in -memory analytics

3:29 job. We need to only load the data that matters

3:32 for the specific maintenance request. If you

3:35 get a maintenance request for your router in

3:37 Frankfurt, you probably do not need to load Australia.

3:41 You need the dependency neighborhood around that

3:43 router. that pushed them into graph modeling

3:46 they describe constraints as objects and associations

3:50 basically vertices and edges routers are objects

3:54 pulls are objects and the dependencies are associations

3:57 and then they built a typed association interface

4:01 so the constraint logic stays simple but the

4:04 backing implementation can get smarter over time

4:07 then they flip their data fetching style instead

4:10 of pulling down huge responses and filtering

4:13 locally They started doing targeted requests

4:16 through the graph interface. They claim response

4:19 sizes dropped by 100 times in one spot. That's

4:23 huge. Also, it's the exact kind of win you get

4:26 when you stop shipping your entire dataset into

4:30 your app layer just to throw most of it away.

4:33 Of course, that created the next problem. sub

4:36 -request limits. They traded a few massive requests

4:39 for a ton of tiny requests. And then they started

4:43 breaching sub -request limits. So they built

4:46 a fetch pipeline with request deduping, a small

4:50 LRU cache, edge caching via caches .default,

4:54 and sane retry and back off. After tuning, they

4:59 were seeing about a 99 % cache hit rate on those

5:02 fetches. That's wild. And it's It's also how

5:05 you make something like this survive at scale

5:08 without tuning your internal APIs into a creator.

5:12 Then there's a super relatable metric story.

5:14 They use Thanos for Prometheus queries, and they

5:18 call out what a lot of teams do by accident.

5:21 Ask for everything. Get megabytes back, parse

5:25 JSON into single -threaded runtime, then filter

5:29 most of it out. That's basically self -inflicted

5:32 pain. instead they use the graph to find the

5:35 specific relationships first then issue much

5:39 more targeted thanos queries they say average

5:41 response size went from multiple megabytes to

5:44 about one kilobyte in one case so the theme of

5:48 this story is stop dragging huge blobs of data

5:51 into your application just so you can toss 99

5:54 of it and then they bring it home with historical

5:57 analysis real time is one thing historical is

6:01 a different because now you're scanning months

6:04 of data to see if your logic is actually safe

6:07 and accurate. They talk about how Prometheus

6:10 TSDB blocks are not really designed for object

6:14 storage access patterns and how that turns into

6:16 a lot of random reads. So they adopt a parquet

6:19 conversation layer for historical data. Columnar

6:23 format. better stats, and you can fetch what

6:26 you need without slamming object storage with

6:28 random IO. Takeaways you can steal even if you're

6:31 not Cloudflare. If you're building platform brains

6:34 like schedulers, deploy orchestrators, policy

6:38 evaluators, you will hit limits. Memory, CPU,

6:42 API quotas, request fanout. You win by changing

6:47 the shape of the problem, not brute forcing it.

6:50 Graph interfaces are a cheat code for dependency

6:53 -heavy domains. And targeted queries plus caching

6:57 plus backoff is still undefeated. All right,

7:00 let's move from Cloudflare did adult engineering

7:03 to developers can click -button databases. AWS

7:07 announced that AWS databases are now available

7:10 on the Vercel marketplace. The headline is simple.

7:14 From Vercel, you can provision and connect to

7:17 Aurora Postgres, Aurora DSQL, and DynamoDB in

7:22 seconds. And here's the part platform folks should

7:25 not ignore. The onboarding path is basically

7:28 create a new AWS account from Vercel with some

7:32 starter credits. So the dev experience is you're

7:35 already in Vercel. You click a thing, and now

7:38 you have a database, and the app is wired up.

7:41 That is great for Velocity. It is also the kind

7:44 of thing that bypasses governance if you don't

7:47 get in front of it. Even the AWS side is legit.

7:50 You still have to deal with who owns that AWS

7:53 account long term. Is it inside your AWS organization

7:57 or is it an orphan account that exists because

8:00 Vercel made it easy? How do SCPs apply? How do

8:05 guardrails apply? How do you do tagging and cost

8:07 allocation so finance doesn't show up later asking

8:11 why there are mystery accounts? What does networking

8:14 look like? Are public endpoints acceptable? Do

8:17 you need private connectivity? Do you have a

8:20 VPC strategy that fits Vercel -first teams? What's

8:24 your audit baseline? Cloud trail? Config? Detective

8:28 controls? All the boring stuff. Also, Region

8:32 selection matters for data, residency, and latency.

8:36 It's not just pick whatever. Vercel also hinted

8:39 this is evolving. They're talking about coming

8:42 soon support for provisioning into an existing

8:45 AWS account, not just a new one. If that lands

8:48 cleanly, that's the version I'd actually want

8:51 as a platform team because now you can meet developers

8:54 where they are without losing governance. So

8:57 the takeaway, if your dev platform can create

9:00 AWS resources, your governance has to meet it

9:03 there. The database is easy. The ownership model

9:06 is the hard part. All right. Story 3 is about

9:10 access, which matters even more once you have

9:13 more accounts and more surfaces. TEAM stands

9:15 for Temporary Elevated Access Management. It's

9:19 an open -source solution built around AWS IAM

9:22 Identity Center. The pitch is basically approval

9:26 -based, time -bound elevated access to AWS accounts.

9:30 Users request elevated access for a specific

9:32 period of time, with a reason. Approvers approve

9:36 or deny it. If approved, access is granted. When

9:39 time expires, it is automatically removed. That

9:42 automatic removal is the whole point. Because

9:45 most orgs fail here. Someone gets admin just

9:48 for this incident, and then it stays for months.

9:52 Privilege creep becomes the default. Team also

9:55 leans into auditing and visibility. Who requested

9:58 what, who approved it, when it expired, plus

10:01 session logging. How I'd frame it. This is not

10:05 your break glass story. Break glass is the world

10:08 is on fire and we need access right now. It should

10:11 be rare, noisy, and heavily monitored. Team is

10:16 the daily, I need admin for 45 minutes to do

10:19 this legit change workflow. If you want this

10:21 to actually work culturally, approvals need to

10:24 be fast enough that people don't route around

10:27 it. And the default permission sets need to be

10:30 sane so elevation is actually meaningful, not

10:33 just ceremony. If you are already on IAM Identity

10:37 Center and you've been hand -waving we should

10:39 do JIT access, team is at least worth a look.

10:43 Even if you don't adopt it, it's a good reference

10:45 for what time -bound elevation can look like

10:48 without building everything from scratch. Alright,

10:51 time for the lightning round. GitHub Actions

10:54 improved performance on the workflows page. Small

10:57 change, but if you live in Actions, you'll feel

10:59 it. Big workflows render better now, lazy loading,

11:03 and you can filter jobs by status so you can

11:06 just see failures or in -progress stuff. During

11:09 an incident, this is a real quality of life upgrade.

11:12 Next, the weird one, Lambda managed instances.

11:16 This is basically run Lambda functions on EC2,

11:19 but AWS manages the lifecycle of those instances

11:22 for you. It's meant for steady state workloads

11:25 and specialized compute needs. It also changes

11:28 concurrency and execution assumptions a bit.

11:31 So you actually need to care about thread safety

11:34 and shared state in ways you might not with regular

11:37 Lambda. Interesting, slightly cursed. But I get

11:40 why it exists. Atmos quick hit. There's a Cloud

11:43 Posse Atmos issue where vendoring a component

11:47 with an invalid URL triggers a weird GitHub username

11:50 prompt. I'm mentioning it less because of the

11:53 bug and more because Atmos is clearly turning

11:57 into a bigger workflow ecosystem now. CLI, dev

12:01 containers, IDE integrations, the whole thing.

12:04 And last, k8sdiagram .fun. It's a free Kubernetes

12:09 diagram builder that can also generate YAML for

12:13 common resources. I would not blindly apply auto

12:17 -generated YAML to prod, but for teaching, prototyping,

12:21 or explaining architecture to humans, it's actually

12:23 super handy. All right, let's close with the

12:27 human story, because this ties into everything

12:29 we just talked about. Mark Brooker wrote a post

12:32 called, What Now? Handling errors in large systems.

12:36 It's basically an interactive error handling

12:39 game. You decide whether a system should crash

12:42 or keep going when something goes wrong. And

12:45 then he explains his take. The key idea is simple.

12:48 And it's something we forget all the time. Error

12:51 handling isn't a local decision. It's a global

12:54 property of the system. We love to argue about

12:57 one line of code. Should this crash? Should this

13:00 retry? Should this be best effort? Mark's point

13:03 is, that decision only makes sense if you understand

13:07 the architecture around it. He asks questions

13:10 like, are failures correlated? If the same bad

13:13 input can hit every node, crashing can amplify

13:17 the blast radius. Can a higher layer handle the

13:20 air? Some architectures are designed to tolerate

13:23 a few crashes. None are designed to tolerate

13:26 a ton of crashes continuously. Is it actually

13:29 safe to keep crashing? Is it actually safe to

13:32 keep running? Sometimes continuing means silent

13:35 corruption, which is worse than a crash. Then

13:38 he ties it to blast radius reduction. Cell -based

13:42 architectures, independent regions, isolating

13:45 failures so you don't have a single mistake become

13:48 a global outage. That connects to today's stories

13:51 perfectly. Cloudflare's scheduler exists because

13:55 humans will guess wrong sometimes, and the systems

13:58 need to prevent correlated failure. The Vercel

14:01 Marketplace DB story is a new surface area. The

14:05 failure modes aren't just technical, they're

14:08 governance and ownership failures. And team is

14:11 literally about reducing the risk of one person

14:13 with standing admin turning a mistake into a

14:17 disaster. So yeah. If you want a good mindset

14:20 going into next year, stop treating air handling

14:23 like a code style preference. Treat it like architecture.

14:27 All right, that's it for this episode of Ship

14:30 It Weekly. We covered Cloudflare's maintenance

14:32 scheduler on workers and the platform limits

14:35 force better design lessons. AWS databases inside

14:38 the Vercel marketplace and what that means for

14:41 governance and blast radius. And team as a practical

14:45 path to time -bound elevated access with IAM

14:49 Identity Center. If you got something out of

14:52 this, hit follow or subscribe wherever you are

14:55 listening. And if you can, leave a quick rating

14:58 or review. It's annoying how much that helps

15:01 the show. Links and show notes are on shipitweekly

15:04 .fm. And one last reminder, I'm looking for interview

15:08 guests. If you want to come on and talk through

15:11 real DevOps or platform work you're doing, hit

15:14 the email on the site. I'm Brian. Thanks for

15:16 listening. Happy holidays. And I'll see you next

15:19 week, which is technically next year.

Cloudflare’s Workers Scheduler, AWS DBs on Vercel, and JIT Admin Access

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller