💬 Host Commentary

For Episode 8, I wanted to stay on the “platform reality” side of the internet: scaling constraints, governance you can’t dodge, and the blast radius of “easy buttons.”

First story is Cloudflare’s internal maintenance scheduler on Workers. I like this write-up because it’s not “serverless is magic,” it’s the stuff you actually hit when you build internal platform tooling: memory limits, request fanout limits, and the classic mistake of pulling giant datasets into a runtime just so you can throw most of it away. The part worth stealing is their shift from “load everything and compute” to “query the dependency neighborhood that matters,” with caching and deduping to keep request counts sane. Also, the Parquet angle is underrated: historical analysis tends to rot into slow object storage thrash unless you intentionally design for it.

Cloudflare post:
https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/

Second story is AWS databases showing up inside the Vercel Marketplace. This is a quiet shift with loud consequences. The dev experience is great: click-button a real AWS database from the same place you deploy your app. The platform team experience is… now your app platform is also provisioning cloud resources, which means you need a governance story that meets developers where they are.

A few extra things I didn’t go deep on in the episode, but you should think about if you run platform/cloud governance:

Account sprawl: if this creates new AWS accounts (especially outside your AWS Organization), you’ll end up with “unknown unknowns” fast.

Cost ownership: make sure there’s an enforced tagging/cost allocation baseline, budgets, and alarms. Otherwise this becomes the new shadow IT.

Networking posture: are these DBs public by default? Private? Do you want to mandate VPC-only connectivity for prod? What’s the migration path when a “quick dev DB” becomes a real production dependency?

IAM + audit trail: who’s allowed to provision? Who can delete? Do you have CloudTrail/logging/detection baselines in place for these resources?

Data residency/regions: easy UIs tend to hide region decisions. That matters for latency and compliance.

AWS announcement + Vercel changelog:
https://aws.amazon.com/about-aws/whats-new/2025/12/aws-databases-are-available-on-the-vercel/
https://vercel.com/changelog/aws-databases-now-available-on-the-vercel-marketplace

Third story is TEAM (Temporary Elevated Access Management) for IAM Identity Center. This is one of those “everyone says least privilege” areas where teams usually fail in practice, because the workflow is painful. TEAM is basically a reference implementation for what you actually want: request elevated access for a specific time window, approval workflow, auto-expiry, and auditing. Auto-expiry is the difference between “least privilege” and “permanent privilege creep.”

A couple extra thoughts here:

Break-glass vs daily elevation: break-glass should be rare, loud, and scary. Daily elevation should be controlled and boring. Don’t mix those.

Approval speed matters: if approvals are slow, engineers route around it. The process has to be fast enough that people keep using it.

Make the default roles boring: the whole point is that you don’t sit in admin all day. If everyone already has broad power, JIT becomes theater.

TEAM docs + repo (and AWS security blog post):
https://aws-samples.github.io/iam-identity-center-team/
https://github.com/aws-samples/iam-identity-center-team
https://aws.amazon.com/blogs/security/temporary-elevated-access-management-with-iam-identity-center/

Lightning round extras:

GitHub Actions workflows page performance improvements (this matters more than it sounds if you’re in Actions during incidents):
https://github.blog/changelog/2025-12-22-improved-performance-for-github-actions-workflows-page/

Lambda Managed Instances (Lambda-on-EC2-ish). The interesting bit is the “steady-state + specialized compute” positioning, plus the concurrency model shift where one execution environment can handle multiple requests. That means thread safety/shared state becomes your problem again.
Docs:
https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html

atmos issue list (for #1831 context):
https://github.com/cloudposse/atmos/issues

k8sdiagram.fun:
https://k8sdiagram.fun/

Human closer: Marc Brooker’s “What Now? Handling Errors in Large Systems.” This is the best “read it once and it changes how you think” link of the week. The big lesson is that error handling is architecture. Crashing vs retrying vs continuing only makes sense when you understand correlation, blast radius, and what “safe to continue” means in your system.
https://brooker.co.za/blog/2025/11/20/what-now.html

If you want to come on the show for a conversation episode, hit the email on shipitweekly.fm. I’m looking for people doing the work for real (platform, SRE, DevEx, cloud governance, migrations, incident scars… all of it).

📝 Show Notes

This week on Ship It Weekly, Brian looks at real platform engineering in the wild.

We start with Cloudflare’s write-up on building an internal maintenance scheduler on Workers. It’s not marketing fluff. It’s “we hit memory limits, changed the model, and stopped pulling giant datasets into the runtime.”

Next up: AWS databases are now available inside the Vercel Marketplace. This is a quiet shift with loud consequences. Devs can click-button real AWS databases from the same place they deploy apps, and platform teams still own the guardrails: account sprawl, billing/tagging, audit trails, region choices, and networking posture.

Third story: TEAM (Temporary Elevated Access Management) for IAM Identity Center. Time-bound elevation with approvals, automatic expiry, and auditing. We cover how this fits alongside break-glass and why auto-expiry is the difference between least-privilege and privilege creep.

Lightning round: GitHub Actions workflow page performance improvements, Lambda Managed Instances (slightly cursed but interesting), a quick atmos tooling blip, and k8sdiagram.fun for explaining k8s to humans.

We close with Marc Brooker’s “What Now? Handling Errors in Large Systems” and the takeaway: error handling isn’t a local code decision, it’s architecture. Crashing vs retrying vs continuing only makes sense when you understand correlation and blast radius.

shipitweekly.fm has links + the contact email. Want to be a guest? Reach out. And if you’re enjoying the show, follow/subscribe and leave a quick rating or review. It helps a ton.

Links from this episode

Cloudflare https://blog.cloudflare.com/building-our-maintenance-scheduler-on-workers/ AWS on Vercel https://aws.amazon.com/about-aws/whats-new/2025/12/aws-databases-are-available-on-the-vercel/ https://vercel.com/changelog/aws-databases-now-available-on-the-vercel-marketplace TEAM https://aws-samples.github.io/iam-identity-center-team/ https://github.com/aws-samples/iam-identity-center-team GitHub Actions https://github.blog/changelog/2025-12-22-improved-performance-for-github-actions-workflows-page/ Lambda Managed Instances https://docs.aws.amazon.com/lambda/latest/dg/lambda-managed-instances.html Atmos https://github.com/cloudposse/atmos/issues k8sdiagram.fun https://k8sdiagram.fun/ Marc Brooker https://brooker.co.za/blog/2025/11/20/what-now.html