Host Commentary

This episode is about the hidden glue holding production together.

That phrase kept coming up while I was looking at this week’s stories. Not because the stories are all the same, but because the failure pattern underneath them feels familiar.

The thing that breaks production is not always the big obvious service.

Sometimes it is the recovery path.

The support workflow.

The generated file.

The tenant boundary.

The async event path.

The admin console.

The logs.

The thing between two systems that quietly decides what is allowed to happen next.

Coinbase’s May 7 outage is the cleanest SRE example. The starting point was physical: multiple chiller failures in a single AWS us-east-1 data hall, which led to a thermal safety shutdown affecting EC2 and EBS resources in one availability zone.

That part is important, but it is not the whole lesson.

Most teams already know an availability zone can fail. We all have the diagram. Three boxes, three AZs, clean arrows, nice labels, everyone nods.

The real question is whether the system can actually recover when one of those boxes disappears.

Coinbase’s postmortem is useful because it shows how quickly the conversation moves from “an AZ failed” to “what was actually coupled to that AZ?” Low-latency systems, stateful dependencies, managed Kafka recovery, ordering, offsets, recovery paths, and failover assumptions all start to matter very quickly.

That is the part worth taking back to your own environment.

Do not let “multi-AZ” become a comfort phrase.

Find the parts of the system that are still effectively tied to one zone, one stateful dependency, or one recovery path. Especially the low-latency pieces, Kafka pieces, databases, matching engines, and services with ordering or money-adjacent semantics.

Then ask the annoying questions.

What happens if that AZ disappears?

What has to recover in place?

What can actually fail over?

How long does recovery take?

And when was the last time anyone proved it?

That is where resilience becomes real.

The Meta AI support story is a different kind of glue, but the same theme. On the surface, it sounds like a consumer social media incident. Instagram accounts, password resets, account recovery.

But the platform angle is much more interesting.

AI support automation became part of an identity recovery control plane.

That matters because account recovery is privileged infrastructure. It can reset passwords, change emails, bypass parts of the normal login path, and help users regain access when the normal flow fails.

That means it has power.

So when AI is added to that workflow, the question is not just whether the answer is helpful. The question is what the workflow can actually do.

Can it only summarize?

Can it route tickets?

Can it recommend next steps?

Can it trigger actions?

Can it change trust?

Can it move an account recovery flow forward?

Those are very different risk profiles.

The takeaway is not “AI support is bad.” The takeaway is that support workflows can be production control planes, especially when they touch identity.

Password resets, MFA resets, email changes, support impersonation, admin ownership changes, and account recovery flows are not just customer service features. They are identity control paths.

If AI can influence them, they need real authorization logic, verification, rate limits, abuse detection, audit logs, escalation paths, and blast radius controls.

The scariest AI system in production might not be your coding agent. It might be the support flow that can reset passwords.

The AWS AgentCore CLI story is another version of the same problem. This time the glue is generated code.

The issue was CVE-2026-11393. In short, collaborator metadata from a Bedrock agent import could end up inserted into a generated Python file in a way that allowed code injection.

The important lesson is simple.

Generated code is still code.

That sounds obvious, but teams do not always treat generated code with the same suspicion as handwritten code. Generated files feel like artifacts. Outputs. Something official. Something the tool made, so maybe nobody needs to read it too closely.

But if metadata, instructions, templates, or agent definitions can become executable files, that path needs review.

This is not really a brand-new class of problem. It is the same old family as SQL injection, template injection, shell injection, CI config injection, and YAML templating weirdness.

Different costume, same villain.

The AI angle makes it feel new, but the secure engineering lesson is old: do not turn text into executable context without a hard boundary.

Do not assume metadata is safe because it came from an API.

Do not assume generated code is safe because it came from an official tool.

And do not assume “instructions” are harmless just because they are called instructions.

In agent systems, instructions can become operational inputs. They can shape behavior, move between systems, become config, become prompts, and sometimes become code.

So the practical takeaway is pretty direct. Keep the CLI updated. Inspect generated files before running them in trusted environments. Run agent tooling with limited permissions. And pay attention to what credentials are available if something goes wrong.

Because the blast radius is not just the bug.

The blast radius is what the generated code can access when it runs.

The Apigee story is about tenant isolation, and I kept this one in the episode because the lesson is still useful even though the specific cross-tenant issue comes from Google’s Apigee security bulletins rather than being a brand-new weekly disclosure.

Tenant isolation is one of the deepest assumptions in cloud and SaaS systems.

You have your environment. Other customers have theirs. The provider, or the platform team, keeps those boundaries in place.

That assumption is what lets anyone build on shared infrastructure at all.

But cross-tenant bugs often do not show up only in the obvious request path. They show up in the seams.

Logs.

Analytics.

Dashboards.

Exports.

Admin tools.

Support consoles.

Background jobs.

Search indexes.

The internal paths that still carry customer data.

That is why the Apigee issue is worth talking about. API management can sit close to access logs, analytics, client behavior, API paths, timing, errors, and operational metadata. Cross-tenant access to that kind of data is not just a privacy problem. It can become a reconnaissance problem, a compliance problem, and a customer trust problem.

The takeaway is not “never use managed services.” That would be ridiculous.

The takeaway is that tenant isolation has to be tested beyond the happy path.

Can tenant A see tenant B’s logs?

Can a support view cross the wrong boundary?

Can an analytics query forget tenant context?

Can an async job mix records that should never touch?

Can an export, dashboard, or search index leak data from the wrong tenant?

Those are platform questions.

The lightning round kept the same pattern going.

Cloudflare turning threat intelligence into real-time WAF rules is useful, but once threat intel enters the blocking path, it becomes production change management. Start in visibility mode, watch false positives, stage the rollout, and make sure the logs explain why something was blocked.

AWS Lambda tenant isolation with event source mappings is a reminder that tenant context has to survive async paths. Queues and event systems are where context gets lost, default tenants sneak in, retries behave differently, and batches mix things that should not mix.

OpenSearch Serverless scale-to-zero is a FinOps story, but also an operations story. Serverless does not mean operationally invisible. It means the operational questions moved to cold starts, indexing behavior, latency, cost patterns, and traffic spikes.

And GitHub Enterprise Managed Users IP allow list coverage is another reminder that source control is production infrastructure. Repo access is part of your security boundary, and that boundary keeps moving into identity, network policy, device posture, and enterprise governance.

Different stories.

Same pattern.

Modern reliability and security problems keep showing up in the seams.

Not always the app.

Not always the database.

Not always the obvious critical service.

Sometimes it is the thing between systems that nobody fully owns until it fails.

That is where a lot of real platform work lives now.

Not just making the app scale.

Not just adding another region.

Not just buying a managed service.

But asking where the authority actually sits.

Who can recover an account?

Who can generate code?

Who can access another tenant’s data?

Who carries tenant identity through a queue?

Who owns the failover path?

Who tested the recovery plan?

Who gets paged when the glue fails?

That work is not glamorous, but it is where a lot of production risk hides.

So the takeaway from this episode is simple:

Do not just review the big production systems.

Review the seams.

That is usually where the incident is waiting.

Extra links worth including on the episode page:

Coinbase May 7 outage postmortem
https://www.coinbase.com/blog/a-postmortem-of-our-may-7-2026-outage

Meta AI support / Instagram account recovery reporting
https://www.theverge.com/tech/945658/meta-ai-support-chatbot-exploit-instagram-accounts

AWS AgentCore CLI CVE-2026-11393
https://aws.amazon.com/security/security-bulletins/2026-040-aws/

AgentCore CLI GitHub advisory
https://github.com/aws/agentcore-cli/security/advisories/GHSA-m4x6-gwgp-4pm7

Google Apigee security bulletins
https://docs.cloud.google.com/apigee/docs/security-bulletins/security-bulletins

Cloudflare real-time threat intel WAF rules
https://blog.cloudflare.com/realtime-threat-intel-waf-rules/

AWS Lambda tenant isolation with event source mappings
https://aws.amazon.com/blogs/compute/integrating-event-source-mappings-with-aws-lambda-tenant-isolation-mode/

Amazon OpenSearch Serverless next generation
https://aws.amazon.com/about-aws/whats-new/2026/05/amazon-opensearch-serverless-next-generation-generally-available/

GitHub Enterprise Managed Users IP allow list coverage
https://github.blog/changelog/2026-06-08-ip-allow-list-coverage-for-emu-namespaces-in-general-availability/

This week’s On Call Brief
https://www.tellerstech.com/on-call-brief-news/2026-W24/

More Ship It Weekly episodes
https://shipitweekly.fm/

Show Notes

This episode of Ship It Weekly is about the hidden glue holding production together.

Brian covers Coinbase’s May 7 outage postmortem, where an AWS us-east-1 cooling failure exposed the difference between being “multi-AZ” on paper and actually being able to recover when stateful, low-latency systems are tied to a failed zone.

Then he looks at Meta’s AI-assisted Instagram support issue and why account recovery is identity infrastructure, not just customer support. If AI can influence password resets, email changes, MFA resets, or account ownership flows, that workflow needs to be treated like a production control plane.

The episode also covers AWS AgentCore CLI CVE-2026-11393, where collaborator metadata could break out into generated Python code during agent import, and an Apigee cross-tenant issue from Google’s Apigee security bulletins that shows why tenant isolation has to be tested beyond the obvious happy path.

Links

Coinbase May 7 outage postmortem https://www.coinbase.com/blog/a-postmortem-of-our-may-7-2026-outage

Meta AI support / Instagram account recovery reporting https://www.theverge.com/tech/945658/meta-ai-support-chatbot-exploit-instagram-accounts

AWS AgentCore CLI CVE-2026-11393 https://aws.amazon.com/security/security-bulletins/2026-040-aws/

AgentCore CLI GitHub advisory https://github.com/aws/agentcore-cli/security/advisories/GHSA-m4x6-gwgp-4pm7

Google Apigee security bulletins https://docs.cloud.google.com/apigee/docs/security-bulletins/security-bulletins

Cloudflare real-time threat intel WAF rules https://blog.cloudflare.com/realtime-threat-intel-waf-rules/

AWS Lambda tenant isolation with event source mappings https://aws.amazon.com/blogs/compute/integrating-event-source-mappings-with-aws-lambda-tenant-isolation-mode/

Amazon OpenSearch Serverless next generation https://aws.amazon.com/about-aws/whats-new/2026/05/amazon-opensearch-serverless-next-generation-generally-available/

GitHub Enterprise Managed Users IP allow list coverage https://github.blog/changelog/2026-06-08-ip-allow-list-coverage-for-emu-namespaces-in-general-availability/

This week’s On Call Brief https://www.tellerstech.com/on-call-brief-news/2026-W24/

More episodes and show notes https://shipitweekly.fm/

Brian Teller
Hosted by
Brian Teller

25 years in production: DevOps, SRE, platform, and cloud. DevOps Institute & ITIL Ambassador.

More about Brian Teller →