Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production

Transcript

0:00 An AWS data hall overheats. A single availability

0:03 zone goes dark. And Coinbase spends hours recovering

0:07 systems that were not as failure-independent

0:11 as everyone hoped. Meta's AI support tooling

0:14 becomes part of an Instagram account recovery

0:17 path. And suddenly the chatbot is not just helping

0:21 customers. It is touching identity. AWS has another

0:25 agent tooling bug, this time in AgentCore CLI.

0:29 where imported agent metadata could become generated

0:32 Python code. And Google Cloud disclosed an Apigee

0:36 cross-tenant issue, which is exactly the kind

0:39 of phrase that makes cloud engineers sit up a

0:42 little straighter. The theme this week is simple.

0:45 The risky part of modern infrastructure is not

0:48 always the big obvious production system. Sometimes

0:51 it is the glue, the recovery path, the support

0:54 workflow, the generated file. the tenant boundary.

0:58 The thing everybody depends on, but nobody quite

1:02 treats like production until it breaks. I'm Brian

1:05 Teller from Teller's Tech, and this is Ship It

1:08 Weekly. Welcome back to Ship It Weekly, the show

1:27 where we look at the DevOps, SRE, cloud, platform,

1:32 and security stories that actually matter when

1:35 you are the person who eventually has to keep

1:37 the thing running. This week, we're starting

1:39 with Coinbase's May 7th outage postmortem. Then

1:43 we'll talk about Meta's AI support incident,

1:45 where attackers reportedly hijacked Instagram

1:48 accounts through an AI-assisted recovery flow.

1:52 After that, we'll get into AWS AgentCore CLI

1:55 CVE-2026-11393, where collaborator metadata

2:01 could break out into generated Python code. Then

2:05 we'll talk about Google Cloud Apigee and a cross

2:08 -tenant vulnerability, because tenant isolation

2:11 is one of those cloud promises you only think

2:15 about when it fails. In the lightning round,

2:17 we'll hit Cloudflare threat intelligence in WAF

2:20 rules, AWS Lambda tenant isolation with event

2:25 source mappings, the next generation of

2:28 OpenSearch Serverless, and GitHub Enterprise Managed

2:31 Users IP allow list coverage. So let's get into

2:35 it. First up, Coinbase published a postmortem

2:42 for its May 7th outage. And this one is worth

2:45 reading if you work anywhere near SRE, platform,

2:49 cloud, infrastructure, Kafka, financial systems,

2:52 or any system where people say we're multi-AZ

2:56 and then everyone quietly hopes that means what

2:59 they think it means. Coinbase says that trading,

3:02 deposits, withdrawals, and most customer -facing

3:05 services were unavailable or degraded for roughly

3:09 eight hours. Full recovery took even longer.

3:12 The initiating event was physical. Multiple chiller

3:15 units failed in a single AWS us-east-1 data hall,

3:19 specifically availability zone use1-az4. That

3:25 cooling issue triggered a thermal safety shutdown

3:27 of affected racks, taking EC2 instances and EBS

3:32 volumes in that building offline. That is the

3:35 part that looks like a classic cloud outage story.

3:38 Something physical breaks inside a data center

3:41 and cloud resources disappear. But the more useful

3:44 lesson is what happened next. Because most of

3:47 us already know that an availability zone can

3:50 fail. At least in theory. We have all seen the

3:53 diagram. Three boxes, three AZs, nice clean arrows,

3:58 everybody nods. Then a real failure happens,

4:01 and the system has to prove whether those boxes

4:04 were architecture or just decoration. For Coinbase,

4:07 one major issue was that the matching engine

4:10 was tied closely to the failed zone. And this

4:13 is where the nuance matters. Low latency systems

4:16 often make tradeoffs that normal apps do not.

4:20 If you need extremely tight latency, deterministic

4:23 behavior, and careful state handling, you may

4:27 choose a design that favors performance over

4:30 easy failover. And that is not automatically

4:32 irresponsible. But it does mean the recovery

4:35 path has to be real, not theoretical. Not we

4:38 documented it once. Real. Tested. Owned. Practiced.

4:43 Known by more than one person. The other big

4:46 lesson is state. Coinbase called out delays around

4:50 managed Kafka recovery. And that is extremely

4:53 relatable. During an outage, stateless compute

4:56 is usually the easy part. You restart it. You

4:59 move it. You scale it. You reroute around it.

5:02 Then you hit the system with state, ordering,

5:05 offsets, replication, partitions, or money-adjacent

5:10 semantics, and everything slows down. That is

5:13 where incidents get complicated. The app might

5:16 be multi-AZ, but what about the broker? The

5:19 database? The cache? the topic, the volume, the

5:23 thing that keeps the system honest. That is the

5:26 real test. The operator takeaway here is simple.

5:29 Do not let multi-AZ become a comfort phrase.

5:33 Find the parts of your system that are still

5:36 effectively tied to one zone, one stateful dependency,

5:40 or one recovery path, especially the low-latency

5:43 pieces, the Kafka pieces, the databases, the

5:46 matching engines, the services with ordering,

5:49 offsets, or money adjacent semantics. Then ask

5:53 the annoying question, what happens if that AZ

5:56 disappears? What depends on recovering in place?

6:00 What can actually fail over? How long does it

6:03 take? And when was the last time we proved it?

6:06 Because resilience is not where the diagram says

6:09 the boxes are. It is where the system can actually

6:12 recover when one of those boxes disappears. Second

6:20 story, Meta confirmed an incident involving its

6:23 AI-assisted Instagram support flow. where more

6:26 than 20 ,000 accounts were likely impacted. The

6:30 reported issue was that attackers could use Meta's

6:34 AI support tooling to trigger password reset

6:37 links to an email address that was not actually

6:40 associated with the victim's account. Meta described

6:43 it as a bug where the system did not properly

6:46 verify that the provided email matched the email

6:49 on the Instagram account. So attackers could

6:52 receive reset links for accounts they did not

6:55 own. Now, I know what you might be thinking.

6:58 Instagram accounts do not sound like a normal

7:01 DevOps or SRE story. But the real story is not

7:04 social media accounts got hacked. The real story

7:07 is this. AI support automation became part of

7:11 an identity recovery control plane. And that

7:14 is absolutely our lane. Account recovery is privileged

7:18 infrastructure. It is the break-glass path for

7:21 identity it can reset passwords it can change

7:25 emails it can bypass parts of the normal login

7:28 path when the user needs help that means it has

7:31 power and if ai can influence that workflow then

7:35 the ai is not just answering support questions

7:38 anymore it is participating in an identity decision

7:42 that changes everything a chatbot that answers

7:45 how do i update my profile photo is one thing

7:49 A chatbot that helps move an account recovery

7:51 flow forward is very different. That system needs

7:55 real authorization logic, real verification,

7:59 rate limits, abuse detection, audit logs, escalation

8:03 paths, blast radius controls, and a very clear

8:07 line between what AI can suggest and what the

8:10 system is allowed to do. Because attackers do

8:14 not care whether your org chart says this belongs

8:17 to support. identity, security, product, or infrastructure.

8:21 They care that the workflow can reset accounts.

8:25 So they attack the workflow. This is also a reminder

8:29 that AI risk is not always a coding agent running

8:33 shell commands. Sometimes it is much more boring.

8:36 A support flow trusts the wrong field. A reset

8:39 link goes to the wrong place. And suddenly, the

8:43 AI part gets the headline. But the real failure?

8:47 is the control boundary around the action. The

8:50 takeaway here is not AI support is bad. The takeaway

8:53 is that support workflows can be privileged infrastructure.

8:57 Password resets, MFA resets, email changes, account

9:02 recovery, support impersonation, admin ownership

9:05 changes. Those are not just customer service

9:08 features. Those are identity control paths. So

9:11 when AI gets added to that flow, The question

9:15 is not just, is the answer helpful? The question

9:18 is, what can this workflow actually do? Can it

9:21 suggest? Can it route? Can it trigger actions?

9:25 Can it change trust? And if it can touch identity,

9:29 it needs the same seriousness as any other production

9:32 control plane. Because the scariest AI system

9:36 in production might not be your coding agent.

9:39 It might be the support flow that can reset passwords.

9:46 Third story. AWS published a security bulletin

9:50 for CVE-2026-11393 in AgentCore CLI. And if

9:57 last week's Kiro bug was stdin answered

10:00 the approval prompt, this one is more like agent

10:04 metadata became Python code, which is not the

10:07 kind of sentence you want in a security bulletin.

10:11 The issue involves AgentCore CLI when importing

10:14 a Bedrock supervisor agent with multi-agent collaboration

10:18 enabled. The advisory says the CLI fetched collaborator

10:23 metadata from the Bedrock API and inserted the

10:26 collaborationInstruction field into a triple

10:30 quoted Python string in the generated main.py

10:34 file. The problem was improper escaping of triple

10:38 quotes. So a crafted instruction could break

10:41 out of that intended string and inject Python

10:44 code into the generated file. Then, if the file

10:48 was run locally or deployed into AgentCore Runtime,

10:52 the injected code could execute with the credentials

10:56 available to that environment. AWS recommends

11:00 upgrading the CLI, removing affected imported

11:03 agents rerunning the import with the patched

11:06 cli and redeploying if you cannot upgrade right

11:10 away they recommend manually inspecting generated

11:13 main.py files for suspicious triple quote sequences

11:17 in collaborator instructions now this is preview

11:21 tooling and the attack path is specific but the

11:24 pattern matters generated code is code that sounds

11:28 obvious but teams do not always treat generated

11:31 code with the same suspicion as handwritten code.

11:35 Generated files feel like outputs, artifacts,

11:38 something the tool made, something official,

11:41 something you probably do not need to read too

11:44 closely. But if untrusted or semi-trusted metadata

11:47 gets inserted into source code, your generator

11:51 becomes a compiler for attacker-controlled input.

11:55 And we have seen this movie before. SQL injection.

11:58 Template injection. Shell injection. CI config

12:01 injection. YAML templating weirdness. Now it

12:05 is agent metadata turning into Python. Different

12:09 costume, same villain. The AI angle makes it

12:12 feel new, but the secure engineering lesson is

12:15 old. Do not turn text into executable content

12:19 without a hard boundary. Do not assume metadata.

12:23 is safe because it came from an API. Do not assume

12:27 generated code is safe because it came from an

12:30 official tool. And definitely do not assume instructions

12:34 are harmless. In agent systems, instructions

12:38 are operational inputs. They can shape behavior.

12:41 They can get passed between systems. They can

12:44 become config. They can become prompts. And sometimes

12:48 they can become code. The takeaway here is short.

12:52 Generated code is still code. If a tool turns

12:55 metadata, instructions, templates, or agent definitions

12:59 into executable files, that path needs review.

13:03 Keep the CLI updated. Inspect generated files

13:07 before running them in trusted environments.

13:10 Run agent tooling with limited permissions. And

13:14 pay attention to what credentials are available

13:16 if something goes wrong. Because the blast radius

13:20 is not just the bug. The blast radius is what

13:23 the generated code can access when it runs. Fourth

13:31 story. Google Cloud disclosed a vulnerability

13:34 in Apigee, CVE-2025-13292. Google's release

13:40 notes say the issue could have allowed a malicious

13:42 actor with administrative or developer -level

13:46 permissions in their own Apigee environment to

13:49 elevate privileges and access cross-tenant data.

13:53 NVD describes it as unauthorized read and write

13:57 access to Apigee analytics data and access logs

14:02 belonging to other customer organizations. Google

14:05 says the issue was patched and no user action

14:08 is required. That is good. But the phrase that

14:11 matters is cross-tenant. Tenant isolation is

14:15 one of the deepest assumptions in cloud computing.

14:18 You have your environment. Other customers have

14:21 their environments. The provider keeps those

14:23 boundaries in place. That is the model. Most

14:27 of the time, you do not have to think about it.

14:29 And you cannot think about it all day. If cloud

14:32 customers had to constantly wonder whether another

14:35 tenant could see their logs, analytics, objects,

14:38 volumes, or traffic, nobody would get anything

14:41 done. So when a managed service has a cross-tenant

14:45 issue, even a patched one, it is worth paying

14:48 attention. And Apigee is not some random side

14:51 tool. API management sits in a sensitive part

14:54 of the stack. It can see traffic patterns, access

14:57 logs, analytics, client behavior. API paths,

15:01 timing, errors, and sometimes more than people

15:05 realize. Access logs are not always harmless.

15:08 They can reveal service names, query parameters,

15:12 client identifiers, internal routing, usage patterns.

15:16 And because logging systems are logging systems,

15:19 sometimes they capture fields everyone later

15:21 wishes had been redacted. So cross-tenant access

15:24 to analytics and logs is not just a privacy issue.

15:28 It can be a reconnaissance issue, a compliance

15:31 issue, a customer trust issue. The takeaway is

15:35 not never use managed services. That would be

15:38 ridiculous. The takeaway here is that tenant

15:41 isolation is not just a cloud provider promise.

15:44 It is a design requirement. For every platform

15:48 team building multi-tenant systems, logs, analytics,

15:52 dashboards, exports, admin tools, support consoles,

15:57 background jobs, search indexes, all of those

16:00 can become tenant boundary problems. So do not

16:04 only test the obvious request path, test the

16:07 internal paths too. Can tenant A see tenant B's

16:11 logs? Can a support view cross the wrong boundary?

16:15 Can an analytics query forget tenant context?

16:19 Can an async job mix records that should never

16:22 touch? Cross tenant bugs usually do not start

16:26 in the happy path. They start in the seams. Now

16:36 let's do a quick lightning round. First, Cloudflare

16:39 is turning threat indicators into real-time

16:42 WAF rules. Cloudflare says Cloudforce One threat

16:47 intelligence can now be used directly inside

16:50 the WAF engine. That means teams can write rules

16:53 based on things like known attacker names, targeted

16:57 industries, source countries, target countries,

17:00 and attack context. I like the direction. It

17:04 moves security from, we know this traffic is

17:07 bad, towards we can actually enforce on that

17:11 knowledge. But the operator warning is obvious.

17:14 Threat intel in the blocking path is still production

17:17 change management. Start in visibility mode.

17:20 Watch false positives. Stage the rollout. Make

17:24 sure that the logs explain why something was

17:27 blocked. Automation is great until it blocks

17:30 your biggest customer because a field matched

17:32 a little too creatively. Second, AWS published

17:36 guidance. for Lambda tenant isolation mode with

17:40 event source mappings. The interesting part is

17:43 that async systems do not naturally carry tenant

17:48 context the same way synchronous API requests

17:51 do. With an API request, you may have headers,

17:55 claims, and request context. With SQS, EventBridge,

18:00 or other event sources, tenant identity may be

18:04 inside the payload or may need to be extracted

18:07 and passed along carefully. That matters because

18:11 multi-tenant bugs love async systems. A message

18:15 loses context. A worker assumes the default tenant.

18:19 A retry uses the wrong metadata. A batch mixes

18:22 records that should never be mixed. Tenant isolation

18:26 is not just about the runtime. It is about context

18:30 propagation. especially when queues and events

18:33 are involved. Third, the next generation of Amazon

18:37 OpenSearch Serverless is generally available.

18:40 AWS says it provisions faster, scales faster,

18:44 supports scale to zero, and can reduce costs

18:47 compared to provisioning for peak load. That

18:50 is a FinOps story, but it is also an operations

18:54 story. If teams start treating search and vector

18:57 infrastructure as more elastic and disposable,

19:01 they still need to understand cold starts, latency,

19:04 indexing behavior, cost patterns, and traffic

19:07 spikes. Serverless does not mean operationally

19:11 invisible. It means the operational questions

19:13 moved. Fourth, GitHub Enterprise Managed Users

19:17 IP Allow List coverage is generally available.

19:21 Enterprises using EMUs can now enforce GitHub's

19:24 native IP Allow List configuration across user

19:28 namespaces. That matters. Because source control

19:31 is production infrastructure. Repo access is

19:34 part of your security boundary. And more of that

19:37 boundary is moving into identity, network policy,

19:41 device posture, and enterprise governance. IP

19:45 allow lists do not solve everything. But repo

19:48 access is still one of the fastest ways to turn

19:52 a credential problem into a production problem.

20:02 The human closer this week is about the glue.

20:05 Coinbase was not just an AZ failure. It was state,

20:09 latency, Kafka, and recovery paths. Meta was

20:13 not just an AI support bug. It was account recovery

20:17 becoming identity infrastructure. AgentCore

20:20 was not just a Python escaping issue. It was

20:24 metadata turning into executable code. Apigee

20:28 was not just a patched cloud vulnerability. It

20:31 was tenant isolation showing up in logs and analytics,

20:36 not just the main product path. Different stories,

20:39 same pattern. Modern reliability and security

20:42 problems keep showing up in the seams. The support

20:45 flow. The generated file. The recovery path.

20:49 The admin console. The queue. The logs. The things

20:53 between two systems that quietly decides what

20:57 is allowed to happen next. That is where a lot

21:00 of real platform work lives now. Not just making

21:03 the app scale. Not just adding another region.

21:06 Not just buying a managed service. But asking

21:09 where the authority actually sits. Who can recover

21:13 an account? Who can generate code? Who can access

21:17 another tenant's data? Who owns the failover

21:20 path? Who tested the recovery plan? And who gets

21:23 paged when the glue fails? So the takeaway this

21:26 week is simple. Do not just review the big production

21:29 systems. Review the seams. That is usually where

21:33 the incident is waiting. That's it for this week

21:36 of Ship It Weekly. We covered Coinbase's May

21:39 7th outage postmortem, Meta's AI support and

21:42 Instagram account recovery issue, AWS AgentCore

21:46 CLI CVE-2026-11393, Google Cloud Apigee's

21:51 cross-tenant vulnerability, and a lightning

21:54 round on Cloudflare WAF threat intel, AWS Lambda

21:58 tenant isolation, OpenSearch Serverless, and

22:01 GitHub Enterprise Managed Users IP allow lists.

22:05 If this episode was useful, follow or subscribe

22:08 wherever you are watching or listening. If you

22:11 are on YouTube, hit subscribe. If you are in

22:14 a podcast app, follow the show there. And if

22:17 you know someone dealing with cloud resilience,

22:19 AI support workflows, agent tooling, tenant isolation,

22:23 or repo governance, send this one to them. It

22:27 helps the show grow, and it helps me keep making

22:29 this kind of content for people who actually

22:32 live with these systems. You can find the weekly

22:35 brief at OnCallBrief.com and more episodes and

22:38 this week's show notes at ShipItWeekly.fm. I'm

22:42 Brian Teller from Teller's Tech. Thanks for listening.

22:45 And remember, the system that breaks production

22:47 is not always the big obvious service. Sometimes

22:51 it is the glue everybody forgot was holding the

22:54 whole thing together.

Coinbase Outage, Meta AI Account Recovery, AWS AgentCore Code Injection, Apigee Tenant Isolation, and the Glue That Breaks Production

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Ship It Conversations: Jake Warner on Cycle.io, Bare Metal’s Comeback, and Why Private Cloud Is Getting Interesting Again

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

AI Agents Get API Access and Identity: GitHub Copilot Cloud Agents, MCP Auth, Ansible Automation, OpenAI Daybreak, and the New Production Risk