0:00
Trusted tools are having a rough week. A popular
0:03
VS Code extension reportedly helped expose thousands
0:07
of GitHub internal repositories. Megalodon hit
0:10
thousands of public repos with poisoned commits
0:13
and GitHub Actions workflow abuse. Railway had
0:17
a platform-wide outage after Google Cloud incorrectly
0:20
suspended its production account. Discord dropped
0:24
17% of active sessions during a Kubernetes migration.
0:29
And AWS is changing SDK retry behavior, which
0:33
sounds boring until you remember retries are
0:35
how your app behaves when the world is already
0:38
on fire. The theme this week is simple. The tools
0:42
that we trust most are becoming some of our riskiest
0:45
production dependencies. I'm Brian Teller from
0:48
Teller's Tech, and this is Ship It Weekly. Welcome
1:07
back to Ship It Weekly, the show where we look
1:10
at DevOps, SRE, cloud, platform, and security
1:14
stories that matter when you are the person who
1:17
eventually has to keep the thing running. This
1:19
week, we're starting with GitHub supply chain
1:21
risk from two directions: a compromised VS Code
1:25
extension tied to a GitHub internal repo breach
1:28
and the Megalodon campaign abusing public repos
1:32
and CI/CD workflows then we'll talk about railway's
1:36
gcp account suspension outage discord's voice
1:39
outage postmortem AWS changing SDK retry defaults,
1:44
and a RabbitMQ AWS plugin issue that accidentally
1:48
shipped debug code into production builds. In
1:52
the lightning round, we'll hit OpenTelemetry
1:54
Graduation, Claude Code RCE, GitLab Secrets Manager,
1:59
Google Cloud AI Spend Caps, and a Redshift Python
2:02
driver RCE. And the human closer is about trusted
2:06
tools because the systems we trust most are often
2:10
the ones we forget to threat model. So let's
2:13
get into it. First up, GitHub had a rough supply
2:20
chain week. The developer toolchain got hit from
2:23
multiple directions. The first story is the Nx
2:26
Console VS Code extension compromise. Reporting
2:29
says a malicious version of the Nx Console
2:32
extension was published and that the extension
2:35
was later tied to a breach involving thousands
2:38
of GitHub internal repositories. StepSecurity
2:41
says the compromised version was
2:44
Nx Console 18.95.0 and that the root cause involved a contributor's
2:49
GitHub token being scraped in a prior supply
2:53
chain attack. The Hacker News reported that roughly
2:56
3,800 GitHub internal repositories were exposed
3:00
after a GitHub employee device was compromised
3:03
through the malicious extension. Now, I'm not
3:06
saying every VS Code extension is evil, but every
3:10
VS Code extension is still code running next
3:13
to the repos you clone, the terminal you use,
3:16
and sometimes the tokens you forgot were still
3:20
hanging around, which is not exactly a low-trust
3:23
environment. A lot of developers install extensions
3:25
like browser tabs. That looks useful. That has
3:28
a nice icon. That has a lot of installs. Sure,
3:32
why not? And normally that feels fine until an
3:35
extension update becomes an initial access path.
3:39
That is the real lesson. The developer workstation
3:42
is part of the production attack surface now,
3:45
not because it serves customer traffic, because
3:47
it touches the things that eventually do. Repos,
3:51
secrets, CI/CD, cloud credentials, deployment
3:55
tooling, package publishing, SSH keys. kubeconfigs
3:59
all the fun little artifacts we pretend are carefully
4:02
managed until somebody runs `ls ~/.aws` and the room
4:07
gets quiet. Then there's Megalodon. Security
4:10
researchers reported a campaign hitting more
4:13
than 5,500 public repositories with malware
4:17
-laden commits, stealing CI/CD secrets like AWS
4:21
and Google Cloud keys. SSH private keys, and
4:25
Kubernetes configs. That makes this more than
4:28
a GitHub story. It's a reminder that CI/CD is
4:32
not just where code gets tested. It is where
4:35
trust gets converted into artifacts. If a workflow
4:38
has cloud credentials, package publishing tokens,
4:41
signing keys, or deployment authority, then a
4:44
compromised workflow is not just a dev problem.
4:48
It is a release system compromise. The takeaway
4:51
is not ban all extensions or never use GitHub
4:54
Actions. That's not serious. The takeaway is
4:57
to treat developer tooling as production-adjacent
5:00
infrastructure. Review extensions with broad
5:03
file, terminal, or workspace access. Use short
5:07
-lived credentials where you can. Keep cloud
5:10
keys out of CI when OIDC works. Lock down GitHub
5:14
Actions permissions. Require review on workflow
5:18
changes. And please. Do not let every workflow
5:21
run with full write access because that was easier
5:25
during the first setup. A trusted extension,
5:28
a trusted repo, and a trusted workflow can all
5:31
become part of the same attack path. That's the
5:34
part that matters. Next up, Railway published
5:41
an incident report about a platform-wide outage
5:45
caused by Google Cloud. incorrectly suspending
5:48
Railway's production account. That sentence alone
5:51
is enough to make most cloud engineers sit up
5:55
a little straighter. Railway says the outage
5:57
started on May 19th when Google Cloud incorrectly
6:00
placed their production account into a suspended
6:04
status. That took Railway's API, dashboard, control
6:08
plane, databases, and GCP-hosted compute infrastructure
6:12
offline. And then it got more interesting. Railway
6:15
also runs workloads on Railway Metal and AWS
6:19
burst-cloud environments. And those workloads
6:21
initially stayed up. But Railway's edge proxies
6:25
relied on a GCP-hosted control plane API to
6:29
populate routing tables. So when route caches
6:32
expired, the outage cascaded beyond GCP. Workloads
6:36
that were still technically running became unreachable
6:40
because the network control plane could no longer
6:42
resolve routes to active instances. At peak impact,
6:46
Railway says all workloads across all regions
6:49
were unreachable. And that is the story. Not
6:52
just Google Cloud suspended an account. The real
6:55
story is that multi-cloud did not save the system
6:58
because the control plane dependency was still
7:02
in the hot path. That is where architecture diagrams
7:05
get too optimistic. You can draw AWS over here
7:08
and GCP over there. Metal in another box. Some
7:12
arrows, maybe a nice little mesh diagram. And
7:15
suddenly, everybody feels resilient. But resilience
7:18
is not about how many providers appear on the
7:20
diagram. It is about what has to work during
7:23
a failure. If your data plane in AWS still needs
7:26
a control plane API in GCP to route traffic,
7:30
then GCP is still in the hot path. If your failover
7:33
region needs the primary region's identity system
7:36
to approve failover, then the primary region
7:39
is still in the hot path. If your emergency deploy
7:42
process depends on the same CI/CD platform that
7:46
is currently broken, then congratulations. You
7:49
have invented a circular dependency with branding.
7:52
The strongest thing in Railway's writeup is
7:55
that they owned it. They said that they take
7:57
responsibility. for the architectural decisions
8:00
that allowed one upstream provider action to
8:03
cascade into a platform-wide outage. That's
8:07
the right posture. Customers do not care whether
8:09
the thing that broke was technically Google,
8:12
Railway, GitHub, Stripe, AWS, or a squirrel with
8:16
a networking certification. They see your product.
8:19
So when you say multi-cloud, ask what dependency
8:22
is still centralized? What service discovers
8:24
routes? What service holds identity? What API
8:28
does the edge need? What happens when cached
8:31
state expires? And what dependency do you only
8:34
discover when the provider account disappears
8:37
and everyone suddenly becomes very interested
8:40
in architecture diagrams? Multicloud is not magic.
8:44
Sometimes it is just single cloud with extra
8:47
invoices. Third story. Discord published a really
8:55
good postmortem. on its March 25th voice outage.
8:58
The title is perfect. You've got too much mail.
9:01
Because this outage wasn't just Kubernetes killed
9:04
some pods. It was a chain reaction where a routine
9:08
infrastructure change hit a stateful system,
9:11
dropped a large number of sessions, and downstream
9:14
systems got overwhelmed by the recovery behavior.
9:18
Discord says voice and video suffered major degradation.
9:22
for a little over three hours. Users were mostly
9:26
unable to start or join calls and saw an awaiting
9:29
endpoint message. The trigger came during a Kubernetes
9:32
migration for Discord's Elixir services. They
9:36
were tuning session management service resources
9:38
and pod counts. As Kubernetes applied the change,
9:42
it terminated 50 % of the pods in one zone. Since
9:46
sessions were balanced across three zones, about
9:49
17% of active sessions were ungracefully stopped.
9:53
That alone is not great. But the cascading part
9:56
is the real lesson. Discord’s systems use Elixir
9:59
GenServer processes, and those processes have
10:03
mailboxes. When all those sessions vanished,
10:05
other processes received a flood of messages
10:08
saying sessions were down. That caused reconnect
10:11
behavior, rate limit pressure, memory spikes.
10:15
gateway problems, and eventually voice and video
10:18
routing issues. This is the kind of postmortem
10:21
that I love because it shows how real outages
10:23
are usually not one failure. They are interaction
10:27
failures. Kubernetes did what it was told. The
10:30
session service had handoff logic. The rate limit
10:33
existed. The downstream services were designed
10:35
for normal load, but the shape of the change
10:38
produced a workload the system was not tuned
10:41
for. That is the part people miss when they
10:43
say, why didn't they just autoscale? Because
10:46
auto scaling is not a magic undo button for we
10:49
just invalidated 17% of active sessions and
10:52
created a reconnect storm. Sometimes the bottleneck
10:56
is not CPU. Sometimes it's mailbox length, back
10:59
pressure, downstream fanout. or one angry queue
11:03
quietly becoming the main character. The practical
11:06
takeaway is migration safety. When you move stateful
11:09
systems into Kubernetes, think beyond pod termination.
11:13
What does the rest of the system do when this
11:16
pod terminates? Who gets notified? Who retries?
11:19
Who reconnects? Who queues messages? Who gets
11:22
overloaded trying to help? Graceful shutdown
11:25
is not just a pod lifecycle feature. It is a
11:28
system behavior. If you are doing Kubernetes
11:31
migrations for stateful services, test the ugly
11:34
cases. Kill more than one pod. Drain a zone.
11:38
Watch downstream queues. Look at reconnect behavior.
11:42
Because production does not ask if the change
11:44
was routine. It asks if the system was ready.
11:51
Fourth story, AWS is changing retry behavior
11:54
across AWS SDKs and tools. And I know that that
11:59
sounds like the kind of story that you would
12:00
normally skip because it has the emotional energy
12:03
of a configuration footnote. But retry defaults
12:07
are invisible infrastructure. They affect latency,
12:10
error rates, load during outages, and how your
12:13
app behaves. when AWS services are already struggling.
12:17
AWS says the updated retry behavior is available
12:20
now behind opt-in and will become the default
12:24
in November, 2026. The updated behavior changes
12:27
how standard and adaptive retry modes handle
12:30
failures. AWS is making standard mode the default
12:34
for SDKs that previously defaulted to legacy
12:37
mode, adding retry quotas where they didn't exist.
12:41
changing backoff timing and treating transient
12:44
errors differently from throttling errors. One
12:46
big change is that transient error retries cost
12:49
more retry quota than before. The idea is that
12:53
during sustained outages, the SDK fails faster
12:56
instead of endlessly retrying and adding pressure
12:59
to a service that is already unhealthy. That
13:03
is good, but it can still surprise you. Retries
13:06
are one of those things most teams do not think
13:09
about until an incident. Your code says call
13:11
S3. The SDK says it'll handle some retries. Your
13:15
app says, great, I'll pretend that that was one
13:18
request. Then the service starts throwing errors
13:20
and suddenly request latency, thread usage, connection
13:24
pools, client CPU, and downstream load all depend
13:29
on retry behavior you may never have explicitly
13:32
configured. Retries can save you from transient
13:35
failures. Retries can also turn a partial outage
13:39
into a client -side traffic storm wearing a helpful
13:42
little hat. So the takeaway is simple. Do not
13:45
wait until November 2026 to discover how your
13:49
app behaves. Pick a non -production workload.
13:52
Opt in with the new environment flag. Look at
13:55
latency. Look at error surfaces. Look at max attempts.
13:59
Look at throttling behavior. look at long-polling
14:02
clients like SQS consumers and figure out whether
14:05
your app depends on old retry behavior without
14:08
anyone realizing it because nothing says fun
14:12
on call rotation like finding out your retry
14:14
strategy was inherited from 2018 and load-tested
14:18
by hope. Fifth story, let's talk about RabbitMQ,
14:26
debug code, secrets, and cloud cost blast radius.
14:30
AWS published a security bulletin for CVE -2026
14:33
-9133 in the rabbitmq-aws plugin. The plugin
14:38
resolves AWS ARNs in RabbitMQ's broker configuration
14:43
at startup and can fetch things like TLS certificates,
14:47
private keys, passwords, and other secrets from
14:50
AWS services. The issue is that debug code in
14:53
the plugin's ARN resolver was accidentally shipped
14:56
in production builds. A debug ARN scheme accepted
15:00
by a validation endpoint could allow a remote
15:03
authenticated user to read arbitrary files accessible
15:07
to the RabbitMQ process. That is not good. AWS
15:11
recommends upgrading to rabbitmq-aws 0 .2 .1,
15:16
patching forked code, and rotating secrets stored
15:19
in files. the RabbitMQ process could read. This
15:23
is a specific bug, but the pattern is broad.
15:26
Debug code in production should make everyone
15:29
briefly stop blinking because debug paths often
15:32
bypass the normal shape of the system. They inspect
15:35
the thing directly. They validate a thing in
15:38
a way production code usually does not. And then
15:41
somehow that path makes it into a build where
15:44
a real user or attacker can reach it this also
15:47
pairs with a separate aws bedrock cost story
15:50
from Reddit where a user described attackers
15:53
using exposed access keys from an ec2 instance
15:57
to run about $14,000 of Claude calls in 24 hours
16:02
That second story is a Reddit report, so I would
16:05
not treat it like a formal incident report, but
16:08
as a pattern it is extremely believable cloud
16:11
credentials plus AI services can become a very
16:15
fast money fire. Security and FinOps are blending
16:18
together. Compromised cloud keys used to mostly
16:21
mean crypto mining, data access, or infrastructure
16:25
abuse. Now they can also mean someone burning
16:27
through model inference or AI API calls at a
16:31
rate that makes finance start typing in all caps.
16:35
The takeaway is not complicated. Scope credentials.
16:38
Use roles instead of long -lived access keys
16:41
where possible. Watch unusual service usage.
16:44
Put budgets and anomaly detection around AI services.
16:49
Rotate secrets when file-read issues appear.
16:52
And remember that authenticated user does not
16:54
mean safe user. Small bugs get expensive when
16:58
the process can read secrets and the secrets
17:01
can spend money. Now let's do a quick lightning
17:10
round. First, OpenTelemetry graduated from the
17:14
CNCF. This is a huge milestone. OpenTelemetry
17:17
is now basically the de facto standard for vendor
17:21
-neutral telemetry across traces, metrics, and
17:24
logs. But the operator warning is still the same.
17:27
The collector is production plumbing. Graduation
17:30
does not mean that every collector upgrade is
17:32
safe, every processor config is harmless, or
17:35
every telemetry pipeline is suddenly boring.
17:38
Standardization helps, but you still need rollout
17:40
strategy. config validation, load testing, and
17:44
a plan for what happens when the thing that reports
17:46
on production becomes the thing breaking production.
17:49
Second, Claude Code had a security issue. I'm
17:52
keeping this short because we have covered Claude
17:55
code and agent security a lot lately. But the
17:58
pattern matters. AI coding tools are not just
18:01
editors. They can have filesystem access, repo
18:04
context, terminals, deeplinks. commands, and
18:08
workflow integration. So when those tools have
18:11
parsing bugs, deeplink bugs, or command execution
18:14
paths, the risk is not theoretical. The developer
18:17
environment is becoming another agent runtime,
18:20
and agent runtimes need threat models. Third,
18:23
GitLab 19.0 introduced GitLab Secrets Manager
18:27
in public beta. This is a good direction. Secrets
18:31
closer to the pipeline, scoped to jobs, governed
18:34
through the same platform people already use
18:37
for CI/CD, that does not solve every secret manager
18:40
problem, but it does acknowledge reality. A lot
18:43
of secrets risk lives in CI/CD because CI/CD is
18:47
where systems need credentials to do work. Treating
18:50
pipeline secrets as first -class objects is better
18:53
than pretending a masked variable named prod
18:56
token is a strategy. Fourth, Google Cloud is
18:59
rolling out hard spend caps for AI services.
19:02
This is a FinOps story, but it is also a reliability
19:06
story. If a budget cap pauses API traffic when
19:09
spend hits a limit, that can protect you from
19:12
a surprise bill. It can also become an availability
19:14
event if your product depends on that API. So
19:18
hard caps are useful, but they need operational
19:21
design. Who gets alerted before the cap? What
19:24
degrades gracefully? What is customer facing?
19:27
And what do you want more? A hard outage or a
19:30
hard invoice? Sometimes the answer depends on
19:33
the day. Fifth, Amazon Redshift Python driver
19:36
had an RCE issue. AWS reported CVE -2026 -8838
19:41
in the Redshift Python driver, where a rogue
19:45
server could execute commands on a user's data
19:49
warehouse client. That is a good reminder that
19:52
database clients are part of your execution boundary
19:54
too. Not every RCE starts on the server. Sometimes
19:58
the client connects to the wrong thing, trusts
20:00
the wrong response, and becomes the thing that
20:03
gets owned. So patch the driver, watch connection
20:06
targets, and remember that it is just a client
20:09
library is usually how the story starts, not
20:13
how it ends. The human closer this week is about
20:23
trusted tools the riskiest systems are not always
20:27
the mysterious ones sometimes they are the familiar
20:30
ones the extension everyone installs the workflow
20:33
nobody reviews the retry behavior nobody configured
20:37
the plugin that shipped with debug code the control
20:41
plane api that seemed fine because the cache
20:44
bought you an hour trusted tools become dangerous
20:47
when trust turns into invisibility That does
20:51
not mean that every tool is bad. It means that
20:53
trust should have an expiration date. Every so
20:56
often you need to ask, what does this tool have
20:59
access to? What can it change? What happens if
21:02
it is compromised? What happens if it disappears?
21:05
What happens if it retries differently? What
21:08
happens if the cache expires? That is not paranoia.
21:12
That is being the person who has to answer the
21:14
incident channel when everyone else is asking,
21:17
how could this happen? The staff and principal
21:19
engineer job is often about seeing the shape
21:22
of the dependency. Noticing when a developer
21:25
tool is actually a production path. When a retry
21:28
default is actually outage behavior. When a multi
21:32
-cloud architecture still has one hot dependency.
21:35
When a plugin can read secrets. When the thing
21:39
that everyone trusts has become the thing nobody
21:41
questions. The takeaway is not to stop trusting
21:44
tools. You cannot run modern systems that way.
21:48
The takeaway is to make trust visible. Map the
21:51
permissions. Review the workflows. Scope the
21:54
credentials. Test the failure path. Patch the
21:57
clients. Constrain the plugins. And look at your
22:01
boring dependencies like they might be production
22:04
infrastructure. Because they probably are. That's
22:07
it for this week of Ship It Weekly. We covered
22:09
the GitHub supply chain week with Nx Console
22:12
and Megalodon. Railway's GCP account suspension
22:15
outage. Discord's voice outage postmortem, AWS
22:19
SDK retry behavior changes, the RabbitMQ AWS
22:23
plugin file-read issue, and a lightning round
22:26
on OpenTelemetry, Claude Code, GitLab Secrets
22:30
Manager, Google AI Spend Caps, and Redshift Python
22:34
Driver RCE. If this episode was useful, follow
22:37
or subscribe wherever you are watching or listening.
22:40
If you're on YouTube, hit subscribe. If you are
22:43
in a podcast app, Follow the show there. And
22:45
if you know someone on a DevOps, SRE, platform
22:48
security or engineering leadership team who is
22:51
dealing with supply chain risk, cloud dependencies,
22:54
retries or trusted tooling, send this one to
22:58
them. It helps the show grow and it helps me
23:00
keep making this kind of content for people who
23:04
actually live with these systems. You can find
23:06
the weekly brief at OnCallBrief.com and more
23:10
episodes and this week's show notes on ShipItWeekly
23:13
.fm. I'm Brian Teller from Teller's Tech. Thanks
23:16
for listening. And remember, if your trusted
23:18
tool can install code, trigger CI, route traffic,
23:22
retry requests, read secrets, or burn cloud money,
23:26
it is not just a tool anymore. It is part of
23:29
production. So maybe treat it like it.
This episode is really about one idea: trusted tools become risky when they become invisible.
Most production incidents are not caused by some mysterious system nobody has ever heard of. A lot of the time, the scary part is something familiar. A developer extension. A CI workflow. A cloud provider account. A control plane API. An SDK retry default. A plugin. A collector. A database driver. The thing everyone uses, nobody reviews closely, and everyone assumes is fine because it was fine yesterday.
That is what stood out to me this week.
The GitHub supply chain stories are the cleanest example. The Nx Console VS Code extension compromise was not just “an extension went bad.” It was a reminder that developer tooling sits right next to source code, terminals, tokens, cloud credentials, package publishing paths, and CI/CD systems. StepSecurity reported that Nx Console version 18.95.0 included malicious code targeting developer credentials, cloud infrastructure tokens, and CI/CD secrets. The Hacker News also reported GitHub confirmed internal repositories were exfiltrated after an employee device was compromised through the poisoned extension.
That makes the developer workstation part of the production attack surface.
Not because it serves customer traffic. It usually does not. But because it touches nearly everything that eventually becomes production. Source code. Deploy paths. Secrets. Cloud access. Build systems. Package publishing. Kubeconfigs. SSH keys. Internal docs.
A popular extension with auto-update, broad workspace access, and a trusted brand name is not “just an editor add-on” anymore. It is code running inside a high-trust environment.
That does not mean the answer is “never install extensions.” That is not realistic. Modern engineering depends on tooling. The better answer is to stop treating dev tools as casual personal preference once they can reach production-adjacent systems. Extension allowlists, endpoint monitoring, token hygiene, short-lived credentials, and real review around high-trust tools all matter more than they used to.
The Megalodon story is the CI/CD version of the same thing. StepSecurity reported more than 5,500 public repositories were hit with malware-laden commits and GitHub Actions workflow abuse. That is not just GitHub drama. It is a reminder that CI/CD is where trust becomes artifacts.
A workflow with cloud credentials is not just a test runner. A workflow with signing keys is not just automation. A workflow with package publishing rights is a release system. If that workflow can be modified by a poisoned commit, then the release path is part of the attack surface.
That is the mental model shift I keep coming back to.
Developer tooling is not around production anymore. It is one of the paths into production.
The Railway outage is the architecture version of this. Railway’s incident report said Google Cloud incorrectly suspended its production account, taking out Railway’s API, dashboard, control plane, databases, and GCP-hosted compute infrastructure. Railway also explained that workloads on Railway Metal and AWS initially stayed up, but their edge proxies depended on a GCP-hosted control plane API for routing data. Once route caches expired, workloads outside GCP became unreachable too.
That is the kind of failure that cuts through the comfortable version of multi-cloud.
Multi-cloud on a diagram is not the same thing as multi-cloud resilience.
You can have AWS, GCP, metal, edge proxies, and nice arrows all over the architecture diagram. But if the routing control plane lives behind one provider account, that provider account is still in the hot path. If failover depends on a centralized identity system, that identity system is in the hot path. If emergency deploys depend on the same CI platform that is down, that platform is in the hot path.
The question is not “how many providers do we use?”
The question is “what has to work during failure?”
That is a harder question, but it is the only one that matters.
Discord’s voice outage postmortem is the distributed systems version. I really liked that writeup because it showed the difference between a routine infrastructure change and the system behavior that change produced. Discord described a Kubernetes migration where terminating too many session management pods in one zone dropped about 17 percent of active sessions. That triggered message floods, reconnect behavior, rate limit pressure, memory spikes, gateway issues, and voice/video routing problems.
That is why I like saying real outages are often interaction failures.
Kubernetes did something understandable. Session handoff existed. Rate limits existed. Downstream systems were designed for normal load. But the shape of the change created a workload the system was not ready for.
That is the part “just autoscale it” misses.
Sometimes the bottleneck is not CPU. Sometimes it is mailbox length, fanout, retries, reconnection behavior, a queue, or the helper service that gets buried trying to clean up the mess. Graceful shutdown is not just a pod lifecycle setting. It is a system behavior.
AWS changing SDK retry behavior is the boring version of the same idea. And boring is not an insult here. Boring is usually where the production risk hides.
AWS is updating retry behavior across SDKs and tools, with opt-in available now and defaults changing in November 2026. The changes affect standard and adaptive retry modes, retry quotas, backoff behavior, throttling behavior, and how transient errors are treated.
That sounds like documentation furniture until you remember retries shape how your app behaves during partial failure.
Your app might think it is “calling S3 once.” The SDK may actually be deciding how long to wait, how many times to retry, how much pressure to apply, and when to fail fast. During a service-side problem, that hidden behavior can affect latency, thread pools, connection usage, downstream load, and customer-visible errors.
Retries are invisible infrastructure.
They can protect you from transient failure, and they can also help create a client-side storm during a partial outage. Both are true. That is why this is worth testing before the default changes.
The RabbitMQ AWS plugin bug is the plugin version of trusted-tool risk. AWS published CVE-2026-9133 for an arbitrary file read in the rabbitmq-aws plugin, caused by debug code accidentally shipped in production builds. The plugin can fetch things like TLS certificates, private keys, passwords, and other secrets from AWS services, so a file read bug in that process is not just “some plugin issue.” It is a secrets and blast-radius issue.
Debug code in production should always make people stop blinking for a second.
Debug paths often bypass the clean shape of the system. They inspect directly. They read directly. They validate differently. They exist for convenience during development, and convenience is exactly what you do not want exposed to a real user or attacker.
The Bedrock Reddit story adds the cost angle. It is not a formal incident report, so I would not treat it the same way as an AWS bulletin. But as a pattern, it is very believable: exposed cloud keys plus AI services can become a fast money fire. A compromised credential used to often mean crypto mining, data access, or infrastructure abuse. Now it can also mean model inference, agent workflows, or API calls burning through money at a rate that makes finance start typing in all caps.
That is where security and FinOps are starting to overlap more directly.
If a key can spend money, it is a financial control too.
The lightning round all fits under that same theme.
OpenTelemetry graduating from the CNCF is a huge milestone, but the collector is still production plumbing. Graduation does not mean every collector upgrade is safe or every telemetry pipeline is boring. The thing that observes production can still break production.
The Claude Code RCE story is another reminder that AI coding tools are not just editors. If they have filesystem access, repo context, terminal access, commands, deeplinks, and workflow integration, then they are part of the developer execution environment. That needs a threat model.
GitLab Secrets Manager moving into public beta is interesting because it brings secrets closer to the CI/CD system where a lot of secrets risk actually lives. It does not solve every secrets problem, but it is directionally right. Pipeline credentials should be treated as first-class production risk, not a pile of masked variables everyone hopes are fine.
Google Cloud AI spend caps are useful, but they are also a reliability design question. A hard cap can prevent a surprise bill. It can also pause API traffic if your application depends on that AI service. That means a spend cap is not just a FinOps control. It can become an availability behavior.
The Redshift Python driver RCE is a reminder that clients are part of the execution boundary too. AWS said versions 2.1.13 and earlier could allow a rogue server or man-in-the-middle to execute arbitrary code on the client. That is not “just a driver.” It is code running somewhere important, trusting a remote endpoint.
The common thread is trust.
Modern systems are built out of trusted tools. They have to be. You cannot run everything from scratch, manually inspect every package, manually deploy every change, manually parse every log line, manually route every request, and manually retry every call. That is not engineering. That is punishment with YAML.
But trust needs visibility.
What does the tool have access to?
What can it change?
What happens if it is compromised?
What happens if it disappears?
What happens if it retries differently?
What happens when cached state expires?
What happens if the workflow runs on a poisoned commit?
What happens if the plugin can read files it was never supposed to expose?
That is not paranoia. That is operational hygiene.
The staff and principal engineer job is often about seeing these hidden dependency shapes before they become incident writeups. Noticing when a dev tool is actually a production path. When a retry default is outage behavior. When a multi-cloud architecture still has one hot dependency. When a telemetry collector is availability-sensitive. When a CI workflow is a release system. When a “temporary” credential is now an archaeological artifact with admin rights.
The takeaway is not to stop trusting tools.
The takeaway is to make trust visible.
Map permissions. Review workflows. Scope credentials. Test failure paths. Patch clients. Constrain plugins. Treat CI/CD as a release system. Treat developer workstations as production-adjacent. Treat retry behavior like part of your reliability model. Treat cloud spend controls like they can affect availability.
Because they can.
Trusted tools are not automatically safe.
They are just familiar.
And familiar is exactly why we stop looking closely.
Extra links worth including on the episode page:
GitHub internal repositories breached via malicious Nx Console VS Code extension
https://thehackernews.com/2026/05/github-internal-repositories-breached.html
OpenTelemetry graduates from the CNCF
https://opentelemetry.io/blog/2026/otel-graduates/
Claude Code RCE flaw
https://devops.com/attackers-can-exploit-a-claude-code-rce-flaw-to-take-command-of-system/
GitLab Secrets Manager public beta
https://about.gitlab.com/blog/secrets-manager-in-public-beta/
Google Cloud AI spend caps
https://cloud.google.com/blog/topics/cost-management/introducing-spend-caps-ai-cost-visibility-next26
Redshift Python driver CVE-2026-8838
https://aws.amazon.com/security/security-bulletins/2026-033-aws/
AWS Bedrock cost spike Reddit thread
https://www.reddit.com/r/aws/comments/1tm3ydo/aws_bedrock_cost_spike_14000_usd/