This episode of Ship It Weekly discusses the growing risks of trusted tools in production, highlighted by a GitHub supply chain attack involving a compromised VS Code extension.
Transcript
Catch This Episode
Listen on your favorite podcast platform
Host Commentary
This episode is really about one idea: trusted tools become risky when they become invisible.
Most production incidents are not caused by some mysterious system nobody has ever heard of. A lot of the time, the scary part is something familiar. A developer extension. A CI workflow. A cloud provider account. A control plane API. An SDK retry default. A plugin. A collector. A database driver. The thing everyone uses, nobody reviews closely, and everyone assumes is fine because it was fine yesterday.
That is what stood out to me this week.
The GitHub supply chain stories are the cleanest example. The Nx Console VS Code extension compromise was not just “an extension went bad.” It was a reminder that developer tooling sits right next to source code, terminals, tokens, cloud credentials, package publishing paths, and CI/CD systems. StepSecurity reported that Nx Console version 18.95.0 included malicious code targeting developer credentials, cloud infrastructure tokens, and CI/CD secrets. The Hacker News also reported GitHub confirmed internal repositories were exfiltrated after an employee device was compromised through the poisoned extension.
That makes the developer workstation part of the production attack surface.
Not because it serves customer traffic. It usually does not. But because it touches nearly everything that eventually becomes production. Source code. Deploy paths. Secrets. Cloud access. Build systems. Package publishing. Kubeconfigs. SSH keys. Internal docs.
A popular extension with auto-update, broad workspace access, and a trusted brand name is not “just an editor add-on” anymore. It is code running inside a high-trust environment.
That does not mean the answer is “never install extensions.” That is not realistic. Modern engineering depends on tooling. The better answer is to stop treating dev tools as casual personal preference once they can reach production-adjacent systems. Extension allowlists, endpoint monitoring, token hygiene, short-lived credentials, and real review around high-trust tools all matter more than they used to.
The Megalodon story is the CI/CD version of the same thing. StepSecurity reported more than 5,500 public repositories were hit with malware-laden commits and GitHub Actions workflow abuse. That is not just GitHub drama. It is a reminder that CI/CD is where trust becomes artifacts.
A workflow with cloud credentials is not just a test runner. A workflow with signing keys is not just automation. A workflow with package publishing rights is a release system. If that workflow can be modified by a poisoned commit, then the release path is part of the attack surface.
That is the mental model shift I keep coming back to.
Developer tooling is not around production anymore. It is one of the paths into production.
The Railway outage is the architecture version of this. Railway’s incident report said Google Cloud incorrectly suspended its production account, taking out Railway’s API, dashboard, control plane, databases, and GCP-hosted compute infrastructure. Railway also explained that workloads on Railway Metal and AWS initially stayed up, but their edge proxies depended on a GCP-hosted control plane API for routing data. Once route caches expired, workloads outside GCP became unreachable too.
That is the kind of failure that cuts through the comfortable version of multi-cloud.
Multi-cloud on a diagram is not the same thing as multi-cloud resilience.
You can have AWS, GCP, metal, edge proxies, and nice arrows all over the architecture diagram. But if the routing control plane lives behind one provider account, that provider account is still in the hot path. If failover depends on a centralized identity system, that identity system is in the hot path. If emergency deploys depend on the same CI platform that is down, that platform is in the hot path.
The question is not “how many providers do we use?”
The question is “what has to work during failure?”
That is a harder question, but it is the only one that matters.
Discord’s voice outage postmortem is the distributed systems version. I really liked that writeup because it showed the difference between a routine infrastructure change and the system behavior that change produced. Discord described a Kubernetes migration where terminating too many session management pods in one zone dropped about 17 percent of active sessions. That triggered message floods, reconnect behavior, rate limit pressure, memory spikes, gateway issues, and voice/video routing problems.
That is why I like saying real outages are often interaction failures.
Kubernetes did something understandable. Session handoff existed. Rate limits existed. Downstream systems were designed for normal load. But the shape of the change created a workload the system was not ready for.
That is the part “just autoscale it” misses.
Sometimes the bottleneck is not CPU. Sometimes it is mailbox length, fanout, retries, reconnection behavior, a queue, or the helper service that gets buried trying to clean up the mess. Graceful shutdown is not just a pod lifecycle setting. It is a system behavior.
AWS changing SDK retry behavior is the boring version of the same idea. And boring is not an insult here. Boring is usually where the production risk hides.
AWS is updating retry behavior across SDKs and tools, with opt-in available now and defaults changing in November 2026. The changes affect standard and adaptive retry modes, retry quotas, backoff behavior, throttling behavior, and how transient errors are treated.
That sounds like documentation furniture until you remember retries shape how your app behaves during partial failure.
Your app might think it is “calling S3 once.” The SDK may actually be deciding how long to wait, how many times to retry, how much pressure to apply, and when to fail fast. During a service-side problem, that hidden behavior can affect latency, thread pools, connection usage, downstream load, and customer-visible errors.
Retries are invisible infrastructure.
They can protect you from transient failure, and they can also help create a client-side storm during a partial outage. Both are true. That is why this is worth testing before the default changes.
The RabbitMQ AWS plugin bug is the plugin version of trusted-tool risk. AWS published CVE-2026-9133 for an arbitrary file read in the rabbitmq-aws plugin, caused by debug code accidentally shipped in production builds. The plugin can fetch things like TLS certificates, private keys, passwords, and other secrets from AWS services, so a file read bug in that process is not just “some plugin issue.” It is a secrets and blast-radius issue.
Debug code in production should always make people stop blinking for a second.
Debug paths often bypass the clean shape of the system. They inspect directly. They read directly. They validate differently. They exist for convenience during development, and convenience is exactly what you do not want exposed to a real user or attacker.
The Bedrock Reddit story adds the cost angle. It is not a formal incident report, so I would not treat it the same way as an AWS bulletin. But as a pattern, it is very believable: exposed cloud keys plus AI services can become a fast money fire. A compromised credential used to often mean crypto mining, data access, or infrastructure abuse. Now it can also mean model inference, agent workflows, or API calls burning through money at a rate that makes finance start typing in all caps.
That is where security and FinOps are starting to overlap more directly.
If a key can spend money, it is a financial control too.
The lightning round all fits under that same theme.
OpenTelemetry graduating from the CNCF is a huge milestone, but the collector is still production plumbing. Graduation does not mean every collector upgrade is safe or every telemetry pipeline is boring. The thing that observes production can still break production.
The Claude Code RCE story is another reminder that AI coding tools are not just editors. If they have filesystem access, repo context, terminal access, commands, deeplinks, and workflow integration, then they are part of the developer execution environment. That needs a threat model.
GitLab Secrets Manager moving into public beta is interesting because it brings secrets closer to the CI/CD system where a lot of secrets risk actually lives. It does not solve every secrets problem, but it is directionally right. Pipeline credentials should be treated as first-class production risk, not a pile of masked variables everyone hopes are fine.
Google Cloud AI spend caps are useful, but they are also a reliability design question. A hard cap can prevent a surprise bill. It can also pause API traffic if your application depends on that AI service. That means a spend cap is not just a FinOps control. It can become an availability behavior.
The Redshift Python driver RCE is a reminder that clients are part of the execution boundary too. AWS said versions 2.1.13 and earlier could allow a rogue server or man-in-the-middle to execute arbitrary code on the client. That is not “just a driver.” It is code running somewhere important, trusting a remote endpoint.
The common thread is trust.
Modern systems are built out of trusted tools. They have to be. You cannot run everything from scratch, manually inspect every package, manually deploy every change, manually parse every log line, manually route every request, and manually retry every call. That is not engineering. That is punishment with YAML.
But trust needs visibility.
What does the tool have access to?
What can it change?
What happens if it is compromised?
What happens if it disappears?
What happens if it retries differently?
What happens when cached state expires?
What happens if the workflow runs on a poisoned commit?
What happens if the plugin can read files it was never supposed to expose?
That is not paranoia. That is operational hygiene.
The staff and principal engineer job is often about seeing these hidden dependency shapes before they become incident writeups. Noticing when a dev tool is actually a production path. When a retry default is outage behavior. When a multi-cloud architecture still has one hot dependency. When a telemetry collector is availability-sensitive. When a CI workflow is a release system. When a “temporary” credential is now an archaeological artifact with admin rights.
The takeaway is not to stop trusting tools.
The takeaway is to make trust visible.
Map permissions. Review workflows. Scope credentials. Test failure paths. Patch clients. Constrain plugins. Treat CI/CD as a release system. Treat developer workstations as production-adjacent. Treat retry behavior like part of your reliability model. Treat cloud spend controls like they can affect availability.
Because they can.
Trusted tools are not automatically safe.
They are just familiar.
And familiar is exactly why we stop looking closely.
Extra links worth including on the episode page:
GitHub internal repositories breached via malicious Nx Console VS Code extension
https://thehackernews.com/2026/05/github-internal-repositories-breached.html
OpenTelemetry graduates from the CNCF
https://opentelemetry.io/blog/2026/otel-graduates/
Claude Code RCE flaw
https://devops.com/attackers-can-exploit-a-claude-code-rce-flaw-to-take-command-of-system/
GitLab Secrets Manager public beta
https://about.gitlab.com/blog/secrets-manager-in-public-beta/
Google Cloud AI spend caps
https://cloud.google.com/blog/topics/cost-management/introducing-spend-caps-ai-cost-visibility-next26
Redshift Python driver CVE-2026-8838
https://aws.amazon.com/security/security-bulletins/2026-033-aws/
AWS Bedrock cost spike Reddit thread
https://www.reddit.com/r/aws/comments/1tm3ydo/aws_bedrock_cost_spike_14000_usd/
Show Notes
This episode of Ship It Weekly is about trusted tools becoming production dependencies. Brian covers a rough GitHub supply chain week, including the compromised Nx Console VS Code extension tied to exposed GitHub internal repositories and the Megalodon campaign abusing GitHub Actions workflows across thousands of public repos.
The bigger thread this week is that the tools around production are increasingly part of production. Brian also covers Railway’s GCP account suspension outage, Discord’s voice outage during a Kubernetes migration, AWS changing SDK retry behavior, CVE-2026-9133 in the RabbitMQ AWS plugin, and a Reddit story about stolen AWS keys turning into a $14,000 Bedrock bill.
Brian also touches on OpenTelemetry graduating from the CNCF, Claude Code security risk, GitLab Secrets Manager, Google Cloud AI spend caps, and a Redshift Python driver RCE.
Full source list and extra links are available on this episode’s page at shipitweekly.fm.
Links
Nx Console compromise https://www.stepsecurity.io/blog/nx-console-vs-code-extension-compromised
Megalodon GitHub Actions attack https://www.stepsecurity.io/blog/megalodon-mass-github-actions-secret-exfiltration-across-5-500-public-repositories
Railway GCP outage https://blog.railway.com/p/incident-report-may-19-2026-gcp-account-outage
Discord voice outage https://discord.com/blog/behind-the-scenes-of-the-3-25-26-voice-outage
AWS SDK retry changes https://aws.amazon.com/blogs/developer/announcing-updated-retry-behavior-for-aws-sdks-and-tools/
RabbitMQ AWS plugin CVE-2026-9133 https://aws.amazon.com/security/security-bulletins/2026-034-aws/
AWS Bedrock cost spike Reddit thread https://www.reddit.com/r/aws/comments/1tm3ydo/aws_bedrock_cost_spike_14000_usd/
This week’s On Call Brief https://www.tellerstech.com/on-call-brief/2026-W22/
More episodes and show notes https://shipitweekly.fm/