GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Transcript

Trusted tools are having a rough week. A popular

VS Code extension reportedly helped expose thousands

of GitHub internal repositories. Megalodon hit

thousands of public repos with poisoned commits

and GitHub Actions workflow abuse. Railway had

a platform-wide outage after Google Cloud incorrectly

suspended its production account. Discord dropped

17% of active sessions during a Kubernetes migration.

And AWS is changing SDK retry behavior, which

sounds boring until you remember retries are

how your app behaves when the world is already

on fire. The theme this week is simple. The tools

that we trust most are becoming some of our riskiest

production dependencies. I'm Brian Teller from

Teller's Tech, and this is Ship It Weekly. Welcome

back to Ship It Weekly, the show where we look

at DevOps, SRE, cloud, platform, and security

stories that matter when you are the person who

eventually has to keep the thing running. This

week, we're starting with GitHub supply chain

risk from two directions: a compromised VS Code

extension tied to a GitHub internal repo breach

and the Megalodon campaign abusing public repos

and CI/CD workflows then we'll talk about railway's

gcp account suspension outage discord's voice

outage postmortem AWS changing SDK retry defaults,

and a RabbitMQ AWS plugin issue that accidentally

shipped debug code into production builds. In

the lightning round, we'll hit OpenTelemetry

Graduation, Claude Code RCE, GitLab Secrets Manager,

Google Cloud AI Spend Caps, and a Redshift Python

driver RCE. And the human closer is about trusted

tools because the systems we trust most are often

the ones we forget to threat model. So let's

get into it. First up, GitHub had a rough supply

chain week. The developer toolchain got hit from

multiple directions. The first story is the Nx

Console VS Code extension compromise. Reporting

says a malicious version of the Nx Console

extension was published and that the extension

was later tied to a breach involving thousands

of GitHub internal repositories. StepSecurity

says the compromised version was

Nx Console 18.95.0 and that the root cause involved a contributor's

GitHub token being scraped in a prior supply

chain attack. The Hacker News reported that roughly

3,800 GitHub internal repositories were exposed

after a GitHub employee device was compromised

through the malicious extension. Now, I'm not

saying every VS Code extension is evil, but every

VS Code extension is still code running next

to the repos you clone, the terminal you use,

and sometimes the tokens you forgot were still

hanging around, which is not exactly a low-trust

environment. A lot of developers install extensions

like browser tabs. That looks useful. That has

a nice icon. That has a lot of installs. Sure,

why not? And normally that feels fine until an

extension update becomes an initial access path.

That is the real lesson. The developer workstation

is part of the production attack surface now,

not because it serves customer traffic, because

it touches the things that eventually do. Repos,

secrets, CI/CD, cloud credentials, deployment

tooling, package publishing, SSH keys. kubeconfigs

all the fun little artifacts we pretend are carefully

managed until somebody runs `ls ~/.aws` and the room

gets quiet. Then there's Megalodon. Security

researchers reported a campaign hitting more

than 5,500 public repositories with malware

-laden commits, stealing CI/CD secrets like AWS

and Google Cloud keys. SSH private keys, and

Kubernetes configs. That makes this more than

a GitHub story. It's a reminder that CI/CD is

not just where code gets tested. It is where

trust gets converted into artifacts. If a workflow

has cloud credentials, package publishing tokens,

signing keys, or deployment authority, then a

compromised workflow is not just a dev problem.

It is a release system compromise. The takeaway

is not ban all extensions or never use GitHub

Actions. That's not serious. The takeaway is

to treat developer tooling as production-adjacent

infrastructure. Review extensions with broad

file, terminal, or workspace access. Use short

-lived credentials where you can. Keep cloud

keys out of CI when OIDC works. Lock down GitHub

Actions permissions. Require review on workflow

changes. And please. Do not let every workflow

run with full write access because that was easier

during the first setup. A trusted extension,

a trusted repo, and a trusted workflow can all

become part of the same attack path. That's the

part that matters. Next up, Railway published

an incident report about a platform-wide outage

caused by Google Cloud. incorrectly suspending

Railway's production account. That sentence alone

is enough to make most cloud engineers sit up

a little straighter. Railway says the outage

started on May 19th when Google Cloud incorrectly

placed their production account into a suspended

status. That took Railway's API, dashboard, control

plane, databases, and GCP-hosted compute infrastructure

offline. And then it got more interesting. Railway

also runs workloads on Railway Metal and AWS

burst-cloud environments. And those workloads

initially stayed up. But Railway's edge proxies

relied on a GCP-hosted control plane API to

populate routing tables. So when route caches

expired, the outage cascaded beyond GCP. Workloads

that were still technically running became unreachable

because the network control plane could no longer

resolve routes to active instances. At peak impact,

Railway says all workloads across all regions

were unreachable. And that is the story. Not

just Google Cloud suspended an account. The real

story is that multi-cloud did not save the system

because the control plane dependency was still

in the hot path. That is where architecture diagrams

get too optimistic. You can draw AWS over here

and GCP over there. Metal in another box. Some

arrows, maybe a nice little mesh diagram. And

suddenly, everybody feels resilient. But resilience

is not about how many providers appear on the

diagram. It is about what has to work during

a failure. If your data plane in AWS still needs

a control plane API in GCP to route traffic,

then GCP is still in the hot path. If your failover

region needs the primary region's identity system

to approve failover, then the primary region

is still in the hot path. If your emergency deploy

process depends on the same CI/CD platform that

is currently broken, then congratulations. You

have invented a circular dependency with branding.

The strongest thing in Railway's writeup is

that they owned it. They said that they take

responsibility. for the architectural decisions

that allowed one upstream provider action to

cascade into a platform-wide outage. That's

the right posture. Customers do not care whether

the thing that broke was technically Google,

Railway, GitHub, Stripe, AWS, or a squirrel with

a networking certification. They see your product.

So when you say multi-cloud, ask what dependency

is still centralized? What service discovers

routes? What service holds identity? What API

does the edge need? What happens when cached

state expires? And what dependency do you only

discover when the provider account disappears

and everyone suddenly becomes very interested

in architecture diagrams? Multicloud is not magic.

Sometimes it is just single cloud with extra

invoices. Third story. Discord published a really

good postmortem. on its March 25th voice outage.

The title is perfect. You've got too much mail.

Because this outage wasn't just Kubernetes killed

some pods. It was a chain reaction where a routine

infrastructure change hit a stateful system,

dropped a large number of sessions, and downstream

systems got overwhelmed by the recovery behavior.

Discord says voice and video suffered major degradation.

for a little over three hours. Users were mostly

unable to start or join calls and saw an awaiting

endpoint message. The trigger came during a Kubernetes

migration for Discord's Elixir services. They

were tuning session management service resources

and pod counts. As Kubernetes applied the change,

it terminated 50 % of the pods in one zone. Since

sessions were balanced across three zones, about

17% of active sessions were ungracefully stopped.

That alone is not great. But the cascading part

is the real lesson. Discord’s systems use Elixir

GenServer processes, and those processes have

mailboxes. When all those sessions vanished,

other processes received a flood of messages

saying sessions were down. That caused reconnect

behavior, rate limit pressure, memory spikes.

gateway problems, and eventually voice and video

routing issues. This is the kind of postmortem

that I love because it shows how real outages

are usually not one failure. They are interaction

failures. Kubernetes did what it was told. The

session service had handoff logic. The rate limit

existed. The downstream services were designed

for normal load, but the shape of the change

produced a workload the system was not tuned

for. That is the part people miss when they

say, why didn't they just autoscale? Because

auto scaling is not a magic undo button for we

just invalidated 17% of active sessions and

created a reconnect storm. Sometimes the bottleneck

is not CPU. Sometimes it's mailbox length, back

pressure, downstream fanout. or one angry queue

quietly becoming the main character. The practical

takeaway is migration safety. When you move stateful

systems into Kubernetes, think beyond pod termination.

What does the rest of the system do when this

pod terminates? Who gets notified? Who retries?

Who reconnects? Who queues messages? Who gets

overloaded trying to help? Graceful shutdown

is not just a pod lifecycle feature. It is a

system behavior. If you are doing Kubernetes

migrations for stateful services, test the ugly

cases. Kill more than one pod. Drain a zone.

Watch downstream queues. Look at reconnect behavior.

Because production does not ask if the change

was routine. It asks if the system was ready.

Fourth story, AWS is changing retry behavior

across AWS SDKs and tools. And I know that that

sounds like the kind of story that you would

normally skip because it has the emotional energy

of a configuration footnote. But retry defaults

are invisible infrastructure. They affect latency,

error rates, load during outages, and how your

app behaves. when AWS services are already struggling.

AWS says the updated retry behavior is available

now behind opt-in and will become the default

in November, 2026. The updated behavior changes

how standard and adaptive retry modes handle

failures. AWS is making standard mode the default

for SDKs that previously defaulted to legacy

mode, adding retry quotas where they didn't exist.

changing backoff timing and treating transient

errors differently from throttling errors. One

big change is that transient error retries cost

more retry quota than before. The idea is that

during sustained outages, the SDK fails faster

instead of endlessly retrying and adding pressure

to a service that is already unhealthy. That

is good, but it can still surprise you. Retries

are one of those things most teams do not think

about until an incident. Your code says call

S3. The SDK says it'll handle some retries. Your

app says, great, I'll pretend that that was one

request. Then the service starts throwing errors

and suddenly request latency, thread usage, connection

pools, client CPU, and downstream load all depend

on retry behavior you may never have explicitly

configured. Retries can save you from transient

failures. Retries can also turn a partial outage

into a client -side traffic storm wearing a helpful

little hat. So the takeaway is simple. Do not

wait until November 2026 to discover how your

app behaves. Pick a non -production workload.

Opt in with the new environment flag. Look at

latency. Look at error surfaces. Look at max attempts.

Look at throttling behavior. look at long-polling

clients like SQS consumers and figure out whether

your app depends on old retry behavior without

anyone realizing it because nothing says fun

on call rotation like finding out your retry

strategy was inherited from 2018 and load-tested

by hope. Fifth story, let's talk about RabbitMQ,

debug code, secrets, and cloud cost blast radius.

AWS published a security bulletin for CVE -2026

-9133 in the rabbitmq-aws plugin. The plugin

resolves AWS ARNs in RabbitMQ's broker configuration

at startup and can fetch things like TLS certificates,

private keys, passwords, and other secrets from

AWS services. The issue is that debug code in

the plugin's ARN resolver was accidentally shipped

in production builds. A debug ARN scheme accepted

by a validation endpoint could allow a remote

authenticated user to read arbitrary files accessible

to the RabbitMQ process. That is not good. AWS

recommends upgrading to rabbitmq-aws 0 .2 .1,

patching forked code, and rotating secrets stored

in files. the RabbitMQ process could read. This

is a specific bug, but the pattern is broad.

Debug code in production should make everyone

briefly stop blinking because debug paths often

bypass the normal shape of the system. They inspect

the thing directly. They validate a thing in

a way production code usually does not. And then

somehow that path makes it into a build where

a real user or attacker can reach it this also

pairs with a separate aws bedrock cost story

from Reddit where a user described attackers

using exposed access keys from an ec2 instance

to run about $14,000 of Claude calls in 24 hours

That second story is a Reddit report, so I would

not treat it like a formal incident report, but

as a pattern it is extremely believable cloud

credentials plus AI services can become a very

fast money fire. Security and FinOps are blending

together. Compromised cloud keys used to mostly

mean crypto mining, data access, or infrastructure

abuse. Now they can also mean someone burning

through model inference or AI API calls at a

rate that makes finance start typing in all caps.

The takeaway is not complicated. Scope credentials.

Use roles instead of long -lived access keys

where possible. Watch unusual service usage.

Put budgets and anomaly detection around AI services.

Rotate secrets when file-read issues appear.

And remember that authenticated user does not

mean safe user. Small bugs get expensive when

the process can read secrets and the secrets

can spend money. Now let's do a quick lightning

round. First, OpenTelemetry graduated from the

CNCF. This is a huge milestone. OpenTelemetry

is now basically the de facto standard for vendor

-neutral telemetry across traces, metrics, and

logs. But the operator warning is still the same.

The collector is production plumbing. Graduation

does not mean that every collector upgrade is

safe, every processor config is harmless, or

every telemetry pipeline is suddenly boring.

Standardization helps, but you still need rollout

strategy. config validation, load testing, and

a plan for what happens when the thing that reports

on production becomes the thing breaking production.

Second, Claude Code had a security issue. I'm

keeping this short because we have covered Claude

code and agent security a lot lately. But the

pattern matters. AI coding tools are not just

editors. They can have filesystem access, repo

context, terminals, deeplinks. commands, and

workflow integration. So when those tools have

parsing bugs, deeplink bugs, or command execution

paths, the risk is not theoretical. The developer

environment is becoming another agent runtime,

and agent runtimes need threat models. Third,

GitLab 19.0 introduced GitLab Secrets Manager

in public beta. This is a good direction. Secrets

closer to the pipeline, scoped to jobs, governed

through the same platform people already use

for CI/CD, that does not solve every secret manager

problem, but it does acknowledge reality. A lot

of secrets risk lives in CI/CD because CI/CD is

where systems need credentials to do work. Treating

pipeline secrets as first -class objects is better

than pretending a masked variable named prod

token is a strategy. Fourth, Google Cloud is

rolling out hard spend caps for AI services.

This is a FinOps story, but it is also a reliability

story. If a budget cap pauses API traffic when

spend hits a limit, that can protect you from

a surprise bill. It can also become an availability

event if your product depends on that API. So

hard caps are useful, but they need operational

design. Who gets alerted before the cap? What

degrades gracefully? What is customer facing?

And what do you want more? A hard outage or a

hard invoice? Sometimes the answer depends on

the day. Fifth, Amazon Redshift Python driver

had an RCE issue. AWS reported CVE -2026 -8838

in the Redshift Python driver, where a rogue

server could execute commands on a user's data

warehouse client. That is a good reminder that

database clients are part of your execution boundary

too. Not every RCE starts on the server. Sometimes

the client connects to the wrong thing, trusts

the wrong response, and becomes the thing that

gets owned. So patch the driver, watch connection

targets, and remember that it is just a client

library is usually how the story starts, not

how it ends. The human closer this week is about

trusted tools the riskiest systems are not always

the mysterious ones sometimes they are the familiar

ones the extension everyone installs the workflow

nobody reviews the retry behavior nobody configured

the plugin that shipped with debug code the control

plane api that seemed fine because the cache

bought you an hour trusted tools become dangerous

when trust turns into invisibility That does

not mean that every tool is bad. It means that

trust should have an expiration date. Every so

often you need to ask, what does this tool have

access to? What can it change? What happens if

it is compromised? What happens if it disappears?

What happens if it retries differently? What

happens if the cache expires? That is not paranoia.

That is being the person who has to answer the

incident channel when everyone else is asking,

how could this happen? The staff and principal

engineer job is often about seeing the shape

of the dependency. Noticing when a developer

tool is actually a production path. When a retry

default is actually outage behavior. When a multi

-cloud architecture still has one hot dependency.

When a plugin can read secrets. When the thing

that everyone trusts has become the thing nobody

questions. The takeaway is not to stop trusting

tools. You cannot run modern systems that way.

The takeaway is to make trust visible. Map the

permissions. Review the workflows. Scope the

credentials. Test the failure path. Patch the

clients. Constrain the plugins. And look at your

boring dependencies like they might be production

infrastructure. Because they probably are. That's

it for this week of Ship It Weekly. We covered

the GitHub supply chain week with Nx Console

and Megalodon. Railway's GCP account suspension

outage. Discord's voice outage postmortem, AWS

SDK retry behavior changes, the RabbitMQ AWS

plugin file-read issue, and a lightning round

on OpenTelemetry, Claude Code, GitLab Secrets

Manager, Google AI Spend Caps, and Redshift Python

Driver RCE. If this episode was useful, follow

or subscribe wherever you are watching or listening.

If you're on YouTube, hit subscribe. If you are

in a podcast app, Follow the show there. And

if you know someone on a DevOps, SRE, platform

security or engineering leadership team who is

dealing with supply chain risk, cloud dependencies,

retries or trusted tooling, send this one to

them. It helps the show grow and it helps me

keep making this kind of content for people who

actually live with these systems. You can find

the weekly brief at OnCallBrief.com and more

episodes and this week's show notes on ShipItWeekly

.fm. I'm Brian Teller from Teller's Tech. Thanks

for listening. And remember, if your trusted

tool can install code, trigger CI, route traffic,

retry requests, read secrets, or burn cloud money,

it is not just a tool anymore. It is part of

production. So maybe treat it like it.

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Watch this episode here

Chapters

Transcript

Catch This Episode

Host Commentary

Show Notes

More from Ship It Weekly

EKS Rollbacks, GitHub Actions Supply Chain Attacks, AI Agentjacking, CloudWatch Log Alarms, and Why Safety Nets Don’t Replace Ownership

containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

Ship It Conversations: Gareth Kersey on IaCConf 2026, AI, and Corey Quinn’s Terraform Keynote

Get the next episode in your inbox

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Chapters

Transcript

Catch This Episode

Host Commentary

Show Notes

Related On Call Brief

More from Ship It Weekly

EKS Rollbacks, GitHub Actions Supply Chain Attacks, AI Agentjacking, CloudWatch Log Alarms, and Why Safety Nets Don’t Replace Ownership

containerd CRI Vulnerabilities, Datadog PostgreSQL HA on Kubernetes, AWS DevOps Agent with Datadog MCP Server, EKS Control Plane Egress, and Why Users Feel the Wait

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk

Ship It Conversations: Gareth Kersey on IaCConf 2026, AI, and Corey Quinn’s Terraform Keynote

Get the next episode in your inbox