On Call Brief – Week of May 24–30, 2026

2026-05-24 — 2026-05-30 Briefing: 2026-05-24 Published 2 months ago (May 31, 2026 6:00 am EDT) 17 min read

This week's top stories

1. DevOps'ish 310: The Breaches Are Coming From Inside the Extension Store

Category: Community
What happened: GitHub confirmed an internal security breach affecting approximately 3,800 internal repositories after a malicious backdoor was discovered in the Nx Console VS Code extension, which has since been removed from the marketplace with no customer data compromised according to DevOps'ish 310. The breach highlights supply chain risks from browser extensions and development tools that SRE teams should address by auditing installed VS Code extensions across development environments and implementing approval processes for new extension installations. Additionally, DevOps'ish 310 reports a critical NGINX vulnerability enabling remote code execution that requires immediate patching, along with a newly disclosed Linux privilege escalation vulnerability that operators should monitor for patches. SRE teams should prioritize updating NGINX installations and review their extension management policies to prevent similar supply chain compromises in their development toolchains.
Worth reading: The GitHub breach highlights risks from third-party extensions, while the NGINX and PostgreSQL vulnerabilities require immediate attention to prevent exploitation. The NPM compromise indicates potential supply chain risks that could affect many projects.
Sources: via DevOps'ish, Devopsish via DevOps'ish
Tags:

2. The War Between Wars: How an IRGC Front Runs Destructive OT and IT Attacks Under Cover of a Ceasefire

Category: Community
What happened: The article describes a security incident involving an IRGC-directed front that executed destructive attacks on both operational technology (OT) and information technology (IT) systems under the guise of a ceasefire. The incident began with a temperature anomaly in a food plant, leading to the discovery of manipulated controllers and a disk wiper disguised as a Microsoft update. The attackers demonstrated a deep understanding of the systems involved, highlighting the risks posed by such targeted attacks.
Worth reading: This incident illustrates the potential for sophisticated attacks on critical infrastructure, emphasizing the need for vigilance in monitoring both IT and OT environments. Operators should be aware of the tactics used and consider implementing stronger security measures and incident response protocols.
Source: Reddit r/netsec
Tags:

3. You’ve Got (Too Much) Mail: Behind the Scenes of the 3/25/26 Voice Outage

Category: Deep Dive
What happened: SRE Weekly published a detailed post-mortem analysis of a major voice service outage that occurred on March 25, 2026, examining the root causes, technical challenges encountered during incident response, and monitoring failures that contributed to the extended downtime. The analysis highlights critical lessons around AI integration risks in production SRE environments, including documented failure modes and associated costs that organizations should evaluate before deployment. SRE teams should review their current monitoring strategies for scale-related blind spots and incident response procedures, particularly around voice services and any AI-assisted tooling in their infrastructure stack. The article emphasizes the importance of reliable monitoring practices and provides actionable guidance for preventing similar cascading failures in distributed voice systems.
Takeaway: Understanding the root causes and response strategies from this outage can help teams improve their incident response plans and system resilience.
Sources: Discord via SRE Weekly, Softwareseni via SRE Weekly, Medium via SRE Weekly
Tags:

4. GCP Account Suspension Incident Report - May 19, 2026 Service Disruption

Category: Deep Dive
What happened: This incident report details the suspension of a GCP account on May 19, 2026, outlining the causes and implications of the outage.
Takeaway: - Operators using GCP should review account management practices to prevent similar suspensions in the future.
Source: Blog Railway via SRE Weekly
Tags:

5. Cogent: AI Exploit Developer Threats Outpace Scanner Detection on Critical Vulnerabilities

Category: Community
What happened: Cogent's latest security research reveals that AI-assisted exploit development has dramatically reduced vulnerability weaponization timelines from an average of 125 days down to just half a day, fundamentally disrupting traditional patching assumptions. This acceleration means that the standard industry practice of relying on vulnerability scanners and extended patching windows is no longer adequate for critical security vulnerabilities. SRE and DevOps teams should immediately reassess their vulnerability management processes to prioritize critical patches within hours rather than days or weeks, particularly for internet-facing systems and high-value assets. Organizations must also implement real-time threat intelligence feeds and automated patching workflows to match the speed of AI-driven exploit development. The research underscores the urgent need for shift-left security practices and continuous monitoring rather than periodic vulnerability assessments.
Worth reading: The drastic reduction in exploit weaponization time means that vulnerabilities may be actively exploited before patches can be applied, necessitating a more proactive and agile security posture.
Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
Tags:

6. By Jeffrey Burt • May 25, 2026

Category: Community
What happened: Anthropic's Mythos AI system has identified over 10,000 high and critical severity vulnerabilities across real production codebases, demonstrating significantly faster vulnerability detection capabilities than traditional human-led security reviews according to Security Boulevard reporting on May 25, 2026. The scale and speed of these discoveries highlight a potential gap between AI-powered vulnerability identification and existing organizational patching workflows. SRE teams should evaluate their current vulnerability management processes and consider whether detection tooling and remediation timelines need adjustment to handle potential increases in identified security issues. Organizations using AI-assisted code analysis tools should also review their incident response procedures to ensure they can effectively prioritize and address the higher volume of vulnerabilities that advanced AI systems may surface. This development suggests the industry may need to reassess standard vulnerability disclosure and patching timelines as AI detection capabilities continue to outpace human remediation speeds.
Worth reading: The ability of AI to rapidly identify vulnerabilities could lead to increased pressure on teams to patch quickly, potentially affecting release cycles and security practices.
Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
Tags:

7. Iranian-Backed Group Behind Attacks on Transit Systems in LA, South Florida

Category: Community
What happened: Gambit Security has attributed recent cyberattacks on transit systems in Los Angeles and South Florida, along with a Maryland vehicle-tracking firm, to an Iranian state-aligned threat group. These attacks appear to be part of escalating tensions following US-Israeli operations and represent a broader pattern of Iranian groups targeting critical infrastructure. SRE teams operating transit systems or related infrastructure should immediately review access logs for unusual activity, ensure all security monitoring is functioning, and verify that incident response procedures are current and tested. Organizations should also coordinate with local cybersecurity authorities and consider temporarily increasing monitoring of operational technology networks and public-facing systems.
Worth reading: Operators should be aware of the increased risk to critical infrastructure from state-aligned groups, which may lead to heightened security measures and incident response protocols.
Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
Tags:

8. Project Glasswing: An initial update

Category: Community
What happened: Project Glasswing is an AI initiative aimed at identifying and securing software vulnerabilities. In its early weeks, it successfully identified over ten thousand critical flaws, shifting the focus from finding bugs to the more labor-intensive process of verifying and patching them. This development encourages the adoption of AI-driven tools to help shorten patch cycles.
Worth reading: The identification of a large number of critical vulnerabilities highlights the need for enhanced verification and patching processes in production environments. Teams may need to adapt their workflows to incorporate AI tools for vulnerability management.
Source: Anthropic via TLDR Dev
Tags:

9. OpenTelemetry Collector v1.59.0/v0.153.0: Feature gates stabilized, memory fixes

Category: Breaking Change
What happened: The OpenTelemetry Collector has released versions v1.59.0 and v0.153.0, introducing several breaking changes and bug fixes. Key changes include stabilizing various feature gates and fixing a memory corruption issue in the gRPC configuration. Additionally, enhancements in the mdatagen command allow for stricter validation of feature gates and improved config documentation generation.
Do this Monday: These changes may require updates to configurations and could affect the stability of telemetry data collection if not properly managed. The stabilization of feature gates indicates a move towards more reliable configurations.
Source: OpenTelemetry Collector releases
Tags:

10. Claw Patrol (GitHub Repo)

Category: Community
What happened: Claw Patrol is a GitHub repository for a security firewall designed to monitor and control traffic between AI agents and production environments. It analyzes network data at the wire level and applies user-defined rules in HCL to block harmful commands or require manual approval for sensitive operations.
Worth reading: This tool could enhance security by preventing unauthorized actions from AI agents in production, which is critical for maintaining system integrity.
Source: Github via TLDR Dev
Tags:

CVE & Security

1. CVE-2026-8838 - Remote Code Execution in amazon-redshift-python-driver - Remote code execution via a rogue server

Category: Security / Patch
What happened: CVE-2026-8838 describes a remote code execution vulnerability in the amazon-redshift-python-driver that could allow a rogue server to execute commands on a user's data warehouse client. Users are advised to patch to version 2.1.14 to mitigate this risk.
Do this Monday: This vulnerability could lead to unauthorized access and control over Redshift data warehouses, making it critical to apply the patch immediately to prevent potential exploitation.
Source: Aws Amazon via Last Week in AWS
Tags:

2. CVE-2026-9133 - Arbitrary file read in rabbitmq-aws plugin - Debug code shipped to production with no kill switch?

Category: Security / Patch
What happened: A critical vulnerability CVE-2026-9133 has been identified in the rabbitmq-aws plugin that enables arbitrary file reading through debug code mistakenly shipped to production without proper controls. Organizations using this plugin should immediately update to version 0.2.1 and rotate all secrets as the vulnerability may have exposed sensitive data. Separately, an AWS cost incident reported on Reddit r/aws demonstrates the financial risk of overprivileged access keys, where an EC2 instance with full Bedrock permissions resulted in a $14,000 unexpected charge when a deployed chatbot was exploited or misconfigured. SRE teams should audit their RabbitMQ plugin versions and review AWS IAM policies to ensure EC2 instances and applications follow least-privilege access principles, particularly for high-cost services like Bedrock.
Do this Monday: This vulnerability could lead to unauthorized access to sensitive files, making it critical to apply the patch and rotate secrets immediately to protect production environments.
Sources: Aws Amazon via Last Week in AWS, Reddit r/aws
Tags:

3. Ghost CMS SQL injection flaw exploited in large-scale ClickFix campaign

Category: Security / Patch
What happened: A critical SQL injection vulnerability in Ghost CMS (CVE-2026-26980) is being exploited in a large-scale campaign to inject malicious JavaScript code, leading to ClickFix attack flows.
Do this Monday: This vulnerability could affect any production systems using Ghost CMS, potentially leading to unauthorized access or data breaches - operators should ensure their installations are patched.
Source: Bleeping Computer
Tags:

4. Attackers Can Exploit a Claude Code RCE Flaw to Take Command of System

Category: Security / Patch
What happened: A remote code execution vulnerability has been discovered in Anthropic's Claude Code developer model that allows attackers to take control of victim systems through crafted deeplinks. The flaw enables arbitrary code execution when users interact with malicious deeplink URLs targeting the Claude Code interface. SRE teams should immediately audit any integrations with Claude Code in their development workflows and implement strict controls around deeplink handling if Claude Code is used in automated systems or CI/CD pipelines. Organizations should also review access policies for AI coding assistants and consider temporarily restricting Claude Code usage until Anthropic releases security updates. According to DevOps.com reporting, this vulnerability demonstrates significant security risks inherent in AI-powered development tools that require immediate attention from security and operations teams.
Do this Monday: This vulnerability could lead to unauthorized system access, emphasizing the need for security measures when using AI tools in development environments.
Sources: DevOps.com, DevOps.com, DevOps.com (+1 more)
Tags:

5. KnowledgeDeliver LMS Flaw Exploited to Deploy Godzilla and Cobalt Strike

Category: Security / Patch
What happened: A high-severity security flaw in Digital Knowledge's KnowledgeDeliver LMS was exploited as a zero-day to deploy the Godzilla web shell and Cobalt Strike Beacon. The vulnerability, CVE-2026-5426, is due to hard-coded ASP.NET machine keys and has been patched.
Do this Monday: This vulnerability could have allowed attackers to gain unauthorized access and control over systems using the LMS, potentially affecting any organizations relying on this software.
Source: Thehackernews via The Hacker News (security)
Tags:

6. Security Advisory for Cargo (CVE-2026-5222)

Category: Security / Patch
What happened: The article discusses a security advisory for Cargo, detailing CVE-2026-5222, which outlines a vulnerability that could affect users of the Rust package manager. It provides insights into the nature of the vulnerability and its potential impact on security practices within the Rust ecosystem.
Do this Monday: Operators using Cargo should assess their usage and apply necessary updates to mitigate the risk associated with CVE-2026-5222 - failure to do so may expose systems to vulnerabilities.
Source: Blog Rust Lang via Hacker News (incidents)
Discussion: https://news.ycombinator.com/item?id=48271079
Tags:

7. Wireshark 4.6.6 Released, (Sun, May 24th)

Category: Security / Patch
What happened: Wireshark version 4.6.6 has been released, addressing one vulnerability and fixing eleven bugs. Additionally, the Npcap component for Windows has been updated to version 1.88.
Do this Monday: The fixed vulnerability may affect users of Wireshark, necessitating an update to ensure security and stability in network analysis tasks - operators should prioritize this update to mitigate potential risks.
Source: SANS ISC
Tags:

8. CrowdStrike detections on Nessus scan for MINIPLASMA_VULNERABLE

Category: Security / Patch
What happened: Users are reporting that CrowdStrike is detecting and terminating PowerShell executions during Tenable Nessus scans due to a new detection for the Miniplasma zero-day vulnerability. Stopping the scan job resolves the issue.
Do this Monday: This detection may disrupt routine vulnerability scanning processes, potentially leading to missed vulnerabilities if scans are not completed.
Source: Reddit r/sysadmin
Tags:

Releases

1. GitLab 19.0 trades its string section for a full DevSecOps orchestra

Category: Release
What happened: GitLab 19.0 introduces significant updates aimed at enhancing DevSecOps practices, including a new GitLab Secrets Manager in public beta for Premium and Ultimate users. This tool allows for scoped access to secrets, limiting them to specific jobs and improving security by enabling better audit trails. The release also emphasizes continuous integration pipeline visibility and aims to streamline the process from code writing to deployment, addressing challenges posed by AI-driven code.
Do this Monday: The introduction of scoped secrets management could enhance security and compliance in CI/CD workflows, reducing the risk of credential exposure and simplifying audit processes. This change may require teams to adapt their existing workflows to leverage the new capabilities effectively.
Source: The New Stack
Tags:

Also this week

Deep dives & postmortems

12. API Errors

Category: Deep Dive
What happened: Cloudflare experienced elevated error rates for API requests, leading to failures when customers attempted to make API calls. The issue was investigated, a fix was implemented, and the incident has since been resolved.
Takeaway: Operators using Cloudflare's API may have faced disruptions during the incident period - be aware of potential impacts on API-dependent services.
Source: Cloudflare Status
Tags:

13. Networking performance degraded in BOM and SJC

Category: Deep Dive
What happened: Fly.io experienced degraded network performance in the BOM and SJC regions. The issue was investigated and resolved, with performance restored and monitoring ongoing.
Takeaway: This incident may have affected applications hosted in the BOM and SJC regions, potentially leading to latency or connectivity issues during the outage.
Source: Fly.io Status
Tags:

Community reads

11. Announcing updated retry behavior for AWS SDKs and Tools - Six years into running AWS SDKs in production, and

Category: Community
What happened: AWS is standardizing the retry behavior for its SDKs and Tools after six years of production use. The new opt-in behavior is available now, with the default changes set to take effect in November 2026. This change may impact applications that rely on the previous retry mechanism, which was more lenient.
Worth reading: Ops teams need to prepare for the upcoming default changes in retry behavior, as applications may be affected if they depend on the legacy retry strategy.
Source: Aws Amazon via Last Week in AWS
Tags:

Lightning links

AWS announces AWS Interconnect - multicloud connectivity with Oracle Cloud Infrastructure in preview (Last Week in AWS) -- AWS Interconnect offers multicloud connectivity with Oracle Cloud, enhancing integration options.
Google Cloud is rolling out actual hard spend caps for AI services that pause API traffic when you hit your budget (Last Week in AWS) -- Google Cloud's new hard spend caps for AI services help manage costs by pausing API traffic.
How Jaeger hit 8.6× compression on 10 million spans with ClickHouse (The New Stack) -- Jaeger's integration with ClickHouse significantly improves telemetry data handling efficiency.
Deno 2.8 introduces new subcommands for automated vulnerability patching (TLDR Dev) -- Deno 2.8 enhances developer experience with automated vulnerability patching and version management.
Bumblebee Goes Open Source (TLDR AI) -- Perplexity's Bumblebee is now open-source, providing a security scanner for risky packages and configurations.
HTTP/3 and QUIC for Web Developers: What Changes and What You Need to Configure (Reddit r/sre) -- Learn how HTTP/3 and QUIC impact web development and the necessary configuration changes.
Show HN: Panorama – Review Code, Faster (Hacker News Show HN) -- Panorama streamlines the code review process, enhancing collaboration and speeding up feedback.
This may sound familiar, because GCP did something very similar almost exactly 2 years ago. (SRE Weekly) -- A recent GCP incident echoes a previous event, highlighting the importance of infrastructure management.
How are you guys handling upgrades for 3rd-party K8s tooling? (Reddit r/kubernetes) -- Community insights on managing upgrades for third-party Kubernetes tools can inform your strategy.

Human Stories

The stories landing on our desks lately paint a sobering picture of how fundamentally the threat landscape has shifted beneath our feet. When AI can compress vulnerability weaponization from four months down to twelve hours, and when trusted developer tools like VS Code extensions become attack vectors that compromise thousands of repositories at once, we're no longer playing the same game we learned our craft in. The Cogent research on AI exploit development and the GitHub extension breach aren't isolated incidents - they're early indicators of a new reality where our traditional security timelines and trust assumptions are becoming dangerously obsolete. What keeps me up at night isn't just the technical sophistication of these attacks, but how they're forcing us to question every assumption about supply chain security, incident response windows, and the very tools we rely on to do our jobs safely. As we head into next week, I find myself wondering less about if our defenses will be tested by these new capabilities, and more about whether we're even measuring the right things anymore.

Also worth reading

The Pulse: AI load breaks GitHub – why not other vendors? (SRE Weekly)

The article discusses the impact of AI workloads on GitHub's infrastructure, exploring why GitHub experienced issues under heavy AI load while other vendors did not. It analyzes the architectural choices and scaling strategies that may have contributed to GitHub's challenges.

SCCM DO failures on Win11 23H2 (22631) vs 24H2 – clients not resolving DPs? (Reddit r/sysadmin)

The post discusses a high rate of Delivery Optimization (DO) failures in SCCM for Windows 11 version 23H2, with over 70% of devices experiencing issues. The author notes that clients are unable to resolve Distribution Point (DP) locations, resulting in empty LocationRecords and no downloads starting

Moved from Harness to Revolte for delivery automation, what's the difference? (Reddit r/terraform)

The author discusses their transition from Harness to Revolte for delivery automation. They highlight that while Harness required manual release coordination and environment promotions, Revolte uses AI agents to automate these processes. This shift allowed their deployment frequency to increase from

Explore by topic

Kubernetes News Cloud News DevOps Security News CI/CD News SRE News Platform Engineering News