On Call Brief – Week of May 31–June 6, 2026
This week's top stories
1. The 28-Hour Meltdown: What Happened When AWS US-EAST-1 Overheated
- Category: Community
- What happened: AWS US-EAST-1 experienced a 28-hour outage caused by overheating issues that cascaded across multiple services in the region, according to SRE Weekly's incident analysis. The outage highlighted vulnerabilities in AWS's infrastructure resilience and raised concerns about how the serverless 'pay what you use' pricing model can lead to unexpected cost implications during recovery periods. SRE teams should review their multi-region failover strategies and ensure they have adequate monitoring for both infrastructure health and cost anomalies during incident recovery. Organizations heavily dependent on US-EAST-1 should evaluate their disaster recovery plans and consider implementing cross-region redundancy for critical workloads. The incident serves as a reminder that even major cloud providers can experience extended outages, making business continuity planning essential for production systems.
- Worth reading: - Significant outages in AWS US-EAST-1 could affect services relying on this region, leading to potential downtime or degraded performance for users and applications hosted there.
- Sources: Builder Aws via SRE Weekly, Feeds Dzone via SRE Weekly
2. Google Cloud Suspends Railway's Production Account
- Category: Community
- What happened: Google Cloud mistakenly suspended Railway's production account on May 19, 2026 due to an automated action, according to Reddit r/aws discussions, highlighting risks in automated account management processes that could impact other cloud customers. Separately, Google Cloud announced the general availability of the Remote Model Context Protocol (MCP) Server for AlloyDB, which enables AI agents to securely access enterprise data in real-time according to the Google Cloud Blog. SRE teams should review their cloud account monitoring and escalation procedures to ensure rapid response to automated suspension events, and evaluate whether the new AlloyDB MCP integration aligns with their AI and data access strategies. Organizations using Google Cloud should verify they have appropriate support channels configured to quickly address automated account actions that could disrupt production services.
- Worth reading: Operators should be aware of the implications of automated actions in cloud environments, as they can lead to unintended service disruptions.
- Sources: Reddit r/aws, Google Cloud Blog
3. AWS Organizations emits CloudTrail events for account membership changes - Only took until 2026 to log when
- Category: Community
- What happened: AWS Organizations will now emit CloudTrail events for account membership changes, allowing organizations to track when accounts join or leave. This feature addresses a significant gap where such changes went unlogged, leading to confusion and potential security concerns.
- Worth reading: This change improves visibility into account management within AWS Organizations, which can enhance security monitoring and incident response capabilities.
- Source: Aws Amazon via Last Week in AWS
4. Postmortem for my UK company database startup (2025)
- Category: Deep Dive
- What happened: The article provides a postmortem analysis of a database incident at a UK startup, detailing the causes of the outage, the response efforts, and lessons learned to improve future resilience.
- Takeaway: - Understanding the incident response and recovery process can help teams prepare for similar outages - Insights into database management and operational resilience are valuable for improving practices.
- Source: Developerwithacat via Hacker News (incidents)
- Discussion: https://news.ycombinator.com/item?id=48346016
5. Cloudflare: 2 service incidents (Durable Objects and Log Explorer issues, Issues with TLS certificates using Lets
- Category: Deep Dive
- What happened: Cloudflare experienced multiple service disruptions affecting Durable Objects, Log Explorer, and TLS certificate provisioning. The first incident caused intermittent delays in log queries and elevated startup errors for Durable Objects specifically in the Atlanta region. A separate incident affected TLS certificate provisioning for domains using Let's Encrypt Certificate Authority, though specific details about the nature of this disruption were not provided in the available reports. SRE teams using Cloudflare services should verify that their Durable Objects are functioning normally in the Atlanta region and check that TLS certificates are properly provisioned, particularly for domains relying on Let's Encrypt CA. Both incidents have been reported as resolved according to Cloudflare Status updates.
- Takeaway: This incident could have affected customer applications relying on Durable Objects and log data availability, particularly in the Atlanta region - operators should ensure their services are functioning correctly post-incident.
- Sources: Cloudflare Status, Cloudflare Status
6. RDP failing after update KB5087537 and KB5087065
- Category: Community
- What happened: A user reports that after installing updates KB5087537 and KB5087065 on a Windows Server 2016 VM, Remote Desktop Protocol (RDP) fails to connect. The user notes that the logon is successful until the password is entered, and they have not found any documentation linking the updates to RDP issues. They have a snapshot of the VM to roll back if needed.
- Worth reading: If these updates are causing RDP failures, it could affect remote access to servers, impacting operations and troubleshooting. Monitoring for similar reports may be necessary.
- Source: Reddit r/sysadmin
7. suspicious login popup from polyfill.io on https://parking.calypsotowerspcb.com/customer/login/
- Category: Community
- What happened: A user reports a suspicious login popup from polyfill.io appearing on their site, which does not occur in incognito mode. They are seeking assistance in understanding the issue.
- Worth reading: This could indicate a potential security issue or unwanted script running on the site, which may affect user experience and trust.
- Source: Reddit r/sysadmin
CVE & Security
1. CVE-2026-9255 - Tool Execution Without Authorization via Piped Stdin in Kiro CLI - Turns out piping untrusted
- Category: Security / Patch
- What happened: CVE-2026-9255 describes a vulnerability in Kiro CLI where untrusted content piped into the tool can allow attackers to execute arbitrary shell commands. This occurs because the interactive prompt accepts stdin as confirmation. Users are advised to update to version 1.28.0 to mitigate this risk.
- Do this Monday: This vulnerability could lead to unauthorized command execution in production environments if Kiro CLI is used with untrusted input - update to 1.28.0 is critical.
- Source: Aws Amazon via Last Week in AWS
2. CVE-2026-9291 - Insecure Deserialization in Amazon Braket SDK Job Results Processing - Quantum computing has
- Category: Security / Patch
- What happened: CVE-2026-9291 describes an insecure deserialization vulnerability in the Amazon Braket SDK that affects job results processing. The SDK improperly trusts a JSON field to determine whether to execute `pickle.loads()`, creating a significant security risk. Users are advised to upgrade to version 1.117.0 and review S3 write access permissions.
- Do this Monday: This vulnerability could allow unauthorized code execution if exploited, impacting the security of applications using the Amazon Braket SDK. Immediate upgrade is recommended to mitigate risks.
- Source: Aws Amazon via Last Week in AWS
3. WP Maps Pro bug exploited to create admin accounts on WordPress sites
- Category: Security / Patch
- What happened: Hackers are exploiting a vulnerability in the WP Maps Pro plugin for WordPress that enables the creation of unauthorized administrator accounts without authentication. This poses a significant security risk for affected sites.
- Do this Monday: - Websites using the WP Maps Pro plugin are at risk of unauthorized access and potential compromise due to this vulnerability.
- Source: Bleeping Computer
4. Toxic Flows: When Your Agent Skill Becomes a Supply Chain Attack
- Category: Security / Patch
- What happened: Snyk's ToxicSkills research examined over 3,000 agent skills and discovered that 36% contained security vulnerabilities, with 13% exhibiting critical flaws including credential theft vectors that could enable supply chain attacks. The study reveals that AI agent skills represent a significant attack surface where malicious actors can inject vulnerabilities into organizational workflows through compromised or intentionally malicious skill packages. SRE teams should implement thorough security reviews and vetting processes before deploying any agent skills in production environments, treating them as potential supply chain risk vectors similar to third-party software dependencies. Organizations should establish approval workflows that include static analysis scanning and behavioral testing of agent skills before integration into existing systems. According to Techstrong Brief reporting on the Snyk research, the prevalence of these vulnerabilities indicates that current agent skill ecosystems lack adequate security controls and require immediate defensive measures.
- Do this Monday: The high percentage of skills with security flaws indicates potential risks in using third-party skills, which could lead to supply chain attacks if not properly vetted.
- Sources: Techstrong Brief, Techstrong Brief
5. Unidentified RAT pushes NetSupport RAT, (Mon, Jun 1st)
- Category: Security / Patch
- What happened: This report details an unidentified RAT infection that occurred on May 27, 2026, which was followed by the deployment of a NetSupport Manager RAT. The initial RAT has been generating encoded traffic to a C2 server since April 2026. The report includes indicators of compromise, such as URLs associated with the SmartApeSG ClickFix campaign and the IP addresses of the C2 servers for both the initial RAT and the NetSupport RAT.
- Do this Monday: Operators should be aware of the ongoing RAT infection and the associated C2 traffic, as it may indicate a broader security issue that could affect their systems. Monitoring for the listed indicators of compromise is crucial to prevent potential breaches.
- Source: SANS ISC
6. DevOps'ish 311: Poisoned Repos, Hallucinating Executives, and More
- Category: Security / Patch
- What happened: Docker Engine v29.4.3 has been released with critical security enhancements including AppArmor, SELinux, and seccomp protections to address CVE-2026-31431, a Linux kernel privilege escalation vulnerability that could allow attackers to gain elevated system privileges. The update specifically aims to rectify issues caused by a previous incomplete fix for this vulnerability. Additionally, the Kubernetes Security Response Committee has announced plans to update CVE records for three long-standing unfixed vulnerabilities on June 1, 2026, along with providing remediation strategies for cluster administrators. SRE teams should immediately upgrade Docker Engine installations to v29.4.3 and monitor for the upcoming Kubernetes CVE updates to prepare appropriate remediation measures for affected clusters. Source: DevOps'ish 311 newsletter.
- Do this Monday: This update is critical for maintaining the security of Docker environments, especially for those using 32-bit binaries. Operators should prioritize upgrading to this version to mitigate potential privilege escalation risks.
- Sources: Devopsish via DevOps'ish, via DevOps'ish
Releases
1. Dutch Authorities Dismantle Botnet Linked to 17 Million Infected Devices
- Category: Release
- What happened: Dutch authorities have dismantled a botnet that controlled at least 17 million infected devices, including computers, tablets, smartphones, and IoT devices, to conduct malicious attacks. The operation involved over 200 servers located in the Netherlands.
- Do this Monday: - This takedown may reduce the risk of attacks originating from these devices, but operators should remain vigilant for potential residual effects or retaliatory actions from malicious actors.
- Source: Thehackernews via The Hacker News (security)
2. KEDA: v2.20.0, v2.19.0
- Category: Release
- What happened: KEDA versions 2.19.0 and 2.20.0 have been released with significant new scalers and authentication improvements that require operator attention. Version 2.19.0 introduces the Kubernetes Resource Scaler and file-based authentication support for ClusterTriggerAuthentication, while 2.20.0 adds OpenSearch and Elastic Forecast scalers along with enhanced scaling behavior and metrics capabilities. Critically, KEDA 2.20.0 migrates event recording to the new events.k8s.io API group, which may impact existing monitoring and logging configurations that depend on the previous event format. Operators should test these versions in non-production environments first, particularly validating any existing event-based alerting or monitoring systems, and review authentication configurations if using ClusterTriggerAuthentication. Both releases include various scaler fixes and behavioral improvements that should be evaluated against current autoscaling policies before production deployment.
- Do this Monday: The change to the events.k8s.io API group may affect deployments with custom RBAC settings, necessitating permission updates to ensure event recording functions correctly. This could lead to operational issues if not addressed before upgrading.
- Sources: KEDA releases, KEDA releases
3. From Kubernetes Dashboard to Headlamp: Understanding the Transition
- Category: Release
- What happened: The Kubernetes Dashboard has been archived, and users are encouraged to transition to Headlamp, which builds on the Dashboard's foundation while adding new features such as multi-cluster visibility and extensibility through plugins. The article provides guidance on how to navigate this transition, highlighting that many familiar workflows remain unchanged while offering improvements in usability and resource management.
- Do this Monday: Operators using Kubernetes Dashboard need to migrate to Headlamp, which may require adjustments in workflows but promises enhanced capabilities and a familiar interface.
- Source: Kubernetes Blog
4. System-wide issues with self-signed certificates under Docker
- Category: Release
- What happened: The Reddit discussion highlights system-wide issues encountered when using self-signed certificates with Docker. Users report various problems related to certificate validation and trust, which can lead to connectivity issues and security concerns in containerized environments.
- Do this Monday: Operators using self-signed certificates in Docker environments may face connectivity issues and potential security vulnerabilities due to improper certificate handling - this could affect service reliability and security posture.
- Source: Reddit r/docker
5. Updates to GitHub Copilot billing and plans
- Category: Release
- What happened: GitHub Copilot has transitioned to usage-based billing effective June 1, where all plans will charge based on GitHub AI Credits consumed. Copilot code review now also uses GitHub Actions minutes. New user-level budget controls allow admins to set spending limits for users, and Copilot Max is introduced for power users with higher usage limits. These changes may impact billing for organizations using Copilot and require adjustments in budget management.
- Do this Monday: Billing changes could affect budget planning and resource allocation for teams using GitHub Copilot, especially with the introduction of usage-based billing and the consumption of Actions minutes.
- Source: GitHub Changelog
6. Multi-Region event-driven failover architecture with Amazon EventBridge and Route 53
- Category: Release
- What happened: This article outlines a multi-region event-driven architecture using Amazon EventBridge, API Gateway, and Route 53 for high availability and disaster recovery. It describes how to implement automatic failover between AWS regions, ensuring that event processing remains independent and efficient. The architecture leverages Route 53 health checks to monitor API Gateway endpoints and reroute traffic to healthy regions, while DynamoDB global tables ensure data availability across regions. This solution is particularly beneficial for organizations with strict availability requirements, supporting both planned and unplanned outages.
- Do this Monday: This architecture can significantly enhance the resilience of applications by ensuring automatic failover and reducing latency through regional independence. It is crucial for teams managing critical applications that require high availability and disaster recovery capabilities.
- Source: AWS Compute Blog
7. etcd: v3.7.0-rc.0, v3.6.12, v3.5.31
- Category: Release
- What happened: Etcd maintainers have released three new versions across different major branches: v3.5.31, v3.6.12, and v3.7.0-rc.0, with all releases including various changes documented in their respective changelogs. Operators should review the upgrade guides for each version before deploying due to potential breaking changes that could impact existing clusters. The v3.7.0 release is currently in release candidate status and should be tested thoroughly in non-production environments before considering production deployment. Installation instructions are available for Linux, macOS, and Docker across all versions, and operators should follow the standard etcd upgrade procedures including backup verification and staged rollouts. Priority should be given to upgrading stable production clusters to v3.5.31 or v3.6.12 depending on your current major version, while evaluating v3.7.0-rc.0 in development environments only.
- Do this Monday: Operators should be aware of the potential breaking changes when upgrading to this release. Following the upgrade guides is crucial to ensure compatibility and stability.
- Sources: etcd releases, etcd releases, etcd releases
8. AWS Weekly Roundup: Claude Opus 4.8 on AWS, Aurora MySQL with Kiro Powers, and more (June 1, 2026)
- Category: Release
- What happened: Anthropic's Claude Opus 4.8 has been released on both AWS and Microsoft Azure's Foundry platform, introducing enhanced coding capabilities with autonomous task execution, improved context handling, and the ability to maintain plans across multiple workflow stages. According to AWS What's New, this release includes features specifically designed for complex coding tasks and multi-step enterprise workflows, while Last Week in AWS notes this represents an incremental update from version 4.7 rather than a major milestone release. SRE teams using Claude for automation or development workflows should evaluate the new autonomous execution features and test any existing integrations to ensure compatibility with the updated model behavior. The Azure Blog confirms that similar capabilities are available on Microsoft Foundry, providing cross-cloud availability for organizations using multi-cloud AI strategies.
- Do this Monday: The introduction of Claude Opus 4.8 could significantly enhance coding workflows, while the new Resilience Hub offers SREs a structured approach to manage application resilience, which may lead to improved service reliability and compliance.
- Sources: AWS What's New, Aws Amazon via Last Week in AWS, Azure Blog
Lightning links
- The AI-native SDLC is paying off: 19% more PRs and 2–3 hours saved per developer per week (Atlassian Engineering) -- Integrating AI into the SDLC has led to a 19% increase in pull requests and significant time savings.
- Alphabet Raises $80 Billion to Fund the AI Infrastructure Build-Out (Techstrong Brief) -- Alphabet's $80 billion investment aims to enhance AI infrastructure and capabilities.
- Apache Kafka 4.3.0 Release Announcement (Confluent Blog) -- The latest Kafka release introduces 25 new KIPs, enhancing functionality and performance.
- Now Available: Ready-to-Use Policies – Guardrails You Can Activate Instantly (env0 Blog) -- env0's new policies allow for immediate enforcement of security and compliance guardrails.
- How We Cut up to 80% of Engineering “Chores” Using AI Agents in Jira (Atlassian Engineering) -- Atlassian's AI agents automate up to 80% of maintenance tasks, streamlining engineering workflows.
- From alert noise to action: How 24 Hour Fitness transformed IT operations with Jira Service Management (Atlassian Engineering) -- 24 Hour Fitness reduced alert noise and ITSM costs by 37% through improved IT operations.
- Spring 2026 SOC 1, 2, and 3 reports are now available with 188 services in scope (AWS Security Blog) -- AWS has released comprehensive SOC reports covering 188 services, enhancing transparency.
- Stop Pasting Tokens: OAuth2 Login for JetBrains IDE Plugins (JetBrains Blog) -- New OAuth2 login flow for JetBrains IDEs eliminates the need for personal access tokens.
Human Stories
Looking at this week's incidents, I'm struck by how many of them trace back to automated systems making decisions without sufficient human oversight or context. Railway's production account got suspended by Google Cloud's automated processes, while AWS finally - after years - started logging basic organizational changes that should have been tracked from day one. The 28-hour AWS US-EAST-1 meltdown reminds us that even the most sophisticated infrastructure can cascade into failure when environmental controls fail, but it's the smaller stories that really get to me. That suspicious polyfill.io popup and the Windows RDP failures after routine updates highlight how fragile our dependency chains have become - we're often just one automated decision or third-party service away from our users losing trust or access entirely. The real lesson here isn't about building better monitoring or redundancy, though those matter; it's about remembering that behind every automated system should be a human who understands the business impact of the decisions that system makes.
Also worth reading
Gavriel Cohen found his own code inside OpenClaw, so he walked away (The New Stack)
Gavriel Cohen discovered his own code, NanoPDF, included in OpenClaw, which raised concerns about the project's security and code quality. He experienced issues with the tool, including unexpected access to all WhatsApp messages instead of just the intended group. Cohen highlighted the risks of a gr
The DIY platform trap that’s burning out engineering teams (The New Stack)
The article discusses the pitfalls of DIY platform engineering, highlighting how automation can lead to increased complexity rather than reducing it. As teams automate workflows, they create layers of scripts and tools that become difficult to maintain and understand over time. When these automation
From alert noise to action: How 24 Hour Fitness transformed IT operations with Jira Service Management and Rovo Ops (Atlassian Engineering)
24 Hour Fitness transformed its IT operations by implementing Jira Service Management, which helped reduce alert noise and cut ITSM costs by 37%. The previous system was inefficient, lacking automation and integration, leading to significant operational challenges. The new platform allows for better