On Call Brief – Week of April 26–May 2, 2026

2026-04-26 — 2026-05-02 Briefing: 2026-04-26
Category:
Tags:

This week's top stories

1. Microsoft Outlook for iOS still down and out for many after 'service change'

  • Category: Deep Dive
  • What happened: Microsoft Outlook for iOS users are facing ongoing outages, including sign-in failures and unexpected sign-outs, more than 24 hours after Microsoft attempted to roll back a configuration change that was supposed to restore services.
  • Takeaway: This outage may affect users' ability to access email and calendar functionalities on iOS devices, potentially disrupting workflows and communication.
  • Source: The Register (Software)
  • Tags:
  • 2. GitHub Faces Scaling Issues as AI Development Surges

    • Category: Deep Dive
    • What happened: GitHub is experiencing significant scaling challenges due to increased AI development activity, resulting in service disruptions that prompted the company to temporarily pause new sign-ups for its Copilot service and transition to usage-based pricing models according to DevOps.com. To address deployment reliability issues during these scaling challenges, GitHub has implemented eBPF technology to identify and prevent hidden circular dependencies that could hinder recovery during outages, as reported by InfoQ DevOps. SRE teams should monitor their GitHub dependencies closely and prepare contingency plans for potential service interruptions, particularly for AI-related workflows and deployments. Organizations heavily reliant on GitHub services should consider implementing circuit breakers and fallback mechanisms to handle potential availability issues during this scaling period.
    • Takeaway: GitHub's scaling issues may affect users relying on its services for AI development, potentially leading to disruptions in workflows and the need for manual remediation of issues. The shift to usage-based pricing could also impact budgeting for teams using Copilot.
    • Sources: DevOps.com, InfoQ DevOps
  • Tags:
  • 3. China-Backed Groups Are Using Massive Botnets in Espionage and Intrusion Campaigns

    • Category: Community
    • What happened: The article discusses the increasing use of botnets linked to China in espionage and intrusion campaigns targeting critical infrastructure. It highlights the severity of these threats and the implications for cybersecurity, emphasizing the need for heightened vigilance and protective measures against such attacks.
    • Worth reading: - Increased risk of cyber espionage and attacks on critical infrastructure - Organizations should assess their security posture and readiness against potential botnet threats.
    • Source: Techstrong Brief
  • Tags:
  • 4. Cursor-Opus agent snuffs out startup’s production database

    • Category: Deep Dive
    • What happened: The founder of PocketOS, an automotive SaaS platform, experienced a significant data loss event caused by their AI coding agent, Cursor-Opus, which occurred in under 10 seconds. Fortunately, the data was recovered, allowing the team to resume their work.
    • Takeaway: This incident highlights potential risks associated with AI coding agents in production environments, emphasizing the need for robust data recovery and backup strategies.
    • Source: The Register (Software)
  • Tags:
  • 5. Presentation: Week-Long Outage: Lifelong Lessons

    • Category: Deep Dive
    • What happened: Molly Struve discusses a six-day outage that severely impacted a company, highlighting technical lessons such as the importance of Failure Mode and Effects Analysis (FMEAs), shadow traffic, and rollback mechanisms. She emphasizes the significance of human factors in crisis management, including the need for early collaboration and having supportive leadership to foster psychological safety.
    • Takeaway: - Understanding the technical and human factors in outages can improve incident response and resilience in production environments. - Insights on psychological safety and leadership support can enhance team dynamics during crises.
    • Source: InfoQ DevOps
  • Tags:

  • CVE & Security

    1. CVE-2026-7191- Arbitrary Code Execution via Sandbox Bypass in QnABot on AWS

    • Category: Security / Patch
    • What happened: CVE-2026-7191 affects QnABot on AWS deployments, allowing authenticated administrators to execute arbitrary code by exploiting improper use of the static-eval npm package through the Content Designer interface. This critical vulnerability enables sandbox bypass, potentially granting attackers access to the underlying AWS infrastructure hosting the QnABot service. SRE teams should immediately audit all QnABot deployments for unauthorized administrative access, restrict Content Designer interface access to essential personnel only, and monitor for any suspicious code execution attempts. AWS has not yet released specific patching guidance, so operators should implement additional network segmentation around QnABot instances and review CloudTrail logs for unusual Content Designer activity until an official fix becomes available. Source information comes from AWS Security Bulletins reporting the CVE details.
    • Do this Monday: This vulnerability could lead to unauthorized access to sensitive backend resources, which may compromise the security of applications using QnABot on AWS - operators should assess their use of affected versions and apply necessary patches.
    • Sources: AWS Security Bulletins, AWS What's New
  • Tags:
  • 2. Malicious Release of elementary-data PyPI Package Steals Cloud Credentials from Data Engineers

    • Category: Security / Patch
    • What happened: A malicious version of the elementary-data Python CLI was released, which contained a backdoor designed to steal cloud credentials from data engineers. This attack exploited a vulnerability in a GitHub Actions script.
    • Do this Monday: Data engineers using the compromised package may have had their cloud credentials exposed, potentially leading to unauthorized access to their environments.
    • Source: Snyk Blog
  • Tags:
  • 3. Prometheus: 3.11.3, 3.5.3

    • Category: Security / Patch
    • What happened: Prometheus versions 3.11.3 and 3.5.3 were released on April 27, 2026 to address multiple critical security vulnerabilities affecting OAuth authentication and UI components. The fixes include CVE patches for AzureAD OAuth client secrets being exposed in plaintext, vulnerabilities in snappy-compressed requests affecting remote-write and remote-read functionality, and a stored XSS vulnerability in the legacy UI. SRE teams should immediately upgrade all Prometheus instances to these patched versions, prioritizing environments that use AzureAD OAuth integration or have the old UI enabled. Organizations should also rotate any potentially exposed AzureAD OAuth client secrets as a precautionary measure following the upgrade.
    • Do this Monday: This release is critical for users of Prometheus as it resolves significant security vulnerabilities that could lead to unauthorized access and data exposure.
    • Sources: Prometheus releases, Prometheus releases
  • Tags:
  • 4. OpenTelemetry Collector v1.57.0/v0.151.0: Breaking changes to Go module paths

    • Category: Security / Patch
    • What happened: The OpenTelemetry Collector v0.151.0 release includes breaking changes such as the use of relative paths in Go module replace statements, which may affect builds that rely on absolute paths. Enhancements include support for declarative schema in service telemetry resource configuration and additional attributes for metrics in the otlp exporters.
    • Do this Monday: Operators using OpenTelemetry Collector should review the breaking changes to ensure compatibility with their builds, especially if they rely on absolute paths. The new enhancements may improve telemetry configuration and metrics reporting.
    • Source: OpenTelemetry Collector releases
  • Tags:
  • 5. Kubernetes v1.36: Mutable Pod Resources for Suspended Jobs (beta)

    • Category: Security / Patch
    • What happened: Kubernetes v1.36 introduces a beta feature allowing modification of container resource requests and limits in the pod template of suspended Jobs. This feature, previously in alpha, enables adjustments to CPU, memory, GPU, and other resources while the Job is suspended, addressing the dynamic resource needs of batch and machine learning workloads. It prevents the need to delete and recreate Jobs for resource changes, preserving metadata and history.
    • Do this Monday: This change can improve resource management for suspended Jobs, particularly in environments with fluctuating workloads, enhancing efficiency in resource allocation without losing Job context.
    • Source: Kubernetes Blog
  • Tags:
  • 6. VCluster: v0.34.0-rc.4

    • Category: Security / Patch
    • What happened: The vCluster v0.34.0-rc.4 release includes component metrics refactoring, standalone restore functionality, and dependency upgrades that address multiple Snyk-identified CVEs, though specific CVE numbers were not disclosed in the release notes. Meanwhile, Deloitte has demonstrated a production optimization pattern using vCluster on Amazon EKS that reduces QA environment provisioning time from 30-45 minutes to approximately 3-5 minutes (89% faster), while significantly cutting infrastructure costs by replacing dedicated EKS clusters with lightweight virtual clusters. SRE teams managing Kubernetes testing environments should evaluate upgrading to v0.34.0-rc.4 for the security fixes and consider implementing vCluster for multi-tenant testing workflows where rapid environment provisioning is critical. Teams currently using older vCluster versions should prioritize the upgrade to address the resolved security vulnerabilities, particularly in production or security-sensitive environments.
    • Do this Monday: This release addresses security vulnerabilities by upgrading dependencies, which is crucial for maintaining a secure environment.
    • Sources: VCluster releases, AWS Architecture Blog
  • Tags:
  • 7. Access control with IAM Identity Center session tags

    • Category: Security / Patch
    • What happened: The article discusses how AWS IAM Identity Center can enhance access control by using session tags combined with permission sets. This allows organizations to implement fine-grained access control and optimize resource usage by passing dynamic attributes from external identity providers like Microsoft Entra ID into AWS. The integration supports advanced features such as AWS Glue usage profiles and AWS Systems Manager Session Manager, enabling attribute-based access control while maintaining centralized access management.
    • Do this Monday: Organizations using AWS IAM Identity Center can improve their access management strategies and enhance security by leveraging session tags for dynamic permissions, which may affect how access is configured across AWS accounts.
    • Source: AWS Security Blog
  • Tags:
  • 8. OpenChoreo 1.0 Brings AI Agents and GitOps to Kubernetes Developer Platforms

    • Category: Security / Patch
    • What happened: OpenChoreo has released version 1.0 and is now part of the CNCF Sandbox. This open-source platform aims to provide engineering teams with a comprehensive foundation for running workloads on Kubernetes, eliminating the need for custom builds.
    • Do this Monday: - This release may simplify Kubernetes workload management for teams, potentially reducing setup time and complexity.
    • Source: InfoQ DevOps
  • Tags:

  • Releases

    1. What the March 2026 Threat Technique Catalog update means for your AWS environment

    • Category: Release
    • What happened: The March 2026 update to the AWS Threat Technique Catalog introduces new entries focusing on security threats related to identity, persistence, and infrastructure destruction. Key highlights include the risk of Amazon Cognito refresh token abuse, which can allow unauthorized persistent access if tokens are not rotated or have long lifetimes. Additionally, the update addresses the threat of AMI image deletion, which can hinder recovery efforts during incidents. The article emphasizes the importance of implementing mitigations such as enabling refresh token rotation and understanding AMI management to enhance security posture.
    • Do this Monday: This update highlights critical security risks in AWS environments, particularly around identity management and disaster recovery. Operators should review their token management practices and AMI retention settings to prevent unauthorized access and ensure recovery capabilities.
    • Source: AWS Security Blog
  • Tags:
  • 2. GitHub Copilot code review will start consuming GitHub Actions minutes on June 1, 2026

    • Category: Release
    • What happened: GitHub is transitioning all Copilot plans to usage-based billing effective June 1, 2026, replacing the current flat-rate model with a credit-based system that charges based on token consumption rather than premium request units. Each subscription will include a monthly allotment of GitHub AI Credits, with overages charged for additional usage, and importantly, Copilot code reviews will begin consuming GitHub Actions minutes for private repositories under this new billing structure. SRE teams should audit their current Copilot usage patterns and GitHub Actions minute consumption to forecast potential cost increases, particularly for organizations with extensive private repository code review workflows. Organizations should also review their GitHub billing plans and consider upgrading or adjusting usage limits before the June 2026 transition to avoid unexpected charges. This information comes from GitHub's official blog announcement, GitHub Changelog, and reporting from The New Stack.
    • Do this Monday: This change will affect billing for organizations using GitHub Copilot for code reviews, particularly those with private repositories. Teams need to manage their GitHub Actions minutes to avoid unexpected costs.
    • Sources: GitHub Changelog, The New Stack, GitHub Blog
  • Tags:
  • 3. GitLab: 2 related updates

    • Category: Release
    • What happened: GitLab has announced enhanced integration with Anthropic's Claude AI for enterprise development environments, enabling AI-assisted coding within existing security and compliance frameworks (GitLab Blog). The integration allows organizations to leverage AI capabilities while maintaining governance controls over AI interactions in their development workflows. Additionally, GitLab has published guidance on implementing CI/CD observability at scale, specifically targeting self-managed GitLab instances through containerized observability solutions developed in partnership with financial services organizations (GitLab Blog). SRE teams operating GitLab instances should evaluate the new Claude AI integration for potential adoption while reviewing the observability implementation patterns to improve monitoring of their CI/CD pipelines. Organizations should assess both offerings against their current governance policies and observability requirements to determine implementation feasibility.
    • Do this Monday: This change may affect production by enabling faster and more compliant AI-driven development processes, ensuring that AI actions are subject to the same governance as human developers. Organizations can leverage existing cloud contracts for AI workloads, potentially streamlining operations.
    • Sources: GitLab Blog, GitLab Blog
  • Tags:
  • 4. Implement SPIFFE/SPIRE authorization on Amazon EKS

    • Category: Release
    • What happened: The article discusses implementing SPIFFE/SPIRE on Amazon EKS to enhance security in distributed applications. It addresses challenges like secure communication in untrusted networks and workload authentication without relying on network controls. The guide explains how SPIFFE and SPIRE can provide workload identity attestation, deliver short-lived X.509 certificates for mTLS, and generate JWTs for flexible authentication. It outlines the deployment of SPIRE across multiple EKS clusters, focusing on secure service-to-service communication and fine-grained authorization policies.
    • Do this Monday: Implementing SPIFFE/SPIRE can significantly improve security for microservices in EKS by enabling mTLS and flexible authentication, which is crucial for maintaining secure communications in complex, distributed environments.
    • Source: AWS Containers Blog
  • Tags:
  • 5. Sentry’s Seer Agent lets developers debug production issues in natural language

    • Category: Release
    • What happened: Sentry has launched Seer Agent, a natural-language debugging tool that allows developers to investigate production issues by querying their observability stack. This tool, available in open beta, aims to address gaps in Sentry's existing AI tools, enabling users to describe symptoms and receive insights without needing predefined issues. Seer Agent complements Sentry's Autofix feature, which requires known issues to function. The tool is designed to provide context and facilitate broader investigations into application performance problems.
    • Do this Monday: Sentry's Seer Agent could enhance debugging processes, allowing teams to address production issues more effectively and potentially reduce downtime. Its natural-language interface may streamline incident response and improve developer efficiency.
    • Source: The New Stack
  • Tags:
  • 6. containerd 2.3.0-rc.0

    • Category: Release
    • What happened: The v2.3.0-rc.0 release of containerd is a pre-release focusing on stability and new features. It aligns its release cadence with Kubernetes, introducing an annual LTS release with support for at least two years. Key highlights include new transfer types for container filesystem copy, a shim bootstrap protocol, options for injecting trace IDs into logs, and improvements to the Container Runtime Interface (CRI) and Node Resource Interface (NRI).
    • Do this Monday: This release may affect production environments using containerd, particularly with the introduction of new features and the alignment with Kubernetes release schedules. Operators should review the changes for compatibility and potential benefits.
    • Source: containerd releases
  • Tags:

  • Lightning links

    Human Stories

    When I look at these stories together, what strikes me most is how quickly our carefully constructed systems can unravel when we introduce change without fully understanding the blast radius. Microsoft's Outlook rollback that somehow made things worse, GitHub buckling under AI-driven load they didn't anticipate, and that gut-wrenching moment when Cursor-Opus wiped PocketOS's production database in ten seconds - these aren't just technical failures, they're reminders of how little buffer we actually have between working and broken. Molly Struve's presentation about that week-long outage feels particularly relevant here because it forces us to confront an uncomfortable truth: we often discover our system's real failure modes only when they're actively destroying value. The common thread isn't that these teams lacked skill or care, but that modern systems are so interconnected and fast-moving that our traditional safety nets - rollbacks, testing, gradual rollouts - sometimes aren't enough to catch the edge cases that matter most.

    Also worth reading

    Why Terraform is green when your cloud is broken (The New Stack)

    The article discusses how Terraform's state file can become out of sync with actual infrastructure changes, leading to unexpected issues, such as an API returning 403 errors due to a manually updated S3 bucket policy that was not reflected in Terraform. It emphasizes that Terraform's state is a snap

    Presentation: Building a Future-Proof Observability Platform to Empower Engineers (InfoQ DevOps)

    Wayne Bell and Dan Gomez Blanco discuss the architectural and cultural shift needed to scale observability at Skyscanner. They explain how adopting OpenTelemetry helped decouple instrumentation from vendors and emphasize the importance of treating the observability platform as a product for engineer
    Scroll to Top