On Call Brief – Week of 2026-04-12
This week's top stories
1. A guide to the breaking changes in GitLab 19.0
- Category: Breaking Change
- What happened: GitLab 19.0 introduces 15 breaking changes, including a significant shift from NGINX Ingress to Gateway API with Envoy Gateway as the default networking component. This change is crucial for GitLab Self-Managed deployments, as NGINX Ingress reached end-of-life in March 2026. Users must plan their migration to the new configuration before upgrading, although NGINX Ingress can still be used temporarily until its removal in GitLab 20.0. The article outlines deployment windows and mitigation steps for these changes.
- Do this Monday: The transition to Gateway API with Envoy Gateway is a major change that requires planning and coordination for GitLab Self-Managed users. Failure to migrate could lead to operational issues post-upgrade.
- Source: GitLab Blog
2. 76 The Cost of Assumptions ⚡
- Category: Deep Dive
- What happened: This newsletter discusses the importance of questioning assumptions in serverless architecture and development. It highlights insights from AWS experts and emphasizes the need for continuous learning and adaptation in the rapidly evolving tech landscape. The issue also features recent AWS service releases and tips for developers.
- Takeaway: Understanding the cost of assumptions can lead to better decision-making in serverless implementations, potentially reducing errors and improving system reliability.
- Source: Serverless Advocate
3. Ashby taught us we have to fight fire with fire
- Category: Deep Dive
- What happened: The article discusses the paradox of using increasingly complex systems, like LLMs, to address problems caused by software complexity. It suggests that improving robustness may require embracing more complexity, raising questions about the effectiveness of this approach.
- Takeaway: This perspective could influence how teams approach system design and incident response, potentially leading to more complex solutions that may not effectively resolve underlying issues.
- Source: Sreweekly via SRE Weekly
CVE & Security
1. React and Next.js Remote Code Execution Vulnerability - Google Cloud Mitigation
- Category: Security / Patch
- What happened: A critical remote code execution vulnerability has been identified in the React and Next.js frameworks. Google Cloud users running affected versions may be at risk. Affected versions include React 19.0 to 19.2.0 and Next.js 14.3.0-canary.77 and later. Users are advised to update to fixed versions immediately. Google Cloud has released a Web Application Firewall rule to help mitigate exploitation attempts while users update their workloads.
- Do this Monday: Operators using React or Next.js on Google Cloud should prioritize updating to the fixed versions to avoid potential exploitation. The availability of a WAF rule provides a temporary mitigation strategy.
- Source: Google Cloud Security Bulletins
2. Ancient Excel bug comes out of retirement for active attacks
- Category: Security / Patch
- What happened: A 17-year-old critical Excel vulnerability has resurfaced and is currently being exploited, prompting an alert from CISA during Microsoft's Patch Tuesday updates.
- Do this Monday: This vulnerability could affect systems using Excel, necessitating immediate attention to apply relevant patches to mitigate potential exploitation.
- Source: The Register (Software)
3. Issues with AWS Research and Engineering Studio (RES)
- Category: Security / Patch
- What happened: AWS has identified multiple security vulnerabilities in the Research and Engineering Studio (RES), including command execution and privilege escalation issues. CVE-2026-5707 allows remote authenticated users to execute arbitrary commands as root, CVE-2026-5708 permits privilege escalation through session creation, and CVE-2026-5709 enables command execution on EC2 instances via unsanitized input in the FileBrowser API. These vulnerabilities affect versions up to 2025.12.01.
- Do this Monday: Operators using AWS Research and Engineering Studio should prioritize patching to mitigate risks associated with these vulnerabilities, as they could lead to unauthorized access and command execution on critical resources.
- Source: AWS Security Bulletins
4. CVE-2026-5190 - AWS C Event Stream Streaming Decoder Stack Buffer Overflow
- Category: Security / Patch
- What happened: CVE-2026-5190 is a critical vulnerability in the AWS Common Runtime library affecting several AWS SDKs. The vulnerability allows for memory corruption and potential arbitrary code execution in client applications that process crafted event-stream messages. Versions prior to 0.6.0 of aws-c-event-stream and specific versions of various AWS IoT device SDKs are impacted.
- Do this Monday: Operators using affected AWS SDKs should prioritize updating to the patched versions to mitigate the risk of arbitrary code execution due to this vulnerability.
- Source: AWS Security Bulletins
5. Issues with Amazon Athena ODBC Driver
- Category: Security / Patch
- What happened: The AWS Security Bulletin identifies several vulnerabilities in the Amazon Athena ODBC driver, including OS command injection and improper authentication controls. CVE-2026-5485 was fixed in version 2.0.5.1 for Linux, while the other vulnerabilities were addressed in version 2.1.0.0 for all platforms.
- Do this Monday: These vulnerabilities could potentially allow unauthorized access or exploitation of systems using the Amazon Athena ODBC driver, necessitating immediate updates to the latest versions.
- Source: AWS Security Bulletins
6. containerd 2.0.8
- Category: Security / Patch
- What happened: Containerd has released security updates across all supported branches (v1.7.31, v2.0.8, v2.1.7, and v2.2.3) to address CVE-2026-35469, a critical vulnerability in the spdystream library that could lead to credential leakage through unsanitized error messages returned via gRPC. The vulnerability affects all containerd installations using the affected spdystream component, with the fix involving proper error sanitization before gRPC responses. Operators should immediately upgrade to the latest patch version for their respective containerd branch to mitigate the credential exposure risk. Additionally, these releases include various CRI improvements, CNI issue fixes, and container runtime enhancements that improve overall stability and functionality.
- Do this Monday: This release contains a critical security fix that should be prioritized for deployment to mitigate potential vulnerabilities in production environments.
- Sources: containerd releases, containerd releases, containerd releases (+1 more)
7. New Rowhammer Attacks on NVIDIA GPUs Enable Full System Takeover
- Category: Security / Patch
- What happened: Researchers have shown that new Rowhammer attacks can exploit vulnerabilities in NVIDIA GPUs, allowing attackers to escalate from memory corruption to full system takeover. This highlights a serious hardware-level security risk that could impact systems using these GPUs.
- Do this Monday: This development could lead to increased security vulnerabilities in systems utilizing NVIDIA GPUs, necessitating immediate attention to hardware security measures.
- Source: InfoQ DevOps
8. Secure AI agent access patterns to AWS resources using Model Context Protocol
- Category: Security / Patch
- What happened: AWS has published guidance on securing AI agent access to AWS resources through the Model Context Protocol (MCP), highlighting critical security considerations for AI agents that operate at machine speed and can potentially misuse permissions if not properly controlled. The AWS Security Blog emphasizes implementing robust IAM controls specifically designed for the dynamic nature of AI agents, which differs significantly from traditional human user access patterns. Separately, JetBrains released research from their Human-AI Experience team analyzing two years of log data from 800 developers to understand how AI coding assistants impact long-term developer workflows. SRE teams should review their current IAM policies for any AI agents or coding assistants in their environment, ensuring these tools have appropriately scoped permissions and monitoring in place, while also considering how AI-assisted development practices may affect their deployment and operational procedures. Organizations using AI agents for AWS resource management should implement the MCP security patterns outlined by AWS and establish monitoring for automated agent activities that could impact system reliability.
- Do this Monday: Understanding and implementing the security principles for AI agents is crucial to prevent misconfigurations that could lead to unauthorized access to AWS resources. The guidance on using MCP servers can help in establishing better security practices.
- Sources: AWS Security Blog, JetBrains Blog
9. Why We Chose the Harder Path: Docker Hardened Images, One Year Later
- Category: Security / Patch
- What happened: Docker reported over 500,000 daily pulls of their Docker Hardened Images (DHI) in the first year, with 2,000+ hardened images now available in their security-focused container image program. Separately, Cal.com announced they are transitioning their open-source scheduling platform from open to closed-source licensing due to security concerns related to AI systems' ability to identify and exploit vulnerabilities in publicly available code. SRE teams using Cal.com should prepare for potential licensing changes and evaluate alternative open-source scheduling solutions if needed. Organizations should also consider adopting Docker's hardened images as part of their container security strategy, particularly for production workloads where supply chain security is critical. Both developments reflect the growing emphasis on proactive security measures in response to evolving threat landscapes.
- Do this Monday: The ongoing development and availability of Docker Hardened Images could enhance security practices within organizations using Docker, particularly in terms of maintaining a secure baseline without vendor lock-in. The multi-distro support may simplify integration into existing workflows.
- Sources: Docker Blog, The New Stack
Releases
1. Prompt Injection Attack Exposes API Tokens in Anthropic, Google, Microsoft AI Agents
- Category: Release
- What happened: Security researchers demonstrated a new type of prompt injection attack that can hijack AI agents from Anthropic, Google, and Microsoft, allowing them to steal API keys and access tokens. The issue affects agents that integrate with GitHub Actions, and the vendors have not disclosed the vulnerabilities.
- Do this Monday: This vulnerability could lead to unauthorized access to sensitive resources if API keys are compromised - operators using these AI agents should assess their security posture and consider implementing additional safeguards.
- Source: The Register (Software)
2. April Patches for Azure DevOps Server
- Category: Release
- What happened: Azure DevOps Server has released patches addressing a null reference exception during pull request completion, improving sign-out validation to prevent malicious redirects, and fixing issues with creating PAT connections to GitHub Enterprise Server. Users are encouraged to update to the latest version for optimal security and reliability.
- Do this Monday: These patches fix critical issues that could affect the functionality and security of Azure DevOps Server, making it important for users to apply them promptly.
- Source: Azure DevOps Blog
3. One-click security scanning and org-wide alert triage come to Advanced Security
- Category: Release
- What happened: Azure DevOps introduces two significant features for enhancing application security: a one-click setup for CodeQL code scanning across repositories and a combined alerts experience for security administrators. The CodeQL setup allows organizations to enable scanning without manual pipeline configuration, while the new alerts view consolidates security alerts from all repositories into a single interface, facilitating easier management and remediation efforts.
- Do this Monday: These updates streamline security processes in Azure DevOps, potentially reducing the time and effort required for security teams to manage code scanning and alert triage across multiple repositories.
- Source: Azure DevOps Blog
4. Dissolving the Boundary Between Cloud and Network
- Category: Release
- What happened: AWS has announced the general availability of AWS Interconnect, a managed private connectivity service that enables direct connection between Amazon VPCs and VPCs on other cloud providers through automated provisioning via the AWS Console. The service, developed in partnership with Lumen Technologies, includes a "last mile" connectivity option that simplifies enterprise access to high-speed, private network connections to AWS infrastructure. SRE teams should evaluate this service for multi-cloud architectures where direct private connectivity is required, as it eliminates the complexity of traditional manual interconnect provisioning processes. The service is immediately available through the AWS Console for organizations seeking to establish dedicated network paths between AWS and other cloud environments without internet transit.
- Do this Monday: This change could significantly reduce the time and complexity involved in establishing cloud connectivity, impacting how enterprises manage their infrastructure and deploy applications. It may also enhance performance for workloads that require high-speed connections.
- Sources: AWS Networking Blog, AWS What's New
5. Beyond the VPN: Cloudflare Mesh builds a private network for the age of AI agents
- Category: Release
- What happened: Cloudflare has launched Cloudflare Mesh, a new private networking service designed to create secure multi-cloud networks for both human users and AI agents, according to The New Stack. Simultaneously, Cloudflare announced enhanced security controls for non-human identity management, introducing scannable tokens for credential protection and improved OAuth visibility for monitoring third-party tool access, as detailed on the Cloudflare Blog. These complementary services address the growing need for secure connectivity and identity management in AI-driven infrastructure environments. SRE teams operating multi-cloud environments or managing AI agents should evaluate these services for potential integration into their security and networking architecture, particularly if currently struggling with cross-cloud connectivity or non-human credential management. Organizations already using Cloudflare services should review the new features for alignment with their zero-trust and identity governance requirements.
- Do this Monday: Cloudflare Mesh could significantly improve security and connectivity for teams managing multi-cloud environments, especially as AI agents become more prevalent. This may reduce reliance on traditional VPNs, which can be slow and risky.
- Sources: The New Stack, Cloudflare Blog
6. OpenAI’s Agents SDK separates the harness from the compute
- Category: Release
- What happened: OpenAI has updated its Agents SDK, transforming it into a comprehensive toolbox for deploying agents in production. The new features include controlled workspaces, or sandboxes, for agents to operate securely and durably, separating the agent harness from the compute environment. This allows for better scalability and security, particularly for enterprise applications. Developers can utilize various container infrastructures to create these sandboxes, enhancing the SDK's functionality and integration with existing systems.
- Do this Monday: This update may affect production environments by providing enhanced security and scalability for AI agents, which could lead to more robust applications in enterprise settings. Operators should consider how to integrate these new features into their workflows.
- Source: The New Stack
7. Customers revolt as GitHub Copilot 'fixes' rate limits
- Category: Release
- What happened: GitHub informed Copilot customers that they need to reduce their usage due to a bug that undercounted token usage, leading to rapid exhaustion of subscription allowances. This issue has raised concerns among users regarding the service's pricing model.
- Do this Monday: Operators using GitHub Copilot may need to adjust their usage patterns to avoid hitting rate limits, which could impact development workflows.
- Source: The Register (Software)
8. Agents are rewriting the rules of security. Here’s what engineering needs to know.
- Category: Release
- What happened: The article discusses the security implications of AI agents in software development, highlighting their capabilities to autonomously manage tasks and the associated risks. It emphasizes the need for engineering leaders to understand how these agents can impact security postures, as they expand the attack surface and introduce novel vulnerabilities. The National Institute of Standards and Technology (NIST) is studying these risks, which include potential hijacking and backdoor attacks, and stresses the importance of bridging the gap between engineering and security teams to ensure safe deployment of AI technologies.
- Do this Monday: AI agents can significantly alter the security landscape, introducing new vulnerabilities that traditional security models may not detect. Organizations must adapt their security strategies to address these risks to maintain public safety and consumer confidence.
- Source: The New Stack
Lightning links
- The German Cyber Criminal Überfall: Shifts in Europe's Data Leak Landscape (Google Cloud Blog) -- Germany is now the primary target for cyber extortion in Europe, with a 50% increase in data leaks.
- How exposed is your code? Find out in minutes - for free (GitHub Blog) -- GitHub's new free tool scans up to 20 repositories for vulnerabilities, providing a comprehensive risk assessment.
- Broadcom Adds AI Agent Runtime to Tanzu PaaS Environment (DevOps.com) -- Broadcom's new runtime enables secure building and deployment of AI agents within the Tanzu PaaS.
- Navigating enterprise networking challenges with Amazon EKS Auto Mode (AWS Containers Blog) -- Amazon EKS Auto Mode automates networking configurations, simplifying enterprise Kubernetes deployments.
- Airbnb Migrates High-Volume Metrics Pipeline to OpenTelemetry (InfoQ DevOps) -- Airbnb shares insights on their successful migration to an open-source metrics stack using OpenTelemetry.
- Troubleshooting environment with AI analysis in AWS Elastic Beanstalk (AWS DevOps Blog) -- AWS Elastic Beanstalk introduces AI Analysis to enhance troubleshooting of environment health issues.
- OpenTelemetry Declarative Configuration Reaches Stability Milestone (InfoQ DevOps) -- Key portions of OpenTelemetry's declarative configuration are now stable, offering a vendor-neutral setup.
- GitHub Introduces Stacked PRs to Ease Review Bottlenecks (DevOps.com) -- GitHub's Stacked Pull Requests feature allows developers to break large updates into manageable pieces.
- How To Measure the ROI of Developer Tools (CNCF Blog) -- Understanding the ROI of developer tools is crucial for justifying investments in cloud-native technologies.
- Rovo Dev in Frontend Platform Engineering – AI for small tasks, AI for big tasks (Atlassian Engineering) -- Rovo Dev automates platform engineering tasks, aiding in large-scale migrations effectively.
Human Stories
Looking at this week's stories, I keep coming back to a fundamental truth about our work - the solutions we reach for often mirror the complexity of the problems we're trying to solve. GitLab's shift from NGINX Ingress to Gateway API with Envoy Gateway isn't just about keeping up with technology; it's about accepting that modern networking challenges demand modern complexity. The Ashby piece really drives this home with its observation that we're fighting fire with fire, using sophisticated tools like LLMs to wrangle the very complexity that software has created. What strikes me most is how the serverless cost assumptions story fits into this pattern - even when we think we're simplifying with serverless, we discover new layers of complexity in understanding costs and performance trade-offs. As SREs, we're constantly walking this tightrope between embracing necessary complexity and fighting unnecessary complication, and this week's stories remind us that sometimes the path forward isn't simpler, just different.
Also worth reading
Why “good enough” cloud databases are becoming a business risk (The New Stack)
Research indicates that many technology leaders are complacent about their cloud databases, with 38% expressing concerns that their current solutions may not meet future demands, particularly for AI/ML workloads. Despite this awareness, they often delay action until a significant event forces a chan
Presentation: Platform Engineering: Lessons from the Rise and Fall of eBay Velocity (InfoQ DevOps)
Randy Shoup discusses eBay's Velocity Initiative, which aimed to double engineering productivity and modernize DORA metrics. He outlines the technical strategies used to scale 4,500 services while highlighting the limitations of elite engineering in the face of waterfall planning, risk aversion, and
What engineering leaders get wrong about data stack consolidation (The New Stack)
The article discusses the implications of data stack consolidation, particularly following IBM's acquisition of Confluent. It highlights the shift from open-source neutrality to vendor control, which can lead to architectural debt as tools become integrated into proprietary platforms. This consolida