On Call Brief – Week of 2026-03-22
This week's top stories
1. Another One Bites the Dust: What the CDKTF Deprecation Means for You
- Category: Breaking Change
- What happened: HashiCorp has deprecated CDKTF as of December 10. The article discusses migration options to OpenTofu or Pulumi and emphasizes how env zero can help prevent vendor lock-in in the future.
- Do this Monday: Operators using CDKTF need to consider migration strategies to avoid disruptions in their infrastructure management.
- Source: env0 Blog
2. Windows Server 2025 SMB SID hardening is beachballing legacy clients
- Category: Community
- What happened: Windows Server 2025 introduces SMB SID hardening that can cause issues for legacy clients, resulting in 'incorrect username or password' errors. This is due to stricter checks on machine identities that terminate sessions if duplicate SIDs are detected, often arising from un-sysprepped VM clones. The problem is exacerbated in automated environments where identity can be ambiguous. Users are advised to check their nodes with 'psgetsid' and generalize if duplicates are found.
- Worth reading: This change may disrupt file sharing for legacy systems and automated environments, requiring operators to ensure unique SIDs to maintain functionality - operators should verify their VM configurations and consider using user-space stacks to avoid these issues.
- Source: Reddit r/sysadmin
3. DevOps'ish 301: Super Micro Arrests, FT PO'd AWS, Show Me the Tokens, and more
- Category: Breaking Change
- What happened: Kubernetes SIG Network has announced the end of support for Ingress NGINX controller, requiring operators to migrate workloads to the new Gateway API standard. This deprecation affects all current Ingress NGINX deployments and represents a significant shift in Kubernetes networking architecture. Operators should begin evaluating their current ingress configurations and planning migration timelines to Gateway API implementations, which provide more advanced traffic management capabilities and better extensibility. The migration should be prioritized as ongoing support and security updates for Ingress NGINX will cease, potentially leaving unmigrated deployments vulnerable. Organizations should review the Gateway API documentation and test compatibility with their existing workloads before implementing production migrations.
- Do this Monday: The discontinuation of Ingress NGINX support means users must transition to the Gateway API to maintain security and functionality. The legal issues surrounding Super Micro could affect supply chains and availability of Nvidia chips, impacting production environments reliant on these technologies.
- Sources: via DevOps'ish, DevOps'ish
4. HPE’s AI agents cut root cause analysis time in half
- Category: Deep Dive
- What happened: Hewlett Packard Enterprise (HPE) is introducing an enterprise-grade agentic operations system that utilizes AI agents to enhance operational efficiency. This system, currently in beta, has reportedly reduced root cause analysis time by at least 50% for early adopters. The AI agents are designed to assist operations teams by bridging data silos and enabling autonomous actions, while still requiring human oversight to build trust in their recommendations. The full release is expected in 2026.
- Takeaway: The introduction of AI agents could significantly streamline incident response processes, potentially reducing downtime and improving operational efficiency. Teams may need to adapt to new workflows involving AI collaboration, which could change how incidents are managed.
- Source: The New Stack
5. LHR (London) on 2026-03-26
- Category: Deep Dive
- What happened: Cloudflare has scheduled maintenance in the London (LHR) datacenter from March 26, 2026, 23:30 UTC to March 27, 2026, 06:00 UTC. During this time, traffic may be re-routed, potentially increasing latency for users in the region. Customers using PNI/CNI should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
- Takeaway: Operators should anticipate increased latency and possible traffic rerouting during the maintenance window, which could affect user experience and connectivity for services relying on the London datacenter.
- Source: Cloudflare Status
6. LAX (Los Angeles) on 2026-03-27
- Category: Deep Dive
- What happened: Cloudflare has announced scheduled maintenance at the LAX (Los Angeles) datacenter on March 27, 2026, from 07:00 to 15:00 UTC. During this time, traffic may be re-routed, potentially causing slight latency increases for end-users. Customers using PNI/CNI connections should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
- Takeaway: Operators should anticipate possible latency increases and prepare for traffic failover during the maintenance window, which could affect service availability for users in the region.
- Source: Cloudflare Status
7. Cursor ACP: session/load fails with Session "" not found, breaking persistent sessions (acpx/OpenClaw ACP runtime)
- Category: Community
- What happened: The Cursor ACP server is experiencing a bug where persistent sessions fail to load, resulting in an error message stating that the session is not found. This issue affects orchestrators like acpx/OpenClaw, leading to process stalls and non-zero exit codes. One-shot mode functions correctly, but the persistent session feature is currently broken.
- Worth reading: This bug can disrupt workflows that rely on persistent sessions in the Cursor CLI, potentially causing failures in automation and orchestration tasks.
- Source: Cursor Forum
CVE & Security
1. React and Next.js Remote Code Execution Vulnerability - Cloud Armor Mitigation
- Category: Security / Patch
- What happened: A critical remote code execution vulnerability has been identified in the React and Next.js frameworks, affecting specific versions. While Google Cloud itself is not vulnerable, users running these frameworks on Google Cloud may be at risk. Google has released a Cloud Armor WAF rule to help mitigate exploitation attempts. Users are advised to update their frameworks immediately to the patched versions.
- Do this Monday: Operators using React or Next.js on Google Cloud must update to the fixed versions to avoid potential exploitation. The availability of a WAF rule provides a temporary mitigation but should not replace updating dependencies.
- Source: Google Cloud Security Bulletins
2. Argo CD v3.1.13
- Category: Security / Patch
- What happened: Argo CD has released security updates across multiple version branches including v3.1.13, v3.2.8, v3.3.5, and v3.4.0-rc3, with versions 3.1.13 and 3.2.8 specifically containing mitigations for CVE-2026-33186. All releases include various bug fixes, dependency updates, and improvements to container logic, UI messages, and hook resource creation according to the official Argo CD release announcements. SRE teams running Argo CD should immediately upgrade to the latest version in their respective branch (3.1.x users to 3.1.13, 3.2.x users to 3.2.8, 3.3.x users to 3.3.5) using the provided kubectl installation commands for both non-HA and HA configurations. All releases include signed container images and provenance attestations, so operators should verify signatures during deployment as emphasized in the release documentation.
- Do this Monday: The mitigation for CVE-2026-33186 is critical for users of Argo CD, as it addresses a vulnerability that could affect the security of deployments. Operators should upgrade to this version to ensure their systems are protected.
- Sources: Argo CD releases, Argo CD releases, Argo CD releases (+1 more)
3. Grafana 12.4.2
- Category: Security / Patch
- What happened: Grafana has released multiple security-focused versions addressing unspecified CVEs across the 11.x and 12.x branches, including versions 11.6.14+security-01, 12.1.10+security-01, 12.2.8+security-01, 12.3.6+security-01, and 12.4.2. Version 12.4.2 specifically includes accessibility improvements, dashboard display name resolution fixes, and multiple security vulnerability patches, though the exact CVE numbers are not detailed in the available information. SRE teams should immediately upgrade their Grafana instances to the latest applicable security release for their major version branch to address these vulnerabilities. The consistent "+security-01" naming pattern across the 12.x series indicates these are critical security patches that should be prioritized in maintenance windows.
- Do this Monday: The security fixes in this release are critical and should be applied to prevent potential vulnerabilities in production environments. The enhancements may improve user experience but are less urgent.
- Sources: Grafana releases, Grafana releases, Grafana releases (+2 more)
4. DRA: A new era of Kubernetes device management with Dynamic Resource Allocation
- Category: Security / Patch
- What happened: Google Cloud has introduced Dynamic Resource Allocation (DRA) for Kubernetes, which allows for more efficient management of GPUs and TPUs by automating hardware resource allocation. This new standard replaces the previous Device Plugin framework, which required manual node pinning and only allowed simple integer resource requests. DRA enables a flexible, request-based model for resource management, improving workload portability and efficiency, especially for AI applications.
- Do this Monday: The adoption of DRA could significantly streamline the deployment of AI workloads on Kubernetes by reducing the complexity of resource management and improving resource utilization. Operators may need to adapt their workflows to leverage this new capability effectively.
- Source: Google Cloud Blog
5. Deploy VPC Block Public Access across AWS Organizations
- Category: Security / Patch
- What happened: The article discusses how to implement VPC Block Public Access (BPA) across multiple AWS accounts using AWS Organizations declarative policies. This approach centralizes security management, reduces manual configuration, and ensures consistent security enforcement across an organization. It highlights the benefits of maintaining a desired configuration automatically, even as new accounts and resources are created, and provides guidance on assessing current environments and managing exceptions.
- Do this Monday: Implementing BPA can significantly reduce operational overhead and improve security compliance across AWS accounts, which is crucial for organizations managing large AWS environments.
- Source: AWS Networking Blog
6. PDX (Portland) on 2026-03-23
- Category: Security / Patch
- What happened: Cloudflare has scheduled maintenance for their PDX (Portland) datacenter on March 23, 2026, from 14:00 to 23:59 UTC, with traffic rerouting expected to cause increased latency for end-users in the affected region. The maintenance window appears to have been adjusted during planning, with one source indicating an end time of 23:45 UTC while another specifies 23:59 UTC. Customers using PNI (Private Network Interconnect) or CNI (Cloudflare Network Interconnect) services should expect potential impact and may need to implement failover procedures or communicate expected latency increases to stakeholders. SRE teams should monitor application performance metrics during this window and verify that load balancing configurations can handle the traffic rerouting effectively.
- Do this Monday: Operators should anticipate increased latency and possible traffic rerouting during the maintenance window, which may affect service performance for users in the region - prepare for failover if using PNI/CNI connections.
- Sources: Cloudflare Status, Cloudflare Status
Releases
1. Sophisticated Supply Chain Attack Targeting Trivy Expands to Checkmarx, LiteLLM
- Category: Release
- What happened: A sophisticated supply chain attack by the threat group TeamPCP has expanded beyond its initial compromise of Aqua Security's Trivy to now target Checkmarx and LiteLLM, with attackers harvesting credentials from CI runner memory and specifically targeting GitHub personal access tokens. Concurrently, researchers have identified a separate supply chain vulnerability affecting 121 agent skills across 7 repositories in Skills.sh and SkillsDirectory, where GitHub username changes create opportunities for repository hijacking attacks when previous usernames become available for registration. SRE teams should immediately audit their CI/CD pipelines for credential exposure in runner memory, review GitHub personal access tokens for unauthorized usage, and verify that any agent skills or external repositories referenced in their systems are still controlled by legitimate maintainers. Organizations should also implement additional validation checks for external dependencies and consider using commit hash pinning rather than branch references for critical supply chain components. These incidents highlight the critical need for comprehensive supply chain security monitoring, as reported by DevOps.com and academic research from arXiv:2603.16572.
- Do this Monday: This attack poses a significant risk to CI/CD pipelines, as compromised tokens can lead to broader supply chain vulnerabilities. Immediate action is required to rotate credentials and secure workflows.
- Sources: DevOps.com, Reddit r/netsec
2. Authenticate Bitbucket Packages with native Pipelines authentication
- Category: Release
- What happened: Bitbucket has introduced a feature that allows developers to authenticate with the Bitbucket Packages container registry using built-in authentication in Bitbucket Pipelines, eliminating the need for personal tokens. This enhancement improves security by using short-lived tokens that expire automatically, simplifies access by providing necessary variables in every pipeline step, and reduces setup complexity. Developers can now push and pull packages more securely and easily within their repositories.
- Do this Monday: This change may affect production workflows by streamlining package management and enhancing security in CI/CD processes, reducing the risk associated with managing long-lived credentials.
- Source: Bitbucket Blog
3. Announcing Amazon Aurora PostgreSQL serverless database creation in seconds
- Category: Release
- What happened: AWS has announced the general availability of a new express configuration for Amazon Aurora PostgreSQL, allowing users to create serverless databases in seconds with just two clicks. This streamlined experience includes preconfigured defaults, the ability to modify settings during and after creation, and an internet access gateway for secure connections without the need for a VPN. The new configuration supports passwordless IAM authentication by default and enables high availability features such as read replicas and automated failover.
- Do this Monday: The rapid database creation capability could significantly speed up development and deployment processes, making it easier for teams to prototype and scale applications. The removal of VPC requirements and the introduction of an internet access gateway may also simplify network configurations for developers.
- Source: AWS What's New
4. Your Kubernetes isn’t ready for AI workloads, and drift is the reason
- Category: Release
- What happened: The article discusses the challenges of running AI workloads on Kubernetes due to infrastructure drift, which includes mismatched kernels and manual patching. It argues that traditional Kubernetes environments are not built for the determinism required by AI workloads, leading to potential failures. The author suggests moving towards an API-driven, immutable operating system and a unified management approach to eliminate drift and enhance reliability for AI applications.
- Do this Monday: Kubernetes environments may struggle to support AI workloads effectively due to existing infrastructure drift, which could lead to failures and compliance risks. Operators should consider adopting more deterministic infrastructure to ensure reliability.
- Source: The New Stack
5. Higress Joins CNCF: Delivering an enterprise-grade AI gateway and a seamless path from Nginx Ingress
- Category: Release
- What happened: Higress has joined the Cloud Native Computing Foundation (CNCF) as a Sandbox project, offering an AI-native API gateway built on Envoy and Istio. It combines traffic gateway, microservices gateway, and AI gateway functionalities, simplifying operations for cloud-native and AI workloads. Higress serves as a mature Kubernetes Ingress Controller, compatible with Nginx Ingress, and enhances security by replacing legacy configurations with a robust control plane. It supports AI traffic with features like token-based rate limiting and model-aware routing, making it suitable for enterprise AI applications.
- Do this Monday: Higress provides a secure alternative to Nginx Ingress, which is set to retire in 2026. Its capabilities for handling AI traffic and integration with existing Kubernetes environments could significantly impact how organizations manage cloud-native and AI workloads, potentially reducing operational complexity and enhancing security.
- Source: CNCF Blog
6. Akuity Adds Ability to Customize Kargo Pipelines
- Category: Release
- What happened: Akuity has introduced a new feature in Kargo that allows teams to customize the steps for promoting applications into production using a GitOps workflow. This feature, called Custom Steps, enables the definition of promotion logic directly within the pipeline, such as policy checks or security scans, eliminating the need for custom scripts. Each step runs in a Kubernetes Pod and is recorded in the Kargo promotion record. Kargo is designed to work with various platforms and aims to streamline the deployment process for DevOps teams.
- Do this Monday: The addition of Custom Steps in Kargo could significantly enhance the flexibility and efficiency of CI/CD workflows for teams using GitOps, allowing for more automated and standardized promotion processes in production environments.
- Source: DevOps.com
7. AI supply chain attacks don’t even require malware…just post poisoned documentation
- Category: Release
- What happened: A proof-of-concept attack on Context Hub reveals significant vulnerabilities in the service that assists coding agents with API updates, indicating a lack of content sanitization. This could lead to supply chain attacks that do not require traditional malware.
- Do this Monday: - The vulnerability in Context Hub could expose systems to supply chain attacks, potentially affecting the integrity of API interactions and overall security posture.
- Source: The Register (Software)
8. Security as Code is Becoming the New Baseline: Continuous Compliance in DevOps
- Category: Release
- What happened: The article discusses the shift from traditional compliance practices to 'security as code', emphasizing the need for continuous compliance in DevOps. It argues that compliance should be integrated into the delivery pipeline rather than treated as a periodic checkpoint. The author highlights the importance of version-controlling security policies and automating their enforcement, drawing parallels to the impact of infrastructure as code. The piece also notes that increasing regulatory pressures are making continuous compliance essential for organizations.
- Do this Monday: This shift to security as code could significantly affect how compliance is managed in production environments, requiring teams to adopt new practices for integrating security into their CI/CD pipelines.
- Source: DevOps.com
9. ControlMonkey Extends Cloud Configuration Disaster Recovery to Observability Platforms
- Category: Release
- What happened: ControlMonkey has expanded its disaster recovery platform to include observability and monitoring platforms, protecting configurations for services like Datadog, New Relic, Dynatrace, Grafana Cloud, and Splunk. The update features automatic daily snapshots of critical observability configurations.
- Do this Monday: This expansion may enhance recovery options for observability tools, which can be crucial for maintaining monitoring integrity during incidents - operators should consider integrating this solution to bolster their disaster recovery strategies.
- Source: Cloud Native Now
10. How lookup tables turn observability data into business insight
- Category: Release
- What happened: The article discusses how lookup tables in Dynatrace can bridge the gap between raw observability data and business insights. It highlights the challenge of cryptic identifiers in dashboards and how lookup tables can enrich this data by providing human-readable context. Real-world examples illustrate how teams can use lookup tables to enhance their observability efforts, making it easier to understand who is impacted during incidents.
- Do this Monday: Lookup tables can significantly improve the clarity of observability data, reducing the time spent on identifying impacted users during incidents. This could lead to faster incident resolution and better communication with business stakeholders.
- Source: Dynatrace Blog
11. EDB Highlights CloudNativePG 1.29, Previews Kubernetes-Native Data Protection at KubeCon Europe
- Category: Release
- What happened: EnterpriseDB announced the community release of CloudNativePG 1.29, which includes modular extensions, and previewed a new Kubernetes-native data protection feature for its commercial operator at KubeCon + CloudNativeCon Europe 2026.
- Do this Monday: The new features in CloudNativePG could enhance PostgreSQL management on Kubernetes, potentially improving data protection strategies for operators using this technology.
- Source: Cloud Native Now
Lightning links
- 2025 env zero Product Release Highlights (env0 Blog) -- Discover new AI-powered features and enhanced cloud visibility in the latest env zero release.
- Fivetran donates its SQLMesh data transformation framework to the Linux Foundation (The New Stack) -- SQLMesh empowers data teams to streamline SQL-based transformations with community support.
- Ubuntu 26.04 Ends 46 Years of Silent sudo Passwords (Lobsters) -- Ubuntu enhances security by requiring visible password entry for sudo commands, a major change.
- Show HN: TMA1 – Local-first observability for LLM agents (Hacker News Show HN) -- TMA1 offers open-source observability for LLM agents, focusing on local metrics without cloud reliance.
- Five real-world lessons for building developer workflows in the agentic era (Dynatrace Blog) -- Learn key insights on integrating AI agents into workflows for improved developer efficiency.
- Kubernetes Builds a Sandbox CRD for AI Agents (Cloud Native Now) -- The new Agent Sandbox project in Kubernetes aims to create a lightweight environment for AI agents.
- Drift Under Control: Keep Your Infrastructure Consistent (env0 Blog) -- Utilize env0 for automated drift detection and remediation to maintain infrastructure consistency.
- Best CI Tools for 2026: What the Data Actually Shows (JetBrains Blog) -- An analysis of CI/CD tool adoption rates reveals insights into the evolving landscape of developer tools.
Human Stories
Looking at this week's collection of outages, deprecations, and breaking changes, I'm struck by how much of our job has become managing the constant churn of dependencies we never chose. HashiCorp's CDKTF deprecation and the end of Ingress NGINX support remind us that the platforms we build on are constantly shifting beneath our feet, often with little regard for the operational burden they create downstream. When Windows Server 2025's SMB hardening breaks legacy clients or Cursor's session management simply stops working, we're reminded that progress and stability exist in perpetual tension - every security improvement, every architectural evolution carries the potential to disrupt the delicate systems we've spent years tuning. The promise of HPE's AI agents cutting root cause analysis time in half feels almost ironic against this backdrop; we're automating our response to complexity that keeps multiplying faster than we can manage it. Perhaps the real skill isn't just building resilient systems, but learning to surf these waves of change while keeping our teams sane and our services running.
Also worth reading
Practical Considerations for AI Incident Reviews (SRE Weekly)
The article discusses the importance of human engagement in AI incident reviews, emphasizing that relying solely on large language models (LLMs) for writing these reviews can lead to missed insights. It highlights that incident reviews are a socio-technical process that benefits from active particip