On Call Brief – Week of 2026-03-29
This week's top stories
1. Trivy Supply Chain Attack Hits Docker Images in TeamPCP Campaign
- Category: Community
- What happened: The article discusses a significant supply chain attack targeting Trivy, which has expanded to compromised Docker images for versions 0.69.4 to 0.69.6. This attack is part of the TeamPCP campaign, which has also affected LiteLLM and Telnyx. It emphasizes the importance of reading the provided resources if Trivy is used in CI/CD environments. Additionally, it highlights a new AI code-review system called Sashiko, developed by Google, which successfully identified bugs in the Linux kernel that human reviewers missed. The piece also touches on ongoing legal issues surrounding AI governance and federal contracts.
- Worth reading: The Trivy supply chain attack poses a direct risk to CI/CD pipelines using affected versions, potentially leading to credential theft. Operators should review their usage of Trivy and implement recommended defenses to mitigate risks.
- Source: Devopsish via DevOps'ish
2. Building SRE Error Budgets for AI/ML Workloads: A Practical Framework
- Category: Community
- What happened: The article discusses the need for SRE error budgets tailored for AI/ML workloads, emphasizing that these systems degrade gradually rather than fail suddenly. It suggests focusing on metrics like model accuracy, data freshness, and fairness, in addition to traditional uptime metrics.
- Worth reading: This framework could significantly affect how SRE teams manage reliability and performance in AI/ML systems, necessitating adjustments in monitoring and budgeting practices.
- Source: Dzone via SRE Weekly
3. LAX (Los Angeles) on 2026-03-30
- Category: Deep Dive
- What happened: Cloudflare has announced scheduled maintenance in the LAX (Los Angeles) datacenter on March 30, 2026, from 07:00 to 15:00 UTC. During this time, traffic may be re-routed, potentially increasing latency for end-users in the region. PNI/CNI customers should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
- Takeaway: Operators should anticipate increased latency and possible traffic rerouting during the maintenance window, which could affect service performance for users in the Los Angeles area - plan for failover if using PNI/CNI connections.
- Source: Cloudflare Status
4. IST (Istanbul) on 2026-03-29
- Category: Deep Dive
- What happened: Cloudflare has scheduled maintenance in the Istanbul datacenter from March 29, 2026, 23:00 UTC to March 30, 2026, 06:00 UTC. During this time, traffic may be rerouted, potentially increasing latency for users in the affected region. Customers using PNI/CNI should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
- Takeaway: Operators should anticipate increased latency and possible traffic rerouting during the maintenance window, which could affect user experience and connectivity for services relying on the Istanbul datacenter - operators should ensure failover mechanisms are in place.
- Source: Cloudflare Status
5. GYE (Guayaquil) on 2026-04-02
- Category: Deep Dive
- What happened: Cloudflare is conducting scheduled maintenance at the GYE (Guayaquil) datacenter on April 2, 2026, from 13:30 to 18:00 UTC. During this time, traffic may be re-routed, potentially increasing latency for users in the region. PNI/CNI customers should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
- Takeaway: Operators should be aware of potential latency increases and prepare for traffic failover during the maintenance window, which could affect service availability for users in the Guayaquil region.
- Source: Cloudflare Status
6. SJC Network Maintenance
- Category: Deep Dive
- What happened: Scheduled network maintenance will occur in the SJC (San Jose, California) region on April 2, 2023, from 12:30 to 14:30 UTC, with an expected brief outage of approximately 5 minutes for each machine.
- Takeaway: Operators should prepare for a brief outage of around 5 minutes per machine during the scheduled maintenance window.
- Source: Fly.io Status
CVE & Security
1. Week Ending March 29, 2026
- Category: Security / Patch
- What happened: Kubernetes has implemented a new security policy requiring all GitHub Actions workflows to pin actions using full 40-character commit SHAs instead of branch names or tags, with non-compliant workflows set to fail after April 15, 2026. This change addresses growing supply chain security threats targeting GitHub Actions workflows, as highlighted by GitHub Security's recent guidance on open source supply chain attacks. SRE teams maintaining Kubernetes-related repositories should immediately audit their GitHub Actions workflows to ensure all action references use complete commit SHAs rather than version tags or branch references like @main or @v1. Additionally, teams should enable CodeQL security scanning on their repositories to detect potential vulnerabilities in workflow configurations. This policy change coincides with the renaming of the default branch in the kubernetes/community repository, requiring teams to update any hardcoded branch references in their automation and documentation.
- Do this Monday: The GitHub Actions policy change requires immediate action to avoid workflow failures. The ingress-nginx vulnerability poses a risk of configuration injection and code execution, necessitating upgrades for affected users. The promotion of NodeLogQuery to GA simplifies log retrieval for operators, enhancing operational efficiency.
- Sources: Last Week in Kubernetes Development (LWKD), GitHub Security
2. North Korean Hackers Suspected in Supply Chain Attack on Popular Axios Project
- Category: Security / Patch
- What happened: North Korean hackers hijacked the npm account of an axios maintainer, publishing malicious versions of the popular JavaScript library that included a remote access trojan (RAT). This attack targeted a widely used library, potentially affecting numerous applications and CI/CD pipelines. The RAT could execute commands, exfiltrate data, and establish persistence on infected systems. Organizations using axios should assess their exposure and check for compromised versions.
- Do this Monday: This incident highlights the risks associated with supply chain attacks on widely used libraries. If axios versions 1.14.1 or 0.30.4 were installed, organizations may have unknowingly executed malicious code, necessitating immediate security assessments and potential remediation actions.
- Source: DevOps.com
3. Claude Code bypasses safety rule if given too many commands
- Category: Security / Patch
- What happened: Claude Code has a vulnerability where it bypasses its deny rules if given a long chain of concatenated commands, making it susceptible to prompt injection attacks.
- Do this Monday: This vulnerability could lead to unauthorized actions being executed by the bot, potentially affecting applications that rely on Claude Code for automation or decision-making.
- Source: The Register (Software)
4. Terraform: 2 related updates
- Category: Security / Patch
- What happened: HashiCorp has released two new security features for HCP Terraform now generally available to all users. The platform now supports IP allow lists that enable organizations to define approved IP address ranges for both organization-level and agent access to Terraform resources, addressing previous gaps in centralized IP restriction capabilities. Additionally, AWS permission delegation is now generally available, allowing organizations to temporarily delegate IAM permissions to trusted partners and simplify AWS service onboarding processes. SRE teams should review their current HCP Terraform security configurations and consider implementing IP allow lists if their compliance requirements include network-based access controls. Organizations using AWS with HCP Terraform should evaluate the new permission delegation feature to streamline IAM management workflows while maintaining security boundaries.
- Do this Monday: This change enhances security for Terraform users by allowing them to restrict access based on IP addresses, which is crucial for organizations in regulated industries. It helps mitigate risks related to token exposure by ensuring that tokens can only be used from predefined IPs.
- Sources: HashiCorp Blog, HashiCorp Blog
5. Run real-time and async inference on the same infrastructure with GKE Inference Gateway
- Category: Security / Patch
- What happened: The article discusses the GKE Inference Gateway, which allows enterprises to run both real-time and async AI inference workloads on a unified infrastructure. Traditionally, these workloads required separate clusters, leading to inefficiencies and resource fragmentation. The Inference Gateway optimizes resource allocation by treating accelerator capacity as a fluid pool, enabling better performance for both latency-sensitive and latency-tolerant tasks. It employs latency-aware scheduling to improve response times for real-time requests and reduces costs associated with maintaining separate infrastructures for async processing.
- Do this Monday: The introduction of GKE Inference Gateway could significantly streamline AI workload management, reducing costs and improving performance for organizations using Kubernetes for AI services. This change may require operators to adapt their infrastructure strategies to leverage the unified resource management capabilities.
- Source: Google Cloud Blog
6. Cilium 1.20.0-pre.1
- Category: Security / Patch
- What happened: Cilium has released version 1.20.0-pre.1, a pre-release that introduces several significant changes including support for dynamic source IP resolution in nodeport configurations, ICMPv6 policy denial capabilities, jitter delay mechanisms for IPsec key rotations, and enhanced traffic distribution improvements. As this is a pre-release version, operators should avoid deploying it in production environments and instead use it only for testing and validation purposes in development or staging clusters. SRE teams currently running Cilium in production should monitor the official release notes for the stable 1.20.0 release and plan testing cycles to evaluate these new features against their specific networking requirements. Organizations should particularly focus testing efforts on the nodeport and IPsec functionality changes if these components are critical to their current Cilium deployments.
- Do this Monday: These changes may affect network policy handling and performance in Kubernetes environments using Cilium, particularly with the new features and fixes that enhance stability and efficiency.
- Sources: Cilium releases, Cilium releases
Releases
1. GitHub Adds 37 New Secret Detectors in March, Extends Scanning to AI Coding Agents
- Category: Release
- What happened: GitHub has introduced significant updates to its secret scanning capabilities, adding 37 new secret detectors across 22 providers and extending scanning to AI coding agents. This includes push protection for 39 token types and validity checks for various tokens. The integration with AI coding agents allows for scanning of code changes for exposed secrets before commits, addressing the increasing risk of secret leaks as AI-generated code becomes more prevalent. Organizations can also now manage push protection exemptions at the repository level.
- Do this Monday: The new secret scanning features, especially for AI coding agents, could significantly reduce the risk of secret leaks in repositories, which is crucial as AI tools generate more code. Teams should review their secret management practices to leverage these updates effectively.
- Source: DevOps.com
2. Announcing managed daemon support for Amazon ECS Managed Instances
- Category: Release
- What happened: AWS has introduced managed daemon support for Amazon ECS Managed Instances, allowing platform engineers to independently manage software agents like monitoring and logging tools without needing to coordinate with application development teams. This feature enables centralized control over operational tooling, ensuring that required daemons run consistently across instances. It simplifies the lifecycle management of these daemons, allowing for separate resource management and deployment across multiple capacity providers, thus improving operational efficiency.
- Do this Monday: This change allows for more efficient management of monitoring and logging agents, reducing the operational burden on platform engineers and improving reliability. It enables teams to deploy updates to operational tools without redeploying application services, which can significantly streamline workflows in large-scale environments.
- Source: AWS What's New
3. Securely connect AWS DevOps Agent to private services in your VPCs
- Category: Release
- What happened: The article discusses how to securely connect the AWS DevOps Agent to private services within Amazon VPCs. It explains the use of private connections that allow the agent to access internal systems without exposing them to the public internet. The process involves creating a secure network path using Amazon VPC Lattice, which manages the connectivity without requiring users to handle the underlying network infrastructure. The article also highlights the security features of this setup, ensuring that all traffic remains within the AWS network.
- Do this Monday: This feature allows for secure integration of internal services with the AWS DevOps Agent, which can enhance operational efficiency and security for organizations using AWS. It is particularly relevant for teams managing private resources that require connectivity without public exposure.
- Source: AWS DevOps Blog
4. Building PCI DSS-Compliant Architectures on Amazon EKS
- Category: Release
- What happened: The article discusses the challenges and considerations for organizations in regulated environments, particularly those processing payment card data, when implementing PCI DSS-compliant architectures on Amazon EKS. It emphasizes the importance of security, data protection, and compliance controls in Kubernetes environments and explores the implications of shared versus dedicated tenancy for PCI DSS compliance. The post also highlights AWS resources available to assist organizations in deploying PCI workloads on shared infrastructure while maintaining compliance.
- Do this Monday: Organizations using Amazon EKS for payment processing must carefully evaluate their architectural choices to ensure PCI DSS compliance, which could affect operational costs and compliance risks. Missteps in this area could lead to financial penalties and complex infrastructure changes.
- Source: AWS Containers Blog
5. Codespaces is now generally available for GitHub Enterprise with data residency
- Category: Release
- What happened: GitHub Codespaces is now generally available for GitHub Enterprise Cloud with data residency, allowing organizations to create secure cloud development environments while adhering to data residency requirements. This feature is available in regions including Australia, EU, Japan, and the US. User-owned codespaces are not supported under this model, requiring enterprise or organization ownership for compliance.
- Do this Monday: Organizations using GitHub Enterprise Cloud must ensure proper ownership configurations for Codespaces to comply with data residency policies - this could affect how development environments are managed within teams.
- Source: GitHub Changelog
6. etcd: v3.6.10, v3.5.29, v3.4.43
- Category: Release
- What happened: The etcd project has released maintenance versions 3.6.10, 3.5.29, and 3.4.43 across all supported major release branches. These releases include various bug fixes and improvements as documented in their respective changelogs, though no specific CVE numbers or security vulnerabilities were mentioned in the release notes. Operators should review the upgrade guide for each target version before proceeding with upgrades, as there may be breaking changes that could impact cluster operations. Installation packages are available for Linux, macOS, and Docker platforms through the standard etcd release channels. Given the maintenance nature of these releases, operators should plan upgrades during scheduled maintenance windows while following their established etcd cluster upgrade procedures.
- Do this Monday: Operators should be aware of potential breaking changes when upgrading to etcd v3.6.10 and follow the upgrade guide to ensure a smooth transition.
- Sources: etcd releases, etcd releases, etcd releases
7. Introducing Code Optimizer (beta) – Better and Safer Infrastructure Code, Right Inside Your Git
- Category: Release
- What happened: Code Optimizer is a new tool that scans Infrastructure-as-Code for risks and inconsistencies, providing suggestions for improvements directly within Git workflows.
- Do this Monday: This tool could enhance the safety and quality of infrastructure code, potentially reducing deployment issues.
- Source: env0 Blog
8. Expanding Drift Remediation: Keep Your Code Aligned with Cloud Changes
- Category: Release
- What happened: env zero has expanded its drift remediation capabilities to address changes made directly in the cloud. It can now suggest pull requests that modify your Infrastructure as Code to align with these cloud-originated changes, helping maintain consistency in your infrastructure.
- Do this Monday: This update could streamline the process of keeping Infrastructure as Code in sync with cloud changes, potentially reducing manual intervention and errors - it may improve deployment reliability.
- Source: env0 Blog
9. env zero and CloudQuery Announce Merger to Create the Industry’s First Unified Cloud Intelligence Platform
- Category: Release
- What happened: env0 and CloudQuery have merged to create a unified cloud governance and asset management platform aimed at providing full lifecycle visibility, automation, and control for cloud teams.
- Do this Monday: This merger may affect cloud governance strategies and tooling choices for teams looking for integrated solutions.
- Source: env0 Blog
Lightning links
- Introducing EmDash - the spiritual successor to WordPress that solves plugin security (Cloudflare Blog) -- Cloudflare introduces EmDash, a secure open-source CMS designed to replace WordPress.
- Top Infrastructure and GKE Sessions at Cloud Next '26 (Google Cloud Blog) -- Key insights from upcoming Cloud Next '26 sessions on infrastructure and GKE.
- Cloudflare Launches Dynamic Workers Open Beta: Isolate-Based Sandboxing for AI Agent Code Execution (InfoQ DevOps) -- Cloudflare's Dynamic Workers now offer faster, secure execution for AI-generated code.
- Research, plan, and code with Copilot cloud agent (GitHub Changelog) -- GitHub Copilot's cloud agent now supports code generation without pull requests.
- v1.15.0-beta2 (Terraform releases) -- Terraform v1.15.0-beta2 introduces new features and enhanced AWS backend authentication.
- Docker Inc. Allies with NanoCo to Deploy General-Purpose AI Agent Safely (Cloud Native Now) -- Docker partners with NanoCo to safely deploy AI agents in restricted environments.
- Migrating to Amazon ElastiCache for Valkey: Best practices and a customer success story (AWS Database Blog) -- Best practices for migrating to Amazon ElastiCache, enhancing performance and scalability.
- Developers Using Anthropic Claude Code Hit by Token Drain Crisis (DevOps.com) -- Developers face unexpected token drain issues with Anthropic's Claude Code, affecting usage.
- libeatmydata - disable fsync and SAVE (Lobsters) -- libeatmydata allows disabling fsync to improve database performance, with trade-offs.
- GitHub Mobile: Stay in flow with a refreshed Copilot tab and native session logs (GitHub Changelog) -- GitHub Mobile enhances workflow management with a new Copilot tab and session logs.
Human Stories
Looking at this week's mix of threats and planned disruptions, I'm struck by how we're constantly navigating between the chaos we can't predict and the chaos we can. The Trivy supply chain attack reminds us that even our security tools can become vectors of compromise - a sobering reminder that trust in our toolchain requires constant verification. Meanwhile, the cascade of Cloudflare maintenance windows across LAX, Istanbul, and Guayaquil shows us the choreographed dance of keeping global infrastructure running, each carefully planned outage a testament to the unglamorous work of preventing unplanned ones. What ties these together is the fundamental tension we live with: some disruptions come at us like lightning strikes through compromised Docker images, while others we orchestrate ourselves through maintenance windows, hoping our planning holds up against Murphy's Law. The art of our craft lies not just in responding to surprises, but in making sure our deliberate disruptions don't become accidental catastrophes.
Also worth reading
Ruby Central report reopens wounds over RubyGems repo takeover (The Register (Software))
Ruby Central published an incident report detailing the September 2025 RubyGems fracture, where control of the RubyGems GitHub repository was taken from its maintainers. The report highlights ongoing governance and trust issues within the Ruby community.