On Call Brief – Week of 2026-04-05

2026-04-05 Briefing: 2026-04-05
Category:
Tags:

This week's top stories

1. GitHub availability report: March 2026

  • Category: Deep Dive
  • What happened: GitHub experienced four service incidents in March 2026 affecting github.com, GitHub API, GitHub Actions, and GitHub Copilot due to caching mechanism issues and misconfigurations, according to their availability report. Separately, GitHub has moved custom images for hosted runners from public preview to general availability, allowing teams to create VM images based on GitHub-approved base images for their CI/CD workflows. SRE teams using GitHub Actions should evaluate whether custom runner images could improve their build performance and consistency, particularly for workflows with specific tooling requirements. Teams should also review their incident response procedures for GitHub service outages and consider implementing fallback strategies for critical CI/CD pipelines. Organizations heavily dependent on GitHub services should monitor the GitHub status page and consider diversifying their development toolchain to reduce single points of failure.
  • Takeaway: These incidents resulted in significant service disruptions, impacting developer workflows and causing high error rates for critical services. The ongoing architectural improvements may lead to better resilience in the future, but immediate attention to these issues is necessary to avoid similar disruptions.
  • Sources: GitHub Blog, InfoQ DevOps
  • Tags:
  • 2. Another One Bites the Dust: What the CDKTF Deprecation Means for You

    • Category: Breaking Change
    • What happened: HashiCorp has deprecated CDKTF as of December 10. The article discusses migration options to OpenTofu or Pulumi and highlights how env zero can help prevent vendor lock-in for infrastructure management.
    • Do this Monday: Operators using CDKTF need to consider migration to alternative tools like OpenTofu or Pulumi to avoid disruptions in their infrastructure management.
    • Source: env0 Blog
  • Tags:
  • 3. Dutch healthcare software vendor goes dark after ransomware attack

    • Category: Deep Dive
    • What happened: ChipSoft, a Dutch healthcare software vendor, is currently offline due to a ransomware attack. While their website is down, email communications are still operational.
    • Takeaway: This incident highlights the vulnerability of healthcare software vendors to ransomware attacks, which could affect service availability and data security in healthcare systems - operators should assess their own security measures.
    • Source: The Register (Software)
  • Tags:
  • 4. STL (St. Louis) on 2026-04-09

    • Category: Deep Dive
    • What happened: Cloudflare has announced scheduled maintenance at the STL (St. Louis) datacenter on April 9, 2026, from 08:00 to 16:00 UTC. During this time, traffic may be re-routed, potentially increasing latency for users in the region. Customers using PNI/CNI should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
    • Takeaway: Operators should anticipate increased latency and possible traffic rerouting during the maintenance window, which could affect service availability for users in the STL region - plan for failover if using PNI/CNI connections.
    • Source: Cloudflare Status
  • Tags:
  • 5. BRU (Brussels) on 2026-04-06

    • Category: Deep Dive
    • What happened: Cloudflare is scheduled to perform maintenance in the Brussels (BRU) datacenter from April 6, 2026, 23:00 UTC to April 7, 2026, 05:30 UTC. During this time, traffic may be rerouted, potentially increasing latency for end-users in the affected region. Customers using PNI/CNI connections should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
    • Takeaway: Operators should anticipate increased latency and possible traffic rerouting during the maintenance window, which could affect service performance for users in the Brussels region.
    • Source: Cloudflare Status
  • Tags:
  • 6. POA (Porto Alegre) on 2026-04-06

    • Category: Deep Dive
    • What happened: Cloudflare has scheduled maintenance in the Porto Alegre (POA) datacenter on April 6, 2026, from 08:00 to 12:00 UTC. During this time, traffic may be re-routed, potentially increasing latency for users in the region. PNI/CNI customers should prepare for possible traffic failover as network interfaces may be temporarily unavailable.
    • Takeaway: Operators should anticipate increased latency and possible traffic rerouting during the maintenance window, which may affect service availability for users in the affected region - plan accordingly for failover scenarios.
    • Source: Cloudflare Status
  • Tags:

  • CVE & Security

    1. Fortinet Patches Actively Exploited CVE-2026-35616 in FortiClient EMS

    • Category: Security / Patch
    • What happened: Fortinet has issued urgent patches for CVE-2026-35616, a critical vulnerability in FortiClient EMS that is currently being exploited. This flaw allows for pre-authentication API access bypass, potentially leading to privilege escalation due to improper access control.
    • Do this Monday: Operators using FortiClient EMS should prioritize applying these patches to mitigate the risk of exploitation, as the vulnerability is actively being targeted in the wild.
    • Source: Thehackernews via The Hacker News (security)
  • Tags:
  • 2. 36 Malicious npm Packages Exploited Redis, PostgreSQL to Deploy Persistent Implants

    • Category: Security / Patch
    • What happened: Researchers found 36 malicious npm packages posing as Strapi CMS plugins that exploit Redis and PostgreSQL. These packages deploy reverse shells, harvest credentials, and install persistent implants. Each package lacks a description and repository link, indicating potential malicious intent.
    • Do this Monday: This incident highlights a significant security risk for developers using npm packages, particularly those related to Strapi CMS. The exploitation of Redis and PostgreSQL could lead to unauthorized access and data breaches.
    • Source: Thehackernews via The Hacker News (security)
  • Tags:
  • 3. Helm v4.1.4

    • Category: Security / Patch
    • What happened: Helm v4.1.4 is a security patch release that addresses several vulnerabilities, including issues with chart extraction, plugin verification, and path traversal in plugin metadata. Users are encouraged to upgrade to this version for improved security.
    • Do this Monday: Upgrading to Helm v4.1.4 is crucial to mitigate security risks associated with the identified vulnerabilities, which could potentially allow unauthorized access or manipulation of Helm charts and plugins.
    • Source: Helm releases
  • Tags:
  • 4. DevOps'ish 303: Claude Code's Source, Iran's Tech Hit List, Microsoft's rough times, and More

    • Category: Security / Patch
    • What happened: Claude Code's source code was leaked this week due to a faulty build step in the development pipeline, according to DevOps'ish newsletter reporting. Separately, the Cursor IDE is experiencing a critical bug where the application becomes completely unresponsive when attempting to restore Claude Code extension webview tabs during startup, preventing terminal access, extension loading, and settings functionality. SRE teams running Cursor IDE should immediately close all Claude Code tabs before shutting down the application to prevent the startup freeze issue, and consider temporarily disabling the Claude Code extension until a fix is released. Organizations using Claude Code should review their build processes for similar security vulnerabilities that could lead to source code exposure and implement additional safeguards around sensitive code artifacts. Both issues highlight the risks of AI coding tool integrations in development environments and warrant immediate attention from platform engineering teams.
    • Do this Monday: The leak of Claude Code's source code could expose vulnerabilities or proprietary information, impacting security practices. The threats to US tech companies, particularly AWS, may affect service reliability and operational security in affected regions. Layoffs at major companies could influence talent availability in the industry.
    • Sources: Devopsish via DevOps'ish, Cursor Forum
  • Tags:
  • 5. Reclaim Developer Hours through Smarter Vulnerability Prioritization with Docker and Mend.io

    • Category: Security / Patch
    • What happened: Docker has announced an integration with Mend.io that enhances container security by automatically identifying and prioritizing vulnerabilities in Docker Hardened Images (DHI). This integration uses VEX statements to distinguish between exploitable and non-exploitable vulnerabilities, allowing developers to focus on critical risks. Key features include zero-configuration setup, automatic detection of DHI images, visual indicators in the Mend UI, and the ability to suppress non-functional risks in bulk. Additionally, it supports automated governance workflows, including SLA management and pipeline gating to ensure CI/CD processes remain efficient.
    • Do this Monday: This integration could significantly reduce the time developers spend managing vulnerabilities, allowing teams to focus on real risks rather than noise. The automated features and prioritization could lead to improved security posture and faster remediation processes in production environments.
    • Source: Docker Blog
  • Tags:
  • 6. Prioritize GitHub Advanced Security alerts with runtime context from Dynatrace

    • Category: Security / Patch
    • What happened: Dynatrace enhances GitHub Advanced Security by integrating runtime context, allowing teams to visualize, prioritize, and automate security alerts. This integration helps unify security findings across the Software Development Lifecycle and provides valuable insights like public internet exposure and reachable data assets, enabling better management of security alerts directly within GitHub.
    • Do this Monday: This integration can improve the efficiency of security alert management in production environments by providing actionable context, potentially reducing the time and effort needed for remediation.
    • Source: Dynatrace Blog
  • Tags:
  • 7. Pipeline Groups in Dynatrace OpenPipeline: Enterprise-grade governance explained

    • Category: Security / Patch
    • What happened: Dynatrace introduces Pipeline Groups in version 1.332 of its SaaS offering, allowing platform engineering teams to enforce governance across multiple pipelines while enabling individual teams to manage their configurations. This feature addresses the challenges of scaling pipeline governance by separating mandatory global processing from team-level customization, thus preventing configuration drift and bottlenecks in pipeline changes.
    • Do this Monday: The introduction of Pipeline Groups could significantly streamline governance processes for organizations using Dynatrace, reducing delays in pipeline changes and minimizing compliance risks associated with configuration drift.
    • Source: Dynatrace Blog
  • Tags:

  • Releases

    1. Microsoft wants to make service mesh invisible

    • Category: Release
    • What happened: At KubeCon EU 2026, Microsoft introduced the Azure Kubernetes Application Network, a fully managed service built on Istio's ambient mode, aiming to simplify service mesh usage by making it 'invisible' to users. The new service avoids the term 'service mesh' to appeal to customers who only need basic proxy features like mTLS. Ambient mode addresses issues with the traditional sidecar model, allowing for independent upgrades without restarting applications. The service also adapts to the unique demands of AI workloads, which require different network handling compared to standard HTTP requests.
    • Do this Monday: The introduction of Azure Kubernetes Application Network could streamline service mesh adoption and management, particularly for teams hesitant about traditional service mesh complexities. Its focus on mTLS and AI workload optimization may influence how operators configure their Kubernetes environments.
    • Source: The New Stack
  • Tags:
  • 2. New GKE Cloud Storage FUSE Profiles take the guesswork out of configuring AI storage

    • Category: Release
    • What happened: Google Cloud introduces GKE Cloud Storage FUSE Profiles to simplify the configuration of Cloud Storage FUSE for AI/ML workloads. This feature automates performance tuning, allowing users to achieve high performance with minimal operational overhead. The profiles are tailored to specific workload needs, enhancing data access for training and inference tasks.
    • Do this Monday: This change could significantly reduce the complexity of configuring storage for AI/ML applications on GKE, potentially improving performance and reducing operational burden for teams managing these workloads.
    • Source: Google Cloud Blog
  • Tags:
  • 3. Dynatrace to acquire Bindplane to bring control to the telemetry lifecycle

    • Category: Release
    • What happened: Dynatrace has announced its acquisition of Bindplane, a telemetry pipeline that enhances control over observability data management. This acquisition aims to address the challenges posed by increasing telemetry volumes and fragmented data collection in cloud-native and AI-driven environments. Bindplane, built on OpenTelemetry, allows teams to collect, process, and route telemetry data in real-time, improving data quality and reducing costs associated with data ingestion. The integration is expected to enhance AI-powered observability by ensuring that data is relevant and high-quality from the outset.
    • Do this Monday: This acquisition may affect production by providing teams with better tools to manage telemetry data, potentially improving observability and reducing costs associated with data ingestion. Organizations may benefit from enhanced flexibility and control over their telemetry lifecycle, which could lead to more efficient operations and better insights.
    • Source: Dynatrace Blog
  • Tags:
  • 4. Code Security risk assessment available for organizations

    • Category: Release
    • What happened: GitHub has introduced a free Code Security risk assessment tool for organization admins and security managers. This tool allows users to review security vulnerabilities across their organization, summarizing them by severity, rule type, and programming language. It provides remediation guidance and highlights where Copilot Autofix can suggest automatic fixes. The assessment helps identify high-impact repositories for prioritization and speeds up the remediation process. This feature is available in GitHub Enterprise Cloud and GitHub Team, and will be included in GitHub Enterprise Server 3.22.
    • Do this Monday: This new tool can help organizations identify and remediate security vulnerabilities more efficiently, potentially reducing the risk of security incidents in production environments - especially for teams using GitHub for their code repositories.
    • Source: GitHub Changelog
  • Tags:
  • 5. The Atlassian Rovo MCP Server now supports Bitbucket Cloud

    • Category: Release
    • What happened: The Atlassian Rovo Model Context Protocol (MCP) Server now integrates with Bitbucket Cloud, allowing AI clients to perform various repository management tasks such as creating commits, opening pull requests, and checking pipeline results. This integration enhances workflows by enabling AI assistants to interact with Bitbucket Cloud tools for managing workspaces, pull requests, and deployments. However, it currently requires an organization-linked workspace and uses API token authentication, with OAuth support not yet available.
    • Do this Monday: This integration could streamline development workflows by allowing AI tools to assist with repository management and CI/CD processes, potentially improving efficiency in code reviews and deployments. However, the reliance on API token authentication and the need for organization-linked workspaces may limit immediate usability for some teams.
    • Source: Atlassian Engineering
  • Tags:
  • 6. Copilot-reviewed pull request merge metrics now in the usage metrics API

    • Category: Release
    • What happened: GitHub has added new metrics to the Copilot usage metrics API that focus on pull requests reviewed by Copilot. The metrics include the total number of pull requests merged and reviewed by Copilot and the median time to merge for those pull requests. This enhancement allows users to assess the impact of Copilot on code review efficiency and track the adoption of automated reviews across their organizations.
    • Do this Monday: These metrics can help teams evaluate the effectiveness of Copilot in speeding up the code review process, potentially influencing how they integrate automated tools into their workflows.
    • Source: GitHub Changelog
  • Tags:
  • 7. With Claude Managed Agents, Anthropic wants to run your AI agents for you

    • Category: Release
    • What happened: Anthropic has launched Claude Managed Agents, a public beta service that enables businesses to build and deploy cloud-based AI agents on its platform. This service abstracts away the infrastructure needed for running agents, allowing users to define agents in natural language or YAML, and manage their execution with built-in governance tools. The platform promises to speed up the deployment process significantly, although some features are still in limited preview. Pricing is based on token usage and session hours.
    • Do this Monday: The introduction of Claude Managed Agents could streamline the deployment of AI agents in production environments, potentially reducing the time and resources needed for setup. This may influence how enterprises adopt AI solutions, especially with the governance tools provided.
    • Source: The New Stack
  • Tags:
  • 8. Openness without compromises for your Apache Iceberg lakehouse

    • Category: Release
    • What happened: Google Cloud announced the preview of interoperability between BigQuery and Iceberg-compatible engines like Trino and Spark at the Apache Iceberg Summit. This new capability allows users to leverage enterprise-grade storage for their lakehouse while maintaining Iceberg's flexibility. It aims to address challenges faced by data teams, such as price-performance overhead and the need for custom infrastructure for real-time streaming and governance across compute engines. The features will be available in preview for Google-managed Iceberg REST catalog tables and will reach general availability for BigQuery-managed Iceberg tables next month.
    • Do this Monday: This change could significantly enhance data workflows for teams using Apache Iceberg and BigQuery, reducing the complexity and costs associated with managing multiple compute engines and improving performance for lakehouse architectures.
    • Source: Google Cloud Blog
  • Tags:
  • 9. A framework for securely collecting forensic artifacts into S3 buckets

    • Category: Release
    • What happened: The blog post discusses a framework for securely collecting forensic artifacts into Amazon S3 buckets, emphasizing the importance of security during the collection process. It outlines best practices such as implementing least privilege access using AWS IAM policies, utilizing time-limited credentials via AWS STS, ensuring compatibility with third-party forensic tools, and automating credential vending to enhance security and efficiency in forensic investigations.
    • Do this Monday: This framework can help organizations improve their incident response capabilities by securely collecting and storing forensic artifacts, which is critical for identifying root causes and validating remediation efforts after security incidents.
    • Source: AWS Security Blog
  • Tags:
  • 10. From bytecode to bytes: automated magic packet generation

    • Category: Release
    • What happened: The article discusses the challenges of reverse-engineering Berkeley Packet Filter (BPF) socket programs used by Linux malware, which can remain dormant until triggered by a specific packet. It introduces a tool that utilizes symbolic execution and the Z3 theorem prover to automate the generation of these 'magic' packets, significantly reducing the time required for security researchers to analyze complex BPF instructions. The post highlights the case of BPFDoor, a sophisticated backdoor that leverages BPF for stealthy monitoring of network traffic.
    • Do this Monday: This tool could enhance the efficiency of security operations by automating the packet generation process, potentially improving response times to BPF-based threats in production environments.
    • Source: Cloudflare Blog
  • Tags:

  • Lightning links

    Human Stories

    Looking at this week's stories, I'm struck by how much of our operational reality boils down to dependencies - and what happens when they shift beneath our feet. The GitHub incidents remind us that even the platforms we consider bedrock can stumble over caching misconfigurations, while HashiCorp's CDKTF deprecation forces teams to confront the uncomfortable truth that today's strategic tooling choice might be tomorrow's migration headache. The ChipSoft ransomware attack shows how quickly a vendor can go from operational to completely dark, leaving their customers scrambling. Even Cloudflare, with their distributed architecture, needs those maintenance windows in STL, Brussels, and Porto Alegre that remind us no infrastructure is truly invisible. As we build and maintain systems, we'd do well to remember that resilience isn't just about redundancy - it's about designing for the inevitable moment when the things we depend on change, break, or simply disappear.

    Also worth reading

    I kept losing 30-45 mins every incident trying to piece together what happened. (Reddit r/sre)

    A DevOps engineer shares their frustration with the time lost during incidents trying to identify the root cause. They describe the manual process of correlating metrics and timelines, which often takes 30-45 minutes. To address this, they are developing a tool that ingests metrics from Prometheus,

    From pilots to productivity: How one AI leader operationalizes enterprise AI (Atlassian Engineering)

    The article discusses the challenges organizations face in operationalizing AI beyond initial pilot programs. It highlights insights from Shivam Khullar of NVIDIA, who emphasizes the importance of distinguishing between adoption metrics and outcome metrics to measure AI's true impact on productivity

    They committed to 60 engineers’ worth of work with 35 engineers (Atlassian Engineering)

    Jensen Fleming from Atlassian discusses the challenges of overcommitting engineering resources, revealing that her team committed to the workload of 60 engineers while only having 35. She highlights the lack of visibility and data in their previous planning methods, which relied on outdated spreadsh
    Scroll to Top