On Call Brief
Your weekly SRE/DevOps briefing. Security patches, postmortems, releases, and community reads — curated for the on-call engineer.
Each brief is dated by its editorial week (not the companion podcast release schedule); when inbox or RSS ingest lags affected sourcing, we say so in the draft.
This week's brief is in draft — updates daily, publishes Sunday Read last week’s brief →
Search every published brief by keyword, vendor, CVE, or topic.
CtrlK to focus search
On Call Brief – Week of May 10–16, 2026
2026-05-10 — 2026-05-16
This week's top stories
1. EdTech Firm Instructure Pays Ransom as U.S. House Starts Investigation
- Category: Community
- What happened: Instructure, the company behind the Canvas learning management system, paid a ransom to the ShinyHunters ransomware group after suffering a data breach that compromised information from over 30 million Canvas users across more than 8,000 educational institutions. The company made the decision to pay following a second intrusion into their systems, according to Security Boulevard Newsletters. SRE teams running Canvas deployments should immediately review their access logs for any suspicious activity, verify that all security patches are current, and coordinate with their information security teams to assess potential data exposure. The U.S. House of Representatives has initiated an investigation into the incident, indicating the severity and scope of this breach affecting educational infrastructure nationwide.
- Worth reading: This incident highlights the risks associated with data breaches in educational technology and may lead to increased scrutiny and regulatory actions affecting similar organizations.
- Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
2. Instructure Pays Ransom to Canvas Hackers
- Category: Deep Dive
- What happened: Instructure paid a ransom to ShinyHunters after breaches of its Canvas learning management system, affecting around 275 million users and causing service disruptions. While the company received a guarantee of data destruction, experts warn that paying ransoms can encourage future attacks and does not ensure the data won't be leaked.
- Takeaway: This incident highlights the risks associated with ransomware attacks and the potential for data breaches in learning management systems - operators should assess their own security measures and incident response plans.
- Source: Insidehighered via TLDR Dev
3. Cloudflare: 12 Scheduled Maintenance Windows (Network Performance Issues, Los Angeles, Sydney, Phoenix (+8 more))
- Category: Community
- What happened: Cloudflare is conducting extensive planned maintenance across multiple global datacenters from May 11-13, 2026, including PHX (May 12, 09:00-11:30 UTC), LAX (May 12, 07:00-15:00 UTC and May 13, 09:00-11:00 UTC), YUL (May 12, 05:00-13:00 UTC), LHR (May 12, 00:30-07:00 UTC and May 13), and SYD (May 11, 15:00 UTC to May 12, 07:00 UTC and May 13, 15:00-20:00 UTC), with additional maintenance scheduled for MEL, CAN, and SCL datacenters on May 13. During these maintenance windows, traffic will be automatically rerouted which may cause increased latency for end users in affected regions, with PNI/CNI customers potentially experiencing more significant impacts. Separately, Cloudflare experienced and resolved network performance issues in Chicago with monitoring ongoing according to Cloudflare Status. SRE teams should monitor application performance metrics and consider pre-positioning traffic or implementing additional caching strategies during the maintenance windows, particularly if serving users in the affected geographic regions.
- Worth reading: This incident may have affected users relying on Cloudflare services in the Chicago region, potentially impacting application performance and availability.
- Sources: Cloudflare Status, Cloudflare Status, Cloudflare Status (+9 more)
4. Detecting and preventing crypto mining in your AWS environment
- Category: Deep Dive
- What happened: This article explains how to use Amazon GuardDuty to detect and prevent cryptocurrency mining threats in AWS environments. It discusses the security challenges posed by unauthorized crypto mining, including increased costs, performance degradation, and potential security incidents. The article outlines key indicators of crypto mining activity and details how GuardDuty employs machine learning and threat intelligence to identify such activities effectively.
- Takeaway: Organizations using AWS should be aware of the risks associated with crypto mining, including financial losses and performance issues. Implementing GuardDuty can help mitigate these risks by providing advanced detection capabilities.
- Source: AWS Security Blog
5. Helm v4.2.0
- Category: Breaking Change
- What happened: Helm v4.2.0 is a feature release that includes notable changes such as switching to goreleaser for builds, updating Kubernetes client libraries to v1.36, adding a mustToToml template function, and deprecating some unused flags. Users are encouraged to upgrade for the best experience.
- Do this Monday: Upgrading to Helm v4.2.0 is recommended for improved functionality and compatibility with Kubernetes v1.36 - this may affect deployments that rely on Helm for managing Kubernetes applications.
- Source: Helm releases
6. LDAP secrets management now available in IBM Vault Enterprise 2.0
- Category: Community
- What happened: IBM Vault Enterprise 2.0 now includes LDAP secrets management, allowing users to manage secrets more effectively within their LDAP environments. This feature enhances security and simplifies the management of sensitive information.
- Worth reading: This update may affect production environments using IBM Vault for secrets management, particularly those integrating with LDAP, as it offers improved security and management capabilities.
- Source: Hashicorp via TLDR DevOps
7. Amazon Redshift introduces AWS Graviton-based RG instances with an integrated data lake query engine
- Category: Community
- What happened: Amazon Redshift has introduced new RG instances powered by AWS Graviton processors, which include an integrated data lake query engine. This enhancement aims to improve performance and reduce costs for data analytics workloads.
- Worth reading: The introduction of Graviton-based instances may lead to cost savings and performance improvements for data analytics in Redshift, potentially affecting how teams manage their data workloads.
- Source: Aws Amazon via TLDR DevOps
8. Securing the Untrusted Agentic Development Layer
- Category: Community
- What happened: The Techstrong Brief article discusses emerging security risks in AI-powered development environments, specifically focusing on autonomous coding assistants and agent-based infrastructures being integrated into production pipelines and supply chains. As enterprises increase their AI technology investments, these untrusted agentic development layers present new attack vectors that require specialized governance and security controls to protect sensitive data and systems. SRE teams should implement robust security measures around AI development tooling, establish governance frameworks for autonomous coding assistants, and review current pipeline security controls to account for AI agent interactions with production systems. The article emphasizes that traditional development security practices may be insufficient for these emerging AI-integrated workflows, requiring updated security strategies specifically designed for agentic development environments.
- Worth reading: As autonomous coding assistants become more prevalent, understanding their security implications is crucial for maintaining the integrity of production environments - operators should consider how these tools interact with their pipelines and the potential risks involved.
- Sources: Techstrong Brief, Techstrong Brief, Techstrong Brief
9. Microsoft Study Warns AI Agents 'Corrupt' Data in Long Workflows
- Category: Community
- What happened: Microsoft Research published a study warning that AI agents can corrupt data during extended autonomous workflows, raising concerns about reliability as these systems are increasingly deployed for unattended tasks. The research suggests that longer-running AI agent workflows have higher risks of data integrity issues, though specific technical details about failure modes or affected AI platforms were not disclosed in the available reporting. SRE teams currently operating AI agents in production should implement additional data validation checkpoints and consider limiting the duration or scope of autonomous AI workflows until more detailed guidance becomes available. Organizations should also establish monitoring for data corruption patterns in existing AI-driven processes and review current AI agent deployment policies. The study emphasizes the need for stricter operational controls around AI agents performing autonomous labor tasks, according to Techstrong.ai reporting.
- Worth reading: This study raises concerns about the reliability of AI agents in production environments, particularly in workflows where data integrity is critical - operators should consider implementing safeguards when integrating AI into their processes.
- Sources: Techstrong.ai, Techstrong.ai
10. Zero-downtime DynamoDB construct migration: from Table to TableV2 with cdk orphan
- Category: Deep Dive
- What happened: The article discusses migrating an Amazon DynamoDB table from the Table construct to TableV2 using the AWS Cloud Development Kit (CDK) without downtime. It highlights the advantages of TableV2, particularly for applications requiring global replicas, as it allows for native CloudFormation management and per-replica configuration. The new cdk orphan command simplifies this migration process, ensuring data integrity and application availability during the transition.
- Takeaway: This migration method can significantly reduce downtime and complexity when transitioning to a more scalable DynamoDB setup, which is crucial for applications expecting to expand globally.
- Source: AWS Database Blog
CVE & Security
1. Google Detects AI-Created Exploit, Thwarts 'Mass Exploitation Operation
- Category: Security / Patch
- What happened: Google researchers have identified and thwarted what they report as the first known AI-generated zero-day exploit used in a mass-exploitation campaign by threat groups. The security team detected this exploitation attempt before widespread damage occurred, though specific CVE numbers, affected software versions, and technical details of the vulnerability have not been publicly disclosed. This represents a significant escalation in threat actor capabilities, marking the transition from AI tools being used for reconnaissance and social engineering to actively generating functional exploit code for zero-day vulnerabilities. SRE teams should ensure their security monitoring systems are updated with the latest threat intelligence feeds from Google's security research division and review their incident response procedures for AI-assisted attacks. Organizations should also verify that their vulnerability management processes can rapidly respond to novel attack vectors that may not match traditional exploitation patterns.
- Do this Monday: This could lead to an increase in sophisticated attacks leveraging AI-generated exploits, necessitating enhanced security measures and vigilance in monitoring for such threats.
- Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
2. Attackers Use Fake OpenAI Model to Push Credential-Stealing Malware
- Category: Security / Patch
- What happened: Cybercriminals are distributing credential-stealing malware through fake OpenAI models, exploiting the current AI hype to target unsuspecting users according to Techstrong Brief reporting. Meanwhile, legitimate AI security initiatives are expanding as OpenAI launched Daybreak (utilizing GPT-5.5) and Anthropic operates Project Glasswing, both cybersecurity programs showing nearly identical benchmarks and sharing three common industry partners as reported by The New Stack. SRE teams should implement strict controls around AI model downloads and installations, ensuring all OpenAI integrations come from official sources and verified repositories. Organizations should also evaluate whether either legitimate AI security initiative aligns with their vulnerability management programs, while maintaining heightened awareness of social engineering attacks leveraging AI brand recognition.
- Do this Monday: Operators should be aware of the evolving threat landscape as attackers exploit AI-related technologies to deliver malware, which could compromise credentials and sensitive data.
- Sources: Techstrong Brief, The New Stack
3. WEEK OF MAY 11 – 17, 2026 Weekly Edition • Thursdays TLDR THIS WEEK - Google Detects AI-Created Exploit, Thwarts
- Category: Security / Patch
- What happened: Google has detected an AI-created exploit and successfully thwarted a mass exploitation operation. This highlights the evolving threat landscape where AI is being used to create sophisticated attacks. Additionally, an EdTech firm, Instructure, has paid a ransom as investigations begin by the U.S. House.
- Do this Monday: Operators should be aware of the increasing use of AI in cyber threats, which may require updates to security protocols and monitoring strategies. The incident with Instructure may also prompt reviews of incident response and ransomware policies.
- Source: Security Boulevard Newsletters
Releases
1. Kubernetes v1.36: Advancing Workload-Aware Scheduling
- Category: Release
- What happened: Kubernetes v1.36 has been released with major scheduling improvements, including the separation of API concerns between the Workload API and the new PodGroup API, plus new features like volume group snapshots and admission policies according to the Kubernetes Blog. The DevOps'ish newsletter reports that the Antrea Kubernetes project experienced a GitHub-based attack, though specific details about the nature or impact of this attack were not provided. SRE teams should evaluate upgrading to Kubernetes v1.36 for the enhanced workload-aware scheduling capabilities, particularly if they manage complex multi-pod workloads that could benefit from topology-aware scheduling. Teams using the Antrea CNI plugin should monitor the project's security advisories and verify the integrity of their current installations following the reported GitHub attack.
- Do this Monday: This release could affect production workloads that rely on batch processing and AI/ML tasks by improving scheduling efficiency and resource allocation. Operators may need to adapt to the new API structure and features to optimize their Kubernetes deployments.
- Sources: Kubernetes Blog, Devopsish via DevOps'ish
2. PCI PIN and P2PE compliance packages for AWS Payment Cryptography are now available
- Category: Release
- What happened: AWS has announced the completion of PCI PIN and PCI P2PE compliance assessments for its Payment Cryptography service. This includes validations for Key Management and Key Loading components, extending compliance to additional AWS regions. The service allows payment applications to utilize PCI-compliant hardware security modules for secure transaction processing, reducing compliance overhead for customers.
- Do this Monday: This compliance enables AWS customers to handle PIN-based and encrypted credit card transactions more securely and efficiently, potentially affecting how payment processing applications are architected and deployed in AWS.
- Source: AWS Security Blog
3. GitHub Enterprise Server 3.21 release candidate is available
- Category: Release
- What happened: GitHub has released Enterprise Server 3.21 as a release candidate with enhancements in deployment efficiency, monitoring, code security, and policy management, including new organization custom properties for tagging and hierarchy views for GitHub Projects. Additionally, GitHub has launched two new APIs in public preview: the Enterprise Installation API that enables GitHub Apps to determine enterprise installation status and retrieve installation IDs, and a REST API for Copilot Business and Enterprise users to programmatically start Copilot cloud agent tasks such as repository refactoring automation. SRE teams should evaluate the GHES 3.21 release candidate in non-production environments to assess the deployment and monitoring improvements, while Enterprise customers can begin testing the new APIs to determine integration opportunities with existing workflows. These updates collectively focus on improving enterprise automation capabilities and operational visibility across GitHub's platform.
- Do this Monday: The introduction of a new REST API version may require updates to existing integrations. Enhanced monitoring and security features could improve operational efficiency and compliance.
- Sources: GitHub Changelog, GitHub Changelog, GitHub Changelog
4. MinIO’s MemKV promises 95% better GPU utilization by ending AI recompute tax
- Category: Release
- What happened: MinIO has launched MemKV, a context memory store designed to enhance AI model performance by retaining situational data. This new product aims to reduce the 'recompute tax' associated with AI inference workloads, which occurs when GPUs lose context and repeat tasks. MemKV reportedly improves GPU utilization by over 95% and reduces costs per token by around 50%, addressing inefficiencies in AI infrastructure.
- Do this Monday: The introduction of MemKV could significantly enhance the efficiency of AI workloads, potentially impacting production environments that rely on GPU resources for inference tasks. Improved GPU utilization and reduced costs may lead to better performance and resource management.
- Source: The New Stack
5. Hacktron Plans to Build AI Platform to Test Code for Vulnerabilities
- Category: Release
- What happened: Hacktron is developing an AI platform to continuously test code for vulnerabilities, aiming to reduce false positives in security alerts. The platform will analyze every pull request and code change, providing remediation recommendations. Hacktron's team has experience in identifying critical vulnerabilities in open-source projects. The use of AI is expected to transform DevSecOps workflows, allowing for faster identification and remediation of vulnerabilities before deployment, although concerns remain about the security of new code generated by AI.
- Do this Monday: This development could significantly affect how DevOps teams manage application security, potentially reducing the time spent on false positives and improving vulnerability management processes.
- Source: DevOps.com
6. Cimento emerges from stealth to secure the one thing no firewall can protect
- Category: Release
- What happened: Cimento has launched an AI-driven human risk management platform aimed at addressing vulnerabilities caused by human behavior in cybersecurity. The platform creates a living risk profile for each employee by integrating with existing security tools and analyzing behavioral data to generate real-time risk scores. This approach seeks to replace traditional security training methods with continuous monitoring and automated responses to potential threats. Cimento also features a unique multi-turn phishing simulation to better assess and mitigate risks.
- Do this Monday: Cimento's focus on human risk management could change how organizations approach security training and incident response, potentially leading to more effective risk mitigation strategies. This may require adjustments in existing security protocols and tools to integrate with Cimento's platform.
- Source: The New Stack
7. Optimisation Tools for Jira: Reducing Configuration Bloat and Enhancing Performance
- Category: Release
- What happened: Atlassian has introduced optimisation tools for Jira to manage configuration bloat as the platform scales to support larger user bases. These tools help admins understand and identify unused configuration entities, such as fields and work types, and allow for bulk remediation actions to improve performance. The tools include reporting features and integrate with Site Optimiser to enhance the overall Jira experience by enforcing limits on configuration entities.
- Do this Monday: These optimisation tools are crucial for maintaining performance and usability in Jira as it scales, especially for large teams. Admins will need to engage with these tools to ensure their configurations remain efficient and within the new limits.
- Source: Atlassian Engineering
8. OpenTelemetry: 2 related updates
- Category: Release
- What happened: HotelTrader demonstrated significant infrastructure optimization by migrating to Valkey GLIDE on Amazon ElastiCache, achieving a 95% reduction in inter-availability zone data transfer costs and 49% latency improvement through availability zone-aware routing (AWS Database Blog). Separately, AWS has published an architecture pattern for streaming CloudWatch metrics to VPC-based OpenTelemetry collectors using Lambda functions, enabling organizations to enhance observability while reducing third-party licensing costs through CloudWatch Metric Streams integration (AWS Architecture Blog). SRE teams managing Redis/ElastiCache workloads should evaluate Valkey GLIDE for similar cost and performance benefits, particularly in multi-AZ deployments with high cross-zone traffic. Teams looking to optimize observability costs should consider implementing the Lambda-based CloudWatch to OpenTelemetry streaming pattern to reduce dependency on expensive third-party monitoring solutions.
- Do this Monday: This case study highlights a significant cost-saving strategy and performance improvement that could be relevant for teams managing high-throughput applications on AWS, particularly those using ElastiCache.
- Sources: AWS Database Blog, AWS Architecture Blog
9. Cilium 1.19.4
- Category: Release
- What happened: Cilium released version 1.19.4 with several operational improvements including filtering of EndpointSlices by service proxy name, enhanced iptables masquerading functionality, and configurable SPIRE client support for ztunnel integration. The release includes critical bugfixes addressing egress gateway FIB lookup failures and IPsec trace source issues that could impact network connectivity and observability. SRE teams running Cilium should evaluate upgrading to v1.19.4 to resolve these networking and security issues, particularly if experiencing problems with egress gateways or IPsec tracing. Standard Cilium upgrade procedures should be followed with proper testing in non-production environments before rolling out to production clusters.
- Do this Monday: Operators using Cilium should review the changes to EndpointSlices filtering and iptables rules as they may affect service routing and network policies. The bugfixes could improve stability and observability in network operations.
- Sources: Cilium releases, Cilium releases
Lightning links
- MySQL 9.7: First Major LTS Since 8.4 Brings Enterprise Features to Community Edition (InfoQ DevOps) -- MySQL 9.7 introduces long-term support with new enterprise features for the community.
- Google Named a Leader in the Gartner® Magic Quadrant™ for AI Application Development Platforms (Google Cloud Blog) -- Google Cloud maintains its leadership position in AI application development platforms.
- The Rise of Composable Architectures to Replace Traditional Platforms (DevOps.com) -- Composable architectures are gaining traction as a modern alternative to monolithic platforms.
- 3 ways AI alert grouping is transforming on-call engineering at Atlassian (Atlassian Engineering) -- AI alert grouping in Jira Service Management helps on-call engineers manage alert overload.
- What SRE practice led to more than expected reduction of incidents? (Reddit r/sre) -- Better alert tuning significantly reduced incidents, proving small adjustments can have big impacts.
- Cline releases open-source agent runtime SDK for coding agents (TLDR AI) -- Cline's new SDK enables developers to build customizable agentic applications with ease.
- Cloudflare Launches “Artifacts” Beta, Introducing Git-Like Versioning for AI Agents (TLDR DevOps) -- Cloudflare's Artifacts feature offers Git-like version control for AI agents, enhancing collaboration.
- Amazon CloudWatch Logs Insights supports querying by log group tags (TLDR DevOps) -- New tagging support in CloudWatch Logs Insights simplifies dynamic log analysis and reduces overhead.
Human Stories
Looking at Instructure's massive ransomware incident alongside Cloudflare's coordinated global maintenance and AWS's new security guidance, I'm struck by how much our industry has learned about transparency in the face of operational challenges. A decade ago, Instructure might have tried to downplay the scope of their breach, but instead they're openly acknowledging the impact on 275 million Canvas users while Congress launches an investigation - that's the kind of accountability that builds trust even in crisis. Meanwhile, Cloudflare is telegraphing their maintenance windows across a dozen datacenters with surgical precision, and AWS is proactively publishing crypto mining detection strategies rather than waiting for customers to get burned. There's something hopeful in seeing our collective response to operational risk evolve from secrecy and reactive fixes toward this more mature posture of proactive communication and shared learning, even when the stakes couldn't be higher.
Also worth reading
What SRE practice led to more than expected reduction of incidents? (Reddit r/sre)
The discussion highlights that better alert tuning has led to a more significant reduction in incidents than implementing new monitoring tools. This suggests that small adjustments in reliability practices can have a substantial impact on incident management.