On Call Brief – Week of June 21–27, 2026

2026-06-21 — 2026-06-27 Briefing: 2026-06-21 Last updated 6 hours ago (Jun 26, 2026 3:07 am EDT)

This week's top stories

1. I discovered a large-scale malware distribution on GitHub

Category: Community
What happened: GitHub has moved its Code Quality feature to General Availability with new pricing and capabilities for integrated code quality tools within the platform. Separately, security researchers have identified a large-scale malware distribution campaign affecting approximately 10,000 GitHub repositories that use regularly updated readme files to link to zip archives containing Trojan malware disguised under various names. SRE and DevOps teams should review their dependency sources and scanning tools to detect potential malicious repositories, avoid downloading code from unfamiliar or suspicious GitHub accounts, and implement additional verification steps when pulling code from public repositories. Organizations using GitHub should also evaluate the new Code Quality GA offering to determine if it provides value for their security and code review workflows.
Worth reading: This discovery highlights a significant security risk for organizations using GitHub, as malicious repositories can compromise systems if users inadvertently download and execute the malware. Operators should enhance their monitoring and scanning practices for third-party code.
Sources: Orchidfiles via TLDR Dev, Techstrong Brief
Tags:

2. MSG Breach: Knicks Take the NBA Championship, ShinyHunters Takes the Data

Category: Community
What happened: The ransomware group ShinyHunters leaked 45GB of data from Madison Square Garden after the venue failed to meet their payment deadline, marking MSG's second major cyber breach and highlighting escalating threats against sports organizations. While specific technical details about the attack vector have not been publicly disclosed, operators managing similar large-scale venue infrastructure should review access controls, backup integrity, and incident response procedures given the pattern of targeting in this sector. SRE teams should ensure monitoring systems can detect unusual data exfiltration patterns and verify that segmentation controls prevent lateral movement between operational technology systems and corporate networks. Organizations in the sports and entertainment vertical should evaluate their cyber insurance coverage and ransom payment policies in light of this breach, as the attackers demonstrated willingness to follow through on data leaks when demands were not met.
Worth reading: Organizations in the sports sector should reassess their cybersecurity measures, particularly regarding payment processes and data protection, to mitigate risks of similar breaches.
Sources: Security Boulevard Newsletters
Tags:

3. PostgreSQL: v9.16

Category: Community
What happened: Datadog engineering discovered during a gameday that PostgreSQL failover on Kubernetes can be unsafe when network latency causes replication lag, potentially promoting a standby with stale data and leading to data inconsistency or loss. Separately, pgAdmin 4 version 9.16 has been released addressing seven security vulnerabilities including critical SQL injection and cross-site scripting issues across 64 total bug fixes. PostgreSQL operators running on Kubernetes should review their failover automation to validate replication lag checks before promotion and ensure monitoring for network-induced lag, while pgAdmin users should upgrade to v9.16 immediately to remediate the critical security issues. The Kubernetes failover risks are particularly relevant for operators relying on automated high-availability solutions that may not adequately verify standby readiness before switching over.
Worth reading: This highlights a critical failure point in PostgreSQL high-availability setups on Kubernetes, emphasizing the need for improved failover strategies to avoid downtime during network latency scenarios.
Sources: Datadoghq via TLDR DevOps, PostgreSQL News
Tags:

4. Amazon EKS now supports customer-routed control plane egress - Compliance teams have spent years begging EKS to

Category: Community
What happened: Amazon EKS has introduced support for customer-routed control plane egress, allowing compliance teams to route control plane traffic through their own VPC without additional costs. However, this may lead to unexpected charges when the traffic passes through a NAT Gateway, highlighting potential billing implications.
Worth reading: This change enables better compliance and control over network traffic for EKS users, but operators should be aware of potential costs associated with NAT Gateway usage.
Source: AWS via Last Week in AWS
Tags:

5. AWS Management Console Private Access now works without internet connectivity

Category: Community
What happened: AWS Management Console Private Access now allows users to connect without internet connectivity, addressing a long-standing request for air-gapped access. Users will only incur costs for PrivateLink endpoints and data processing, maintaining a per-gigabyte pricing model.
Worth reading: This change enables secure access to the AWS Management Console in environments without internet, which could enhance security for sensitive operations. Operators should consider the cost implications of using PrivateLink endpoints.
Source: AWS via Last Week in AWS
Tags:

6. Production-Ready Autonomous Incident Resolution with AWS DevOps Agent (now GA) and Datadog MCP Server

Category: Deep Dive
What happened: AWS DevOps Agent and Datadog MCP Server are now generally available, enabling autonomous incident resolution across AWS and multicloud environments. The integration allows AI agents to access observability data from Datadog, facilitating faster incident resolution by correlating monitoring data with infrastructure. The AWS DevOps Agent automates incident triage and response, optimizing application reliability and performance while providing proactive prevention recommendations. It supports integration with tools like Slack and PagerDuty for streamlined communication during incidents.
Takeaway: The general availability of AWS DevOps Agent and Datadog MCP Server may significantly enhance incident response times and reduce manual effort in operations. Teams can leverage these tools to automate incident resolution processes, which is crucial as applications become more complex and distributed. This could lead to improved reliability and performance of applications in production environments.
Source: AWS DevOps Blog
Tags:

7. Accelerate Incident Resolution with PagerDuty and AWS DevOps Agent

Category: Deep Dive
What happened: The integration of PagerDuty with the AWS DevOps Agent aims to accelerate incident resolution by enabling the DevOps Agent to initiate investigations automatically when a PagerDuty incident is triggered. This integration allows for real-time analysis and correlation of data across various observability tools, reducing the time spent manually gathering information. The AWS DevOps Agent can also provide proactive recommendations to improve observability and infrastructure, ultimately helping teams resolve incidents more efficiently.
Takeaway: This integration could significantly reduce incident resolution times by automating the investigation process and providing immediate insights, which may lead to improved uptime and customer satisfaction. Teams using AWS and PagerDuty should consider implementing this integration to enhance their incident response capabilities.
Source: AWS DevOps Blog
Tags:

8. F5 Embeds Neural Network in WAF Platform to Continuously Assess Risks

Category: Community
What happened: F5 has integrated a neural network-based risk engine into its Web Application and API Protection (WAAP) solution that continuously learns and scores each request in real-time to identify attack patterns without depending on static signatures. This machine learning capability represents a shift from traditional signature-based WAF detection to adaptive threat identification that evolves with observed traffic patterns. Operators using F5's WAAP platform should evaluate this feature for potential deployment, particularly in environments where zero-day exploits and novel attack vectors are concerns, though testing in non-production environments is recommended before enabling adaptive scoring in production. The neural network approach may reduce false positives while catching previously unknown attacks, but SREs should monitor initial performance impacts and tuning requirements during rollout.
Worth reading: The integration of a neural network into the WAAP solution could improve threat detection capabilities and reduce reliance on traditional signature-based methods, potentially enhancing security posture in production environments.
Sources: Security Boulevard Newsletters
Tags:

9. Friday Five — June 19, 2026

Category: Community
What happened: Project Lightwell is a collaborative initiative between Red Hat and IBM aimed at securing the software supply chain. This $5 billion project involves a group of financial and critical infrastructure leaders working together to create a secure enterprise clearinghouse for software.
Worth reading: This initiative may influence production environments by enhancing the security of software supply chains, which is critical for enterprises relying on open-source solutions.
Source: Red Hat Blog
Tags:

10. NYC Sewers Crawling With Rats and Potential Bad Actors

Category: Community
What happened: Surveillance footage has captured unidentified individuals accessing New York City's sewer system, raising concerns about potential threats to critical operational technology infrastructure that controls wastewater management. Security experts warn that NYC's aging sewer monitoring systems, which have been increasingly connected to the internet in recent years, may be vulnerable to physical tampering or cyberattacks that could disrupt essential services. SRE teams managing municipal or critical infrastructure OT systems should review physical access controls to underground facilities, audit network segmentation between IT and OT environments, and ensure monitoring systems for water and wastewater infrastructure have proper authentication and access logging enabled. Organizations operating similar legacy OT systems should assess whether internet-connected monitoring equipment has been deployed without adequate security controls and consider implementing additional physical security measures at infrastructure access points.
Worth reading: This situation highlights vulnerabilities in critical infrastructure security, particularly with legacy systems that may not be adequately protected against physical and cyber threats - operators should assess their own OT systems for similar risks.
Sources: Security Boulevard Newsletters
Tags:

CVE & Security

1. Issue with containerd CRI Plugin - CVE-2026-50195, CVE-2026-53488, CVE-2026-53492, CVE-2026-53489, CVE-2026-47262

Category: Security / Patch
What happened: The containerd project has released security updates across all supported versions (1.7.33, 2.0.10, 2.1.9, 2.2.5, and 2.3.2) to address five critical vulnerabilities in the CRI plugin that affect versions 1.7 through 2.3. The CVEs include CVE-2026-50195 (checkpoint import issues), CVE-2026-53488 (command execution), CVE-2026-53492 (annotation smuggling), CVE-2026-53489 (arbitrary file read), and CVE-2026-47262 (potential denial of service), as identified by AWS Security Bulletins. Operators running containerd should immediately upgrade to the patched versions corresponding to their release branch, with version 2.3.2 also including a fix for a Windows-specific data race when reading shim logs. The 1.7.33 release additionally updates runc to v1.3.6 and addresses CVE-2026-34986, providing comprehensive security improvements across the container runtime stack.
Do this Monday: Operators using AWS managed container services should prioritize patching containerd to mitigate these vulnerabilities, as they could lead to serious security risks including command execution and denial of service.
Sources: AWS Security Bulletins, containerd releases
Tags:

2. DevOps'ish 314: GitHub Ignored the Reports, Norway Didn't, AI Needs More Discipline, and More

Category: Security / Patch
What happened: GitHub has reportedly failed to act on security vulnerability reports that are now being actively exploited by a supply-chain worm, which has compromised numerous packages and developer accounts according to DevOps'ish 314. The incident represents an active supply chain attack affecting the GitHub ecosystem, though specific CVE numbers and package names were not disclosed in the available reporting. Operators should immediately audit their dependencies for compromised packages, review access logs for unusual authentication patterns on developer accounts, and implement additional supply chain security controls such as dependency pinning and signature verification. Organizations relying on GitHub packages should monitor security advisories closely and consider implementing automated scanning tools to detect indicators of compromise in their software supply chain. This case underscores the critical need for platform operators to prioritize and respond promptly to community-reported security issues before they escalate into active exploitation.
Do this Monday: The exploitation of vulnerabilities in GitHub could lead to significant security risks for organizations relying on its repositories. The introduction of Mendral could help teams enhance their security posture by ensuring dependencies are reviewed for potential threats before integration.
Sources: DevOps'ish
Tags:

3. YouTrack Security Update: Upgrade Required for YouTrack Server

Category: Security / Patch
What happened: YouTrack administrators are advised to upgrade to fixed versions due to several identified security vulnerabilities. YouTrack Cloud has already been patched, but YouTrack Server users must upgrade to versions 2024.2 or newer to mitigate risks. The vulnerabilities include potential admin account takeovers and email verification bypasses, with specific CVEs assigned for tracking.
Do this Monday: Failure to upgrade YouTrack Server could expose systems to critical vulnerabilities, including admin account takeovers and email verification bypasses. Immediate action is recommended for affected installations.
Source: JetBrains Blog
Tags:

4. Claude Fable 5 on Bedrock Requires Sharing Inference Data with Anthropic

Category: Security / Patch
What happened: Using Claude Fable 5 or Mythos 5 on Amazon Bedrock necessitates opting into data sharing, which involves sending prompts and outputs to Anthropic for a 30-day retention period with human review. This marks a change from previous Bedrock models that retained inference data within AWS. Shortly after the launch, Anthropic requested AWS to revoke access to both models due to compliance with US export control regulations.
Do this Monday: This change in data handling practices may affect how organizations using Bedrock manage their data privacy and compliance, particularly regarding sensitive information and export controls - operators should review their data sharing policies.
Source: InfoQ DevOps
Tags:

Releases

1. Upgrading Lambda function runtimes at scale with AWS Transform custom

Category: Release
What happened: AWS has introduced two new developer tools aimed at streamlining cloud operations and Lambda management. AWS Transform custom is a new service designed to help organizations systematically upgrade Lambda function runtimes at scale by identifying risks and ensuring test coverage during the transformation process, particularly useful for teams managing large numbers of functions. Separately, AWS has released the Kiro power for AWS DevOps Agent, which embeds cloud operational intelligence directly into the Kiro AI-powered IDE, enabling developers to troubleshoot production issues without context switching between tools. SRE teams managing multiple Lambda deployments should evaluate AWS Transform custom for planned runtime upgrades, while those using Kiro IDE can integrate the DevOps Agent to reduce mean time to resolution for production incidents.
Do this Monday: Organizations using AWS Lambda need to be aware of upcoming deprecation timelines for runtimes like Node.js and Python to avoid security risks and compliance issues. The introduction of AWS Transform custom can significantly reduce the engineering effort required for large-scale upgrades, allowing teams to focus on feature development while maintaining compliance.
Sources: AWS Compute Blog, AWS DevOps Blog
Tags:

2. AWS Security Agent announces support for Threat Modeling - An AI now tells you all the ways your architecture

Category: Release
What happened: AWS Security Agent has introduced support for Threat Modeling, utilizing AI to identify potential vulnerabilities in user architectures. This feature is currently free during its preview phase, highlighting various threats, including concerns about pricing transparency.
Do this Monday: This new AI-driven threat modeling tool could help teams proactively identify security risks in their AWS architectures, potentially improving security posture and compliance efforts.
Source: AWS via Last Week in AWS
Tags:

3. Bcachefs exits experimental status in new 'performance release'

Category: Release
What happened: Bcachefs version 1.38.6 has been released, marking its exit from experimental status. This release includes performance optimizations, bug fixes, and increases the number of devices in a filesystem to 255. The filesystem's Reconcile operation is now faster and more parallel. Additionally, the userspace code has been converted to Rust, with plans for further integration into the DKMS module. Performance comparisons show bcachefs achieving competitive speeds against XFS, although some benchmarks indicate it may be slower in certain scenarios.
Do this Monday: The transition of bcachefs to a non-experimental status and its performance enhancements could influence filesystem choices in production environments, especially for those considering alternatives to traditional filesystems like XFS. The ongoing Rust conversion may also affect future compatibility and performance.
Source: The Register (Software)
Tags:

Also this week

Community reads

11. Losing Fable made the best case yet for AI models you can run yourself

Category: Community
What happened: The article discusses the recent shutdown of the AI model Fable due to U.S. export-control directives, highlighting the risks of relying on hosted models. It emphasizes the importance of open-weight models, such as Z.ai's GLM-5.2, which users can download and run independently. The situation serves as a cautionary tale for enterprises that built automation on Fable, illustrating the potential consequences of losing access to a hosted service. The piece also notes the growing interest in GLM-5.2, which is being touted as a leading open-weight model for coding tasks.
Worth reading: The shutdown of Fable underscores the risks associated with hosted AI models, prompting a shift towards open-weight models that can be self-hosted. This may influence operational decisions regarding AI model deployment and reliance on third-party services.
Source: The New Stack
Tags:

12. Google has been on fire lately proposing new protocols in the agentic AI era. Just last week they proposed the

Category: Community
What happened: AWS has launched S3 annotations, a new feature that enables users to store mutable and queryable metadata directly on S3 objects, providing native metadata management capabilities beyond the existing immutable object metadata. Unlike traditional S3 object metadata which is set at upload time and cannot be changed, annotations allow operators to update and query metadata independently of the object itself, enabling richer contextual information for stored data. This feature is immediately available and requires no migration of existing objects, though operators should evaluate use cases where dynamic metadata tracking (such as classification tags, processing status, or access patterns) could replace external database lookups or custom indexing solutions. According to readysetcloud.io, this enhancement aims to improve S3 usability by offering a native alternative to maintaining separate metadata stores or using object tags with their limited query capabilities.
Worth reading: This feature allows for more flexible data management and querying capabilities directly within S3, which could improve workflows that rely on metadata for object retrieval and processing.
Sources: readysetcloud.io
Tags:

13. Amazon ECS introduces new high-resolution metrics for faster service auto scaling

Category: Community
What happened: Amazon ECS has added support for 20-second high-resolution CloudWatch metrics for service auto scaling. This enhancement allows for a quicker response to workload changes, achieving up to 76% faster scale-out and 72% faster end-to-end provisioning. The update aims to improve performance, reduce over-provisioning costs, and simplify scaling configurations using target tracking policies across Fargate, EC2, and managed instances.
Worth reading: This change could lead to more efficient resource utilization and cost savings in production environments by enabling faster scaling in response to workload fluctuations. Operators should consider updating their scaling configurations to leverage these new metrics.
Source: AWS via TLDR DevOps
Tags:

14. Amazon ECS service auto scaling now supports 20-second high-resolution CloudWatch metrics, enabling significantly

Category: Community
What happened: Amazon ECS service auto scaling now supports 20-second high-resolution CloudWatch metrics, replacing the previous default of 60-second granularity and enabling faster response to workload changes. This enhancement allows ECS services to scale up or down more rapidly based on real-time traffic patterns, potentially reducing over-provisioning costs and improving application responsiveness during traffic spikes. SRE teams running containerized workloads on ECS should evaluate updating their auto scaling policies to leverage the higher resolution metrics for services with volatile or rapidly changing traffic patterns. The feature is immediately available across all AWS regions where ECS operates, requiring no version upgrades but may necessitate adjustments to scaling thresholds and cooldown periods to prevent oscillation at the higher data resolution.
Worth reading: This change could lead to improved scaling performance for ECS services, allowing operators to respond more quickly to workload fluctuations - it may require adjustments in monitoring and scaling policies.
Sources: Chronosphere via TLDR DevOps
Tags:

15. AWS Previews Release Management Capabilities Added to DevOps Agent

Category: Community
What happened: AWS has announced a preview of AI-native release management capabilities for its DevOps agents, which aims to enhance the release process by integrating AI functionalities. This development could streamline deployment workflows and improve efficiency in managing software releases.
Worth reading: The introduction of AI-native release management could significantly affect how teams handle deployments, potentially reducing errors and improving speed. Operators should evaluate how these new capabilities can be integrated into their existing workflows.
Source: Techstrong Brief
Tags:

Lightning links

NVIDIA and AWS Collaborate to Bring AI to Production at Scale (TLDR AI) -- NVIDIA and AWS enhance AI deployment with new GPUs for improved performance.
Vercel Connect Launches to Replace Long-Lived Tokens (TLDR AI) -- Vercel's new Connect feature enhances security with short-lived, task-scoped credentials.
Helm v3.21.2 Released for Kubernetes Compatibility (Helm releases) -- Upgrade to Helm v3.21.2 for better alignment with Kubernetes v1.36.
New Postgres Language Server: postgres-lsp (PostgreSQL News) -- The postgres-lsp enhances PostgreSQL development with Language Server Protocol support.
GitHub Retires Its Free AI Model Playground (DevOps.com) -- Developers need to adapt as GitHub phases out its free AI model access.
Cloudflare Introduces Temporary Accounts for AI Agents (Cloudflare Blog) -- New feature allows AI agents to deploy without prior account setup.
OpenAI Set to Launch GPT-5.6 with Enhanced Features (TLDR AI) -- GPT-5.6 promises a larger context window, impacting AI agent capabilities.
Organizations Confuse Speed with Effective Flow in Agile Practices (DevOps.com) -- Understanding flow can improve value delivery in Agile and DevOps.

Human Stories

Looking at the EKS control plane egress feature and AWS Console Private Access landing in the same week, I'm reminded that the biggest shifts in our industry aren't always the flashy AI announcements - sometimes they're the unglamorous compliance features that teams have been filing tickets about for half a decade. We've spent so much energy routing around platform limitations that when vendors finally close those gaps, it feels almost anticlimactic, yet these are the changes that let us sleep better at night. The autonomous incident resolution tools from AWS and PagerDuty promise to handle the 3am pages for us, which sounds amazing until you remember that PostgreSQL failover story from Datadog's gameday - automation only works when the underlying systems behave predictably under stress. The real thread here is that maturity isn't about having the newest toy; it's about having systems you can trust when things go sideways, whether that's network partitions during failover or compliance auditors asking where your control plane traffic actually flows.

Also worth reading

“Time to clean up human slop”: Why AI now reviews code better than your teammate. (The New Stack)

The article discusses the evolving role of AI in code reviews, suggesting that AI can outperform human reviewers by reducing common human errors, referred to as 'human slop'. It highlights the inefficiencies of current peer review processes, where delays and lack of context can lead to suboptimal ou

Conan O'Brien Deadpans Deepfakes (Security Boulevard Newsletters)

Adaptive Security has collaborated with comedian Conan O'Brien to create engaging training videos focused on deepfake and phishing awareness. This initiative aims to replace traditional training modules with comedic content, addressing the rising threat of AI-driven fraud, which is expected to reach

An incident response playbook for satellite operations on AWS (Part-1): Detection and forensic readiness (SRE Weekly)

The article introduces a fictional character, Alice, to discuss the importance of incident response playbooks specifically for satellite operations on AWS. It emphasizes the need for detection and forensic readiness in managing incidents effectively.

Explore by topic

Kubernetes News Cloud News DevOps Security News CI/CD News SRE News Platform Engineering News