On Call Brief – Week of May 17–23, 2026
This week's top stories
1. CISA Credentials, Sensitive Data Exposed in GitHub Repository
- Category: Community
- What happened: A contractor accidentally exposed a GitHub repository named "Private-CISA" for several months, containing sensitive CISA and DHS credentials including AWS access keys, plaintext passwords, authentication tokens, and system logs according to Security Boulevard Newsletters. The exposure represents a significant supply chain security breach affecting critical government infrastructure systems. SRE teams should immediately audit their own repositories for accidentally committed secrets, implement automated secret scanning tools like GitHub's secret scanning or GitLeaks, and establish mandatory code review processes that specifically check for credential exposure before any commits reach public or shared repositories. Organizations should also rotate any credentials that may have been exposed in previous commits and ensure proper .gitignore configurations prevent sensitive files from being tracked in version control systems.
- Worth reading: This exposure could lead to unauthorized access to sensitive systems and data, potentially compromising national security and requiring immediate review of access controls and security practices.
- Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
2. GitHub Breach Tied to Malicious VS Code Extension Exposes Thousands of Internal Repositories
- Category: Deep Dive
- What happened: The article discusses a security breach at GitHub caused by a malicious Visual Studio Code extension, which has exposed thousands of internal repositories. It highlights the implications of this incident for developers and organizations using GitHub for version control.
- Takeaway: This breach could lead to unauthorized access to sensitive code and data, necessitating immediate review of security practices around third-party extensions and repository access controls.
- Source: Techstrong Brief
3. Researcher says Microsoft secretly built a backdoor into BitLocker
- Category: Deep Dive
- What happened: A security researcher claims that Microsoft has secretly integrated a backdoor into BitLocker, which could potentially compromise the security of encrypted data. The discussion surrounding this allegation raises concerns about trust in encryption technologies and the implications for user data privacy.
- Takeaway: If true, this could affect the integrity of data protection mechanisms in production environments that rely on BitLocker for encryption - organizations may need to reassess their encryption strategies.
- Source: Techspot via Lobsters
- Discussion: https://lobste.rs/s/ynxkj6/researcher_says_microsoft_secretly
4. Barracuda Networks Report Identifies CypherLoc Scareware Kit
- Category: Community
- What happened: Barracuda Networks has identified the CypherLoc scareware kit as responsible for approximately 2.8 million attacks that freeze browsers and redirect users to fraudulent tech support lines through encrypted loaders designed to evade detection systems. The scareware kit represents a significant threat to enterprise environments as it can bypass traditional security controls and disrupt user productivity while potentially leading to social engineering attacks against employees. SRE and DevOps teams should ensure their web filtering and endpoint detection systems are updated to recognize CypherLoc's encrypted loader patterns and implement additional browser security policies to prevent users from accessing malicious redirect chains. Organizations should also review their incident response procedures for handling browser-based attacks and consider implementing user awareness training focused on identifying tech support scam attempts. Network monitoring should be enhanced to detect unusual browser behavior patterns and suspicious outbound connections that may indicate CypherLoc infections according to Security Boulevard Newsletters reporting.
- Worth reading: - The rise of such scareware kits could lead to increased phishing attempts and user support issues, necessitating stronger browser security measures and user education.
- Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
5. Terraform: 2.0
- Category: Community
- What happened: HashiCorp has released Terraform Enterprise 2.0 with significant operational enhancements including Stacks for multi-environment infrastructure management, project-level notifications, SCIM 2.0 automation for user provisioning, and improved governance and diagnostics capabilities according to TLDR DevOps. Concurrently, Pulumi Cloud has entered public preview as a Terraform state backend, enabling teams to store Terraform state files in Pulumi's infrastructure with unified visibility, role-based access control, encrypted state storage, and audit policies as reported by TLDR DevOps. SRE teams should evaluate Terraform Enterprise 2.0's new Stacks feature for simplifying multi-environment deployments and consider Pulumi Cloud as an alternative state backend if currently using multiple infrastructure tools. Organizations using both Terraform and Pulumi should assess the unified management benefits of consolidating state storage in Pulumi Cloud, particularly for teams seeking centralized access controls and audit capabilities across their infrastructure automation tools.
- Worth reading: The new features in Terraform Enterprise 2.0 could significantly enhance the management of infrastructure across multiple environments, improving operational efficiency and governance.
- Sources: Hashicorp via TLDR DevOps, Pulumi via TLDR DevOps
6. Automate root cause analysis across Datadog and Elasticsearch with AWS DevOps Agent
- Category: Deep Dive
- What happened: The article discusses how the AWS DevOps Agent can automate root cause analysis by integrating Datadog and Elasticsearch. It highlights the challenges of manually correlating logs and metrics across different systems in distributed environments, which can be time-consuming and error-prone. By using the AWS DevOps Agent, alerts from Datadog can trigger automated investigations that correlate signals across observability backends, significantly reducing the mean time to identify (MTTI) issues in complex systems.
- Takeaway: This automation can streamline incident response processes and reduce downtime, making it crucial for teams managing distributed systems to consider implementing this solution.
- Source: AWS DevOps Blog
7. The Invisible OOMKill: Why Your Java Pod Keeps Restarting in Kubernetes
- Category: Community
- What happened: This article discusses the issue of Java pods in Kubernetes being unexpectedly restarted due to out-of-memory (OOM) kills. It explains the underlying causes of OOM kills, particularly in Java applications, and offers insights into how to diagnose and mitigate these issues effectively.
- Worth reading: Understanding OOM kills is crucial for maintaining application stability in Kubernetes environments - operators should be aware of memory management and tuning for Java applications.
- Source: Feeds Dzone via SRE Weekly
8. Reducing MTTR became a “context reconstruction” problem for us, not a monitoring problem.
- Category: Community
- What happened: The article discusses the challenges of reducing Mean Time to Recovery (MTTR) in multi-tenant Platform9/OpenStack environments. It highlights that while existing tools can indicate failures, they often lack the capability to explain operational changes leading to those failures. The author describes their approach of creating a unified operational timeline that aggregates events from various systems to improve incident response. Key challenges included ownership resolution and event deduplication, with the outcome showing that reducing context-switching during incidents significantly improved MTTR.
- Worth reading: This approach could enhance incident response efficiency by providing clearer context during failures, potentially leading to faster recovery times in production environments.
- Source: Reddit r/sre
9. Versa Extends Zero-Trust Reach to Model Context Protocol to Secure AI Agents
- Category: Community
- What happened: Versa is expanding its zero-trust security framework to include the Model Context Protocol (MCP), aiming to secure AI agents before they are deployed at scale. This move comes as part of a competitive landscape where companies like Check Point are also enhancing their security offerings.
- Worth reading: - This development may affect production environments that utilize AI agents, as implementing zero-trust principles can enhance security and reduce vulnerabilities in AI deployments.
- Source: Techstrong Brief
10. DevOps'ish 309: Dirty Pages All the Way Down, The Cloud Is Hot, and more
- Category: Community
- What happened: Recent DevOps'ish 309 coverage highlights a concerning trend of increasing Linux kernel vulnerabilities that require immediate attention to system upgrade safety protocols. The report emphasizes that operators should implement robust safety measures and testing procedures before applying kernel patches, particularly given the accelerated pace of security disclosures. Additionally, the coverage notes significant Kubernetes developments including PSI (Pressure Stall Information) metrics graduating to General Availability status, which provides operators with enhanced resource monitoring capabilities. SRE teams should review their current kernel patching workflows and consider adopting the highlighted migration strategies, particularly the documented case study covering transitions from ingress-nginx to Envoy proxy configurations. Operators should prioritize establishing comprehensive pre-upgrade testing environments and automated rollback procedures to mitigate risks associated with the increased kernel vulnerability disclosure frequency.
- Worth reading: Operators should be aware of the kernel vulnerabilities and ensure their systems are prepared for frequent updates. The Kubernetes updates may require adjustments in resource management and scheduling strategies.
- Sources: via DevOps'ish, Devopsish via DevOps'ish
CVE & Security
1. New Windows 'MiniPlasma' zero-day exploit gives SYSTEM access, PoC released
- Category: Security / Patch
- What happened: A proof-of-concept exploit for a Windows privilege escalation zero-day named 'MiniPlasma' has been released, allowing attackers to gain SYSTEM privileges on fully patched Windows systems.
- Do this Monday: This exploit poses a significant security risk as it can be used to escalate privileges on Windows systems, potentially affecting production environments if not mitigated.
- Source: Bleeping Computer
2. CVE-2026-46333 in Kubernetes: unset seccomp let pods reach pidfd_getfd, RuntimeDefault blocked it
- Category: Security / Patch
- What happened: CVE-2026-46333 is a vulnerability in Kubernetes related to the Linux __ptrace_may_access() bug. Testing revealed that pods with unset seccomp profiles could exploit the pidfd_getfd primitive, leading to potential file descriptor theft. The tests showed that while RuntimeDefault and PSS Restricted profiles effectively blocked this vulnerability, the PSS Baseline profile allowed it under certain conditions. Recommendations include enforcing seccomp profiles, patching node kernels, and ensuring proper privilege settings to mitigate risks.
- Do this Monday: Operators should ensure that effective seccomp profiles are set for workloads to prevent potential file descriptor theft. The vulnerability can be exploited if seccomp is unset or set to Unconfined, which could lead to security breaches in Kubernetes clusters.
- Source: Reddit r/kubernetes
3. TeamPCP Takes Cover by Releasing Source Code on GitHub, Spurs Copycats
- Category: Security / Patch
- What happened: TeamPCP, the threat actor behind recent supply chain attacks targeting npm and PyPI package repositories, temporarily published their attack toolkit source code on GitHub before removing it, according to Security Boulevard Newsletters. This leak exposes the specific methods and payloads used in their campaigns, providing valuable intelligence for defenders to identify compromised packages and develop detection signatures. However, the public availability of this toolkit also creates risk of copycat attacks using the same techniques against JavaScript and Python package ecosystems. SRE teams should immediately audit their npm and PyPI dependencies for suspicious packages, implement package integrity verification in CI/CD pipelines, and monitor package installation logs for indicators of compromise based on the now-exposed attack patterns. Organizations should also review their software supply chain security controls and consider implementing additional package scanning tools to detect malicious dependencies before they reach production systems.
- Do this Monday: Defenders can analyze the leaked toolkit to improve security measures, but the availability of this toolkit may lead to an increase in similar attacks from malicious actors.
- Sources: Security Boulevard Newsletters, Security Boulevard Newsletters
4. Toxic Flows: When Your Agent Skill Becomes a Supply Chain Attack
- Category: Security / Patch
- What happened: Snyk's ToxicSkills research has identified significant security vulnerabilities in AI agent skills, finding that 36% of inspected skills contain security flaws and 13% have critical issues including credential theft and prompt injection payloads. The research documented three specific attack chains where agent skills become vectors for supply chain attacks through compromised third-party integrations and dependencies. SRE teams should immediately implement zero-trust principles in development environments where AI agents are deployed, audit existing agent skills for security vulnerabilities, and establish secure integration practices for third-party agent dependencies. The findings are particularly urgent given the approaching EU Cyber Resilience Act deadline, which will impose stricter security requirements on software supply chains.
- Do this Monday: - The focus on zero-trust security measures is critical as supply chain attacks become more prevalent, potentially affecting production environments that rely on third-party tools and agents.
- Sources: Techstrong Brief, Techstrong Brief, Techstrong Brief
Releases
1. Expanded OIDC support for Dependabot and code scanning
- Category: Release
- What happened: GitHub has expanded OIDC authentication support for Dependabot and code scanning to include Cloudsmith and Google Artifact Registry. This allows organization administrators to configure OIDC-based credentials for private registries, enabling dynamic retrieval of short-lived credentials from cloud identity providers. This feature is now generally available on GitHub.com and will be included in GitHub Enterprise Server 3.22.
- Do this Monday: This change enhances security for dependency management and code scanning by allowing more flexible authentication methods for private registries, which could affect how organizations manage their supply chain security.
- Source: GitHub Changelog
2. Governing infrastructure as code using pattern-based policy as code
- Category: Release
- What happened: The article discusses the challenges organizations face in enforcing security and compliance across their cloud infrastructure. It introduces a pattern-based policy as code approach using Open Policy Agent (OPA) to automate checks in CI/CD pipelines, ensuring that infrastructure changes meet predefined standards before deployment. This method aims to simplify policy management and enhance consistency across teams and environments.
- Do this Monday: Implementing pattern-based policy as code can significantly reduce security risks and compliance gaps in infrastructure deployments, making it crucial for teams to adopt this approach for better governance.
- Source: AWS Security Blog
3. Azure hub-and-spoke generally available for HCP Vault Dedicated
- Category: Release
- What happened: Azure hub-and-spoke networking for HCP Vault Dedicated is now generally available, allowing enterprises to integrate Vault into centralized Azure network architectures without custom routing or exceptions. This release enhances support for organizations needing a clear separation of product and infrastructure management, facilitating secure private connectivity and simplifying network management. It enables centralized routing and firewall policy enforcement, reducing operational complexity and improving security.
- Do this Monday: This change could streamline network operations for teams managing HCP Vault, leading to fewer network tickets and quicker incident resolution. It also aligns with cloud security maturity efforts by providing a standardized network architecture.
- Source: HashiCorp Blog
4. Nine Entertainment’s journey: Achieving 98% cost savings with Amazon ElastiCache Serverless for Valkey
- Category: Release
- What happened: Nine Entertainment achieved a 98% cost reduction by migrating to Amazon ElastiCache Serverless, which improved scalability and reduced manual intervention during peak events. The company faced high caching costs that exceeded those of their compute infrastructure. By switching to ElastiCache Serverless, they aligned resource usage with demand, resulting in significant cost savings and improved performance for their streaming platforms.
- Do this Monday: This migration highlights the importance of cost management in cloud infrastructure, particularly for services with fluctuating traffic patterns. Operators should consider serverless options to optimize costs and performance.
- Source: AWS Database Blog
5. Automated JDBC query caching with the AWS Advanced JDBC Wrapper
- Category: Release
- What happened: The AWS Advanced JDBC Wrapper has introduced a Remote Query Cache Plugin that automates JDBC query caching, reducing the need for custom cache implementations. This plugin intercepts JDBC queries, caches results in Amazon ElastiCache, and serves identical queries from the cache with minimal application changes required. It simplifies cache management by handling serialization, cache misses, and expiration automatically, while also providing monitoring through Amazon CloudWatch.
- Do this Monday: This plugin can significantly improve database performance and reduce costs by automating caching, which may lead to less engineering overhead for teams managing database queries.
- Source: AWS Database Blog
6. Barman 3.18.0 Released
- Category: Release
- What happened: Barman 3.18.0 has been released, introducing significant features for PostgreSQL backup and recovery. Key updates include experimental incremental backups for cloud storage, a new local-to-cloud backup method, and a command for direct WAL archiving to cloud storage. The release also enhances compression options and improves compatibility with Python 3.14, along with various bug fixes.
- Do this Monday: The introduction of incremental backups for cloud storage can reduce costs and improve backup efficiency, which may affect how backup strategies are implemented in production environments. The new features streamline cloud backup operations, potentially simplifying disaster recovery processes.
- Source: PostgreSQL News
7. Meet Gordon: Docker’s AI Agent For Your Entire Container Workflow
- Category: Release
- What happened: Docker has launched Gordon, an AI agent designed to assist developers with their entire container workflow. Gordon integrates into Docker Desktop and the CLI, providing context-aware support by reading logs, images, and compose files. It proposes fixes for issues and requires user approval for actions, streamlining the debugging and deployment process.
- Do this Monday: Gordon could significantly reduce the time spent troubleshooting container issues, potentially improving deployment efficiency and developer productivity.
- Source: Docker Blog
Also this week
Deep dives & postmortems
11. Issues with Cloudflare Log Explorer
- Category: Deep Dive
- What happened: Cloudflare experienced an issue with its Log Explorer, causing delays in log visibility for affected customers. The problem was identified and a fix was implemented, resolving the incident.
- Takeaway: Operators relying on Cloudflare Log Explorer for log access may have faced delays, impacting their ability to monitor and troubleshoot effectively during the incident.
- Source: Cloudflare Status
Community reads
12. Datadog's best practices for LLM observability
- Category: Community
- What happened: Datadog has published a comprehensive guide detailing best practices for implementing observability in large language model (LLM) workflows, according to TLDR AI reporting. The guide covers end-to-end monitoring strategies, methods for detecting security risks within LLM systems, and specific mitigation approaches to address identified vulnerabilities. SRE teams operating LLM-based services should review this guide to establish proper monitoring baselines, implement security detection mechanisms, and develop response procedures for LLM-specific reliability issues. The guidance emphasizes building reliable observability frameworks to reduce system downtime and security exposure in AI-powered applications.
- Worth reading: - Understanding LLM observability can help improve monitoring and security practices in production environments.
- Sources: Datadoghq via TLDR AI, Datadoghq via TLDR AI, Datadoghq via TLDR AI
Lightning links
- OpenAI announces new Guaranteed Capacity offering for customers to secure compute (TLDR AI) -- OpenAI's new offering ensures long-term access to compute resources for AI workflows.
- An important update: Transitioning Gemini CLI to Antigravity CLI (TLDR Dev) -- Google's transition to Antigravity CLI signals important changes for developers using Gemini CLI.
- The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale (SRE Weekly) -- Netflix's approach to building an operations layer highlights the importance of team dynamics in infrastructure.
- plpgsql_wrap v1.0 released (PostgreSQL News) -- This PostgreSQL extension obfuscates PLPGSQL source code, enhancing security for developers.
- The future of agentic development: Redefining the data practitioner lifecycle with Data Agent Kit (Google Cloud Blog) -- Google's Data Agent Kit integrates data engineering and science into development environments.
- Easily apply Copilot code review feedback with Copilot cloud agent (GitHub Changelog) -- GitHub's Copilot update improves user experience for applying code review suggestions.
- One operational pattern I keep seeing across unrelated incidents (Reddit r/sre) -- A recurring pattern in SRE incidents highlights the challenge of managing system complexity.
- What actually gets lost when on-call rotates isn’t in the runbooks (Reddit r/devops) -- Informal knowledge loss during on-call handoffs emphasizes the need for better communication.
- Show HN: Machine – per-project dev VMs with session-only secrets (Hacker News Show HN) -- The 'machine' CLI tool enhances security by creating isolated development environments for projects.
Human Stories
Looking at this collection of incidents, I keep coming back to how much trust we place in the tools and platforms that form the backbone of our daily work. The GitHub breach through a malicious VS Code extension and CISA's exposed repository both remind us that our development environments - the places where we feel most in control - can become vectors for catastrophic exposure. Even Microsoft's alleged BitLocker backdoor situation speaks to this deeper question of whether the systems we rely on for security actually serve our interests or someone else's. What strikes me most about the Java OOM kills in Kubernetes is how it represents the other side of the same coin - sometimes our platforms fail us not through malice but through the sheer complexity of interactions we can't fully predict or observe. As we head into next week, it's worth remembering that our role isn't just managing systems, but constantly questioning the assumptions built into the tools we've chosen to trust.
Also worth reading
One operational pattern I keep seeing across unrelated incidents (Reddit r/sre)
The author observes a recurring operational pattern in SRE and Kubernetes incidents where systems accumulate complexity faster than teams can maintain a usable understanding of them. This leads to issues such as poor incident reasoning preservation, drift between declared state and runtime reality,
The Human Infrastructure: How Netflix Built the Operations Layer Behind Live at Scale (SRE Weekly)
This article discusses how Netflix developed its operations layer to support live events at scale, focusing on the human aspects of infrastructure and team dynamics. It highlights the importance of collaboration, communication, and the cultural practices that enable effective operations during high-
What actually gets lost when on-call rotates isn’t in the runbooks (Reddit r/devops)
The article discusses the informal knowledge that gets lost during on-call handoffs, which is not captured in runbooks. It highlights the importance of verbal communication during these transitions, such as specific troubleshooting insights that are often shared in conversation but not documented. T