On Call Brief – Week of June 28 – July 4, 2026
This week's top stories
1. Hijacked npm and Go Packages Use VS Code Tasks to Deploy Python Infostealer
- Category: Deep Dive
- What happened: Researchers found hijacked npm and Go packages that deploy a Python-based information stealer on Windows, Linux, and macOS. The attack circumvents common npm execution paths to evade security measures introduced in npm v12.
- Takeaway: This incident highlights the risk of using compromised packages in software supply chains, potentially affecting applications that rely on these npm and Go packages - organizations should review their package dependencies and implement security measures to mitigate such risks.
- Source: Thehackernews via The Hacker News (security)
2. Apple Fixes WebKit Flaws in iOS and macOS, With Help From AI Tools
- Category: Community
- What happened: Apple has released security updates for iOS, iPadOS, macOS, and Safari to address four WebKit vulnerabilities. The company utilized AI tools to identify these flaws and expedited the fixes due to the rapid development of exploits from known vulnerabilities.
- Worth reading: These updates are critical for maintaining the security of devices running iOS, iPadOS, macOS, and Safari, as the vulnerabilities could be exploited if not patched.
- Source: Securityaffairs via TLDR Dev
3. OnyxC2 Stealer Targets Wide Range of Apps, Evades Detection by Hiding in Oversized NVIDIA Library
- Category: Community
- What happened: OnyxC2, a new malware-as-a-service tool, has been observed targeting credentials from over 210 applications and extensions while evading detection by concealing malicious code within an oversized NVIDIA library file, according to Security Boulevard. This stealth technique exploits the large file size to hide malicious payloads that security tools may skip during scanning due to performance considerations. SRE teams should review endpoint detection rules to ensure scanning of large library files is not disabled, verify that security tooling can inspect files regardless of size, and monitor for unexpected NVIDIA library modifications or unusually large NVIDIA DLL files on systems. Organizations should also audit credential storage practices and consider implementing application allowlisting to prevent execution of tampered libraries in production environments.
- Worth reading: The widespread targeting of applications by OnyxC2 could lead to credential theft and unauthorized access, impacting security protocols and requiring immediate attention to application security measures.
- Sources: Security Boulevard Newsletters
4. Amazon CloudWatch launches OTel Container Insights for Amazon EKS - Embracing OpenTelemetry while routing every
- Category: Community
- What happened: Amazon CloudWatch has introduced OTel Container Insights for Amazon EKS, integrating OpenTelemetry for enhanced metric collection. However, the billing model based on metrics per region may lead to high costs at scale, raising concerns about the financial implications of using this feature.
- Worth reading: The new billing model for metrics could significantly increase costs for teams using Amazon EKS at scale, necessitating careful monitoring of usage to avoid unexpected charges.
- Source: AWS via Last Week in AWS
5. Run isolated sandboxes with full lifecycle control: AWS Lambda introduces MicroVMs
- Category: Community
- What happened: AWS Lambda has introduced MicroVMs based on Firecracker technology to provide isolated sandboxes with full lifecycle control for function execution, according to Last Week in AWS. This architectural change enhances security and resource management by better isolating untrusted code execution within Lambda functions, though it raises questions about how this impacts the traditional definition of serverless computing. Operators using Lambda do not need to take immediate action as this is an infrastructure improvement handled transparently by AWS, but teams should be aware that their functions now run in Firecracker-based microVMs rather than traditional containers. The third related item appears to be an unrelated story about AWS Cloud modernizing border control systems with a multi-service architecture including SageMaker and likely was incorrectly grouped with the Lambda MicroVM announcements.
- Worth reading: The introduction of MicroVMs may affect how Lambda functions are managed and executed, potentially improving security and performance. Operators should evaluate how this change could optimize their serverless architecture.
- Sources: AWS via Last Week in AWS
6. AI Will Test Identity Infrastructure, Organizations Need More Prep
- Category: Community
- What happened: According to Semperis research reported by Security Boulevard, AI agents are increasingly being granted access to critical enterprise systems without clearly defined security boundaries, creating new attack surfaces in identity infrastructure. The research indicates that AI is actively transforming global identity attack services, making automated credential compromise and privilege escalation more accessible to threat actors. Organizations should review and strengthen their identity and access management (IAM) policies to explicitly define which systems AI agents can access, implement additional monitoring for AI agent authentication patterns, and ensure their identity infrastructure can detect and respond to AI-driven attacks. SRE teams should work with security teams to audit current AI agent permissions and establish guardrails before AI-based identity attacks become more prevalent.
- Worth reading: Organizations may face increased risks to their identity infrastructure as AI agents operate without clear security boundaries - this could lead to new attack vectors that need to be addressed proactively.
- Sources: Security Boulevard Newsletters
7. Lessons learned from scaling to 1 million Lambda functions
- Category: Deep Dive
- What happened: AWS published two architectural guides for Lambda practitioners covering production challenges at scale. The AWS Architecture Blog detailed lessons from scaling a serverless SaaS platform beyond one million Lambda functions, emphasizing the critical need for true scale-to-zero architecture, proactive quota management strategies, and early engagement with AWS service teams when approaching large-scale deployments. Separately, the AWS Compute Blog explored using Lambda durable functions to build fault-tolerant multi-agent AI workflows, specifically demonstrating resilience patterns for healthcare prior authorization processes where traditional workflows face reliability challenges. Operators running high-volume Lambda deployments should review quota limits proactively and consider durable functions for complex stateful workflows requiring fault tolerance, particularly those involving multiple service orchestration or AI agent coordination.
- Takeaway: Scaling to over one million Lambda functions introduces complexities in resource management and deployment strategies. Understanding the importance of scale-to-zero can help optimize costs and resource usage in production environments. The insights on using AWS CloudFormation StackSets for managing infrastructure across multiple accounts can improve deployment efficiency and reduce operational overhead.
- Sources: AWS Architecture Blog, AWS Compute Blog
8. Automate public TLS certificate issuance with ACME support in AWS Certificate Manager
- Category: Community
- What happened: The article discusses the new feature in AWS Certificate Manager that allows for the automation of public TLS certificate issuance using ACME protocol support. This enhancement simplifies the process of obtaining and managing TLS certificates, making it easier for users to secure their applications.
- Worth reading: This change could significantly streamline the process of managing TLS certificates in AWS, reducing manual overhead and potential errors in certificate management - it is particularly relevant for teams using AWS for their services.
- Source: AWS via TLDR DevOps
9. Dragonfly v2.5.0 is released
- Category: Community
- What happened: Dragonfly v2.5.0 introduces new features and improvements, including enhanced performance and stability for image distribution. The release focuses on optimizing the image pulling process and reducing latency, which can benefit users managing large container images.
- Worth reading: The performance improvements in image distribution may lead to faster deployments and reduced downtime during updates, which is crucial for production environments relying on containerized applications - especially those using large images.
- Source: CNCF via TLDR DevOps
10. Claude Code Incidents — June 2026: what silently broke, and the one-line fixes
- Category: Deep Dive
- What happened: The article reviews several incidents involving Claude Code that occurred in June 2026, highlighting issues that led to data loss, unexpected costs, and silent failures. Each incident includes a brief description of what went wrong and a suggested one-line fix to mitigate the issue. Examples include folder name collisions causing data loss, unintended file replacements, and commands that inadvertently affected multiple processes. The author emphasizes the importance of careful command usage and monitoring to avoid these pitfalls.
- Takeaway: Operators using Claude Code should be aware of these incidents as they highlight potential risks in data management and command execution that could lead to significant operational issues. Implementing the suggested fixes could help prevent similar problems in production environments.
- Source: dev.to (DevOps tag)
CVE & Security
1. Attackers Exploit SimpleHelp Flaw to Steal Info from AI Coding Assistants, Clouds
- Category: Security / Patch
- What happened: Attackers are exploiting a critical authentication bypass vulnerability in SimpleHelp software, tracked as CVE-2026-48558, to deploy malware that steals sensitive data from AI coding assistants and cloud services. The malware, Djinn Stealer, can access a wide range of developer information, including tokens from AI tools, GitHub CLI data, and Docker authentication, potentially compromising entire development pipelines.
- Do this Monday: This vulnerability poses a significant risk to organizations using SimpleHelp, as it allows attackers to gain unauthorized access to sensitive developer data and credentials, which could lead to broader security breaches across systems and services.
- Source: DevOps.com
2. CVE-2026-12957 and CVE-2026-12958 - Issues in Language Servers for AWS and Amazon Q Developer Plugins - Your AI
- Category: Security / Patch
- What happened: Amazon Web Services disclosed two critical vulnerabilities, CVE-2026-12957 and CVE-2026-12958, in the Language Servers for AWS and Amazon Q Developer Plugins that permit arbitrary command execution from any workspace with no available workarounds. Operators must immediately upgrade these plugins to patched versions as this is the only remediation path available. Separately, AWS published its June 2026 Threat Technique Catalog update adding five new threat entries that focus on container security risks including EKS workload modification, organization-level trust exploitation, and compute hijacking scenarios. SRE teams should review the new catalog entries to assess whether their AWS environments have adequate controls against these documented attack techniques, particularly if running EKS clusters or using AWS Organizations trust relationships.
- Do this Monday: These vulnerabilities could lead to unauthorized command execution, posing a significant security risk. Immediate upgrades are necessary to mitigate potential exploitation.
- Sources: AWS via Last Week in AWS, AWS Security Blog
3. CVE-2026-13762 and CVE-2026-13763 - Issue with HTTP/2 multi-frame request body inspection in AWS WAF
- Category: Security / Patch
- What happened: AWS identified two vulnerabilities, CVE-2026-13762 and CVE-2026-13763, affecting HTTP/2 multi-frame request body inspection in AWS WAF. CVE-2026-13762, which impacts WAF with CloudFront, has been remediated server-side with no action needed from customers. CVE-2026-13763 affects WAF with AWS Application Load Balancer, where a crafted HTTP/2 request could lead to incomplete request body inspection. This issue has been addressed, and customers should configure WAF settings to ensure full protection.
- Do this Monday: Operators using AWS WAF with Application Load Balancer should review their configuration to ensure proper inspection of HTTP/2 request bodies to mitigate the risk associated with CVE-2026-13763. No action is needed for those using WAF with CloudFront.
- Source: AWS Security Bulletins
4. Data breach exposes up to 14.2 million email logins at six ISPs
- Category: Security / Patch
- What happened: KDDI Corporation reported a data breach affecting its email system, which is shared with five other ISPs, potentially exposing up to 14.2 million email logins. The breach highlights vulnerabilities in shared systems among ISPs.
- Do this Monday: This incident raises concerns about the security of shared email systems and the potential for widespread credential theft, which could affect user accounts across multiple ISPs.
- Source: Bleeping Computer
5. Survey Surfaces Rise in IT Incidents Attributable to AI Coding Tools
- Category: Security / Patch
- What happened: A survey of IT decision makers reveals that 93% of organizations have faced infrastructure incidents due to AI coding tools. The findings indicate that AI has increased demands on infrastructure teams, with 40% reporting faster security vulnerabilities and governance challenges. Despite 86% expressing confidence in their AI governance capabilities, only 30% have formal policies in place. Alarmingly, one-third of infrastructure teams apply AI-generated code directly to production without review, raising concerns about security vulnerabilities and misconfigurations.
- Do this Monday: The reliance on AI-generated code without adequate review processes may lead to increased security vulnerabilities and operational incidents. Organizations should consider implementing stronger governance policies and review mechanisms for AI-generated infrastructure code to mitigate risks.
- Source: DevOps.com
Releases
1. Bamboo to Bitbucket Pipelines migration tool is now GA
- Category: Release
- What happened: The Bamboo to Bitbucket Pipelines migration tool has reached general availability, enhancing its capabilities for users transitioning from Bamboo to Bitbucket. Key features include automatic conversion of Bamboo Deployment Projects into custom pipelines, support for custom transformers via Python scripts, instance-level migration for all build plans and deployment projects, and improved audit reporting for unsupported tasks. The migration aims to reduce operational overhead and costs while providing a fully managed CI/CD platform.
- Do this Monday: The availability of this migration tool may affect teams currently using Bamboo by simplifying their transition to a cloud-native CI/CD solution, potentially leading to reduced operational costs and improved productivity.
- Source: Atlassian Engineering
2. Announcing Valkey 9.1 for Amazon ElastiCache
- Category: Release
- What happened: Amazon ElastiCache now supports Valkey 9.1, which introduces enhancements aimed at improving throughput, memory efficiency, and operational workflows for in-memory workloads. Key features include a redesigned I/O threading model that boosts performance, new commands for easier application management, and observability improvements for better engine visibility. The update is particularly beneficial for large-scale deployments, as it can lead to significant cost savings and improved performance metrics.
- Do this Monday: The introduction of Valkey 9.1 could lead to better resource utilization and cost savings for teams using ElastiCache, especially for those managing high-throughput and latency-sensitive applications. The performance improvements may allow for reduced infrastructure costs and delayed scaling events, which could impact budgeting and resource planning.
- Source: AWS Database Blog
3. Okta is the first to bring AI agent governance inside FedRAMP boundaries
- Category: Release
- What happened: Okta has launched its AI agent governance platform for FedRAMP and HIPAA-regulated environments, positioning itself as the first independent identity platform to manage AI agents alongside human and machine identities. This initiative responds to federal mandates for AI adoption and security, emphasizing the need for visibility and control over AI agents to mitigate risks such as compliance violations and security breaches. The platform introduces governance measures that ensure agents are registered, owned, and operate under strict access controls, replacing static credentials with dynamic, scoped tokens.
- Do this Monday: The introduction of Okta's AI agent governance could significantly affect organizations operating under federal regulations by providing a structured approach to managing AI agents. This could help mitigate risks associated with ungoverned agents, such as compliance violations and security breaches, making it crucial for agencies to adopt these governance practices to enhance their security posture.
- Source: The New Stack
4. Highlights from Git 2.55
- Category: Release
- What happened: Git 2.55 has been released, introducing features and bug fixes contributed by over 100 contributors. A key feature is the ability to write incremental multi-pack indexes (MIDX) directly with the `git repack` command, which improves efficiency for large repositories by allowing Git to manage multiple packfiles without needing to rewrite the entire index. This change is part of GitHub's repository maintenance strategy.
- Do this Monday: The introduction of incremental MIDXs can enhance performance for large repositories, potentially reducing the overhead during repository maintenance tasks. This may affect how teams manage their Git repositories, especially those with significant size and complexity.
- Source: GitHub Blog
5. Triton Inference Server: 2.70.0, 2.69.0
- Category: Release
- What happened: NVIDIA Triton Inference Server has released versions 2.69.0 (NGC container 26.05) and 2.70.0 (NGC container 26.06) with notable changes for operators. Version 2.69.0 adds request cancellation support in the gRPC C++ client, improved memory management, and Azure Managed Identity authentication for Azure Storage mounts. Version 2.70.0 introduces breaking changes including the complete removal of deprecated Windows support and a new requirement for BF16 input/output tensors in the Python client, which will require code updates for any workflows using BF16 data types. Operators running Triton on Windows must migrate to Linux-based deployments before upgrading to 2.70.0, and those using the Python client with BF16 tensors should review their integration code for compatibility with the new tensor handling requirements.
- Do this Monday: The removal of Windows support may affect users relying on that platform. The change in BF16 handling requires updates to existing scripts, which could lead to runtime errors if not addressed. The security enhancements are crucial for maintaining a secure inference environment. Users should be aware of the known issues that may impact deployment stability.
- Sources: Triton Inference Server releases
Also this week
Deep dives & postmortems
11. DevOps'ish 315: Sub-Nanometer Chips, Damn You Apple, and the Database Nobody Could Kill
- Category: Deep Dive
- What happened: Netflix has successfully migrated its batch compute infrastructure to Kueue, a Kubernetes-native job queueing system that provides advanced scheduling capabilities including job preemption and fair sharing while maintaining service continuity for end users. Separately, Klue experienced a supply chain security incident where attackers obtained unauthorized access to LastPass customer data through stolen OAuth tokens, demonstrating the risks inherent in third-party vendor integrations. LastPass has stated that their password vaults and core infrastructure were not compromised in this incident. SRE teams should review their own third-party OAuth token management practices and consider implementing additional monitoring for unauthorized token usage, while those running Kubernetes batch workloads may want to evaluate Kueue as a replacement for custom scheduling solutions.
- Takeaway: The shift to Kueue could influence how teams manage batch processing in Kubernetes, potentially leading to improved resource utilization and scheduling efficiency. Operators may want to consider similar tools for their own batch processing needs.
- Sources: DevOps'ish
Lightning links
- How We Brought Enterprise-Managed Authorization to Rovo MCP with XAA and ID-JAG (Atlassian Engineering) -- Atlassian's new centralized access management enhances security for approved clients.
- AWS Weekly Roundup: Agentic CX designer for Amazon Connect Customer, EC2 AMI Watermarks, Open Governance for... (AWS What's New) -- Amazon Connect's no-code tool allows rapid design of AI-powered customer experiences.
- Eliya 25 Brings a JVM-Level Diagnostic Profile to OpenJDK 25 LTS (InfoQ DevOps) -- Eliya 25 enhances Java diagnostics, making it easier to monitor production environments.
- AWS Previews FinOps Agent for Cost Analysis and Optimization (InfoQ DevOps) -- The new AWS FinOps Agent automates cost anomaly detection and integrates with AWS activities.
- Configuration Drift in a Multi-Cloud World (DevOps.com) -- Learn how to detect and address configuration drift in multi-cloud environments effectively.
- Preventing data exfiltration in machine learning environments with Amazon SageMaker AI (AWS Architecture Blog) -- Explore strategies to protect sensitive data in machine learning while enabling data science.
- Kubernetes Efficiency Starts With Better Decisions (dev.to (DevOps tag)) -- Improving Kubernetes efficiency requires better decision-making on resource allocation.
- Claude Opus 4.8 (fast mode) is now in preview for GitHub Copilot (GitHub Changelog) -- The new fast mode for Copilot boosts output speeds while maintaining intelligence.
- 13.0.3 (Grafana releases) -- Grafana 13.0.3 introduces Docker image updates and improved dashboard provisioning.
- OpenAI Previews GPT-5.6 Sol With Restricted Access and Stronger Cyber Safeguards (Security Boulevard (FeedBurner mirror)) -- The limited preview of GPT-5.6 models includes enhanced cyber safeguards for security.
Human Stories
When I look at the hijacked npm packages hiding Python infostealers in VS Code tasks alongside OnyxC2 concealing itself in bloated NVIDIA libraries, I'm reminded that our adversaries understand software supply chains and operational blind spots better than most security teams do. The timing feels pointed too - just as AWS rolls out MicroVMs for stronger isolation and Amazon pushes OpenTelemetry adoption, we're seeing attacks that specifically target the gaps in our observability and the trust we place in dependency ecosystems. The Semperis research on AI agents accessing critical systems without clear security boundaries isn't a future problem anymore; it's happening right now while we're still figuring out how to scale to a million Lambda functions without breaking our mental models of how distributed systems behave. Apple leaning on AI tools to find WebKit vulnerabilities faster shows the technology can help us, but the broader pattern here is that our infrastructure is growing more complex and interconnected faster than our ability to reason about its security posture, and that gap is exactly where the next generation of attacks will live.
Also worth reading
Agentic AI Is Eating Your Engineering Org — And 94% of (dev.to (DevOps tag))
A significant percentage of organizations using AI agents express concerns about increased complexity, technical debt, and security risks. The rapid adoption of agentic AI has led to issues such as agent sprawl, verification bottlenecks, and the need for better governance. Teams are encouraged to tr
I watched someone burn 50 hours on OpenClaw and the fix was embarrassingly simple (dev.to (DevOps tag))
The article discusses a case where someone spent 50 hours trying to automate a complex workflow using OpenClaw, only to find that the solution was to simplify the approach. Instead of relying on a single large agent for the entire process, the author suggests breaking the workflow into stages and us
is hard to dismiss, even if "trust the agents" is exactly what you'd expect a cloud vendor to want you believing. (Last Week in AWS)
The author discusses enhancements made to a single-user localhost job application tracking list, including the addition of scrypt, TOTP, and a hash-chained audit ledger. Despite these security improvements, the author notes that anyone with filesystem access can still copy the entire database, highl