On Call Brief – Week of 2026-03-15

2026-03-15 Briefing: 2026-03-15

This week's top stories

1. Major Breach — McKinsey's Lilli AI system compromised in under two hours, exposing millions of confidential client

Category: Community
What happened: McKinsey's Lilli AI system was compromised in under two hours, leading to the exposure of millions of confidential client communications and strategic discussions.
Worth reading: This breach raises concerns about the security of AI systems and the potential for similar vulnerabilities in production environments - operators should review their AI security protocols.
Source: Harper Carroll AI

Tags:

2. Perplexity's Personal Computer system and Mistral's zero-exposure training address growing enterprise security concerns.

Category: Community
What happened: Multiple AI vendors have launched new enterprise security-focused products to address data exposure vulnerabilities highlighted by recent breaches including McKinsey's Lilli system compromise which exposed 46.5 million confidential messages and 728,000 client files via CodeWall's AI agent according to Harper Carroll AI. Mistral has released Forge, an enterprise training platform that enables complete pre-training pipelines on customer-owned servers with zero data exposure to Mistral, specifically targeting pharmaceuticals and biotech sectors, while also launching the open-source Mistral Small 4 model with integrated reasoning, coding, and vision capabilities for enterprise use. Perplexity has deployed a Personal Computer system running on Mac mini hardware that provides local AI agent capabilities with secure file and application access plus activity tracking to minimize enterprise data exposure risks. Enterprise operators should evaluate these on-premises AI training and inference solutions as alternatives to cloud-based AI services, particularly for sensitive data workloads where data sovereignty and zero-exposure requirements are critical security controls.
Worth reading: This incident underscores the vulnerabilities in AI systems and the potential risks to sensitive data, which could impact organizations relying on AI for operations.
Sources: Harper Carroll AI, Harper Carroll AI, Harper Carroll AI (+6 more)

Tags:

3. Google officially closed its acquisition of Wiz, locking in one of the biggest cloud security deals ever.

Category: Community
What happened: Google has completed its acquisition of Wiz, marking a significant move in cloud security as AI workloads grow. This acquisition highlights the increasing importance of security in multi-tenant and multi-cloud environments.
Worth reading: The acquisition could lead to enhanced security features in Google Cloud, which may affect how operators manage security in multi-cloud setups.
Source: Cloud Google via Off-by-none

Tags:

4. KIP-1150: Diskless Topics approved for Apache Kafka

Category: Community
What happened: Apache Kafka's KIP-1150 proposal for Diskless Topics has been approved, introducing the ability to store messages directly in object storage systems like Amazon S3 or Google Cloud Storage rather than requiring local broker disk storage. This architectural change is designed to enhance Kafka's cloud-native capabilities by making broker disks optional for topic storage. SRE teams running Kafka clusters should begin evaluating how this feature could impact their current storage architecture and capacity planning once it becomes available in future Kafka releases. Organizations heavily invested in cloud infrastructure should consider this development for potential cost optimization and operational simplification in their Kafka deployments.
Worth reading: This change could significantly reduce storage costs and complexity for Kafka deployments, particularly in cloud environments - operators should evaluate the implications for their storage architecture.
Sources: Aiven via TLDR Dev, Aiven via TLDR Dev

Tags:

5. Nav Bhasin from AWS Generative AI Innovation Center outlines four criteria for identifying agent-appropriate work

Category: Community
What happened: AWS has released operational guidance from their Generative AI Innovation Center identifying four criteria for selecting suitable AI agent workloads: clear boundaries, judgment across tools, measurable success metrics, and safe failure modes. Amazon Bedrock has simultaneously introduced enhanced visibility features for first token latency and quota consumption metrics, providing critical operational monitoring capabilities for AI workloads. Cloudflare has launched RFC 9457-compliant error responses that can reduce agent token costs by up to 98% and made their AI Security for Apps generally available for protecting AI-powered applications at the edge. SRE teams should evaluate current AI workloads against AWS's four criteria framework while implementing the new Bedrock monitoring capabilities for latency and quota tracking. Organizations using AI agents should consider adopting Cloudflare's RFC 9457-compliant error handling to significantly reduce operational costs.
Worth reading: This change improves observability for AI workloads, making it easier to monitor performance and resource usage, which can enhance reliability and user experience in production environments.
Sources: Aws Amazon via Off-by-none, Off-by-none, Aws Amazon via Off-by-none (+5 more)

Tags:

6. AWS GitHub Actions Deploy Express Service automates CI/CD for ECS

Category: Community
What happened: AWS has launched a GitHub Actions 'Deploy Express Service' action that automates the CI/CD process for Amazon ECS Express Mode. This action builds Docker images, pushes them to Amazon ECR, and deploys updates upon code commits, utilizing OIDC authentication with IAM roles for secure access.
Worth reading: This new action simplifies the CI/CD workflow for teams using Amazon ECS, potentially improving deployment efficiency and security through automated processes.
Source: Platformengineering via TLDR DevOps

Tags:

7. The Invisible Rewrite: Modernizing the Kubernetes Image Promoter

Category: Community
What happened: The Kubernetes team has modernized the kpromo tool, which is responsible for promoting container images to registry.k8s.io. This rewrite has reduced the codebase by 20% and significantly decreased the time required for production promotion jobs from 30 minutes to a range of 2-15 minutes, depending on the phase.
Worth reading: This improvement in the image promotion process could enhance deployment efficiency and reduce downtime for Kubernetes users relying on registry.k8s.io.
Source: Kubernetes via TLDR DevOps

Tags:

8. On the developer tooling side, AWS CDK Mixins is now generally available

Category: Community
What happened: AWS has released CDK Mixins to general availability, enabling infrastructure teams to compose reusable patterns without duplicating stack code, which should reduce maintenance overhead for teams managing complex CDK deployments. Concurrently, AWS Lambda Managed Instances has added Rust runtime support, expanding language options for serverless workloads. SRE teams using CDK should evaluate migrating repeated infrastructure patterns to the new Mixins functionality to improve code maintainability and reduce drift between similar environments. Teams with Rust-based applications can now consider Lambda Managed Instances as a deployment option, though performance characteristics should be benchmarked against existing container or EC2 deployments.
Worth reading: The availability of CDK Mixins could streamline infrastructure management and reduce code complexity, potentially improving deployment efficiency. The support for Rust in AWS Lambda may open new opportunities for performance optimization in serverless applications.
Sources: Aws Amazon via Off-by-none, Aws Amazon via Off-by-none

Tags:

9. Key metrics for monitoring Karpenter

Category: Community
What happened: The article discusses key metrics for monitoring Karpenter, focusing on Prometheus metrics that reveal autoscaling behavior, provisioning latency, and cloud provider issues. It emphasizes the importance of tracking scheduling, disruption, controller, and cost metrics to diagnose scaling delays, API throttling, and inefficiencies that can impact Kubernetes performance and costs.
Worth reading: Understanding these metrics can help operators optimize Karpenter's performance and manage costs effectively in Kubernetes environments.
Source: Datadoghq via TLDR DevOps

Tags:

10. Struct: AI on-call agent that automates root-cause analysis for engineering alerts.

Category: Deep Dive
What happened: Struct is an AI on-call agent designed to automate root-cause analysis for engineering alerts, potentially streamlining incident response processes.
Takeaway: This tool could reduce the time spent on diagnosing alerts, improving response efficiency during incidents - it may affect how teams manage on-call duties.
Source: Superhuman – Zain Kahn

Tags:

CVE & Security

1. ▶ Why AI Agents Are the Ultimate Identity Challenge ServiceNow’s John Aisien explains how agentic workloads

Category: Security / Patch
What happened: ServiceNow's John Aisien has identified a critical infrastructure challenge for SRE teams managing AI agent deployments, where traditional identity and access management systems are inadequate for agentic workloads that require ephemeral permissions and real-time privilege escalation. According to Aisien's analysis published in Techstrong Brief, AI agents can significantly amplify user privileges and perform actions that existing audit systems cannot properly track or control. SRE teams should immediately assess their current IAM implementations to determine if they can handle dynamic permission grants and comprehensive audit trails for automated agents. Organizations running AI workloads should implement what Aisien terms an 'access intelligence layer' that provides audit-ready controls specifically designed for ephemeral agent permissions rather than relying on traditional user-based access models. Teams should prioritize establishing proper tracking mechanisms for agent actions before expanding AI agent deployments in production environments.
Do this Monday: This highlights potential risks in identity and access management as AI agents become more prevalent, necessitating enhanced governance measures to prevent privilege escalation.
Sources: Techstrong Brief, Techstrong Brief

Tags:

2. Security Frameworks — Nvidia's NemoClaw provides open-source security and privacy framework for OpenClaw agents to

Category: Security / Patch
What happened: NVIDIA has released NemoClaw, an open-source security and privacy framework designed to help enterprises safely build and deploy OpenClaw AI agents in production environments. According to multiple sources, this follows OpenClaw being highlighted as "the most popular open source project in the history of humanity" during NVIDIA's recent keynote presentation. The NemoClaw stack specifically addresses enterprise adoption barriers around trust and control by providing security frameworks for autonomous AI agents that could function as digital coworkers. SRE teams should evaluate NemoClaw's security controls and integration requirements if planning to deploy OpenClaw agents in their infrastructure, particularly focusing on the framework's privacy protections and enterprise safety features. Organizations already running or considering OpenClaw deployments should assess how NemoClaw's security layer aligns with their existing compliance and operational security requirements.
Do this Monday: - This framework may influence how enterprises implement security measures for OpenClaw agents, potentially affecting security protocols in production environments.
Sources: Techstrong.AI, Harper Carroll AI, Superhuman – Zain Kahn (+1 more)

Tags:

3. Open SWE: An Open-Source Framework for Internal Coding Agents

Category: Security / Patch
What happened: Open SWE is an open-source framework designed to support the development of internal coding agents. It offers core architectural components such as isolated cloud sandboxes, curated toolsets, subagent orchestration, and integration with developer workflows. The framework emphasizes customization, allowing users to plug in various components as needed.
Do this Monday: This framework could enhance developer productivity by streamlining workflows and enabling the creation of tailored coding agents, potentially impacting how teams manage coding tasks and automation.
Source: Blog Langchain via TLDR

Tags:

Also this week

Community reads

11. AWS and Cerebras Team Up to Accelerate AI Inference in the Cloud

Category: Community
What happened: AWS and Cerebras are collaborating to enhance AI inference speed by combining AWS's cloud capabilities with Cerebras's specialized AI hardware. This partnership aims to reduce costs and improve throughput as demand for AI models increases. The focus is on controlling inference costs to gain a competitive edge in the AI landscape.
Worth reading: This partnership may affect production by providing more efficient AI inference options, potentially lowering costs and improving performance for AI workloads on AWS.
Source: Techstrong Brief

Tags:

12. Decoding the White House Cyber Strategy: Why Resilience Matters Now

Category: Community
What happened: The article discusses the White House's new cyber strategy emphasizing resilience, which involves assuming breaches, limiting lateral movement, and maintaining critical services. It highlights the importance of microsegmentation linked to identity and endpoint telemetry as a means to implement this strategy effectively. The piece argues that while 'Zero Trust' is a popular concept, true resilience is demonstrated by the ability to function even after an attacker has gained access.
Worth reading: This strategy shift may influence how organizations approach their cybersecurity frameworks, particularly in implementing microsegmentation and resilience measures, which could affect operational security practices.
Source: Techstrong Brief

Tags:

13. MiniMax launches M2.7 model on MiniMax Agent and APIs

Category: Community
What happened: MiniMax has released its M2.7 model, which is accessible through the MiniMax Agent and API Platform. This model is designed to support complex workflows in software engineering, office productivity, and research, featuring capabilities such as autonomous debugging and research agent harnesses. The release indicates a shift towards models that can evolve independently.
Worth reading: The introduction of the M2.7 model may enhance productivity and debugging processes in software engineering and research environments, potentially affecting workflows that rely on these capabilities.
Source: Testingcatalog via TLDR AI

Tags:

14. OpenAI announces GPT-5.4 mini and nano as it focuses on coding and enterprise

Category: Community
What happened: OpenAI has released GPT-5.4 mini and nano models as smaller, more efficient versions of the flagship GPT-5.4, specifically optimized for coding assistants, multi-agent systems, and enterprise applications. According to Harper Carroll AI and TLDR AI, these models are designed for high-volume workloads with faster speeds and lower costs compared to the full GPT-5.4 model. Superhuman reports that this represents OpenAI's strategic shift toward focusing on coding and enterprise use cases rather than side projects. SRE teams currently using OpenAI APIs should evaluate these new models for cost optimization in high-throughput scenarios, particularly for coding assistance and automated workflow applications. No immediate migration actions are required as these are additional model options rather than replacements for existing services.
Worth reading: - The focus on coding and enterprise tools may influence how developers integrate AI into their workflows, potentially affecting tool choices and project priorities.
Sources: Harper Carroll AI, Wsj via TLDR AI, Superhuman – Zain Kahn (+2 more)

Tags:

15. The benefit? No more manual E2E testing. No more slow QA cycles. No more bugs reaching production.

Category: Community
What happened: Drata's engineering team improved their testing efficiency by using QA Wolf, achieving four times more test cases and reducing QA cycles by 86%.
Worth reading: This could lead to faster deployment cycles and fewer bugs in production, impacting overall software quality and release timelines.
Source: Qawolf via TLDR Dev

Tags:

16. Code at AI speed while testing with production confidence AI agents generate code in minutes, but without ever

Category: Community
What happened: This appears to be sponsored content about mirrord, a development tool that addresses the disconnect between AI-generated code and real production environments. The tool allows AI agents to access actual API responses, database states, and live service behavior during code generation, rather than working with synthetic or mock data. For SRE teams evaluating AI-assisted development workflows, mirrord claims to bridge the gap between rapid AI code generation and production-ready implementations by providing real-world context during the development process. Teams should evaluate whether integrating such tools into their development pipeline could improve the quality of AI-generated code while maintaining appropriate security boundaries between development and production systems.
Worth reading: This tool could improve the efficiency of code generation and testing processes, potentially leading to faster deployment cycles and reduced errors in production.
Sources: Github via TLDR Dev, Metalbear via TLDR Dev, Metalbear via TLDR Dev

Tags:

17. Most companies claim they are AI-ready, but new survey data reveals a harsh reality

Category: Community
What happened: A new survey indicates that while many companies assert they are prepared for AI, issues such as poor data quality and inadequate context engineering are preventing AI projects from progressing to production.
Worth reading: This highlights the importance of ensuring data quality and context in AI initiatives, which could affect project timelines and resource allocation in organizations pursuing AI solutions.
Source: Techstrong.AI

Tags:

18. OpenAI’s Codex now lets you spawn parallel subagents

Category: Community
What happened: OpenAI has released a significant upgrade to Codex that introduces parallel agent spawning capabilities, allowing developers to deploy multiple specialized subagents that report back to a single coordinator for tasks like code review, bug triage, and multi-step debugging. Each subagent operates with its own instructions, model settings, and tool context, enabling more efficient workflows by handling different parts of a task simultaneously. SRE teams using Codex should evaluate how this parallel processing capability could improve their automated code review and debugging workflows, particularly for complex multi-repository operations. Additionally, Codex Security has shifted their approach to prioritize direct repository analysis over traditional SAST reports, focusing on system architecture and trust boundaries before presenting validated findings to users. Teams currently relying on SAST integration with Codex should assess whether this architectural change affects their security scanning pipelines and consider adjusting their code analysis strategies accordingly.
Worth reading: This feature could enhance productivity in software development tasks by allowing for more efficient handling of complex workflows, potentially reducing time spent on code reviews and debugging.
Sources: The Code, The Code, The Code (+2 more)

Tags:

19. cc-safe tool scans project settings to flag dangerous commands and enhance Claude AI security

Category: Community
What happened: cc-safe is a tool designed to scan project settings and flag potentially dangerous commands that could delete files, run with administrative privileges, or grant excessive access to the AI model Claude. Users are advised to run it on their project folders to enhance security.
Worth reading: This tool could help prevent accidental data loss or security breaches by identifying risky commands in project settings - operators should consider integrating it into their workflows.
Source: The Code

Tags:

20. The diagnosis. Mount Sinai researchers conducted the first independent safety evaluation of ChatGPT Health — and

Category: Community
What happened: Mount Sinai researchers completed the first independent safety evaluation of ChatGPT Health and identified two critical biases in the AI system's medical assessments. The evaluation found that ChatGPT Health consistently downplayed genuine medical emergencies while simultaneously escalating routine, non-urgent cases to higher priority levels. Dr. Ashwin Ramaswamy, involved in the research, highlighted that the chatbot's recommendations lacked logical reasoning patterns expected in medical decision-making systems. Organizations currently using or evaluating ChatGPT Health for patient triage or medical consultations should immediately review their implementation protocols and consider additional human oversight mechanisms until these bias issues are addressed. This finding is particularly concerning for healthcare operators who may be relying on AI-assisted triage systems in production environments.
Worth reading: This evaluation raises concerns about the reliability of AI tools in healthcare settings, which could impact decision-making processes if used in production environments.
Sources: Superhuman – Zain Kahn, Superhuman – Zain Kahn

Tags:

21. Entry points and business logic that enable a user to interact with AI

Category: Community
What happened: TLDR AI published guidance on securing AI applications that covers three critical areas: infrastructure components hosting AI workloads, software dependencies and data pipelines, and user-facing entry points with business logic layers. The guidance emphasizes that AI security requires traditional application security practices applied to AI-specific attack vectors including prompt injection, model poisoning, and data exfiltration through inference APIs. SRE teams should implement input validation and rate limiting on AI API endpoints, secure model artifacts and training data with appropriate access controls, and monitor inference requests for anomalous patterns that could indicate attacks. Organizations running AI services should audit their current security posture against these recommendations and establish baseline security controls before expanding AI deployments to production environments.
Worth reading: This guide may help operators understand how to secure AI applications and their interactions, which is increasingly relevant as AI systems are integrated into production environments.
Sources: Datadoghq via TLDR AI, Datadoghq via TLDR AI, Datadoghq via TLDR AI

Tags:

22. How a small team scaled AI infrastructure

Category: Community
What happened: A six-engineer team scaled an AI platform serving millions of businesses by consolidating infrastructure into a single codebase and platform to simplify operations and support rapid growth.
Worth reading: This approach may inspire teams to streamline their infrastructure and improve operational efficiency, potentially affecting how AI services are managed.
Source: Vercel via TLDR AI

Tags:

Lightning links

What AWS Actually Shipped in the Last 12 Months (Non-AI Edition) (Off-by-none) -- A comprehensive review of AWS releases focusing on serverless and infrastructure enhancements.
Amazon OpenSearch Service now supports in-place volume increases for all volume sizes (Off-by-none) -- Users can now increase storage capacity without creating new volumes or migrating data.
Amazon CloudWatch introduces organization-wide EC2 detailed monitoring enablement (Off-by-none) -- Enhance visibility and management of EC2 instances across multiple accounts with new monitoring features.
Serverless resilience: A practitioner's guide | Serverless Office Hours (Off-by-none) -- Learn battle-tested patterns for building resilient serverless systems with expert insights.
FieldTrip - Search every field across your schemas (Off-by-none) -- An open-source CLI tool for discovering and indexing schema files across your codebase.
Chaos Engineering for AWS Lambda: failure-lambda 1.0 (Off-by-none) -- A new chaos engineering tool for AWS Lambda that introduces various failure modes.
Amazon CloudWatch Application Signals adds new SLO capabilities (Off-by-none) -- New Service Level Objective capabilities enhance monitoring and observability for applications.
Google introduces read-only 'plan mode' for Gemini CLI (Techstrong Brief) -- Gemini CLI now allows code exploration without changes until human approval is granted.

Human Stories

The McKinsey Lilli breach getting cracked in under two hours tells us everything we need to know about where we are right now. Here we have one of the world's most prestigious consulting firms, presumably with serious security resources, watching their AI system fold faster than a bad poker hand, and suddenly everyone's scrambling to announce enterprise security features and close billion-dollar acquisition deals like Google's Wiz purchase. It's the same pattern we've seen before - a high-profile failure creates a market correction where the fundamentals that should have been there all along suddenly become urgent priorities. While teams are rushing to bolt security onto their AI workloads after the fact, others like the Kubernetes folks are quietly doing the invisible work of modernizing their image promoter, understanding that reliability isn't built from panic but from the patient, unglamorous work of making systems actually trustworthy. The real lesson isn't that we need more security theater - it's that we need to build things right from the ground up, because two hours is about how long it takes to destroy years of reputation when your foundations aren't solid.

Also worth reading

Each additional layer of review in a process introduces massive latency—often slowing delivery by an order of... (TLDR DevOps)

The article discusses how adding more layers of review in processes can significantly increase latency, often slowing down delivery times by a factor of ten. It emphasizes the need to streamline processes to improve efficiency.

Every time you approve a command in Claude Code, it gets saved to your settings file. (The Code)

A developer experienced a significant issue with Claude Code when a command they auto-approved led to the deletion of their entire home directory. This incident highlights the risks associated with having a long list of auto-approved commands that users may overlook.