McKinsey AI Flaw, Kafka Goes Diskless, Google Buys Wiz, AWS Copilot Ends, and AI Gateway on Kubernetes

Transcript

0:00 Everybody wants the new interface. Very few people

0:02 want the new responsibility. Because the second

0:05 a company adds AI to an internal workflow, changes

0:08 the paved road in cloud, or pushes a new gateway

0:11 layer into production, somebody has to own the

0:14 policy, the logs, the rollback, and the fallout.

0:17 That somebody is usually us. Hey, I'm Brian Teller.

0:37 I work in DevOps and SRE, and I run Teller's

0:40 Tech. This is Ship It Weekly, where I filter

0:42 the noise and focus on what actually changes

0:45 how we run infrastructure and own reliability.

0:47 Show notes and links are on shipitweekly .fm.

0:51 If the show's been useful, follow it wherever

0:53 you listen. Ratings help way more than they should.

0:56 And if you want more signal between episodes,

0:58 check out OnCallBrief .com. We've got five main

1:01 stories today, then the lightning round, and

1:03 we'll wrap with the human closer. We're starting

1:06 with McKinsey because an internal AI tool vulnerability

1:09 is a nice reminder that internal does not mean

1:12 low risk. Then Kafka because the diskless topics

1:15 work is one of the more interesting architecture

1:18 signals I've seen in a while. After that, Google

1:21 officially closes the Wiz deal, which tells you

1:24 a lot about where the cloud fight is headed.

1:26 Then AWS is sunsetting co -pilot CLI, which is

1:30 one of those stories that sounds small right

1:32 up until it lands on your team's migration list.

1:35 And finally, Kubernetes is standing up an AI

1:38 gateway working group, which is probably the

1:40 clearest sign yet that AI traffic is becoming

1:43 regular platform traffic, just with stranger

1:46 payloads. Let's start with McKinsey, because

1:52 I think this is one of those stories where the

1:54 real lesson is less flashy than the headline.

1:57 McKinsey said on March 11th that it was alerted

2:00 by a security researcher to a vulnerability related

2:03 to its internal AI tool. Lilly confirmed it.

2:07 fixed it within hours and found no evidence client

2:10 data or client confidential information had been

2:13 accessed by the researcher or any other unauthorized

2:17 third party. And sure, the official takeaway

2:19 there is we moved quickly and no client data

2:22 appears to have been accessed. Fine. That part

2:26 matters. But the reason I think this is worth

2:28 covering is that it keeps happening across the

2:30 industry, where companies treat internal AI tools

2:33 like they are somehow lighter weight than normal

2:36 applications. They are not. The second an internal

2:39 AI tool can touch company knowledge, shape decision

2:42 flow, or influence what people trust, it stops

2:46 being a novelty and starts being part of your

2:48 operating surface. Same auth questions. Same

2:51 blast radius questions. Same logging questions.

2:55 Same need for somebody to own the thing when

2:57 it goes weird. That inference is based on the

3:00 kind of access and sensitivity McKinsey itself

3:03 describes around Lilly and on the fact that the

3:06 company treated the issue as a serious security

3:09 matter requiring forensic review. So the practical

3:12 lesson here is simple. Internal AI app is not

3:16 a special category. It is still an app. If your

3:18 org has quietly stood up a chatbot, assistant,

3:22 code helper, search helper, support helper, whatever,

3:25 and it can see real context or influence real

3:28 work, then congratulations. You now have another

3:31 production service wearing a friendlier mask.

3:34 One thing to check this week. Make a list of

3:37 every internal AI thing your company now treats

3:40 as normal. Then ask the boring questions. What

3:43 can it read? What can it write? What identities

3:46 does it assume? What gets logged? What would

3:49 forensics look like after a bad day? If nobody

3:52 can answer that clearly, there's your work. Now

3:58 to the most infrastructure -shaped story in the

4:01 episode. Kafka's KIP -1150 proposes diskless

4:06 topics, where topic data is stored durably in

4:09 object storage instead of broker disks. Replication

4:13 is delegated to object storage, and no broker

4:16 is uniquely the leader of a partition. The proposal

4:19 also explicitly says early diskless topics would

4:23 not immediately support compaction or transactional

4:26 writes, and it describes append latency as buffering

4:30 plus remote upload time. with p99 upload times

4:33 in the rough 200 to 400 millisecond range in

4:37 the design notes that is not a minor improvement

4:40 That is a pretty serious statement about where

4:43 cloud economics keep pushing distributed systems.

4:46 Because a lot of the old assumptions around platforms

4:48 like Kafka came from a world where durable local

4:51 storage and broker -centric replication made

4:54 obvious sense. In cloud, especially once across

4:57 AZ traffic and storage costs show up in the bill

5:01 at scale, those assumptions get expensive. This

5:05 proposal is basically saying maybe the right

5:07 answer is not to keep optimizing the old center

5:10 of gravity forever. Maybe the center of gravity

5:12 moved. And that's what makes this story interesting

5:15 to me. Not that every team is about to run diskless

5:18 Kafka tomorrow. Most won't. The interesting part

5:21 is the architecture honesty. It is admitting

5:24 that cloud native economics can eventually force

5:27 cloud native redesign. not just better tuning

5:30 knobs. So if you run distributed systems, this

5:33 is the kind of story worth paying attention to.

5:35 Not because you need the feature today, but because

5:38 it is a sign that the old durable core assumptions

5:41 are starting to crack in places where cost and

5:44 scale finally get loud enough. One thing worth

5:46 asking this week, where are you still paying

5:49 an old architecture tax? Because the system was

5:51 designed around hardware or topology assumptions

5:54 that no longer match the environment you actually

5:57 buy. Next up, Google and Wiz. Google announced

6:05 on March 11th that it completed its acquisition

6:08 of Wiz, that Wiz will join Google Cloud, and

6:11 that Wiz will keep its brand and continue securing

6:14 customers across all cloud environments. TechCrunch

6:18 reported the deal at $32 billion and called it

6:21 Google's biggest acquisition ever. That matters

6:24 because this is not just a big company buying

6:26 a hot security startup. It is Google spending

6:29 an absurd amount of money on the idea that cloud

6:32 security posture, multi -cloud visibility, and

6:35 AI -era security operations are now part of the

6:38 core platform fight. And honestly, that tracks

6:41 with how teams actually operate now. The cloud

6:44 conversation is not just compute anymore. It

6:47 is identity posture, exposure management, policy.

6:51 visibility across ugly, mixed environments, and

6:54 whether your security layer still works once

6:57 half the company is touching three clouds and

6:59 six SaaS platforms and some new AI service somebody

7:03 turned on last week. Google's own announcement

7:05 frames the acquisition as a bet on cloud security

7:08 and helping organizations build across any cloud

7:12 or AI platform. That is the real takeaway for

7:14 me. Security is not bolted onto platform anymore.

7:17 It is part of the platform buying decision itself.

7:20 So if your company still talks about cloud strategy

7:23 over here and security strategy over there, like

7:26 they are two separate decks, that feels increasingly

7:29 fake. Those are the same conversations now, or

7:32 at least they should be. Now for the AWS story

7:38 that is probably already annoying somebody. AWS

7:41 announced that co -pilot CLI will reach end of

7:45 support on June 12th, 2026. AWS says the tool

7:49 will still exist as an open source project on

7:51 GitHub. but it will no longer receive new features

7:55 or security updates from AWS. And the migration

7:58 guidance points people towards ECS Express Mode

8:02 or AWS CDK Layer 3 constructs. This is exactly

8:06 the kind of platform story that sounds smaller

8:08 than it is. Because Copilot was not just CLI.

8:12 For a lot of teams, it was the we can ship containers

8:15 on AWS without building an entire internal platform

8:18 first path. It was the paved road. And when the

8:22 vendor changes the paved road, teams inherit

8:25 migration work whether they asked for it or not.

8:28 That's the part cloud people know in their bones.

8:30 The easy path is temporary. the recommended abstraction

8:33 is rented. And even when the replacement makes

8:36 sense, you still pay retraining tax, docs tax,

8:39 migration tax, and that wonderful tax where somebody

8:43 asks you why you spent time changing this when

8:45 it was technically still working yesterday. I

8:47 think the bigger signal here is that AWS is still

8:50 refining how opinionated it wants to be around

8:53 container delivery. Fair enough. But if you build

8:56 on vendor convenience, you need an exit story

8:59 before the deprecation notice shows up. not after.

9:02 So one practical check here. If your team depends

9:05 on a managed or vendor -blessed workflow for

9:09 something important, do you already know the

9:11 off -ramp? Last main story, and this one is a

9:18 really clean signal. Kubernetes announced the

9:20 AI Gateway Working Group on March 9th. The group

9:24 says it is focused on standards and best practices

9:27 for networking infrastructure that supports AI

9:30 workloads in Kubernetes, including token -based

9:33 rate limiting, fine -grained access controls

9:35 for inference APIs, payload inspection for routing

9:39 and guardrails, and active proposals around payload

9:42 processing to defend against malicious prompts

9:45 and prompt injection. It is also looking at egress

9:48 patterns for securely routing traffic to external

9:51 AI services. is a loud clue about where the real

9:55 work is headed. Because once the Kubernetes ecosystem

9:57 starts formalizing the gateway and policy layer

10:01 around AI traffic, the interesting part is no

10:04 longer just which model are we calling or what

10:07 prompt trick did somebody discover this week.

10:09 The interesting part becomes operational. Who

10:12 can hit what? How traffic gets shaped? What gets

10:15 inspected? What gets cached? What gets blocked?

10:18 What happens when external model providers are

10:21 in the path? what gets logged when a request

10:23 goes sideways. This is platform work, which is

10:26 why I like this story so much. It cuts past a

10:29 lot of hype and lands in a place that feels real.

10:32 If your org is already exposing inference endpoints

10:35 or routing to outside model providers, treat

10:38 that traffic like any other sensitive path. Rate

10:41 limit it, wrap auth around it, think about payload

10:44 handling, think about egress control, and think

10:47 about observability. It's the same game, just

10:49 weirder packets. A few quick ones before we wrap.

11:00 Amazon Bedrock added two new CloudWatch metrics,

11:03 time to first token and estimated TPM quota usage.

11:07 That matters because it gives teams first token

11:10 latency and quota consumption visibility without

11:13 client -side instrumentation. And both metrics

11:16 are updated every minute for successfully completed

11:19 requests. Cloudflare now returns structured JSON

11:22 for its one XXX errors when clients send accept

11:26 application slash JSON or application slash problem

11:30 plus JSON. And those responses follow RFC 9457.

11:35 Tiny story, but a good one for automation, agents,

11:39 and anything that should not have to scrape messy

11:41 error blobs to figure out what happened. AWS

11:44 S3 server access logs now include source region

11:48 information automatically at the end of each

11:50 log entry, which makes it easier to spot cross

11:54 -region access patterns that can quietly turn

11:56 into cost or latency problems. AWS Config added

12:00 30 more supported resource types in early March,

12:03 including Bedrock Agent core resources like aws

12:07 bedrock agent core gateway and aws bedrock agent

12:10 core memory which is just another reminder that

12:13 compliance and inventory scope keep expanding

12:16 while nobody is looking and a reminder that bedrock

12:19 agent core runtime now supports stateful mcp

12:22 server features like elicitation sampling and

12:25 progress notifications with each user session

12:28 running in a dedicated micro vm and keeping context

12:32 across interactions that is the kind of thing

12:34 that makes agent systems feel a lot less like

12:37 demos and a lot more like infrastructure. I think

12:47 the cleanest takeaway this week is that the new

12:50 interface does not remove the old responsibilities.

12:53 McKinsey's Lily story says your internal AI app

12:57 is still an app. Kafka's diskless push says cloud

13:00 economics eventually force architectural honesty.

13:04 Google Closing Wiz says security and platform

13:07 strategy are now tangled together at the executive

13:10 level. Copilot Getting Sunset says convenience

13:13 is borrowed. And the Kubernetes AI Gateway effort

13:16 says the next layer of work is going to be policy,

13:19 routing, inspection, and traffic control around

13:23 these systems, not just model selection. So the

13:26 job is still the job. Make the control plane

13:28 observable. Make permissions explicit. Keep the

13:31 rollback clean. Don't let internal turn into

13:34 unreviewed. And don't mistake a new interface

13:37 for a new set of operational laws. Most of the

13:40 laws are the same. They just keep showing up

13:42 in new clothes. Alright, that's it for this week

13:45 of Ship It Weekly. Quick recap. McKinsley's Lilly

13:49 vulnerability and why internal AI tools are still

13:52 real attack surfaces. Kafka's diskless topics

13:55 push and what it says about cloud shaping architecture.

13:59 Google officially closing the Wiz acquisition.

14:02 AWS sunsetting co -pilot CLI. And Kubernetes

14:05 standing up an AI gateway working group because

14:08 AI traffic is becoming platform traffic whether

14:11 we like it or not. Links and show notes are on

14:14 shipitweekly .fm. You can also find the video

14:17 versions on YouTube. And if you want the DevOps

14:20 news before the show, you can check out oncallbrief

14:23 .com. If this episode was useful, follow or subscribe

14:26 wherever you listen. And send it to the person

14:29 on your team who keeps hearing Just Add AI while

14:32 quietly inheriting all the policy, observability,

14:35 and guardrail work that comes with it. I'm Brian,

14:39 and I'll see you next week.

McKinsey AI Flaw, Kafka Goes Diskless, Google Buys Wiz, AWS Copilot Ends, and AI Gateway on Kubernetes

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller