IBM Buys Confluent, React2Shell, and Netflix on Aurora

Transcript

0:06 Hey, I'm Brian and this is Ship It Weekly by

0:10 Teller's Tech. Quick heads up before we get into

0:13 it, my voice is a little rough this week. I've

0:15 been fighting a cold, so if I sound like I've

0:18 been yelling at Kubernetes for three days straight,

0:20 that's why. All right, here's what we're talking

0:23 about this episode. First, IBM is buying Confluent.

0:27 That's right after the HashiCorp deal, so we'll

0:30 talk about what that means if you're on Confluent

0:33 Cloud, running Kafka on -prem, or trying to pick

0:37 between Confluent, MSK, and do -it -yourself.

0:40 Second, there's a nasty React vulnerability people

0:44 are calling React2Shell. It's a 10 .0 RCE, it's

0:49 been exploited in the wild, and if you run platforms

0:53 for front -end teams, this absolutely involves

0:56 you. even if you've never touched React. Third,

1:00 Netflix wrote up how they consolidated a big

1:03 chunk of relational databases onto Aurora Postgres.

1:07 They saw up to a 75 % better performance and

1:11 solid cost savings, and they simplified their

1:14 fleet. We'll talk about what's real there and

1:17 what's marketing. In the lightning round, we'll

1:19 hit... Open Tofu 1 .11, some Terraform tips from

1:24 the trenches, Ghosty going non -profit, and a

1:28 pair of tools around spec -driven development

1:31 with AI. And we'll wrap with a human story from

1:34 your brain on incidents, about what big incidents

1:38 actually do to people and how to make that less

1:42 awful. Let's start with the big acquisition,

1:45 IBM and Confluent. So IBM is buying Confluent

1:49 for a very large pile of money. They already

1:53 grabbed HashiCorp earlier, and now they're adding

1:56 Confluent, which is basically Kafka as a product,

1:59 with managed clusters, connectors, governance,

2:02 all the stuff around the core broker. If you

2:05 zoom out, this looks pretty intentional. HashiCorp

2:08 gives them a control plane story, terraform,

2:11 vault, console, all the infrastructure lifecycle

2:14 pieces. Confluent gives them a data plane story,

2:19 real -time streams, feeding analytics, and AI

2:22 systems. From IBM's point of view, that's great.

2:26 From your point of view, this should trigger

2:28 some questions. If you're a Confluent Cloud customer

2:32 today, I'd be asking, what's the realistic time

2:36 window where pricing and packaging stay roughly

2:39 the same? 12 months? 24 months? What's the path

2:43 if that changes? and we suddenly don't like the

2:47 new world. Are we comfortable being on HashiCorp

2:50 tools and Confluent and whatever IBM does for

2:53 AI all from the same vendor? If you're not on

2:56 Confluent but you've been evaluating it, this

2:59 changes the comparisons with MSK and self -managed

3:02 Kafka a little bit. On the plus side, IBM has

3:05 deep enterprise relationships and is very good

3:08 at, let's say, long sales cycles. That might

3:12 mean better integration with big company identity,

3:15 governance, on -prem stories, all of that. On

3:18 the minus side, Every time a company like this

3:21 gets acquired, there's a non -zero risk of focus

3:25 shifting to bundle this into everything instead

3:28 of make the core service amazing. If you're on

3:31 MSK or self -managed Kafka right now, I don't

3:35 think this is an immediate you chose wrong moment,

3:38 but it is a reminder to check your own vendor

3:41 concentration. If the same company controls your

3:44 infra control plane, your secret management,

3:48 and your streaming backbone, you should at least

3:50 have some kind of exit plan on paper. Not because

3:54 you're going to use it tomorrow, but because

3:57 you really don't want to start thinking about

3:59 migrations the week after a pricing email lands.

4:02 So the homework here is simple. If you use Confluent

4:05 anywhere, write down what Plan B looks like.

4:09 If you don't, decide whether this acquisition

4:12 makes you more or less likely to adopt it in

4:15 the next couple of years. And maybe keep an eye

4:18 on what IBM does next in the infra -AI space,

4:22 because they're clearly not done shopping. Alright,

4:25 let's talk about something more urgent. React

4:28 to Shell. React2Shell is one of those vulns where

4:32 the score is 10 out of 10, and unfortunately,

4:35 that's not an exaggeration. Very short version,

4:38 React server components use a protocol called

4:41 flight to talk between client and server. There's

4:45 a bug that lets an attacker send a malicious

4:47 payload through that protocol and chain it into

4:51 a remote code execution on the server side. So

4:54 this is not someone can mess up your CSS. This

4:57 is someone can run arbitrary code on whatever

5:00 is hosting your react server components. Why

5:03 should you care if you're just the platform person?

5:05 Because a lot of modern front end stacks are

5:08 now front end plus service side piece deployed

5:11 as containers into your Kubernetes clusters or

5:15 your app platform. Those pods are part of your

5:18 blast radius. If they get popped, the attacker

5:21 is one cube CDL away. from the rest of the cluster,

5:26 or one IMDS hop away from credentials. Patches

5:30 are out in the ecosystem. Next .js and others

5:33 have shipped fixed versions. React has guidance,

5:37 the usual wave of advisories. But there is real

5:40 exploit traffic in the wild already, and cloud

5:44 providers and security vendors are seeing this

5:46 used to drop loaders and pivot deeper. So what

5:50 do you actually do as a platform SRE person?

5:53 First, assess inventory. Figure out which services

5:57 in your world are using React 19, React server

6:01 components, Next .js with RSC, that kind of thing.

6:05 If you don't know, this is a good week to ask

6:07 app teams some annoying questions. Second, patch

6:11 tracking. Make sure those services are on the

6:14 patched versions of their framework. This is

6:17 one of those we're doing an emergency patch even

6:20 if the sprint board doesn't like it weeks. Third,

6:23 guardrails. If you have a WAF, check whether

6:25 your provider has shipped React to Shell rules

6:28 and turn them on in at least monitoring mode,

6:31 ideally block mode for the exposed endpoints.

6:34 Same thing for IDS, IPS, or runtime security

6:38 tools. Fourth, egress and privileges. Double

6:41 check the egress posture of those front -end

6:44 services if an attacker does get code exec, how

6:47 easy is it for them to phone home, hit IMDS,

6:51 pull other secrets, or talk to internal services

6:54 that they really shouldn't? You don't need to

6:57 become a React expert. You just need to treat

7:00 this like what it is. A serious server -side

7:03 RCE that just happens to be writing in on a front

7:07 -end stack. All right, let's move from everything

7:09 is on fire to how one of the big kids is evolving

7:13 their database story. Netflix published a case

7:17 study about consolidating a big chunk of their

7:20 relational database fleet onto Amazon Aurora

7:23 Postgres. The headline numbers, they quote, are

7:26 up to 75 % better performance and almost 30 %

7:31 cost savings for some workloads. Now you should

7:34 always be a little skeptical of round numbers.

7:37 but the pattern is pretty familiar. They had

7:39 a bunch of self -managed Postgres clusters scattered

7:43 around. Each one had its own tuning, its own

7:46 backup setup, its own failover behavior, its

7:49 own on -call expectations. Over time, that turns

7:53 into a huge operational tax. Moving to Aurora

7:56 gave them a few things. Managed failover and

8:00 backups instead of writing and maintaining that

8:04 themselves. A more uniform story for observability

8:07 and performance tuning. The ability to simplify

8:10 sizing and auto scaling in a more consistent

8:13 way. From our side of the fence, the interesting

8:16 question is not should we be Netflix? It's where

8:20 do we get outsized value from letting the cloud

8:23 provider manage more of the boring stuff? If

8:26 you have a handful of big, weird Postgres clusters

8:29 with very tight latency requirements, very custom

8:33 extensions, or unusual replication topologies,

8:37 self -managed might still be the right call.

8:40 But if you have 30 or 50 little to medium Postgres

8:43 instances that all need roughly the same reliability

8:47 story and none of them are super special snowflakes,

8:50 something Aurora -like starts to look pretty

8:53 attractive. The trade -offs are similar to what

8:56 we just talked about with Confluent. You're consolidating

8:59 onto one managed platform. You get reliability

9:03 and lower ops overhead at the cost of more vendor

9:07 lock -in. There's no free lunch there. The takeaway

9:10 I'd want people to get from the Netflix piece

9:12 is not, oh, cool, Aurora is magic. It's you should

9:16 occasionally step back and ask if the way you're

9:19 running your databases is still the right shape

9:22 for the scale you're at now. If you built a fleet

9:25 of hand -tuned clusters back when you had five

9:28 apps, that might not be the right model now that

9:32 you've got 100. All right, let's knock out a

9:35 quick lightning round. First up, Open Tofu 1

9:39 .11. Open Tofu keeps moving quickly, and 1 .11

9:43 brings some nice language features. There's an

9:47 enabled meta argument you can use on resources

9:50 and modules to conditionally include things without

9:53 the old count equals zero hacks. And there's

9:56 support for ephemeral values, so you're not forced

9:59 to jam every intermediate into state forever.

10:03 I'm not going to go line by line through the

10:05 changelog. But if you've been Terraform curious

10:08 about Open Tofu, this is a good excuse to try

10:11 it out on a small non -critical stack and see

10:14 how painful or painless the migration is. At

10:18 minimum, be aware of it so you're not surprised

10:21 when someone on your team says, hey, can we standardize

10:24 on this instead? Next, a Terraform tips post

10:27 from Rose Security that I liked. It's one of

10:30 those small things that add up articles. Stuff

10:33 like using one instead of bracket zero bracket,

10:38 when you really expect a single value, shaping

10:41 variables as objects with optional attributes,

10:45 so you're not passing around random maps, that

10:47 kind of stuff. This is the kind of post I'd quietly

10:50 drop into your Terraform channel or internal

10:53 docs, and then steal ideas from the next time

10:56 you touch a core module. You don't have to refactor

11:00 everything at once, but tightening up the patterns

11:03 over time does pay off. third lightning item,

11:07 Ghosty Going Nonprofit. Ghosty is a GPU accelerated

11:11 terminal emulator that's gotten really popular

11:14 and Mitch Hashimoto announced that it is now

11:17 under a non -profit umbrella instead of being

11:21 a commercial product in waiting. I'm not going

11:23 to pretend that this is a pure DevOps story,

11:26 but I do think it's interesting to see a dev

11:29 tool with this much traction explicitly choose

11:32 a non -profit ownership model. After watching

11:36 things like Terraform's license change and all

11:39 the drama around OpenCore tools, it's kind of

11:42 refreshing to have a core tool say, nope, this

11:45 is going to be community governed. And last lightning

11:48 item, spec -driven dev with AI. GitHub released

11:52 SpecKit, and there's also a project called OpenSpec

11:56 from Fission. Both are playing in the same space.

12:00 Instead of just prompting your AI assistant with,

12:03 hey, write some code, you start with a structured

12:06 spec that says what you're building and how you'll

12:10 know it's correct. And then you let AI generate

12:13 plans, code, and tests anchored to that spec.

12:17 From a platform perspective, I think this is

12:20 the only sane way AI is going to touch infrastructure

12:23 at scale. Imagine a spec for new service on our

12:27 platform, or standard Kubernetes app, or new

12:31 CI pipeline, and then the assistant uses that

12:34 spec to generate Terraform, Helm, and policy

12:37 that fits your patterns. We're not there yet

12:40 in most shops, but watch this space. It's way

12:44 better than let the bot randomly edit production

12:47 yaml. Alright, let's talk about incidents and

12:50 brains for a minute. For the human bit this week,

12:53 I want to pull from an article called Your Brain

12:56 on Incidents. It's about what major incidents

12:59 actually do to people, not just systems. If you've

13:03 ever been on a multi -hour call where everything

13:07 is breaking and the pager will not shut up, you

13:10 already know the feeling. Tunnel vision? bad

13:12 decisions, snapping at teammates, that sort of

13:15 thing. The article talks about cognitive load

13:18 and stress responses in a pretty approachable

13:22 way. When you're in an incident, your brain is

13:25 juggling a ton of context, logs, dashboards,

13:29 Slack, tickets, leadership asking for updates.

13:33 On top of that, if you don't feel safe saying,

13:36 I don't know, or we need to slow down, your brain

13:38 goes into pure defensive mode. That's where blamey

13:42 cultures make everything worse. If every mistake

13:45 gets dissected in the most painful way possible,

13:48 people will hide information during the incident

13:51 and the postmortem. You lose exactly the insight

13:54 you need to get better next time. So what do

13:57 you do with that as an SRE or platform lead?

14:01 One, Be explicit about expectations during big

14:04 incidents. It's okay to say we're going to pick

14:07 one hypothesis at a time and we're allowed to

14:10 be wrong. It's okay to say someone needs to be

14:13 the scribe and someone needs to tell leadership

14:15 to wait five minutes for an update. Two. Design

14:18 your incident process so it takes some load off

14:21 of the humans. That could be better run books,

14:24 better dashboards, or just a clear template for

14:27 how to structure a Slack channel during an event.

14:31 Three, in the review, focus more on how people

14:34 reasoned under pressure and less on who typoed

14:37 the command. You want people to feel safe saying,

14:40 I was fried and I misread the graph, because

14:43 that's how you find systemic fixes. And maybe

14:46 most importantly, remember that people have a

14:49 limited number of massive incident nights in

14:52 them before they burn out. That's not a moral

14:55 failing, that's just biology. All right, that's

14:58 it for this episode of Ship It Weekly. We talked

15:01 about IBM buying Confluent and what that means

15:04 for streaming and vendor risk. React to Shell

15:08 and why a React Vuln absolutely still belongs

15:12 on your platform radar. And Netflix moved to

15:16 Aurora Postgres as an example of rethinking your

15:20 database fleet. In the lightning round, we hit

15:22 Open Tofu 1 .11. some Terraform cleanup ideas,

15:27 ghosty going non -profit, and spec -driven development

15:31 with AI instead of free -form prompt chaos. And

15:35 we closed with your brain on incidents, and a

15:38 reminder that your systems aren't the only thing

15:42 taking damage during a bad outage. I'm Brian,

15:45 this is Ship It Weekly by Teller's Tech, thanks

15:48 for bearing with the cold voice version of me.

15:51 I'll drop the links for this episode in the show

15:54 notes and I'll see you in the next one.

IBM Buys Confluent, React2Shell, and Netflix on Aurora

Transcript

Catch This Episode

Host Commentary

Show Notes

Meet Brian Teller