0:06
Hey, I'm Brian and this is Ship It Weekly by
0:10
Teller's Tech. Quick heads up before we get into
0:13
it, my voice is a little rough this week. I've
0:15
been fighting a cold, so if I sound like I've
0:18
been yelling at Kubernetes for three days straight,
0:20
that's why. All right, here's what we're talking
0:23
about this episode. First, IBM is buying Confluent.
0:27
That's right after the HashiCorp deal, so we'll
0:30
talk about what that means if you're on Confluent
0:33
Cloud, running Kafka on -prem, or trying to pick
0:37
between Confluent, MSK, and do -it -yourself.
0:40
Second, there's a nasty React vulnerability people
0:44
are calling React2Shell. It's a 10 .0 RCE, it's
0:49
been exploited in the wild, and if you run platforms
0:53
for front -end teams, this absolutely involves
0:56
you. even if you've never touched React. Third,
1:00
Netflix wrote up how they consolidated a big
1:03
chunk of relational databases onto Aurora Postgres.
1:07
They saw up to a 75 % better performance and
1:11
solid cost savings, and they simplified their
1:14
fleet. We'll talk about what's real there and
1:17
what's marketing. In the lightning round, we'll
1:19
hit... Open Tofu 1 .11, some Terraform tips from
1:24
the trenches, Ghosty going non -profit, and a
1:28
pair of tools around spec -driven development
1:31
with AI. And we'll wrap with a human story from
1:34
your brain on incidents, about what big incidents
1:38
actually do to people and how to make that less
1:42
awful. Let's start with the big acquisition,
1:45
IBM and Confluent. So IBM is buying Confluent
1:49
for a very large pile of money. They already
1:53
grabbed HashiCorp earlier, and now they're adding
1:56
Confluent, which is basically Kafka as a product,
1:59
with managed clusters, connectors, governance,
2:02
all the stuff around the core broker. If you
2:05
zoom out, this looks pretty intentional. HashiCorp
2:08
gives them a control plane story, terraform,
2:11
vault, console, all the infrastructure lifecycle
2:14
pieces. Confluent gives them a data plane story,
2:19
real -time streams, feeding analytics, and AI
2:22
systems. From IBM's point of view, that's great.
2:26
From your point of view, this should trigger
2:28
some questions. If you're a Confluent Cloud customer
2:32
today, I'd be asking, what's the realistic time
2:36
window where pricing and packaging stay roughly
2:39
the same? 12 months? 24 months? What's the path
2:43
if that changes? and we suddenly don't like the
2:47
new world. Are we comfortable being on HashiCorp
2:50
tools and Confluent and whatever IBM does for
2:53
AI all from the same vendor? If you're not on
2:56
Confluent but you've been evaluating it, this
2:59
changes the comparisons with MSK and self -managed
3:02
Kafka a little bit. On the plus side, IBM has
3:05
deep enterprise relationships and is very good
3:08
at, let's say, long sales cycles. That might
3:12
mean better integration with big company identity,
3:15
governance, on -prem stories, all of that. On
3:18
the minus side, Every time a company like this
3:21
gets acquired, there's a non -zero risk of focus
3:25
shifting to bundle this into everything instead
3:28
of make the core service amazing. If you're on
3:31
MSK or self -managed Kafka right now, I don't
3:35
think this is an immediate you chose wrong moment,
3:38
but it is a reminder to check your own vendor
3:41
concentration. If the same company controls your
3:44
infra control plane, your secret management,
3:48
and your streaming backbone, you should at least
3:50
have some kind of exit plan on paper. Not because
3:54
you're going to use it tomorrow, but because
3:57
you really don't want to start thinking about
3:59
migrations the week after a pricing email lands.
4:02
So the homework here is simple. If you use Confluent
4:05
anywhere, write down what Plan B looks like.
4:09
If you don't, decide whether this acquisition
4:12
makes you more or less likely to adopt it in
4:15
the next couple of years. And maybe keep an eye
4:18
on what IBM does next in the infra -AI space,
4:22
because they're clearly not done shopping. Alright,
4:25
let's talk about something more urgent. React
4:28
to Shell. React2Shell is one of those vulns where
4:32
the score is 10 out of 10, and unfortunately,
4:35
that's not an exaggeration. Very short version,
4:38
React server components use a protocol called
4:41
flight to talk between client and server. There's
4:45
a bug that lets an attacker send a malicious
4:47
payload through that protocol and chain it into
4:51
a remote code execution on the server side. So
4:54
this is not someone can mess up your CSS. This
4:57
is someone can run arbitrary code on whatever
5:00
is hosting your react server components. Why
5:03
should you care if you're just the platform person?
5:05
Because a lot of modern front end stacks are
5:08
now front end plus service side piece deployed
5:11
as containers into your Kubernetes clusters or
5:15
your app platform. Those pods are part of your
5:18
blast radius. If they get popped, the attacker
5:21
is one cube CDL away. from the rest of the cluster,
5:26
or one IMDS hop away from credentials. Patches
5:30
are out in the ecosystem. Next .js and others
5:33
have shipped fixed versions. React has guidance,
5:37
the usual wave of advisories. But there is real
5:40
exploit traffic in the wild already, and cloud
5:44
providers and security vendors are seeing this
5:46
used to drop loaders and pivot deeper. So what
5:50
do you actually do as a platform SRE person?
5:53
First, assess inventory. Figure out which services
5:57
in your world are using React 19, React server
6:01
components, Next .js with RSC, that kind of thing.
6:05
If you don't know, this is a good week to ask
6:07
app teams some annoying questions. Second, patch
6:11
tracking. Make sure those services are on the
6:14
patched versions of their framework. This is
6:17
one of those we're doing an emergency patch even
6:20
if the sprint board doesn't like it weeks. Third,
6:23
guardrails. If you have a WAF, check whether
6:25
your provider has shipped React to Shell rules
6:28
and turn them on in at least monitoring mode,
6:31
ideally block mode for the exposed endpoints.
6:34
Same thing for IDS, IPS, or runtime security
6:38
tools. Fourth, egress and privileges. Double
6:41
check the egress posture of those front -end
6:44
services if an attacker does get code exec, how
6:47
easy is it for them to phone home, hit IMDS,
6:51
pull other secrets, or talk to internal services
6:54
that they really shouldn't? You don't need to
6:57
become a React expert. You just need to treat
7:00
this like what it is. A serious server -side
7:03
RCE that just happens to be writing in on a front
7:07
-end stack. All right, let's move from everything
7:09
is on fire to how one of the big kids is evolving
7:13
their database story. Netflix published a case
7:17
study about consolidating a big chunk of their
7:20
relational database fleet onto Amazon Aurora
7:23
Postgres. The headline numbers, they quote, are
7:26
up to 75 % better performance and almost 30 %
7:31
cost savings for some workloads. Now you should
7:34
always be a little skeptical of round numbers.
7:37
but the pattern is pretty familiar. They had
7:39
a bunch of self -managed Postgres clusters scattered
7:43
around. Each one had its own tuning, its own
7:46
backup setup, its own failover behavior, its
7:49
own on -call expectations. Over time, that turns
7:53
into a huge operational tax. Moving to Aurora
7:56
gave them a few things. Managed failover and
8:00
backups instead of writing and maintaining that
8:04
themselves. A more uniform story for observability
8:07
and performance tuning. The ability to simplify
8:10
sizing and auto scaling in a more consistent
8:13
way. From our side of the fence, the interesting
8:16
question is not should we be Netflix? It's where
8:20
do we get outsized value from letting the cloud
8:23
provider manage more of the boring stuff? If
8:26
you have a handful of big, weird Postgres clusters
8:29
with very tight latency requirements, very custom
8:33
extensions, or unusual replication topologies,
8:37
self -managed might still be the right call.
8:40
But if you have 30 or 50 little to medium Postgres
8:43
instances that all need roughly the same reliability
8:47
story and none of them are super special snowflakes,
8:50
something Aurora -like starts to look pretty
8:53
attractive. The trade -offs are similar to what
8:56
we just talked about with Confluent. You're consolidating
8:59
onto one managed platform. You get reliability
9:03
and lower ops overhead at the cost of more vendor
9:07
lock -in. There's no free lunch there. The takeaway
9:10
I'd want people to get from the Netflix piece
9:12
is not, oh, cool, Aurora is magic. It's you should
9:16
occasionally step back and ask if the way you're
9:19
running your databases is still the right shape
9:22
for the scale you're at now. If you built a fleet
9:25
of hand -tuned clusters back when you had five
9:28
apps, that might not be the right model now that
9:32
you've got 100. All right, let's knock out a
9:35
quick lightning round. First up, Open Tofu 1
9:39
.11. Open Tofu keeps moving quickly, and 1 .11
9:43
brings some nice language features. There's an
9:47
enabled meta argument you can use on resources
9:50
and modules to conditionally include things without
9:53
the old count equals zero hacks. And there's
9:56
support for ephemeral values, so you're not forced
9:59
to jam every intermediate into state forever.
10:03
I'm not going to go line by line through the
10:05
changelog. But if you've been Terraform curious
10:08
about Open Tofu, this is a good excuse to try
10:11
it out on a small non -critical stack and see
10:14
how painful or painless the migration is. At
10:18
minimum, be aware of it so you're not surprised
10:21
when someone on your team says, hey, can we standardize
10:24
on this instead? Next, a Terraform tips post
10:27
from Rose Security that I liked. It's one of
10:30
those small things that add up articles. Stuff
10:33
like using one instead of bracket zero bracket,
10:38
when you really expect a single value, shaping
10:41
variables as objects with optional attributes,
10:45
so you're not passing around random maps, that
10:47
kind of stuff. This is the kind of post I'd quietly
10:50
drop into your Terraform channel or internal
10:53
docs, and then steal ideas from the next time
10:56
you touch a core module. You don't have to refactor
11:00
everything at once, but tightening up the patterns
11:03
over time does pay off. third lightning item,
11:07
Ghosty Going Nonprofit. Ghosty is a GPU accelerated
11:11
terminal emulator that's gotten really popular
11:14
and Mitch Hashimoto announced that it is now
11:17
under a non -profit umbrella instead of being
11:21
a commercial product in waiting. I'm not going
11:23
to pretend that this is a pure DevOps story,
11:26
but I do think it's interesting to see a dev
11:29
tool with this much traction explicitly choose
11:32
a non -profit ownership model. After watching
11:36
things like Terraform's license change and all
11:39
the drama around OpenCore tools, it's kind of
11:42
refreshing to have a core tool say, nope, this
11:45
is going to be community governed. And last lightning
11:48
item, spec -driven dev with AI. GitHub released
11:52
SpecKit, and there's also a project called OpenSpec
11:56
from Fission. Both are playing in the same space.
12:00
Instead of just prompting your AI assistant with,
12:03
hey, write some code, you start with a structured
12:06
spec that says what you're building and how you'll
12:10
know it's correct. And then you let AI generate
12:13
plans, code, and tests anchored to that spec.
12:17
From a platform perspective, I think this is
12:20
the only sane way AI is going to touch infrastructure
12:23
at scale. Imagine a spec for new service on our
12:27
platform, or standard Kubernetes app, or new
12:31
CI pipeline, and then the assistant uses that
12:34
spec to generate Terraform, Helm, and policy
12:37
that fits your patterns. We're not there yet
12:40
in most shops, but watch this space. It's way
12:44
better than let the bot randomly edit production
12:47
yaml. Alright, let's talk about incidents and
12:50
brains for a minute. For the human bit this week,
12:53
I want to pull from an article called Your Brain
12:56
on Incidents. It's about what major incidents
12:59
actually do to people, not just systems. If you've
13:03
ever been on a multi -hour call where everything
13:07
is breaking and the pager will not shut up, you
13:10
already know the feeling. Tunnel vision? bad
13:12
decisions, snapping at teammates, that sort of
13:15
thing. The article talks about cognitive load
13:18
and stress responses in a pretty approachable
13:22
way. When you're in an incident, your brain is
13:25
juggling a ton of context, logs, dashboards,
13:29
Slack, tickets, leadership asking for updates.
13:33
On top of that, if you don't feel safe saying,
13:36
I don't know, or we need to slow down, your brain
13:38
goes into pure defensive mode. That's where blamey
13:42
cultures make everything worse. If every mistake
13:45
gets dissected in the most painful way possible,
13:48
people will hide information during the incident
13:51
and the postmortem. You lose exactly the insight
13:54
you need to get better next time. So what do
13:57
you do with that as an SRE or platform lead?
14:01
One, Be explicit about expectations during big
14:04
incidents. It's okay to say we're going to pick
14:07
one hypothesis at a time and we're allowed to
14:10
be wrong. It's okay to say someone needs to be
14:13
the scribe and someone needs to tell leadership
14:15
to wait five minutes for an update. Two. Design
14:18
your incident process so it takes some load off
14:21
of the humans. That could be better run books,
14:24
better dashboards, or just a clear template for
14:27
how to structure a Slack channel during an event.
14:31
Three, in the review, focus more on how people
14:34
reasoned under pressure and less on who typoed
14:37
the command. You want people to feel safe saying,
14:40
I was fried and I misread the graph, because
14:43
that's how you find systemic fixes. And maybe
14:46
most importantly, remember that people have a
14:49
limited number of massive incident nights in
14:52
them before they burn out. That's not a moral
14:55
failing, that's just biology. All right, that's
14:58
it for this episode of Ship It Weekly. We talked
15:01
about IBM buying Confluent and what that means
15:04
for streaming and vendor risk. React to Shell
15:08
and why a React Vuln absolutely still belongs
15:12
on your platform radar. And Netflix moved to
15:16
Aurora Postgres as an example of rethinking your
15:20
database fleet. In the lightning round, we hit
15:22
Open Tofu 1 .11. some Terraform cleanup ideas,
15:27
ghosty going non -profit, and spec -driven development
15:31
with AI instead of free -form prompt chaos. And
15:35
we closed with your brain on incidents, and a
15:38
reminder that your systems aren't the only thing
15:42
taking damage during a bad outage. I'm Brian,
15:45
this is Ship It Weekly by Teller's Tech, thanks
15:48
for bearing with the cold voice version of me.
15:51
I'll drop the links for this episode in the show
15:54
notes and I'll see you in the next one.