0:00
AI is making it easier than ever to create more
0:03
software, more code, more diffs, more experiments,
0:07
more changes moving through systems that were
0:10
already complicated before everybody got a robot
0:13
assistant in the editor. And that sounds great
0:15
until you are the team responsible for production
0:18
because production does not really care that
0:21
the code was generated faster. It still cares
0:24
about latency. It still cares about overload.
0:27
It still cares about dependencies, rollbacks,
0:30
traffic routing, region failures, weird edge
0:34
cases, and whether the people on call actually
0:37
know what to do when the system starts acting
0:40
strange. That is the part of the AI conversation
0:43
that feels under-discussed to me. Not just can
0:46
AI write code, it can. Not just can AI help debug,
0:50
sometimes yes. The better question is what happens
0:54
to reliability when the volume of change goes
0:57
up, the amount of generated code goes up and the
1:00
context behind that code gets thinner. Because
1:03
incidents are not just caused by bad code. They
1:06
are caused by systems behaving in ways people
1:09
did not expect, under conditions they did not
1:13
practice, with dependencies they forgot were
1:16
part of the critical path. That is really what
1:18
this conversation is about. Not just SLOs, not
1:22
just incident reviews, not just AI agents. More
1:25
like, what does reliability actually look like
1:28
when change is accelerating? Systems are more
1:32
connected, and recovery matters just as much
1:35
as prevention. I'm Brian Teller from Teller's
1:38
Tech, and this is Ship It Weekly.
1:58
Welcome back to Ship It Weekly, where I filter the noise and
2:00
focus on what actually matters when you are the
2:03
one running infrastructure and owning reliability.
2:06
Most weeks, it's a quick news recap. In between
2:09
those, I do conversation episodes with people
2:12
who are building platforms, running infrastructure,
2:15
organizing events, and thinking through where
2:18
this industry is actually headed. Today is one
2:21
of those conversations. I'm joined by Francois
2:24
Richard, an engineering director at Meta. We're
2:27
talking about reliability at scale, how AI and
2:30
automation are changing production risk, what
2:33
teams actually learn from incidents, and why
2:36
recovery practice matters just as much as prevention.
2:40
And I like this conversation because it gets
2:43
past the neat, clean version of reliability that
2:46
fits nicely in a dashboard. Reliability is not
2:49
just a number. SLOs matter, dashboards matter,
2:53
alerts matter, guardrails matter. But none of
2:56
that means much if the team cannot use it to
2:59
make better decisions, recover from real failures,
3:02
and improve the system after something breaks.
3:05
You start with the obvious goal. Keep the system
3:08
up. Keep the users happy. Avoid incidents. Do
3:12
not page people for nonsense. Do not let every
3:15
deployment feel like a coin flip. Then reality
3:18
shows up. A region has problems. A dependency
3:22
gets slow. A service gets overloaded and cannot
3:26
restart because it is too overloaded to recover
3:29
cleanly. Traffic spikes during a major event.
3:32
A change rolls out faster than the team understands
3:35
it. Or an AI-generated pile of code technically
3:38
works until it fails in a way nobody has enough
3:41
context to explain quickly. And somewhere in
3:45
the middle of that, reliability stops being just
3:48
prevention and starts becoming practice. In this
3:51
conversation, Francois and I talk about how Meta
3:55
thinks about reliability across both reactive
3:58
and proactive sides. Incident response, incident
4:01
reviews, SLOs, guardrails, validation, disaster
4:06
recovery testing and what happens when you actually
4:09
practice taking a region out instead of just
4:12
assuming the failover plan works because the
4:15
diagram looks good. We also get into what teams
4:18
learn during incidentsss, not just from the post
4:20
postmortem afterward but inside the pressure cooker
4:23
itself where engineers have to make decisions
4:26
quickly build consensus fast understand the system
4:30
under stress and figure out what is actually
4:33
happening while the clock is running there is
4:36
also a good thread in here around AI agents in
4:39
incident response not the fantasy version where
4:42
you hand production to an agent and hope it saves
4:45
the day more the practical version where AI helps
4:48
with investigation telemetry metrics logs relationships
4:52
across services and narrowing down what might
4:56
be happening faster than a human clicking around
4:59
dashboards alone and towards the end we talk
5:02
about recovery practice known failures versus
5:05
unknown failures why teams should test the failure
5:08
modes they claim they can survive how smaller
5:11
teams can learn from Meta-scale reliability patterns
5:14
and why not every system needs six nines on day
5:19
one so if you work around DevOps, SRE, platform
5:23
engineering, infrastructure, engineering leadership,
5:26
incident response, or you are just trying to
5:29
figure out what reliability looks like when AI
5:31
is increasing the volume of change, this one
5:34
should be worth your time. All right, let's jump
5:37
in.
5:45
Today, I'm joined by Francois Richard. He is an engineering director at Meta, and we're
5:47
talking about reliability at scale, how AI and
5:50
automation are changing production risk, what
5:53
teams learn from incidents, and why recovery
5:56
practice matters just as much as prevention.
5:59
Francois, thank you for joining me. Thank you,
6:02
thank you. Happy to be here. Okay, so when you
6:04
think about reliability at Meta scale, where does
6:08
the work actually start? I think for me, reliability
6:13
is actually two things. What people kind of like
6:16
know usually it's the reactive side is how do
6:19
we manage incidentss? How do we respond to problemss?
6:23
And how do we handle things and recover and fall
6:27
back when something goes bad? But at the same
6:31
time, it's also everything that I call like,
6:33
you know, the proactive side of the house. This
6:35
is everything that has to do with, hey, how are
6:40
we going to prevent reliability problems from
6:44
happening? What sort of guardrails do we put
6:47
in place? What sort of framework? What sort of
6:51
tooling do we allow platform or do we give to
6:55
platform services and application developers
6:58
to use? And also, what is the real validation
7:02
that we are doing? There's a lot of papers available
7:07
in the public and so on around like, you know,
7:10
there's something that we call like the storm
7:12
program. And it's basically our way to test like,
7:15
you know, disaster recovery, where once in a
7:18
while we take an entire region out, out of the
7:23
like, you know, like if there was an earthquake
7:25
or there was like a giant electrical grid failure.
7:29
And we test what happens with the system. How
7:32
do they recover? How do they handle the spike
7:35
in traffic and so on? So it's both like for me,
7:37
reliability is both like, you know, the reactive
7:39
side and also the proactive side. Is that kind
7:43
of like chaos engineering, the concepts? Yes,
7:46
yes, yes, exactly. And, you know, we go as far
7:49
as, you know, disconnecting a complete region from
7:52
the map. Interesting. Okay, so speaking about
7:55
reliability, what makes an SLO useful instead
7:59
of just another dashboard number? Like, how do
8:02
you think about that? Yes, an engineer, like
8:03
a senior engineer that I work with always says
8:06
that the SLO of something, of a system, is actually
8:11
the promise that you're making to your customer.
8:15
That, you know, XYZ should work 99.99% of the
8:19
time and you should expect that amount of latency
8:23
on average at, you know, P90 or something like
8:27
this. So we use this, like, you know, most of
8:31
the platform we operate at Meta and also the
8:34
product services and so on. They will all, the
8:39
big ones, will all have some form of SLO for
8:43
the top APIs or the top services that they are
8:46
doing. We have some alerts that will trigger
8:49
on SLO, but usually this is too late. Like, you
8:52
know, we want to catch problemss. The SLO allows
8:55
us in the long run to actually figure out like,
8:59
hey, are we investing enough in reliability?
9:03
Are we keeping the bar high enough? Or do we
9:07
have wiggle room to take some risk, introduce
9:12
new features? Because you all know that when
9:15
you introduce or you do a giant migration, there's
9:18
a period of time where it takes time to stabilize
9:21
things. So the SLOs are helping us make these
9:25
decisions in the long run. And you can clearly
9:28
see when you've been tracking them for a while
9:30
that some systems are trending up and some systems
9:34
are trending down. And for the ones that are
9:37
trending down, it allows you, it gives you the
9:39
argument and the data to start investing more.
9:43
So this is kind of like, you know, our way about
9:46
deciding if we have enough investment in an area
9:50
or if, you know, we are okay and we can take
9:53
more, you know, progressive risk. So you had,
9:57
we talked a little bit about incidents before
10:00
the show, and then you had just mentioned incidents.
10:02
Across the incident lifecycle, Where do teams
10:05
actually learn the most? So in terms of learning
10:08
during incidentss, I think it happens in two phases.
10:11
First, it happens while handling the incident
10:14
during what I call like the pressure cooker.
10:17
When you actually need to handle the incident,
10:20
you actually need to make decisions quickly,
10:24
build consensus fast with a small group of people
10:28
about what's the next action to take. And then
10:31
what are the bypass, the restart process, which
10:35
system is calling what. This is actually a time
10:40
where a lot of people, like junior engineers
10:43
and senior, they learn a lot. They learn a lot
10:45
about the system, but they also learn a lot about
10:47
team dynamics and themselves during that period.
10:52
Then the next phase happens when kind of like
10:56
you build the... incident review report or we
11:00
call it the SEV report, where you basically,
11:03
you know, once the incident is handled, mitigated,
11:07
you know, a few days after, you basically sit
11:09
down and then we have like a tool to do this.
11:12
And then people sit down and they write a report
11:15
around like, you know, hey, what exactly happened?
11:18
What was the root cause of this? What was the
11:21
exact timeline, which is kind of like forcing
11:24
people to go back into logs and all of this?
11:26
Could the alerts have fired earlier? Did we have
11:29
the right instrumentation? Did we diagnostic,
11:32
like diagnose this thing, like, you know, the
11:35
proper way? And, you know, what are the follow
11:37
-up tasks, like to make sure that this doesn't
11:40
happen again? or that we have some level of automation
11:43
to be able to remediate this type of problem
11:46
automatically. So you build that report. But
11:49
then from there, you go into a series of reviews.
11:54
You review it with your team, with your organization.
11:57
And if the incident was severe enough, we actually
12:01
reviewed them at the company level. And during
12:04
that time, it's a big opportunity to get feedback
12:06
from other teams. from other senior engineers
12:10
that, yeah, we see this pattern. Here's, you
12:12
know, how we handle this type of thing. And you
12:14
actually learn a lot during that process. So
12:17
this is usually like, you know, the learnings
12:19
happen in kind of like two phases. Yeah. And
12:23
so given that learning, what should an incident
12:26
review actually produce? Like once the incident's
12:28
over? The incident review, like we have a report
12:30
type of thing. It's a template. It has like,
12:34
you know, walks you to, for you to get like two.
12:37
the root cause of it. But I think the most important
12:41
thing that it produces is also like, you know,
12:43
what are the follow-up actions? The follow-up
12:45
could be like, hey, we need to implement more
12:48
redundancy. We need to implement better diagnostics.
12:50
We need to implement better ways of handling
12:53
these exceptions or problems. We need a different
12:56
deployment process. This is usually like, you
12:59
know, the best outcome we could get because,
13:01
you know, the more you mature your system, the
13:03
more you'll be able to cover a lot of these,
13:06
like, kind of like unknown slash unexpected situationss.
13:10
So this is usually like the good outcome of these.
13:12
And what do you find is the difference between
13:14
finding the cause versus improving the system?
13:17
The difference between the cause and improving
13:21
the system. Or is there a difference? There is
13:27
a bit. Because for me, finding the cause overall
13:31
is, do I have the right tools to pinpoint the
13:37
problem very quickly. And then I'll be able to
13:39
find the cause. And then the improvements are more
13:43
are more about eliminating this exact problem
13:47
or that class of problem. And the improvement
13:50
can be that like you know, like from a code structure
13:53
that this will always happen, but you can work
13:57
around with either deployment strategy, automation
14:01
of recovery and things like this so that it's...
14:05
It's not visible to the end users with strategies
14:08
of retries and things like that. So there's a
14:11
big difference between finding the cause. Like
14:13
if you work in mobile apps, you know that there's
14:16
going to be a couple of cases where like the
14:18
cause is going to be like, you know, the mobile
14:19
networks just drop on you. And the app needs over
14:23
time to come up with ways to mask it. You'll
14:26
never eliminate the cause, but you'll have apps
14:30
that are a lot better at masking this type of
14:33
problem. So it's slightly different. Okay, so
14:34
let's switch gears to AI agents. Are you using
14:37
AI in incident response at all? Yes. In triaging?
14:42
We use it for what I would call like investigation.
14:48
So, you know, there's usually like two classeses
14:51
of production incidentss. You have one class when
14:54
it happens. You kind of know what this is. Like,
14:57
you know, it's a pattern that you already recognize.
15:00
The alarms or alerts are very clear about what's
15:03
happening. So you kind of know and you can narrow
15:05
down and then you can just go focus on, hey,
15:08
you know, how do I remediate that? There are
15:10
some of the cases like where you kind of see
15:12
like, oh, I'm starting to see transient errors.
15:15
I'm starting to see latency creeped up and where
15:19
it's kind of like, I'm kind of not sure. And
15:23
then usually like, you know, we launch what's
15:25
called like investigations and things like that.
15:29
Really, the AI agents are helping us in investigation
15:33
because they can analyze and gather large swath
15:37
of data very, very quickly and then pinpoint
15:41
relationships between datasets. Like, you know,
15:46
it's probably the same for everybody, but we
15:47
have multiple layers of software on top of each
15:50
other. They call each other and they call back.
15:54
Sometimes finding complex situation requires
15:56
us to analyze a lot of data at the same time.
16:00
Even myself personally, like, you know, we have
16:02
like centralized, like, you know, monitoring
16:05
system. And then if you aggregate a lot of data
16:09
with very small granularity, like, you know,
16:13
at the 10 second or like at the minute level,
16:15
and you have a lot of time series, not the data
16:17
point, it gets a pain to actually like get the
16:21
right query and the right graph for the problem
16:25
that you're investigating at a point in time
16:27
like so the AI agent has kind of like shifted
16:29
this thing where they help us find these things
16:33
and even like you know in some cases like what
16:35
i do is like i will dump all the metrics locally
16:38
in the SQLite database and then like do all sorts
16:41
of things because it's local and my SSD can do
16:44
it a lot faster And then you use also the AI
16:48
agents to help you craft the queries. I don't
16:51
remember on top of my head all the various flavors
16:54
of joins and all of that. So it's helping a
16:57
lot. So with investigation, it helps a lot. It's helping
17:02
a lot. It's not perfect. You still need to guide
17:06
it very well because the topology of the system,
17:11
the request... The request flow between the system
17:16
is still not something that I will call it like
17:20
the various AI agents are actually mastering.
17:23
They master code very well. But these flows are
17:28
a little bit more difficult and understanding
17:29
how they map into the application or the product
17:33
flows is even more complicated. But it does help.
17:36
That's the investigation side of it. We're doing
17:40
a lot also into what I call like having a standard
17:45
dataset across the company that kind of like
17:48
represents the state of reliability. Like we
17:51
drive a couple of initiatives globally, like
17:54
we call it like change safety. Do you have SLOs
17:58
for something? Do you have guardrails? And with
18:02
these datasets being kind of official, then it's
18:06
easy for each service team or product team to
18:10
build their own customized kind of dashboard
18:12
like right now the cost of making dashboards is
18:15
almost free and really tailoreded to their use cases
18:19
or to their product using the standard dataset
18:22
and that's another place where it's been like
18:23
oh, this has been great. And how are you, I'm assuming
18:28
you're still validating that that root cause
18:30
though right if it comes up with a you know it
18:32
uses MCP servers it's reaching out to Argo Kubernetes
18:35
whatever it's gathering all the telemetry from
18:38
Grafana and then it's putting that together and
18:41
it's saying, hey, this pod was crash loop back
18:44
off. This caused this issue. Are you still validating
18:48
that? And what is that process for validating
18:50
that? Like, is there a standard SOP that you're
18:52
following? No, I don't think we have like a standard
18:55
evaluation. The evaluation, like most of the
18:59
time, the tools that you have in the moment from
19:02
an incident is either you move away from the
19:05
problem or you roll back the problem so we will
19:08
try like you know first you know move away let's
19:11
call it like a problem is a backend problem
19:14
it's isolated to a region we'll move away from
19:17
that region to kind of like prove that theory
19:19
and then we'll probably during that time roll
19:23
back the system or the backend or whatever
19:26
it was that introduced that problem and then
19:29
you can imagine that we can then now that the
19:32
hypothesis is live then we can shift a portion
19:35
of traffic back to that problem and see like
19:38
oh is it still happening or not this is all manual
19:41
we have some of these process that are automated
19:45
but they really boxed into very very well known
19:48
like you know they had like very well established
19:51
runbook that have been like you know in place
19:55
for the longest time that we know that when this
19:57
happened do this like it's almost like if then
20:00
else But the most like the investigation result,
20:04
the validation is more to roll out the fix, you
20:09
know, driven by human with like, you know, slow
20:11
roll type of techniques. Do you think that AI
20:14
is increasing the speed of change, the volume
20:17
of change or the type of failure that teams need
20:20
to expect? But it's clear that it's increasing
20:23
the volume of change. We have data internally
20:27
that, you know, the number of diffss is through
20:30
the roof. The number of lines of code changed is,
20:34
again, through the roof. The type of failure
20:37
is like one thing that, because there's more
20:41
code and it's going faster, it's the type of
20:45
failure becomes more into the, what I would call
20:48
like pseudo-unexpected. When a product developer,
20:52
application developer, when they... roll out
20:55
a new feature, they do understand the business
20:58
logic. They do understand how it should show
21:01
to the user, what they expect the user to do,
21:04
where's a corner where a user could get stuck.
21:07
They understand that. That's not a problem. But
21:10
if you have to build on modern apps on mobile
21:15
phones or on backend, A lot of these things are
21:19
built on top of frameworks. There's a lot of boilerplate
21:21
stuff, especially to handle like, you know, async
21:24
callbacks and things like this. And since a lot
21:27
of that code gets auto-generated, people are
21:30
losing context around these things. And when
21:33
something doesn't work, we realize that it takes
21:37
us more time to understand why it doesn't work.
21:41
Because there was not that additional context
21:44
that... the developer or that team developing
21:47
that feature took a lot of time to craft it and
21:49
a lot of like internal reviews and things like
21:51
that so so it goes faster out but that context
21:55
on how it's built is actually lost along the
21:57
way and this is basically what we need to reconstruct
22:01
live and again like i said earlier a lot of our
22:05
tooling is like you know in this case we're just
22:07
like we'll roll back and you figure it out like
22:09
bring back the system to a stable state and then
22:12
you can figure it out on your own time without
22:15
being on the pressure cooker. Okay, so speaking
22:17
along the same lines, you had mentioned too that
22:21
recovery comes down to practice. So what does
22:23
good recovery practice actually look like? You
22:26
know, there's a set of standard failuress you expect
22:29
your app to have. And it starts at, you know,
22:34
if you're running, if you have data centers or
22:37
region in the United States or in Europe or something
22:41
like that, you get... Tons of things like fiber
22:44
cut in hurricanes and, you know, power grid and
22:48
things like that. You need to understand how
22:51
your system will react. Like, you know, a simple
22:54
thing, like if I'm a startup and I run into a
22:57
AWS region, but I actually never test the fact
23:01
that one of them will go down, I can guarantee
23:04
you, like, as soon as you get a little bit of
23:06
complexity in your system, something will not
23:09
work as expected in there. for us is actually
23:13
to do, to validate it, to not just test, like
23:16
we run, I would call it like data simulation
23:19
and analysis in advance to figure out like, okay,
23:23
if we lose that data center, we'll be okay. But
23:26
it is always interesting. And then after we've
23:29
done the validation and vetted that it all worked.
23:33
We actually do the test and we discover a suite
23:35
of other problems that the analysis had never
23:38
found. So exercising the failure for real at
23:42
the lower scale allows you to find a lot of these
23:46
problems. And it's the same thing like, you know,
23:49
if you have very sensitive hotspot in your application
23:53
and your product, injecting failure for real
23:56
at a very low percentage of user will... allow
23:59
you okay do i have the right detection like did
24:01
that trigger properly did it get to the right
24:04
person did this team know how to handle it like
24:08
without this it's it all becomes like improv
24:11
like you improv all the time versus when you
24:14
have a portion of your infrastructure that's
24:16
always tested in such a way first you keep on
24:18
hardening and it gets better and it gets better
24:20
and the muscle of handling it gets better also
24:23
so i guess that I was going to ask about like
24:26
how, how would teams practice known failures
24:28
versus unknown failures? But it sounds like it
24:31
comes down to just that practicing and pulling
24:33
out the regions. Okay. And there's, there's a
24:36
bunch of types of failure. Like, you know, I,
24:37
I made an allusion about worst-case failures
24:40
versus worst-case failure, which is, um, you
24:45
know, you lose a complete region. And with the
24:48
weather patterns in the U.S., you know, we have
24:50
hurricane season, we have like fire season, tornado,
24:54
like the cuts happen everywhere. So you got to
24:57
be ready. But there's also other things like,
25:01
you know, for meta, we are often, I would call
25:05
it like we got, we get these traffic spikes.
25:09
It could be because there's something in the
25:12
news related. It could be a big event like the
25:14
Super Bowl, New Year's Eve. The World Cup's coming.
25:17
So we know that every time there's going to be
25:18
a goal, it's going to spike. What we realize
25:21
over time with practice is that some system will
25:25
be overloaded. And in some cases, you'll have
25:28
to restart them. And during overload, the damn
25:32
thing cannot restart. You cannot kill it. Like
25:35
it's all completely dead. Like it's completely
25:38
blocked. So we practice a lot of these cases
25:40
too. And it's something specific. To us, we get
25:45
these spikes that are kind of unexpected. Most
25:47
systems have some form of overload protection
25:50
up to a point, but it's something we really had
25:54
to invest in, into like, in the worst load scenario,
25:59
can I restart? Do I have, like, is my process
26:03
smart enough to just accept requests later down
26:06
the road? Or do I have enough control in the
26:10
traffic routing to, like, okay, choke the traffic,
26:12
bring back up, warm up your memory, and then,
26:15
like, gradually. Like, we had to practice that
26:18
a lot. Yeah, so you're dealing with, like, thrashing
26:20
and having, like, no... Yeah. Yes, yes. Resources
26:24
being completely exhausted. Yeah, yeah. Completely
26:26
gone. And is that typically at the control plane
26:29
Completely gone. And is that typically at the control plane
26:32
level? Or... Some of our control plane is pretty good
26:34
you know, there's some backend. Like you can
26:36
imagine like, you know, an Instagram feedback
26:39
an Instagram feed backend or Facebook feed backend
26:43
same type of things. And we have a couple of
26:47
academic paper. People can look at it. It's called
26:50
like Taiji, I believe. And then when we discuss
26:53
these things on how we can reroute traffic and
26:57
then control and then remove it from an area
27:00
to allow the system to reload. Okay. So with
27:04
that in mind. I would imagine most listeners
27:07
are not operating at Meta scale, so they're not
27:10
dealing with every time there's a goal at the
27:12
World Cup, you know, their servers go down. What
27:14
ideas can actually transfer for most like SRE
27:17
DevOps that are listening? I think the example
27:20
I gave earlier, like you often see like no blame
27:24
on Amazon, but if US East go down, like, you know,
27:29
we see the world of the internet like go like
27:31
berserk. So I'm like, guys, like, you know, you
27:34
got to start testing these cases. Like you've
27:36
seen it happen, like, you know, at least like,
27:39
you know, three or four times in the last two
27:40
years. You got to start like, you know, having
27:44
your dual regions and really validating it. And
27:47
maybe it's a question about like, you know, every
27:50
Tuesday, I do not run a single thing in US East,
27:56
you know, whatever the name is these days
27:58
to make sure that it's, you know, it's performing.
28:02
And then it's kind of like tackling the problems
28:05
in a way like, you know, when you look at your
28:08
incident inside a small, a smaller company, medium
28:12
sized startup or so on, like I get a list of
28:15
incidents. There's a point where you start having
28:18
enough data that you could say, okay, most of
28:22
them are caused by, it's because we launched
28:25
that feature or because we're doing configuration
28:28
change or because we have a billing problem.
28:31
It's starting to. trend your your SEV data and
28:36
figure out hey what are the top two or three
28:38
then we can start attacking and really focusing
28:41
because like you can focus on everything but
28:44
you'll get nowhere it's really start bucketizing
28:47
like you know in having the discipline of writing
28:51
it down like it's not that hard to write like
28:53
a SEV report there's a lot of example in there
28:56
and having the discipline of going back and Now
28:59
with LLMs, do the analysis is kind of like trivial.
29:03
You need to, we used to need to allow a lot of
29:05
people to do this and now it's a lot easier.
29:07
So I think that's a big opportunity to actually
29:10
really focus on where reliability matters. Yeah,
29:13
that's fair. You had mentioned too that reliability
29:15
expectations should match the system and product
29:18
life cycle. Yes. Can you talk about that? Like,
29:21
because 100 % uptime is not always like the right
29:25
goal. Yes, yes, yes, yes. There is. there's like
29:29
you know when i say about the expectation versus
29:32
life cycle we should always expect like if something
29:36
is is has some level of complexity and some level
29:40
of feature at the beginning when you roll it
29:42
out like you know getting two nines or even three
29:46
nines of reliability it's pretty good like like like
29:50
not a simple system that's that that that's usually
29:52
like you can get it easy but like if it's early
29:55
in the life cycle of that product or that backend
29:58
at the beginning, it's going to be rocky and
30:01
it should be expected. Like I see a lot of people
30:04
start with the assumption that, Hey, everything
30:07
should be like six nines, six nines guys. Like
30:10
this is like insane. Like, and you invest a lot
30:14
and then you, you kind of like mix, uh, miss
30:17
your product market fit while you're doing that.
30:20
Like there are some like Facebook and Instagram,
30:24
there's a gazillion experiments that runs at
30:26
a given point in time. some of them are not fully
30:28
reliable like and it's okay because we're trying
30:31
to figure out if that feature will be something
30:36
that users enjoy and use or not but they are
30:40
other part of the app which like needs to be
30:42
like rock solid and this is the place where we
30:45
invest so depending on the life cycle if you
30:48
are early or not because there's a cost at investing
30:51
and then when especially like either your SRE
30:55
or production engineering and you have to also
30:58
convince your PM, your product managers that
31:01
this is the case, having that conversation from
31:04
that standpointpoint is a lot more easier than just like
31:08
being the one that says, no, I need six nines
31:10
of reliability for everything. That's not happening.
31:13
Yeah. Well, six nines too, that's like what,
31:16
31 seconds a year of downtime? I mean, the amount
31:20
of infrastructure and cost complexity to just
31:23
keep a system at scale. Depending on the system,
31:27
I mean, I assume if it's like a very important
31:29
financial system, maybe it matters. But for the
31:31
majority of systems, like do the users really
31:34
care about 31 seconds of outage per year? I mean,
31:38
are they going to? They do care. You know, we
31:42
can see from our own data that, you know, if
31:47
we have too much problems in a row, we can
31:49
see that, you know, engagement eroding. So they
31:51
do care. They do show. But it tends to stabilize
31:54
when we stabilize things. So there's a correlation
31:58
there. But sometimes the investment to reach
32:01
that extra nine is just so high. And while you're
32:06
doing that extra nine, you're also sacrificing
32:09
not just the speed and velocity, but also your
32:12
mitigation time for future unknowns. Because
32:16
when you have invested so much complexity and
32:19
then you get into an unknown situation and then
32:22
now you have to untangle all of this, your mitigation
32:24
time will be higher. So it's kind of like it's
32:27
a delicate trade-off. Like it's not a win-win
32:30
all along. Yeah, for sure. It's a reality there.
32:34
A real conversation that you have to have and
32:36
a balance that you have to have there. Okay,
32:38
so wrapping up, this conversation also connects
32:41
to At Scale systems and reliability. Why does
32:44
that program feel relevant right now? Yeah. So
32:48
we have this suite of conferences, like it's called
32:51
At Scale. And we have like four conferences a year.
32:54
And the next one that's coming is basically systems
32:57
and reliability together. And what we want to
33:00
target for this specific conference is, hey,
33:04
how are we injecting? What are the systems that
33:09
are like, you know, building reliability for
33:12
AI? And what are the systems that are under the
33:15
hood for AI? Because people tend to talk about,
33:17
hey, we talk about the models, okay? But we never
33:20
talk about like the underlying plumbing that
33:23
is required to train and serve the model and
33:26
the insane amount of data that we need to move
33:29
back and forth. And then on the other angle,
33:32
we also, we will discuss like... All the other
33:35
areas where we actually use AI to enhance reliability.
33:39
So it's both like, you know, both cases that
33:42
we want to do. This conference is going to happen
33:45
in person. It's going to be in Bellevue. And
33:48
you can find the detail. The website is at scale
33:51
conference. And we could put that at the show
33:55
notes. Yes. So if someone works in infra SRE
33:59
platform or engineering leadership. Why should
34:03
this be on their radar? And what are you watching
34:07
specifically with the conference? I'm just curious.
34:09
Personally, I'm a big fan of this conference.
34:13
I've been part of it for a while. And one thing
34:18
that I tell the speakers that are coming from
34:21
Meta, and we have speakers from NVIDIA, from
34:23
Microsoft, from Google, the goal here, I want
34:26
to have the real technical discussion. With Meta,
34:31
I'm not in the business of selling cloud services.
34:34
Like I'm not the business of selling API or things
34:37
like this. So what I want to talk is like, you
34:40
know, Microsoft, you have a Kubernetes cluster with
34:43
one million servers. How the
34:46
hell did you do that? Like, this is like, you
34:48
know, the goal is really like, I want to really
34:50
understand the story behind the system. I want
34:54
to have like the real technical conversation.
34:56
And I want to avoid the sales pitch of like,
34:59
oh. use that service, use that thing. So this
35:03
is really the focus of the conference is to really
35:05
have that technical discussion and also the story
35:09
behind the system. Like, I'm always fascinated.
35:15
I've always been fascinated by, like, okay, a
35:18
team starts something. Like, I've been involved
35:20
in ZooKeeper and all sorts of other things in
35:24
my days at Yahoo back then. You start something,
35:27
it was to solve a problem. Then you open source
35:29
and you're like, oh, that became this? Oh my
35:31
God. So I'm really interested in this and then
35:34
like the ups and downs of the team because you
35:37
always have like, you know, the hype at the beginning
35:39
and then, oh, the reality hits and you're like,
35:41
oh my God, like this will not work as expected.
35:44
And how do you overcome that? Like that, for
35:47
me, that's the most interesting part of all of
35:50
these stories and all of these presentations.
35:52
So, okay, wrapping up, what... kind of reliability
35:55
conversations does the industry need more of
35:58
i think right now the industry does understand
36:02
the the use of AI for code generation like i
36:07
think we get that i think we can get that we
36:10
can go all out i think we understand that everybody
36:13
can build a custom app like for themselves and
36:16
they will only use it for for them it's gonna
36:20
be perfect for them and then it's okay if it
36:23
goes down but i think the rest of the industry
36:27
has not kept up with that rate of change and
36:32
there's not enough investment in kind of like
36:36
defense like we're able to generate code are
36:39
we able to debug it faster are we able to understand
36:41
it faster are we able to troubleshoot it faster
36:45
like like that has not kind of followed. And
36:49
then I feel that we are catching up now. And
36:53
then the hype seems to be like mostly on, on
36:57
the model. And then there's amazing infrastructure
37:01
that had to be built underneathneath. And I think
37:04
you know, everybody needs to understand it a
37:06
little bit more. Yeah, for sure. Awesome. So
37:09
I will put links for At Scale Systems and Reliability
37:12
reliability in the show notes. I can put your
37:15
LinkedIn there as well. Is there any other links
37:17
or comments that you'd like to give to the audience
37:19
before we wrap up? No, I think that's it. Don't
37:22
give up. Awesome. Thank you so much, Francois,
37:24
for coming on. Really appreciate your time. Thank
37:26
you so much. Thank you. Bye-bye. All right.
37:28
That was my conversation with Francois Richard
37:31
from Meta. My biggest takeaway from this one
37:34
is that reliability is not just about preventing
37:37
failure. The better question is, what happens
37:40
when prevention fails? Because sometimes the
37:43
answer is a rollback. Sometimes it is moving
37:45
traffic. Sometimes it is draining a region. Sometimes
37:49
it is restarting a service. Sometimes it is realizing
37:53
that the service cannot restart cleanly because
37:56
it is already overloaded, which is the kind of
37:59
fun little production detail that does not usually
38:01
show up in architecture diagrams. That is the
38:05
part that I think is worth paying attention to.
38:07
a lot of teams talk about reliability like it
38:10
is mostly a tooling problem. Get the right dashboards.
38:13
Get the right alerting. Define the SLOs. Add
38:16
some runbooks. Maybe sprinkle in some AI and
38:19
pretend the incident lifecycle is solved. But
38:22
reliability is not just the tools. It is the
38:25
practice around the tools. It is whether the
38:28
SLO actually represents a promise to users. It
38:32
is whether the alert fires early enough to matter.
38:35
It is whether the incident review produces real
38:37
follow-up work instead of just a nicer explanation
38:40
of what broke. It is whether the team has practiced
38:44
the failure mode before production forces them
38:47
to learn it live. And honestly, that is the part
38:49
of the conversation that translates really well.
38:52
Even if you are nowhere near Meta scale. Most
38:55
of us are not dealing with World Cup traffic
38:57
spikes or massive global systems, but a lot of
39:01
us are depending on a cloud region more than
39:04
we want to admit. A lot of us say we are multi
39:06
-region, but have not actually run without the
39:09
primary region on a boring Tuesday. A lot of
39:13
us have runbooks that look reasonable until someone
39:15
has to follow them under pressure. A lot of us
39:18
have services that should recover automatically,
39:21
but only if the failure happens in the exact
39:23
way we imagined. That is where the work is. Practice
39:27
the recovery. Test the boring assumptions. Look
39:30
at your incident data. Bucket the causes. Figure
39:33
out what keeps showing up. Then go after the
39:36
top patterns instead of trying to boil the ocean.
39:39
I also liked Francois' point about AI changing
39:42
the reliability equation. AI can absolutely help
39:46
with investigation. It can look across a lot
39:49
of data quickly. It can help build queries, connect
39:51
patterns, and speed up the part where humans
39:54
are trying to figure out what changed and what
39:57
is related. But AI is also increasing the volume
40:00
of change. More diffs, more generated code, more
40:03
boilerplate, more systems moving faster, and
40:07
sometimes less human context behind the code
40:10
that just went out. That is a weird trade-off.
40:13
Because if code moves faster than understanding,
40:16
Reliability teams are going to feel that gap
40:19
during incidentsss. The system breaks, and now
40:22
someone has to reconstruct not just what changed,
40:26
but why it changed, what the generated code is
40:28
actually doing, what assumptions it made, and
40:32
how to get back to a stable state. That does
40:34
not mean AI is bad. It means the defensive side
40:37
has to catch up. Debugging has to get better.
40:41
Observability has to get better. Incident response
40:43
has to get better. recovery practice has to get
40:47
better. And humans still need to be in the loop
40:49
for judgment, especially when the system is too
40:52
important to let a guess turn into the next mitigation.
40:56
I also think the lifecycle point matters. Not
40:59
every system needs the same reliability target.
41:02
A brand new experiment probably should not get
41:05
the same investment as a core production path
41:08
that millions or billions of people depend on.
41:12
Six nines sounds impressive until you realize
41:15
what it costs, what complexity it adds, and how
41:18
much slower it can make future changes. But the
41:22
reverse is true too. If a system becomes important
41:24
and the reliability investment never catches
41:27
up, you are just borrowing risk until production
41:31
collects. So maybe the healthier conversation
41:33
is not how do we make everything maximally reliable.
41:37
It is more like what promise are we making? Who
41:40
depends on this system? What happens when it
41:43
fails? And have we practiced the recovery enough
41:46
to believe our own answer? That is probably where
41:49
a lot of reliability conversations are heading.
41:52
Not AI will fix incidents. Not SLOs solve reliability.
41:57
Not just make everything multi-region and call
42:00
it done. More like, what failures should we expect?
42:04
What failures have we practiced? And what are
42:07
we learning every time production teaches us
42:09
something? I'll have links to Francois, Meta's
42:12
At Scale systems and reliability event, and
42:15
anything else we mentioned in the show notes.
42:17
If you enjoyed this conversation, follow or subscribe
42:20
to Ship It Weekly wherever you listen to podcasts.
42:23
It helps the show and it makes sure you get both
42:26
these conversation episodes and the weekly DevOps,
42:29
SRE, platform, cloud, and security news recaps.
42:33
You can also find the show notes and links over
42:36
at shipitweekly.fm. Thanks for listening, and
42:39
I'll see you later this week.
For this Conversations episode, the part that stuck with me is that reliability is not really about whether something fails.
It is about what happens next.
That sounds obvious, but I think a lot of teams still treat reliability like it is mostly a prevention problem. Better deploy checks. Better alerts. Better dashboards. Better SLOs. Better review processes. Better guardrails. All of that matters, obviously.
But production still gets a vote.
A region has issues. A dependency slows down. A rollout behaves differently under real traffic. A service gets overloaded and then cannot restart cleanly because it is already too far gone. The system does something nobody expected because the conditions were never actually tested.
That is where reliability stops being a dashboard and starts becoming a practice.
That was the biggest thread for me in this conversation with Francois Richard from Meta. He framed reliability as both reactive and proactive work, which is probably the right split. You need the incident response muscle. You need people who can stay calm in the pressure cooker, make decisions quickly, build consensus, understand the system under stress, and recover.
But you also need the proactive side. Guardrails, validation, disaster recovery testing, and the uncomfortable work of actually exercising failure modes before they show up on their own.
That is the part most teams agree with in theory and avoid in practice.
Everyone likes the idea of multi-region. Fewer teams like the idea of turning off the primary region on a Tuesday and seeing what actually breaks.
Everyone likes the idea of graceful recovery. Fewer teams have tested whether the overloaded service can restart while it is overloaded.
Everyone likes the idea of runbooks. Fewer teams know whether the runbook still works when the person following it is tired, under pressure, and trying to make sense of five dashboards while Slack is melting.
That is why I liked the way Francois talked about practice. Not just tabletop exercises. Not just theoretical architecture reviews. Real validation. Real failure injection. Real regional testing. Real traffic and overload scenarios, at safe enough scale that you can learn before it becomes a customer-visible disaster.
For most of us, the lesson is not “copy what Meta does.”
That would be silly. Most teams are not dealing with World Cup traffic spikes, global-scale social products, or the same infrastructure footprint.
But the pattern transfers really well.
Take the failure modes you claim you can survive and test them. Take the incident patterns that keep showing up and bucket them. Take the systems that are critical and ask whether the recovery plan is something you have actually practiced, or just something you hope will work because it looks reasonable in a diagram.
The SLO discussion was also useful because it puts reliability in business terms without turning it into corporate fluff.
An SLO is not just a graph. It is a promise.
That is a much better way to think about it. What are we promising users? What are we promising internal customers? What are we promising product teams? And does the reliability investment match that promise?
This is where teams can get weird in both directions.
Sometimes teams underinvest in reliability because the system “mostly works,” until one day it becomes critical and the reliability model never caught up.
Other times, teams overinvest too early and try to make a young experimental system behave like a mature core production path. That can add cost, slow down learning, and introduce complexity before the product even proves it deserves that level of investment.
Six nines sounds great until you ask what it costs, how much complexity it adds, and whether the business actually needs that promise right now.
That does not mean users do not care about reliability. They absolutely do. Francois called that out too. If systems have too many problems in a row, engagement suffers. People notice. Trust erodes.
But the answer is not “make everything maximally reliable.” The better answer is to match the reliability target to the lifecycle, importance, and risk of the system.
A new experiment does not need the same reliability posture as login, feed, payments, messaging, or whatever your real critical path is.
That is a healthier conversation for SRE and platform teams to have with product and engineering leadership. Not “I need six nines for everything.” More like, “Here is the promise this system is making. Here is the cost of that promise. Here is the risk if we miss it. Is that the tradeoff we want?”
The AI part of this conversation is where things get more interesting.
Francois talked about AI helping with investigation. That makes sense. Incidents often involve too much data, too many dashboards, too many layers, and too many relationships between services. If AI can help gather telemetry, summarize patterns, generate queries, and point humans toward likely relationships faster, that is useful.
That is not the same thing as handing production to an agent and letting it freestyle the mitigation.
The useful version is more grounded. AI helps humans move faster during investigation. Humans still validate. Humans still decide. Humans still handle the judgment calls, especially when the system is important and the mitigation could make things worse.
But AI is also creating more change.
More diffs. More lines of code. More generated boilerplate. More changes moving through systems faster than before.
That creates a reliability gap that I think a lot of teams are going to feel.
The code can move faster than the understanding.
A product engineer may understand the user behavior and the business logic, but not every generated async callback, framework detail, or edge case buried in the generated implementation. Then when it breaks, the reliability or platform team has to reconstruct what happened under pressure.
What changed?
Why did it change?
What assumption did the generated code make?
What dependency is involved?
Is this a known pattern or something new?
Can we roll back?
Can we move traffic?
Can we isolate the failure?
That is not an anti-AI argument. It is just the operational reality.
If AI speeds up software delivery, the defensive side has to speed up too.
Observability has to improve. Debugging has to improve. Incident investigation has to improve. Recovery practice has to improve. The ability to understand generated code and generated changes has to improve.
Otherwise, teams are going to get more output without enough context, and that cost shows up during incidents.
That is why this episode pairs well with the At Scale Systems & Reliability conversation too. A lot of AI discussion is still stuck at the model layer or the code generation layer. But the infrastructure underneath matters. The systems that train, serve, move data, recover, and keep large AI workloads reliable matter. And the systems that use AI to improve reliability matter too.
That is the conversation I think SRE, DevOps, platform, and infrastructure teams need more of.
Not just “AI can write code.”
Not just “AI can summarize incidents.”
More like, what does the whole production system look like when AI increases the rate of change?
How do we preserve human understanding?
How do we validate what AI suggests?
How do we practice recovery?
How do we make sure the systems behind AI are reliable enough for the expectations being placed on them?
And how do we avoid pretending that a faster delivery loop automatically means a safer one?
The answer probably looks boring in the best way.
Write down the incidents. Review them honestly. Look for patterns. Practice the failures. Test the failover. Validate your assumptions. Make SLOs useful. Match reliability investment to product maturity. Use AI where it helps, but do not confuse investigation assistance with operational judgment.
That is not flashy, but it is the work.
And if there is one practical takeaway from this conversation, it is probably this: do not wait for production to be the first place your recovery plan gets tested.
Production will test it eventually.
The only real question is whether your team has already practiced.