Ship It Conversations: Meta’s Francois Richard on AI Incident Response, SLOs, and Reliability at Scale

Transcript

0:00 AI is making it easier than ever to create more

0:03 software, more code, more diffs, more experiments,

0:07 more changes moving through systems that were

0:10 already complicated before everybody got a robot

0:13 assistant in the editor. And that sounds great

0:15 until you are the team responsible for production

0:18 because production does not really care that

0:21 the code was generated faster. It still cares

0:24 about latency. It still cares about overload.

0:27 It still cares about dependencies, rollbacks,

0:30 traffic routing, region failures, weird edge

0:34 cases, and whether the people on call actually

0:37 know what to do when the system starts acting

0:40 strange. That is the part of the AI conversation

0:43 that feels under-discussed to me. Not just can

0:46 AI write code, it can. Not just can AI help debug,

0:50 sometimes yes. The better question is what happens

0:54 to reliability when the volume of change goes

0:57 up, the amount of generated code goes up and the

1:00 context behind that code gets thinner. Because

1:03 incidents are not just caused by bad code. They

1:06 are caused by systems behaving in ways people

1:09 did not expect, under conditions they did not

1:13 practice, with dependencies they forgot were

1:16 part of the critical path. That is really what

1:18 this conversation is about. Not just SLOs, not

1:22 just incident reviews, not just AI agents. More

1:25 like, what does reliability actually look like

1:28 when change is accelerating? Systems are more

1:32 connected, and recovery matters just as much

1:35 as prevention. I'm Brian Teller from Teller's

1:38 Tech, and this is Ship It Weekly.

1:58 Welcome back to Ship It Weekly, where I filter the noise and

2:00 focus on what actually matters when you are the

2:03 one running infrastructure and owning reliability.

2:06 Most weeks, it's a quick news recap. In between

2:09 those, I do conversation episodes with people

2:12 who are building platforms, running infrastructure,

2:15 organizing events, and thinking through where

2:18 this industry is actually headed. Today is one

2:21 of those conversations. I'm joined by Francois

2:24 Richard, an engineering director at Meta. We're

2:27 talking about reliability at scale, how AI and

2:30 automation are changing production risk, what

2:33 teams actually learn from incidents, and why

2:36 recovery practice matters just as much as prevention.

2:40 And I like this conversation because it gets

2:43 past the neat, clean version of reliability that

2:46 fits nicely in a dashboard. Reliability is not

2:49 just a number. SLOs matter, dashboards matter,

2:53 alerts matter, guardrails matter. But none of

2:56 that means much if the team cannot use it to

2:59 make better decisions, recover from real failures,

3:02 and improve the system after something breaks.

3:05 You start with the obvious goal. Keep the system

3:08 up. Keep the users happy. Avoid incidents. Do

3:12 not page people for nonsense. Do not let every

3:15 deployment feel like a coin flip. Then reality

3:18 shows up. A region has problems. A dependency

3:22 gets slow. A service gets overloaded and cannot

3:26 restart because it is too overloaded to recover

3:29 cleanly. Traffic spikes during a major event.

3:32 A change rolls out faster than the team understands

3:35 it. Or an AI-generated pile of code technically

3:38 works until it fails in a way nobody has enough

3:41 context to explain quickly. And somewhere in

3:45 the middle of that, reliability stops being just

3:48 prevention and starts becoming practice. In this

3:51 conversation, Francois and I talk about how Meta

3:55 thinks about reliability across both reactive

3:58 and proactive sides. Incident response, incident

4:01 reviews, SLOs, guardrails, validation, disaster

4:06 recovery testing and what happens when you actually

4:09 practice taking a region out instead of just

4:12 assuming the failover plan works because the

4:15 diagram looks good. We also get into what teams

4:18 learn during incidentsss, not just from the post

4:20 postmortem afterward but inside the pressure cooker

4:23 itself where engineers have to make decisions

4:26 quickly build consensus fast understand the system

4:30 under stress and figure out what is actually

4:33 happening while the clock is running there is

4:36 also a good thread in here around AI agents in

4:39 incident response not the fantasy version where

4:42 you hand production to an agent and hope it saves

4:45 the day more the practical version where AI helps

4:48 with investigation telemetry metrics logs relationships

4:52 across services and narrowing down what might

4:56 be happening faster than a human clicking around

4:59 dashboards alone and towards the end we talk

5:02 about recovery practice known failures versus

5:05 unknown failures why teams should test the failure

5:08 modes they claim they can survive how smaller

5:11 teams can learn from Meta-scale reliability patterns

5:14 and why not every system needs six nines on day

5:19 one so if you work around DevOps, SRE, platform

5:23 engineering, infrastructure, engineering leadership,

5:26 incident response, or you are just trying to

5:29 figure out what reliability looks like when AI

5:31 is increasing the volume of change, this one

5:34 should be worth your time. All right, let's jump

5:37 in.

5:45 Today, I'm joined by Francois Richard. He is an engineering director at Meta, and we're

5:47 talking about reliability at scale, how AI and

5:50 automation are changing production risk, what

5:53 teams learn from incidents, and why recovery

5:56 practice matters just as much as prevention.

5:59 Francois, thank you for joining me. Thank you,

6:02 thank you. Happy to be here. Okay, so when you

6:04 think about reliability at Meta scale, where does

6:08 the work actually start? I think for me, reliability

6:13 is actually two things. What people kind of like

6:16 know usually it's the reactive side is how do

6:19 we manage incidentss? How do we respond to problemss?

6:23 And how do we handle things and recover and fall

6:27 back when something goes bad? But at the same

6:31 time, it's also everything that I call like,

6:33 you know, the proactive side of the house. This

6:35 is everything that has to do with, hey, how are

6:40 we going to prevent reliability problems from

6:44 happening? What sort of guardrails do we put

6:47 in place? What sort of framework? What sort of

6:51 tooling do we allow platform or do we give to

6:55 platform services and application developers

6:58 to use? And also, what is the real validation

7:02 that we are doing? There's a lot of papers available

7:07 in the public and so on around like, you know,

7:10 there's something that we call like the storm

7:12 program. And it's basically our way to test like,

7:15 you know, disaster recovery, where once in a

7:18 while we take an entire region out, out of the

7:23 like, you know, like if there was an earthquake

7:25 or there was like a giant electrical grid failure.

7:29 And we test what happens with the system. How

7:32 do they recover? How do they handle the spike

7:35 in traffic and so on? So it's both like for me,

7:37 reliability is both like, you know, the reactive

7:39 side and also the proactive side. Is that kind

7:43 of like chaos engineering, the concepts? Yes,

7:46 yes, yes, exactly. And, you know, we go as far

7:49 as, you know, disconnecting a complete region from

7:52 the map. Interesting. Okay, so speaking about

7:55 reliability, what makes an SLO useful instead

7:59 of just another dashboard number? Like, how do

8:02 you think about that? Yes, an engineer, like

8:03 a senior engineer that I work with always says

8:06 that the SLO of something, of a system, is actually

8:11 the promise that you're making to your customer.

8:15 That, you know, XYZ should work 99.99% of the

8:19 time and you should expect that amount of latency

8:23 on average at, you know, P90 or something like

8:27 this. So we use this, like, you know, most of

8:31 the platform we operate at Meta and also the

8:34 product services and so on. They will all, the

8:39 big ones, will all have some form of SLO for

8:43 the top APIs or the top services that they are

8:46 doing. We have some alerts that will trigger

8:49 on SLO, but usually this is too late. Like, you

8:52 know, we want to catch problemss. The SLO allows

8:55 us in the long run to actually figure out like,

8:59 hey, are we investing enough in reliability?

9:03 Are we keeping the bar high enough? Or do we

9:07 have wiggle room to take some risk, introduce

9:12 new features? Because you all know that when

9:15 you introduce or you do a giant migration, there's

9:18 a period of time where it takes time to stabilize

9:21 things. So the SLOs are helping us make these

9:25 decisions in the long run. And you can clearly

9:28 see when you've been tracking them for a while

9:30 that some systems are trending up and some systems

9:34 are trending down. And for the ones that are

9:37 trending down, it allows you, it gives you the

9:39 argument and the data to start investing more.

9:43 So this is kind of like, you know, our way about

9:46 deciding if we have enough investment in an area

9:50 or if, you know, we are okay and we can take

9:53 more, you know, progressive risk. So you had,

9:57 we talked a little bit about incidents before

10:00 the show, and then you had just mentioned incidents.

10:02 Across the incident lifecycle, Where do teams

10:05 actually learn the most? So in terms of learning

10:08 during incidentss, I think it happens in two phases.

10:11 First, it happens while handling the incident

10:14 during what I call like the pressure cooker.

10:17 When you actually need to handle the incident,

10:20 you actually need to make decisions quickly,

10:24 build consensus fast with a small group of people

10:28 about what's the next action to take. And then

10:31 what are the bypass, the restart process, which

10:35 system is calling what. This is actually a time

10:40 where a lot of people, like junior engineers

10:43 and senior, they learn a lot. They learn a lot

10:45 about the system, but they also learn a lot about

10:47 team dynamics and themselves during that period.

10:52 Then the next phase happens when kind of like

10:56 you build the... incident review report or we

11:00 call it the SEV report, where you basically,

11:03 you know, once the incident is handled, mitigated,

11:07 you know, a few days after, you basically sit

11:09 down and then we have like a tool to do this.

11:12 And then people sit down and they write a report

11:15 around like, you know, hey, what exactly happened?

11:18 What was the root cause of this? What was the

11:21 exact timeline, which is kind of like forcing

11:24 people to go back into logs and all of this?

11:26 Could the alerts have fired earlier? Did we have

11:29 the right instrumentation? Did we diagnostic,

11:32 like diagnose this thing, like, you know, the

11:35 proper way? And, you know, what are the follow

11:37 -up tasks, like to make sure that this doesn't

11:40 happen again? or that we have some level of automation

11:43 to be able to remediate this type of problem

11:46 automatically. So you build that report. But

11:49 then from there, you go into a series of reviews.

11:54 You review it with your team, with your organization.

11:57 And if the incident was severe enough, we actually

12:01 reviewed them at the company level. And during

12:04 that time, it's a big opportunity to get feedback

12:06 from other teams. from other senior engineers

12:10 that, yeah, we see this pattern. Here's, you

12:12 know, how we handle this type of thing. And you

12:14 actually learn a lot during that process. So

12:17 this is usually like, you know, the learnings

12:19 happen in kind of like two phases. Yeah. And

12:23 so given that learning, what should an incident

12:26 review actually produce? Like once the incident's

12:28 over? The incident review, like we have a report

12:30 type of thing. It's a template. It has like,

12:34 you know, walks you to, for you to get like two.

12:37 the root cause of it. But I think the most important

12:41 thing that it produces is also like, you know,

12:43 what are the follow-up actions? The follow-up

12:45 could be like, hey, we need to implement more

12:48 redundancy. We need to implement better diagnostics.

12:50 We need to implement better ways of handling

12:53 these exceptions or problems. We need a different

12:56 deployment process. This is usually like, you

12:59 know, the best outcome we could get because,

13:01 you know, the more you mature your system, the

13:03 more you'll be able to cover a lot of these,

13:06 like, kind of like unknown slash unexpected situationss.

13:10 So this is usually like the good outcome of these.

13:12 And what do you find is the difference between

13:14 finding the cause versus improving the system?

13:17 The difference between the cause and improving

13:21 the system. Or is there a difference? There is

13:27 a bit. Because for me, finding the cause overall

13:31 is, do I have the right tools to pinpoint the

13:37 problem very quickly. And then I'll be able to

13:39 find the cause. And then the improvements are more

13:43 are more about eliminating this exact problem

13:47 or that class of problem. And the improvement

13:50 can be that like you know, like from a code structure

13:53 that this will always happen, but you can work

13:57 around with either deployment strategy, automation

14:01 of recovery and things like this so that it's...

14:05 It's not visible to the end users with strategies

14:08 of retries and things like that. So there's a

14:11 big difference between finding the cause. Like

14:13 if you work in mobile apps, you know that there's

14:16 going to be a couple of cases where like the

14:18 cause is going to be like, you know, the mobile

14:19 networks just drop on you. And the app needs over

14:23 time to come up with ways to mask it. You'll

14:26 never eliminate the cause, but you'll have apps

14:30 that are a lot better at masking this type of

14:33 problem. So it's slightly different. Okay, so

14:34 let's switch gears to AI agents. Are you using

14:37 AI in incident response at all? Yes. In triaging?

14:42 We use it for what I would call like investigation.

14:48 So, you know, there's usually like two classeses

14:51 of production incidentss. You have one class when

14:54 it happens. You kind of know what this is. Like,

14:57 you know, it's a pattern that you already recognize.

15:00 The alarms or alerts are very clear about what's

15:03 happening. So you kind of know and you can narrow

15:05 down and then you can just go focus on, hey,

15:08 you know, how do I remediate that? There are

15:10 some of the cases like where you kind of see

15:12 like, oh, I'm starting to see transient errors.

15:15 I'm starting to see latency creeped up and where

15:19 it's kind of like, I'm kind of not sure. And

15:23 then usually like, you know, we launch what's

15:25 called like investigations and things like that.

15:29 Really, the AI agents are helping us in investigation

15:33 because they can analyze and gather large swath

15:37 of data very, very quickly and then pinpoint

15:41 relationships between datasets. Like, you know,

15:46 it's probably the same for everybody, but we

15:47 have multiple layers of software on top of each

15:50 other. They call each other and they call back.

15:54 Sometimes finding complex situation requires

15:56 us to analyze a lot of data at the same time.

16:00 Even myself personally, like, you know, we have

16:02 like centralized, like, you know, monitoring

16:05 system. And then if you aggregate a lot of data

16:09 with very small granularity, like, you know,

16:13 at the 10 second or like at the minute level,

16:15 and you have a lot of time series, not the data

16:17 point, it gets a pain to actually like get the

16:21 right query and the right graph for the problem

16:25 that you're investigating at a point in time

16:27 like so the AI agent has kind of like shifted

16:29 this thing where they help us find these things

16:33 and even like you know in some cases like what

16:35 i do is like i will dump all the metrics locally

16:38 in the SQLite database and then like do all sorts

16:41 of things because it's local and my SSD can do

16:44 it a lot faster And then you use also the AI

16:48 agents to help you craft the queries. I don't

16:51 remember on top of my head all the various flavors

16:54 of joins and all of that. So it's helping a

16:57 lot. So with investigation, it helps a lot. It's helping

17:02 a lot. It's not perfect. You still need to guide

17:06 it very well because the topology of the system,

17:11 the request... The request flow between the system

17:16 is still not something that I will call it like

17:20 the various AI agents are actually mastering.

17:23 They master code very well. But these flows are

17:28 a little bit more difficult and understanding

17:29 how they map into the application or the product

17:33 flows is even more complicated. But it does help.

17:36 That's the investigation side of it. We're doing

17:40 a lot also into what I call like having a standard

17:45 dataset across the company that kind of like

17:48 represents the state of reliability. Like we

17:51 drive a couple of initiatives globally, like

17:54 we call it like change safety. Do you have SLOs

17:58 for something? Do you have guardrails? And with

18:02 these datasets being kind of official, then it's

18:06 easy for each service team or product team to

18:10 build their own customized kind of dashboard

18:12 like right now the cost of making dashboards is

18:15 almost free and really tailoreded to their use cases

18:19 or to their product using the standard dataset

18:22 and that's another place where it's been like

18:23 oh, this has been great. And how are you, I'm assuming

18:28 you're still validating that that root cause

18:30 though right if it comes up with a you know it

18:32 uses MCP servers it's reaching out to Argo Kubernetes

18:35 whatever it's gathering all the telemetry from

18:38 Grafana and then it's putting that together and

18:41 it's saying, hey, this pod was crash loop back

18:44 off. This caused this issue. Are you still validating

18:48 that? And what is that process for validating

18:50 that? Like, is there a standard SOP that you're

18:52 following? No, I don't think we have like a standard

18:55 evaluation. The evaluation, like most of the

18:59 time, the tools that you have in the moment from

19:02 an incident is either you move away from the

19:05 problem or you roll back the problem so we will

19:08 try like you know first you know move away let's

19:11 call it like a problem is a backend problem

19:14 it's isolated to a region we'll move away from

19:17 that region to kind of like prove that theory

19:19 and then we'll probably during that time roll

19:23 back the system or the backend or whatever

19:26 it was that introduced that problem and then

19:29 you can imagine that we can then now that the

19:32 hypothesis is live then we can shift a portion

19:35 of traffic back to that problem and see like

19:38 oh is it still happening or not this is all manual

19:41 we have some of these process that are automated

19:45 but they really boxed into very very well known

19:48 like you know they had like very well established

19:51 runbook that have been like you know in place

19:55 for the longest time that we know that when this

19:57 happened do this like it's almost like if then

20:00 else But the most like the investigation result,

20:04 the validation is more to roll out the fix, you

20:09 know, driven by human with like, you know, slow

20:11 roll type of techniques. Do you think that AI

20:14 is increasing the speed of change, the volume

20:17 of change or the type of failure that teams need

20:20 to expect? But it's clear that it's increasing

20:23 the volume of change. We have data internally

20:27 that, you know, the number of diffss is through

20:30 the roof. The number of lines of code changed is,

20:34 again, through the roof. The type of failure

20:37 is like one thing that, because there's more

20:41 code and it's going faster, it's the type of

20:45 failure becomes more into the, what I would call

20:48 like pseudo-unexpected. When a product developer,

20:52 application developer, when they... roll out

20:55 a new feature, they do understand the business

20:58 logic. They do understand how it should show

21:01 to the user, what they expect the user to do,

21:04 where's a corner where a user could get stuck.

21:07 They understand that. That's not a problem. But

21:10 if you have to build on modern apps on mobile

21:15 phones or on backend, A lot of these things are

21:19 built on top of frameworks. There's a lot of boilerplate

21:21 stuff, especially to handle like, you know, async

21:24 callbacks and things like this. And since a lot

21:27 of that code gets auto-generated, people are

21:30 losing context around these things. And when

21:33 something doesn't work, we realize that it takes

21:37 us more time to understand why it doesn't work.

21:41 Because there was not that additional context

21:44 that... the developer or that team developing

21:47 that feature took a lot of time to craft it and

21:49 a lot of like internal reviews and things like

21:51 that so so it goes faster out but that context

21:55 on how it's built is actually lost along the

21:57 way and this is basically what we need to reconstruct

22:01 live and again like i said earlier a lot of our

22:05 tooling is like you know in this case we're just

22:07 like we'll roll back and you figure it out like

22:09 bring back the system to a stable state and then

22:12 you can figure it out on your own time without

22:15 being on the pressure cooker. Okay, so speaking

22:17 along the same lines, you had mentioned too that

22:21 recovery comes down to practice. So what does

22:23 good recovery practice actually look like? You

22:26 know, there's a set of standard failuress you expect

22:29 your app to have. And it starts at, you know,

22:34 if you're running, if you have data centers or

22:37 region in the United States or in Europe or something

22:41 like that, you get... Tons of things like fiber

22:44 cut in hurricanes and, you know, power grid and

22:48 things like that. You need to understand how

22:51 your system will react. Like, you know, a simple

22:54 thing, like if I'm a startup and I run into a

22:57 AWS region, but I actually never test the fact

23:01 that one of them will go down, I can guarantee

23:04 you, like, as soon as you get a little bit of

23:06 complexity in your system, something will not

23:09 work as expected in there. for us is actually

23:13 to do, to validate it, to not just test, like

23:16 we run, I would call it like data simulation

23:19 and analysis in advance to figure out like, okay,

23:23 if we lose that data center, we'll be okay. But

23:26 it is always interesting. And then after we've

23:29 done the validation and vetted that it all worked.

23:33 We actually do the test and we discover a suite

23:35 of other problems that the analysis had never

23:38 found. So exercising the failure for real at

23:42 the lower scale allows you to find a lot of these

23:46 problems. And it's the same thing like, you know,

23:49 if you have very sensitive hotspot in your application

23:53 and your product, injecting failure for real

23:56 at a very low percentage of user will... allow

23:59 you okay do i have the right detection like did

24:01 that trigger properly did it get to the right

24:04 person did this team know how to handle it like

24:08 without this it's it all becomes like improv

24:11 like you improv all the time versus when you

24:14 have a portion of your infrastructure that's

24:16 always tested in such a way first you keep on

24:18 hardening and it gets better and it gets better

24:20 and the muscle of handling it gets better also

24:23 so i guess that I was going to ask about like

24:26 how, how would teams practice known failures

24:28 versus unknown failures? But it sounds like it

24:31 comes down to just that practicing and pulling

24:33 out the regions. Okay. And there's, there's a

24:36 bunch of types of failure. Like, you know, I,

24:37 I made an allusion about worst-case failures

24:40 versus worst-case failure, which is, um, you

24:45 know, you lose a complete region. And with the

24:48 weather patterns in the U.S., you know, we have

24:50 hurricane season, we have like fire season, tornado,

24:54 like the cuts happen everywhere. So you got to

24:57 be ready. But there's also other things like,

25:01 you know, for meta, we are often, I would call

25:05 it like we got, we get these traffic spikes.

25:09 It could be because there's something in the

25:12 news related. It could be a big event like the

25:14 Super Bowl, New Year's Eve. The World Cup's coming.

25:17 So we know that every time there's going to be

25:18 a goal, it's going to spike. What we realize

25:21 over time with practice is that some system will

25:25 be overloaded. And in some cases, you'll have

25:28 to restart them. And during overload, the damn

25:32 thing cannot restart. You cannot kill it. Like

25:35 it's all completely dead. Like it's completely

25:38 blocked. So we practice a lot of these cases

25:40 too. And it's something specific. To us, we get

25:45 these spikes that are kind of unexpected. Most

25:47 systems have some form of overload protection

25:50 up to a point, but it's something we really had

25:54 to invest in, into like, in the worst load scenario,

25:59 can I restart? Do I have, like, is my process

26:03 smart enough to just accept requests later down

26:06 the road? Or do I have enough control in the

26:10 traffic routing to, like, okay, choke the traffic,

26:12 bring back up, warm up your memory, and then,

26:15 like, gradually. Like, we had to practice that

26:18 a lot. Yeah, so you're dealing with, like, thrashing

26:20 and having, like, no... Yeah. Yes, yes. Resources

26:24 being completely exhausted. Yeah, yeah. Completely

26:26 gone. And is that typically at the control plane

26:29 Completely gone. And is that typically at the control plane

26:32 level? Or... Some of our control plane is pretty good

26:34 you know, there's some backend. Like you can

26:36 imagine like, you know, an Instagram feedback

26:39 an Instagram feed backend or Facebook feed backend

26:43 same type of things. And we have a couple of

26:47 academic paper. People can look at it. It's called

26:50 like Taiji, I believe. And then when we discuss

26:53 these things on how we can reroute traffic and

26:57 then control and then remove it from an area

27:00 to allow the system to reload. Okay. So with

27:04 that in mind. I would imagine most listeners

27:07 are not operating at Meta scale, so they're not

27:10 dealing with every time there's a goal at the

27:12 World Cup, you know, their servers go down. What

27:14 ideas can actually transfer for most like SRE

27:17 DevOps that are listening? I think the example

27:20 I gave earlier, like you often see like no blame

27:24 on Amazon, but if US East go down, like, you know,

27:29 we see the world of the internet like go like

27:31 berserk. So I'm like, guys, like, you know, you

27:34 got to start testing these cases. Like you've

27:36 seen it happen, like, you know, at least like,

27:39 you know, three or four times in the last two

27:40 years. You got to start like, you know, having

27:44 your dual regions and really validating it. And

27:47 maybe it's a question about like, you know, every

27:50 Tuesday, I do not run a single thing in US East,

27:56 you know, whatever the name is these days

27:58 to make sure that it's, you know, it's performing.

28:02 And then it's kind of like tackling the problems

28:05 in a way like, you know, when you look at your

28:08 incident inside a small, a smaller company, medium

28:12 sized startup or so on, like I get a list of

28:15 incidents. There's a point where you start having

28:18 enough data that you could say, okay, most of

28:22 them are caused by, it's because we launched

28:25 that feature or because we're doing configuration

28:28 change or because we have a billing problem.

28:31 It's starting to. trend your your SEV data and

28:36 figure out hey what are the top two or three

28:38 then we can start attacking and really focusing

28:41 because like you can focus on everything but

28:44 you'll get nowhere it's really start bucketizing

28:47 like you know in having the discipline of writing

28:51 it down like it's not that hard to write like

28:53 a SEV report there's a lot of example in there

28:56 and having the discipline of going back and Now

28:59 with LLMs, do the analysis is kind of like trivial.

29:03 You need to, we used to need to allow a lot of

29:05 people to do this and now it's a lot easier.

29:07 So I think that's a big opportunity to actually

29:10 really focus on where reliability matters. Yeah,

29:13 that's fair. You had mentioned too that reliability

29:15 expectations should match the system and product

29:18 life cycle. Yes. Can you talk about that? Like,

29:21 because 100 % uptime is not always like the right

29:25 goal. Yes, yes, yes, yes. There is. there's like

29:29 you know when i say about the expectation versus

29:32 life cycle we should always expect like if something

29:36 is is has some level of complexity and some level

29:40 of feature at the beginning when you roll it

29:42 out like you know getting two nines or even three

29:46 nines of reliability it's pretty good like like like

29:50 not a simple system that's that that that's usually

29:52 like you can get it easy but like if it's early

29:55 in the life cycle of that product or that backend

29:58 at the beginning, it's going to be rocky and

30:01 it should be expected. Like I see a lot of people

30:04 start with the assumption that, Hey, everything

30:07 should be like six nines, six nines guys. Like

30:10 this is like insane. Like, and you invest a lot

30:14 and then you, you kind of like mix, uh, miss

30:17 your product market fit while you're doing that.

30:20 Like there are some like Facebook and Instagram,

30:24 there's a gazillion experiments that runs at

30:26 a given point in time. some of them are not fully

30:28 reliable like and it's okay because we're trying

30:31 to figure out if that feature will be something

30:36 that users enjoy and use or not but they are

30:40 other part of the app which like needs to be

30:42 like rock solid and this is the place where we

30:45 invest so depending on the life cycle if you

30:48 are early or not because there's a cost at investing

30:51 and then when especially like either your SRE

30:55 or production engineering and you have to also

30:58 convince your PM, your product managers that

31:01 this is the case, having that conversation from

31:04 that standpointpoint is a lot more easier than just like

31:08 being the one that says, no, I need six nines

31:10 of reliability for everything. That's not happening.

31:13 Yeah. Well, six nines too, that's like what,

31:16 31 seconds a year of downtime? I mean, the amount

31:20 of infrastructure and cost complexity to just

31:23 keep a system at scale. Depending on the system,

31:27 I mean, I assume if it's like a very important

31:29 financial system, maybe it matters. But for the

31:31 majority of systems, like do the users really

31:34 care about 31 seconds of outage per year? I mean,

31:38 are they going to? They do care. You know, we

31:42 can see from our own data that, you know, if

31:47 we have too much problems in a row, we can

31:49 see that, you know, engagement eroding. So they

31:51 do care. They do show. But it tends to stabilize

31:54 when we stabilize things. So there's a correlation

31:58 there. But sometimes the investment to reach

32:01 that extra nine is just so high. And while you're

32:06 doing that extra nine, you're also sacrificing

32:09 not just the speed and velocity, but also your

32:12 mitigation time for future unknowns. Because

32:16 when you have invested so much complexity and

32:19 then you get into an unknown situation and then

32:22 now you have to untangle all of this, your mitigation

32:24 time will be higher. So it's kind of like it's

32:27 a delicate trade-off. Like it's not a win-win

32:30 all along. Yeah, for sure. It's a reality there.

32:34 A real conversation that you have to have and

32:36 a balance that you have to have there. Okay,

32:38 so wrapping up, this conversation also connects

32:41 to At Scale systems and reliability. Why does

32:44 that program feel relevant right now? Yeah. So

32:48 we have this suite of conferences, like it's called

32:51 At Scale. And we have like four conferences a year.

32:54 And the next one that's coming is basically systems

32:57 and reliability together. And what we want to

33:00 target for this specific conference is, hey,

33:04 how are we injecting? What are the systems that

33:09 are like, you know, building reliability for

33:12 AI? And what are the systems that are under the

33:15 hood for AI? Because people tend to talk about,

33:17 hey, we talk about the models, okay? But we never

33:20 talk about like the underlying plumbing that

33:23 is required to train and serve the model and

33:26 the insane amount of data that we need to move

33:29 back and forth. And then on the other angle,

33:32 we also, we will discuss like... All the other

33:35 areas where we actually use AI to enhance reliability.

33:39 So it's both like, you know, both cases that

33:42 we want to do. This conference is going to happen

33:45 in person. It's going to be in Bellevue. And

33:48 you can find the detail. The website is at scale

33:51 conference. And we could put that at the show

33:55 notes. Yes. So if someone works in infra SRE

33:59 platform or engineering leadership. Why should

34:03 this be on their radar? And what are you watching

34:07 specifically with the conference? I'm just curious.

34:09 Personally, I'm a big fan of this conference.

34:13 I've been part of it for a while. And one thing

34:18 that I tell the speakers that are coming from

34:21 Meta, and we have speakers from NVIDIA, from

34:23 Microsoft, from Google, the goal here, I want

34:26 to have the real technical discussion. With Meta,

34:31 I'm not in the business of selling cloud services.

34:34 Like I'm not the business of selling API or things

34:37 like this. So what I want to talk is like, you

34:40 know, Microsoft, you have a Kubernetes cluster with

34:43 one million servers. How the

34:46 hell did you do that? Like, this is like, you

34:48 know, the goal is really like, I want to really

34:50 understand the story behind the system. I want

34:54 to have like the real technical conversation.

34:56 And I want to avoid the sales pitch of like,

34:59 oh. use that service, use that thing. So this

35:03 is really the focus of the conference is to really

35:05 have that technical discussion and also the story

35:09 behind the system. Like, I'm always fascinated.

35:15 I've always been fascinated by, like, okay, a

35:18 team starts something. Like, I've been involved

35:20 in ZooKeeper and all sorts of other things in

35:24 my days at Yahoo back then. You start something,

35:27 it was to solve a problem. Then you open source

35:29 and you're like, oh, that became this? Oh my

35:31 God. So I'm really interested in this and then

35:34 like the ups and downs of the team because you

35:37 always have like, you know, the hype at the beginning

35:39 and then, oh, the reality hits and you're like,

35:41 oh my God, like this will not work as expected.

35:44 And how do you overcome that? Like that, for

35:47 me, that's the most interesting part of all of

35:50 these stories and all of these presentations.

35:52 So, okay, wrapping up, what... kind of reliability

35:55 conversations does the industry need more of

35:58 i think right now the industry does understand

36:02 the the use of AI for code generation like i

36:07 think we get that i think we can get that we

36:10 can go all out i think we understand that everybody

36:13 can build a custom app like for themselves and

36:16 they will only use it for for them it's gonna

36:20 be perfect for them and then it's okay if it

36:23 goes down but i think the rest of the industry

36:27 has not kept up with that rate of change and

36:32 there's not enough investment in kind of like

36:36 defense like we're able to generate code are

36:39 we able to debug it faster are we able to understand

36:41 it faster are we able to troubleshoot it faster

36:45 like like that has not kind of followed. And

36:49 then I feel that we are catching up now. And

36:53 then the hype seems to be like mostly on, on

36:57 the model. And then there's amazing infrastructure

37:01 that had to be built underneathneath. And I think

37:04 you know, everybody needs to understand it a

37:06 little bit more. Yeah, for sure. Awesome. So

37:09 I will put links for At Scale Systems and Reliability

37:12 reliability in the show notes. I can put your

37:15 LinkedIn there as well. Is there any other links

37:17 or comments that you'd like to give to the audience

37:19 before we wrap up? No, I think that's it. Don't

37:22 give up. Awesome. Thank you so much, Francois,

37:24 for coming on. Really appreciate your time. Thank

37:26 you so much. Thank you. Bye-bye. All right.

37:28 That was my conversation with Francois Richard

37:31 from Meta. My biggest takeaway from this one

37:34 is that reliability is not just about preventing

37:37 failure. The better question is, what happens

37:40 when prevention fails? Because sometimes the

37:43 answer is a rollback. Sometimes it is moving

37:45 traffic. Sometimes it is draining a region. Sometimes

37:49 it is restarting a service. Sometimes it is realizing

37:53 that the service cannot restart cleanly because

37:56 it is already overloaded, which is the kind of

37:59 fun little production detail that does not usually

38:01 show up in architecture diagrams. That is the

38:05 part that I think is worth paying attention to.

38:07 a lot of teams talk about reliability like it

38:10 is mostly a tooling problem. Get the right dashboards.

38:13 Get the right alerting. Define the SLOs. Add

38:16 some runbooks. Maybe sprinkle in some AI and

38:19 pretend the incident lifecycle is solved. But

38:22 reliability is not just the tools. It is the

38:25 practice around the tools. It is whether the

38:28 SLO actually represents a promise to users. It

38:32 is whether the alert fires early enough to matter.

38:35 It is whether the incident review produces real

38:37 follow-up work instead of just a nicer explanation

38:40 of what broke. It is whether the team has practiced

38:44 the failure mode before production forces them

38:47 to learn it live. And honestly, that is the part

38:49 of the conversation that translates really well.

38:52 Even if you are nowhere near Meta scale. Most

38:55 of us are not dealing with World Cup traffic

38:57 spikes or massive global systems, but a lot of

39:01 us are depending on a cloud region more than

39:04 we want to admit. A lot of us say we are multi

39:06 -region, but have not actually run without the

39:09 primary region on a boring Tuesday. A lot of

39:13 us have runbooks that look reasonable until someone

39:15 has to follow them under pressure. A lot of us

39:18 have services that should recover automatically,

39:21 but only if the failure happens in the exact

39:23 way we imagined. That is where the work is. Practice

39:27 the recovery. Test the boring assumptions. Look

39:30 at your incident data. Bucket the causes. Figure

39:33 out what keeps showing up. Then go after the

39:36 top patterns instead of trying to boil the ocean.

39:39 I also liked Francois' point about AI changing

39:42 the reliability equation. AI can absolutely help

39:46 with investigation. It can look across a lot

39:49 of data quickly. It can help build queries, connect

39:51 patterns, and speed up the part where humans

39:54 are trying to figure out what changed and what

39:57 is related. But AI is also increasing the volume

40:00 of change. More diffs, more generated code, more

40:03 boilerplate, more systems moving faster, and

40:07 sometimes less human context behind the code

40:10 that just went out. That is a weird trade-off.

40:13 Because if code moves faster than understanding,

40:16 Reliability teams are going to feel that gap

40:19 during incidentsss. The system breaks, and now

40:22 someone has to reconstruct not just what changed,

40:26 but why it changed, what the generated code is

40:28 actually doing, what assumptions it made, and

40:32 how to get back to a stable state. That does

40:34 not mean AI is bad. It means the defensive side

40:37 has to catch up. Debugging has to get better.

40:41 Observability has to get better. Incident response

40:43 has to get better. recovery practice has to get

40:47 better. And humans still need to be in the loop

40:49 for judgment, especially when the system is too

40:52 important to let a guess turn into the next mitigation.

40:56 I also think the lifecycle point matters. Not

40:59 every system needs the same reliability target.

41:02 A brand new experiment probably should not get

41:05 the same investment as a core production path

41:08 that millions or billions of people depend on.

41:12 Six nines sounds impressive until you realize

41:15 what it costs, what complexity it adds, and how

41:18 much slower it can make future changes. But the

41:22 reverse is true too. If a system becomes important

41:24 and the reliability investment never catches

41:27 up, you are just borrowing risk until production

41:31 collects. So maybe the healthier conversation

41:33 is not how do we make everything maximally reliable.

41:37 It is more like what promise are we making? Who

41:40 depends on this system? What happens when it

41:43 fails? And have we practiced the recovery enough

41:46 to believe our own answer? That is probably where

41:49 a lot of reliability conversations are heading.

41:52 Not AI will fix incidents. Not SLOs solve reliability.

41:57 Not just make everything multi-region and call

42:00 it done. More like, what failures should we expect?

42:04 What failures have we practiced? And what are

42:07 we learning every time production teaches us

42:09 something? I'll have links to Francois, Meta's

42:12 At Scale systems and reliability event, and

42:15 anything else we mentioned in the show notes.

42:17 If you enjoyed this conversation, follow or subscribe

42:20 to Ship It Weekly wherever you listen to podcasts.

42:23 It helps the show and it makes sure you get both

42:26 these conversation episodes and the weekly DevOps,

42:29 SRE, platform, cloud, and security news recaps.

42:33 You can also find the show notes and links over

42:36 at shipitweekly.fm. Thanks for listening, and

42:39 I'll see you later this week.

Ship It Conversations: Meta’s Francois Richard on AI Incident Response, SLOs, and Reliability at Scale

Watch this episode here

Transcript

Catch This Episode

Host Commentary

Show Notes

👤 Guest

More from Ship It Weekly

Kiro CLI Approval Bypass, Amazon Braket Pickle Risk, AWS Org Logging, KEDA Upgrades, and Automation’s Hidden Boundaries

GitHub Supply Chain Attacks, Railway’s GCP Outage, Discord’s Voice Failure, AWS Retry Changes, and Trusted Tool Risk

Ship It Conversations: Jake Warner on Cycle.io, Bare Metal’s Comeback, and Why Private Cloud Is Getting Interesting Again

CISA’s GitHub Leak, AI Root Cause Analysis, Copilot Agents, Claude Code in CI/CD, and Kubernetes Seccomp Risk