Host Commentary

Show Notes

This is a guest conversation episode of Ship It Weekly, separate from the weekly news recaps.

In this Ship It: Conversations episode, I talk with Francois Richard, Engineering Director at Meta, about reliability at scale, how AI is changing production risk, what teams actually learn from incidents, and why recovery practice matters just as much as prevention.

We talk about the proactive and reactive sides of reliability, why SLOs should represent a promise to users instead of just another dashboard number, how incident reviews should drive real system improvements, and how teams can practice recovery before production forces the lesson on them.

The bigger theme here is that reliability is not just about avoiding failure. It is about knowing what happens when prevention fails. That means practicing regional failure, understanding overload behavior, improving incident response, using AI carefully during investigation, and making reliability targets match the actual lifecycle and importance of the system.

Highlights

• Why reliability work starts with both prevention and recovery

• The difference between reactive incident response and proactive reliability engineering

• How Meta thinks about disaster recovery testing and regional failure practice

• Why an SLO should be treated like a promise to users, not just a dashboard metric

• How SLO trends help teams decide when to invest more in reliability or take more product risk

• What engineers actually learn during the “pressure cooker” of an incident

• Why incident reviews should produce follow-up work, not just a nicer explanation of what broke

• The difference between finding the cause of an incident and improving the system

• Where AI agents can help with incident investigation, telemetry, metrics, and query building

• Why AI-generated code can increase change volume while reducing human context

• How faster code generation changes the kinds of reliability problems teams should expect

• Why recovery practice matters, especially for region loss, traffic spikes, overload, and restart behavior

• What smaller DevOps and SRE teams can learn from Meta-scale reliability patterns

• Why not every system needs six nines, especially early in a product lifecycle

• How to think about reliability investment based on user promise, product maturity, and operational risk

• Why At Scale Systems & Reliability is focused on the infrastructure behind AI and the use of AI to operate large-scale systems

Francois’ links

• LinkedIn: https://www.linkedin.com/in/francoisrichard/

At Scale links

• Systems & Reliability 2026: https://bit.ly/4xd2FdG

• At Scale Conferences: https://atscaleconference.com/

Our links

More episodes + show notes + links: https://shipitweekly.fm

On Call Brief: https://oncallbrief.com

👤 Guest

Meta’s Francois Richard
Brian Teller
Hosted by
Brian Teller

25 years in production: DevOps, SRE, platform, and cloud. DevOps Institute & ITIL Ambassador.

More about Brian Teller →