Mean time to recovery or MTTR is defined in Sleuth as the time a project spends in a failure state.

For instance, let's say you make a deploy that adds 25% to your database CPU. Assume that Sleuth is tracking this impact and determines that the deploy is Unhealthy. You team has setup slack notifications in Sleuth and as a result your mean time to discovery or MTTD is basically zero. Your team jumps into action and initiates a rollback which takes 25 minutes to complete. Once your rollback is deployed Sleuth sees that your database CPU has gone back down to normal and auto-verifies the deploy as Healthy. Your MTTR in this scenario would be 25 minutes, the amount of time it took for your team to return your project to a healthy state.

MTTR breakdowns

The Sleuth project metrics dashboard shows the total time spent in a failure state in the period. We also provide a detailed breakdown of the time spent in each type of failure. Failure types currently supported in Sleuth are:

Feature flags and MTTR

Sleuth supports feature flags as a first class form of change. Because feature flag changes have just as much power to affect failure and recovery as code changes feature flag changes are included in your MTTR calculations. Sleuth's deploy verification applies to flag changes in the same way it applies to code deploys.

Every deployment, feature flags included, has an advanced setting that allows you to exclude it from impact collection. If this is enabled then feature flags will not affect your MTTR.

Setting up MTTR

Because MTTR is closely tied to change failure please see the setting up change failure rate to configure MTTR.