Mean time to recovery or MTTR is defined in Sleuth as the time a project spends in a failure state.
For instance, let's say you make a deploy that adds 25% to your database CPU. Assume that Sleuth is tracking this impact and determines that the deploy is Unhealthy. You team has setup slack notifications in Sleuth and as a result your mean time to discovery or MTTD is basically zero. Your team jumps into action and initiates a rollback which takes 25 minutes to complete. Once your rollback is deployed Sleuth sees that your database CPU has gone back down to normal and auto-verifies the deploy as Healthy. Your MTTR in this scenario would be 25 minutes, the amount of time it took for your team to return your project to a healthy state.
The Sleuth project metrics dashboard shows the total time spent in a failure state in the period. We also provide a detailed breakdown of the time spent in each type of failure. Failure types currently supported in Sleuth are:
Incidents - any deploy with a status of
Incident - integrations with PagerDuty, Statuspage and more are coming soon to automate the discovery of incidents
Rolled back - any code deploys that were detected to be rolled back
Sleuth supports feature flags as a first class form of change. Because feature flag changes have just as much power to affect failure and recovery as code changes feature flag changes are included in your MTTR calculations. Sleuth's deploy verification applies to flag changes in the same way it applies to code deploys.
Because MTTR is closely tied to change failure please see the setting up change failure rate to configure MTTR.