Mean time to recovery or MTTR is defined in Sleuth as the time a project spends in a failure state. Along with Change failure rate, MTTR is a measure of the quality, or stability of your software delivery capability.
When Sleuth detects that an Impact Source is failing (e.g. an incident in PagerDuty or an elevated metric in Datadog), it creates a failure period that tracks the details of that failure along with its start and end period. When calculating the MTTR for a date range, Sleuth accounts for all of the failure periods that occurred in that range and produces the average.
For example, if you have three incidents that happened in the date period you're inspecting, one lasting 1 hour, one lasting 2 hours and one lasting 3 hours. Your MTTR will be: (1 + 2 + 3) / 3 = 2 hours.
For a real-world example of how Sleuth helps you measure and drive down your MTTR, let's say you make a deploy that adds 25% to your database CPU. Assume that Sleuth is tracking this impact and determines that the deploy is Unhealthy. Your team has setup slack notifications in Sleuth, and as a result your mean time to discovery (or MTTD) is basically zero. Your team jumps into action and initiates a rollback which takes 25 minutes to complete. Once your rollback is deployed, Sleuth sees that your database CPU has gone back down to normal and auto-verifies the deploy as Healthy. Your MTTR in this scenario would be 25 minutes, the amount of time it took for your team to return your project to a healthy state.
For more on how Sleuth measures MTTR, check out Sleuth CTO, Don Brown, explaining it in detail in this SleuthTV episode!
Sleuth CTO Don Brown explains how Sleuth measures MTTR
- Incidents - any deploy with a status of
Incident- Sleuth provides integrations with PagerDuty, Statuspage, and many more, and we're continuously adding new integrations per customer demand. See Integrations for an up-to-date list of those we currently support.
Sleuth supports feature flags as a first class form of change. Because feature flag changes have just as much power to affect failure and recovery as code changes feature flag changes are included in your MTTR calculations. Sleuth's deploy verification applies to flag changes in the same way it applies to code deploys.