Last week we covered Who
, and Where
for cloud companies that “write it down” to pursue goals for The Perfect Team
. This issue will get to one of the two remaining questions, When
, and next week we will explore Why
“Mean Time To RCA” can be viewed through several lenses or perspectives within a learning-focused postmortem culture
. While vendors of tooling utilized by SRE
and incident management practitioners have a variety of perspectives
on the fastest way
or most complete approach to get to RCA, they all trend to other Mean Time To X
as a foundation (Ishikawa diagrams, Kaizen methods, Cause Maps, Postmortem Templates, etc.). That said, marketing teams for tooling vendors may look for a way to, at best, differentiate or, at worst, obfuscate with a thesaurus approach to naming conventions.
- If X = R = Respond, Repair, Recovery, Resolve, or Resolution
- If X = I = Identify, Isolate, or Insights
- If X = F = Failure, Fix, Fidelity, or Facilitate
- If X = A = Acknowledge, Activity, or Action
- If X = D = Determine, Detect, or Diagnose
- If X = V = Verify or Validate
- If X = T = Triage or Telemetry
- If X = C = Confirm, Clarity, or Closure
If X = RR… 🤣🤣🤣🤣
- and so on
- but it ALL adds up to the time it takes to get to RCA
So, one may wonder if MTTAA is the Mean Time To Another Acronym.🤔
Effectively, Mean Time To RCA (for this series) refers to the time it takes to produce actionable insights from a root cause analysis
. The lessons learned will inform, refine, or result in creating KPIs or Objectives and Key Results (OKRs) for the organization as part of a commitment to conspicuous and continuous improvement.
We know there is an increasingly personalized approach to DevCommsOps among hyperscale public cloud service providers. So, we need to understand the impact on Mean Time To RCA from both general public DevCommsOps and the effect from personalized approaches.
To provide examples, let’s examine where Mean Time To RCA is found within the hyperscale public cloud service providers today using our previous searches for “Root Cause Analyses (RCAs) / Incidents.” Once again, the list is in no particular order or weighting other than shorter names to longer names.
- ~5 days for an outage duration of ~3 days
- ~10 days for an outage duration of ~12 hours
- ~10 days for an outage duration of ~9 hours
- ~10 days for an outage duration of ~6 hours
- ~2 days for an outage duration of ~3 hours
- ~3 days for an outage duration of ~2 hours
- And so on
Alibaba Cloud Mean Time to RCA examples
- Unable to find any notices that include outage duration
- Unable to find any links from news coverage of outages
- And so on?
Microsoft Azure Mean Time to RCA examples
- RCA (detailed) can be made available upon request
- Unable to find any notices with an actual publication date
- RCA publishing is organized by the start date of an outage
- Several RCA reference outages lasting to the following day
- Otherwise, ~1 day for an outage duration of any length (unlikely?)
- And so on?
Amazon Web Services Mean Time to RCA examples
- ~9 days for the April 21, 2001 “disruption” and no duration calculated
- ~5 days for the July 2, 2012 “event” and no duration calculated
- ~5 days for the October 22, 2012 “event” based on Twitter update
- ~5 days for the December 24, 2012 “event” based on Twitter update
- ~3 days for the December 17, 2012 “event”
- ~5 days for the June 13, 2014 “disruption” based on Twitter update
- The August 7, 2014 message URI seems to be recycled from 2011 🤷♂️
- ~3 days for the November 25, 2020 “event”
- And so on
Google Cloud Platform Mean Time to RCA examples
- ~9 days for the October 31, 2019 “incident” duration of ~3 days
- ~14 days for the May 20, 2021 “incident” duration of ~1 hour
- And so on
Oracle Cloud Infrastructure Mean Time to RCA examples
In summary, there are stark variations amongst the hyperscalers in expressing Mean Time To RCA. Further, it is reasonable to expect the market will drive demand for standards that normalize the variations.
At the same time, DevCommsOps mixes public and personalized views that are unique to the customer experience. Further, the drive for personalization will result in Mean Time To RCA for the customer informed by their unique specific dependency mapping. The Azure
and Oracle Cloud
approaches will appeal to particular Enterprise customers.
As a reminder, we have established definitions for status dashboards
, Engineering SLO
, and Mean Time To RCA. We have a baseline that is ready to compare general public dependencies and customer personalized views of the underlying dependencies among hyperscale public cloud service providers.
Our last issue in the series will look at the increasing importance of dependency mapping across hyperscale public cloud service providers. Finally, we will consider business value engineering and customer journeys.