Reliability, robustness and resilience
"No system can escape the constraints of finite resources and changing conditions" - David D. Woods
The topic of failure, its nature, consequences and especially how individuals and organizations think about it has fascinated me for a long time. As we approach Black Friday, Cyber Monday and Christmas - a period when many organizations face increased load and stress on their systems - and with my upcoming discussion of the classic 'How Complex Systems Fail'1 treatise (link to the event here), I thought it was a good time to revisit the concepts of reliability, robustness and resilience.
These terms are often used interchangeably, however as you probably guessed, they do in fact mean very different things, and having a clear understanding of them will actually help reason about failure modes and stressors that affect the systems we build and operate2.
Reliability is the probability of a system performing adequately for a given period of time when it is used under the specified operating conditions. A classic example of this is the failure rate of hard disks over a period in well defined operating conditions.
Robustness is the ability of a system to continue to operate despite failures of some of its subcomponents or parts. Staying in the world of storage, creating redundant copies of the information in different disks provides a degree of robustness in case of one or more disk failures.
Resilience is a system's ability to adapt from disruptions stemming from fundamental surprises (black swan events, key assumptions that no longer hold true, etc). The experience during the COVID-19 pandemic is a good example of resilience, with many sectors of society adapting to novel circumstances by leveraging technology to enable remote work and digital service delivery.
A keen reader has probably observed that reliability, robustness and resilience are increasingly difficult properties to achieve: from ensuring that individual components or processes behave according to specification under well known conditions, to tolerating known failure modes of sub-components/processes all the way to adapting to unforeseen events. Resilience in particular is not a factor of technical components alone, system operators play a key role in providing adaptive capacity in situations of crisis.
"Everything Fails All the Time" - Werner Vogels
The deployment of increasingly ambitious systems in the context of a hyper connected world facing multiple stressors (economic pressure, geo-politics, climate) makes it valuable to be able to reason about failure using more precise terminology. A better understanding of failure is also humbling: no matter how good technical solutions are, they operate in a messy environment and failure is almost guaranteed. Allowing a system to "extend its capacity to adapt when surprise events challenge its boundaries"3, i.e. Graceful Extensibility, requires investments to build up adaptive capacity (e.g. build anticipatory capacity through game days4 and fault injection5, up-skill personnel), but is increasingly important in our contemporary world.
Footnotes
-
I'm obviously retreading old ground here, and if this topic interests you, Lorin Hochstein's excellent Github page offers a wealth of great resources to get you started. I cannot recommend it enough. ↩
-
Resilience as Graceful Extensibility to Overcome Brittleness ↩
-
"A game day simulates a failure or event to test systems, processes and team responses" - AWS Well Architected Framework ↩
-
This goes into the topic of Chaos Engineering. You can find a good description of what it is here ↩