Skip to main content

3 posts tagged with "papers-in-systems"

Posts related to the links and papers discussed by the papers in systems (https://papersin.systems/) community

View All Tags

Reliability, robustness and resilience

· 3 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

"No system can escape the constraints of finite resources and changing conditions" - David D. Woods

The topic of failure, its nature, consequences and especially how individuals and organizations think about it has fascinated me for a long time. As we approach Black Friday, Cyber Monday and Christmas - a period when many organizations face increased load and stress on their systems - and with my upcoming discussion of the classic 'How Complex Systems Fail'1 treatise (link to the event here), I thought it was a good time to revisit the concepts of reliability, robustness and resilience.

These terms are often used interchangeably, however as you probably guessed, they do in fact mean very different things, and having a clear understanding of them will actually help reason about failure modes and stressors that affect the systems we build and operate2.

Reliability is the probability of a system performing adequately for a given period of time when it is used under the specified operating conditions. A classic example of this is the failure rate of hard disks over a period in well defined operating conditions.

Robustness is the ability of a system to continue to operate despite failures of some of its subcomponents or parts. Staying in the world of storage, creating redundant copies of the information in different disks provides a degree of robustness in case of one or more disk failures.

Resilience is a system's ability to adapt from disruptions stemming from fundamental surprises (black swan events, key assumptions that no longer hold true, etc). The experience during the COVID-19 pandemic is a good example of resilience, with many sectors of society adapting to novel circumstances by leveraging technology to enable remote work and digital service delivery.

A keen reader has probably observed that reliability, robustness and resilience are increasingly difficult properties to achieve: from ensuring that individual components or processes behave according to specification under well known conditions, to tolerating known failure modes of sub-components/processes all the way to adapting to unforeseen events. Resilience in particular is not a factor of technical components alone, system operators play a key role in providing adaptive capacity in situations of crisis.

"Everything Fails All the Time" - Werner Vogels

The deployment of increasingly ambitious systems in the context of a hyper connected world facing multiple stressors (economic pressure, geo-politics, climate) makes it valuable to be able to reason about failure using more precise terminology. A better understanding of failure is also humbling: no matter how good technical solutions are, they operate in a messy environment and failure is almost guaranteed. Allowing a system to "extend its capacity to adapt when surprise events challenge its boundaries"3, i.e. Graceful Extensibility, requires investments to build up adaptive capacity (e.g. build anticipatory capacity through game days4 and fault injection5, up-skill personnel), but is increasingly important in our contemporary world.


Footnotes

  1. How complex systems fail

  2. I'm obviously retreading old ground here, and if this topic interests you, Lorin Hochstein's excellent Github page offers a wealth of great resources to get you started. I cannot recommend it enough.

  3. Resilience as Graceful Extensibility to Overcome Brittleness

  4. "A game day simulates a failure or event to test systems, processes and team responses" - AWS Well Architected Framework

  5. This goes into the topic of Chaos Engineering. You can find a good description of what it is here

Macro beats micro

· 5 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

Science, engineering and to some extent management are permeated by the ideas of Cartesian reductionism. Treating phenomena as machines that can be decomposed into their constituent parts layed the foundation to our modern world. Once analyzed and understood, assembling the constituents back together would yield a complete understanding of the phenomena. This is both enticing and arguably successful, however this approach struggles when faced with emergent properties and behaviors of systems.

Notes on "Programming as theory building"

· 7 min read
Bruno Felix
Digital plumber, organizational archaeologist and occasional pixel pusher

Programming as Theory Building[^1] is an almost 40 year-old paper that remains relevant to this day. In it, the author (Peter Naur, Turing Award winner and the "N" in BNF[^2]) dives into the fundamental question of what is programming, and builds up from that to answer the question about what expectations can one have, if any, on the modification and adaptation of software systems.