Resilience Engineering Perspectives, Volume 1 (Hollnagel, Nemeth, Dekker, 2008)

Remaining Sensitive to the Possibility of Failure

In the resilience engineering approach to safety, failures and successes are seen as two different outcomes of the same underlying process, namely how people and organizations cope with complex, underspecified and therefore partly unpredictable work environments. Therefore safety can no longer be ensured by constraining performance and eliminating risks. Instead, it is necessary to actively manage how people and organizations adjust what they do to meet the current conditions of the workplace, by trading off efficiency and thoroughness and by making sacrificing decisions."The Ashgate Studies in Resilience Engineering" series promulgates new methods, principles and experiences that can complement established safety management approaches, providing invaluable insights and guidance for practitioners and researchers alike in all safety-critical domains. While the Studies pertain to all complex systems they are of particular interest to high hazard sectors such as aviation, ground transportation, the military, energy production and distribution, and healthcare. Published periodically within this series will be edited volumes titled "Resilience Engineering Perspectives". The first volume, "Remaining Sensitive to the Possibility of Failure", presents a collection of 20 chapters from international experts. This collection deals with important issues such as measurements and models, the use of procedures to ensure safety, the relation between resilience and robustness, safety management, and the use of risk analysis. The final six chapters utilise the report from a serious medical accident to illustrate more concretely how resilience engineering can make a difference, both to the understanding of how accidents happen and to what an organisation can do to become more resilient.

About the Author
Professor Erik Hollnagel, Industrial Safety Chair, Ecole des Mines de Paris - Pole Cindyniques, France, Christopher P. Nemeth, Research Associate, Department of Anesthesia and Critical Care, The University of Chicago, Chicago, USA and Sidney Dekker, Professor, Director of Research, Lund University School of Aviation, Sweden.

xi-xii Resilience engineering makes it clear that failures and successes are closely related phenomena and not incompatible opposites. Whereas established safety approaches hold that the transition from a safe to an unsafe state is tantamount to the failure of some component or subsystem and therefore focus on what has gone or might go wrong, resilience engineering proposes that:

... an unsafe state may arise because system adjustments are insufficient or inappropriate rather than because something fails... Since both failures and successes are the outcome of normal performance variability, safety cannot be achieved by constraining - or eliminating that. Instead, it is necessary to study both successes and failures, and to find ways to reinforce the variability that leads to successes as well as dampen the variability that leads to adverse outcomes... effective safety management cannot be based on a reactive approach alone... it is necessary also to make corrections or changes in anticipation of what may happen... a resilient system is defined by its ability effectively to adjust its functioning prior to or following changes and disturbances so that it can continue its functioning after a disruption or a major mishap, and in the presence of continuous stresses.

p.1 The designer, the planner, and the systems operator must always keep in mind that things could go wrong... The "faint signals" that are often the precursors of trouble need to be heard and sent to competent authority for action.

p.4-5 Resilience seems to be closely linked with some sort of insight into the (narrowly defined) system, the (broadly defined) environment in which it exists, and their interactions... Resilience involves anticipation... Deeper understanding allows at least two sources of resilience. One is to know sooner when "things are going wrong" by picking up faint signals of impending dysfunction. The other is to have better knowledge resources that are available in order to develop adaptive resources "on the fly." It follows that the lack of such understanding diminishes resilience. It also follows that resulting choices that lack an understanding of how to create, configure, and operate a system lead to less resilient (more brittle) systems. Resilience can be seen in action, and is made visible through the way that safety and risk information are used. Resilience is an active process that implicitly draws on the way that an organization or society can organize itself. It is more than just a set of resources because it involves adaptation to varying demands and threats. Adaptation and restructuring make it possible for an organization to meet varying, even unanticipated, demands.

p.5-6 resilience... [anticipates] what future events may challenge system performance. More importantly, resilience is about having the generic ability to cope with unforeseen challenges, and having adaptable reserves and flexibility to accommodate those challenges... resilience invests flexibility and the ability to find and use available resources in a system in order to meet the changes that are inherent in a dynamic world... Making changes to systems in anticipation of needs in order to meet future demands is the engineering of resilience.

p.6 To measure something, we must know its essential properties. Resilience of materials must be measured by experiment in order to find how much a material returns to its original shape. The same can be said for systems. The act of measurement is the key for engineers to begin to understand the nature of an unexampled event, and the probability part of Probabilistic Risk Assessment (PRA).

p.29-30 Among the definitions of resilience are an ability to resist disorder (Fiskel, 2003), as well as an ability to retain control, to continue and to rebuild (Hollnagel & Woods, 2006)... it may be possible only to measure [a system's] potential for resilience... The following factors are thought to contribute to resilience (Woods, 2006):

buffering capacity...
flexibility/stiffness...
margin...
tolerance...
cross-scale interactions

p.31 Broadly speaking, measurement may be defined as the "process of linking abstract concepts to empirical indicants" (Carmines & Zeller, 1979)... In other words, a valid measurement is one that is capable of accessing a phenomenon and placing its value along some scale.

p.128-129 intuitively, a robust or resilient system is one which must be able to adapt its behaviour to unforeseen situations, such as perturbations in the environment, or to internal dysfunctions in the organisation of the system, etc. ... a resilient system generally aims to restore the initial functions of the system without fundamentally questioning its internal structure in charge of the regulation... From a system theory point of view, the processes linked to robustness are very different since:

1) they inevitably do not guarantee that the function of the systems will be maintained; new functions can emerge in the system (e.g. a new organisation or new objectives for a company, etc.)

2) it is difficult to disassociate the system from its environment since the two entities can be closely coupled

p.129 for McDonald, resilience represents:

the capacity of an organizational system to anticipate and manage risk effectively, through appropriate adaptation of its actions, systems and processes so as to ensure that its core functions are carried out in a stable and effective relationship with the environment

p.130 Woods defines a resilient system as one which is able to monitor the boundary of its organization capability and which can adapt or adjust its current model... an agent or a structure is able to anticipate unforeseen circumstances in an intelligent way in order to drive back the system to its initial state... Following this complex system point of view, we stress that it is necessary to distinguish between resilient engineering that is concerned with the aim to bring back the system in its initial conditions and robustness engineering which is able to harness the more complex (and hidden) properties of self-organized processes.

p.188 The safety fundamentals for system safety architecture and technology... are: transparency, redundancy, interdependence, functionality, integrity, and maintainability.

p.256 Based on these elements, cross-checking fundamentally consists in being able to question a plan in progress at any given level of the model, comparing elements to expected ones. Expectations might differ from current elements because they are based on a different knowledge of the situation: e.g., the situation has evolved and new events have occurred, the agent cross-checking has a different perspective... The need to detect emerging effects in order to potentially recover from unintended negative outcomes is also essential.

p.260 In analogy with this we may in accident investigation propose a What-You-Look-For-Is-What-You-Find or WYLFIWYF principle. The meaning of this is that the assumptions about possible causes (What-You-Look-For) to a large extent will determine what is actually found (What-You-Find)... a root cause analysis implies that accidents can be explained by finding the root - or real- causes. The assumption is in this case that the accident can be described as a sequence, or tree, of causes and effects.

Enter supporting content here