Resilience Engineering (Hollnagel, Woods, Leveson, 2006)

Concepts and Precepts

For Resilience Engineering, 'failure' is the result of the adaptations necessary to cope with the complexity of the real world, rather than a breakdown or malfunction. The performance of individuals and organizations must continually adjust to current conditions and, because resources and time are finite, such adjustments are always approximate. This definitive new book explores this groundbreaking new development in safety and risk management, where 'success' is based on the ability of organizations, groups and individuals to anticipate the changing shape of risk before failures and harm occur.

Featuring contributions from many of the worlds leading figures in the fields of human factors and safety, "Resilience Engineering" provides provocative insights into system safety as an aggregate of its various components, subsystems, software, organizations, human behaviours, and the way in which they interact. The book provides an introduction to Resilience Engineering of systems, covering both the theoretical and practical aspects. It is written for those responsible for system safety on managerial or operational levels alike, including safety managers and engineers (line and maintenance), security experts, risk and safety consultants, human factors professionals and accident investigators.

About the Author

Erik Hollnagel became Industrial Safety Chair at Ecole des Mines de Paris in 2006, after having been Professor of Human-Machine Interaction at Linkoping University, Sweden, since 1999. David D. Woods is Professor at the Institute for Ergonomics, Ohio State University, USA, and Past-President of the Human Factors and Ergonomics Society. Nancy Leveson is Professor of Aeronautics and Astronautics at Massachusetts Institute of Technology, USA.

"Resilience appears to convey the properties of being adapted to the requirements of the environment, or otherwise being able to manage the variability or challenging circumstances the environment throws up... the notion is important of being able to read the environment appropriately and to be able to anticipate, plan and implement appropriate adjustments to address perceived future requirements."

"the basic unit of analysis for assessing resilience is observations of a system's response to varying kinds of disturbances"

JLJ - This book offers insight to the dynamic nature of a complex adaptive system and is directly applicable to a machine attempting to 'understand' what is going on in a game, such as chess

[Prologue: Resilience Engineering Concepts, David Woods, Erik Hollnagel, p.1-6]

p.2 doing things safely always has been and always will be part of operational practices

p.3 failures occur when multiple contributors - each necessary but only jointly sufficient - combine... The thesis... is that failure... represents the temporary inability to cope effectively with complexity. Success belongs to organisations, groups and individuals who are resilient in the sense that they recognise, adapt to and absorb variations, changes, disturbances, disruptions, and surprises - especially disruptions that fall outside of the set of disturbances the system is designed to handle... safety is created through proactive resilient processes rather than through reactive barriers and defences.

p.5 Safety is, in the words of Karl Weick, a dynamic non-event.

[JLJ - Perhaps in his thinking, but it can also be seen as a reduction - say year over year - of countable unwanted events. We count highway fatalities every year, and when we have a reduction in highway deaths per 100,000 miles driven - we claim an increase in safety.]

p.6 Resilience engineering is a paradigm for safety management that focuses on how to help people cope with complexity under pressure to achieve success... A resilient organisation treats safety as a core value, not a commodity that can be counted. Indeed, safety shows itself only by events that do not happen! ... organisations continue to invest in anticipating the changing potential for failure because they appreciate that their knowledge of the gaps is imperfect and that their environment constantly changes. One measure of resilience is therefore the ability to create foresight - to anticipate the changing shape of risk, before failure and harm occurs

[JLJ - Well consider airbags combined with seatbelts in automobiles. Regardless of whether they reduce highway fatalities, it is a clever design that you just want to have in your car because of the opportunity it has to protect you or a loved one from injury in an accident.]

[Resilience - the Challenge of the Unstable, Erik Hollnagel, p.9-17]

p.12 Many authors have pointed out that accidents can be seen as due to an unexpected combination or aggregation of conditions or events (e.g., Perrow, 1984). A practical term for this is concurrence, meaning the temporal property of two (or more ) things happening at the same time and thereby affecting each other.

p.13 The essence of the systemic view can be expressed by the following four points:

Normal performance... as well as failures are emergent phenomena. Neither can therefore be attributed to or explained by referring to the (mal)functions of specific components or parts...
The outcomes of actions may sometimes differ from what was intended, expected or required...
The adaptability and flexibility of human work is the reason for its efficiency. Normal actions are successful because people adjust to local conditions, to shortcomings or quirks of technology, and to predictable changes in resources and demands...
The adaptability and flexibility of human work is, however, also the reason for the failures that occur, although it is rarely the cause of such failures. Actions and response are almost always based on a limited rather than complete analysis of the current conditions, i.e., a trade-off of thoroughness for efficiency.

p.15 It is, indeed, a consequence of the systemic view that the potential for (complex) accidents cannot be described by a fixed structure such as a tree, graph, or network, but must invoke some way of representing dynamic bindings or couplings... Indeed, the problems of risk assessment may to a large degree arise from a reliance on graphical representations, which... are unable adequately to account for concurrence [JLJ defined, p. 12, the temporal property of two (or more ) things happening at the same time and thereby affecting each other] and for how a stable system slowly or abruptly may become unstable.

p.15-16 The real challenge for system safety, and therefore also for resilience engineering, is to recognize that complex systems are dynamic and that a state of dynamic stability sometimes may change into a state of dynamic instability. This change may be either abrupt, as in an accident, or slow, as in a gradual erosion of safety margins. Complex systems must perforce be dynamic since they must be able to adjust their performance to the conditions... Complex systems must, however, be dynamically stable, or constrained, in the sense that the adjustments do not get out of hand but at all times remain under control.

p.16 Dictionaries commonly define resilience as the ability to "recover quickly from illness, change, or misfortune"... it is easier to recover from a potentially destabilising disturbance if it is detected early. The earlier an adjustment is made, the smaller the resulting adjustments are likely to be... the definition of resilience can be modified to be the ability of a system or an organisation to react to and recover from disturbances at an early stage, with minimal effort on the dynamic stability. The challenges to system safety come from instability, and resilience engineering is an expression of the methods and principles that prevent this from taking place.

p.17 Rather than looking for causes we should look for concurrences, and rather than seeing concurrences as exceptions we should see them as normal and therefore also as inevitable... it is the concurrence of a number of events, just on the border of the ordinary, that constitutes an explanation of the accident or event.

[Essential Characteristics of Resilience, David Woods, p.21-34]

p.22 The focus is on assessing the organization's adaptive capacity relative to challenges to the capacity... resilience engineering devotes effort to make observable the organization's model of how it creates safety, in order to see when the model is in need of revision.

p.23 Monitoring and managing resilience... is concerned with understanding how the system adapts and to what kinds of disturbances in the environment, including properties such as:

buffering capacity: the size or kinds of disruptions the system can absorb or adapt to without a fundamental breakdown in performance or in the system's structure;
flexibility versus stiffness: the system's ability to restructure itself in response to external changes or pressures;
margin: how closely or how precarious the system is currently operating relative to one or another kind or performance boundary;
tolerance: how a system behaves near a boundary - whether the system gracefully degrades as stress/pressure increase or collapses quickly when pressure exceeds adaptive capacity.

p.26-27 Even more difficult, the six goals represent a set of interacting and often conflicting pressures so that in adapting to reach for one of these goals it is very easy to undermine or squeeze others. To improve on all simultaneously is quite tricky.

[Defining Resilience, Andrew Hale, Tom Heijer, p.35-40]

p.35 Resilience first conjures up in the mind pictures of bouncing back from adversity... If we were to apply this to organizations, the emphasis would come to fall on responding to disaster... This captures some of the essentials, with an emphasis on flexibility, coping with unexpected and unplanned situations and responding rapidly to events, with excellent communication and mobilisation of resources to intervene at the critical points. However, we would argue that we should extend the definition a little more broadly, in order to encompass also the ability to avert disaster or major upset, using these same characteristics. Resilience then describes also the characteristic of managing the organisation's activities to anticipate and circumvent threats to its existence and primary goals.

p.36-37 Reverting to Rasmussen's model, resilience is the ability to steer the activities of an organisation so that it may sail close to the area where accidents will happen, but always stays out of that dangerous area. This implies a very sensitive awareness of where the organisation is in relation to that danger and a very rapid and an effective response when signals of approaching or actual danger are detected, even unexpected or unknown ones.

p.37 We cannot talk of resilience unless the organisation achieves this feat consistently over a long period of time... resilience is a dynamic process of steering and not a static state of an organisation. It has to be worked at continuously and, like the voyage of the Flying Dutchman, the task is never ended and the resilience can always disappear or be proven ineffective in the face of particular threats.

p.40 In this short note we are pleading for two things: what is interesting for safety is preventing accidents and not just surviving them. If resilience is used with its common meaning of survival in adversity, we do not see it to be of interest to us. If its definition is extended to cover the ability in difficult conditions to stay within the safe envelope and avoid accidents it becomes a useful term... We would enter a plea that we should consider resilience against the background of the size of the risk.

[A Typology of Resilience Situations, Ron Westrum, p.55-65]

p.55 There are basically three aspects to threats:

The predictability of the threat...
The threat's potential to disrupt the system...
The origin of the threat (internal vs. external).

p.56-57 With these basic aspects, let us see what we can do to develop a classification of situations...

Situation I. The Regular Threat... Regular threats are those that occur often enough for the system to develop a standard response. Trouble comes in one of a number of standard configurations, for which an algorithm of response can be formulated....

Situation II. The Irregular Threat... The more challenging situation is the one-off event, for which it is virtually impossible to provide an algorithm, because there are so many similar low-probability but devastating events that might take place; and one cannot prepare for all of them... This kind of emergency tests the organization's ability to self-organise and respond effectively to crisis (Hauser, 2004).

Situation III. The Unexampled Event... Situation III is marked by events that are so awesome or so unexpected that they require more than the improvisation of Situation II. They require a shift in mental framework. It may appear impossible that something like the event could happen.

p.59 Resilience thus has three major meanings.

Resilience is the ability to prevent something bad from happening,
Or the ability to prevent something bad from becoming worse,
Or the ability to recover from something bad once it has happened.

[JLJ - Yes, but missing is the concept of being able to 'go on' towards desired goals, in spite of what has happened, is threatening to happen, or what is happening at the present moment. At every present moment, we are either recovering from a problem, preparing to engage in a maneuver, or correcting a planned maneuver of some kind - we are in the middle of things. Resilience is the ability to configure or reconfigure or 'kludge' a posture or stance, in order to be able to 'go on'. You might be able 'to prevent something bad from happening' by staying inside your house and never going outside. But that is hardly resilience, because it interferes with your ability to be able to 'go on'. Think of an actor on stage during a drama production who temporarily forgets a line. Another actor might say the missing line, or perhaps might prompt the actor by saying, "were you going to tell her about your trip to Europe?" The first actor now recovers and says the line, and the play continues. The play 'goes on' in part due to a 'kludge'. Perhaps an actor breaks their hand backstage before the start of the play and has to go to the emergency room. The stage manager then takes a script and performs the role of the missing actor. Again, something happens to make things 'go on'. An organization decides to downsize senior management or reduce the number of divisions from three to two in order to reduce overhead costs, and become more competitive in winning proposals. Resilience is all about being able to configure or reconfigure or 'kludge' a posture or stance, in order to be able to 'go on', in spite of expected or unexpected events. You are injured - versus killed - in a car accident, and after a few weeks of recovery continue on as before.]

p.59 The ability to anticipate when and how calamity might strike has been called 'requisite imagination' (Adamski & Westrum, 2003).

p.61 often organizations are resilient because they can respond quickly or even redesign themselves in the midst of trouble. They might use 'slack resources' or other devices that help them cope with struggle. The organization's flexibility is often a key factor in organizing to fight the problem. They are thus 'adaptive' rather than 'tough.'

p.65 No question that resilience is important. But what is it? Resilience is a family of related ideas, not a single thing... Organizational resilience is supported by internal processes.

[Incidents - Markers of Resilience or Brittleness?, David Woods, Richard Cook, p.69-76]

p.69 The adaptive capacity of any system is usually assessed by observing how it responds to disruptions or challenges. Adaptive capacity has limits or boundary conditions, and disruptions provide information about where those boundaries lie and how the system behaves when events push it near or over those boundaries. Resilience in particular is concerned with understanding how well the system adapts and to what range or sources of variation. This allows one to detect undesirable drops in adaptive capacity and to intervene to increase aspects of adaptive capacity.

p.71 the value of incidents is in how they mark boundary conditions on the mechanisms/model of adaptiveness built into the system's design. Incidents simultaneously show how the system in question can stretch given disruptions and the limits on that capacity to handle or buffer these challenges.
Assessing resilience requires models of classes of adaptive behavior. These models need to capture the processes that contribute to adaptation when surprising disruptions occur.

p.72 One part of assessing a system's resilience is whether that system knows if it is operating near boundary conditions. Assessing the margin is not a simple static state (the distance of an operating point to a definitive boundary), but a more complex assessment of adaptive responses to different kinds of disturbances. Incidents are valuable because they provide information about what stretches the system and how well the system can stretch.

p.74 Our conjecture is that, inspired directly or indirectly by these very detailed situations of judging adaptive capacity in supervisory control, we can create mechanisms to monitor the adaptive capacity of organisations and anticipate when its adaptive capacity is precarious.

p.75 in safety management, change and production pressure are disturbances that erode or place new demands on adaptive capacity.

p.75 the basic unit of analysis for assessing resilience is observations of a system's response to varying kinds of disturbances.

[JLJ - Bingo. Ding ding ding ding ding. You win a prize. Absolutely correct. Going further, in certain cases we can intelligently simulate our system stretching to adapt, or buckling and failing, and infer or assess an appropriate degree of resilience, based on these interpreted results.]

p.75 Measures of brittleness and resilience will emerge when we abstract general patterns from specific cases of challenge and response.

[Resilience Engineering: Chronicling the Emergence of Confused Consensus, Sidney Dekker, p.77-92]

p.77 Detecting drift into failure that happens to seemingly safe systems, before breakdowns occur is a major role for resilience engineering.

p.78 If charting the distance between operations as imagined and as they really occur is too difficult, then an even broader indicator of resilience could be the extent to which the organization succeeds in keeping discussions of risk alive even when everything looks safe.

p.80 safety and risk in safe systems are emergent properties that arise from a much more complex interaction of all factors that constitute normal work.

p.80-81 accidents require their own set of models if people want to gain predictive leverage.

[Engineering Resilience into Safety-Critical Systems, Nancy Leveson et al., p.95-124]

p.97 While events reflect the effects of dysfunctional interactions and inadequate enforecement of safety constraints... the events are the result of the inadequate control.

p.98-99 Preventing accidents requires designing a control structure... that will enforce the necessary constraints on development and operations.

p.103 System dynamics provides a framework for dealing with dynamic complexity, where cause and effect are not obviously related... System dynamics models are formal and can be executed, like our other models.

p.106 In this chapter, we are concerned with resilience and therefore will concentrate on how system dynamics models can be used to design and analyze resilience

p.107 System behavior in system dynamics is modeled by using feedback (causal) loops, stock and flows (levels and rates), and the non-linearities created by interactions among system components.

[Is Resilience Really Necessary? The Case of Railways, Andrew Hale, Tom Heijer, p. 125-148]

p.147 Resilience is only one strategy for achieving very high levels of safety.

[Organizational Resilience and Industrial Risk, Nick McDonald, p.156-180]

p.156 Resilience appears to convey the properties of being adapted to the requirements of the environment, or otherwise being able to manage the variability or challenging circumstances the environment throws up... the notion is important of being able to read the environment appropriately and to be able to anticipate, plan and implement appropriate adjustments to address perceived future requirements.

p.157,158,159,160 Resilience represents the capacity (of an organisational system) to anticipate and manage risk effectively, through appropriate adaptation of its actions, systems and processes, so as to ensure that its core functions are carried out in a stable and effective relationship with the environment... maintaining stability requires the capacity to adjust... On the other hand, resilience seems also to require a certain flexibility and capacity to adapt to circumstances... achieving resilience (whatever that means) is not just a matter of finding the right technical solution to an operational problem, but of constructing a better way of understanding the operational space.

p.164 In organisational analysis and diagnosis often the easy part is to identify the apparent imperfections, deficiencies and inconsistencies to which most organisations are subject. What is less obvious may be what keeps the system going at a high level of safety and reliability despite these endemic problems. In many systems, what is delivering operational resilience (flexibility to meet environmental demands adequately) is the "professionalism" of the front line staff. In this context, professionalism perhaps refers to the ability to use one's knowledge and experience to construct and sustain an adequate response to varying, often unpredictable and occasionally testing demands from the operational environment... improvisational characteristics... are often employed.

p.167-168 At the operational level, therefore, resilience may be a function of the way in which organisations approach and manage the contradictory requirements of, on the one hand, good proceduralisation and good planning, and on the other hand, appropriate flexibility to meet the real demands of the operation as they present on any particular day.

p.173 The concept of resilience would seem to require both the capacity to anticipate and manage risks before they become serious threats to the operation, as well as being able to survive situations in which the operation is compromised, such as survival being due to the adequacy of the organisation's response to that challenge.

[Taking Things in One's Stride: Cognitive Features of Two Resilient Performances, Richard Cook, Christopher Nemeth, p.205-221]

p.206 Resilience is a feature of some systems that allows them to respond to sudden, unanticipated demands for performance and then return to their normal operations quickly and with a minumum decrement in their performance.

p.216 We propose the cases [JLJ - previously described in detail] as examples of resilient performances and these resilient performances as evidence of the presence of a resilient system... resilient performances occur in the face of sudden, unanticipated demands. At present. the only strong evidence of resilience that we can identify is the presence of these resilient performances.

[JLJ - For the case of a machine 'playing' a complex game of strategy, we infer the presence of resilience when we can demonstrate through the examination of likely scenarios that resilient performances are likely to result.]

p.220-221 We propose that resilient performance is empirical evidence of resilience... What distinguishes resilient performance is the fact that practitioners are able to move through the goal - means hierarchy to address the threat. In this formulation, cognition is the critical factor in resilient performances and the central feature of what it means for a system to possess the dynamic functional characteristic of resilience.
If our conclusions are correct, then research on resilience will likely be some combination of three themes. The first is research on cognition - including distributed cognition - in demanding situations. The second is research on the explanation of goal - means hierarchies in naturalistic settings. The third is research on the characteristics of sudden demands for resources and the reactions that they evoke.

[Erosion of Managerial Resilience: From Vasa to NASA, Rhona Flin, p.223-233]

p.227-228 Three component skills characterize managerial resilience in relation to safety. The first is Diagnosis - detecting the signs of operational drift towards a safety boundary. For a manager this means noticing changes in risk profile of the current situation and recognising that the tolerance limit is about to be (or has been) breached. This requires knowledge of the organisational environment, as well as risk sensitivity... The second component is Decision-making - having recognised that the risk balance is now unfavourable (or actually dangerous), managers have to select the appropriate action to reduce the diagnosed level of threat to personnel and/or plant safety... In order to accomplish this kind of resilience response, the manager may also acquire Assertiveness skills in order to persuade other personnel (especially more senior) that production has to be halted or costs sacrificed.

[Learning How to Create Resilience in Business Systems, Gunilla Sunderstrom, Erik Hollnagel, p.235-252]

p.248 An organisation is resilient if it is able successfully to adjust to the compounded impact of internal and external events over a significant time period.

[Properties of Resilient Organizations: An Initial View, John Wreathall, p.275-285]

275 resilience... one of the simplest explanations is contained in the following description...

Resilience is the ability of an organization (system) to keep, or recover quickly to, a stable state, allowing it to continue operations during and after a major mishap or in the presence of continuous significant stresses.

[Auditing Resilience in Risk Control and Safety Management Systems, Andrew Hale, Frank Guldenmund, Louis Goossens, p.289-314]

p.291,292,293 The technical modelling of the plant [JLJ - using the ARAMIS risk assessment and audit tool] identifies all critical equipment and accident scenarios for the plant concerned and analyses what barriers the company claims to use to control these scenarios. It is essential that this identification process is exhaustive for all potential accidents with major consequences, otherwise crucial scenarios may be missed, which may later turn out to be significant.... Inherent hazards... are kept under control by barriers... If these barriers are not present, or kept in good operating state, the hazard will not be effectively controlled and the scenario will move towards the centre event, the loss of control. This is usually defined for chemicals as a loss of containment of the chemical... The management model on which the audit is based is structured around the life cycle of these barriers or barrier elements

p.299 the management of major hazards is inescapably a complex process... Without a clear picture of what is to be controlled, the other elements of resilience have no basis on which to operate.

p.303 Barriers were originally conceived of largely physical things in the path of energy flows. However, the concept has now been extended to procedural, immaterial and symbolic elements (Hollnagel, 2004), in other words everything that keeps energy (or more recently information) flows and processes from deviating from their desired pathways... Barriers are defined as... behavior, which keeps the critical process with its safe limits. In other words, barriers are seen more as the frontier posts guarding the safe operating envelope... They are also defined as control devices

p.314 An audit tool should, according to our arguments, be able to recognise at least a number of the weak signals indicating that an organisation is not resilient. The question then is how to improve matters. What the audit tool tells the organisation is which of the performance indicators is off-specification... What an audit cannot say is how to achieve that change.

[How to Design a Safety Organitation: Test Case for Resiliance Engineering, David D. Woods, p.315-325]

p.315 Woods (2005a)... argues that organizational accidents represent breakdowns in the processes that produce resilience.

p.316 How do people detect that problems are emerging or changing when information is subtle, fragmented, incomplete or distributed across the different groups involved in production processes and in safety management?

p.318 Resilience Engineering, if it is a meaningful and practical advance in safety management, should be able to specify the design of safety organizations as a work-a-day part of the organization's activities.

p.321 The safety organization's mission then is to monitor the organization's resilience including the ability to make targeted investments to restore resilience and reduce brittleness.

p.322 The tragedy of the commons is a name for a baseline adaptive dynamic whereby the actors, by acting rationally in the short term to generate a return in a competitive environment, deplete or destroy the common resource on which they depend in the long run.

[States of Resilience, Erik Hollnagel, Gunilla Sundstrom, p.339-346]

p.339 A resilient system, or, organisation is able to withstand the effects of stress and strain and to recover from adverse conditions over long time periods. One way of describing that is to think of a system as being in one of several states...

p.341-342 There must clearly always be a state of normal functioning, where the system provides or produces what it is intended to do in a reliable and, if required, profitable manner. There will usually also be a state of regular reduced functioning... as well as a state of irregular reduced functioning due to a lack of internal resources... The transitions between normal and regular reduced functioning are scheduled, whereas the transitions between normal and irregular reduced functioning usually are unexpected... There will, unfortunately, also always be a state of disturbed functioning, corresponding to the unhealthy or even catastrophic states... it may be the mark of a resilient organization that it has a number of different modes of functioning whenever a disturbance happens.

[Epilogue: Resilience Engineering Precepts, Erik Hollnagel, David D. Woods, p.347-358]

p.347-348 We can only measure the potential for resilience but not resilience itself... resilience is tantamount to coping with complexity

p.350 A resilient system must have the ability to anticipate, perceive and respond. Resilience engineering must therefore address the principles and methods by which these qualities can be brought about.

p.356 Resilience requires a constant sense of unease that prevents complacency.

p.357 It is fundamental for resilience engineering to monitor and learn from the gap between work as imagined and work as practised.