Monday, July 8, 2019

Review: Engineering A Safer World, by Nancy Leveson

This is a 6-year-old post cross-posted from my woodworking blog (written before I had this blog available). It remains as timely and important as ever. I'm reposting it motivated by the discussion of the Boeing 737 MAX, such as at EmbeddedArtistry.com (and mentioned at Embedded.fm).

As a software engineer I've been a dedicated reader of RISKS DIGEST for over 20 years. Formally, RISKS is the Forum On Risks To The Public In Computers And Related Systems, ACM Committee on Computers and Public Policy, moderated by Peter G. Neumann (affectionately known to all as PGN).

RISKS is an online news and discussion group covering various mishaps and potential mishaps in computer-related systems, everything from data breaches and privacy concerns to catastrophic failures of automated systems that killed people. It's an extremely valuable resource, exposing people to many concerns they  might otherwise not know about.

All back issues are archived and available online. It's fascinating to see the evolution of computer-related risks over time. It's also disheartening to see the same things pop up year after year as sad history repeatedly repeats itself.

Nancy Leveson's work on safety engineering has been mentioned in RISKS ever since volume 1, issue 1. She's currently Professor of Aeronautics and Astronautics and Professor of Engineering Systems at MIT. Her 2011 book Engineering A Safer World, Systems Thinking Applied to Safety, was noted in RISKS 26.71, but has not yet been reviewed there. I offer this informal review.

This book should be required reading for anyone who wishes to avoid having their work show up as a RISKS news item. There's no excuse for not reading it: Leveson and MIT Press have made it available as a free downloadable PDF (555 pages), which is how I read it. The download link is available on the book's webpage at http://mitpress.mit.edu/books/engineering-safer-world.

This was my first introduction to formal safety engineering, so yes, I speak with the enthusiasm of the newly evangelized.

The topic is the application of systems theory to the design of safer systems and the analysis of accidents in order to prevent future accidents (not, notably, to assign blame). Systems theory originated in the 1930's and 1940's to cope with the increasing complexity of systems starting to be built at that time.

This theory holds that systems are designed, built, and operated in a larger sociotechnical context. Control exists at multiple hierarchical levels, with new properties emerging at higher levels ("emergent properties"). Leveson says safety is an emergent property arising not from the individual components, but from the system as a whole. When analyzing an accident, you must identify and examine each level of control to see where it failed to prevent the accident.

So while an operator may have been the person who took the action that caused an accident, you must ask why that action seemed a reasonable one to the operator, why the system allowed the operator to take that action, why the regulatory environment allowed the system to be run that way, etc. Each of these levels may have been an opportunity to prevent the accident. Learning how they failed to do so is an opportunity to prevent future accidents.

Furthermore, systems and their contexts are dynamic, changing over time. What used to be safe may no longer be. Consider that most systems are in use for decades, with many people coming and going over time to maintain and operate them, while much in the world around them changes. Leveson says most systems migrate to states of higher risk over time. If safety is not actively managed to adapt to this change, accidents become inevitable.

Another important point is the distinction between reliability and safety. Components may operate reliably at various levels, yet still result in an accident, frequently due to the interactions between components and subsystems.

Much of Leveson's view can be summarized in two salient quotes. First is a brief comment on the human factor: "Depending on humans not to make mistakes is an almost certain way to guarantee that accidents will happen."

The second is more involved: 

"Stopping after identifying inadequate control actions by the lower levels of the safety control structure is common in accident investigation. The result is that the cause is attributed to "operator error," which does not provide enough information to prevent accidents in the future. It also does not overcome the problem of hindsight bias. In hindsight, it is always possible to see that a different behavior would have been safer. But the information necessary to identify that safer behavior is usually only available after the fact. To improve safety, we need to understand the reasons people acted the way they did. Then we can determine if and how to change conditions so that better decisions can be made in the future.

"The analyst should start from the assumption that most people have good intentions and do not purposely cause accidents. The goal then is to understand why people did not or could not act differently. People acted the way they did for very good reasons: we need to understand why the behavior of the people involved made sense to them at the time."

The book is organized into three parts. Part I, "Foundations," covers traditional safety engineering (specifically, why it is inadequate) and introduces systems theory. Part II, "STAMP: An Accident Model Based On Systems Theory," introduces System-Theoretic Accident Model and Processes, covering safety constraints, hierarchical safety control structures, and process models. Part III, "Using STAMP," covers how to apply it, including the STPA (System-Theoretic Process Analysis) approach to hazard analysis and the CAST (Causal Analysis based on STAMP) accident analysis method.

Throughout, Leveson illustrates her points with accidents from various domains. These cover a military helicopter friendly-fire shootdown, chemical and nuclear plant accidents, pharmaceutical issues, the Challenger and Columbia space shuttle losses, air and rail travel accidents, the loss of a satellite, and contamination of a public water supply. They resulted in deaths, injuries with prolonged suffering, destruction, and significant financial losses. There's also one fictional case used for training purposes.

The satellite loss was an example where there was no death, injury, or ground damage, but an $800 million satellite was wasted, along with a $433 million launch vehicle (all due to a single misplaced decimal point in a software configuration file). Financial losses in all cases included secondary costs due to litigation and loss of business. Accidents are expensive in both humanity and money.

Several accidents are examined in great detail to expose the complexity of the event and glean lessons, identifying the levels of control, the system hazards they faced, and the safety constraints they violated. They show that the answer to further prevention is not simply to punish the operator on duty at the time. What's to prevent another accident from occurring under a different operator? What systemic factors exist that increase the likelihood of accidents?

These systems affect us every day. During the time I was reading the book, there was an airline crash at San Francisco, a fiery oil train derailment in Canada, and a major passenger train derailment in Spain. I started reading it while a passenger on an aircraft model mentioned 14 times in the book, and read the remainder while traveling to and from work on the Boston commuter rail.

The book can be read on several levels. At a minimum, the cases studies and analyses are horribly fascinating for the lessons they impart. Fans of The Andromeda Strain will be riveted.

As I read the account of two US Black Hawk helicopters shot down by friendly fire in Iraq, I could visualize a split screen showing the helicopters flying low in the valleys of the no-fly zone to avoid Iraqi air defense radar, the traces going inactive on the AWACS radar scopes, the F-15's picking up unidentified contacts that did not respond to IFF, and the mission controllers back in Turkey, as events ground to their inexorable conclusion. It made my hair stand on end.

All the case studies are equally jaw-dropping, down to the final example of a contaminated water supply in Ontario. Further shades of Andromeda, since that was a biological accident resulting in deaths.

They're all examples of systems that migrated to very high risk states, where they became accidents waiting to happen. It was just a matter of which particular event out of the many possible triggered the accident.

Part of what's so shocking about these cases is the enormously elaborate multilayered safety systems that were in place. The military goes to great lengths in its air operations control to avoid friendly fire incidents, the satellite software development process had numerous checkpoints, NASA had a significant safety infrastructure.

Yet it seems that this very elaborateness contributed to a false sense of safety, with uncoordinated control leaving gaps in coverage. In some cases this led to complacency that resulted in scaling back safety programs.

The other shocking cases were at the opposite end of the spectrum, where plants were operated more fast and loose.

The one bright spot in the case studies is the description of the US Navy's SUBSAFE program, instituted after the loss of the USS Thresher in 1963. It flooded during deep dive testing; despite emergency recovery attempts by the crew, they were unable to surface. Just pause and think about that for a moment.

SUBSAFE is an example of a tightly focused and rigorously executed safety program. The result is that no submarine has been lost in 50 years, with the exception of the USS Scorpion in 1968, where the program requirements were waived. The result of that tragic lesson was the requirements were never again waived.

The book can be read at an academic level, as a study of the application of systems theory to the creation of safer systems and analysis of accidents. It can be read at an engineering level, as a guide on how to apply the methodology in the development and operation of such systems. It's not a cookbook, but it points you in the right direction. It includes an extensive bibliography for follow-up.

Even those who work on systems that don't present life safety or property damage risks can benefit, because any system behaving poorly can make people's lives miserable. They frequently pose significant business risks, affecting the life and death of a company.

This book paired with PGNs book Computer-Related Risks would make an excellent junior or senior level college survey course for all engineering fields, along the lines of "with great power comes great responsibility". While some might feel it's a text more suited to graduate level practicum, I think it's worth conveying at the undergraduate level for broader distribution.