Sunday, November 3, 2019

Review: Clean Agile, by Robert C. Martin, and More Effective Agile, by Steve McConnell

This started out as a review of McConnell's book, but Just-In-Time, my pre-order of Uncle Bob's book arrived Friday. Ah, sweet serendipity! I read it yesterday, and it fits right in.

I have no idea what the two authors think of each other. I don't know if they're friends, enemies, or frenemies. I don't know if they shake their fists at each other or high-five. But as a software developer, I do believe they're both worth listening to.

I've read most of the books in Martin's Clean Code series. I'm a big fan. He was one of the original signatories of the Agile Manifesto.

A recent post by Phillip Johnston, CEO of Embedded Artistry, set me off on a path reading some of Steve McConnell's books and related material. I've become a big fan of his as well.

Week before last, I read McConnell's Software Estimation: Demystifying the Black Art, 2006. Last week, I read his new book More Effective Agile: A Roadmap for Software Leaders, that just came out in August, the one I'm reviewing here.

This week, I'm reading his Code Complete: A Practical Handbook of Software Construction, 2nd edition, 2004, and Software Requirements, 3rd edition, 2013, by Karl Wiegers and Joy Beatty (or maybe over the next few weeks, since they total some 1500 pages; I note that in the Netflix documentary series "Inside Bill's Brain: Decoding Bill Gates", one of his friends says Gates reads 150 pages an hour; that's a superpower, and I am totally jealous!).

These are areas where software engineering practice has continually run into problems.

The Critical Reading List

Martin's and McConnell's new books are excellent, to the point that I can add them as the other half of this absolutely critical reading list:
In fact, I would be so bold as to say that not reading these once you know about them constitutes professional negligence, whether you are an engineer, a manager, or an executive. If you deal with software development in any way, producer or consumer, you must read these.

Brooks' first edition outlined the problems in software engineering in 1975. Twenty years later, his second edition showed that we were still making the same mistakes.

There are a few items that are extremely dated and quaint. Read those for their historical perspective. But don't for a moment doubt the timely relevance of the rest of the book.

Brooks is the venerated old man of this. Everybody quotes him, particularly Brooks' Law: Adding human resources to a late software project makes it later.

Every 12 years after Brooks' first edition, DeMarco and Lister addressed the theme from a different perspective in their editions of Peopleware.

Forty-four years after, we are still making the same mistakes, just cloaked in the Agile name. So McConnell's new book addresses those issues in modern supposedly Agile organizations, with suggestions about what to do about them.

Meanwhile, Martin's book returns us to the roots of Agile, literally back to the basics to reiterate and re-emphasize them. Because many of them have been lost in what Martin Fowler calls "the Agile Industrial Complex," the industry that has grown out of promoting Agile.

The first three books are easy reading. McConnell's is roughly equivalent to two of them put together. It also forms the root of a study tree of additional resources, outlining a very practical and pragmatic approach.

There are clearly some tensions and disagreements between the authors and the way things have developed. Martin goes so far as to include material with dissenting opinions in his book.

Don't just read these once. Re-read them at least once a year. Each time, different details will feel more relevant as you make progress.

Problems

The problems in the industry that have persisted for decades can be summarized as late projects, over budget, and poor software that doesn't do what it's supposed to do or just plain doesn't work.

Tied up in this are many details. Poor understanding and management of requirements, woefully underestimated work, poor understanding of hidden complexities, poor testing, poor people management.

Much of it is the result of applying the Taylor Scientific Management method to software development. Taylorism may work for a predictable production line of well-defined inputs, steps, and outputs, running at a repeatable rate, but it is a terrible model for software management. Software development is not a production line. There are far too many unknowns.

In general, most problems arise because companies practice the IMH software project management method: Insert Miracle Here. With Agile, they have adopted the IAMH variant: Insert Agile Miracle Here.

But as Brooks writes, there are no silver bullets. Relying on miracles is not an effective project management technique. This is a source of no end of misery for all involved with software.

As Sandro Mancuso, author of the Clean Code series book The Software Craftsman: Professionalism, Pragmatism, Pride (Yes! Read it!) writes in chapter 7 of Clean Agile, "Craftsmanship", "the original Agile ideas got distorted and simplified, arriving at companies as the promise of a process to deliver software faster." I.e. miracles.

A Pet Peeve (Insert Rant Here)

One of the areas of disagreement between various authors is the open-plan office. The original Agile concept was co-locating team members so that they could communicate immediately, directly, and informally, at higher bandwidth than through emails or heavy formal documents. It was meant to foster collaboration and remove impediments to effective communication.

Peopleware is extremely critical of the open-plan office, and I couldn't agree more. The prevailing implementation of it is clearly based more on the idea of cutting real-estate and office overhead costs than on encouraging productive communication. The result has all the charm of a cattle concentration feedlot, everyone getting their four square feet to exist in.

Another distortion of the Agile concepts embraced by management at the cost of actual effective development. That might make the CFO happy, but it's a false economy that should horrify the CTO.

Those capex savings can incur significant non-recurring engineering costs and create technical problems that will incur further downstream development and support costs. And that just means more opex for facilities where the engineering gets done, because the project takes longer.

You're paying me all this money to be productive and concentrate on complex problems, then you deliberately destroy my concentration to save on furniture and floorspace? It's like a real-life version of Kurt Vonnegut's short story Harrison Bergeron. What does that do to the product design and quality? What customer problems does it create, with attendant opportunity costs?

I turned down an excellent job offer in 2012 after the on-site interviews because of this. I was bludgeoned by my impression of the office environment: sweatshop. They probably thought of me as a prima donna.

McConnell also recommends against this, referencing the 2018 article It's Official: Open-Plan Offices Are Now the Dumbest Management Fad of All Time, which summarized the findings of a Harvard study on the topic. The practice appears to me to be the office-space equivalent of Taylorism.

Ok, now that I have all that off my chest, on to the actual reviews.

Clean Agile, Robert C. Martin

Martin's premise is that Agile has gotten muddled. He says it has gotten blurred through misinterpretation and usurpation.

His purpose is to set the record straight, "to be as pragmatic as possible, describing Agile without nonsense and in no uncertain terms."

He starts out with the history of Agile, how it came about, and provides an overview of what it does. He then goes on to cover the reasons for using it, the business practices, the team practices, the technical practices, and becoming Agile.

An important concept is the Iron Cross of project management: good, fast, cheap, done: pick any three. He says that in reality, each of these four attributes have coefficients, and good management is about managing those coefficients rather than demanding they all be at %100; that is the kind of management Agile strives to enable, by providing data.

The next concept is Ron Jeffries' Circle of Life: the diagram decribing the practices of XP (eXtreme Programming). Martin chose XP for this book because he says it is the best defined, the most complete, and the least muddled of the Agile processes. He references Kent Beck's Extreme Programming Explained: Embrace Change (he prefers the original 2000 edition; my copy is due to arrive week after next).

The enumeration and description of the various practices surprised me, reinforcing his point that things have gotten muddled. While I was aware of them, I was not aware of their original meanings and intent.

The most mind-blowing moment was reading about acceptance tests, under the business practices. Acceptance tests have become a real hand-waver, "one of the least understood, least used, and most confused of all the Agile practices."

But as he describes them, they have the power to be amazing:
  • The business analysts specify the happy paths.
  • QA writes the tests for those cases early in the sprint, along with the unhappy paths (QA engineer walks into a bar; orders a beer; orders 9999 beers; orders NaN beers; orders a soda for Little Bobby Tables; etc.). Because you want your QA people to be devious and creative in showing how your code can be abused, so that you can prevent anyone else from doing it. You want Machiavelli running your QA group.
  • The tests define the targets that developers need to hit.
  • Developers work on their code, running the tests repeatedly, until the code passes them.
Holy crap! Holy crap! This ties actual business-defined requirements end-to-end through to the running code. It is a fractal-zoom-out-one-level application of Test Driven Development (and we all thought TDD was just for the developer-written unit tests!).

It completely changes the QA model. Then the unit and acceptance tests get incorporated into Continuous Build, under the team practices.

There are other important business practices that I believe are poorly understood, such as splitting and spikes. Splitting means splitting a complex story into smaller stories, as long as you maintain the INVEST guidelines:
  • Independent
  • Negotiable
  • Valuable
  • Estimable
  • Small
  • Testable
Splitting is important when you realize a story is more complex than originally thought, a common problem. Rather than trying to beat it into submission (or be beaten into submission by the attempt), break it apart and expose the complexity in manageable chunks.

I never knew just what a spike was. It's a meta-story, a story for estimating a story. It's called that because it's a long, thin slice through all the layers of the system. When you don't know how to estimate a story, you create a spike for the sole purpose of figuring that out.

Almost as mind-blowing is his discussion of the technical practices. Mind-blowing because much of this whole area has been all but ignored by most Agile implementations. Reintroducing them is one of the strengths of this book.

Martin has been talking about this for a while. He gave the talk in this video, Robert C. Martin - The Land that Scrum Forgot, at a 2011 conference (very watchable at 2x speed). The main gist is that Scrum covered the Agile management practices, but left out the Agile technical practices, yet they are fundamental to making the methodology succeed.

These are the XP practices:
  • Test-Driven Development (TDD), the double-entry bookkeeping of software development.
  • Refactoring.
  • Simple Design.
  • Pair Programming.
Of these, I would say TDD is perhaps the most-practiced. But all of these have been largely relegated to a dismissive labeling as something only the extremos do. Refactoring is seen as something you do separately when things get so bad that you're forced into it. Pair programming in particular is viewed as a non-starter.

I got my Scrum training in a group class taught by Jeff Sutherland, so pretty much from the horse's mouth. That was 5 years ago, so my memory is a bit faded, but I don't remember any of these practices being covered. I learned about sprints and stories and points, but not about these.

As Martin describes them, they are the individual daily practices that developers should incorporate into every story as they do them. Every story makes use of them in real-time, not in some kind of separate step.


Refactoring builds on the TDD cycle, recognizing that writing code that works is a separate dimension from writing code that is clean:
  1. Create a test that fails.
  2. Make the test pass.
  3. Clean up the code.
  4. Return to step 1.
Simple Design means "writing only the code that is required with a structure that keeps it simplest, smallest, and most expressive." It follows Kent Beck's rules:
  1. Pass all the tests.
  2. Reveal the intent (i.e. readability).
  3. Remove duplication.
  4. Decrease elements.
Pair programming is the one people find most radical and alarming. But as Martin points out, it's not an all-the-time 100% thing. It's an on-demand, as-needed practice that can take a variety of forms as the situation requires.

Who hasn't asked a coworker to look over some code with them to figure something out? Now expand that concept. It's the power of two-heads-are-better-than-one. Maybe trading the keyboard back and forth, maybe one person driving while the other talks. Sharing information, knowledge, and ideas in both directions, as well as reviewing code in real-time. There's some bang for the buck!

The final chapters cover becoming Agile, including some of the danger areas that get in the way, tools, coaching (pro and con), and Mancuso's chapter on craftsmanship, which reminds us that we do this kind of work because we love it. We are constantly striving to be better at it. I am a software developer. I want to be professional about it. This hearkens back to the roots of Agile.

More Effective Agile, Steve McConnell

McConnell has a very direct, pragmatic writing style. He is brutally honest about what works and what doesn't, and the practical realities and difficulties that organizations run into.

His main goal is addressing practical topics that businesses care about, but that are often neglected by Agile purists:
  • Common challenges in Agile implementation.
  • How to implement Agile in only part of the organization (because virtually every company will have parts that simply don't work that way, or will interact with external entities that don't).
  • Agile's support for predictability.
  • Use of Agile on geographically distributed teams
  • Using Agile in regulated industries.
  • Using Agile on a variety of different types of software projects.
He focuses on techniques that have been proven to work over the past two decades. He generalizes non-Agile approaches as Sequential development, typically in some sort of phased form.

The book contains 23 chapters, organized into these 4 parts:
  • INTRODUCTION TO MORE EFFECTIVE AGILE
  • MORE EFFECTIVE TEAMS
  • MORE EFFECTIVE WORK
  • MORE EFFECTIVE ORGANIZATIONS
It includes full bibliography and index.

Throughout, he uses the key principle of "Inspect and Adapt": inspect your organization for particular attributes, then adapt your process as necessary to improve those attributes.

Another important concept is that Agile is not one monolithic model that works identically for all organizations. It's not one-size-fits-all, because the full range of software projects covers a variety of situations. So the book covers the various ways organizations can tailor the practices to their needs. Probably to the horror of Agile purists.

Each chapter is organized as follows:
  • Discussion of key principles and details that support them. This includes problem areas and various options for dealing with them.
  • Suggested Leadership Actions
  • Additional Resources
The Suggested Leadership Actions are divided into recommended Inspect and Adapt lists. The Inspect items are specific things to examine in your organization. I suspect they would reveal some rude surprises. The Adapt items cover actions to take based on the issues revealed by inspection.

The Additional Resources list additional reading if you need to delve further into the topics covered.

One of the very useful concepts in the book is the "Agile Boundary". This draws the line between the portion of the organization that uses Agile techniques, and the portion that doesn't. Even if the software process is 100% Agile, the rest of the company may not be.

Misunderstanding the boundary can cause a variety of problems. But understanding it creates opportunities for selecting an appropriate set of practices. This is helpful for ensuring successful Agile implementation across a diverse range of projects.

A significant topic of discussion is the tension between "pure Agile" and the more Sequential methods that might be appropriate for a given organization at a given point in a project.

The Agile Boundary defines the interface where the methods meet, and which methods are appropriate on each side of it under given circumstances. Again, Agile is not a single monolithic method that can be applied identically to every single project. As he says, it's not a matter of "go full Agile or go home".

There's a lot of information to digest here, because it all needs to be taken in the context of your specific environment. The chapters that stand out to me based on my personal experience:
  • More Effective Agile Projects: keeping projects small and sprints short; using velocity-based planning (which means you need accurate velocity measurement), delivering in vertical slices, and managing technical debt; and structuring work to avoid burnout.
  • More Effective Agile Quality: minimizing the defect creation gap (i.e. finding and removing defects quickly, before they get out); creating and using a definition of done (DoD); maintaining a releasable level of quality at all times; reducing rework, which is typically not well accounted for.
  • More Effective Agile Testing: using automated tests created by the development team, including unit and acceptance tests, and monitoring code coverage.
  • More Effective Agile Requirements Creation: stories, product backlog, refining the backlog, creating and using a definition of ready (DoR).
  • More Effective Agile Requirements Prioritization: having an effective product owner, classifying stories by combined business value and development cost.
  • More Effective Agile Predictability: strict and loose predictability of cost, schedule, and feature set; dealing with the Agile Boundary.
  • More Effective Agile Adoptions.
Requirements make an interesting area, because that is often a source of problems. The Agile approach is to elicit just enough requirements up front to be able to size a story, then rely on more detailed elicitation and emergent design when working on the story.

But the problem I've seen with that is one of the classic issues in estimation. Management tends to treat those very rough initial estimates as commitments, not taking into account the fact that further refinement has been deferred. So downstream dependent commitments get made based on them.

The risk comes when further examination of the story reveals that there is more work hidden underneath than originally believed. I've seen this repeatedly. Then the whole chain of dependent commitments gets disrupted, creating chaos as the organization tries to cope.

For example, consumer-product embedded systems are very sensitive to this. The downstream dependent commitments involve hardware manufacturing and the retail pipeline, where products need to be pre-positioned to prepare for major sales cycles such as holidays.

The Christmas sales period means consumer products need to be in warehouses by mid-November at the latest. Both the hardware manufacturing facilities (and their supply chains) and the sales channels are Taylor-style systems, relying on bulk delivery and just-in-time techniques. They need predictability. That's your Agile Boundary right there, on two sides of the software project.

IOT products have fallen into the habit of relying on a day 1 OTA update after the consumer unboxes them, but that's risky. If the massive high-scale OTA of all the fielded devices runs into problems, it creates havoc for consumers, who are not going to be happy. That can have significant opportunity costs if it causes stalled revenue or returns, or some horribly expensive solution to work around a failed OTA, not to mention the reputation effect on future sales.

What about commercial/industrial embedded systems? Cars, planes, factory equipment, where sales, installation, and operation are dependent on having the software ready. These can have huge ripple effects.

Online portal rollouts that gate real-world services are also sensitive to it. Martin uses the example of healthcare.gov. People need to have used the system successfully by a certain date in order to access real-world services, with life-safety consequences.

Both of these highlight the real-world deadlines that make business sense for picking software schedule dates. As software engineers, we can't just whine about arbitrary, unreasonable dates. There's a whole chain of dependencies that needs to be managed.

Schedule issues need to be surfaced and addressed as soon as possible, just like software bugs. The later in the process a software bug is identified, the more expensive it is to fix, sometimes by orders of magnitude. Dealing with schedule bugs is no different.

In his book on estimation, McConnell talks about the Cone of Uncertainty, the greater uncertainty about estimates early in the project, that narrows to better certainty over time as more information is available. Absolute certainty only comes after the completion. But everybody behaves as if the certainty is much better much earlier.

It's clear from the variety of information in this book that Agile is not simply a template that can be laid down across any organization and be successful. It takes work to adapt it to the realities of each organization. There is no simple recipe for success. No silver bullets.

That's why it's necessary to re-read this periodically, because each time you'll be viewing it in the context of your organization's current realities. That's continuing the Inspect and Adapt concept.

Update Nov 10, 2019


My copy of Beck's Extreme Programming Explained arrived yesterday, and I've been reading through it. Here we see the benefits of going back to original sources, in this case on open plan offices. In Chapter 13, "Facilities Strategy", he says:
The best setup is an open bullpen, with little cubbies around the outside of the space. The team members can keep their personal items in these cubbies, go to them to make phone calls, and spend time at them when they don't want to be interrupted. The rest of the team needs to respect the "virtual" privacy of someone sitting in their cubby. Put the biggest, fastest development machines on tables in the middle of the space (cubbies might or might not contain machines).
So it appears what caught on was the group open bullpen part, and what has been left out was the personal space part (and it's attendant value).

There's a continuous spectrum on which to interpret Beck's recommendation, with the typical modern open office representing one end (all open space, no private space), and individual offices representing the other (no open space, all private space).

There's a point on the spectrum where I would shift to liking it, if I had a private place to make my own where I could concentrate in relative quiet, with enough space to bring in a pairing partner.

Where I find the open office breaks down is the overall noise level from multiple separate conversations. It can be a near-constant distraction when I'm trying to work (hence the rampant proliferation of headphones in open offices).

Meanwhile, when I need to have a conversation with someone, I want to be able to do it without competing with all those others, and without disturbing those around me.

What seems to me to have the most practical benefit is optimizing space for two-person interactions, acoustically isolated from other two-person interactions. So individual workspaces with room for two to work together. That allows for individual time as well as the pairing method, from simple rubber-duck debugging to full keyboard and mouse back-and-forth.

Those are both high-value, high-quality times. That's the real value proposition for the company.

And in fact, that's precisely the kind of setup Beck says Ward Cunningham told him about.

Given that most developers now work on dedicated individual machines, through which they might be accessing virtualized cloud computing resources, the argument for a centralized bullpen with machines seems less compelling.

The open bullpen space seems to be less optimal, but still useful for times when more than two people might be involved.

This is clearly a philosophical difference from Beck's intent, but I think the costs of open plan offices as he experienced them, tempered by the reality of how they've been adopted, outweigh their benefits.

Meanwhile, his followup discussion in that chapter is fully in harmony with Peopleware's Part II: "The Office Environment".

Monday, July 8, 2019

Review: Engineering A Safer World, by Nancy Leveson

This is a 6-year-old post cross-posted from my woodworking blog (written before I had this blog available). It remains as timely and important as ever. I'm reposting it motivated by the discussion of the Boeing 737 MAX, such as at EmbeddedArtistry.com (and mentioned at Embedded.fm).

As a software engineer I've been a dedicated reader of RISKS DIGEST for over 20 years. Formally, RISKS is the Forum On Risks To The Public In Computers And Related Systems, ACM Committee on Computers and Public Policy, moderated by Peter G. Neumann (affectionately known to all as PGN).

RISKS is an online news and discussion group covering various mishaps and potential mishaps in computer-related systems, everything from data breaches and privacy concerns to catastrophic failures of automated systems that killed people. It's an extremely valuable resource, exposing people to many concerns they  might otherwise not know about.

All back issues are archived and available online. It's fascinating to see the evolution of computer-related risks over time. It's also disheartening to see the same things pop up year after year as sad history repeatedly repeats itself.

Nancy Leveson's work on safety engineering has been mentioned in RISKS ever since volume 1, issue 1. She's currently Professor of Aeronautics and Astronautics and Professor of Engineering Systems at MIT. Her 2011 book Engineering A Safer World, Systems Thinking Applied to Safety, was noted in RISKS 26.71, but has not yet been reviewed there. I offer this informal review.

This book should be required reading for anyone who wishes to avoid having their work show up as a RISKS news item. There's no excuse for not reading it: Leveson and MIT Press have made it available as a free downloadable PDF (555 pages), which is how I read it. The download link is available on the book's webpage at http://mitpress.mit.edu/books/engineering-safer-world.

This was my first introduction to formal safety engineering, so yes, I speak with the enthusiasm of the newly evangelized.

The topic is the application of systems theory to the design of safer systems and the analysis of accidents in order to prevent future accidents (not, notably, to assign blame). Systems theory originated in the 1930's and 1940's to cope with the increasing complexity of systems starting to be built at that time.

This theory holds that systems are designed, built, and operated in a larger sociotechnical context. Control exists at multiple hierarchical levels, with new properties emerging at higher levels ("emergent properties"). Leveson says safety is an emergent property arising not from the individual components, but from the system as a whole. When analyzing an accident, you must identify and examine each level of control to see where it failed to prevent the accident.

So while an operator may have been the person who took the action that caused an accident, you must ask why that action seemed a reasonable one to the operator, why the system allowed the operator to take that action, why the regulatory environment allowed the system to be run that way, etc. Each of these levels may have been an opportunity to prevent the accident. Learning how they failed to do so is an opportunity to prevent future accidents.

Furthermore, systems and their contexts are dynamic, changing over time. What used to be safe may no longer be. Consider that most systems are in use for decades, with many people coming and going over time to maintain and operate them, while much in the world around them changes. Leveson says most systems migrate to states of higher risk over time. If safety is not actively managed to adapt to this change, accidents become inevitable.

Another important point is the distinction between reliability and safety. Components may operate reliably at various levels, yet still result in an accident, frequently due to the interactions between components and subsystems.

Much of Leveson's view can be summarized in two salient quotes. First is a brief comment on the human factor: "Depending on humans not to make mistakes is an almost certain way to guarantee that accidents will happen."

The second is more involved: 

"Stopping after identifying inadequate control actions by the lower levels of the safety control structure is common in accident investigation. The result is that the cause is attributed to "operator error," which does not provide enough information to prevent accidents in the future. It also does not overcome the problem of hindsight bias. In hindsight, it is always possible to see that a different behavior would have been safer. But the information necessary to identify that safer behavior is usually only available after the fact. To improve safety, we need to understand the reasons people acted the way they did. Then we can determine if and how to change conditions so that better decisions can be made in the future.

"The analyst should start from the assumption that most people have good intentions and do not purposely cause accidents. The goal then is to understand why people did not or could not act differently. People acted the way they did for very good reasons: we need to understand why the behavior of the people involved made sense to them at the time."

The book is organized into three parts. Part I, "Foundations," covers traditional safety engineering (specifically, why it is inadequate) and introduces systems theory. Part II, "STAMP: An Accident Model Based On Systems Theory," introduces System-Theoretic Accident Model and Processes, covering safety constraints, hierarchical safety control structures, and process models. Part III, "Using STAMP," covers how to apply it, including the STPA (System-Theoretic Process Analysis) approach to hazard analysis and the CAST (Causal Analysis based on STAMP) accident analysis method.

Throughout, Leveson illustrates her points with accidents from various domains. These cover a military helicopter friendly-fire shootdown, chemical and nuclear plant accidents, pharmaceutical issues, the Challenger and Columbia space shuttle losses, air and rail travel accidents, the loss of a satellite, and contamination of a public water supply. They resulted in deaths, injuries with prolonged suffering, destruction, and significant financial losses. There's also one fictional case used for training purposes.

The satellite loss was an example where there was no death, injury, or ground damage, but an $800 million satellite was wasted, along with a $433 million launch vehicle (all due to a single misplaced decimal point in a software configuration file). Financial losses in all cases included secondary costs due to litigation and loss of business. Accidents are expensive in both humanity and money.

Several accidents are examined in great detail to expose the complexity of the event and glean lessons, identifying the levels of control, the system hazards they faced, and the safety constraints they violated. They show that the answer to further prevention is not simply to punish the operator on duty at the time. What's to prevent another accident from occurring under a different operator? What systemic factors exist that increase the likelihood of accidents?

These systems affect us every day. During the time I was reading the book, there was an airline crash at San Francisco, a fiery oil train derailment in Canada, and a major passenger train derailment in Spain. I started reading it while a passenger on an aircraft model mentioned 14 times in the book, and read the remainder while traveling to and from work on the Boston commuter rail.

The book can be read on several levels. At a minimum, the cases studies and analyses are horribly fascinating for the lessons they impart. Fans of The Andromeda Strain will be riveted.

As I read the account of two US Black Hawk helicopters shot down by friendly fire in Iraq, I could visualize a split screen showing the helicopters flying low in the valleys of the no-fly zone to avoid Iraqi air defense radar, the traces going inactive on the AWACS radar scopes, the F-15's picking up unidentified contacts that did not respond to IFF, and the mission controllers back in Turkey, as events ground to their inexorable conclusion. It made my hair stand on end.

All the case studies are equally jaw-dropping, down to the final example of a contaminated water supply in Ontario. Further shades of Andromeda, since that was a biological accident resulting in deaths.

They're all examples of systems that migrated to very high risk states, where they became accidents waiting to happen. It was just a matter of which particular event out of the many possible triggered the accident.

Part of what's so shocking about these cases is the enormously elaborate multilayered safety systems that were in place. The military goes to great lengths in its air operations control to avoid friendly fire incidents, the satellite software development process had numerous checkpoints, NASA had a significant safety infrastructure.

Yet it seems that this very elaborateness contributed to a false sense of safety, with uncoordinated control leaving gaps in coverage. In some cases this led to complacency that resulted in scaling back safety programs.

The other shocking cases were at the opposite end of the spectrum, where plants were operated more fast and loose.

The one bright spot in the case studies is the description of the US Navy's SUBSAFE program, instituted after the loss of the USS Thresher in 1963. It flooded during deep dive testing; despite emergency recovery attempts by the crew, they were unable to surface. Just pause and think about that for a moment.

SUBSAFE is an example of a tightly focused and rigorously executed safety program. The result is that no submarine has been lost in 50 years, with the exception of the USS Scorpion in 1968, where the program requirements were waived. The result of that tragic lesson was the requirements were never again waived.

The book can be read at an academic level, as a study of the application of systems theory to the creation of safer systems and analysis of accidents. It can be read at an engineering level, as a guide on how to apply the methodology in the development and operation of such systems. It's not a cookbook, but it points you in the right direction. It includes an extensive bibliography for follow-up.

Even those who work on systems that don't present life safety or property damage risks can benefit, because any system behaving poorly can make people's lives miserable. They frequently pose significant business risks, affecting the life and death of a company.

This book paired with PGNs book Computer-Related Risks would make an excellent junior or senior level college survey course for all engineering fields, along the lines of "with great power comes great responsibility". While some might feel it's a text more suited to graduate level practicum, I think it's worth conveying at the undergraduate level for broader distribution.