C. Michael Holloway directly relates some learnings that we in software can take from other disasters. How could I pass up a paper who’s abstract says, “software engineers can learn lessons from failures in traditional engineering disciplines”?
Specifically he relates learnings from the Tacoma Narrows bridge failure, and the Challenger disaster to software. He reviews each, goes over the lessons, and then some specific applications.
I won’t get into the specifics of the disasters so much here, nor does Holloway can read about that elsewhere.
First, he looks at the Tacoma Narrows Bridge, completed in 1940 a suspension bridge that was to be the alternative to taking a ferry across puget sound. The bridge was designed by one of the world’s top authorities on bridge design. Mostly what you need to know is that the designer of the bridge had used a theory called “deflection theory” that he used to justify using short girders instead of long trusses to build the bridge to stand up to the wind. This led to his design being picked over the Washington Department of Highways and cost about $5,000,000 less
Since it was expected to have light traffic, it was only a two lane bridge, which was very narrow compared to others at the time, so it was ended up being a very flexible bridge. And as I’m sure you can imagine, flexible is not really the adjective you want describing your bridge. It was said to move so much that people were getting seasick crossing it and eventually it was nicknamed “Galloping Girtie”. Eventually 40 mile an hour winds were able to break the cables and allow it to start twisting. Eventually the movement got so bad that the ropes tore and the deck broke and not long after the rest of the deck fell into the water. Fortunately, most everyone survived, though sadly a dog was lost.
Both of the accidents that Holloway talks about were investigated. In the case of the bridge, the Federal Works Agency picked three engineers to produce a report (one, Theodore von Kármán, would go on to be one of the founders of JPL).
The report stated
“the Tacoma Narrows Bridge was well designed and built to resist safely all static forces, including wind, usually considered in the design of similar structures. …It was not realized that the aerodynamic forces which had proven disastrous in the past to much lighter and shorter flexible suspension bridges would affect a structure of such magnitude as the Tacoma Narrows Bridge”
Holloway points out that they had indeed followed the modern techniques, but it happened that these techniques were flawed. One person, however, did see this coming Theodore L. Condron. He was actually an engineer who as advising the financing company on whether or not to approve the loan needed to construct the bridge. He was worried about the narrowness of the bridge, which is what ultimately caused the problem. He was so worried about it that he actually compared it to every other suspension bridge that had been built recently and pointed out that it’s width to length ratio was much narrower than any other bridge that had successfully been built.
He went to Berkeley to investigate some models and was essentially told that it would be okay. Of course, now we know it turns out that they did not account for deflection in both directions, but because he couldn’t disprove the deflection theory and he couldn’t find evidence to support his concerns, he eventually gave in. Even though we gave in, he’d still did suggest whiting and the bridge have 52 feet, a change that may have prevented the collapse.
Holloway goes on to point out some relevant lessons to draw some relevant lessons for us:
Lesson 1: Relying heavily on theory, without adequate confirming data, is unwise.
This bridge was the first actual test of deflection theory.
Lesson 2: Going well beyond existing experience is unwise.
He suggests that instead incremental steps should have been made specifically in narrowing the width. Small change size being a good thing is likely something familiar to you.
Lesson 3: In studying existing experience, more than just the recent past should be included.
It turns out that a professor who was studying suspension bridges and narrowly escaped the Tacoma Bridge collapsing while he was on it looked back on other bridge disasters that involved wind. Nine of the 10 that he found occurred before 1865.
University of Washington’s Professor Farquharson would later write it "came as such a shock to the engineering profession that it is surprising to most to learn that failure under the action of the wind was not without precedent”
Lesson 4: When safety is concerned, misgivings on the part of competent engineers should be given strong consideration, even if the engineers can not fully substantiate these misgivings.
This is supported by Condron’s concerns being correct, that the bridge design wouldn’t work.
Next, Holloway goes on to talk about Challenger. Again, I’m not going to talk too much about the details of the disaster here. Lots of places exist to document that very well. But as a reminder, the Challenger disaster occurred on January 28th, 1986 when 73 seconds after liftoff, the shuttle exploded during its 10th flight. This disaster was also investigated of course.
The report “concluded that the cause of the Challenger accident was the failure of the pressure seal in the aft field joint of the right Solid Rocket Motor.” Holloway explains the failure was due to a faulty design that was unacceptably sensitive to things like temperature, physical dimension, and reusability. Holloway tells us that this reinforces 3 of the 4 lessons that we learned from the bridge and an additional one.
Lesson 2: Going well beyond existing experience is unwise: The SRB joints, even though they were initially based on a solid design, deviated very far from their initial basis.
Lesson 3: In studying existing experience, more than just the recent past should be included.
Holloway specifically compares the attitudes around Challenger to that of the Apollo one fire, “The attitude of great confidence in accomplishments and the concern about meeting the planned schedules are especially apparent.”
Lesson 4: When safety is concerned, misgivings on the part of competent engineers should be given strong consideration, even if the engineers can not fully substantiate these misgivings. So much so in this case, that actually, the night before the launch, the engineers at the company who actually made some of the parts were against launching. And again these engineers who were against the launch, were not able to prove that it was unsafe. Holloway points out that the burden of proof on those who were for and against launch should not be equal.
Finally, he tells us that challenge or teaches us a new lesson, Lesson 5: Relying heavily on data, without an adequate explanatory theory, is unwise.
He specifically cites joints in the building of the solid rocket booster that originally were thought to become tighter during launch, but during tests for the first few milliseconds, right after they ignited, they actually moved away from each other. Several tests were done and the data apparently eventually satisfied everyone about this being okay, but there was no real understanding of why it was that these parts behaved differently than the way they thought when they designed them.
To close we’re left with three applications that he feels are very specific to software systems.
Application 1: The verification and validation of a software system should not be based on a single method, or a single style of methods. From lessons 1 and 5. “Every testing method has limitations; every formal method has limitations, too. Testers and formalists should be cooperating friends, not competing foes.”
Application 2: The tendency to embrace the latest fad should be overcome. From lesson 2. I’m looking at you Javascript (though this applies across languages).
Application 3: The introduction of software control into safety-critical systems should be done cautiously. From lessons 2 and 4. He’s not against using software and safety systems and actually finds that they can be successful, but suggests caution, warning: “Software can be used in safety-critical systems. But its use ought to be guided by successful past experiences, and not by ambitious future dreams.”