C H A L L E N G E R   D I S A S T E R
  Course Outline

Disclaimer

National Security Space Programs

Mythical Man – Month

Resolving Engineer – Mgr Conflicts

Chief Scientist

Analytic & Gaming Sims

Complex Systems
 
ABSTRACT

In January of 1986, slightly more than a minute after launch, the Space Shuttle Challenger broke apart, causing the deaths of its seven crew members.   President Ronald Reagan appointed the Rogers Commission to investigate the accident.

One of the most famous members of the Rogers Commission was the late Nobel-Prize-winning Caltech physicist Richard Feynman.   Feynman
 
  devoted the latter half of his (second) autobiography, What Do You Care What Other People Think?, to his experiences on the commission.   His report, Personal Observations On The Reliability Of The Shuttle, is online.

It is instructive to use Feynman’s report as a tutorial in some basic principles of Systems Engineering.   Although Feynman was not a rocket scientist, his report is an excellent example of analyzing a large, complex system using elementary concepts.   The report is difficult to read, not because it uses advanced mathematics, but because of its use of advanced engineering principles.

 
 
SA F E T Y   F A C T O R
 
Safety factor was one of the concepts NASA used to evaluate risks to the Shuttle.   We can illustrate this concept with a simple example, probably familiar to almost everyone.   This example also shows how NASA did not apply the concept correctly.

Suppose we want a bridge to support a 20-ton load.   Then if we drive a 20-ton truck across the bridge, the beams in the bridge will not break.   They will not crack, they will not permanently deform; in fact, we will be able to drive many 20-ton trucks across the bridge, one at a time, for many years, without any problems.

Now if the bridge has a safety factor of three, this means we can drive a 60-ton load across the bridge (three times the 20-ton load), and the bridge will not fail.   The beams will not crack, they will not permanently deform, and the bridge will not become unsafe.  Reasonable people might disagree on how many times we could expect to drive a 60-ton load across the bridge, but they would agree on two things:
 
  (1) One could drive a 20-ton load across the bridge, one truck at a time, every day for many years, without fear of the bridge collapsing;  
  (2) One could drive a 60-ton load across the bridge once without fear of the bridge collapsing;  
  Now suppose we were to drive a 20-ton load across the bridge, and the bridge were to crack 1/3 of the way through.   Would this be a safety factor of three?   (After all, the bridge only cracked 1/3 of the way, 2/3 of the way is left.)   No, it would not.

In fact, we could not be sure that the bridge wouldn't fail the next time a 20-ton load was driven across it.   Since the bridge was not designed to crack under these circumstances, the cracks constitute a design failure.   Clearly, no one would design a bridge this way.   Yet this was exactly what NASA did when calculating a safety factor for the Shuttle.

During one of the previous Shuttle flights, one of the O-rings eroded by a third of its radius.   Since the O-ring was not designed to erode, this was (or should have been) a clear indication of a design failure.   Yet NASA made no attempt to further understand the O-ring erosion, or to change the O-ring design to avoid erosion during future flights.   NASA simply declared that there was a safety factor of three, and continued on.
 
  TOP
 
P R O B A B I L I T Y   O F   F A I L U R E
 
Here's an example of how we might calculate the probability of one type of failure.   Suppose that if five tiles peel off of the Shuttle, that is enough to cause it to break up during re-entry.   Let us define the following probabilities:
 
  PX the probability that an object will fall off of the fuel tank during liftoff;
  PY the probability that the object will strike a tile;
  PZ the probability a tile, if it peels off, will cause another tile to peel off;
  PW the probability a struck tile is damaged enough to peel off during re-entry;
 
Then the probability of the Shuttle breaking up during re-entry (due to this cause) is the larger of
 
  PX PY PZ PW4
and
(PX PY PZ) 5  
  TOP
 
O R D E R   O F   M A G N I T U D E   E S T I M A T E S
 
NASA made an outrageous (and, in retrospect, obvious) error in estimating the probability of catastrophic failure.   When Feynman spoke to the working engineers, most estimated this probability to be about one in a hundred.   Feynman tells an interesting story in his second autobiography, in which he asks one of the managers for his estimate.   After mentioning that he was originally an engineer, the manager becomes quite evasive, but eventually tells Feynman he believes the probability to be one in a hundred thousand (while the others in the room – engineers not managers – gasped).

As Feynman points out (in his autobiography and in his report), if this number is to be believed then one could launch the shuttle every day, week after week, year after year, for three hundred years, and expect there to be only one failure.
 
  TOP
 
A   F A I L U R E   O F   M A N A G E M E N T ?
 
At this point it's tempting to think
 
  NASA must have been run by idiots  
  NASA management is incompetent  
  or other thoughts along these lines.   This would be a grave mistake.   Anyone with more than a few years of experience in the Aerospace industry, or in software development, has been on projects with unrealistic budgets or schedules.   Virtually all experienced engineers have worked projects where pressures from the customer, or from upper management, have resulted in requirements creep, or giving short shrift to testing.

And when the project failed to meet its requirements, the requirements were quietly changed, or moved to the next delivery.   In those respects, the Shuttle program was no different.   What was different was this:

The reasons for the disclaimer should now be apparent.
 
  TOP