13. Design Applications of Risk Management
• The space shuttle uses 5 computers for flight control. The first 4 run a primary flight control system. The fifth computer runs a separate flight control program, and is only used in the most dire emergencies. The 4 redundant systems will operate separately, then compare outputs. These should be identical, but in the event of disagreement, they can vote a conflicting system out.
13.1.1 A Mobile Service Robot for the Space Station
• We can see a figure depicting the SPDM for the planned space station.
*********** Include Robot Arm figure
• All discussions in this section are based on the space station manipulator as described in SSP30000.
• The basic functions (at PMC) are classified as,
Category 1: requires tolerance for two consecutive failures in each system: fail safe/fail operational: basically required 1 prime + 1 redundant + 1 backup
Category 2: requires tolerance for one failure in each system: failure tolerant: typically requires 1 prime + 1 backup
Category 2S: requires tolerance for one failure in the system: fail operational
• Examples of equipment in the different categories are,
Category 1: The orbiter is a time critical system
Category 2S: Safety monitoring and emergency control systems
• Recall the following hazard levels, also consider the control requirements,
• For the manipulator (SSRMS) hazards include,
payload released without command
orbiter stuck to space station via SSRMS
orbiter collides with space station because of failed capture (docking with SSRMS).
no motion in arm in response to command
orbiter stuck to space station via SSRMS.
all functions must be safed within 250 ms of occurrence of fault
can’t implement alternate operation
• isolation: we want to estimate the % failures that are prevented from reaching a specific module. Typically these values are,
5% maximum false error indication rate
• MSS Failure Tolerance Concept [Brimley]
• Failure Detection and Isolation Coverage Scheme [Brimley]
• MSS Failure Management Functional Interfaces [Brimley]
• Layered defense approach for Detection of Sensor Data Failures [Brimley]
provide drive (EVA) for joint and LEE latch mechanisms
alternate data path/transmission
reconfiguration time less than 271 seconds
• The purpose for these measures
when the failure occurs, the software, and hardware engineers must know what their systems are to do. This is the best way to get all to agree.
• operation failure of computational units may include,
invoking off-line bit checks with error checking algorithms
operator visual inspections via cameras, etc.
analysis of units memory through data dumps, etc.
ground support failure isolation analysis
exercising equipment with known algorithms
• Note: in the case of SSRMS the operator may use EVA units to move the arm away from contact.
• Operators may always elect to replace failed units, if extras available.
• A diagram of the MSS Failure Management Concept is shown below, [Brimley]. This depicts a scheme for dealing with faults once they are detected. Some of the acronyms used are,
• There is also a scheme for estimating when a system has erred. This is based on a bottom up approach where the checks for errors are made in the specific modules, and then error reports are propagated up to the high level software/hardware. The diagram below depicts the system used in the SSRM.
13.1.2 Case Studies In Failure
• Three astronauts were burned to death on the launch pad (1967)
• General Electric (GE) and other companies were commissioned to develop safety programs.
• These were developed based on combinations of existing programs for risk managements, such as those used by the Air Force, and Department of Defense.
• a definitive example of failure management
• the mission objective was to land on the surface of the moon, and return.
• the mission had four major components,
1. The booster (a Saturn V rocket) was used for the initial launch, and was discarded after use.
2. The lunar module (dubbed Aquarius) was to be used for descent and ascent from the lunar surface, and to be discarded after use.
3. The service module was to be used for the trip to and from the moon.
4. The command module was equipped with a heat shield, and would be the only module to return to earth.
• In basic terms the mission had to be aborted as a result of an oxygen tank rupture. The fact that the astronauts managed to return back safely is a tribute to the quality of design in the space program.
• The events are chronicled below,
during a test the booster oxygen tank drains (causing fires in nearby automobiles, sparked by ignition systems)
oxygen tanks that supplied breathing air and fuel cells were not draining properly. This was overcome by turning on heat, and redraining the tanks. No action was taken because the performance was adequate. It is believed that a problem arose at this point where insulation on wires inside the tank was worn down because of overheating, and the slower drain of oxygen from the tank.
one of the backup crew (Charles Duke) got german measles, and had exposed one of the crew (Thomas Mattingly). He was replaced by Jack Swigert as the command module pilot. This crew had not had any experience together in critical situations.
The Saturn V rocket burn is cut off 2 minutes early, requiring additional fuel to be burned to reach earth orbit. Fuel level were now lower, but not critical.
one of the oxygen tank pressure gauges had gone off the scale and ground control had requested several times that the oxygen be stirred by a small fan in the tank.
a warning light indicates low pressure in a hydrogen tank in the service module. Fans in the tanks were turned on to stir up the hydrogen.
an arc between two wires starts a fire in an oxygen tank fueled by the insulators. The pressure in the oxygen tank builds, and warnings are not sounded because they are overridden by the hydrogen tank warning. The fire spreads through the service module bay.
a bang is heard, crew assumes it is a noisy valve actuation, and that it might have been done as a practical joke.
the master power alarm sounds as power on a master bus is lost. This was caused by damaged oxygen lines to the fuel cells. It turns out that 2 of the 3 were off line.
readings are scrambled, pressures and temperatures appeared erratic. Oxygen pressure in one tank was zero, and dropping in another.
the venting of gases causes the craft to drift off course, the guidance system fails, and the craft begins to wobble, interrupting communications with the ground.
Huston is still taken by surprise, assuming that four failures indicated could not be related, and almost impossible independently, and they began looking for instrumentation problems.
Huston suggests disconnecting the 3rd fuel cell to check to see if oxygen loss was caused by the fuel cell. This operation could not be reversed.
the goal of landing on the moon is no longer assumed possible, but the lunar module is kept intact for its supply of water and oxygen.
the reentry battery is recharged and the command module is shut down, as the astronauts move to the lunar module. This is done acknowledging that the effects of the cold on the electronics is not known, and may affect earth reentry.
it was decided to continue the flight path, and use the moons gravity to get back (with no lunar landing) for a total flight time of 100 hours.
the remainder of the trip was punctuated by,
cold (minimal power for heating left temperatures about 10°C)
fatigue (standing room only, or sleeping in the cold service module)
hunger (the food was meant to be mixed with hot water)
the CO2 was normally absorbed by filters. With the extra use the filters in the lunar module were exhausted. The filters from the service module were not the correct shape to fit, requiring a jury rigged arrangement using duct tape.
a new reentry procedure was devised and had to be relayed verbally to the crew and written on paper (not plentiful). This checklist took 2 hours to transmit verbally.
The astronauts moved back to the command module, powered up the equipment (luckily the effects of the cold did not cause any malfunctions). There was a new concern that the heat shield was still intact. No instruments allowed this to be checked. The other modules were jettisoned.
the spacecraft landing began, and finished safely, with one of the fastest recoveries.
• In retrospect the accident was reviewed, and the causes for this accident were believed to be,
budget pressures, morale problems, and schedule pressures encouraged the use of the damaged oxygen tank.
the estimates of failure effects had been incomplete because of assumptions of adequacy.
emergency plans were available but not practiced.
• The Challenger accident would have been prevented with the existing NASA procedures. The explosion of the Challenger is more the result of management failure in NASA and Morton Thiokol than the result of technical failures.
• In general the important notes are,
1972: contract awarded to Morton Thiokol to design the Solid Rocket Boosters (SRBs)
one of the changes was an o-ring seal along the rocket body. The joint was made longer, and a second ring added to provide a redundant seal.
1977-78: An engineer discovers during tests that under pressure the joints rotated significantly causing the secondary o-ring to become ineffective. This is a result of the elongated joint to hold the secondary o-ring. Morton Thiokol management did not recognize the problem.
1980: The joint is classified on the CIL (Critical Item List) as 1R, indicating that failure would be catastrophic, but there is a redundant o-ring to act as a backup in the event of failure. This was only one of 700 items listed as criticality 1.
1981: the shuttle begins orbital testing
After a few flights, problems with the o-rings were noted, as were other items. The normal procedure was to assign a problem tracking number, and examine the causes. This was not done for the o-ring problem. Eventually the problem was recognized and the rating was changed to 1 on the CIL. It was shown that despite NASA’s reclassification, the system was still listed as 1R in the Morton Thiokol paperwork, as well as a number of other documents. Also, Morton Thiokol disagreed with the criticality change, and went to a referee procedure.
1984: the erosion of the o-rings has become a significant concern, and review procedures are requested for the packing of the o-ring joint with the asbestos filled putty that prevents heating of the rings. Morton Thiokol responds with a letter suggesting that higher pressures used in testing the joints was resulting in channels in the putty, and increased erosion of the o-rings. statistics from before and after the change in testing pressure seemed to confirm this. Morton Thiokol recommends continuing the tests to ensure sealing despite the problems, and begins investigating the effects of the testing on the putty.
Jan 1985: A launch of a space shuttle at the coldest temperatures to date leads to the greatest failure of the o-rings to date. The o-rings will deform under pressure to seal the gap, but this is hindered when they are colder, and the material stiffer.
Jan-April 1985: Continued flights and investigations show continued problems with the o-rings, and a relationship to launch temperature. Morton Thiokol acknowledges the problem, and the effects of temperature, but concludes that the second o-ring will ensure safety.
April 1985: the primary o-ring does not seal, and the secondary ring carries the pressure, with some blowby (i.e., the backup was starting to fail). As a result a committee concludes that the shuttle must only be operated in an acceptable flight envelope for the o-ring seal. This report is received by Morton Thiokol, but does not seem to be properly distributed. The problem was also not properly reported within NASA to upper management.
July 1985: A Morton Thiokol engineer recommends that a team be set up to study the o-ring seal problem, citing a potential disaster.
August 1985: Morton Thiokol and NASA managers brief NASA headquarters on the o-ring problems, with a recommendation to continue flights, but step up investigations. A Morton Thiokol task force is set up.
October 1985: The head of the Thiokol task force complains to management about lack of cooperation and support.
December 1985: One Thiokol engineer suggests stopping shipments of SRBs until the problem is fixed. Thiokol writes a memo to NASA suggesting that the problem tracking of the o-rings be discontinued. This lead to an erroneous listing of the problem as closed, meaning that it would not be considered as critical during launch.
Jan 1986: The space shuttle Challenger is prepared to launch Jan., 22, originally it was scheduled for July 1985, and postponed 3 times, and scrubbed once. It was rescheduled again to the 23rd, then 25th, then 27th, then 28th. This was a result of weather, equipment, scheduling, and other problems.
Jan., 27th, 1986: The shuttle begins preparation for launch the next day, despite predicted temperatures below freezing (26°F) at launch time. Thiokol engineers express concerns over low temperatures, and suggests NASA managers be notified (this was not done). A minimum launch temperature of 53°F had been suggested to NASA. There was no technical opinion supporting the launch at this point. The NASA representative discussing the launch objected to Thiokol’s engineers opinions, and accused them of changing their opinions. Upper management became involved with the process, and “convinced” the technical staff to withdraw objections to the launch. Management at Thiokol gave the go ahead to launch under pressure from NASA officials (this was the critical decision).
the shuttle is wheeled out to the launch pad. Rain has frozen on the launch pad, and may have gotten into the SRB joints and frozen there also.
Jan., 28th, 1986: The shuttle director gives the OK to launch, without having been informed of the Thiokol concerns. The temperature is 36°F.
11:39 am: The engines are ignited, and a puff of black smoke can be seen blowing from the right SRB. As the shuttle rises the gas can be seen blowing by the o-rings. The vibrations experienced in the first 30 seconds of flight are the worst encountered to date.
11:40 am: A flame jet from the SRB starts to cut into the liquid fuel engine tank, and a support strut.
11:40:15 am: the strut gives way, and the SRB pointed nose cone pierces the liquid fuel tank. The resulting explosion totally destroys the shuttle and crew.
11:40:50 am: the SRB’s are destroyed by the range safety officer.
AFSCN Air Force Satellite Control Network
AMU Astronaut Maneuvering Unit
APS Alternate Payload Specialist
ASE Airborne Support Equipment
CCAFS Cape Canaveral Air Force Station
CCMS Checkout, Control and Monitor Subsystem
CCTV Closed Circuit Television
CDMS Command & Data Management Systems Officer
CFES Continuous Flow Electrophoresis System
CIC Crew Interface Coordinator
CIE Communications Interface Equipment
CITE Cargo Integration Test Equipment
DFI Development Flight Instrumentation
DFRF Hugh L. Dryden Flight Research Facility
DMC Data Management Coordinator
DMOS Diffusive Mixing of Organic Solutions
ECLSS Environmental Control & Life Support System
EECOMP Electrical, Environmental & Consumables Systems Engineer
EMU Extravehicular Mobility Unit
ESMC Eastern Space and Missile Center
FAWG Flight Assignment Working Group
FCTS Flight Crew Trainer Simulator
FOD Flight Operations Directorate
FOE Flight Operations Engineer
FOPG Flight Operations Planning Group
FOSO Flight Operations Scheduling Officer
FRCS Forward Reaction Control System
FSE Flight Simulation Engineer
GNC Guidance, Navigation & Control Systems Engineer
GSFC Goddard Space Flight Center
HMF Hypergolic Maintenance Facility
HPPF Horizontal Payloads Processing Facility
HUS Hypergolic Umbilical System
IECM Induced Environment Contamination Monitor
INCO Instrumentation & Communications Officer
IRIG Interrange Instrumentation Group
JSC Lyndon B. Johnson Space Center
KSC John F. Kennedy Space Center
LDEF Long Duration Exposure Facility
LETF Launch Equipment Test Facility
MLR Monodisperse Latex Reactor
MMACS Maintenance, Mechanical Arm & Crew Systems Engineer
MMPSE Multiuse Mission Payload Support Equipment
MMSE Multiuse Mission Support Equipment
MOD Mission Operations Directorate
MPGHM Mobile Payload Ground Handling Mechanism
MPPSE Multipurpose Payload Support Equipment
MSBLS Microwave Scanning Beam Landing System
MSFC George C. Marshall Space Flight Center
MTE Mobile Transporter Element
NASCOM NASA Communications Network
NIP Network Interface Processor
NOCC Network Operations Control Center
NSRS NASA Safety Reporting System
NSTL National Space Technology Laboratories
NSTS National Space Transportation System
OAST Office of Aeronautics & Space Technology
O&C Operations and Checkout (Building)
OFI Operational Flight Instrumentation
OMBUU Orbiter Midbody Umbilical Unit
OMRF Orbiter Maintenance & Refurbishment Facility
OMS Orbital Maneuvering System
OPF Orbiter Processing Facility
OSSA Office of Space Science and Applications
OSTA Office of Space and Terrestrial Applications
PACE Prelaunch Automatic Checkout Equipment
PAYCOM Payload Command Coordinator
PDRS Payload Deployment & Retrieval System
PGHM Payload Ground Handling Mechanism
PLSS Portable Life Support Subsystem
POCC Payload Operations Control Center
POD Payload Operations Director
PRF Parachute Refurbishment Facility
PRSD Power Reactant Storage & Distribution
RSS Rotating Service Structure
SAEF Spacecraft Assembly & Encapsulation Facility
SAIL Shuttle Avionics Integration Laboratory
SCAMMA Station Conferencing & Monitoring Arrangement
SCAPE Self-Contained Atmospheric Protection Ensemble
SID Simulation Interface Device
SMAB Solid Motor Assembly Building
SMCH Standard Mixed Cargo Harness
SPDM Special Purpose Dextrous Manipulator
SPIF Shuttle Payload Integration Facility
SPOC Shuttle Portable On-Board Computer
SRBDF Solid Rocket Booster Disassembly Facility
SRM&QA Safety, Reliability, Maintainability and Quality Assurance
SSC John C. Stennis Space Center
SSCP Small Self-Contained Payload
SSIP Shuttle Student Involvement Project
SSME Space Shuttle Main Engines
SSRMS Space Station Remote Manipulator System
STS Space Transportation System
TAEM Terminal Area Energy Management
TAL Trans-Atlantic Abort Landing
TDRS Tracking and Data Relay Satellite
TPAD Trunnion Pin Acquisition Device
VPF Vertical Processing Facility
WSMC Western Space & Missile Center
13.1.4 References and Bibliography
13.1 American Institute of Chemical Engineers, Guidelines for hazard evaluation procedures: with worked examples, 2nd edition, 1992.
13.2 Brimley, W., “Spacecraft Systems; Safety/Failure Tolerance Failure Management”, part of a set of course note for a course offered previously at the University of Toronto, 199?.
13.3 Dhillon, B.S., Engineering Design; a modern approach, Irwin, 1996.
13.4 Dorf, R.C. (editor), The Electrical Engineering Handbook, IEEE Press/CRC Press, USA, 1993, pp. 2020-2031.
13.5 Leveson, N., Safeware: system safety and computers, Addison-Wesley Publishing Company Inc., 1995.
13.6 Rasmussen, J., Duncan, K., and Leplat, J., New Technology and Human Error, John Wiley & Sons Ltd., 1987.
13.7 Ullman, D.G., The Mechanical Design Process, McGraw-Hill, 1997.