Engineer On A Disk

13. Design Applications of Risk Management

13.1 The Space Shuttle Orbiter Control Computers

• The space shuttle uses 5 computers for flight control. The first 4 run a primary flight control system. The fifth computer runs a separate flight control program, and is only used in the most dire emergencies. The 4 redundant systems will operate separately, then compare outputs. These should be identical, but in the event of disagreement, they can vote a conflicting system out.

13.1.1 A Mobile Service Robot for the Space Station

• We can see a figure depicting the SPDM for the planned space station.

*********** Include Robot Arm figure

• All discussions in this section are based on the space station manipulator as described in SSP30000.

• The basic functions (at PMC) are classified as,

Category 1: requires tolerance for two consecutive failures in each system: fail safe/fail operational: basically required 1 prime + 1 redundant + 1 backup

Category 2: requires tolerance for one failure in each system: failure tolerant: typically requires 1 prime + 1 backup

Category 2S: requires tolerance for one failure in the system: fail operational

• Examples of equipment in the different categories are,

Category 1: The orbiter is a time critical system

Category 2: MBS

Category 2S: Safety monitoring and emergency control systems

• Recall the following hazard levels, also consider the control requirements,

• For the manipulator (SSRMS) hazards include,

Criticality 1

payload released without command

possible collision

payload cannot be released

orbiter stuck to space station via SSRMS

orbiter collides with space station because of failed capture (docking with SSRMS).

motion of arm without command

possible collisions

no motion in arm in response to command

orbiter stuck to space station via SSRMS.

• Dealing with failures,

Criticality 1

all functions must be safed within 250 ms of occurrence of fault

Criticality 2

report as occurs

side effects are

can’t report critical failure

can’t safe a system

can’t implement alternate operation

• isolation: we want to estimate the % failures that are prevented from reaching a specific module. Typically these values are,

95% isolated through ORU

90% isolated by on-line bits

5% maximum false error indication rate

• MSS Failure Tolerance Concept [Brimley]

• Failure Detection and Isolation Coverage Scheme [Brimley]

• MSS Failure Management Functional Interfaces [Brimley]

• Layered defense approach for Detection of Sensor Data Failures [Brimley]

• Failure tolerance

fault tolerance

single failure tolerant

two failure tolerant for orbiter

provide drive (EVA) for joint and LEE latch mechanisms

• Reconfigurations

alternate data path/transmission

reconfiguration time less than 271 seconds

• The purpose for these measures

when the failure occurs, the software, and hardware engineers must know what their systems are to do. This is the best way to get all to agree.

• operation failure of computational units may include,

invoking off-line bit checks with error checking algorithms

operator visual inspections via cameras, etc.

analysis of units memory through data dumps, etc.

ground support failure isolation analysis

exercising equipment with known algorithms

• Note: in the case of SSRMS the operator may use EVA units to move the arm away from contact.

• Operators may always elect to replace failed units, if extras available.

• A diagram of the MSS Failure Management Concept is shown below, [Brimley]. This depicts a scheme for dealing with faults once they are detected. Some of the acronyms used are,

FD: Failure Detection

FI: Failure Isolation

C&W: Caution & Warning

CRIT: Failure Criticality

EVA: Extra Vehicular Activity

BIT: Built-In-Test

• There is also a scheme for estimating when a system has erred. This is based on a bottom up approach where the checks for errors are made in the specific modules, and then error reports are propagated up to the high level software/hardware. The diagram below depicts the system used in the SSRM.

13.1.2 Case Studies In Failure

13.1.2.1 - Apollo 204

• Three astronauts were burned to death on the launch pad (1967)

• General Electric (GE) and other companies were commissioned to develop safety programs.

• These were developed based on combinations of existing programs for risk managements, such as those used by the Air Force, and Department of Defense.

13.1.2.2 - Apollo 13

• a definitive example of failure management

• the mission objective was to land on the surface of the moon, and return.

• the mission had four major components,

1. The booster (a Saturn V rocket) was used for the initial launch, and was discarded after use.

2. The lunar module (dubbed Aquarius) was to be used for descent and ascent from the lunar surface, and to be discarded after use.

3. The service module was to be used for the trip to and from the moon.

4. The command module was equipped with a heat shield, and would be the only module to return to earth.

• In basic terms the mission had to be aborted as a result of an oxygen tank rupture. The fact that the astronauts managed to return back safely is a tribute to the quality of design in the space program.

• The events are chronicled below,

PRELAUNCH

during a test the booster oxygen tank drains (causing fires in nearby automobiles, sparked by ignition systems)

a helium tank on the lunar module is found to have excessive pressure a number of times

oxygen tanks that supplied breathing air and fuel cells were not draining properly. This was overcome by turning on heat, and redraining the tanks. No action was taken because the performance was adequate. It is believed that a problem arose at this point where insulation on wires inside the tank was worn down because of overheating, and the slower drain of oxygen from the tank.

T-5 DAYS

one of the backup crew (Charles Duke) got german measles, and had exposed one of the crew (Thomas Mattingly). He was replaced by Jack Swigert as the command module pilot. This crew had not had any experience together in critical situations.

LAUNCH

The Saturn V rocket burn is cut off 2 minutes early, requiring additional fuel to be burned to reach earth orbit. Fuel level were now lower, but not critical.

T+0 TO 55 HOURS

one of the oxygen tank pressure gauges had gone off the scale and ground control had requested several times that the oxygen be stirred by a small fan in the tank.

there were also pressure problems in a lunar module helium tank, and a hydrogen tank.

56 HOURS

a warning light indicates low pressure in a hydrogen tank in the service module. Fans in the tanks were turned on to stir up the hydrogen.

56 HOURS + 16 SECONDS

an arc between two wires starts a fire in an oxygen tank fueled by the insulators. The pressure in the oxygen tank builds, and warnings are not sounded because they are overridden by the hydrogen tank warning. The fire spreads through the service module bay.

56 HOURS + 5 MINUTES

a bang is heard, crew assumes it is a noisy valve actuation, and that it might have been done as a practical joke.

the pressure has built enough to blow out an outer panel on the service module bay.

the master power alarm sounds as power on a master bus is lost. This was caused by damaged oxygen lines to the fuel cells. It turns out that 2 of the 3 were off line.

hatch to lunar module is closed.

Huston is informed of the problem.

readings are scrambled, pressures and temperatures appeared erratic. Oxygen pressure in one tank was zero, and dropping in another.

a cloud is observed outside the service module

the venting of gases causes the craft to drift off course, the guidance system fails, and the craft begins to wobble, interrupting communications with the ground.

Huston is still taken by surprise, assuming that four failures indicated could not be related, and almost impossible independently, and they began looking for instrumentation problems.

a battery for reentry was connected to the power bus, but disconnected to reduce power drain.

Huston suggests disconnecting the 3rd fuel cell to check to see if oxygen loss was caused by the fuel cell. This operation could not be reversed.

the goal of landing on the moon is no longer assumed possible, but the lunar module is kept intact for its supply of water and oxygen.

the reentry battery is recharged and the command module is shut down, as the astronauts move to the lunar module. This is done acknowledging that the effects of the cold on the electronics is not known, and may affect earth reentry.

it was decided to continue the flight path, and use the moons gravity to get back (with no lunar landing) for a total flight time of 100 hours.

THE NEXT 3 AND A HALF DAYS

the remainder of the trip was punctuated by,

cold (minimal power for heating left temperatures about 10°C)

stress

fatigue (standing room only, or sleeping in the cold service module)

hunger (the food was meant to be mixed with hot water)

thirst (water was generated by the fuel cells)

the CO2 was normally absorbed by filters. With the extra use the filters in the lunar module were exhausted. The filters from the service module were not the correct shape to fit, requiring a jury rigged arrangement using duct tape.

a new reentry procedure was devised and had to be relayed verbally to the crew and written on paper (not plentiful). This checklist took 2 hours to transmit verbally.

DESCENT

The astronauts moved back to the command module, powered up the equipment (luckily the effects of the cold did not cause any malfunctions). There was a new concern that the heat shield was still intact. No instruments allowed this to be checked. The other modules were jettisoned.

the spacecraft landing began, and finished safely, with one of the fastest recoveries.

• In retrospect the accident was reviewed, and the causes for this accident were believed to be,

budget pressures, morale problems, and schedule pressures encouraged the use of the damaged oxygen tank.

the estimates of failure effects had been incomplete because of assumptions of adequacy.

emergency plans were available but not practiced.

13.1.2.3 - The Challenger

• The Challenger accident would have been prevented with the existing NASA procedures. The explosion of the Challenger is more the result of management failure in NASA and Morton Thiokol than the result of technical failures.

• In general the important notes are,

1972: contract awarded to Morton Thiokol to design the Solid Rocket Boosters (SRBs)

the design is based on a modified Titan III rocket, with significant design changes

one of the changes was an o-ring seal along the rocket body. The joint was made longer, and a second ring added to provide a redundant seal.

1977-78: An engineer discovers during tests that under pressure the joints rotated significantly causing the secondary o-ring to become ineffective. This is a result of the elongated joint to hold the secondary o-ring. Morton Thiokol management did not recognize the problem.

1980: The joint is classified on the CIL (Critical Item List) as 1R, indicating that failure would be catastrophic, but there is a redundant o-ring to act as a backup in the event of failure. This was only one of 700 items listed as criticality 1.

1981: the shuttle begins orbital testing

1982: the space shuttle is declared operational

After a few flights, problems with the o-rings were noted, as were other items. The normal procedure was to assign a problem tracking number, and examine the causes. This was not done for the o-ring problem. Eventually the problem was recognized and the rating was changed to 1 on the CIL. It was shown that despite NASA’s reclassification, the system was still listed as 1R in the Morton Thiokol paperwork, as well as a number of other documents. Also, Morton Thiokol disagreed with the criticality change, and went to a referee procedure.

1984: the erosion of the o-rings has become a significant concern, and review procedures are requested for the packing of the o-ring joint with the asbestos filled putty that prevents heating of the rings. Morton Thiokol responds with a letter suggesting that higher pressures used in testing the joints was resulting in channels in the putty, and increased erosion of the o-rings. statistics from before and after the change in testing pressure seemed to confirm this. Morton Thiokol recommends continuing the tests to ensure sealing despite the problems, and begins investigating the effects of the testing on the putty.

Jan 1985: A launch of a space shuttle at the coldest temperatures to date leads to the greatest failure of the o-rings to date. The o-rings will deform under pressure to seal the gap, but this is hindered when they are colder, and the material stiffer.

Jan-April 1985: Continued flights and investigations show continued problems with the o-rings, and a relationship to launch temperature. Morton Thiokol acknowledges the problem, and the effects of temperature, but concludes that the second o-ring will ensure safety.

April 1985: the primary o-ring does not seal, and the secondary ring carries the pressure, with some blowby (i.e., the backup was starting to fail). As a result a committee concludes that the shuttle must only be operated in an acceptable flight envelope for the o-ring seal. This report is received by Morton Thiokol, but does not seem to be properly distributed. The problem was also not properly reported within NASA to upper management.

July 1985: A Morton Thiokol engineer recommends that a team be set up to study the o-ring seal problem, citing a potential disaster.

August 1985: Morton Thiokol and NASA managers brief NASA headquarters on the o-ring problems, with a recommendation to continue flights, but step up investigations. A Morton Thiokol task force is set up.

October 1985: The head of the Thiokol task force complains to management about lack of cooperation and support.

December 1985: One Thiokol engineer suggests stopping shipments of SRBs until the problem is fixed. Thiokol writes a memo to NASA suggesting that the problem tracking of the o-rings be discontinued. This lead to an erroneous listing of the problem as closed, meaning that it would not be considered as critical during launch.

Jan 1986: The space shuttle Challenger is prepared to launch Jan., 22, originally it was scheduled for July 1985, and postponed 3 times, and scrubbed once. It was rescheduled again to the 23rd, then 25th, then 27th, then 28th. This was a result of weather, equipment, scheduling, and other problems.

Jan., 27th, 1986: The shuttle begins preparation for launch the next day, despite predicted temperatures below freezing (26°F) at launch time. Thiokol engineers express concerns over low temperatures, and suggests NASA managers be notified (this was not done). A minimum launch temperature of 53°F had been suggested to NASA. There was no technical opinion supporting the launch at this point. The NASA representative discussing the launch objected to Thiokol’s engineers opinions, and accused them of changing their opinions. Upper management became involved with the process, and “convinced” the technical staff to withdraw objections to the launch. Management at Thiokol gave the go ahead to launch under pressure from NASA officials (this was the critical decision).

the shuttle is wheeled out to the launch pad. Rain has frozen on the launch pad, and may have gotten into the SRB joints and frozen there also.

Jan., 28th, 1986: The shuttle director gives the OK to launch, without having been informed of the Thiokol concerns. The temperature is 36°F.

11:39 am: The engines are ignited, and a puff of black smoke can be seen blowing from the right SRB. As the shuttle rises the gas can be seen blowing by the o-rings. The vibrations experienced in the first 30 seconds of flight are the worst encountered to date.

11:40 am: A flame jet from the SRB starts to cut into the liquid fuel engine tank, and a support strut.

11:40:15 am: the strut gives way, and the SRB pointed nose cone pierces the liquid fuel tank. The resulting explosion totally destroys the shuttle and crew.

11:40:50 am: the SRB’s are destroyed by the range safety officer.

13.1.3 Glossary

AFSCN Air Force Satellite Control Network

A/L Approach and Landing

ALT Approach and Landing Test

AMU Astronaut Maneuvering Unit

AOA Abort Once Around

APS Alternate Payload Specialist

APU Auxiliary Power Unit

ASE Airborne Support Equipment

ATE Automatic Test Equipment

ATO Abort to Orbit

BFC Backup Flight Control

BOC Base Operations Contract

CAPCOM Capsule Communicator

CCAFS Cape Canaveral Air Force Station

CCMS Checkout, Control and Monitor Subsystem

CCTV Closed Circuit Television

CDMS Command & Data Management Systems Officer

CDR Commander

CDS Central Data System

CFES Continuous Flow Electrophoresis System

CIC Crew Interface Coordinator

CIE Communications Interface Equipment

CITE Cargo Integration Test Equipment

CTS Call to Stations

DCC Data Computation Complex

DCS Display Control System

DFI Development Flight Instrumentation

DFRF Hugh L. Dryden Flight Research Facility

DIG Digital Image Generation

DMC Data Management Coordinator

DMOS Diffusive Mixing of Organic Solutions

DOD Department of Defense

DOP Diver Operated Plug

DPS Data Processing System

EAFB Edwards Air Force Base

ECLSS Environmental Control & Life Support System

EECOMP Electrical, Environmental & Consumables Systems Engineer

EI Entry Interface

EMU Extravehicular Mobility Unit

ESA European Space Agency

ESMC Eastern Space and Missile Center

ET External Tank

EVA Extravehicular Activity

FAO Flight Activities Officer

FAWG Flight Assignment Working Group

FBSC Fixed Base Crew Stations

F/C Flight Controller

FCT Flight Crew Trainer

FCTS Flight Crew Trainer Simulator

FD Flight Director

FDF Flight Data File

FDO Flight Dynamics Officer

FOD Flight Operations Directorate

FOE Flight Operations Engineer

FOPG Flight Operations Planning Group

FOSO Flight Operations Scheduling Officer

FR Firing Room

FRC Flight Control Room

FRCS Forward Reaction Control System

FRF Flight Readiness Firing

FRR Flight Readiness Review

FSE Flight Simulation Engineer

FSS Fixed Service Structure

GAS Getaway Special

GC Ground Control

GDO Guidance Officer

GLS Ground Launch Sequencer

GN Ground Network

GNC Guidance, Navigation & Control Systems Engineer

GPC General Purpose Computer

GSE Ground Support Equipment

GSFC Goddard Space Flight Center

HAC Heading Alignment Circle

HB High Bay

HMF Hypergolic Maintenance Facility

HPPF Horizontal Payloads Processing Facility

HUS Hypergolic Umbilical System

IECM Induced Environment Contamination Monitor

IG Inertial Guidance

ILS Instrument Landing System

IMF In Flight Maintenance

IMU Inertial Measurement Unit

INCO Instrumentation & Communications Officer

IRIG Interrange Instrumentation Group

ISP Integrated Support Plan

IUS Inertial Upper Stage

IVA Intravehicular Activity

JPL Jet Propulsion Laboratory

JSC Lyndon B. Johnson Space Center

KSC John F. Kennedy Space Center

LC Launch Complex

LCC Launch Control Center

LCS Launch Control System

LDEF Long Duration Exposure Facility

LETF Launch Equipment Test Facility

LOX Liquid Oxygen

LPS Launch Processing System

LSA Launch Services Agreement

LWG Logistics Working Group

MBCS Motion Base Crew Station

MCC Mission Control Center

MD Mission Director

MDD Mate/Demate Device

ME Main Engine

MECO Main Engine Cutoff

MET Mission Elapsed Time

MLP Mobile Launch Platform

MLR Monodisperse Latex Reactor

MLS Microwave Landing System

MMACS Maintenance, Mechanical Arm & Crew Systems Engineer

MMPSE Multiuse Mission Payload Support Equipment

MMSE Multiuse Mission Support Equipment

MMU Manned Maneuvering Unit

MOD Mission Operations Directorate

MOP Mission Operations Plan

MPGHM Mobile Payload Ground Handling Mechanism

MPPSE Multipurpose Payload Support Equipment

MPS Main Propulsion System

MS Mission Specialist

MSBLS Microwave Scanning Beam Landing System

MSC Mobile Servicing Centre

MSCI Mission Scientist

MSFC George C. Marshall Space Flight Center

MSS Mobile Service Structure

MST Mobile Service Tower

MTE Mobile Transporter Element

MUM Mass Memory Unit Manager

NASCOM NASA Communications Network

NBT Neutral Buoyancy Facility

NIP Network Interface Processor

NOCC Network Operations Control Center

NSRS NASA Safety Reporting System

NSTL National Space Technology Laboratories

NSTS National Space Transportation System

OAA Orbiter Access Arm

OAST Office of Aeronautics & Space Technology

OC Operations Coordinator

O&C Operations and Checkout (Building)

OFI Operational Flight Instrumentation

OFT Orbiter Flight Test

OMBUU Orbiter Midbody Umbilical Unit

OMRF Orbiter Maintenance & Refurbishment Facility

OMS Orbital Maneuvering System

OPF Orbiter Processing Facility

OSF Office of Space Flight

OSS Office of Space Science

OSSA Office of Space Science and Applications

OSTA Office of Space and Terrestrial Applications

OV Orbiter Vehicle

PACE Prelaunch Automatic Checkout Equipment

PAM Payload Assist Module

PAYCOM Payload Command Coordinator

PCR Payload Changeout Room

PDRS Payload Deployment & Retrieval System

PGHM Payload Ground Handling Mechanism

PHF Payload Handling Fixture

PIP Payload Integration Plan

PLSS Portable Life Support Subsystem

PLT Pilot

POCC Payload Operations Control Center

POD Payload Operations Director

PRC Payload Changeout Room

PRF Parachute Refurbishment Facility

PRSD Power Reactant Storage & Distribution

PS Payload Specialist

R&D Research Development

RCS Reaction Control System

RMS Remote Manipulator System

RPS Record Playback Subsystem

RSS Rotating Service Structure

RTLS Return to Launch Site

SAEF Spacecraft Assembly & Encapsulation Facility

SAIL Shuttle Avionics Integration Laboratory

SCA Shuttle Carrier Aircraft

SCAMMA Station Conferencing & Monitoring Arrangement

SCAPE Self-Contained Atmospheric Protection Ensemble

SID Simulation Interface Device

SIP Standard Interface Panel

SIT Shuttle Interface Test

SL Spacelab

SLF Shuttle Landing Facility

SMAB Solid Motor Assembly Building

SMCH Standard Mixed Cargo Harness

SMS Shuttle Mission Simulator

SN Space Network

SPDM Special Purpose Dextrous Manipulator

SPIF Shuttle Payload Integration Facility

SPOC Shuttle Portable On-Board Computer

SRB Solid Rocket Booster

SRBDF Solid Rocket Booster Disassembly Facility

SRM&QA Safety, Reliability, Maintainability and Quality Assurance

SSC John C. Stennis Space Center

SSCP Small Self-Contained Payload

SSIP Shuttle Student Involvement Project

SSME Space Shuttle Main Engines

SSP Standard Switch Panel

SSRMS Space Station Remote Manipulator System

SST Single System Trainer

STA Shuttle Training Aircraft

STS Space Transportation System

T Time

TACAN Tactical Air Navigation

TAEM Terminal Area Energy Management

TAL Trans-Atlantic Abort Landing

TDRS Tracking and Data Relay Satellite

TPAD Trunnion Pin Acquisition Device

TPS Thermal Protection System

TSM Tail Service Mast

UHF Ultra high Frequency

UV Ultraviolet

VAB Vehicle Assembly Building

VLF Very Low Frequency

VPF Vertical Processing Facility

WCS Waste Collection System

WSMC Western Space & Missile Center

WSSH White Sands Space Harbor

13.1.4 References and Bibliography

13.1 American Institute of Chemical Engineers, Guidelines for hazard evaluation procedures: with worked examples, 2nd edition, 1992.

13.2 Brimley, W., “Spacecraft Systems; Safety/Failure Tolerance Failure Management”, part of a set of course note for a course offered previously at the University of Toronto, 199?.

13.3 Dhillon, B.S., Engineering Design; a modern approach, Irwin, 1996.

13.4 Dorf, R.C. (editor), The Electrical Engineering Handbook, IEEE Press/CRC Press, USA, 1993, pp. 2020-2031.

13.5 Leveson, N., Safeware: system safety and computers, Addison-Wesley Publishing Company Inc., 1995.

13.6 Rasmussen, J., Duncan, K., and Leplat, J., New Technology and Human Error, John Wiley & Sons Ltd., 1987.

13.7 Ullman, D.G., The Mechanical Design Process, McGraw-Hill, 1997.