[SystemSafety] "Reliability" culture versus "safety" culture

From: Peter Bernard Ladkin < >
Date: Mon, 29 Jul 2013 14:37:32 +0200

As a few of you know, I have recently been involved in what appears to be a technical-culture clash, between "reliability" and "safety" engineers, which has led/leads to organisational problems, for example the scope of technical standards. Some suspect that such a culture clash is moderately rigid. I would like to figure out as many specific technical differences as I can. It is moderately important to me that the expression of such differences attain universal assent (that is, from both cultures as well as any others....)

Here are some I know about already.

  1. Root Cause Analysis. Reliability people set store by methods such as Five-Why's, and Fishbone Diagrams, which people analysing accidents or serious incidents consider hopelessly inadequate (in Nancy's word, "silly").
  2. Root Cause Analysis. Reliability people often look to identify "the" root cause of a quality problem, and many methods are geared to identifying "the" root cause. Accident analysts are (usually) adamant that there is hardly ever (in the words of many, "never") just *one* cause which can be called root.
  3. FMEA. There are considerable questions with today's complex systems of how to calculate maintenance cycles. Even a military road vehicle nowadays can be considered a "system of systems", in that the system-subsystem hierarchy is quite deep. Calculating maintenance cycles requires obtaining some idea of MTBFs of components. Components may be simple, or line-replacable units, or units that require shop maintenance. Physical components may or may not correspond to functional blocks (there is a notation, Functional Block Diagrams or FBDs, which is widely used). There are ways of calculating MTBFs and maintenance procedures for components hierarchically arranged in FBDs. They may well work well enough for the control of complexity to determine the requirements for regular maintenance.

However, if functional failures contribute to hazards, these methods, which are approximate, do not appear to work well for assessing the likelihoods of hazards arising. (This is true even for those hazards which arise exclusively as a result of failures.)

4. FMEA. People who work with FMEA for reliability goals are not so concerned with completeness. Indeed, I have had reliability-FMEA experts dismiss the subject when I brought it up, claiming it to be "impossible". However, people who use FMEA for the analysis of failures of safety-relevant systems and their hazards must be very concerned, as a matter of due diligence, that their analyses (their listing of failure modes) as far as possible leave nothing out (in other words, that they are as complete as possible).

5. Testing. Safety people generally know (or can be presumed to know) of the work which tells them that assessing software-based systems for high reliability through testing cannot practically be accomplished, if the desired reliability is higher than one error in 10,000 to 100,000 operational hours (e.g., Littlewood/Strigini, Butler/Finelli, both 1993).

Whereas reliability people believe that statistical analysis of testing is practical and worthwhile. For example , from a paper in the 2000 IEEE R&M Symposium:
> Abstract: When large hardware-software systems are run-in or an acceptance testing is made, a
> problem is when to stop the test and deliver/accept the system. The same problem exists when a
> large software program is tested with simulated operations data. Based on two theses from the
> Technical University of Denmark the paper describes and evaluates 7 possible algorithms. Of these
> algorithms the three most promising are tested with simulated data. 27 different systems are
> simulated, and 50 Monte Carlo simulations made on each system. The stop times generated by the
> algorithm is compared with the known perfect stop time. Of the three algorithms two is selected
> as good. These two algorithms are then tested on 10 sets of real data. The algorithms are tested
> with three different levels of confidence. The number of correct and wrong stop decisions are
> counted. The conclusion is that the Weibull algorithm with 90% confidence level takes the right
> decision in every one of the 10 cases.

6 .... and onwards. I would like to collect as many examples as possible of such differences. Do some of you have other contrasts to contribute? I would like to share with colleagues, so I do intend to attribute to the contributor if this is OK. (Desired-anonymous examples will also be kept, as desired, anonymous.)

PBL Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany Tel+msg +49 (0)521 880 7319 www.rvs.uni-bielefeld.de

The System Safety Mailing List
systemsafety_at_xxxxxx Received on Mon Jul 29 2013 - 14:37:41 CEST

This archive was generated by hypermail 2.3.0 : Tue Jun 04 2019 - 21:17:05 CEST