Re: [SystemSafety] "Reliability" culture versus "safety" culture

From: SPRIGGS, John J < >
Date: Mon, 29 Jul 2013 14:00:50 +0000


Many safety people use James Reason's model that involves layers of cheese with holes in. The cheese represents safety barriers; the idea is that when the holes line up, you can poke grissini through and cause an accident. Many of these same people will quite happily perform an FMEA, which proceeds by assuming that only the hole under consideration exists and all the others are blocked. Good for assessing local effects, but not good for identifying hazards, let alone assessing their likelihoods...

John

-----Original Message-----
From: systemsafety-bounces_at_xxxxxx Sent: 29 July 2013 13:38
To: systemsafety_at_xxxxxx Subject: [SystemSafety] "Reliability" culture versus "safety" culture

As a few of you know, I have recently been involved in what appears to be a technical-culture clash, between "reliability" and "safety" engineers, which has led/leads to organisational problems, for example the scope of technical standards. Some suspect that such a culture clash is moderately rigid. I would like to figure out as many specific technical differences as I can. It is moderately important to me that the expression of such differences attain universal assent (that is, from both cultures as well as any others....)

Here are some I know about already.

  1. Root Cause Analysis. Reliability people set store by methods such as Five-Why's, and Fishbone Diagrams, which people analysing accidents or serious incidents consider hopelessly inadequate (in Nancy's word, "silly").
  2. Root Cause Analysis. Reliability people often look to identify "the" root cause of a quality problem, and many methods are geared to identifying "the" root cause. Accident analysts are (usually) adamant that there is hardly ever (in the words of many, "never") just *one* cause which can be called root.
  3. FMEA. There are considerable questions with today's complex systems of how to calculate maintenance cycles. Even a military road vehicle nowadays can be considered a "system of systems", in that the system-subsystem hierarchy is quite deep. Calculating maintenance cycles requires obtaining some idea of MTBFs of components. Components may be simple, or line-replacable units, or units that require shop maintenance. Physical components may or may not correspond to functional blocks (there is a notation, Functional Block Diagrams or FBDs, which is widely used). There are ways of calculating MTBFs and maintenance procedures for components hierarchically arranged in FBDs. They may well work well enough for the control of complexity to determine the requirements for regular maintenance.

However, if functional failures contribute to hazards, these methods, which are approximate, do not appear to work well for assessing the likelihoods of hazards arising. (This is true even for those hazards which arise exclusively as a result of failures.)

4. FMEA. People who work with FMEA for reliability goals are not so concerned with completeness. Indeed, I have had reliability-FMEA experts dismiss the subject when I brought it up, claiming it to be "impossible". However, people who use FMEA for the analysis of failures of safety-relevant systems and their hazards must be very concerned, as a matter of due diligence, that their analyses (their listing of failure modes) as far as possible leave nothing out (in other words, that they are as complete as possible).

5. Testing. Safety people generally know (or can be presumed to know) of the work which tells them that assessing software-based systems for high reliability through testing cannot practically be accomplished, if the desired reliability is higher than one error in 10,000 to 100,000 operational hours (e.g., Littlewood/Strigini, Butler/Finelli, both 1993).

Whereas reliability people believe that statistical analysis of testing is practical and worthwhile. For example , from a paper in the 2000 IEEE R&M Symposium:
> Abstract: When large hardware-software systems are run-in or an acceptance testing is made, a
> problem is when to stop the test and deliver/accept the system. The same problem exists when a
> large software program is tested with simulated operations data. Based on two theses from the
> Technical University of Denmark the paper describes and evaluates 7 possible algorithms. Of these
> algorithms the three most promising are tested with simulated data. 27 different systems are
> simulated, and 50 Monte Carlo simulations made on each system. The stop times generated by the
> algorithm is compared with the known perfect stop time. Of the three algorithms two is selected
> as good. These two algorithms are then tested on 10 sets of real data. The algorithms are tested
> with three different levels of confidence. The number of correct and wrong stop decisions are
> counted. The conclusion is that the Weibull algorithm with 90% confidence level takes the right
> decision in every one of the 10 cases.

6 .... and onwards. I would like to collect as many examples as possible of such differences. Do some of you have other contrasts to contribute? I would like to share with colleagues, so I do intend to attribute to the contributor if this is OK. (Desired-anonymous examples will also be kept, as desired, anonymous.)

PBL Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany Tel+msg +49 (0)521 880 7319 www.rvs.uni-bielefeld.de<http://www.rvs.uni-bielefeld.de>



The System Safety Mailing List
systemsafety_at_xxxxxx

If you are not the intended recipient, please notify our Help Desk at Email isproduction_at_xxxxxx immediately. You should not copy or use this email or attachment(s) for any purpose nor disclose their contents to any other person.

NATS computer systems may be monitored and communications carried on them recorded, to secure the effective operation of the system.

Please note that neither NATS nor the sender accepts any responsibility for viruses or any losses caused as a result of viruses and it is your responsibility to scan or otherwise check this email and any attachments.

NATS means NATS (En Route) plc (company number: 4129273), NATS (Services) Ltd (company number 4129270), NATSNAV Ltd (company number: 4164590) or NATS Ltd (company number 3155567) or NATS Holdings Ltd (company number 4138218). All companies are registered in England and their registered office is at 4000 Parkway, Whiteley, Fareham, Hampshire, PO15 7FL.





The System Safety Mailing List
systemsafety_at_xxxxxx Received on Mon Jul 29 2013 - 16:01:08 CEST

This archive was generated by hypermail 2.3.0 : Tue Jun 04 2019 - 21:17:05 CEST