Re: [SystemSafety] Critical Design Checklist

From: Driscoll, Kevin R < >
Date: Wed, 28 Aug 2013 17:44:35 +0000


> Is "it" (the revised taxonomy) just re-inventing the wheel, or is there something else going on?
As one who abhors re-inventing the wheel (particularly when the result may have some corners on it), we don't do this unless we need to.

There are a number of problems with trying to make such a taxonomy. One is the trade between making the fault classes as broad as possible (to make sure we have covered as many faults as possible) versus making the fault class definitions concise and having useful properties (e.g., being able to map appropriate fault avoidance or fault tolerance techniques to a particular fault class). Another problem is trying to simplify the high dimensionality of this space. When reducing the dimensionality by aggregating fault classes into supersets, different hierarchies can result. For example, should faults first be divided into Value faults and Timing faults with each having a subset that is Byzantine? Or, should Byzantine be the superior set with Value and Timing subsets? Whichever way this is done, it seems there is always some lack of orthogonality.

There is no consensus on how a fault taxonomy should be constructed. When a group of people is assembled for some purpose, in which individuals disagree on the taxonomy, some compromise taxonomy usually is created (often specific to the task at hand). There is also a lack of consensus on a lot of the terminology. For example, I disagree with the use of "arbitrary" as a synonym for or a description of "Byzantine" (need to edit the Wikipedia "Byzantine fault tolerance" page someday). I don't think "arbitrary" should be used for a fault set that doesn't include power source overvoltage, shrapnel from exploding capacitors, common mode failures due to compiler/linker or synthesizer bugs, ...

Even the basic definitions of fault, failure, and error are not completely agreed. I think the definitions created by IFIP WG10.4 are the best published and should be the ones generally used. However, I think the term "error" should apply only to the difference in state for those elements of a device that are intended to hold state. I vehemently disagree with those (including other members of WG10.4) who use "error" as the difference in any state of the device, including a structural state. That is, I would not classify a broken wire as an "error".

From: systemsafety-bounces_at_xxxxxx Sent: Tuesday, August 27, 2013 14:06
To: Peter Bernard Ladkin
Cc: systemsafety_at_xxxxxx Subject: Re: [SystemSafety] Critical Design Checklist

"It never seems to be exactly what we want."

Is "it" (the revised taxonomy) just re-inventing the wheel, or is there something else going on?



From: Peter Bernard Ladkin <ladkin_at_xxxxxx Sent: Tuesday, August 27, 2013 1:01 PM
To: Robert Schaefer at 300
Cc: Driscoll, Kevin R; systemsafety_at_xxxxxx Subject: Re: [SystemSafety] Critical Design Checklist

On 27 Aug 2013, at 18:07, Robert Schaefer at 300 <schaefer_robert_at_xxxxxx

Would a complete taxonomy even be possible? As the possibility of fault- contexts-in-the-world appears to be infinite or near infinite, wouldn't the number of fault types be near infinite as well?

Since "fault type" is a human classification, it is guaranteed not to be anywhere near infinite, but indeed quite finite. Perrow has a classification he called "DEPOSE". That has just six categories, one for each letter.

Whether it does what one wants it to do is another question, as Kevin points out.

I would also propose that fault is also a human classification (since you talk about a fault in language, no matter how precise, your words may have another instance which fulfil them, and it is the words/concepts which define what you are talking about) whereas failure has at least a time/space stamp. Ideally. Unfortunately, in the current state of the (lack of) art, I think failure might often be lacking objectivity too, if a specification exists and is ambiguous.

PBL Prof. Peter Bernard Ladkin, University of Bielefeld and Causalis Limited



Sent: Tuesday, August 27, 2013 11:28 AM
To: Matthew Squair
Cc: systemsafety_at_xxxxxx Subject: Re: [SystemSafety] Critical Design Checklist

> such a list should possess orthogonality, decidability, atomicity, criticality and a rationale.
Addressing orthogonality (and completeness), the list should have a proper taxonomy. But, that's hard to do.

Internally, we keep revisiting the creation of a taxonomy for fault types, even though much has been published on the subject. It never seems to be exactly what we want.

Sent: Tuesday, August 27, 2013 04:12
To: martyn_at_xxxxxx Cc: systemsafety_at_xxxxxx Subject: Re: [SystemSafety] Critical Design Checklist

Not so much a list but a comment that the items in such a list should possess orthogonality, decidability, atomicity, criticality and a rationale.

The criticality should address Martyn's 'and what then' comment.

On Tuesday, 27 August 2013, Martyn Thomas wrote: On 26/08/2013 21:37, Driscoll, Kevin R wrote: For NASA, we are creating a Critical Design Checklist:

*       Objective
-     A checklist for designers to help them determine if a safety-critical design has met its safety requirements

Kevin

For this purpose, I interpret your phrase "safety requirements" for a "safety-critical design" as meaning that any system that can be shown to implement the design correctly will meet the safety requirements for such a system in some required operating conditions.

Here's my initial checklist:

  1. Have you stated the "safety requirements" unambiguously and completely? How do you know? Can you be certain? If not, what is your confidence level and how as it derived?
  2. Have you specified unambiguously and completely the range of operating conditions under which the safety requirements must be met? How do you know? Can you be certain? If not, what is your confidence level and how as it derived?
  3. Do you have scientifically sound evidence that the safety-critcal design meets the safety requirements?
  4. Has this evidence been examined by an independent expert and certified to be scientifically sound for this purpose?
  5. Can you name the both the individual who will be personally accountable if the design later proves not to meet its safety requirements and the organisation that will be liable for any damages?
  6. Has the individual signed to accept accountability? Has a Director of the organisation signed to accept liability?

Of course, there is a lot of detail conceled within these top-level questions. For example, the specification of operating conditions is likely to contain detail of required training for operators, which will also need to be shown to be adequate.

But there's probably no need to go into more detail as you will probably get at least one answer "no" to the top six questions.

What will you do then?

Regards

Martyn

--
Sent from Gmail Mobile
_______________________________________________
The System Safety Mailing List
systemsafety_at_xxxxxx



_______________________________________________ The System Safety Mailing List systemsafety_at_xxxxxx
Received on Wed Aug 28 2013 - 19:44:54 CEST

This archive was generated by hypermail 2.3.0 : Mon Apr 22 2019 - 22:17:05 CEST