Re: [SystemSafety] RE : Qualifying SW as "proven in use"

From: Nancy Leveson < >
Date: Thu, 27 Jun 2013 10:23:43 -0400

  1. Software is *not*, by itself, unsafe. It is an abstraction without any physical reality. It cannot itself cause physical damage. The safety of safety is dependent on
    • the software logic itself,
    • the behavior of the hardware on which the software executes,
    • the state of the system that is being controlled or somehow affected by the outputs of the software (the encompassing "system"), and
    • the state of the environment. All of these things determine safety so a change in one can impact the so-called "software safety.:" For example, the change in the design of the Ariane 5 which led to a steeper trajectory than the Ariane 4 led to the software contributing to the explosion. The environment does matter. All the usage of that software on the Ariane 4 meant nothing with respect to its use in the Ariane 5.

Any change in the environment, in the controlled system, in the underlying hardware, or in the software invalidates all previous experience unless one can prove that the change will not lead to an accident (and that proof cannot not be based on a statistical argument). Does anyone know any non-trivial software, for example, that is not changed in any way over decades of use? or even years of use? And what about changes in the behavior of human operators, of the system itself, and of the environment?

Someone wrote:
> I've been thinking about Peter's example a good deal, the developer seems
to me to have made an implicit assumption that one can use a statistical argument based on successful hours run to justify the safety of the software.
And Peter responded:
>>It is not an assumption. It is a well-rehearsed statistical argument with a few decades of universal acceptance, as well as various successful applications in the assessment of emergency systems in certain English nuclear power plants.

"Well-rehearsed statistical arguments with a few decades of universal acceptance" are not proof. They are only well-rehearsed arguments. Saying something multiple times is not a proof. The fact that nuclear power plants in Britain have not experienced any major accidents (they have had minor incidents by the way) rises only to the level of anecdote, and not proof. And that experience (and well-rehearsed arguments) cannot be carried over to other systems.

I agree with the original commenter about the implicit assumption, which the Ariane 5 case disproves (as well as dozens of others).

2. It is not even clear what "failure" of software means when software is merely "design abstracted from its physical realization." How can a "design" fail? It may not satisfy its requirements (when executed on some hardware), but design (equivalent to a blueprint for hardware) does not fail and certainly does not fail "randomly."

Perhaps the reason why software reliability modeling still has pretty poor performance after at least 40 years of very bright people trying to get it to work is that the assumptions underlying it are not true. These assumptions have not been proven (only stated with great certainty) and, in fact there is evidence showing they are not necessarily true. I tried raising this point a long time ago, but I was met with such a ferocious response (as I am sure I will be here) that I simply ignored the whole field and worked on things that seemed to have more promise. The most common assumption is that the environment is stochastic and that the selection of inputs (from the entire space of inputs) that will trigger a software fault (design error) is random. There is data from NASA (using real aircraft) that is evidence of "error bursts" in the software (ref. Dave Eckhardt). It appeared that these resulted when the aircraft flew a trajectory that was near a "boundary point" in the software and thus set off all the common problems in software related to boundary points. The selection of inputs triggering the problems was not random.

As another example, Ravi Iyer looked at software failures of a widely used operating system in an interesting experiment where he found that a bunch of software errors appeared to be preceding a computer hardware failure. It made no sense that the software could be "causing" the hardware failure. Closer examination showed the problem. Hardware often degrades in its behavior before it actually stops. The strange hardware behavior, if I remember correctly, was exercising the software error handling routines until it got beyond the capability of the software to mitigate the problems. Again, in this case, the software was not "failing" due to randomly selected inputs from the external input space.

When someone wrote:
> I don't think that's true,

Peter Ladkin wrote:
>>You might like to take that up with, for example, the editorial board of IEEE TSE. [As a past Editor-in-Chief of IEEE TSE, I can assure you that the entire editorial board does not read and vet the papers, in fact, I was lucky if one editor actually read the paper. Are you suggesting that anything that is published should automatically be accepted as truth? That nothing incorrect is ever published?]

Nancy

-- 
Prof. Nancy Leveson
Aeronautics and Astronautics and Engineering Systems
MIT, Room 33-334
77 Massachusetts Ave.
Cambridge, MA 02142

Telephone: 617-258-0505
Email: leveson_at_xxxxxx
URL: http://sunnyday.mit.edu



_______________________________________________ The System Safety Mailing List systemsafety_at_xxxxxx
Received on Thu Jun 27 2013 - 16:23:50 CEST

This archive was generated by hypermail 2.3.0 : Tue Jun 04 2019 - 21:17:05 CEST