[SystemSafety] RE : Qualifying SW as "proven in use" - unclassified

From: King, Martin (NNPPI) < >
Date: Fri, 28 Jun 2013 09:13:32 +0100


One thing that seems to be being ignored(?) is that safety functions fall into two camps - continuous and on demand. These may each have different lifetime exposure to the input range statistics.  

By intention most people intend that on demand safety functions are rarely, if ever, invoked. Invoking of an on demand safety function is usually the result of some other set of failures which are also intended to be rare. This means that any evidence for correct operation of the on demand safety function is, by design, rare and evidence for incorrect (ie fails on demand) operation is, hopefully, even rarer.  

Most continuous safety systems will see most use with input conditions away from the area where safety issues may arise or will be operating in the sweet spot of 'normal' operation. So once again the input space statistics will tend to show low frequencies around the space of (safety) interest.  

All of this might suggest that in many cases safety claims based on 'proven in use' may reflect operational practices better than performance of the safety functions. There may be more evidence about 'fails safe' than 'fails dangerous'.    

My opinion - not necessarily that of my employer!

Martin King  


From: systemsafety-bounces_at_xxxxxx [mailto:systemsafety-bounces_at_xxxxxx Matthew Squair
Sent: 28 June 2013 08:16
To: Peter Bernard Ladkin; Bielefield Safety List Subject: Re: [SystemSafety] RE : Qualifying SW as "proven in use"

Peter,

Yep it was my poor choice of phrase, my point was that in terms of evidence one or a thousand hours of data from the old environment should have the same evidential weight, if the new environment is different to the old and we have no idea of what it is.

Yes I would agree that stochastic inputs can generate stochastic behaviour. So if it's inputs we're talking about isn't the use of 'hours run' as a unit of exposure essentially a side issue? If what you're actually doing is exposing the software to a set of inputs that are stochastic in nature. As a consequence the amount of time you have to assign to collect a statistically valid sample is driven by the confidence you wish to obtain, the inherent variability of the input, and how frequently it arrives over some period of time.

Going back to my example if I 'know' that the variability of the inputs is extremely low one hour of input data may do, if not then much more may be needed. If the input data is very complex, any amount of hours may not be enough. All of which is about establishing how well we understand the environment rather than the 'reliability' of software, as I see it.

Picking up your example again, why was the difference between the two environments not detected by the designer? Was an assumption made that the new environment was the same as the old? Presuming that to be the case, isn't the decision to deploy software in that new context really about a different kind of uncertainty?

For example, the original environment had an arrival rate of inputs that could be characterised to have some frequency, the new environment also has a frequency but we are uncertain as to it's value. We could have estimated some bounds to this possible range of frequencies and run some tests to see what effect differing arrival rates might have, or we could have gone out and gathered field data, but instead (I presume) we elected to assume that the parameters were the same.

So deploying into a new environment carries epistemic uncertainty and we can reduce this, but if we make an assumption that the environment is the same we are translating that epistemic uncertainty into an ontological one. I infer from your example that we didn't have to wait too long after deployment to find this problem so I presume that we wouldn't have to run a trial for very long before we saw the problem input.

As to whether you would or should weigh operation in multiple different environments as better or worse I was thinking about the open source example, where having multiple different people looking at the code independently seems to generate very low defect rates. Linux is the example a lot of people use I believe. So, couldn't one argue that operation across a range of different environments would be more likely to expose different systematic errors, as compared to operation in one environment for a long time?

On Thu, Jun 27, 2013 at 9:35 PM, Peter Bernard Ladkin <ladkin_at_xxxxxx

        Matthew,         

        Scenarios such as those Bertrand describes are not that far-fetched. Unfortunately, there are in some places senior management who are in the same state of (lack of) expertise as Bertrand describes. That is a problem of professional qualification which I would prefer to treat as a separate issue.         

        On 27 Jun 2013, at 09:18, Matthew Squair <mattsquair_at_xxxxxx wrote:

> I've been thinking about Peter's example a good deal, the
developer seems to me to have made an implicit assumption that one can use a statistical argument based on sucessful hours run to justify the safety of the software.                  

        It is not an assumption. It is a well-rehearsed statistical argument with a few decades of universal acceptance, as well as various successful applications in the assessment of emergency systems in certain English nuclear power plants.         

> I don't think that's true,
                 

        You might like to take that up with, for example, the editorial board of IEEE TSE.         

> in fact I'd go further and say that whether you operate for a
thousand hours or a million hours has no bearing on demonstrating software safety, because what we're interested in are systematic failures rather than random ones.                  

        I presume you would want to argue that the occurrence of a failure caused by a systematic fault is functionally dependent on the inputs, and that is what distinguishes it from what you call "random". However, if your inputs have a stochastic nature, then anything functionally dependent on them will also exhibit stochastic behavior. Failures caused by systematic faults thus exhibit stochastic behavior.         

> Example, I have a piece of software and (despite my best
efforts) there's a latent fatal fault within it, however testing hasn't discovered it and I'm also in luck in that the operating environment is sufficiently close to the test environment that the fault is not triggered in the operating environment. Now I could run the system for one, one hundred or a thousand years in that operating environment and I wouldn't see a problem. So according to the statistical treatment the software is safe, even with a fatal flaw isn't it?                  

        No. According to the statistical treatment, if you have seen 3 x 10^X operational hours without failure, *and* you are guaranteed to have had perfect failure detection, *and* the future operating environment has the exact same statistical properties as the previous (not "similar" but exact, statistically), then you may be 90% confident that you will see failures with a likelihood of not more than 10^(-X) per operating hour. How that might relate to a claim that "the software is safe" is up to you. Also, you didn't express what level of confidence you might need in such a claim.         

> So logically if the number of hours you run in service in a
particular environment has nothing to do with proving the safety of software, why couldn't I say that after one hundred hours the software was 'proven in use', for that specific environment. Why not one hour?                  

        It is correct that the number of hours.... has nothing to do with proving the safety of software, if by that you mean establish without a shadow of doubt. Neither does any practical statistical reasoning. Usual levels of confidence with statistical reasoning are 95%. Well away from certainty.         

        You can of course say that, after 100 hours of failure-free operation, the SW is "proven in use", whatever that might mean to you. What you cannot do is attribute to that assertion any other than a very, very low level of confidence. Even one hour. With an appropriately lower level of confidence (= epsilon indistinguishable from zero, I would hope).         

> In Peter's example the number of hours run on the original
software version could have been one, or ten million and there still would have been the same end result, e.g a failure when put into a new operational context. In other words one hour of operations has as much weight as one thousand (in the same environment).                  

        I am not sure what you mean here. To me, "new operational context" and "same environment" are contradictory, so maybe I don't understand the way you are using these terms.         

> Another question, say I have developed a piece of software,
it's now running in three quite different operating environments, in terms of evidence of 'safety' would I weight 300 hours of operation in a single environment the same as 100 hours from each of these different environments? If so why?                  

        What you have is 100 hours of experience from each of three different distributions. You could superimpose the distributions if you want, but the only reason to do that is if you are thinking of deploying the SW in an environment identical to that superimposition and want to get a clue as to its viability.         

        PBL                  Prof. Peter Bernard Ladkin, University of Bielefeld and Causalis Limited                  

-- 
Matthew Squair 


Mob: +61 488770655 <tel:%2B61%20488770655> 
Email: MattSquair_at_xxxxxx

The data contained in, or attached to, this e-mail, may contain confidential information. If you have received it in error you should notify the sender immediately by reply e-mail, delete the message from your system and contact +44 (0) 1332 242424 (the Rolls-Royce IT Security Director) if you need assistance. Please do not copy it for any purpose, or disclose its contents to any other person.

An e-mail response to this address may be subject to interception or monitoring for operational reasons or for lawful business practices.

(c) 2012 Rolls-Royce plc

Registered office: 65 Buckingham Gate, London SW1E 6AT Company number: 1003142. Registered in England. 




_______________________________________________ The System Safety Mailing List systemsafety_at_xxxxxx
Received on Fri Jun 28 2013 - 10:13:48 CEST

This archive was generated by hypermail 2.3.0 : Tue Apr 23 2019 - 06:17:05 CEST