[SystemSafety] Categorising "errors" [was: Stupid Software Errors]

From: Drew Rae < >
Date: Tue, 5 May 2015 09:21:00 +1000


I know some people who would look at this whole discussion and say "For forty years we've had better ways of talking about safety problems in complex systems than simply categorising them as human error. I can't believe people are still making this same basic mistake. I blame this young generation who haven't had a proper education in safety engineering, and are just messing about in organisations without a proper understanding of the fundamentals. What are we teaching engineers these days?"

Does anyone else find it remarkable that a company is publicly releasing information about a potential problem with a system, even though their own understanding of how the system is used suggests that the conditions that would cause the problem are unlikely to exist?

They've recognised that there is variability in the socio-technical system, and that information sharing is important. I can think of similar occasions in the past when this sort of hazard would be considered "covered" by instructions in the operating procedures that "prevented" the power units being on for that long, with no explanation of the rationale. I'm optimistic enough to see this as a sign of progress.

Drew

On 05/05/2015, at 8:47 AM, Heath Raftery wrote:

> On 5/05/2015 1:41 AM, Daniel Kästner wrote:

>> some performance figures about an Astrée analysis for a Level A avionics
>> application:
>> - code size > 700.000 lines of C code
>> - analysis duration: 6 hours
>> - hardware: Intel Core2Duo 2.66 GHz, 8GB RAM.
>> - result: 0 alarms
>> I.e. the absence of run-time errors was proven, including arithmetic
>> overflows.

>
> Is the implicit assumption that zero run-time errors is better, actually sound? Here's a "run time error":
>
> <code>
> uint16_t buttonPressTime = 0, timeInMilliseconds = 0;
>
> while(1)
> {
> wait(1);
>
> timeInMilliseconds++;
>
> if(buttonPressed)
> buttonPressTime = timeInMilliseconds;
>
> if(buttonPressTime && (timeInMilliseconds-buttonPressTime > 300))
> {
> printf("A button was pressed 0.3s ago.");
> buttonPressTime = 0;
> }
> }
> </code>
>
> Eventually timeInMilliseconds will wrap - apparently a run time error. But this code will "work" forever, even after the wrap occurs.
>
> Here's a "fix" for the run-time error:
>
> <code>
> uint16_t buttonPressTime = 0, timeInMilliseconds = 0;
>
> while(1)
> {
> wait(1);
>
> if(timeInMilliseconds < SHRT_MAX)
> timeInMilliseconds++;
>
> if(buttonPressed)
> buttonPressTime = timeInMilliseconds;
>
> if(buttonPressTime && (timeInMilliseconds-buttonPressTime > 300))
> {
> printf("A button was pressed 0.3s ago.");
> buttonPressTime = 0;
> }
> }
> </code>
>
> Tada! No run-time errors! Of course, it stops working after a minute.
>
> Yes, the tools are great, and not using them would take extraordinary justification. But to cry that "integer overflow was fixed 30 years ago!" may be missing the point.
>
> Heath
>
> _______________________________________________
> The System Safety Mailing List
> systemsafety_at_xxxxxx


The System Safety Mailing List
systemsafety_at_xxxxxx Received on Tue May 05 2015 - 01:21:18 CEST

This archive was generated by hypermail 2.3.0 : Tue Jun 04 2019 - 21:17:07 CEST