[SystemSafety] Stupid Software Errors [was: Overflow......]

From: Peter Bernard Ladkin < >
Date: Mon, 04 May 2015 08:41:56 +0200


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

I wrote a version of the following a few days ago to a closed list.

AA has EFBs crashing on a number of flights. Apparently two copies of the approach chart for Reagan Washington National airport were included in after the latest update of the EFB, and the app wasn't able to handle having two files with almost-identical metadata denoted as "favorites". A colleague who flies for a major airline (not AA) which uses EFBs spoke of some colleagues having their EFBs crash early on Jan 1 one year - they fixed it by rolling the date back a day.

On the Boeing 787: think of 32-bit Unix clock, and lots of examples. There's even a Wikipedia page http://en.wikipedia.org/wiki/Time_formatting_and_storage_bugs .

Remember Apple's go-to fail (CVE-2014-1266) from 2014: missing parsing checks.

These are simple, known types of error. Forty years ago, it was known how to avoid all these kinds of problems. Twenty years ago, there were industrial-quality engineering tools available (proper languages and coding standards checkers) which enabled companies to avoid such problems without undue development costs.

I don't buy Derek Jones's or Tom Ferrell's versions of the curate's egg. I don't see why anyone else should, either. Are they still going to be saying "well, it depends, it's complicated" in another twenty years when stupid coding errors still make it through into supposedly-dependable software products?

Look at go-to fail. That's critical code! How come critical code such as that is not routinely subject to static analysis?

Look at the 787 generator code. A systematic loss of all generators is surely a hazardous event. That should make it 10^(-7). Oh, but I forgot. Even though correct operation of SW contributes to the 10^(-7), the reliability of the SW itself is not assessed. But surely it gets to be at least DAL B, since the result is a hazardous event? Oh, but I forgot something else. A systematic failure like that would be common cause, and the certification requirements concern single failures, not common cause failures. So that's all right then. Tom's suggestion that it might have been a design compromise is vitiated by the fact that the phenomenon is subject to an AIRWORTHINESS Directive by the FAA. (Is that sufficient emphasis?)

If people had told me thirty years ago that we'd still be making the same stupid mistakes in the same ways, but this time in code more fundamental to the safe or secure operation of everyday engineered objects, I wouldn't have believed it.

Maybe it's a social thing. Mostly, people actually writing the code and inspecting it are in their twenties and their bosses maybe at most in their early thirties. The young people have never made
*this* mistake before - the previous lot had of course, but they're all in management now. I'm
reminded of Philip Larkin's ode to rediscovery, Annus Mirabilis:

Sexual intercourse began
In nineteen sixty-three
(Which was rather late for me)-
Between the end of the Chatterley ban
And the Beatles' first LP.

The Ensuing Discussion.

There was obviously discussion on the list of why we are making the same old mistakes forty years after it was known how to avoid them. Some discussants suggested it might help to professionally certify software engineers, a PE. Others referred to the Knight-Leveson study a decade ago for the ACM, in which inserting SE into the current PE scheme was not seen as advantageous. UK discussants pointed out that such certification exists in the UK, as a CEng through the BCS or IET, and that there had been some UK consideration of extra qualification for critical-software engineering.

Such qualification for system safety hasn't (yet) generally caught on anywhere. SARS offer it in the UK for example. It didn't catch on in the US. Over a decade ago, the System Safety Society introduced an option for system safety engineering into the PE exam. They had to pay the NPSE or NCEES (I forget which) lots of money per year to maintain the option - and two people took it in some number of years. So they dropped it. (I was at the board meeting in Ottawa in 2004 when this was decided.)

The UK qualification regime hasn't stopped IT disasters in government procurement. And it hasn't stopped the kind of poor engineering which allows bank ATMs which use supposedly pseudo-one-time-pad nonce generation to be subject to replay attacks (see a recent paper reciting local experiments performed by Ross Anderson's group). I do note, however, that the three examples I mentioned above are all US examples. It's not ruled out that having some degree of formal professional training, as in the UK, encourages software engineers to avoid repeating simple mistakes whose prophylaxis has been well known for decades.

Time was, when UK and US cars were not known for their reliability. Kind of like SW, relatively-inexpensive cars used to go wrong a lot. However, some very expensive cars such as made by Rolls-Royce/Bentley and Wolseley were reliable. So there was proof of concept. Japanese companies decided it was possible to produce reliable relatively-inexpensive cars and make money, and did it.

There is proof of concept in SE, too. Unlike Rolls-Royce cars, it is not prohibitively expensive. Three out of my four examples involve run-time error. It is feasible to produce SW cost-effectively which is free from run-time error. Just like the Japanese approach to cars, you just have to decide to do it.

How about the following? We design a document called A Programmer's Pledge. It has thirty or so numbered clauses:

A professional programmer signs it and files it with hisher professional organisation. Quality control issues in programs (such as the above phenomena) are routinely subject to RCA of sorts. When a programmer is responsible for a piece of code with such an error in it, the company reports it to the professional organisation and the programmer gets "points" attached to the corresponding clause in hisher Pledge. Like with driving (Germans say "points in Flensburg" which is where the office is. What is it in the UK? "Points in Cardiff"?). I bet lots of organisations, from companies hiring programmers to professional-insurance companies will find uses for it.

PBL Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany Je suis Charlie
Tel+msg +49 (0)521 880 7319 www.rvs.uni-bielefeld.de

-----BEGIN PGP SIGNATURE----- iQEcBAEBCAAGBQJVRxS0AAoJEIZIHiXiz9k+Sv4H/3qSuiODGIZarIb0Rwj4PoOR gi6zvdAb1ns2A8w0xXiBz6E8+iwik53ueVxhEDTINA4RXyoLTfFEVl9yunOR0qnU 7ht92kguaSjuM3BGUGYzy8MpZMjc0jyNWRmyC3wh0y3X0NnjL+/GMiqYR+3zq5RX ZEzJk89SboZiB1kyTqMM+IcKzbABmk1CSaAkQziGvdJFWklNM10prMIk/5MprGwV EeePB1rGs13Z1LZi8GIqdz8PDc1FKSz5qRugQ8VZJbbJvgct9JJVfEtQx3uElGkt a/E5fQ/+Gw8CARMhpktEr/wLdk7t3akJvNF5iLK5W7Mbb3h0kd7sCNLZ5d9OZyA= =i/nm
-----END PGP SIGNATURE-----



The System Safety Mailing List
systemsafety_at_xxxxxx Received on Mon May 04 2015 - 08:42:06 CEST

This archive was generated by hypermail 2.3.0 : Tue Jun 04 2019 - 21:17:07 CEST