Re: [SystemSafety] A comparison of STPA and ARP 4761

From: Peter Bernard Ladkin < >
Date: Tue, 29 Jul 2014 07:42:12 +0200

On 2014-07-29 01:49 , Matthew Squair wrote:
> To be absolutely fair, the comparison is between the worked example provided in ARP 4761 and a STAMP
> rerun of the same example.

Redoing a worked example is a standard method of comparison. But I am inclined to be wary of overgeneralising the conclusion (unlike the authors cited by M. Fabre :-) )

There is a real, pressing issue here which the cited conclusion addresses. Put simply, it is that we have to do better than the "traditional" methods of hazard and risk analysis.

We do indeed "... need to create and employ more powerful and inclusive approaches to evaluating safety" than those referred to in the current aerospace standards. FMEA as it is currently performed, and FTA, RBD, are the three techniques referred to explicitly in 14 CFR 25.1309 for auxiliary kit, as commonly executed in aerospace contexts. They all miss stuff that needs to be identified and mitigated. This is well known to people like us, but there are people out there in industry who swear by (their version of) FMEA, FTA and so on and to my mind the message really needs to be put out there much more visibly that there are issues that will cause problems with your kit which these methods do not identify.

I don't know about the 4761 example, but there are well-known actual cases. The "usual techniques" did not identify the error in the boot-up configuration SW of the Boeing 777 FMS which led to the uncommanded pitch excursions of the Malaysian Airlines Boeing 777 out of Perth in 2005. And it should be clear that they could never do so. Neither did they identify that spikey misleading output emanating from a sporadically faulty ADFC could be accepted as veridical by the primary flight control computer SW and lead (also) to uncommanded pitch excursions (and some damaged people this time) in the 2008 Learmonth A330 accident, or to a similar incident a couple months later on a sister ship. And it should be clear.....etc. Neither did they adequately assess the risk of the Byzantine fault in <famous airplane> databus processes, recounted in Driscoll, Hall, Sivencrona and Zumsteg in their classic SAFECOMP 2003 paper, which almost led to the withdrawal of the airworthiness certificate.

It is also evident (at least to those on this list, I hope!) that these traditional methods do not address important issues of how operators work with the systems they are to operate. What is called HF. There are many recent examples. Try Turkish in Amsterdam in 2009 or Asiana in San Francisco in 2013. These two are different in that Turkish slaved faulty kit to the autothrottle and left themselves under autothrottle control, whereas there was nothing wrong with anything on Asiana, but they are similar in that experienced crews didn't monitor the basics on final approach until it was too late to recover (and we really are talking stuff that every pilot is taught on hisher first flying lesson and is emphasised throughout primary training). The puzzle is why not. That's puzzling everyone at the moment. (But I take the likelihood that one could wave some magical method at the problem and have the answer pop out to be just about zero, if not slightly less.)

What happens, constantly in my experience, and likely that of the MIT people also, is that development engineering says "I do this-and-this and this like the guidance says, and it shows me that-and-that and then I fix it. <And, implicitly, anything I didn't see when doing this-and-this doesn't exist/isn't my concern/etc> So, job done and the certification agencies are happy." And that's just not enough. Methods should be used that identify *all* the hazards if possible. Such methods are available and as the paper says STPA is one. Traditional FMEA, FTA and RBD aren't such methods.

But I think people can do much better than they do at the moment with incremental improvements, rather than having to learn an entire sociological-system model (this is presumably where I differ in my view from the authors of the cited article :-) ). One obvious lack in the traditional methods is any kind of reliable self-check. You do an FMEA - how do you know the one you got is right/sufficient/adequate? Same with FTA. The answer is: traditionally you don't - there are no technical correctness/completeness checks. So, devise a valid check, add it on, and you are already doing much better.

I am not the only person to have noticed this, of course. There are myriad papers out there which propose an enhancement, take an example, redo it using the enhancement, and show how much better the results are. But a common conclusion is "sure, every PhD student and hisher mother can enhance a technique to work better on a given specific example". One could rather observe that if there are a ton of papers out there showing by example how the traditional method doesn't catch everything that needs catching, then it is very likely indeed to be the case that using the traditional method risks not catching everything you should be catching. But this reasonable conclusion is rarely heard.

So, for example, "everyone" redoes the pressure-vessel example in the Fault Tree Handbook (kudos, BTW to Bedford/Cooke who don't, while having good material on FTA). Kamen and Hassezahl in their 1999 text and Leveson in her 1995 monograph. We had a go too in 2000. Some simple causal analysis of the control loops in the pressure-vessel system followed by a syntactic transformation into a tree yield an FT that is obviously "more complete" than the one in the FTH, than Kamen and Hassenzahl's or Leveson's. I've been sitting around for a decade and a half waiting for a paper from some PhD student taking the same example and doing even better than Vesely, Kamen, Hassenzahl, Leveson and Ladkin. I haven't seen one, so either people are bored with playing the game or I've not been keeping up with the literature. Or maybe our fault tree was perfect :-)

For a more recent example, in 2010 Daniel Jackson picked one out of Nancy's recent book, and showed how an Alloy analysis picked up some things which STPA hadn't. (See his November 2010 entry "How to Prevent Disasters" at ) Then Jan Sanders in my group started an OHA and picked up some additional things which Jackson's analysis seemed also to have missed. There followed a discussion of completeness and how to check for it. (I have some blog posts on it and the mailing-list discussion is in the York archives.)

The main point is that the traditional methods don't work well, better is available, indeed much better is available, and people should be using it. As the citation says, "STPA is one possibility." Even adding some half-decent self-checks to FMEA or FTA would be better than what's currently done.

The question is how we get there, socially. Prominently picking prominent examples and redoing them prominently is helpful, but it is susceptible to the "everybody and hisher mother can do that" response above, usually followed with "and I can't speak for our competitors, but all *our* engineers can do a decent FMEA and we don't get it wrong".

Others have mooted that things will change when the compensation lawsuits start mounting. Having been involved in some of those processes, I am not so sure. As others here with similar experience can testify, mostly only a tiny fraction of any such negotiations concern the technical engineering details, and very few of them get to open court like Bookout/Toyota. Indeed, that case was notable not only for its visibility but also for the fact that it was decided directly on engineering details. A participant in such discussions who has had to disclose a hazard analysis (often an FMEA) and had it successfully trashed by the opposition can choose to fault the engineers who produced it rather than the inadequate method the company required them to use.

One could maybe hope for progress coming through engineering training. Say, in engineering departments at universities. York and MIT do their best, but system safety methods are not taught in any detail in most places. The head of the division responsible for risk analysis at a prominent German transportation safety-assessment company once told me he'd been looking for a new engineering graduate to perform FTAs and ETAs, and could find only one German uni at which FTA was taught to usable levels (that was Stuttgart). He's right; I checked. There are lots of people who *mention* FTAs in coursework and have a couple in their lecture slides, including yours truly, but there's nobody except perhaps Stuttgart who can say "oh, XYZ passed our course so heshe can certainly do a decent FTA for you". They are still, mainly, methods learnt on the job, and if you learn a method on the job because it's company standard, you mostly don't learn about where it doesn't work.

The frustration in some quarters is palpable. In a discussion on the FMEA standard a few years ago, Jens Braband told me of the number of people he encounters who think FMEA is, in that wonderful German phrase, the eierlegende Wollmilchsau - the egg-laying wool/milk-sow, the Swiss army knife of farm animals. I've met some of these people. It is true that a well-performed FMEA can solve complex logistics problems that people couldn't solve any other way - you have a very complex piece of kit out there in the field, with thousands to millions of components, all of which have their own MTBF and you have to figure out a maintenance schedule and parts-supply schedule which keeps it running while minimising maintenance down-time. I've had someone explain to me in detail over many hours how it can do that wonderfully. Then I say: what it can't do is reliably identify all the hazards when you're performing a functional system safety analysis. The reply, "OK, but <there follows a further encomium to the wonders of FMEA>".

The cited conclusion is that modern kit is so interactively complex that traditional risk analysis methods don't work and we need something better. That is just so right. It is a sad comment on the current state of engineering affairs that M. Fabre should need to observe "I suspect that this conclusion will generate some controversy."

PBL Prof. Peter Bernard Ladkin, Faculty of Technology, University of Bielefeld, 33594 Bielefeld, Germany Tel+msg +49 (0)521 880 7319

The System Safety Mailing List
systemsafety_at_xxxxxx Received on Tue Jul 29 2014 - 07:42:27 CEST

This archive was generated by hypermail 2.3.0 : Fri Feb 22 2019 - 15:17:07 CET