Re: [SystemSafety] Critical Design Checklist

From: Les Chambers < >
Date: Fri, 30 Aug 2013 09:55:01 +1000


Hi Kevin

How's it going? I hope you're enjoying the responses from the list. Here are some more contributions:

Configuration audit - boring but necessary

I experienced this on a distributed control systems project with 200 computing nodes.

The hazard was: loss of control due to installation of the wrong version of one or all of three node components: the control logic file, the configuration file or the operating system itself. The hazardous event actually occurred at 2 AM one morning when a gang of commissioning engineers attempted to install the wrong version of all three. Luckily there was no safety incident but it was highly embarrassing and very expensive with everyone's time. This event turned some laissez-faire systems architects and developers into hard-core card-carrying configuration audit zealots. I was truly impressed with the speed with which they implemented an automated system to run a real time audit of the operational baseline. A configuration baseline definition was held in a redundant set of supervisory computers and a check run every few minutes. We ran a 32-bit CRC over all files and use that as a unique identifier of each system component. In a design review I would ask the question: what measures are you taking to secure the integrity of the operational baseline. You could get some interesting answers.

Glue tracking

And then there was the glue incident. I should expand on that as it has configuration management ramifications. The heatsinks on the CPU chips in all the control node's were glued to the main processor chip. A factory in Shanghai used a bad batch of glue. Once the computers heated up in operation the glue melted and the heatsinks fell off rattling around inside the card cage. The stuff of nightmares for a maintenance engineer. Once again this did not trigger a hazardous event because the control node configurations were highly redundant with much health checking going on. But it did cause some consternation for some time. Somewhat like the sort of Damocles hanging over the head of the commissioning guys. The core problem was: no one thought to include the glue batch number in the configuration definition of a control computer. It turns out that software engineers are not strong in glue tracking. Who knows, there may not even have been such a thing as a glue batch number. We therefore did not know which of the 200 computing nodes had been assembled with the defective glue. Anyway, at least that bunch of guys are now card-carrying glue trackers. In a design review I would ask the general question as to what measures have been taken to avoid common cause failures and to what degree are components traceable to their manufacture.

Self-modifying code

I mentioned this because I see the strategy that calls for system components to modify each other at run time coming back into vogue. It used to be a cool thing to do in assembler language programming when you had limited working memory. It naturally died out when working memory became like air, infinite. Now its back. It is standard practice in web apps for code to modify web page HTML at run time. In fact I do it myself I'm ashamed to admit. The thing is you are forced down this parth if you want to create dynamic webpages (my excuse is: PHP made me do it!!). My concern is that if someone should use these architectural design components for something serious such as provisioning or tasking and armed drone there could be serious consequences. Self modifying code is difficult to review. You sort of have to imagine what it's going to look like at runtime. In a design review I would ask: justify any strategy that involves design components modifying themselves or others at run time.  

Sidebar:

There seems to be some confusion on this list as to how to answer your simple request for design review questions in the context of safety critical systems. I would encourage everyone to get your collective heads out of the meta world and just let it all hang out. Random accounts of things that went wrong because of a design flaw can be extremely useful to other designers. This is an excellent opportunity to build a library of same. Some meta person can sort it out later. How hard can that be?

Good luck with your checklist.

Cheers

Les            

From: systemsafety-bounces_at_xxxxxx [mailto:systemsafety-bounces_at_xxxxxx Driscoll, Kevin R
Sent: Tuesday, August 27, 2013 6:38 AM
To: systemsafety_at_xxxxxx Subject: [SystemSafety] Critical Design Checklist  

For NASA, we are creating a Critical Design Checklist:

. Objective

w Too easy to just check "yes" without doing sufficient work

w Instead, "What have you done ..."

w Prove what you have done is sufficient

. We are looking for inputs to include in this checklist

. Do you have any inputs that should be included?

w Where are the bodies buried?  

We are finishing the Checklist by next week and would like to include any good questions you may have that we have overlooked. Realizing this is an imposition on your time, I am hoping some of you would be so kind as to spend just a few minutes to send questions or even question fragments.  

--

P.S.

I am also looking for unusual failure scenarios to add to my collection,
like those I've described in my series of "Murphy was an Optimist"
presentations (e.g.
http://www.rvs.uni-bielefeld.de/publications/DriscollMurphyv19.pdf).

 





_______________________________________________ The System Safety Mailing List systemsafety_at_xxxxxx
Received on Tue Sep 03 2013 - 09:34:30 CEST

This archive was generated by hypermail 2.3.0 : Wed Apr 24 2019 - 23:17:06 CEST