Archive for November, 2009

Does the Climategate code produce reliable output?

November 30th, 2009 No comments

The source of several rather important commercial programs have been made public recently, or to be more exact programs whose output is important (i.e., the Sequoia voting system and code and data from the Climate Research Unit at University of East Anglia the so called ‘Climategate’ leak). While many technical commentators have expressed amazement at how amateurish the programming appears to be, apparently written with little knowledge of good software engineering practices or knowledge of the programming language being used, those who work on commercial projects know that low levels of software engineering/programming competence is the norm.

The emails included in the Climategate leak provide another vivid example, if one were needed, of why scientific data should be made publicly available; scientists are human and are sometimes willing to hide data that does not fit their pet theory or even fails to validate their theory at all.

The Climategate source has only only recently become available and existing technical commentary has been derived from embarassing comments and the usual complaint about not using the right programming language (Fortran is actually a good choice of language for this problem, it is widely used by climatology researchers and a non-professional programmer is probably makes best of their time by using the one language they know tolerably well rather than attempting to use a new language that nobody else in the research group knows).

An important quality indicator of the leaked software was what was not there, test cases (at least I could not find any). How do we know that a program’s output is correct? One way to gain some confidence in a program’s correctness is to process data for which the correct output is known. This blindness to the importance of program level correctness testing is something that I often encounter in people who are subject area experts rather than professional programmers; they believe that if the output has the form they are expecting it must be correct and will sometimes add ‘faults’ to ‘fix’ output that deviates from what they are expecting.

A quick visual scan through the source showed a tale of two worlds, one of single letter identifier names and liberal use of goto, and the other of what looks like meaningful names, structured code and a non-trivial number of comments. The individuals who have contributed to the code base obviously have very different levels of coding ability. Not having written any Fortran in anger for over 15 years my ability to estimate the impact of more subtle coding practices has atrophied.

What kind of faults might a code review look for in these programs? Common coding errors such as using uninitialized variables and incorrect argument passing are obvious choices and their are tools available to check for these kinds of error. A much more insidious kind of error, which requires people with the mathematical expertise to spot, is created by the approximate nature of floating-point arithmetic.

The source is not huge, but not small either, consisting of around 64,000 lines of Fortran and 16,000 lines of IDL (a language designed for interactive data analysis which to my untrained eye looks a lot like MATLAB). There was no obvious support for building the source included within the leaked files (e.g., no makefiles) and my attempt to manually compile using the GNU Fortran compiler failed miserably. So I cannot say anything reliable about the compiler output warnings.

To me the complete lack of test cases implies that the Climategate code does not produce reliable output. Comments in the code such as ***** APPLIES A VERY ARTIFICIAL CORRECTION FOR DECLINE********* suggests that the authors were willing to patch the code to produce output that matched their expectations; this is the mentality of somebody for whom code correctness is not an important issue and if they don’t believe their code is correct then I don’t either.

Source code in itself is rarely that important, although it might have been expensive to create. The real important information in the leaked files is the climate data. Now that this is available others can apply their analysis skills to provide an interpretation to what, if anything statistically reliable, it is telling us.

Software maintenance via genetic programming

November 27th, 2009 No comments

Genetic algorithms have been used to find solution to a wide variety of problems, including compiler optimizations. It was only a matter of time before somebody applied these techniques to fixing faults in source code.

When I first skimmed the paper “A Genetic Programming Approach to Automated Software Repair” I was surprised at how successful the genetic algorithm was, using as it did such a relatively small amount of cpu resources. A more careful reading of the paper located one very useful technique for reducing the size of the search space; the automated software repair system started by profiling the code to find out which parts of it were executed by the test cases and only considered statements that were executed by these tests for mutation operations (they give a much higher weighting to statements only executed by the failing test case than to statements executed by the other tests; I am a bit surprised that this weighting difference is worthwhile). I hate to think of the amount of time I have wasted trying to fix a bug by looking at code that was not executed by the test case I was running.

I learned more about this very interesting system from one of the authors when he gave the keynote at a workshop organized by people associated with a source code analysis group I was a member of.

The search space was further constrained by only performing mutations at the statement level (i.e., expressions and declarations were not touched) and restricting the set of candidate statements for insertion into the code to those statements already contained within the code, such as if (x != NULL) (i.e., new statements were not randomly created and existing statements were not modified in any way). As measurements of existing code show most uses of a construct are covered by a few simple cases and most statements are constructed from a small number of commonly used constructs. It is no surprise that restricting the candidate insertion set to existing code works so well. Of course no fault fix that depends on using a statement not contained within the source will ever be found.

There is ongoing work looking at genetic modifications at the expression level. This
work shares a problem with GA driven test coverage algorithms; how to find ‘magic numbers’ (in the case of test coverage the magic numbers are those that will cause a controlling expression to be true or false). Literals in source code, like those on the web, tend to follow a power’ish law but the fit to Benford’s law is not good.

Once mutated source that correctly processes the previously failing test case, plus continuing to pass the other test cases, has been generated the code is passed to the final phase of the automated software repair system. Many mutations have no effect on program behavior (the DNA term intron is sometimes applied to them) and the final phase removes any of the added statements that have no effect on test suite output (Westley Weimer said that a reduction from 50 statements to 10 statements is common).

Might the ideas behind this very interesting research system end up being used in ‘live’ software? I think so. There are systems that operate 24/7 where faults cost money. One can imagine a fault being encountered late at night, a genetic based system fixing the fault which then updates the live system, the human developers being informed and deciding what to do later. It does not take much imagination to see the cost advantages driving expensive human input out of the loop in some cases.

An on-going research topic is the extent to which a good quality test suite is needed to ensure that mutated fault fixes don’t introduce new faults. Human written software is known to often be remarkably tolerant to the presence of faults. Perhaps ensuring that software has this characteristic is something that should be investigated.

Where are the dead bodies?

November 18th, 2009 3 comments

The possibility of faults in software causing death or serious injury is often talked about and in some cases large amounts of money are invested in work to reduce the possibility of these events occurring (or at least doing things that will support the view that a company took reasonable precautions, should a case end up in court). The Therac-25 accidents are an often quoted example of a software fault that directly resulted in deaths. These accidents occurred over a 19 month period in the mid 1980s and are believed to have resulted in the death of six people. I don’t wish to disrespect the memory of the people who died, but six people 20 years ago; is that it? Less than the number of people killed every day (around 10) in traffic accidents in the UK.

If faults in software really do have a non-trivial impact on human safety then we would expect this fact to be reflected in accident statistics. After searching the accident statistics for the UK I cannot find any whose cause is directly attributed to software. If there are people who have died as a direct result of faults in software, the death rate has not yet reached the minimum level needed to be recorded as such (or are these deaths ‘hidden’ away in ones and twos within other causes?)

The US National Transportation Safety Board carries out a thorough investigation of all US aviation accidents. Searching the Aviation Accident Database on the query “software” between the dates 1 Jan 2000 and 9 Aug 2005 returns 44 matches. Reading these 44 reports I did not find any accident attributed to a software related issue.

If faults in software are not killing or seriously injuring many people why is so much effort invested in reducing the probability of these events occurring? The following are some of the possibilities:

  • The investment actually made is small, but it is talked up.
  • The investment is made for economic reasons (e.g., more reliable products are likely to reduce support costs) and increased ’safety’ is a side effect.
  • In situations where there is a likelihood of death or serious injury the procedures and reliability of non-software items is sufficient to short-circuit the effects of any life threatening faults that may exist in the software used (at least until the fault can be corrected).

As any developer knows, replicating faulty behavior in software can be very difficult, if not impossible. It may be that software faults are not given as the root cause of death or serious injury because the necessary proof is not available. Or perhaps software faults have yet to be the root cause of such events on any non-trivial scale.

Existing practice affects what people are willing to put up with. Many users of Microsoft Windows now accept that it is necessary to reboot the computer they are using on a daily, or even hourly, basis. Users of cars accept that the tool they are using can result in serious injuries or even death (usually rating nothing more than a story in the local town newspaper). Will there be a public hue and cry once software faults start to be recorded as a primary factor in accidental death or serious injury? As this paper shows, it can take a lot of dead bodies before existing practices are changed.

The lack of dead bodies attributed to a software root cause suggests that it is very still early days for the field of high integrity software development.

This material was originally written in 2005 and appeared in an earlier blog of mine which I did not keep up.

sizeof i++

November 11th, 2009 No comments

It is quite common for coding guideline documents to contain at least one guideline recommending against the use of a construct that developers very rarely use, for instance: “The operand of the sizeof operator shall not contain side-effects.”

... sizeof i++; // Is the author expecting i to be incremented?

Why do such recommendations get incorporated into guideline documents? The obvious answer is that their author(s) are unaware of actual developer usage and believe the recommendation has value.

I have heard people claim that such recommendations are harmless, after all the adherence cost is minimal. Besides, the fact that code very likely already adheres to it will increase the “pass” rate for the set of guidelines the first time developers check their code. However, such guidelines unjustifiably increase peoples confidence in software (as measured by the number of guidelines adhered to). They not only fail to add value to a set of coding guidelines, but their presence could result in the probity of the other guidelines being questioned.

I continue to be surprised by the amount of resistance encountered by my attempts to have the “sizeof” guideline removed from, or not included in, a set of coding guidelines. In the case of an existing set of guidelines there is obviously a resistance to change, but I have not yet managed to extract a single promise to consider removing the guideline in a future revision.

People seem unimpressed by the amount of source code I have searched in a vain attempt to find a violation of the “sizeof” guideline, but they often have some vague memory of having seen an instance of this elusive usage. My questions asking after the name of the source file, the name of the program, the name of the project, or simply the name of the company they were working for at the time are greeted with uncertainty. Perhaps the only instance they have seen is in the example underneath the text of the recommendation? Growls and pointed looks.

Another factor is existing practice, if it appears in other guideline documents it must have some benefit. People don’t want to go out on a limb. Besides, basing decisions on measurement of source code, who does that these days?

Whatever else might be said about the “sizeof” guideline, it does make a great example that developers can use to regale management.

Superior tone: Less experienced developers
Earnest voice: don’t always have a complete understanding of C/C++
Shock: They make the mistake of thinking
Talking very fast: that the code sizeof i++
Incredulity: will cause i to be incremented!
Emphasis: This guideline
Relieved voice: ensures that this mistake os warned about.

This article originally appeared in an earlier blog of mine which I did not keep up.

Compiler writing: The career path to World domination

November 7th, 2009 3 comments

Compiler writing is not usually thought of as a career path that leads to becoming Ruler of the World. Perhaps this is because compiler writing is a relatively new profession and us compiler writers are still toiling in obscurity awaiting the new dawn.

What might be a workable plan for a compiler writer to become Ruler of the World? One possibility is to write a compiler for the language in which most of the World’s critical software is written (i.e., C) and for that compiler to become the one that the vendors of this critical software all use (i.e., gcc). This compiler needs to do more that just compile the source code it is feed, it also needs to generate code that creates a backdoor in important programs (e.g., the login program).

But, you say, this cannot happen with gcc because its source is available for everybody to read (and spot any backdoor generator). In his 1984 Turing acceptance lecture Ken Thompson showed how a compiler could contain a backdoor that was not visible in its source. The idea is for the compiler writer to modify a compiler to detect when it is being used to compile itself and to insert the backdoor generating code into its own executable. This modified compiler is then used to compile itself and the resulting executable made the default compiler; the backdoor modifications are then removed from the compiler source, they are no longer needed because the previously compiled compiler will spot when it is being used to compile its own source and generate the code necessary to propagate the backdoor code into the executable it creates.

How would the world counter the appearance of such a modified gcc? Obviously critical programs would need to be recompiled by a version of gcc that did not contain the backdoor. Today there are several companies and many amateur groups that distribute their own Linux distributions which they build from source. It should be relatively easy to obtain a usable executable of gcc from 10 years ago; remember what is needed is a version capable of compiling the latest gcc sources.

The ideal time to create a backdoor’ed version of gcc is while its development was under the control of one person, so early in the development history that all versions available anywhere are very likely to be derived from it. How can we prove that the original author of gcc did not do just this?

It could be argued that the very substantial changes to the gcc sources (most of the source has probably been rewritten several times) mean that the coding patterns searched for by the executable to detect that it is compiling itself have long gone and at some point the backdoor failed to propagate itself to the next executable.

Compilers other than gcc might also include backdoors that propagate themselves. However, the method of propagation is likely to be different. Compiling the gcc sources with a non-gcc compiler creates an executable that should exhibit the same behavior as a gcc-compiled executable. Differences in the behavior of these independently built executables is a cause for concern (one difference might be caused by differences in the conversion of floating-point literals, a recent PhD thesis provides more detail).

The problem with compiling the gcc sources is that they make use of language extensions that few, if any, other compilers support. I know IBM added modified one of their C compilers to support those gcc extensions needed to compile the Linux kernel, but I don’t know if this compiler is capable of compiling the gcc sources. The LLVM project intended to support many gcc extensions but I don’t know if they aim to be able to compile the gcc sources.

Another option is to compare the assembler generated when gcc compiles itself against the corresponding source code. A very expensive task for source code measured in hundreds of thousands of lines. Adding the necessary language extension support to another compiler would probably be cheaper and also create a tool that could be used to check future releases of gcc.