Home > Uncategorized > Estimating the reliability of compiler subcomponent

Estimating the reliability of compiler subcomponent

Compiler stress testing can be used for more than finding bugs in compilers, it can also be used to obtain information about the reliability of individual components of a compiler. A recent blog post by John Regehr, lead investigator for the Csmith project, covered a proposal to improve an often overlooked aspect of automated compiler stress testing (removing non-essential code from a failing test case so it is small enough to be acceptable in a bug report; attaching 500 lines of source to a report in a sure fire way for it to be ignored) triggered this post. I hope that John’s proposal is funded and it would be great if the researchers involved also received funding to investigate component reliability using the data they obtain.

One process for estimating the reliability of the components of a compiler, or any other program, is:

  • divide the compiler into a set of subcomponents. These components might be a collection of source files obtained through cluster analysis of the source, obtained from a functional analysis of the implementation documents or some other means,
  • count the number of times each component executes correctly and incorrectly (this requires associating bugs with components by tracing bug fixes to the changes they induce in source files; obtaining this information will consume the largest amount of the human powered work) while processing lots of source. The ratio of these two numbers, for a given component, is an estimate of the reliability of that component.

How important is one component to the overall reliability of the whole compiler? This question can be answered if the set of components is treated as a Markov chain and the component transition probabilities are obtained using runtime profiling (see Large Empirical Case Study of Architecture–based Software Reliability by Goševa-Popstojanova, Hamill and Perugupalli for a detailed discussion).

Reliability is a important factor in developers’ willingness to enable some optimizations. Information from a component reliability analysis could be used to support an option that only enabled optimization components having a reliability greater than a developer supplied value.

The one big threat to validity of this approach is that stress tests are not representative of typical code. One possibility is to profile the compiler processing lots of source (say of the order of a common Linux distribution) and merge the transition probabilities, probably weighted, to those obtained from stress tests.

  1. August 3rd, 2011 at 14:12 | #1

    We’ve already done a little work in this direction. In our lists of bugs we’ve indicated both the very general part of the compiler that was wrong and also the specific files that were modified to fix the bugs:


    However, not being a software engineer, it’s not clear to me why these data are useful beyond perhaps using them to suggest modules that are ripe for refactoring or other maintenance.

    Due to the strange and not-well-understood shapes of the parts of the program space that are miscompiled and that are occupied by code written by humans, I would put very little stock in any estimate that a compiler will miscompile a given code base.

  2. August 3rd, 2011 at 15:15 | #2

    @John, the information in the list needs to be combined with execution frequencies to get a measure of reliability.

    Human written code tends to have a consistent set of characteristics (e.g., 98% of expressions contain less than two binary operators, excluding assignment) and I would argue that compilers evolve to perform well in this code space. Automatic generators output contains some operators more frequently than human code (e.g., %) have more accesses to deeply nested types (e.g., a.c[3].d) and have expressions containing more operators, etc.

    Previous fault history for a component/module/file is a good predictor of existing faults. The huge cost of collecting enough data to be able to predict with some accuracy the likelihood of encountering a fault when compiling a given code base will probably prevent it happening. However, I think that obtaining estimate for component reliability is within reach of a small research group.

  1. No trackbacks yet.

A question to answer *