Archive

Posts Tagged ‘dynamic analysis’

Programs spent a lot of time repeating themselves

January 18, 2013 No comments

Inexperienced software developers are always surprised that programs used by lots of people can contain many apparently non-trivial faults and yet continue to operate satisfactorily; experienced developers become familiar with this state of affairs and tend to shrug their shoulders. I have previously written about how software is remarkably fault tolerant. I think this fault tolerance is telling us something important about the characteristics of software and while I have some ideas about what it might be I don’t yet have a good handle (or data) on what is going on to lay out my argument.

In this article I’m going to talk about another characteristic of program execution which I think is connected to program fault tolerance and is also very surprising.

Software differs from hardware in that for a given set of inputs a program will always produce the same output, it will not wear out like hardware and eventually do something different (to simplify things I’m ignoring the possible consequences of uninitialized variables and treating any timing dependencies as part of the input set). So for a fault to be observed different input is required (assuming one exists and none appeared for the first input set).

I used to assume that during a program’s execution the basic cpu operations (e.g., binary arithmetic and bitwise operations) processed a huge number of different combinations of input values (e.g., there are 2^16 * 2^16 / 2 combinations of input value for a 16-bit add operation) and was very surprised to find out this is not the case. For many programs around 80% of all executed instructions are repeat instructions, that is a given instruction, such as add, operates on the same combinations of input values that it has previously operated on (while executing the program) to generate an output value that is identical to the one previously generated from these input values. If we count the number of static instructions in the program (i.e., the number of assembly instructions in a listing of the disassembled executable program) then 20% of them account for 90% of the repeated instructions; so a small amount of code (i.e., 20%) is not only responsible for most dynamically executed instructions but around 72% (i.e., 80%*90%) of these instructions repeat previous computations. If a large percentage of a what goes on internally within a program is repetition is it any surprise that once it works for a reasonable set of inputs it will probably work on other inputs?

Hang on you say, perhaps the percentage of repeat instructions is very high for a given set of external input values (e.g., a file to compress, compile or display as a jpeg) but there is a lot of variation in the set of repeat instructions between different external inputs. Measurements suggest this is not the case, with around 20% of dynamic instructions having input values that can be traced to external program input (12-30% come from globally initialized variables and the rest are generated internally).

There is a technical detail that reduces the repeat instruction percentages given about by a factor of two; researchers always like to give the most favorable numbers and for this discussion we need to make a distinction between local repetition which counts one instruction and its inputs/outputs at a particular point in the code and global repetition which counts all instructions of a given kind irrespective of where they occur in the code. A discussion of fault behavior needs to look at local repetition, not global repetition; there is a factor of two difference in the dynamic percentage and some reduction in the percentage of static instructions involved.

Sometimes the term redundant computation is used, as if the cpu should remember what happened last time it executed an instruction with a particular set of inputs and reuse the answer it got last time. Researchers have proposed caching the results of executing an instruction with a given set of input values and speeding things up or saving power by reusing previous results rather than recalculating them (a possible speedup of 13% on SPEC95 is claimed for a reuse buffer containing 4096 entries).

So a small percentage of the instructions in a program account for most of the execution time (a generally known characteristic) and around 30% of the executed instructions operate on input values they have processed before to produce output they have produced before (to the extent that a cache containing a few thousand entries is big enough to hold the a large percentage of the duplicates). If encountering a new fault requires different execution behavior to occur then having a large percentage of a program always doing the same thing (i.e., same input values same output value) will have a significant impact on the likelihood of encountering a fault. Part of the reason programs are fault tolerant is because external input values don’t have a big an impact on program behavior as we might have thought.

Researchers have also investigated repeats involving units larger than one instruction, such as sub-blocks (a sequence of instructions smaller than a basic block) and even complete functions or just the mathematical ones.

The raw data is obtained using cpu simulators to monitor programs as they are executed, logging the values read as input by an instruction and the value generated as output (in most cases the values are read from registers and written to a register). A single study might log billions of instructions from the SPEC benchmark.

An academic programming language paper about R

April 27, 2012 1 comment

The R language has passed another milestone, a paper aimed at the academic programming language community (or at least one section of this community) has been written about it, Evaluating the Design of the R Language by Morandat, Hill, Osvald and Vitek. Hardly earth shattering news, but it may have some impact on how R is viewed by nonusers of the language (the many R users in finance probably don’t care that R seems to have been labeled as the language for doing statistics). The paper is well written and contains some very interesting information as well as a few mistakes, although it will probably read like gobbledygook to anybody not familiar with academic programming language research. What follows has something of the form of an R users guide to reading this paper, plus some commentary.

The paper has roughly three parts, the first gives an overview of R, the second is a formal definition of a subset and the third an initial report of an analysis of R usage. For me and I imagine you dear reader the really interesting stuff is in the third section.

When giving a language overview to people who know other computer languages it makes sense to leverage that knowledge, this is why the discussion has a world view from the perspective of languages rarely associated with R: Scheme, Haskell and CLOS. I found some of the discussion of R constructs to be much more informative and less confusing than that in nearly all R books/tutorials I have read, but then they are written from a detailed operational programming language perspective. One criticism of this overview is that it does not give any hint as to why R has such a large following (saying that users found it more useful than these languages would send the wrong kind of signal ;-).

What is a formal description of a subset of R (i.e., done purely using mathematics) doing in the second part? Well, until recently very little academic software engineering was empirically based and was populated by people I would classify as failed mathematicians without the common sense needed to be engineers. Things are starting to change but research that measures things, particularly people, is still regarded as not being respectable in some quarters. In this case the formal definition is playing the role of a virility symbol showing that the authors are obviously regular guys who happen to be indulging in a bit of empirical research.

A surprising number of papers measuring the usage of real software contain formal definitions of a subset of the language being measured. Subsets are used because handling the complete language is a big project that usually involves one or more people getting a PhD out of the work. The subset chosen have to look plausible to readers who understand the mathematics but not the programming language, broadly handle all the major constructs but not get involved with all the fiddly details that need years of work and many pages to describe.

The third part contains the real research, which is really about one implementation of R and the characteristics of R source in the CRAN and Bioconductor repositories, and contains lots of interesting information. Note: the authors are incorrect to aim nearly all of the criticisms in this subsection at R, these really apply to the current implementation of R and might not apply to a different implementation.

In a previous post I suggested some possibilities for speeding up the execution of R programs that depended on R usage characteristics. The Morandat paper goes a long way towards providing numbers for some of these usage characteristics (e.g., 37% of function parameters are assigned to and 36% of vectors contain a single value).

What do we learn from this first batch of measurements? R users rarely use many of the more complicated features (e.g., object oriented constructs {and this paper has been accepted at the European Conference on Object-Oriented Programming}), a result usually seen for other languages. I was a bit surprised that R programs were only 40% smaller than equivalent C programs. I think part of the reason is that some of the problems used for benchmarking are not the kind that would usually be solved using R and I did not see any ‘typical’ R programs being coded up in C for comparison, another possibility is that the authors were not thinking in R when writing the code.

One big measurement topic the authors missed is comparing their general findings with usage measurements of other languages. I think they will find lots of similar patterns of usage.

The complaint that R has been defined by the successive releases of its only implementation, rather than a written specification, applies to all widely used languages, at least in their early days. Back in the day a major reason for creating language standards for Pascal and then C was so that other implementations could be created; the handful of major languages whose specification was written before the first implementation (e.g., PL/1, Ada) have/are dieing out. Are multiple implementations needed in an Open Source world? The answer seems to be no for Perl and yes for PHP, Ruby etc. The effort needed to create a written specification for the R language might be better invested improving the efficiency of the current implementation so that a better alternative is not needed.

Needless to say the authors suggested committing the fatal programming language research mistake.

The authors have created an interesting set of tools for static and dynamic analysis of R and I look forward to reading more about their findings in future papers.

Go faster R for Google’s summer of code 2012

March 28, 2012 5 comments

The R Foundation has been accepted for Google’s summer of code and I thought I would suggest a few ideas for projects. My interests are in optimization and source code analysis, so obviously the suggestions involve these topics.

There are an infinite number of possible optimizations that can be applied to code (well, at least more than the number of atoms in the known universe). The first job for any optimization project is to find the common characteristics of the code; once these are known the available resources can be concentrated on improving the performance of these common cases (as they evolve optimizers necessarily attack less frequently occurring constructs and in rare cases address a previously unnoticed common pattern of behavior).

What are the common characteristics of R programs? I have no idea and have not seen any published empirical analysis on the subject. Analysing the characteristics of the R source code ecosystem would make a very good summer project. The analysis could be static, based purely on the source, or dynamic, looking at the runtime characteristics. The purpose of analyse is to gain a general understanding of the characteristics of R code and to investigate whether specific kinds of optimizations might be worthwhile. Often optimizations are suggested by the results of the analysis and in some cases optimization possibilities that were thought to be worthwhile turn out to have little benefit. I will stick my neck out and suggest a few optimizations that I think might be worthwhile.

  • Reducing object copying through last usage analysis. In R function arguments are passed using call-by-value, that is a copy of the argument is made and passed to the called function. For large arguments call-by-value is very time consuming and if the value of the argument is not used after the called function returns the copy operation is redundant. I think it would be a worthwhile optimization for the R compiler to replace call-by-value with call-by-reference in those cases where the current argument is not read again and is modified during the call (the R implementation uses copy-on-write so there is overhead minimal overhead if the argument is only ever read); analysis is needed to verify this hunch.
  • Operations on short vectors. Many processors have instructions that simultaneously perform the same operation on a small number of values (e.g., the Intel/AMD SSE instructions). If it is possible to figure out that the two vectors involved in an add/subtract/multiple/etc are short, the same length, do not contain any NA, then a ‘short-operation’ instruction could be generated (when running on processors without the necessary support the R interpreter would implement these the same way as the longer forms). Analysis is needed to find out how often short vector operations occur in practice.
  • Do R programs spend most of their time executing in C/Fortran routines or in R code? If the answer is C/Fortran and there is some set of functions that are called frequently then it may be worthwhile having versions of these that are tuned to the common case (whatever that might be). If the answer is R then what is the distribution pattern of R operations? There is a lot that can be done to speed up the R interpreter, but that project will need a lot more effort than is available in a summer of code and we need to get some idea of what the benefits for the general population might be.

To increase coverage of R usage the measurement tools should be made available for people to download and run on their own R code, and hopefully forwarding the output back some central collection point. For maximum portability this means writing the static analysis tools in R. By their very nature the dynamic analysis measurements have to be made via changes to the R system itself, getting users to download and use prebuilt binaries (or building from source) has always been fraught with problems; it is always hard o get users to buy into helping out with dynamic measurements.

Sophisticated static analysis consumes lots of compute resources. However, R programs tend to be short, so the required resources are unlikely to be that great in R’s case; even writing the analysis in R should not cause the resource requirements to be that excessive.

The only other language likely to share many of R’s language usage characteristics that I can think is APL. There have been a few published papers on APL usage, but these were not that wide ranging and probably not of much use. Perhaps somebody who worked for a now defunct APL compiler company has a copy of in-house performance analysis reports they can make available.