The Shape of Code

Home > Uncategorized > Go faster R for Google’s summer of code 2012

Go faster R for Google’s summer of code 2012

March 28, 2012 Derek Jones Leave a comment Go to comments

The R Foundation has been accepted for Google’s summer of code and I thought I would suggest a few ideas for projects. My interests are in optimization and source code analysis, so obviously the suggestions involve these topics.

There are an infinite number of possible optimizations that can be applied to code (well, at least more than the number of atoms in the known universe). The first job for any optimization project is to find the common characteristics of the code; once these are known the available resources can be concentrated on improving the performance of these common cases (as they evolve optimizers necessarily attack less frequently occurring constructs and in rare cases address a previously unnoticed common pattern of behavior).

What are the common characteristics of R programs? I have no idea and have not seen any published empirical analysis on the subject. Analysing the characteristics of the R source code ecosystem would make a very good summer project. The analysis could be static, based purely on the source, or dynamic, looking at the runtime characteristics. The purpose of analyse is to gain a general understanding of the characteristics of R code and to investigate whether specific kinds of optimizations might be worthwhile. Often optimizations are suggested by the results of the analysis and in some cases optimization possibilities that were thought to be worthwhile turn out to have little benefit. I will stick my neck out and suggest a few optimizations that I think might be worthwhile.

Reducing object copying through last usage analysis. In R function arguments are passed using call-by-value, that is a copy of the argument is made and passed to the called function. For large arguments call-by-value is very time-consuming and if the value of the argument is not used after the called function returns the copy operation is redundant. I think it would be a worthwhile optimization for the R compiler to replace call-by-value with call-by-reference in those cases where the current argument is not read again and is modified during the call (the R implementation uses copy-on-write so there is overhead minimal overhead if the argument is only ever read); analysis is needed to verify this hunch.
Operations on short vectors. Many processors have instructions that simultaneously perform the same operation on a small number of values (e.g., the Intel/AMD SSE instructions). If it is possible to figure out that the two vectors involved in an add/subtract/multiple/etc are short, the same length, do not contain any NA, then a ‘short-operation’ instruction could be generated (when running on processors without the necessary support the R interpreter would implement these the same way as the longer forms). Analysis is needed to find out how often short vector operations occur in practice.
Do R programs spend most of their time executing in C/Fortran routines or in R code? If the answer is C/Fortran and there is some set of functions that are called frequently then it may be worthwhile having versions of these that are tuned to the common case (whatever that might be). If the answer is R then what is the distribution pattern of R operations? There is a lot that can be done to speed up the R interpreter, but that project will need a lot more effort than is available in a summer of code and we need to get some idea of what the benefits for the general population might be.

To increase coverage of R usage, the measurement tools should be made available for people to download and run on their own R code, and hopefully forwarding the output back to some central collection point. For maximum portability this means writing the static analysis tools in R. By their very nature the dynamic analysis measurements have to be made via changes to the R system itself, getting users to download and use prebuilt binaries (or building from source) has always been fraught with problems; it is always hard o get users to buy into helping out with dynamic measurements.

Sophisticated static analysis consumes lots of compute resources. However, R programs tend to be short, so the required resources are unlikely to be that great in R’s case; even writing the analysis in R should not cause the resource requirements to be that excessive.

The only other language likely to share many of R’s language usage characteristics that I can think is APL. There have been a few published papers on APL usage, but these were not that wide-ranging and probably not of much use. Perhaps somebody who worked for a now defunct APL compiler company has a copy of in-house performance analysis reports they can make available.

Categories: Uncategorized Tags: call by reference, call by value, compiler, dynamic analysis, Google, interpreter, object copying, optimizing, R, static analysis, vector length

Comments (5) Trackbacks (0) Leave a comment Trackback

Tal Galili

March 29, 2012 07:01 | #1

Reply | Quote

Hi Derek,
An interesting direction.

I hope you will e-mail the google group soon with your suggestion (coupled with adding it to the wiki), so people will be able to respond to you there.

Cheers,
Tal
Matthew Dowle

March 29, 2012 13:14 | #2

Reply | Quote

The first bullet point seems very wrong. One of the greatest features of R is the *illusion* of pass-by-copy, but the copy (if any) is only made at the point of copy-on-write. Just reading function arguments inside a function doesn’t make a copy of them, and certainly unused arguments are not copied! How did you arrive at this misunderstanding?
Derek-Jones

March 29, 2012 13:49 | #3

Reply | Quote

@Matthew Dowle
Thanks, perhaps I had oversimplified the discussion. Added some wording to make it clear that this optimization would only apply if the argument was modified during the function call.
Kevin Wright

March 29, 2012 14:10 | #4

Reply | Quote

Have you seen this:
http://radfordneal.wordpress.com/category/statistics/statistics-computing/r-programming/
Derek-Jones

March 29, 2012 14:34 | #5

Reply | Quote

@Kevin Wright
Yes, I followed the whole bracket discussion. The R team seem to have got the major design stuff right but not followed through and sorted out all the minor details. If there is a large interpreter overhead for many users (I’m not convinced of this) then a major redesign to handle modern processor characteristics (e.g., caching and long pipelines) is needed.

No trackbacks yet.

Unique values generated by expressions of a given complexity Matching context sensitive rules and generating output using regular expressions

Go faster R for Google’s summer of code 2012

Recent Posts

Recent Comments

Archives

Meta