Home > Uncategorized > Does native R usage exist?

Does native R usage exist?

Note to R users: Users of other languages enjoy spending lots of time discussing the minutiae of the language they use, something R users don’t appear to do; perhaps you spend your minutiae time on statistics which I don’t yet know well enough to spot when it occurs). There follows a minutiae post that may appear to be navel-gazing to you (interesting problem at the end though).

In various posts written about learning R I have said “I am trying to write R like a native”, which begs the question what does R written by a native look like? Assuming for a moment that ‘native R’ exists (I give some reasons why it might not below) how…

To help recognise native R it helps to start out by asking what it is not. Let’s start with an everyday analogy; if I listen to a French/German/American person speaking English I can usually tell what country they are from, they have patterns of usage that we in merry England very rarely, if ever, use; the same is true for programming languages. Back in the day when I spent several hours a day programming in various languages I could often tell when somebody showing me some Pascal code had previously spent many years writing Fortran or that although they were now using Fortran they had previously used Algol 60 for many years.

If expert developers can read R source and with high accuracy predict the language that its author previously spent many years using, then the source is not native R.

Having ruled out any code that is obviously (to a suitably knowledgeable person) not native R, is everything that is left native R? No, native language users share common characteristics; native speakers recognize these characteristics and feel at home. I’m not saying these characteristics are good, bad or indifferent any more than my southern English accent is better/worse than northern English or American accents; it is just the way people around here speak.

Having specified what I think is native R (I would apply the same rules to any language) it is time to ask whether it actually exists.

I’m sure there are people out there whose first language was R and who have spent a lot more time using R over, say, five years rather than any other language. Such people are unlikely to have picked up any noticeable coding accents from other languages and so can be treated as native.

Now we come to the common characteristics requirement, this is where I think an existence problem may exist.

How does one learn to use a language fluently? Taking non-R languages I am familiar with the essential ingredients seem to be:

  • spending lots of time using the language, say a couple of hours a day for a few years
  • talking to other, heavy, users of the language on a daily basis (often writing snippets of code when discussing problems they are working on),
  • reading books and articles covering language usage.

I am not saying that these activities create good programmers, just that they result in language usage characteristics that are common across a large percentage of the population. Talking and reading provides the opportunity to learn specific techniques and writing lots of code provides the opportunity to make use these techniques second nature.

Are these activities possible for R?

  • I would guess that most R programs are short, say under 150 lines. This is at least an order of magnitude shorter (if not two or three orders of magnitude) than program written in Java/C++/C/Fortran/etc. I know there are R users out there who have been spending a couple of hours a day using R over several years, but are they thinking about R coding or think about the statistics and what the data analysis really means. I suspect they are spending most of this R-usage thinking time on the statistics and data analysis,
  • I can easily imagine groups of people using R and individuals having the opportunity to interact with other R users (do they talk about R and write snippets of code to describe their problem? I don’t work in an R work environment, so I don’t know the answer),
  • Where are the R books and articles on language usage? They don’t exist, not in the sense of Sutter’s “Effective C++: 55 Specific Ways to Improve Your Programs and Designs” (there must be a several dozen of this kind of book for C++) Bloch’s “Java Puzzlers: Traps, Pitfalls, and Corner Cases” (probably only a handful for Java) and Koenig’s “C: Traps and Pitfalls” (again a couple of dozen for C). In places Crawley’s “The R Book” has the feel of this kind of book, but Matloff’s “The Art of R Programming” is really an introduction to R for people who already know another language (no discussion of art of R as such). R users write about statistics and data analysis, with the language being a useful tool.

I suspect that many people are actually writing R for short amounts of time to solve data analysis problems they spend a lot of time thinking about; they don’t discuss R the language much (so little opportunity to learn about the techniques that other people use) and they don’t write much code (so little opportunity to try out many new techniques).

Yes, there may be a few people who do spend a couple of hours a day thinking about R the language and also get to write lots of code, these people are more like high priests than your average user.

For the last two years I have been following a no for-loops policy in an attempt to make myself write R how the natives write it. I am beginning to suspect that this view of native R is really just me imposing beliefs from usage of other language that support whole vector/array operations, e.g., APL.

I encountered the following coding problem yesterday. Do you think the non-loop version should be how it is done in R or is the loop version more ‘natural’?

Given a vector of ordered items the problem is to count the length of each subsequence of identical items,

a,a,a,b,b,a,c,c,c,c,b,c,c

output

a 3
b 2
a 1
c 4
b 1
c 2

Non-looping version (looping version is easy to figure out):

subseq_len=function(feature)
{
r_shift=c(feature[1], feature)
l_shift=c(feature, ",,,") # pad with something that will not match
 
# Where are the boundaries between subsequences?
boundary=(l_shift != r_shift)
 
sum_matches=cumsum(!boundary)
 
# Difference of cumulative sum at boundaries, whose value will
# be off by 1 and we need to handle 'virtual' start of list at 1.
t=sum_matches[boundary]
 
seq_len=1+c(t, 0)-c(1, t)
 
# Remove spurious value
return(cbind(feature[boundary[-1]], seq_len[-length(seq_len)]))
}
 
subseq_len(c("a", "a", "b", "b", "e", "c", "c", "c", "a", "c", "c"))
  1. Ken Knoblauch
    February 22, 2013 13:27 | #1

    I would just use the rle function:

    d <- scan(textConnection(a,a,a,b,b,a,c,c,c,c,b,c,c), "character", sep = ",")
    rle(d)
    Run Length Encoding
    lengths: int [1:6] 3 2 1 4 1 2
    values : chr [1:6] "a" "b" "a" "c" "b" "c"

  2. Dave
    February 22, 2013 13:32 | #2

    For your example you could also use the rle function in R

  3. February 22, 2013 13:41 | #3

    @Ken Knoblauch
    Groan, yet another example of me not knowing about a function in the base library. The implementation of rle is also much less cluttered than mine. My use of cumsum shows that I am still thinking in terms of loops and counting, using which is the whole vector way of thinking.

  4. Henrik R
    February 22, 2013 13:58 | #4

    Interesting post. I use R every day at work (biostatistician) but have never learned it as a programming language. Therefore I do not really have any framework for discussing R in this respect.

    I have recently become more interested in the general theory of programming languages – I’ve been trying to follow this course at coursera.org that teaches this using SML, Racket and Ruby and I must say it is highly rewarding. It does help for understanding R as well. Maybe this will put an end to my constant use of for-loops!

  5. Ben Bolker
    February 22, 2013 14:25 | #5

    If you don’t want to use rle(), how about

    sapply(split(seq(z),cumsum(c(0,diff(as.numeric(factor(z)))!=0))),length)

    ?

  6. February 22, 2013 14:28 | #6

    One way to learn R (e.g. ensure you haven’t missed any functions in base such as rle) is :
    library(unknownR)
    ?unk
    unk()
    http://unknownr.r-forge.r-project.org/

  7. Joseph
    February 22, 2013 14:31 | #7

    Nice post. R is the first and only language I have learned so you might call me a native user. I work with R at my job on average 3-4 hours a day. I have also wondered if there was a standard style for writing R code since I have never learned any other language. This is one reason I have been hesitant to publish my code.

    In answer to your vector question, this would be my approach:
    > vector table(vector)
    vector
    a b c
    4 3 6

  8. Joseph
    February 22, 2013 14:33 | #8

    “vector <- c("a", "a", "a", "b","b","a","c","c","c","c","b","c","c")"

  9. February 22, 2013 14:40 | #9

    @Ben Bolker
    Using sapply implies a ‘looping way’ of thinking. I am starting to notice for-loops in other peoples’ code that really ought to be written using sapply, i.e., they have an explicit loop that calculates some value and assigns it to the current element of a vector. The somewhat unintuitive behavior of the apply functions does not help, perhaps beginners should be exclusively pointed at the plyr package.

  10. February 22, 2013 14:58 | #10

    I work a lot in R, and didn’t know about rle()
    Yet, a better way to achieve this without looping and without the use of apply (implicit looping) is by doing the following:

    my.rle = function(x) {
    tmp = which(diff(as.numeric(factor(x)))!=0)
    result = diff(c(0,tmp,length(x)))
    names(result) = x[c(tmp,length(x))]
    result
    }

    a real native would probably do all this in one long line ๐Ÿ™‚

  11. Stan S
    February 22, 2013 15:59 | #11

    I don’t consider myself a “native” R user as I have no academic programming background, however, I have found that I code in Hadley’s R rather than base R most of the time. While I have found ddply and ggplot to be a bit slow at times, I can really appreciate the workflow behind them. In terms of for loops, I only avoid them when I have a good reason to believe that the code may have to be scaled up at some point. Otherwise, a for loop just saves time…

    As for your example (and other examples of that type), I tend to rely on index math/manipulation:

    Code:

    subseq_len <- function(feature) {

    # Pad with character unlikely to be found normally
    feature <- c('##', feature, '##')

    # Create an index vector for convenience
    index <- 3:length(feature) – 1

    # Get the start and finish indeces of the feature
    start.index <- index[feature[index-1] != feature[index]]
    finish.index <- index[feature[index+1] != feature[index]]

    # The features themselved
    unique.feature <- feature[start.index]

    # The numbers can be calculated from index differences
    out <- data.frame(feature=unique.feature,
    count=finish.index-start.index + 1)
    }

    print(subseq_len(c("a","a","a","b","b","a","c","c","c","c","b","c","c")))

    Result:

    feature count
    1 a 3
    2 b 2
    3 a 1
    4 c 4
    5 b 1
    6 c 2

    This approach may take a couple of lines but I think it's quite easy to follow.

  12. February 22, 2013 17:38 | #12
  13. February 22, 2013 18:26 | #13

    Most of R isn’t written in R (LoC metric): http://librestats.com/2011/08/27/how-much-of-r-is-written-in-r/ and most of R was written by math stats, not coders. Unlike SAS, for example.

    So, if you want to know how to write in R, write in C (or FORTRAN ๐Ÿ™‚ ). Even though some claim that R is both Functional and Object Oriented (I never bought the argument), even Chambers ( http://www.amazon.com/gp/product/1441926127/ref=s9_simh_gw_p14_d0_i1?pf_rd_m=ATVPDKIKX0DER&pf_rd_s=center-2&pf_rd_r=06T16R4FNE6JNYA8R18C&pf_rd_t=101&pf_rd_p=1389517282&pf_rd_i=507846 ) admits that this is mostly hand waving. R is data/function structured just like FORTRAN 77; even the Great and Powerful Google admonishes one to use S3 classes (and no more than 80 characters per line, just like a 1950’s punch card), which are merely tagged structs – no methods. I hear tell that the Bioconductor folks build real objects, but I’ve never looked.

    So, there really isn’t native R, if only because there isn’t a BDFL (python/Perl) and more than 4,000 packages in CRAN and 600 at Bioconductor. I suspect that each genre (biostats/econometricians/psychometricians/finance/etc.) develops its own style. I suppose CRAN acceptance might be as close as one gets.

    Finally, there’s R as stat pack command language and R as programming language. I’m still undecided whether pushing both a round peg and a square one into a triangular hole is wisest; SPSS/SAS with command language and macro language just might be the better approach. Most use R as commands, possibly scripted.

    Given how little of R is R, one might argue that R is really C programming and R command languages?

  14. February 22, 2013 18:40 | #14

    @Robert Young
    It doesn’t matter what R is implemented in, that is all behind the scenes stuff.

    Almost every language has strong connections to languages that went before, just like English can claim Germanic decent, this does not prevent native speakers of the language using it in a particular way.

  15. February 22, 2013 18:44 | #15

    @Tal,
    Yes, I saw it when it was published; a bit superficial. There has been lots of analysis of identifier names, unfortunately mostly inconclusive, or showing little can be automated.

  16. Meower68
    February 22, 2013 19:03 | #16

    I follow this blog because I’m interested in learning more about R.

    I have spent plenty of time writing stuff in Scheme and Perl, and both groups experiences show themselves in my current gig, where I write Java all day.

    I have no fear of using HashMap, which is a bit like Perl’s hashes. I have no fear of using List, which is similar to Perl’s lists. I have no qualms about building multi-tiered combinations of the two, which is DEFINITELY how a lot of stuff gets done in Perl. I have no qualms about returning an Object[] with multiple values in it, to avoid creating a special-purpose class to hold a handful of values. This is very much the Perl way.

    I regularly have function calls which have parentheses nested half a dozen levels deep (functional programming, a la Scheme). Most debuggers have a hard time with that, however. I regularly write methods which take a value, modify it, and return the modified value, making it very easy to chain function calls.

    As such, you could say that I write Java with Perl and Scheme accents. Ergo, I find it interesting that someone else is thinking/talking about “programming with an accent.”

    Lately, I’ve been reading about Flow-Based Programming (indeed, that’s the title of the book). Think of Unix pipes, building asynchronous pipelines of data and processes, on steroids. Plenty of room for parallelization (both through fan-out/in and multiple stages in the pipeline) and largely building on interconnecting standardized program components, with maybe a parameter or two to tell the standardized component which part of the data to look at and/or transform.

    I see R as being somewhat similar to this. Where the book mentions packets of data, flowing between components in a network, y’all use data flowing out of one function and into another, with implicit opportunities for parallelism.

    As such, it’s not hard to see your exercise in FBP light and extrapolate from there. Building a Run-Length Encoding component would be pretty easy to do. As with FBP, the trick is learning all of the available components available for use.

  17. February 22, 2013 22:44 | #17

    Hmm, interesting point. As far as general purpose language go, it’s complex to tell. But what you can recognize rather easily is people coming from other statistical packages, such as SAS or STATA.

  18. Sergey Goder
    February 23, 2013 00:40 | #18

    > Do you think the non-loop version should be how it is done in R or is the loop version more โ€˜naturalโ€™?

    R is the first programming language I learned well and have been using it on a daily basis for the last three years. IMO using the apply functions in a vectorized manner is the way it should be done by a “native R coder,” the loop version is probably more natural to someone coming from an background in C or Java. On the other hand if you are familiar with any functional languages, then the apply functions should feel more natural.

    There are lots of advantages to using the apply functions especially now that R 2.14 has the built in parallel library. Write your code to use lapply and if you need it to run faster you can swap it out with mclapply and things just work (provided you have enough memory). I’ve found that using lapply followed by a do.call is a very fast way to get what you need done.

  19. Richie Cotton
    February 25, 2013 13:57 | #19

    Even native R coder will occasionally use for/while/repeat loops, but nested loops are a definite sign of a background in C/C++/FORTRAN.

    Naming your functions proc_something (even more so if capitalised) indicates a SAS background.

    Preferring lower_under_case to lowerCamelCase or vice versa could hint at a background in Python or Ruby vs. Java/C#. lowercasewithnounderscores is very MATLABy.

  20. March 20, 2013 03:25 | #20

    I am just starting to learn R. I like it better than just using SSPS. Statistics is not complicated enough, I want to learn a new language to express it.
    I seem to have some C-style accents, because I write little functions with nested case and loop structures. I tend to use Python indenting even in languages that don’t need the indent. My excuse is sometimes “readability” but usually just habit.

  1. No trackbacks yet.