Home > Uncategorized > subset vs array indexing: which will cause the least grief in R?

subset vs array indexing: which will cause the least grief in R?

The comments on my post outlining recommended R usage for professional developers were universally scornful, with my proposal recommending subset receiving the greatest wrath. The main argument against using subset appeared to be that it went against existing practice, one comment linked to Hadley Wickham suggesting it was useful in an interactive session (and by implication not useful elsewhere).

The commenters appeared to be knowledgeable R users and I suspect might have fallen into the trap of thinking that having invested time in obtaining expertise of language intricacies, they ought to use these intricacies. Big mistake, the best way to make use of language expertise is to use it to avoid the intricacies, aiming to write simply, easy to understand code.

Let’s use Hadley’s example to discuss the pros and cons of subset vs. array indexing (normally I have lots of data to help make my case, but usage data for R is thin on the ground).

Some data to work with, which would normally be read from a file.

sample_df = data.frame(a = 1:5, b = 5:1, c = c(5, 3, 1, 4, 1))

The following are two of the ways of extracting all rows for which a >= 4:

subset(sample_df, a >= 4)
# has the same external effect as:
sample_df[sample_df$a >= 4, ]

The subset approach has the advantages:

  1. The array name, sample_df, only appears once. If this code is cut-and-pasted or the array name changes, the person editing the code may omit changing the second occurrence.
  2. Omitting the comma in the array access is an easy mistake to make (and it won’t get flagged).
  3. The person writing the code has to remember that in R data is stored in row-column order (it is in column-row order in many languages in common use). This might not be a problem for developers who only code in R, but my target audience are likely to be casual R users.

The case for subset is not all positive; there is a use case where it will produce the wrong answer. Let’s say I want all the rows where b has some computed value and I have chosen to store this computed value in a variable called c.

c=3
subset(sample_df, b == c)

I get the surprising output:

>   a b c
> 1 1 5 5
> 5 5 1 1

because the code I have written is actually equivalent to:

sample_df[sample_df$b == sample_df$c, ]

The problem is caused by the data containing a column having the same name as the variable used to hold the computed value that is tested.

So both subset and array indexing are a source of potential problems. Which of the two is likely to cause the most grief?

Unless the files being processed each potentially contain many columns having unknown (at time of writing the code) names, I think the subset name clash problem is much less likely to occur than the array indexing problems listed earlier.

Its a shame that assignment to subset is not supported (something to consider for future release), but reading is the common case and that is what we are interested in.

Yes, subset is restricted to 2-dimensional objects, but most data is 2-dimensional (at least in my world). Again concentrate recommendations on the common case.

When a choice is available, developers should pick the construct that is least likely to cause problems, and trivial mistakes are the most common cause of problems.

Does anybody have a convincing argument why array indexing is to be preferred over subset (not common usage is the reason of last resort for the desperate)?

Categories: Uncategorized Tags: , ,
  1. Bernhard
    January 4, 2016 13:19 | #1

    First of all, not only did Hadley Wickham write, that subset was for interactive use. It is the help page of ?subset, where your readers will read that and be bewildered.
    Second, even if you insist, that “reading is the common case”, which I doubt, still every R user will have to do assignment once in a while. Are you trying to leave your Readers alone with that task of are you going to introduce square brackets for assignment but not for reading? Two syntaxes instead of one?
    I don’t think anyone will chance the meaning of subset() in a future release. Assignment to the result of a function looks ugly. Obviously you could define a function assign.to.subset. Even more syntax added to a language with to many functions already.
    Third, I don’t remember having made the error to leave out that comma. So if you “think the subset name clash problem is much less likely to occur than the array indexing problems listed earlier.” Are there any indications to that?
    Please don’t get me wrong, there is neither scorn nor wrath. I like the idea behind “JavaScript – the good parts” and would like to see a similar book on R. Square brackest just seem to be an unhappy decision.

    Cheers,
    Bernhard

  2. January 4, 2016 18:46 | #2

    3. The person writing the code has to remember that in R data is stored in row-column order (it is in column-row order in many languages in common use).

    no idea why this will cause the coder/user/quant any issue. all languages that I know of call out an matrix-like object as x(row, column) just as in maths. some, like FORTRAN, store column order major, but all languages (again, with which I’m familiar) store matrices as a vector, with markers for dimension shift. unless, there’s some reason to directly manipulate the storage, but we all know pointers are evil.

  3. January 4, 2016 19:11 | #3

    @Bernhard
    Yes, there is a lot to be said for only teaching array indexing because people have to know it and always using one approach helps it to become more strongly reinforced.

    The reading/writing ratio is certainly greater than one. I would say greater than two, but my own code may not be representative and is certainly a small sample. If developers make use of whole array/vector operations the ratio may be a lot lower than other languages that don’t support such operations.

    You get used to seeing function calls appearing on the left-hand-side of assignments. This usage occurs surprisingly often in object oriented languages returning references.

    “JavaScript – the good parts” is a very useful book. R is a very small language that supports lots of ways of doing the same thing, so the equivalent book in R would not be able to leave much out, but instead concentrate on describing the less error prone ways of doing things.

    I am not aware of any studies on column/variable name usage.

  4. January 5, 2016 00:08 | #4

    Very nice article. Great to see some detailed and brave thinking.
    Agree with point 1 fully. That’s actually built into data.table; see my answer to the 2nd highest voted R question on Stack Overflow: http://stackoverflow.com/a/10758086/403310
    @Bernhard Yes data.table builds := inside [] so it can be used together with subset and grouping in one consistent syntax with no new functions to learn.
    I agree with point 2 too and is one reason why data.table made that comma optional. You have to explicitly use double brackets to get a single column extract (i.e. DT[[2]]) which I think is clearer than including a comma or not (the presence of which changes the method dispatched to in base). In other words, I find it confusing that in base, DF[,3] == DF[3] (both return the 3rd column).
    On point 3: some people find everything hard to remember. I find one consistent syntax easier to remember than lots of different functions. Which languages use M[column, row] syntax anyway?

  5. Nathan
    January 5, 2016 00:38 | #5

    While the code is usually written together, it should account for possible unforeseen names. Let’s say you create a function:

    my_fun <- function(x) {
    a <- 1:5
    b <- 5:1
    subset(x, a == b)
    }

    Obviously, this function is meant to the third row of every five rows of a data frame.

    Now for a data.frame to use with this function:

    my_data <- data.frame(x = c(1, 2, 3, 4, 5), y = c(1, 0, 3, 0, 5))
    my_fun(my_data)
    # x y
    #3 3 3

    A great benefit to using a function is to make it general, so it can be used with any data frame. And you can't always predict the names of columns in those data frames. Assume a colleague was using the same data as above, but renamed it. He or she would be surprised if the function suddenly returned a different value.

    colnames(my_data) <- c("a", "b")
    my_fun(my_data)
    # a b
    #1 1 1
    #3 3 3
    #5 5 5

    The point is that R uses nested environments, which can cause major problems when looking up variables. The subset function treats the data.frame as an environment with the columns as variables. Any variables with the same names in any other environment will be ignored, since subset immediately found the variable in the data.frame.

    Expected other developers to know which columns to rename before passing a data.frame as an argument to your function is a major opportunity for problems.

    PS: Forgive me if there's some way to add code tags in these comments.

  6. Tim
    January 5, 2016 09:39 | #6

    You can assign value to subset(), simply create a function (for data.frame or matrix objects only):

    `subset<-` <- function (x, subset, value) {
    x[with(x, subset), ] <- value
    return(x)
    }

  7. Bernhard
    January 5, 2016 11:50 | #7

    @Derek Jones
    Hi,

    I don’t mind the reading/writing ratio, as long as the absolute frequency of having to write is high enough. I do concur though, that whole vector operations should be encouraged. Still unsolved is the problem, that subset allows row selection but not column selection.

    For example:
    questionnaire <- data.frame(ID = 1:8, item1=c(1,0,0,1,0,0,1,1), item2=c(1,1,1,1,1,1,1,0),
    item3=c(0,0,0,1,0,1,0,1))
    This may be some probands answers on a yes/no questionnaire with three items. Now, we want the sum scores for each proband. This is a reading operation somewhere along the lines of

    questionnaire$score <- rowSums(questionnaire[2:4])

    Maybe I am to much entangled with frequent square bracket use, but do you do that in a concise way without square brackets?

    Cheers,
    Bernhard

  8. Roy
    January 8, 2016 10:10 | #8

    Another point that has been overlooked is the speed of both approaches. The [] approach seems to be around 1.5 times faster than the subset() function. Another point is that [] approach silently converts dataframe to vector if less than two columns unless drop=FALSE is used.

  9. January 8, 2016 12:44 | #9

    @Roy
    In most cases the performance differences are not important. When performance is an issue people will spend time looking for the important bottlenecks. We should not allow the small number of cases where performance is an issue drive the behavior for the common case.

  1. No trackbacks yet.