Home > Uncategorized > R recommended usage for professional developers

R recommended usage for professional developers

R is not one of those languages where there is only one way of doing something, the language is blessed/cursed with lots of ways of doing the same thing.

Teaching R to professional developers is easy in the sense that their fluency with other languages will enable them to soak up this small language like a sponge, on the day they learn it. The problems will start a few days after they have been programming in another language and go back to using R; what they learned about R will have become entangled in their general language knowledge and they will be reduced to trial and error, to figure out how things work in R (a common problem I often have with languages I have not used in a while, is remembering whether the if-statement has a then keyword or not).

My Empirical software engineering book uses R and is aimed at professional developers; I have been trying to create a subset of R specifically for professional developers. The aims of this subset are:

  • behave like other languages the developer is likely to know,
  • not require knowing which way round the convention is in R, e.g., are 2-D arrays indexed in row-column or column-row order,
  • reduces the likelihood that developers will play with the language (there is a subset of developers who enjoy exploring the nooks and crannies of a language, creating completely unmaintainable code in the process).

I am running a workshop based on the book in a few weeks and plan to teach them R in 20 minutes (the library will take a somewhat longer).

Here are some of the constructs in my subset:

  • Use subset to extract rows meeting some condition. Indexing requires remembering to do it in row-column order and weird things happen when commas accidentally get omitted.
  • Always call read.csv with the argument as.is=TRUE. Computers now have lots of memory and this factor nonsense needs to be banished to history.
  • Try not to use for loops. This will probably contain array/data.frame indexing, which provide ample opportunities for making mistakes, use the *apply or *ply functions (which have the added advantage of causing code to die quickly and horribly when a mistake is made, making it easier to track down problems).
  • Use head to remove the last N elements from an object, e.g., head(x, -1) returns x with the last element removed. Indexing with the length minus one is a disaster waiting to happen.

It’s a shame that R does not have any mechanism for declaring variables. Experience with other languages has shown that requiring variables to be declared before use catches lots of coding errors (this could be an optional feature so that those who want their ‘freedom’ can have it).

We now know that support for case-sensitive identifiers is a language design flaw, but many in my audience will not have used a language that behaves like this and I have no idea how to help them out.

There are languages in common use whose array bounds start at one. I will introduce R as a member of this club. Not much I can do to help out here, except the general suggestion not to do array indexing.

Suggestions based on reader’s experiences welcome.

Categories: Uncategorized Tags: , ,
  1. December 30, 2015 14:40 | #1

    On subset, there is a reason it isn’t recommended except in interactive context. Using it in nested calls can lead people to making unnecessary mistakes:
    http://adv-r.had.co.nz/Computing-on-the-language.html#subset

    On read.csv, we tend to use options(stringsAsFactors = TRUE) as part of everyones profile to avoid this issue.

    Agree on *apply, but note that using using for loops with pre allocation should provide the same speed on general cases. Sometimes *apply can use up more RAM then a for loop especially when working with larger datasets.

    We generally start non-R users on ‘The Art of R Programming’ in my team as for us it’s shown to be the best way for experienced devs to get familiar with the language (Matlab, C/C++, C# and Javascript devs mostly)

    On the variable scoping issue it can most of the time be prevented by building functions into a package and using the available tools (devtools:: etc) to make the required checks.

    A quick alternative way of doing it is to evaluate a function within an environment that only inherits the base packages and running them through compiler::cmpfun which will highlight the use of unassigned variables. Note this will require non-base package function to be referred to by namespace::function. Example:

    > rm(list=ls())
    > someFunction <- function(x) { return(x*someVariable) }
    > invisible(compiler::cmpfun(someFunction))
    Note: no visible binding for global variable 'someVariable'
    # Scoping can be an issue:
    > someVariable <- 1
    > invisible(compiler::cmpfun(someFunction))
    # No more error because of lexical scope, to solve:
    > poorMansLibrary <- new.env()
    > poorMansLibrary$someFunction <- someFunction
    > rm(list=setdiff(ls(),"poorMansLibrary"))
    > for(fn in names(poorMansLibrary)){
    	out <- capture.output(invisible(compiler::cmpfun(poorMansLibrary[[fn]])))
    	if(length(out)>0) {
    		cat("Warnings from function: ",fn,"n",paste(out,sep="n"),"n",sep="")
      	}
    }

    You seem to make some suggestions I disagree with and counter what’s noted in The Art of R Programming and http://adv-r.had.co.nz . But in this case interesting to see another point of view although some of them may result in issues similar to the ones they are meant to prevent, just different. Look forward to seeing the book.

  2. December 30, 2015 14:43 | #2

    I cannot agree on all your points. I’m not an expert in R (2.5years dev experience) but I have no problem in reading R source code of R packages, no problem at all, including computing on the language processing. I’m not a programmer in other programming language, maybe that helps? Anyway much more important than points listed in the post are IMO unit tests and their clarity of mapping to business requirements. Bothering about `head(x, -1)` vs `x[-length(x)]` is pointless, much more “professional” way I see is to have unit test verifying `x` structure.

  3. December 30, 2015 15:53 | #3

    @Hansi
    Thanks for your suggestion on checking variable usage.

    The scoping rules of subset are non-standard, but I think they are the sensible choice. Yes, users can make mistakes, but that is not a reason to recommend against using a construct (people make mistakes using binary plus, are we going to recommend against this usage?) Coding rules are often about picking the least worst construct. Cut-and-paste is a big source of coding error and complicated array/data.frame accesses are tailor-made for making this kind of mistake.

    Premature optimization is the root of all evil and I plan to ignore coding efficiency issues. Worry about developer time, not computer time. The computers we have are way fast enough to handle the size of data sets that occur in software engineering (it would be nice to see this change).

    I picked as.is because the usage suggests its name, has no upper-case letters to remember and is shorter than stringsAsFactors.

    Matloff’s R For programmers is very good; the follow on book really ought to be called “Using R like it was a language you already know”, it is not the art of R at all 🙁

  4. December 30, 2015 17:49 | #4

    Thanks for sharing, though I have to say that I disagree with pretty much all of your recommendations here.

    * Indexing is one of the things that R does well. Your recommendation not to use it flies in the face of 99% of R code written.

    * *apply functions, while different from the typical pattern of most other languages, are used heavily by experienced R programmers. However, so are for loops. Suggesting not to use them because it may lead to array indexing is… lets say… an argument I’ve not heard before.

    * Using head would depend on what behavior you want. If you want to allow for the possibility that the vector doesn’t have as many elements as you are trying to remove and return an empty vector, then use head. Otherwise use indexing, which will throw an error.

    * Languages with prominent variable declarations are typically statically typed. R like Python and other dynamically typed languages do variable declaration on assignment. This is not a weakness.

    tl;dr: These recommendations are very non-standard and is not how the best coders I know write their R code.

  5. Bernhard
    December 30, 2015 18:15 | #5

    Hi there,

    R is crowded with lots of functions and I like the idea to build a usefull smaller subset of it. However, one should not program against the language und you should not take from R what makes R special. In my opinion, square brackets are at the heart of what R is. They are easily explained, very powerfull and very inuitive once you realized their power. Apart from subset() not being recommended for programming, I do not see how it could replace square brackets. Not only because subset works on only one of two dimensions, it also is of little use in assignement, where I use square brackets often. See this example:

    # example data
    d <- data.frame(a=1:20, b=rep(0,20), c=runif(20))

    # it is easy to use subset for reading
    d[d$a %in% c(3,4,5),] #equals
    subset(d, a %in% c(3,4,5))

    # but what about assignment?
    d[d$a %in% c(3,4,5),]$b <- TRUE # works as expected
    subset(d, a %in% c(3,4,5))$b <- TRUE # yields an error

    How do you plan to do assignment to individual fields in matrices or data.frames?

    As for for-loops: They are a hallmark of things that are similiar in many, many mainstream languages. So I don't think you should necessarily leave them out for you audiences.

    Cheers,
    Bernhard

  6. December 30, 2015 19:10 | #6

    @Ian Fellows
    Indexing is certainly very powerful in R and can provide a very concise way of performing whole array operations in a single statement. But using loops for array indexing, that is just using R like a language that does not support whole array operations.

    Yes, for-loops do occur rather frequently in R code written by experienced developers. I suspect this is because they have no sat down and thought about how best to make use of what R has to offer and are approaching the choice of algorithms to use in the same way as they would in other languages (perhaps because that is what they see all around them). When I started using R a lot I made a conscious decision to try and avoid for-loops; R struck me as a language in which for-loops were not natural and apart from about half a dozen cases I have managed to avoid them.

    Some implementations of dynamic languages support optional variable declarations as a helpful debugging option.

    Do the experienced R coders you know use a consistent style? I have my doubts that one exists for R.

  7. December 30, 2015 19:19 | #7

    @Bernhard
    Why recommend against subset? Just because a variable in the argument list may match a column in the file that was read? Not a strong enough reason.

    Use of square brackets invariably introduces dependencies between the two side of an assignment; dependencies that are prone to errors through cut and paste oversights and failure to change array names. The common case is reading, not writing (it would be great if subset supported your assignment use case).

    Why are you assigning to individual elements in a language that supports whole array operations?

    Professional developers know lots of languages and if they are using for-loops in R I don’t understand why they are not using one of the languages they know a lot better to solve their problem.

  8. December 30, 2015 19:41 | #8

    @jangorecki
    Yes, testing is very important and I assume my target audience of professional developers already know this.

  9. December 30, 2015 20:44 | #9

    @Hansi
    Seems my example got mangled by the parser, fixed here: https://gist.github.com/Hlynsson/8f1fda68075b7af54d50

  10. December 30, 2015 20:49 | #10

    @Hansi
    Thanks for posting this. I have edited the code into your original post.

  1. No trackbacks yet.