Complexity is a source of income in open source ecosystems
I am someone who regularly uses R, and my interest in programming languages means that on a semi-regular basis spend time reading blog posts about the language. Over the last year, or so, I had noticed several patterns of behavior, and after reading a recent blog post things started to make sense (the blog post gets a lot of things wrong, but more of that later).
What are the patterns that have caught my attention?
Some background: Hadley Wickham is the guy behind some very useful R packages. Hadley was an academic, and is now the chief scientist at RStudio, the company behind the R language specific IDE of the same name. As Hadley’s thinking about how to manipulate data has evolved, he has created new packages, and has been very prolific. The term Hadley-verse was coined to describe an approach to data manipulation and program structuring, based around use of packages written by the man.
For the last nine-months I have noticed that the term Tidyverse is being used more regularly to describe what had been the Hadley-verse. And???
Another thing that has become very noticeable, over the last six-months, is the extent to which a wide range of packages now have dependencies on packages in the HadleyTidyverse. And???
A recent post by Norman Matloff complains about the Tidyverse’s complexity (and about the consistency between its packages; which I had always thought was a good design principle), and how RStudio’s promotion of the Tidyverse could result in it becoming the dominant R world view. Matloff has an academic world view and misses what is going on.
RStudio, the company, need to sell their services (their IDE is clunky and will be wiped out if a top of the range product, such as Jetbrains, adds support for R). If R were simple to use, companies would have less need to hire external experts. A widely used complicated library of packages is a god-send for a company looking to sell R services.
I don’t think Hadley Wickam intentionally made things complicated, any more than the creators of the Microsoft server protocols added interdependencies to make life difficult for competitors.
A complex package ecosystem was probably not part of RStudio’s product vision, at least for many years. But sooner or later, RStudio management will have realised that simplicity and ease of use is not in their interest.
Once a collection of complicated packages exist, it is in RStudio’s interest to get as many other packages using them, as quickly as possible. Infect the host quickly, before anybody notices; all the while telling people how much the company is investing in the community that it cares about (making lots of money from).
Having this package ecosystem known as the Hadley-verse gives too much influence to one person, and makes it difficult to fire him later. Rebranding as the Tidyverse solves these problems.
Matloff accuses RStudio of monopoly behavior, I would have said they are fighting for survival (i.e., creating an environment capable of generating the kind of income a VC funded company is expected to make). Having worked in language environments where multiple, and incompatible, package ecosystems existed, I can see advantages in there being a monopoly. Matloff is also upset about a commercial company swooping in to steal their precious, a common academic complaint (academics swooping in to steal ideas from commercially developed software is, of course, perfectly respectable). Matloff also makes claims about teachability of programming that are not derived from any experimental evidence, but then everybody makes claims about programming languages without there being any experimental evidence.
RStudio management rode in on the data science wave, raising money from VCs. The wave is subsiding and they now need to appear to have a viable business (so they can be sold to a bigger fish), which means there has to be a visible market they can sell into. One way to sell in an open source environment is for things to be so complicated, that large companies will pay somebody to handle the complexity.
The tidyverse is certainly complex in some ways, but is it really any worse than some of the rough edges in base R? I definitely remember the days before the tidyverse, where in order to understand some of R’s weird quirks you had to read things like The R Inferno. And are “tidy” workflows based around inputting and outputting dataframes really worse than the multitude of OOP systems that were developed for R?
Even if RStudio benefits from “controlling” the complexity that exists in the tidyverse (I think at least some level is inevitable), aren’t we all better off for having a slightly more standardized way of working with R?
Other than ggplot, I find most of the tidyverse packages (that I have been exposed to) superfluous. I see them as ease of use shortcuts for those who don’t want to learn base R. The problem I encounter now is that when DuckDuckGoing for solutions to algorithmic problems, many of the solutions involve tidyverse packages, forcing me to become aquainted with the tidyverse package.
@Marius
Yes, there are benefits in having everybody doing things the same way. The path to getting to a standardized way of doing things can be tortuous, as things evolve towards their final resting place, e.g., previously working code breaking as interfaces are modified.
People will argue over better, but this is invariably personal opinion, i.e., no experimental comparisons take place.
The problem with Tidyverse R is that it is a parallel, rather than complementary, way of working in R. Tidyverse functions replace basic data manipulation steps; as a result Tidyverse users are unable to handle these basic data manipulation steps when working with R packages such as advanced linear modeling or GIS, or even RStudio’s own Shiny.
@Joe
Interesting. I resisted ggplot2, since my old boss was incredibly exacting about how plots should look. There were too many ggplot2 defaults that she would have demanded be changed, as a result my code would have had a half dozen lines undoing all of Hadley Wickham’s unique work in plot design.
(On a side note, Wickham did a fine job researching ggplot2’s default design choices; she was very much a pointy haired boss type whose own opinions were unchangeable by facts, research, or industry standards.)
There’s so much wrong with R as a programming semantics/syntax, that Wickham saw an opportunity. He took it. ggplot2 has a solid basis in theory/design, and is so much better than base R graphics.
Whether there’s any significantly worse learning curve with R vis-a-vis SAS/SPSS when used as a stat command language is completely different issue. My view is R used as such is more straightforward; just follow the Yellow Brick Road. Bob Muenchen has a book (alas, dated from 2011) devoted to getting the SAS/SPSS crowd to R; again R as stat command pack. Early ggplot2, but no X-verse. May haps he’ll get out a new edition?
Wickham has been taking slings and arrows from long time R folks just because he had the temerity to point out the glaring flaws in R as programming language. An OO language? Puhhllleese! It’s an ‘everything is a struct’ language. And so on.
As to an RStudio attempt to create a moat around R with the Tidyverse in order to garner commercial or Enterprise clients, that’s not going to create new clients of those types. Any number of obstacles are known to exist:
– SAS owns the ‘programming’ protocol in Enterprise
– SPSS/IBM own the ‘point and click’ protocol in Enterprise
– R is memory limited; whether any of the storage-centric packages get traction is up in the air
– many believe (I haven’t had the energy to research the truth) that GPL restrictions make use of R impossible in commercial contexts – linking and calling and all that mean that an R based ‘product’ can’t be sold for moolah
– all of the semantic/syntax issues mentioned above
I disagree, people use tidyverse because it makes their life easier, not because RStudio is pushing it.
@Samuel
People would not use the Tidyverse unless it was useful. But there are lots of ways fo structuring a useful collection of packages and there are various incentives driving the choices.
I run a data science department and the Tidyverse makes it much easier to teach new data scientists. I much prefer RStudio’s approach to something like Microsoft R Open, which is even more parallel to base R than the Tidyverse.