Should I Use the Tidyverse?

I’m doing a lot of programming in the statistics language R, as I translate an economic model into R from Python. This is a big project, and I’ll blog about it more in later posts, as I share useful bits of code I’ve written. But in this post, I want to mention a kind of add-on to R: not part of the base language, but widely used and respected. This is Hadley Wickham‘s Tidyverse.

I’ve had to decide whether to use the Tidyverse or to stick to the base language: to so-called “base R”. Why? Reality is complicated, and in engineering, we evolve alternative sets of tools for manipulating it. For example, computing has “functional programming”, “object-oriented programming”, and “logic programming”. These are different notations for describing reality, and conflicts may occur if we try to think in more than one at the same time. When designing a program that my colleague will also work on, I have to decide whether the benefits of another set of tools justify the conflict he’ll face in learning them.

I hope Hadley Wickham will forgive me for saying that his Tidyverse libraries set up such a conflict. As Bob Muenchen notes in “The Tidyverse Curse”, learners often comment that base R functions and Tidyverse ones feel like two separate languages. Navigating the balance between base R and the Tidyverse can be a challenge.

But as Bob also notes when discussing dplyr, a package that, together with its relatives makes up the Tidyverse, learning it is well worth the effort. Conflicts between the Tidyverse and base R are not there for the hell of it, but because of decisions made by R’s original designers. These probably seemed like a good idea at the time, but conflict with better ways of doing things. The Tidyverse functions are just doing the best they can with the existing architecture.

As for what the Tidyverse contains, Bob talks about some of its features. And there are tutorials scattered around the web: I like monashbioinformaticsplatform.github.io’s “The tidyverse: dplyr, ggplot2, and friends”. Highlights for me so far include: tibbles, a reimplementation of R data frames; pipes notation, which makes it easy to write sequences of data transformations; and various functions for rearranging data, including spread, gather, nest, and unnest.

Leave a Reply

Your email address will not be published. Required fields are marked *