Thoughts on nest()

I’ve been experimenting with the Tidyverse’s nest function, because it may be useful when, for example, using households together with benefit units. Below are some thoughts that I first posted as a comment to Hadley Wickham’s blog entry “tidyr 0.4.0”. More on this in future posts.

First, this is likely to be very useful to me. I’m translating a microeconomic model into R. Its input is a set of British households, where each household record contains data on income and expenditure. The model uses these to predict how their incomes will change if you change tax (e.g. by increasing income tax) or benefits (e.g. by increasing pensions or child benefit).

Our data splits households into “benefit units”. A benefit unit ( ) is defined as an adult plus their spouse if they have one, plus any dependent children they are living with. So for example, mum and dad plus 10-year old Johnnie would be one benefit unit. But if Johnnie is over 18, he becomes an adult who just happens to live with his parents, and the household has two benefit units. These are treated more-or-less independently by the benefit system, e.g. if they receive money when out of work.

This matters because each dataset contains one table for household-wide data, and another for benefit-unit-wide data. I’ve been combining these with joins. But it might be cleaner to nest each household’s benefit units inside the household data frame. Not least, because sometimes we have to change data in a household, for example when simulating inflation, and then propagate the changes down to the benefit units.

Second, nest and unnest could be useful elsewhere in our data. Each household’s expenditure data is divided into categories, for example food, rent, and alcohol. We may want to group and ungroup these. For example, I make:

d <- tibble( ID=c( 1, 1, 1,  2, 2,  3, 3,
                   4, 4, 4,  5, 5, 5,  6, 6 ),
             expensetype=c( 'food', 'alcohol', 'rent',  'food', 'rent',  'food', 'rent',
                            'food', 'cigarettes', 'rent',  'food', 'alcohol', 'rent',  'food', 'rent' ),
             expenditure = c( 100, 50, 400,  75, 300,  90, 400,
                              100, 30, 420,  75, 50, 550,  150, 600 )  


d %>% group_by(ID) %>% nest %>% arrange(ID)
gives me a table
1 tibble [3 x 2]
2 tibble [2 x 2]
3 tibble [2 x 2]
4 tibble [3 x 2]
5 tibble [3 x 2]
6 tibble [2 x 2]
where the first column is the ID and the second is a table such as
food      100
alcohol    50
rent      400
So in effect, it makes my original table into a function from ID to ℘ expensetype × expenditure.

Whereas if I do

d %>% group_by(expensetype) %>% nest %>% arrange(expensetype)
I get the table
alcohol    tibble [2 x 2]
cigarettes tibble [1 x 2]
food       tibble [6 x 2]
rent       tibble [6 x 2]
where the first column is expenditure category and the second holds tables of ID versus expenditure. In effect, a function from expensetype to ℘ ID × expenditure. This sort of reformatting may be very useful.

Third. Continuing from the above, I wrote this function:

functionalise <- function( data, cols )
  data %>% group_by_( .dots=cols ) %>% nest %>% arrange_( .dots=cols )
The idea is that functionalise( data, cols ) reformats data into a data frame that represents a function. The columns cols represent the function’s domain, and will never contain more than one occurrence of the same tuple. The remaining column represents the function’s codomain. Thus, functionalise(d,"expensetype") returns the data frame shown in the last example.

Fourth, I note that I can write either

d %>% group_by( expensetype ) %>% nest %>% arrange( expensetype )
nest( d, ID, expenditure )

In the first, I have to name those columns that I want to act as the function’s domain. In the second, I have to name the others. I find the first more natural.

Fifth, nest and unnest, as well as spread and gather make it very easy to generate alternative but logically equivalent representations of a data frame. But every time I change representation in this way, I have to rewrite all the code that accesses the data. It would be really nice if either a representation-independent way of accessing it could be found, or if nest/unnest and spread/gather could be made to operate on the code as well as the data. Paging Douglas Ross and the Uniform Referent Principle…

Leave a Reply

Your email address will not be published. Required fields are marked *