R-Taxben: a Microsimulation Economic Model in R

I’ve just finished a slideshow about a preliminary translation of a simulation of the UK economy from Python into R. The simulation is the one I’ve been blogging about in recent posts. It’s a “microeconomic” model, meaning that it simulates at the level of individual people. In contrast, a “macroeconomic” model (which this is not), would simulate bulk economic variables such as inflation and unemployment.

The model reads data about the expenditure and income of British households. It contains rules describing the tax and benefits system, and a front-end through which the user can change these: for example, by increasing VAT on alcohol or by decreasing child benefit. With these, it calculates how the changes would affect each household’s net expenditure and income.

The model works with various different data sets. For our initial translation, we’re concentrating on the Government’s Family Resources Survey, which surveys 20,000 households each year. See “Introduction to the Family Resources Survey” for an overview of the 2013-2014 survey. This and other documentation, including details of the columns in the data tables, are available at the “Family Resources Survey, 2013-2014” data catalogue page.

Tax-benefit models like ours can be used in various ways. One is evaluation of Government economic policy. If the Government increases income tax, how will this affect net incomes? Will it push a lot of low-income people into poverty? Or will it extract income mainly from the better off? To answer such questions, we need to summarise how change affects the population as a whole, via tools such as graphs of income distribution.

Another use is education: about economics, about policy, and about statistics and data analysis. We believe it’s an important application, where R will be very helpful.

Yet a third is as a “reference standard”. The benefits system is notoriously complicated, making it difficult for claimants to know what they’re entitled to. To help, some organisations have written their own guides to the system. For example, the Childhood Poverty Action Group publishes its Welfare Benefits and Tax Credits Handbook: “an essential resource for all professional advisers serious about giving the best and most accurate advice to their clients”. We would love R-Taxben to become a reference standard for the benefits system, against which interpretations of the benefits rules could be tested: an electronic Handbook. We think this is feasible. It will, of course, require our code to be very very readable. Like this, perhaps:

# Returns a family_type. See the
# family_type document.
# age_2 and sex_2 are only defined
# if is_couple .
#
family_type <- function( age_1, sex_1
                       , age_2, sex_2
                       , is_couple
                       , num_children
                       )

{
  ad1_is_pensioner <- of_pensionable_age( age_1, sex_1 )                       
  ad2_is_pensioner <- of_pensionable_age( age_2, sex_2 )

  is_single <- !is_couple  

  case_when( 
    is_couple & ( ad1_is_pensioner | ad2_is_pensioner ) ~ 'couple_pensioner'
  , is_single & ad1_is_pensioner                        ~ 'single_pensioner'
  , is_couple & num_children == 0                       ~ 'couple_no_children'
  , is_single & num_children == 0                       ~ 'single_no_children'
  , is_couple & num_children > 0                        ~ 'couple_children'
  , is_single & num_children > 0                        ~ 'single_children'
  )
}

I wrote my slideshow for a non-R-speaking colleague, and then decided I should make it public. It presents R a fragment at a time, with screenshots to show what the interaction is like. Having demonstrated some elementary features such as assignment and function calls, the slideshow moves on to more advanced things that I use in the model, or that will be useful for users of its R interface when probing economic data.

I see this as helping the users “reify” economic concepts and data: making them more “thing-like”, more “manipulable”, like toys in a playground:

Indeed, as the slides explain, R-Taxben could be a playground for the mind: an “educational microworld” where students can experiment and learn in a non-threatening and stimulating environment. This idea was promoted by Seymour Papert, co-designer of Logo, the language with which children learned mathematics by programming robot “turtles” to draw squares and other figures. There’s an introduction (from 1983) in Education Week‘s “Seymour Papert’s ‘Microworld’: An Educational Utopia” by Charlie Euchner. A survey of Logo and other microworlds can be found in “Microworlds” by Lloyd P. Rieber. To quote:

However, another, much smaller collection of software, known as microworlds, is based on very different principles, those of invention, play, and discovery. Instead of seeking to give students knowledge passed down from one generation to the next as efficiently as possible, the aim is to give students the resources to build and refine their own knowledge in personal and meaningful ways.

Rieber describes their effect on learning thus:

Similar to the idea that the best way to learn Spanish is to go and live in Spain, Papert conjectured that learning math- ematics via Logo was similar to having students visit a comput- erized Mathland where the inhabitants (i.e., the turtle) speak only Logo. And because mathematics is the language of Logo, children would learn mathematics naturally by using it to com- municate to the turtle. In Mathland, people do not just study mathematics, according to Papert, they “live” mathematics.
So with R-Taxben, could we build an EconomicsLand?

Returning to R, a lot of the slideshow’s more advanced demonstrations feature the Tidyverse — hat-tip to its inventor Hadley Wickham. This cornucopia of data-manipulation tools can be confusing, because it’s very unlike base R; but as I blogged in “Should I Use the Tidyverse?”, I think the confusion is worth it. I’d also like to give something back to the Tidyverse team, by showing lots of ways it can be useful in such a project.

One of these ways is in analysing and displaying data. The plots below are of FRS household income versus ethnic group.
They were suggested to me by an exercise in “SC504: Statistical Analysis: ASSIGNMENT ONE: Spring 2008”, an assignment for sociology students at the Essex University School of Health and Social Care. I did the plots using the Tidyverse’s ggplot function, with the “ggthemes” package to display them in the style of The Economist. As you can see from the screenshot, I didn’t need to write very much code, and I’ve got a lovely publication-quality plot.

The above also shows that even the raw FRS data, untransformed by our model, can be educational. Another example of analysis and display comes from my slide on “Probing Data: Multi-level Grouping”. This uses the Tidyverse’s summarise function to count and histogram the number of “benefit units” — adult plus spouse if any, plus children — per household.

For some reason, the counts almost follow a logarithmic distribution.

R-Taxben is not yet complete: what I have now is a proof of concept. It was funded by Landman Economics, and we’ll be looking for grants to finish the work. Some indication of where the emphasis lies is that my slideshow has nine slides about testing. We want this model to be reliable and accurate, and that means rigorous testing. That’s all the more necessary because R lacks static typing, which is what makes languages like Pascal and Ada safe.

(On the other hand, the freedom to put any kind of value almost anywhere does make prototyping easier. As Michael Clarkson points out in his lecture on static versus dynamic checking, there are trade-offs. But as he also points out, few people now believe that “strong types are for weak minds”. Humans are really bad at avoiding bugs, and we need all the help we can get!)

Not only do we not want surprises when R-Taxben runs, we don’t want users to be surprised by the assumptions built into it. Traditionally, economic models have been opaque, their assumptions known only to the few who implemented them. R programmers can probe R-Taxben’s data using R. But I’ve also implemented a novel web-based interface which “visualises” the model as a network of data nodes connected by functions, so that even non-programmers can peer inside: I think that’s essential.

And, though I agree with Donald Knuth that premature optimisation is the root of all evil, R-Taxben does have to be fast enough to be usable.

One thing we may need is help merging files. There are a lot of FRS files: one for the households themselves, but others for data such as: mortgage payments; accommodation rentals; house purchases; payments into pension schemes; benefits claims; jobs; and share dividends and other non-job income. We need to translate each of these into a form the model can use, then merge with the households. To make my code easy to read and maintain, I’ve written a separate function for translating each file, and I then JOIN the results. But I suspect this won’t be terribly fast, and almost certainly not as fast as the Python code which looped over files in parallel, merging records one by one.

Another is something I blogged under the cryptic subtitle “Why Douglas T. Ross would hate nest(), unnest(), gather() and spread()”. Douglas Ross was one of the first to propose what he called the Uniform Referent Principle: that code for extracting or changing data should be independent of how the data is stored. Following the principle means you can change storage while not affecting the rest of your program; not following the principle makes every change bleeds into the rest of your code, with all the consequences for time-wasting updates, typos, and re-testing:

Why? Look at the tables below. They represent four different ways of storing my income data.

PersonIncome_TypeIncome_Value
AliceWages37000
AliceBonuses0
AliceBenefits0
BobWages14000
BobBonuses1000
BobBenefits6000
PersonIncome_WagesIncome_BonusesIncome_Benefits
Alice370000 0
Bob 1400010006000
PersonIncome
Alice
TypeValue
Wages37000
Bonuses0
Benefits0
Bob
TypeValue
Wages14000
Bonuses1000
Benefits6000
PersonIncome
Alice
WagesBonusesBenefits
3700000
Bob
WagesBonusesBenefits
1400010006000

Abstractly, the data is the same in each case. But the tables are implemented in very different ways. If you access their elements with $ or an equivalent, and you then change the implementation, you have to reprogram all those accesses. I’ve written some code which hides implementation details, so that I can access the different representations without having to change the interface, but again, it may not be efficient. It may also not work well with vectorisation, the way R implicitly loops over entire vectors. It would be great to have R experts, even R implementors, who were willing to advise on this, and even to collaborate on our grant applications.

The slideshow was written in Hakim El Hattab et. al.’s reveal.js. This is a JavaScript system for building web-based slideshows. A demonstration can be seen at revealjs.com .

The contents page was implemented with Frederick Feibel’s Presentable plugin.

Two Kinds of Conditional Expression: ifelse(A,B,C) versus if (A) B else C

One out of quite a lot of confusing things about R is that it has two kinds of conditional expression. There’s ifelse(); and there’s the if statement. It’s important to know which one to use, as I found when trying to write a conditional expression that chose between lists.

The first thing to appreciate is that if can be used as a conditional expression as well as a conditional statement. Probably most programmers use it as a statement, like this:

> greet_or_leave <- 'GREET'
> if ( greet_or_leave == 'GREET' ) cat('HELLO') else cat('GOODBYE')
HELLO> 
But you can equally well use it as an expression:
greeting <- if ( greet_or_leave == 'GREET' ) 'HELLO' else 'GOODBYE'
> greeting
[1] "HELLO"

The latter is what I’m interested in in this article. How does it compare with ifelse()? For simple uses, they seem to do the same thing:

> ifelse( TRUE, 1, 2 )
[1] 1
> if ( TRUE ) 1 else 2
[1] 1
> ifelse( FALSE, 1, 2 )
[1] 2
> if ( FALSE ) 1 else 2
[1] 2

But this equivalence breaks down when you ask them to return a list rather than a scalar. The ifelse() returns only the first element of the list. To return it all, you have to use if:

> ifelse( TRUE, list(a=1,b=2), list(a=1,b=2) )
[[1]]
[1] 1

> if ( TRUE ) list(a=1,b=2) else list(a=1,b=2)
$a
[1] 1

$b
[1] 2

This bit me when I was using recode() from the Tidyverse. This function takes a vector and translates each element by looking it up in a list of name-replacement pairs formed by the following arguments. Thus, if codes is c( 'a', 'b', 'c' ), the call

recode( codes, a=1, b=2, c=3 )
returns c(1,2,3). I wanted a version of recode which takes all the replacements in one argument. I implemented it by using !!! to splice these into the call, as demonstrated under the “Capturing multiple variables” section of “Programming with dplyr”:
recode_with_list <- function( x, other_args )
{
  recode( x, !!! other_args )
}
So the call
recode_with_list( codes, list( a=1, b=2, c=3 ) )
also returns c(1,2,3).

I used this when translating data about households in our economic model. Each household has a numeric field indicating its region. We need to convert this to a meaningful string, such as “London”, “Scotland”, or “North East”. That’s easy to do with recode_with_list() and a translation list mapping codes to region names. But unfortunately, different data sets use different coding conventions, so I needed conditionals to select between translation lists. Initially, I did this with ifelse(), like this:

translation_list_1 <- 
  list( '1000'='London', '1001'='Scotland', '1002'='North East' )

translation_list_2 <- 
  list( '1'='London', '2'='Scotland', '3'='North East' )

dataset <- tribble( ~id, ~region_codes
                  ,   1,          1000
                  ,   2,          1001
                  ,   3,          1000
                  )

dataset_follows_convention_1 <- TRUE

dataset$regions <-
  recode_with_list( dataset$region_codes
                  , ifelse( dataset_follows_convention_1
                          , translation_list_1
                          , translation_list_2
                          )
                  )

But I found that recode_with_list() complained “Unreplaced values treated as NA as .x is not compatible. Please specify replacements exhaustively”. This must have been because the ifelse() was returning only one list element, and stripping it of its name. After a bit of thought and experimenting, I realised that I could rewrite as:

dataset$regions <-
  recode_with_list( dataset$region_codes
                  , if ( dataset_follows_convention_1 )
                      translation_list_1
                    else
                      translation_list_2
                  )

This worked, but why didn’t ifelse()? The documentation says that ifelse(test, yes, no) returns a value with the same shape as test which is filled with elements selected from either yes or no depending on whether the element of test is TRUE or FALSE. The “same shape” bit is what’s important, because it means that test determines how many elements ifelse(test, yes, no) returns from yes and no. In my case, dataset_follows_convention_1 had only one element (R scalars are, in reality, single-element vectors), which means that ifelse(test, yes, no) only returned one element from translation_list_1 and translation_list_2.

You can see the influence of the “shape” below. As test becomes longer and longer, so do yes and no:

> ifelse( TRUE, translation_list_1, translation_list_2 )
[[1]]
[1] "London"

> ifelse( c(TRUE,FALSE), translation_list_1, translation_list_2 )
[[1]]
[1] "London"

[[2]]
[1] "Scotland"

> ifelse( c(TRUE,FALSE,TRUE), translation_list_1, translation_list_2 )
[[1]]
[1] "London"

[[2]]
[1] "Scotland"

[[3]]
[1] "North East"

I’m not the only person to have been bitten by this, as “Ryogi”‘s Stack Overflow question “if-else vs ifelse with lists” shows. There are probably other things to beware of too. I notice that the results just above have lost the names from my lists. Moreover, the documentation warns that if(test) yes else no is much more efficient and often much preferable to ifelse(test, yes, no) whenever test has length 1. That’s presumably because ifelse() will waste a lot of time selecting and discarding elements. Indeed, there’s also a warning that “Sometimes it is better to use a construction such as

(tmp <- yes; tmp[!test] <- no[!test]; tmp)
, possibly extended to handle missing values in test“.

This is not good. The whole point of a high-level language is to provide notations that enable you to express your problem clearly and concisely. It’s the language’s responsibility to compile them into efficient code, not the programmer’s. R designers, please note.

Beat the Delays: installing R and the BH package on a memory stick

I use R on a range of Windows machines. Often, I’ll only use these once, and they won’t already have R. So I want to carry an installation with me. So I decided to install R on a memory stick. Installing R itself worked, once I’d changed the folder on the “Select Destination Location” pop-up. But when I then tried installing the Tidyverse package, it seemed to hang. This is just a note for others facing the same problem, which in my case was caused by the BH package.

I tried installing Tidyverse packages one by one, and found that the hang was probably caused by BH. Some Googling led me to “mondano”‘s question https://stackoverflow.com/questions/31272280/installation-of-r-package-bh-not-possible. To which Dirk Eddelbuettel replied that “BH, as a sizeable subset of Boost Headers is big, as in really big”. It’s 111 megabytes, and mondano might simply have run out of patience if Windows was slow writing the files.

So I tried installing BH on its own, and using the Windows file explorer to watch the files. Indeed, I could see a temporary copy of BH slowly taking shape in a subdirectory of R-3.4.2/library called file1afc7c1c1629, with files slowly appearing under its subdirectories. Presumably, once the temporary copy had been made, R would copy it upwards to library where the other packages lived.

But after far too many minutes — memory sticks must be really slow — when the temporary copy was complete, it never did get copied upwards. Instead, I got the error Warning: unable to move temporary installation.

It would be nice to have been told the reason for the failure, but R is not that clever. There was enough space on the stick, so that wasn’t the problem. Time to Google once more. In https://stackoverflow.com/questions/5700505/windows-7-update-packages-problem-unable-to-move-temporary-installation, Tal Galili had the same error with the MASS package. The consensus reply was that this is probably an antivirus program locking a file.

So I installed R, and the Tidyverse, on my hard drive, which always works. I then copied its version of the final BH install to R-3.4.2/library. And the memory-stick Tidyverse then worked beautifully.

Experiments with count(), tally(), and summarise(): how to count and sum and list elements of a column in the same call

Most people have a job. Some don’t. And a few have more than one. I’ve mentioned before that our economic model works on data about British households, gathered from surveys such as the Family Resources Survey. Each collection of data is a year long, and contains a file that describes all the adults in the 20,000 or so households covered. There’s another file that describes their jobs. I have to make the model aggregate and merge these.

As an example, assume Pat has three jobs. Her adult data consists of one record, with details such as age and sex. Her jobs data has one record per job, stating how much the job pays. I need to squish these three records down to one which has a field for the number of jobs, a field for the total earnings, and a field with a list of earnings. And I must then merge the result with Pat’s adult record. Now, the jobs file and the adults file are both sorted in order of ID. In our Python version of the model, we looped over these, matching IDs, doing subsidiary loops round records with the same ID to aggregate data, and then combining the results. But in R, I’d like to write code that’s more concise, easier to apprehend, easier to tweak, and that suits R’s programming idioms. As it happens, the Tidyverse can do this easily with summarise(). But it took me a bit of time to get there, and I’m going to show the experiments I did, and why. I’ll precede these by explaining the main points I wanted to investigate, and what I discovered.

First, I’ll show the call to summarise() that worked. Suppose my jobs table t has a column id which identifies the adult, and a column value which gives a job’s earnings. Here’s such a table, obviously not realistic.

t <- tribble( ~id, ~value
            , 1  , 101
            , 1  , 102
            , 2  , 201
            , 3  , 301
            , 3  , 302
            , 3  , 303
            , 4  , 401
            )
From this, I want to make a table whose counts column says how many jobs the adult has, whose sums column is their total earnings, and whose lists column is a list of earnings. This is how:
t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value), lists=list(value) )

This may look obvious. But I was thrown off course, partly by some typos I made early on, and partly by the questions I was thinking about when I wrote “Second-Guessing R”.

Second, I’ll mention something that surprised me. In the tables these generate, each element of the lists column turns out to be an atomic vector, not a list. So identical( tsl[[1,4]], list(101,102) ) is FALSE, but identical( tsl[[1,4]], c(101,102) ) is TRUE. I don’t know whether this is something odd about base R or about summarise().

Third, a surprise arising from the above. I can’t call c() instead of list(). If I do, I get an error Error in summarise_impl(.data, dots): Column `lists` must be length 1 (a summary value), not 2. That sort of makes sense, in R’s strange world. Something seems to be “lowering” list() so that it produces atomic vectors when it would normally produce lists. So perhaps any other function also gets “lowered”. Including c(). But c() is already as low as it can get. You can’t have anything lower than an atomic vector. So “lowering” c() would render it useless. Which is what’s happened.

Fourth, I wanted to see whether there was anything special about summarise(). Would the same functions work from mutate()? Indeed they would: they’re not restricted to summarise(). The code above produces one row per adult. Each group (i.e. each collection of rows for an adult’s jobs) has been “collapsed”, as the Tidyverse documentation for tally() puts it. But if I want to calculate counts, sums and lists without collapsing, then I can call mutate() instead of summarise():

 tsl <- t %>% 
  group_by( id ) %>% 
  mutate( counts=n(), sums=sum(value), lists=list(value) )
This adds the count, sum, and list to every row of the original uncollapsed table.

Fifth, another surprise involving c(). This time, I can call it. But it just causes the lists column to duplicate the original values.

Sixth, I wanted to see whether summarise() would work if I didn’t group the table first. And it does. It treats the entire table as one group, and therefore collapses it to a single row.

Seventh, would mutate() also work if I didn’t group the table first? It should, of course, because that’s how it’s normally used. And it does. n() counts the rows in the original table, and sum() and list() get applied to the entire value column.

My eighth and final point is perhaps more general, being about how summarise() and mutate() treat their functions. For more about why I wanted to investigate this, read “Second-Guessing R”. In essence, I’ve found that the name of a column is always replaced by a vector. If the table holding the column is ungrouped, the vector is the entire column, just as it would be in a base-R call such as data_frame$values * 2. If it’s grouped, the vector becomes whatever slice of that column belongs to the first group, then whatever slice of that column belongs to the second group, and so on. This seems to be true whatever the function it’s passed to.

And, the Tidyverse does not appear to recognise certain functions as special because of their name or other identity. It treats the built-in list() the same as a function with different code or a different name.

With all that out of the way, here are my experiments. Incidentally, there are some that I haven’t mentioned above, mainly with tally() and count().

# try_sum_count_and_list.R
#
# When processing adults data, I
# need to summarise the earnings
# from each adult's jobs. An adult
# may have more than one job, and
# I need to make fields containing
# the number of jobs, the earnings
# summed, and a list of the 
# individual earnings. These are
# experiments in doing this with
# Tidyverse functions.
#
# This led on to some general
# questions about the functions,
# which I've blogged in
# http://www.j-paine.org/blog/2017/11/experiments-with-count-tally-and-summarise-how-to-count-and-sum-and-list-elements-of-a-column-in-the-same-call.html .


library( tidyverse )


t <- tribble( ~id, ~value
            , 1  , 101
            , 1  , 102
            , 2  , 201
            , 3  , 301
            , 3  , 302
            , 3  , 303
            , 4  , 401
            )

t %>% 
  group_by( id ) %>%
  tally()
#
#        id     n
#   1     1     2
#   2     2     1
#   3     3     3
#   4     4     1
#
# The 'n' column gives the
# number of rows in each 
# group. Unlike with add_count()
# below, the result is 
# collapsed so there's only 
# one row per ID.


t %>% 
  count( id )
#
# The same.


t %>% 
  add_count( id )
#
#        id value     n
#   1     1   101     2
#   2     1   102     2
#   3     2   201     1
#   4     3   301     3
#   5     3   302     3
#   6     3   303     3
#   7     4   401     1
#
# The table is not collapsed,
# so each ID has the same 
# number of rows as before. But
# they all get an 'n' column
# giving their count. add_tally()
# would do the same.


t %>% 
  group_by( id ) %>%
  tally( wt=value )
#
#        id     n
#   1     1   203
#   2     2   201
#   3     3   906
#   4     4   401
#
# Here, the 'n' column
# is the sum over 'value'
# within each group.


t %>% 
  count( id, wt=value )
#
# The same.


t %>% 
  add_count( id, wt=value )
#
#        id value     n
#   1     1   101   203
#   2     1   102   203
#   3     2   201   201
#   4     3   301   906
#   5     3   302   906
#   6     3   303   906
#   7     4   401   401
#
# As before with add_count(),
# the table is not collapsed.
# An 'n' cell has been appended
# to each row, giving the
# sum over the value's in
# that row's group. add_tally()
# would do the same.


t %>% 
  add_count( id ) %>%
  filter( n== 1 )
#
#        id value     n
#   1     2   201     1
#   2     4   401     1
#
# As the documentation notes,
# "add_count() is useful for groupwise 
# filtering. E.g.: show only species 
# that have a single member."


t %>% 
  group_by( id ) %>% 
  summarise( n(), sum(value) )
#
#        id `n()` `sum(value)`
#   1     1     2          203
#   2     2     1          201
#   3     3     3          906
#   4     4     1          401
#
# With tally() and count(), I
# can't see a way to count rows and
# sum values in the same call. With
# summarise(), I can.


t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value) )
#
# As above, but the new columns have
# my names 'counts' and 'sums'.


tsl <- t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value), lists=list(value) )
# 
#        id counts  sums     lists
#   1     1      2   203 <dbl [2]>
#   2     2      1   201 <dbl [1]>
#   3     3      3   906 <dbl [3]>
#   4     4      1   401 <dbl [1]>
#
# So as well as counting rows and
# summing values, I can aggregate values 
# into a collection. But beware. Each 
# cell of 'lists' becomes an atomic 
# vector, not a list. The code below 
# proves this.
#
# (These are the first and second points
# in my blog post.)

identical( tsl[[1,4]], list(101,102) ) 
#   [1] FALSE

identical( tsl[[1,4]], c(101,102) ) 
#   [1] TRUE



tsc <- t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value), lists=c(value) )
#
# Gives an error:
#   Error in summarise_impl(.data, dots) : 
#   Column `lists` must be length 1 (a summary value), not 2
#
# So I can't aggregate by calling c().
# I do have to call list().
#
# (This is the third point in my blog post.)


tsl <- t %>% 
  group_by( id ) %>% 
  mutate( counts=n(), sums=sum(value), lists=list(value) )
#
#        id value counts  sums     lists
#   1     1   101      2   203 <dbl [2]>
#   2     1   102      2   203 <dbl [2]>
#   3     2   201      1   201 <dbl [1]>
#   4     3   301      3   906 <dbl [3]>
#   5     3   302      3   906 <dbl [3]>
#   6     3   303      3   906 <dbl [3]>
#   7     4   401      1   401 <dbl [1]>
#
# If I want the same effect as with 
# add_tally() and add_count(), this
# is the way to do it. It adds the
# count, sum, and list to every row
# of the original uncollapsed table.
#
# This also shows that n() and sum()
# are not restricted to use from
# summarise(). You can use them
# on grouped tables from mutate().
# And on ungrouped tables, I suppose,
# but that's probably not useful.
#
# Note: I've not shown it here, but
# the result is grouped, so it's 
# probably best to ungroup it.
#
# (This is the fourth point in my blog 
# post.)


tsl <- t %>% 
  group_by( id ) %>% 
  mutate( counts=n(), sums=sum(value), lists=c(value) )
#
#        id value counts  sums lists
#   1     1   101      2   203   101
#   2     1   102      2   203   102
#   3     2   201      1   201   201
#   4     3   301      3   906   301
#   5     3   302      3   906   302
#   6     3   303      3   906   303
#   7     4   401      1   401   401
#
# Very odd, and not useful. The result
# of c() doesn't get put into each cell
# of 'lists', but treated as a slice of the
# column. Probably related to the 
# behaviour I commented on in
# http://www.j-paine.org/blog/2017/10/experiments-with-summarise-or-when-does-x-sub-1-equal-x.html .
#
# As above, and as expected, the table
# is grouped.
#
# (This is the fifth point in my blog 
# post.)


tsl <- t %>% 
  group_by( id ) %>% 
  transmute( counts=n(), sums=sum(value), lists=list(value) )
#
# This produced the same table as above,
# but without the 'value' column, and 
# accompanied by a warning:
#   Adding missing grouping variables: `id`


tsl <- t %>%  
  summarise( counts=n(), sums=sum(value), lists=list(value) )
#
#     counts  sums     lists
#   1      7  1711 <dbl [7]>
#
# Out of interest, this is what
# happens if I summarise without grouping.
# The table is treated as one group.
#
# (This is the sixth point in my blog 
# post.)


tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=list(value) )
#
#        id value counts  sums     lists
#   1     1   101      7  1711 <dbl [7]>
#   2     1   102      7  1711 <dbl [7]>
#   3     2   201      7  1711 <dbl [7]>
#   4     3   301      7  1711 <dbl [7]>
#   5     3   302      7  1711 <dbl [7]>
#   6     3   303      7  1711 <dbl [7]>
#   7     4   401      7  1711 <dbl [7]>
#
# And this is what happens if I mutate
# without grouping. The aggregating 
# functions n(), sum() and list() get
# applied to the entire column. Their
# result is then appended to each row.
#
# (This is the seventh point in my
# blog post.)


my_list <- list

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# The same table as above.
#
# I wanted to see whether mutate() is 
# treating 'list' specially. For example,
# does it recognise the name and do
# something special because of it?
# This shows that it doesn't, because 
# I'm using a different name but getting
# the same result.
#
# (This and the stuff below make the 
# eigth point in my blog post.)


my_list <- function( x ) list( x )

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# The same table as above. 
#
# mutate() might have been recognising
# the value of list(), i.e. the pointer
# to its code. This shows it isn't,
# because I'm using a different pointer
# but getting the same result.


my_list <- function( x ) str_c( x[1], x[2], x[3], x[4], x[5], x[6], x[7], sep="," )

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# A table like those above but where each
# element of the 'lists' column is a string
# concatenation of all the values in
# the 'value' column. This shows that
# my_list() is receiving all the values
# in one go.


my_list <- function( x ) x[1]

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# A table like those above but where each
# element of the 'lists' column is 101.
# This is consistent with my conclusions
# above.


tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=value[1] )
#
# The same as the last table.
#
# This shows that it doesn't matter whether 
# the subscripting is hidden from mutate()'s
# view. In other words, it doesn't treat
# [] specially.


tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=value*2 )
#
# The same as the tables above, but each element
# of 'lists' is twice the corresponding element
# of 'value.


my_list <- function(x) { cat("Calling my_list() with argument "); dput(x); x*2 } 

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# The same table as above. Outputs
#   Calling my_list() with argument c(101, 102, 201, 301, 302, 303, 401) .
#
# This confirms that my_list() gets called
# only once and is passed the entire column.
# 
# The point of these last two experiments is
# that when I first used calls such as 
#   mutate( new_value=value*2 )
# I assumed * was getting called once per row,
# and that mutate() substituted that row's
# 'value' as the argument to * . But that's
# not true. * gets the entire column. This
# works because of R's vectorisation.

Experiments with summarise(); or, when does x=x[[1]]?

Here’s another innocent-eye exploration, this time about the Tidyverse’s summarise() function. I’d been combining data tables by nesting and joining them, which gave me a tibble with nested tibbles in. I wanted to check the sizes of these inner tibbles, by mapping nrow() over the columns containing them. The Tidyverse provides several ways to do this, one of which (I thought) would be summarise(). So I tried calling it with the argument s=nrow(tibbles), where tibbles was the column with the tibbles in. It crashed. Why? And how should I make it work?

The insight I got from these experiments is that summarise() passes its summarising functions a slice of a column, not an element. To illustrate the difference in meaning: a slice of a list is a list, possibly with fewer elements; a slice of a table is a table, possibly with fewer rows. But an element of a list is not a list, and an element of a table is not a table. I’d overlooked the distinction because I’m used to table columns being atomic vectors. In these, there’s no difference between an element of the vector and a slice. This is because R is strange, and regards all numbers and other primitive values as one-element vectors.

But when the columns are lists, there is a difference. The summarising functions get passed a list containing the element rather than the element itself, so they have to unwrap it. And, by the way, if they return a result that’s to go into a list column, they must wrap it. That’s important if I want them to return tibbles.

With that as my introduction, here is my code, with comments explaining what I was trying.

# try_summarise.R
#
# Some experiments with summarise(),
# to work out why it didn't seem
# to work when applied to columns
# that are lists of tibbles. 

library( tidyverse )

library( stringr )
# For string_c() .


t <-
  tribble( ~a, ~b   , ~c  ,   ~d   , ~e         , ~f
         , 3 , FALSE, "AA", c(1,11), list(x=1)  , tibble(x=c(1))
         , 1 , TRUE , "B" , c(2,22), list()     , tibble(y=c(1,2))
         , 2 , TRUE , "CC", c(3,33), list(1,2,3), tibble(x=1,y=2)
         )

summarise( t, s=min(a) )
#
#   A tibble: 1 x 1
#         s
#     <dbl>
#   1     1
# So a tibble with one element,
# 1.

summarise( t, s=max(a) )
#
# A tibble with one element,
# 3.  

summarise( t, s=mean(a) )
#
# A tibble with one element,
# 2.  

summarise( t, s=str_c(c) )
#
# Gives an error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` must be length 1 (a summary value), not 3
 
summarise( t, s=str_c(c,collapse='') )
#
#
# A tibble with one element,
# 'AABCC'. 

summarise( t, s=any(b) )
#
#
# A tibble with one element,
# TRUE.

summarise( t, s=all(b) )
#
#
# A tibble with one element,
# FALSE.

summarise( t, s=show(a) )
#   [1] 3 1 2
# Then gives error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` is of unsupported type NULL

# The above all show that the
# entire column (t$a or t$b or t$c)
# gets passed to the expression
# after =. If that expression can't
# reduce it to a single atomic
# value, we get an error.


# This was confirmed by the 
#   summarise( t, s=show(a) )
# and also by the two below.

summarise( t, s=identity(a) )
# Gives error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` must be length 1 (a summary value), not 3

summarise( t, s=a[[1]] )
#
# A tibble with one element,
# 3.

summarise( t, s=a[[3]] )
#
# A tibble with one element,
# 2.

# These are consistent with the
# entire column t$a , which is
#   c(3,1,2)
# being passed.


# What happens if I group by a?

t %>% group_by(a) %>% summarise( s=min(a) )
#
#   A tibble: 3 x 2
#         a     s
#     <dbl> <dbl>
#   1     1     1
#   2     2     2
#   3     3     3
# So now I get a tibble with
# as many rows as t has. 

t %>% group_by(a) %>% summarise( s=show(a) )
#
# Shows 1 and then gives an error.

# So now, t is sliced into three
# groups. There are three calls to
# the expression after =, and in
# each, the appropriate slice of t$a
# is substituted for 'a' in the expression.


# Let's confirm this.

t %>% group_by(a) %>% summarise( s=nchar(c) )
#
# Gives a tibble whose single column
# s is three elements, 1 2 2.
# These are the lengths of the elements
# of t$c .


# Does this work with list columns?
# That's where I had trouble, and is
# what inspired me to try these
# calls.

t %>% group_by(a) %>% summarise( s=length(e) )
#
#   A tibble: 3 x 2
#   a     s
#   <dbl> <int>
#   1     1     1
#   2     2     1
#   3     3     1
# So here I get lengths of 1. But
# I'd expect 0, 3, and 1.

t %>% group_by(a) %>% summarise( s=nrow(f) )
#
# And here, I get an error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` is of unsupported type NULL


# Why don't these work? The calls
#   summarise( s=length(e) )
#   summarise( s=nrow(f) )
# are surely analogous to
#   summarise( s=nchar(c) ) .

# Epiphany! The reason, I realise, is
# that in each of the expressions after =,
# the column variable is getting
# substituted for by a single-row
# slice of that column. (This is in
# my grouped examples. In the others,
# it gets substituted for by the entire
# column.)

# When the columns are atomic vectors,
# this single-row slice is an atomic
# vector too, with just one element.
# But in R, these get treated like
# single numbers, strings or Booleans.
# So the call to nchar() gets passed
# a vector containing a single string
# element of column c. It works, because
# such single elements are vectors anyway.

# But when the columns are lists,
# the single-row slice is also a list.
# So nrow(f), for example, gets passed
# not a tibble but a list containing
# a tibble. It then crashes. Similarly,
# length(e) gets passed a list containing
# whichever single slice of column e,
# and always returns 1.  

# I'll confirm this by doing these calls.

summarise( t%>%group_by(a), s=nrow(f[[1]]) )
#
# Works! Returns a tibble with the table
# lengths.

summarise( t%>%group_by(a), s=length(e[[1]]) )
#
# Also works. Returns a tibble with the
# list lengths.


# The key to all this is that if v
# is a length-1 atomic vector, v=v[[1]].
#   > v <- c(1)
#   > v
#   [1] 1
#   > v[[1]]
#   [1] 1

# But if l is a length-1 list, l!=l[[1]]:
#   > l <- list(tibble())
#   > l
#   [[1]]
#   A tibble: 0 x 0
# Whereas:
#   > l[[1]]
#   A tibble: 0 x 0
# The latter loses a level of subscripting.


# By the way, I note that summarise()
# works with more than one column name,
# and more than use of a column name,
# and substitutes them all appropriately.

summarise( t%>%group_by(a), s=a/a )
#
# Returns a tibble with three
# rows, all 1.

summarise( t%>%group_by(a), s=str_c(a,b,c,collapse='') )
#
# Returns a tibble where each
# row is the result of concatenating
# the first three elements in
# the corresponding row of t.


# Finally, I also note that if
# I want to _return_ a tibble
# from a summary function, I have
# to wrap it in a list.

summarise( t%>%group_by(a), s=tibble(x=a) )
#
# Runs, but doesn't do what I wanted.
# (Puts a double into each row.)

summarise( t%>%group_by(a), s=list(tibble(x=a)) )
#
# Runs, and does do what I wanted:
#   A tibble: 3 x 2
#   a                s
#   <dbl>           <list>
#   1     1 <tibble [1 x 1]>
#   2     2 <tibble [1 x 1]>
#   3     3 <tibble [1 x 1]>

# I suppose that this is because 
# the expression in aummarise() has
# to return something that is 
# a row of a column, not an element.
# If the column is an atomic vector,
# these are the same, but they aren't
# if it's a list.

My Testing Was Hurt By a List With No Names

I spied a strange thing with my innocent eye. I was testing spread_to_list, the function I wrote about in “How Best to Convert a Names-Values Tibble to a Named List?”. One of my tests passed it a zero-row tibble, expecting that the result would be a zero-element list:

test_that( "Test on zero-element tibble", {
  t <- tribble( ~a, ~b )
  l <- spread_to_list( t )
  expect_that( length( l ), equals( 0 ) )
  expect_that( l, is_identical_to( list() ) )
})
But it wasn’t. Or at least, it was, but with an unsuspected excrescence.

The package I used to run that test is Hadley Wickham’s testthat, explained in his 2011 article “testthat: Get Started with Testing”. Even if you don’t know it, you can predict from his well-chosen function names what two things I expected. The first thing, that the result be length zero, came true. The second thing, that it be an empty list, didn’t.

Output from the test gave me a hint:

Error: Test failed: 'Test on zero-element tibble'
* `l` not identical to list().
Attributes: < Modes: list, NULL >
Attributes: < Lengths: 1, 0 >
Attributes: < names for target but not for current >
Attributes: < current is not list-like >

Displaying l and a list() told me the rest:

> l
named list()
> list()
list()
R evidently thinks a named list is not identical to a list, even when the set of names is empty. I don’t know whether this is reasonable or not. An empty set is different from no set. On the other hand, a dictionary which just happens to have no definitions in it is still a dictionary. The R manual’s section on object attribute lists implies that a list’s names are an attribute. Let’s see how that behaves:
> attributes(l)
$names
character(0)

> attributes(list())
NULL
So a list can either have no names and no box to put them in, or the box with nothing inside.

Multinest: Factoring Out Hierarchical Names by Converting Tibbles to Nested Named Lists

Consider this tibble:

vat_rates <- tribble( ~l1,   ~l2         , ~l3         , ~val
                    , 'VAT', 'Cigarettes', '' 	       , 'standard'
                    , 'VAT', 'Tobacco'   , ''          , 'standard'
                    , 'VAT', 'Narcotics' , ''	       , 0
                    , 'VAT', 'Clothing'  , 'Adult'     , 'standard'
                    , 'VAT', 'Clothing'  , 'Children'  , 0
                    , 'VAT', 'Clothing'  , 'Protective', 0
                    )
It contains three name columns followed by a value. The name columns are hierarchical. The top category is VAT (Value Added Tax). The second column subdivides this into VAT on cigarettes, other tobacco, narcotics, and clothing. And the third column further subdivides clothing into adult, children’s, or protective. This is part of the VAT parameters worksheet that provides our economic model with its knowledge of the British tax system. Here’s the same fragment as a screen shot:
I want to convert this into a tree-structured named list in which I can access parameter values by multiple selection. That is, I want to be able to ask for params$VAT$Tobacco and get 'standard', or for params$VAT$Clothing$Children and get 0. This is how I did so.

I’m going to show my code below. I’ve annotated it in the same way I did in my previous posts, so all the explanation is in the code as comments. What I did was to write a function called multinest, correcting it a step at a time as I found cases it couldn’t handle. I built it up that way because I found it easier to explain. It seemed to be easier for the reader (or at least, for me acting as reader) to visualise than presenting a finished version of the function immediately and explaining how its recursion and base cases worked. I’m not sure why I found the latter harder, but it’s something to do with the fact that nest transforms an entire table in one go. If you then want to perform an inner transformation on the nested tables it inserts, you have to loop or recurse over them. That’s different from how a more traditional algorithm would work. Anyway, here’s my code. The final version of my function is at the end, multinest_4.

# try_multinest_4.R
#
# This is a version of multinest which
# I hope is easier to understand than
# that in try_multinest_3.R . I built
# it by successive enhancement, starting
# with nest() and adding one correction
# at a time.


library( tidyverse )


t <- tribble( ~l1, ~l2, ~val
            , 'A', 'a', 'Aa'
            , 'B', 'a', 'Ba'
            , 'A', 'b', 'Ab'
            , 'B', 'b', 'Bb'
            )
#
# I will try my functions on this
# tibble. I want to write a function
# that factors out the two name
# columns l1 and l2, converting it
# to the nested named list
#   list( A = list( a = "Aa", b = "Ab" )
#       , B = list( a = "Ba" ,b = "Bb" )
#       )
# So if this list is l, l$A$a must be
# "Aa", and l$B$b must be "Bb", etc.
     

# First, let's try nest() interactively.

t_nested <- t %>% group_by( l1 ) %>%
                 nest
#
# Gives a tibble where the second
# and third columns of t have been
# collapsed to inner, nested, tibbles:
#   # A tibble: 2 x 2
#        l1             data
#     <chr>           <list>
#   1     A <tibble [2 x 2]>
#   2     B <tibble [2 x 2]>

t_nested[[1,2]]
#
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# So the data column in t_nested
# is indeed tibbles.


# Now I want to "spread" t_nested, 
# flipping it into a list where the column-1
# elements "A" and "B" become the list's
# names, and the tibbles in column 2
# becomes the values associated with those
# names.

# In my blog post 
# http://www.j-paine.org/blog/2017/10/how-best-to-convert-a-names-values-tibble-to-a-named-list.html 
# I tried functions for doing this. I could
# use spread(). But it turns out I don't need
# to. setNames() can do the job much more 
# efficiently. Here's my function. See
# the post for why it's as it is.

spread_to_list <- function( t )
{
  setNames( as.list( t[[2]] ), t[[1]] ) 
}


# Let's try it.

t_nested_spread <- spread_to_list( t_nested )
#
#   $A
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
#
#   $B
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Ba
#   2     b    Bb
# So that does what I wanted.
# Let's check by displaying one
# of the elements.

t_nested_spread$A
#
#   #  A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# Good.

# So we can convert the outer level of
# a table t to a named list whose keys are
# the names in t[,1].

# I should note that spread_to_list() 
# is safe because nest() will ensure 
# that spread_to_list()'s argument 
# never has a name more than
# once in column 1.


# Let's make this into a function.

multinest <- function( t )
{
  colname_1 <- as.name( names(t)[1] )

  t %>% group_by( !!colname_1 ) %>%
        nest %>%
        spread_to_list    
}

t_nested_spread_ <- multinest( t )
identical( t_nested_spread, t_nested_spread_ )
#
# Which gives the same result as above,
# so multinest() appears OK.


# Now I want to make multinest() recursive so
# that it factors out the inner tables
# in the same way. I'll assume all the name
# columns are at the front of the table. I
# do have to give an extra argument saying
# how many there are.

multinest_2 <- function( t, no_of_name_cols )
{
  colname_1 <- as.name( names(t)[1] )
  #
  # For as.name and !! , see e.g. Henrik's
  # answer to https://stackoverflow.com/questions/26724124/standard-evaluation-in-dplyr-summarise-on-variable-given-as-a-character-string ,
  # "[u]se as.name if you have a character 
  # string that gives a variable name".

  listed <- t %>% group_by( !!colname_1 ) %>%
                  nest %>%
                  spread_to_list    

  if ( no_of_name_cols > 1 ) 
    return( map( listed, multinest_2, no_of_name_cols-1 ) )
  else
    return( listed )
}

identical( multinest_2( t, 1 ), t_nested_spread )
#
# So this function called for just one column
# works as before.


# Let's see what happens with two columns.

t_nested_spread_2 <- multinest_2( t, 2 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') ) 
    cat( t_nested_spread_2[[X]][[Y]] ) 
#
# But ...
#   Error in cat(listed_2[[X]][[Y]]) : 
#   argument 1 (type 'list') cannot be handled by 'cat'

t_nested_spread_2$A$a
#   # A tibble: 1 x 1
#       val
#     <chr>
#   1    Aa

# The innermost list elements are tibbles, not
# simple values. That would be OK if I wanted to
# carry more than one value cell with each
# innermost element, but it wastes space when
# I have only one. What's causing this is the
# inner call to nest(), which is converting
# a tibble such as 
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# into
#   # A tibble: 2 x 2
#       l2             data
#    <chr>           <list>
#  1     a <tibble [1 x 1]>
#  2     b <tibble [1 x 1]>


# Let's rewrite to avoid this. If there's
# only one column left, just convert to list
# without calling nest().

multinest_3 <- function( t, no_of_name_cols )
{
  colname_1 <- as.name( names(t)[1] )
 
  if ( no_of_name_cols > 1 )
    listed <- t %>% group_by( !!colname_1 ) %>%
                    nest %>%
                    spread_to_list %>% 
                    map( multinest_3, no_of_name_cols-1 )
  else if ( no_of_name_cols == 1 )
    listed <- t %>% spread_to_list

  listed
}

identical( multinest_3( t, 1 ), t_nested_listed )
#
# This actually returns FALSE. So asked to nest
# just one column, the function doesn't do the same
# as its predecessors. It returns the list
#   list(A = "a", B = "a", A = "b", B = "b")
# But that's reasonable, because it's treating
# the second column as the values. So I'll not
# amend it.


# What happens with two columns?

t_nested_spread_3 <- multinest_3( t, 2 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') ) 
    cat( t_nested_spread_3[[X]][[Y]] ) 
#
# Displays AaAbBaBb.
# So that looks OK. 

dput(t_nested_spread_3)
#
# So does a dput():
#   structure(list(A = structure(list(a = "Aa", b = "Ab"), .Names = c("a", 
#   "b")), B = structure(list(a = "Ba", b = "Bb"), .Names = c("a", 
#   "b"))), .Names = c("A", "B"))


# Now let's try three columns. 
# I'll reassign t for this.

t <- tribble( ~l1, ~l2, ~l3, ~val
            , 'A', 'a', '1', 'Aa1'
            , 'A', 'a', '2', 'Aa2'
            , 'A', 'b', '1', 'Ab1' 
            , 'A', 'b', '2', 'Ab2'
            , 'B', 'a', '1', 'Ba1'
            , 'B', 'a', '2', 'Ba2'
            , 'B', 'b', '1', 'Bb1' 
            , 'B', 'b', '2', 'Bb2'
            )

t_nested_spread_3 <- multinest_3( t, 3 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') )
    for( Z in c('1','2') ) 
      cat( t_nested_spread_3[[X]][[Y]][[Z]] )
#
# Displays Aa1Aa2Ab1Ab2Ba1Ba2Bb1Bb2. Good.


# Now let's try on some real data, a fragment 
# of our VAT parameters.

vat_rates <- tribble( ~l1,   ~l2         , ~l3         , ~val
                    , 'VAT', 'Cigarettes', '' 	       , 'standard'
                    , 'VAT', 'Tobacco'   , ''          , 'standard'
                    , 'VAT', 'Narcotics' , ''	       , 0
                    , 'VAT', 'Clothing'  , 'Adult'     , 'standard'
                    , 'VAT', 'Clothing'  , 'Children'  , 0
                    , 'VAT', 'Clothing'  , 'Protective', 0
                    )

vat_rates_nested_spread <- multinest_3( vat_rates, 3 )
#
# But this goes awry on the first three table rows,
# where it generates results such as 
#   $VAT$Narcotics
#   $VAT$Narcotics[[1]]
#   [1] "0"
# The value is wrapped in an unwanted list.

# This is because I didn't check for names
# where some of the final components are
# empty. R has come across the '' and done its
# own thing. 


# I need to fix this. I'll do so below. If 
# the table passed to multinest() has '' 
# as its first element, return the final 
# element, the corresponding value. I don't think such a table can 
# ever have more than one row. If it did, it 
# would have been part of a table such as
#   Pars VAT ''  20
#   Pars VAT Low 17
# which would have to translate into the list
#   Pars$VAT <- 20
#   Pars$VAT$Low <- 17
# This would mean that Pars$VAT is 
# simultaneously a terminal and a non-terminal,
# which makes no sense.

multinest_4 <- function( t, no_of_name_cols )
{
  if ( t[[1,1]] == '' )
    return( t[[ 1, ncol(t) ]] )
  else {
    colname_1 <- as.name( names(t)[1] )

    if ( no_of_name_cols > 1 )
      listed <- t %>% group_by( !!colname_1 ) %>%
                      nest %>%
                      spread_to_list %>% 
                      map( multinest_4, no_of_name_cols-1 )
    else if ( no_of_name_cols == 1 )
      listed <- t %>% spread_to_list

    return( listed )
  }
}

vat_rates_nested_spread <- multinest_4( vat_rates, 3 )
#
#   $VAT
#   $VAT$Cigarettes
#   [1] "standard"
#
#   $VAT$Tobacco
#   [1] "standard"
#
#   $VAT$Narcotics
#   [1] "0"
#
#   $VAT$Clothing
#   $VAT$Clothing$Adult
#   [1] "standard"
#
#   $VAT$Clothing$Children
#   [1] "0"
#
#   $VAT$Clothing$Protective
#   [1] "0"
#
# Good.

How Best to Convert a Names-Values Tibble to a Named List?

Here, in the spirit of my “Experiments with by_row()” post, are some experiments in writing and timing a function spread_to_list that converts a two-column tibble such as:

x 1
y 2 
z 3
t 4
to a named list:
list( x=1, y=2, z=3, t=4 )
I need this for processing the parameter sheets shown in that by_row post, and I’ll explain why later. In this post, I’m just interested in how best to define spread_to_list. The best implementation looks like either spread_to_list_3 or spread_to_list_4 below.

# try_spread_to_list.R
#
# Consider a tibble t with two columns,
# where each cell in the second column
# represents the value associated with
# the string (assumed to be its name) 
# in the first column.
#
# I want to define a function spread_to_list() 
# which converts t into a named list whose names
# are the names in the first column, and whose
# values are the values in the second column.
#
# For example, if t is:
#        names no_of_days
#    DaysInMay         31
#   DaysInJune         30
# then
#   spread_to_list(t) 
# would be the list
#   list( DaysInMay = 31, DaysInJune = 30 )
#
# The code here tries various ways
# of implementing spread_to_list,
# and benchmarks them. Two are variants
# of one another, using spread() and converting
# its result. Another two take the values
# column as a list, and call setNames() to 
# convert to a named list.


library( tidyverse )
library( microbenchmark )
library( stringr )


t <- tribble( ~a , ~b 
            , 'x',  1
            , 'y',  2 
            , 'z',  3
            , 't',  4
            )
#
# I'm going to try various ways
# of implementing spread_to_list()
# on t.


# First, what happens if I call spread()?

s <- spread( t, a, b )
#
# s becomes a one-row tibble:
#         t     x     y     z
#   1     4     1     2     3


# How can I convert s to a named list?

# An obvious way is to call map().
# Let's see what the function argument to
# map() gets passed if I call map() on s.

map( s, show )
#
# Displays
#  [1] 4
#  [1] 1
#  [1] 2
#  [1] 3
# So it gets passed a column of the
# tibble as an atomic vector.

map( s, function(x)x )
#
# Returns a list of these elements:
#   list(t = 4, x = 1, y = 2, z = 3)
# That's because map() is defined to return 
# lists. So the call above uses it merely
# as a type converter.

map( s, identity )
#
# Does the same. identity() is a built-in
# identity function.


# But maybe I can avoid mapping. In my
# experiments with by_row(), 
# http://www.j-paine.org/blog/2017/10/experiments-with-by_row.html ,
# I discovered that as.list() will convert
# a tibble to a named list.

as.list( s )
#
# Also returns 
#   list(t = 4, x = 1, y = 2, z = 3).


# But can I avoid spread() altogether?
# Browsing the discussion groups gave me
# the idea of trying setNames().
# Let's try that, passing t's names as its
# second argunent and t's values as its
# first.

setNames( t[[2]], t[[1]] )
# 
# Gives me an atomic named vector. 

# I need to convert it to a list.
# One way is for the argument to setNames()
# to be a list, because that's specified to
# make it return a list.

setNames( as.list( t[[2]] ), t[[1]] )
#
# Returns the list
#   list(x = 1, y = 2, z = 3, t = 4)

# But I could convert the result instead.

as.list( setNames( t[[2]], t[[1]] ) )
#
# Also returns
#   list(x = 1, y = 2, z = 3, t = 4)


# Let's try these four implementations.
# First, define the functions.

spread_to_list_1 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    map( identity )
}

spread_to_list_2 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    as.list
}

spread_to_list_3 <- function( t )
{
  setNames( as.list( t[[2]] ), t[[1]] )
}

spread_to_list_4 <- function( t )
{
  as.list( setNames( t[[2]], t[[1]] ) )
}


# Now try them.

s1 <- spread_to_list_1( t )
s2 <- spread_to_list_2( t )
s3 <- spread_to_list_3( t )
s4 <- spread_to_list_4( t )

dput( s1 )
dput( s2 )
dput( s3 )
dput( s4 )

identical( s1, s2 )
identical( s1, s3 )
identical( s1, s4 )


# They all return named lists, but the order
# of elements is different for the spread()-based
# versions than from the as.list()-based ones.
# (That was obvious earlier, actually.)
# So I'll sort the lists, then test that
# they're identical. I'll also microbenchmark
# the functions.

sort_list <- function(l)
{ 
  sort( unlist( l ) )
}

identical( sort_list(s1), sort_list(s2) )%>%show
identical( sort_list(s1), sort_list(s3) )%>%show
identical( sort_list(s1), sort_list(s4) )%>%show

mbres <- microbenchmark( spread_to_list_1( t )
                       , spread_to_list_2( t )
                       , spread_to_list_3( t )
                       , spread_to_list_4( t )
                       )
print( mbres )


# Now let's microbenchmark the functions applied
# to bigger tibbles. I'll generate random name-value
# tibbles of sizes n, where n is defined by the
# vector in the 'for' condition.

for ( n in c(10,30,100,300) ) {

  cat( "Trying ", n, "row tibble\n" )

  names <- replicate( n, str_c(sample(letters,5,replace=FALSE),collapse="") )
  #
  # Generate n random alphabetic strings.
  # From Dirk Eddelbuettel's answer to 
  # https://stackoverflow.com/questions/1439513/creating-a-sequential-list-of-letters-with-r .

  values <- runif( n, 1, 100 )
  #
  # Generate n random values.

  t <- tibble( names=names
             , values=values
             )
  #
  # Use these to make a random tibble with
  # two columns and n rows.

  identical( sort_list(s1), sort_list(s2) )%>%show
  identical( sort_list(s1), sort_list(s3) )%>%show
  identical( sort_list(s1), sort_list(s4) )%>%show

  mbres <- microbenchmark( spread_to_list_1( t )
                         , spread_to_list_2( t )
                         , spread_to_list_3( t )
                         , spread_to_list_4( t )
                         )
  print( mbres )
}


# Here are the microbenchmark results for the
# 300-row tibble:
#   Unit: microseconds
#                  expr       min        lq    
#   spread_to_list_1(t) 17738.631 17928.946 
#   spread_to_list_2(t) 14582.521 14805.888 
#   spread_to_list_3(t)    36.223    40.901   
#   spread_to_list_4(t)    35.317    40.146 
#                  expr        mean    median         uq
#   spread_to_list_1(t) 18668.44595 18221.133 19532.8080
#   spread_to_list_2(t) 15386.05835 15050.685 16355.4180
#   spread_to_list_3(t)    46.67244    47.089    51.4655
#   spread_to_list_4(t)    45.77894    45.882    51.6165
#         max neval
#   21477.003   100
#   17314.838   100
#      64.294   100
#      63.087   100

# So the two spread() versions are much slower.
# Converting the spread() result with mapping is
# slower than with as.list(), probably unsurprisingly.
# The two setNames() versions are much faster. 
# It doesn't seem to matter whether we type-convert
# to list by making setNames()'s first argument
# a list, or by making its result one.

Experiments with by_row()

I’ve been experimenting with by_row() from the R purrrlyr package. It’s a function that maps over the rows of a data frame, which I thought might be handy for processing our economic model’s parameter files. I didn’t find the documentation for by_row() told me everything I wanted to know, so I made up and ran some examples. As these might be useful to others, I’m blogging them here.

purrrlyr was written mainly by Hadley Wickham. Regular readers of my posts will know that Hadley wrote the suite of programs known as the Tidyverse, which can be installed with the command install.packages("tidyverse"). However, purrrlyr is not quite part of this, and has to be installed separately, via install.packages("purrrlyr").

Why am I interested in by_rows()? Our economic model reads what we call parameter files. These are spreadsheets describing the taxes and benefits it has to apply, plus various controls. Each spreadsheet contains several worksheets, most of which start with three columns making up a hierarchical parameter name, followed by one or more values. Here’s a fragment from one such sheet:

        Name1              Name2    Name3 Allowance_Spouse PrivatePension BankAccount NationalSavings
1 IncomeLists      IncomeSupport     <NA>                1              1           0               0               
2 IncomeLists      PensionCredit     Main                1              1           0               0               
3 IncomeLists      PensionCredit  Savings                0              1           0               0               
4 IncomeLists NonContributoryESA   Income                1              1           0               0               
5 IncomeLists NonContributoryESA Earnings                0              0           0               0               
6 IncomeLists              HBCTB     <NA>                1              0           0               0               
And here’s a screen shot of it:

I need to process these sheets a row at a time, converting each row into an entry in a lookup table that maps parameter names to their values. Because by_row() maps over rows, it looked worth learning about. So the code below shows the calls I tried. In it, t is the tibble I passed to by_rows(), and tbr is the result of the calls. I’ve commented each call explaining what it tells me, and also listed the things I wanted to know. One was what the documentation meant by saying that the .collate argument makes by_row() collate by row or by column. “Collate” is a very general verb, and didn’t tell me exactly how by_row()‘s results were arranged. I was also puzzled by the .labels argument, in which if TRUE “the returned data frame is prepended with the labels of the slices (the columns in .d used to define the slices)”.

# try_by_row.R
#
# Some experiments with 
# purrrlyr's by_row() function.
# This may be useful when 
# converting segments of tibble
# rows to lists that I can
# store in a single cell.


library( tidyverse )
library( purrrlyr )


t <- tribble( ~a, ~b, ~c
            ,  1,  2,  3
            ,  4,  5,  6
            ,  7,  8,  9
            , 10, 11, 12
            )
#
# I'll try various calls to by_row on this.

# Let's try sum().

tbr <- by_row( t, sum )
#
# tbr becomes t with a .out column added.
#         a     b     c      .out
#   1     1     2     3 
#   2     4     5     6 
#   3     7     8     9 
#   4    10    11    12 
# Each of the 's is a list whose
# single element is the sum of a, b, and c.

tbr <- by_row( t, sum, .collate="rows" )
tbr <- by_row( t, sum, .collate="cols" )
#
# In both of these, tbr becomes a table where
# each .out cell holds a sum rather than a
# list containing the sum:
#   a     b     c        .out
#   1     1     2     3     6
#   2     4     5     6    15
#   3     7     8     9    24
#   4    10    11    12    33

# Let's try mean().

tbr <- by_row( t, mean )
# Warning messages:
# 1: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA
# 2: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA
# 3: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA
# 4: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA

# Let's find out why sum() works but mean() doesn't.

tbr <- by_row( t, show )
#
# I did this to see what by_row()'s function
# argument gets passed. The command above 
# displays four one-row tibbles, each a row of t.
# So that's what the function gets passed.

sum( t[1,] )
#
# 6. So sum() can accept tibbles.

mean( t[1,] )
#   Warning message:
#   In mean.default(t[1, ]) : argument is not numeric 
#   or logical: returning NA
# But mean() can't.

# The above three results show why by_row gave
# an error with mean() but not sum(). It was
# passing one-row tibbles to both. sum() accepts
# these but mean() doesn't.             
    
# Let's see whether I can create a column 
# containing copies of the original rows.

tbr <- by_row( t, function(row)row )
#
# Shows that tbr$.out is a list where
# each element is one of these one-row tibbles.
# So this copies each original row into 
# the corresponding cell of .out .

tbr <- by_row( t, as.list )
#
# This does as above but converts each row to
# a list. So here, 
#   tbr$.out[[1]] 
# is
#   list(a = 1, b = 2, c = 3) .
# and so on. This will be useful when I want to
# convert rows to look-up tables implemented
# as named lists. And I presume it's more
# efficient than storing an entire tibble.

as.list( t[1,] )
#
# Just to show that conversion of one-row
# tibble to named list is what as.list does,
# and not some quirk of by_row(). You never
# know with R.

as.list( t )
#
# I try this for interest. It converts to a
# named list where each element is a 
# column vector. Thus,  
#   as.list( t )
# is
#   list( a = c(1, 4, 7, 10)
#       , b = c(2, 5, 8, 11)
#       , c = c(3, 6, 9, 12)
#       )

# Let's try the .collate argument, because
# I don't understand exactly what the 
# by_row() documentation means by collation.

tbr <- by_row( t, function(row)c(10,20,30)  )
#
# This makes tbr$.out a list column each of
# whose elements is c(10,20,30). So it just 
# puts the entire result of the function 
# into the .out cells.

tbr <- by_row( t, function(row)c(10,20,30), .collate="rows" )
#
# Makes tbr the table
#         a     b     c  .row  .out
#   1     1     2     3     1    10
#   2     1     2     3     1    20
#   3     1     2     3     1    30
#   4     4     5     6     2    10
#   5     4     5     6     2    20
#   6     4     5     6     2    30
#   7     7     8     9     3    10
# etc. So appends one column. But copies
# each original row as many times as there 
# are components in the function's result. 
# The i'th copy's .out cell holds the i'th 
# component. And we get a .row column holding
# the number of the original row.

tbr <- by_row( t, function(row)c(10,20,30), .collate="cols" )
#
# Makes tbr the table
#         a     b     c .out1 .out2 .out3
#   1     1     2     3    10    20    30
#   2     4     5     6    10    20    30
#   3     7     8     9    10    20    30
#   4    10    11    12    10    20    30
# So appends three columns, where the i'th
# column contains the i'th component of the
# function's result.

# Let's see whether .collate can produce
# "Manhattan arrays", where the rows are 
# different lengths.

tbr <- by_row( t, function(row)1:(sum(row)%%5), .collate="cols" )
#
# This gave an error:
#   Error in by_row(t, function(row) 1:(sum(row)%%5), .collate = "cols") : 
#   .f should return equal length vectors or data frames 
#   for collating on `cols`
# My function was intended to return rows of different
# length. But, as the documentation says, they have to
# be the same length.

tbr <- by_row( t, function(row)1:(sum(row)%%5), .collate="rows" )
#
# However, this one does run, producing differing
# numbers of rows for each original row:
#         a     b     c  .row  .out
#   1     1     2     3     1     1
#   2     4     5     6     2     1
#   3     4     5     6     2     0
#   4     7     8     9     3     1
#   5     7     8     9     3     2
#   6     7     8     9     3     3
#   7     7     8     9     3     4
#   8    10    11    12     4     1
#   9    10    11    12     4     2
#  10    10    11    12     4     3

# Let's see what happens with .collate when 
# the function returns a data frame, 
# because the documentation doesn't really 
# explain that.

df <- tribble( ~X, ~Y, "$", "%", "^", "&" )

tbr <- by_row( t, function(row)df )
#
# Each cell of tbr's .out column becomes a copy of
# df.

tbr <- by_row( t, function(row)df, .collate="rows" )
# 
# tbr becomes a table 
#         a     b     c  .row     X     Y
#   1     1     2     3     1     $     %
#   2     1     2     3     1     ^     &
#   3     4     5     6     2     $     %
#   4     4     5     6     2     ^     &
# etc. So we get one copy of each original row
# for each row in df. We get one column for
# each column in df, plus again a .row column 
# holding the number of the original row.

tbr <- by_row( t, function(row)df, .collate="cols" )
#
# I wasn't expecting that. df's cells get
# spread into a single line, thus, and its
# column names suffixed accordingly:
#         a     b     c    X1    X2    Y1    Y2
#   1     1     2     3     $     ^     %     &
#   2     4     5     6     $     ^     %     &
#   3     7     8     9     $     ^     %     &
#   4    10    11    12     $     ^     %     &

# Let's see what the .labels argument does. The
# documentation says: if TRUE, the returned data 
# frame is prepended with the labels of the slices 
# (the columns in .d used to define the slices). 
# But I don't know what it means by slices.

tbr <- by_row( t, sum, .labels=FALSE )
#
# We just get 
#          .out
#   1 <dbl [1]>
#   2 <dbl [1]>
#   3 <dbl [1]>
#   4 <dbl [1]>

tbr <- by_row( t, function(row)df, .collate="rows", .labels=FALSE )
#
# tbr becomes
#      .row     X     Y
#   1     1     $     %
#   2     1     ^     &
#   3     2     $     %
#   4     2     ^     &
# etc.

tbr <- by_row( t, function(row)df, .collate="cols", .labels=FALSE )
#
#        X1    X2    Y1    Y2
#   1     $     ^     %     &
#   2     $     ^     %     &
#   3     $     ^     %     &
#   4     $     ^     %     &

# So .labels=FALSE gives me what I think
# of as a traditional mapping function,
# where by_row() returns only the new
# columns. Except that if .collate="rows",
# it prepends a .row column holding the
# original row number.

Abstract Data Types and the Uniform Referent Principle II: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()

In “Abstract Data Types and the Uniform Referent Principle I: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()”, I explained why the notation for interfacing to a data structure should be independent of that structure’s representation. R programmers honour this principle in the same way that bricks hang in the sky. All published R code that operates on data frames uses column names. Sometimes these follow the $ operator; sometimes the data frame is implicit via attach() or similar. In the Tidyverse, the column names will often be part of a mutate(), the data frame being piped through a sequence of %>% operators. And this is dreadful software engineering.

Why? Look at the tables below. They represent four different ways of storing my income data.

PersonIncome_TypeIncome_Value
AliceWages37000
AliceBonuses0
AliceBenefits0
BobWages14000
BobBonuses1000
BobBenefits6000
PersonIncome_WagesIncome_BonusesIncome_Benefits
Alice370000 0
Bob 1400010006000
PersonIncome
Alice
TypeValue
Wages37000
Bonuses0
Benefits0
Bob
TypeValue
Wages14000
Bonuses1000
Benefits6000
PersonIncome
Alice
WagesBonusesBenefits
3700000
Bob
WagesBonusesBenefits
1400010006000

Abstractly, the data is the same in each case, and if you’re familiar with nest(), unnest(), gather() and spread(), you will easily see how to transform one table into any of the others. But the tables are implemented in very different ways. If you access their elements with $ or an equivalent, and you then change the implementation, you have to rewrite all those accesses. Which is dreadful software engineering.

Cartoon of experimenter peering into innards of complicated piece of machinery. His colleague is holding a plug coming out of it labelled INTERFACE: 'GET' 'INCOME' 'WAGES' and saying 'Don't worry about how it works. It's the interface that's important.'