Experiments with summarise(); or, when does x=x[[1]]?

Here’s another innocent-eye exploration, this time about the Tidyverse’s summarise() function. I’d been combining data tables by nesting and joining them, which gave me a tibble with nested tibbles in. I wanted to check the sizes of these inner tibbles, by mapping nrow() over the columns containing them. The Tidyverse provides several ways to do this, one of which (I thought) would be summarise(). So I tried calling it with the argument s=nrow(tibbles), where tibbles was the column with the tibbles in. It crashed. Why? And how should I make it work?

The insight I got from these experiments is that summarise() passes its summarising functions a slice of a column, not an element. To illustrate the difference in meaning: a slice of a list is a list, possibly with fewer elements; a slice of a table is a table, possibly with fewer rows. But an element of a list is not a list, and an element of a table is not a table. I’d overlooked the distinction because I’m used to table columns being atomic vectors. In these, there’s no difference between an element of the vector and a slice. This is because R is strange, and regards all numbers and other primitive values as one-element vectors.

But when the columns are lists, there is a difference. The summarising functions get passed a list containing the element rather than the element itself, so they have to unwrap it. And, by the way, if they return a result that’s to go into a list column, they must wrap it. That’s important if I want them to return tibbles.

With that as my introduction, here is my code, with comments explaining what I was trying.

# try_summarise.R
#
# Some experiments with summarise(),
# to work out why it didn't seem
# to work when applied to columns
# that are lists of tibbles. 

library( tidyverse )

library( stringr )
# For string_c() .


t <-
  tribble( ~a, ~b   , ~c  ,   ~d   , ~e         , ~f
         , 3 , FALSE, "AA", c(1,11), list(x=1)  , tibble(x=c(1))
         , 1 , TRUE , "B" , c(2,22), list()     , tibble(y=c(1,2))
         , 2 , TRUE , "CC", c(3,33), list(1,2,3), tibble(x=1,y=2)
         )

summarise( t, s=min(a) )
#
#   A tibble: 1 x 1
#         s
#     <dbl>
#   1     1
# So a tibble with one element,
# 1.

summarise( t, s=max(a) )
#
# A tibble with one element,
# 3.  

summarise( t, s=mean(a) )
#
# A tibble with one element,
# 2.  

summarise( t, s=str_c(c) )
#
# Gives an error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` must be length 1 (a summary value), not 3
 
summarise( t, s=str_c(c,collapse='') )
#
#
# A tibble with one element,
# 'AABCC'. 

summarise( t, s=any(b) )
#
#
# A tibble with one element,
# TRUE.

summarise( t, s=all(b) )
#
#
# A tibble with one element,
# FALSE.

summarise( t, s=show(a) )
#   [1] 3 1 2
# Then gives error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` is of unsupported type NULL

# The above all show that the
# entire column (t$a or t$b or t$c)
# gets passed to the expression
# after =. If that expression can't
# reduce it to a single atomic
# value, we get an error.


# This was confirmed by the 
#   summarise( t, s=show(a) )
# and also by the two below.

summarise( t, s=identity(a) )
# Gives error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` must be length 1 (a summary value), not 3

summarise( t, s=a[[1]] )
#
# A tibble with one element,
# 3.

summarise( t, s=a[[3]] )
#
# A tibble with one element,
# 2.

# These are consistent with the
# entire column t$a , which is
#   c(3,1,2)
# being passed.


# What happens if I group by a?

t %>% group_by(a) %>% summarise( s=min(a) )
#
#   A tibble: 3 x 2
#         a     s
#     <dbl> <dbl>
#   1     1     1
#   2     2     2
#   3     3     3
# So now I get a tibble with
# as many rows as t has. 

t %>% group_by(a) %>% summarise( s=show(a) )
#
# Shows 1 and then gives an error.

# So now, t is sliced into three
# groups. There are three calls to
# the expression after =, and in
# each, the appropriate slice of t$a
# is substituted for 'a' in the expression.


# Let's confirm this.

t %>% group_by(a) %>% summarise( s=nchar(c) )
#
# Gives a tibble whose single column
# s is three elements, 1 2 2.
# These are the lengths of the elements
# of t$c .


# Does this work with list columns?
# That's where I had trouble, and is
# what inspired me to try these
# calls.

t %>% group_by(a) %>% summarise( s=length(e) )
#
#   A tibble: 3 x 2
#   a     s
#   <dbl> <int>
#   1     1     1
#   2     2     1
#   3     3     1
# So here I get lengths of 1. But
# I'd expect 0, 3, and 1.

t %>% group_by(a) %>% summarise( s=nrow(f) )
#
# And here, I get an error:
#   Error in summarise_impl(.data, dots) : 
#   Column `s` is of unsupported type NULL


# Why don't these work? The calls
#   summarise( s=length(e) )
#   summarise( s=nrow(f) )
# are surely analogous to
#   summarise( s=nchar(c) ) .

# Epiphany! The reason, I realise, is
# that in each of the expressions after =,
# the column variable is getting
# substituted for by a single-row
# slice of that column. (This is in
# my grouped examples. In the others,
# it gets substituted for by the entire
# column.)

# When the columns are atomic vectors,
# this single-row slice is an atomic
# vector too, with just one element.
# But in R, these get treated like
# single numbers, strings or Booleans.
# So the call to nchar() gets passed
# a vector containing a single string
# element of column c. It works, because
# such single elements are vectors anyway.

# But when the columns are lists,
# the single-row slice is also a list.
# So nrow(f), for example, gets passed
# not a tibble but a list containing
# a tibble. It then crashes. Similarly,
# length(e) gets passed a list containing
# whichever single slice of column e,
# and always returns 1.  

# I'll confirm this by doing these calls.

summarise( t%>%group_by(a), s=nrow(f[[1]]) )
#
# Works! Returns a tibble with the table
# lengths.

summarise( t%>%group_by(a), s=length(e[[1]]) )
#
# Also works. Returns a tibble with the
# list lengths.


# The key to all this is that if v
# is a length-1 atomic vector, v=v[[1]].
#   > v <- c(1)
#   > v
#   [1] 1
#   > v[[1]]
#   [1] 1

# But if l is a length-1 list, l!=l[[1]]:
#   > l <- list(tibble())
#   > l
#   [[1]]
#   A tibble: 0 x 0
# Whereas:
#   > l[[1]]
#   A tibble: 0 x 0
# The latter loses a level of subscripting.


# By the way, I note that summarise()
# works with more than one column name,
# and more than use of a column name,
# and substitutes them all appropriately.

summarise( t%>%group_by(a), s=a/a )
#
# Returns a tibble with three
# rows, all 1.

summarise( t%>%group_by(a), s=str_c(a,b,c,collapse='') )
#
# Returns a tibble where each
# row is the result of concatenating
# the first three elements in
# the corresponding row of t.


# Finally, I also note that if
# I want to _return_ a tibble
# from a summary function, I have
# to wrap it in a list.

summarise( t%>%group_by(a), s=tibble(x=a) )
#
# Runs, but doesn't do what I wanted.
# (Puts a double into each row.)

summarise( t%>%group_by(a), s=list(tibble(x=a)) )
#
# Runs, and does do what I wanted:
#   A tibble: 3 x 2
#   a                s
#   <dbl>           <list>
#   1     1 <tibble [1 x 1]>
#   2     2 <tibble [1 x 1]>
#   3     3 <tibble [1 x 1]>

# I suppose that this is because 
# the expression in aummarise() has
# to return something that is 
# a row of a column, not an element.
# If the column is an atomic vector,
# these are the same, but they aren't
# if it's a list.

My Testing Was Hurt By a List With No Names

I spied a strange thing with my innocent eye. I was testing spread_to_list, the function I wrote about in “How Best to Convert a Names-Values Tibble to a Named List?”. One of my tests passed it a zero-row tibble, expecting that the result would be a zero-element list:

test_that( "Test on zero-element tibble", {
  t <- tribble( ~a, ~b )
  l <- spread_to_list( t )
  expect_that( length( l ), equals( 0 ) )
  expect_that( l, is_identical_to( list() ) )
})
But it wasn’t. Or at least, it was, but with an unsuspected excrescence.

The package I used to run that test is Hadley Wickham’s testthat, explained in his 2011 article “testthat: Get Started with Testing”. Even if you don’t know it, you can predict from his well-chosen function names what two things I expected. The first thing, that the result be length zero, came true. The second thing, that it be an empty list, didn’t.

Output from the test gave me a hint:

Error: Test failed: 'Test on zero-element tibble'
* `l` not identical to list().
Attributes: < Modes: list, NULL >
Attributes: < Lengths: 1, 0 >
Attributes: < names for target but not for current >
Attributes: < current is not list-like >

Displaying l and a list() told me the rest:

> l
named list()
> list()
list()
R evidently thinks a named list is not identical to a list, even when the set of names is empty. I don’t know whether this is reasonable or not. An empty set is different from no set. On the other hand, a dictionary which just happens to have no definitions in it is still a dictionary. The R manual’s section on object attribute lists implies that a list’s names are an attribute. Let’s see how that behaves:
> attributes(l)
$names
character(0)

> attributes(list())
NULL
So a list can either have no names and no box to put them in, or the box with nothing inside.

Multinest: Factoring Out Hierarchical Names by Converting Tibbles to Nested Named Lists

Consider this tibble:

vat_rates <- tribble( ~l1,   ~l2         , ~l3         , ~val
                    , 'VAT', 'Cigarettes', '' 	       , 'standard'
                    , 'VAT', 'Tobacco'   , ''          , 'standard'
                    , 'VAT', 'Narcotics' , ''	       , 0
                    , 'VAT', 'Clothing'  , 'Adult'     , 'standard'
                    , 'VAT', 'Clothing'  , 'Children'  , 0
                    , 'VAT', 'Clothing'  , 'Protective', 0
                    )
It contains three name columns followed by a value. The name columns are hierarchical. The top category is VAT (Value Added Tax). The second column subdivides this into VAT on cigarettes, other tobacco, narcotics, and clothing. And the third column further subdivides clothing into adult, children’s, or protective. This is part of the VAT parameters worksheet that provides our economic model with its knowledge of the British tax system. Here’s the same fragment as a screen shot:
I want to convert this into a tree-structured named list in which I can access parameter values by multiple selection. That is, I want to be able to ask for params$VAT$Tobacco and get 'standard', or for params$VAT$Clothing$Children and get 0. This is how I did so.

I’m going to show my code below. I’ve annotated it in the same way I did in my previous posts, so all the explanation is in the code as comments. What I did was to write a function called multinest, correcting it a step at a time as I found cases it couldn’t handle. I built it up that way because I found it easier to explain. It seemed to be easier for the reader (or at least, for me acting as reader) to visualise than presenting a finished version of the function immediately and explaining how its recursion and base cases worked. I’m not sure why I found the latter harder, but it’s something to do with the fact that nest transforms an entire table in one go. If you then want to perform an inner transformation on the nested tables it inserts, you have to loop or recurse over them. That’s different from how a more traditional algorithm would work. Anyway, here’s my code. The final version of my function is at the end, multinest_4.

# try_multinest_4.R
#
# This is a version of multinest which
# I hope is easier to understand than
# that in try_multinest_3.R . I built
# it by successive enhancement, starting
# with nest() and adding one correction
# at a time.


library( tidyverse )


t <- tribble( ~l1, ~l2, ~val
            , 'A', 'a', 'Aa'
            , 'B', 'a', 'Ba'
            , 'A', 'b', 'Ab'
            , 'B', 'b', 'Bb'
            )
#
# I will try my functions on this
# tibble. I want to write a function
# that factors out the two name
# columns l1 and l2, converting it
# to the nested named list
#   list( A = list( a = "Aa", b = "Ab" )
#       , B = list( a = "Ba" ,b = "Bb" )
#       )
# So if this list is l, l$A$a must be
# "Aa", and l$B$b must be "Bb", etc.
     

# First, let's try nest() interactively.

t_nested <- t %>% group_by( l1 ) %>%
                 nest
#
# Gives a tibble where the second
# and third columns of t have been
# collapsed to inner, nested, tibbles:
#   # A tibble: 2 x 2
#        l1             data
#     <chr>           <list>
#   1     A <tibble [2 x 2]>
#   2     B <tibble [2 x 2]>

t_nested[[1,2]]
#
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# So the data column in t_nested
# is indeed tibbles.


# Now I want to "spread" t_nested, 
# flipping it into a list where the column-1
# elements "A" and "B" become the list's
# names, and the tibbles in column 2
# becomes the values associated with those
# names.

# In my blog post 
# http://www.j-paine.org/blog/2017/10/how-best-to-convert-a-names-values-tibble-to-a-named-list.html 
# I tried functions for doing this. I could
# use spread(). But it turns out I don't need
# to. setNames() can do the job much more 
# efficiently. Here's my function. See
# the post for why it's as it is.

spread_to_list <- function( t )
{
  setNames( as.list( t[[2]] ), t[[1]] ) 
}


# Let's try it.

t_nested_spread <- spread_to_list( t_nested )
#
#   $A
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
#
#   $B
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Ba
#   2     b    Bb
# So that does what I wanted.
# Let's check by displaying one
# of the elements.

t_nested_spread$A
#
#   #  A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# Good.

# So we can convert the outer level of
# a table t to a named list whose keys are
# the names in t[,1].

# I should note that spread_to_list() 
# is safe because nest() will ensure 
# that spread_to_list()'s argument 
# never has a name more than
# once in column 1.


# Let's make this into a function.

multinest <- function( t )
{
  colname_1 <- as.name( names(t)[1] )

  t %>% group_by( !!colname_1 ) %>%
        nest %>%
        spread_to_list    
}

t_nested_spread_ <- multinest( t )
identical( t_nested_spread, t_nested_spread_ )
#
# Which gives the same result as above,
# so multinest() appears OK.


# Now I want to make multinest() recursive so
# that it factors out the inner tables
# in the same way. I'll assume all the name
# columns are at the front of the table. I
# do have to give an extra argument saying
# how many there are.

multinest_2 <- function( t, no_of_name_cols )
{
  colname_1 <- as.name( names(t)[1] )
  #
  # For as.name and !! , see e.g. Henrik's
  # answer to https://stackoverflow.com/questions/26724124/standard-evaluation-in-dplyr-summarise-on-variable-given-as-a-character-string ,
  # "[u]se as.name if you have a character 
  # string that gives a variable name".

  listed <- t %>% group_by( !!colname_1 ) %>%
                  nest %>%
                  spread_to_list    

  if ( no_of_name_cols > 1 ) 
    return( map( listed, multinest_2, no_of_name_cols-1 ) )
  else
    return( listed )
}

identical( multinest_2( t, 1 ), t_nested_spread )
#
# So this function called for just one column
# works as before.


# Let's see what happens with two columns.

t_nested_spread_2 <- multinest_2( t, 2 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') ) 
    cat( t_nested_spread_2[[X]][[Y]] ) 
#
# But ...
#   Error in cat(listed_2[[X]][[Y]]) : 
#   argument 1 (type 'list') cannot be handled by 'cat'

t_nested_spread_2$A$a
#   # A tibble: 1 x 1
#       val
#     <chr>
#   1    Aa

# The innermost list elements are tibbles, not
# simple values. That would be OK if I wanted to
# carry more than one value cell with each
# innermost element, but it wastes space when
# I have only one. What's causing this is the
# inner call to nest(), which is converting
# a tibble such as 
#   # A tibble: 2 x 2
#        l2   val
#     <chr> <chr>
#   1     a    Aa
#   2     b    Ab
# into
#   # A tibble: 2 x 2
#       l2             data
#    <chr>           <list>
#  1     a <tibble [1 x 1]>
#  2     b <tibble [1 x 1]>


# Let's rewrite to avoid this. If there's
# only one column left, just convert to list
# without calling nest().

multinest_3 <- function( t, no_of_name_cols )
{
  colname_1 <- as.name( names(t)[1] )
 
  if ( no_of_name_cols > 1 )
    listed <- t %>% group_by( !!colname_1 ) %>%
                    nest %>%
                    spread_to_list %>% 
                    map( multinest_3, no_of_name_cols-1 )
  else if ( no_of_name_cols == 1 )
    listed <- t %>% spread_to_list

  listed
}

identical( multinest_3( t, 1 ), t_nested_listed )
#
# This actually returns FALSE. So asked to nest
# just one column, the function doesn't do the same
# as its predecessors. It returns the list
#   list(A = "a", B = "a", A = "b", B = "b")
# But that's reasonable, because it's treating
# the second column as the values. So I'll not
# amend it.


# What happens with two columns?

t_nested_spread_3 <- multinest_3( t, 2 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') ) 
    cat( t_nested_spread_3[[X]][[Y]] ) 
#
# Displays AaAbBaBb.
# So that looks OK. 

dput(t_nested_spread_3)
#
# So does a dput():
#   structure(list(A = structure(list(a = "Aa", b = "Ab"), .Names = c("a", 
#   "b")), B = structure(list(a = "Ba", b = "Bb"), .Names = c("a", 
#   "b"))), .Names = c("A", "B"))


# Now let's try three columns. 
# I'll reassign t for this.

t <- tribble( ~l1, ~l2, ~l3, ~val
            , 'A', 'a', '1', 'Aa1'
            , 'A', 'a', '2', 'Aa2'
            , 'A', 'b', '1', 'Ab1' 
            , 'A', 'b', '2', 'Ab2'
            , 'B', 'a', '1', 'Ba1'
            , 'B', 'a', '2', 'Ba2'
            , 'B', 'b', '1', 'Bb1' 
            , 'B', 'b', '2', 'Bb2'
            )

t_nested_spread_3 <- multinest_3( t, 3 )

for ( X in c('A','B') ) 
  for( Y in c('a','b') )
    for( Z in c('1','2') ) 
      cat( t_nested_spread_3[[X]][[Y]][[Z]] )
#
# Displays Aa1Aa2Ab1Ab2Ba1Ba2Bb1Bb2. Good.


# Now let's try on some real data, a fragment 
# of our VAT parameters.

vat_rates <- tribble( ~l1,   ~l2         , ~l3         , ~val
                    , 'VAT', 'Cigarettes', '' 	       , 'standard'
                    , 'VAT', 'Tobacco'   , ''          , 'standard'
                    , 'VAT', 'Narcotics' , ''	       , 0
                    , 'VAT', 'Clothing'  , 'Adult'     , 'standard'
                    , 'VAT', 'Clothing'  , 'Children'  , 0
                    , 'VAT', 'Clothing'  , 'Protective', 0
                    )

vat_rates_nested_spread <- multinest_3( vat_rates, 3 )
#
# But this goes awry on the first three table rows,
# where it generates results such as 
#   $VAT$Narcotics
#   $VAT$Narcotics[[1]]
#   [1] "0"
# The value is wrapped in an unwanted list.

# This is because I didn't check for names
# where some of the final components are
# empty. R has come across the '' and done its
# own thing. 


# I need to fix this. I'll do so below. If 
# the table passed to multinest() has '' 
# as its first element, return the final 
# element, the corresponding value. I don't think such a table can 
# ever have more than one row. If it did, it 
# would have been part of a table such as
#   Pars VAT ''  20
#   Pars VAT Low 17
# which would have to translate into the list
#   Pars$VAT <- 20
#   Pars$VAT$Low <- 17
# This would mean that Pars$VAT is 
# simultaneously a terminal and a non-terminal,
# which makes no sense.

multinest_4 <- function( t, no_of_name_cols )
{
  if ( t[[1,1]] == '' )
    return( t[[ 1, ncol(t) ]] )
  else {
    colname_1 <- as.name( names(t)[1] )

    if ( no_of_name_cols > 1 )
      listed <- t %>% group_by( !!colname_1 ) %>%
                      nest %>%
                      spread_to_list %>% 
                      map( multinest_4, no_of_name_cols-1 )
    else if ( no_of_name_cols == 1 )
      listed <- t %>% spread_to_list

    return( listed )
  }
}

vat_rates_nested_spread <- multinest_4( vat_rates, 3 )
#
#   $VAT
#   $VAT$Cigarettes
#   [1] "standard"
#
#   $VAT$Tobacco
#   [1] "standard"
#
#   $VAT$Narcotics
#   [1] "0"
#
#   $VAT$Clothing
#   $VAT$Clothing$Adult
#   [1] "standard"
#
#   $VAT$Clothing$Children
#   [1] "0"
#
#   $VAT$Clothing$Protective
#   [1] "0"
#
# Good.

How Best to Convert a Names-Values Tibble to a Named List?

Here, in the spirit of my “Experiments with by_row()” post, are some experiments in writing and timing a function spread_to_list that converts a two-column tibble such as:

x 1
y 2 
z 3
t 4
to a named list:
list( x=1, y=2, z=3, t=4 )
I need this for processing the parameter sheets shown in that by_row post, and I’ll explain why later. In this post, I’m just interested in how best to define spread_to_list. The best implementation looks like either spread_to_list_3 or spread_to_list_4 below.

# try_spread_to_list.R
#
# Consider a tibble t with two columns,
# where each cell in the second column
# represents the value associated with
# the string (assumed to be its name) 
# in the first column.
#
# I want to define a function spread_to_list() 
# which converts t into a named list whose names
# are the names in the first column, and whose
# values are the values in the second column.
#
# For example, if t is:
#        names no_of_days
#    DaysInMay         31
#   DaysInJune         30
# then
#   spread_to_list(t) 
# would be the list
#   list( DaysInMay = 31, DaysInJune = 30 )
#
# The code here tries various ways
# of implementing spread_to_list,
# and benchmarks them. Two are variants
# of one another, using spread() and converting
# its result. Another two take the values
# column as a list, and call setNames() to 
# convert to a named list.


library( tidyverse )
library( microbenchmark )
library( stringr )


t <- tribble( ~a , ~b 
            , 'x',  1
            , 'y',  2 
            , 'z',  3
            , 't',  4
            )
#
# I'm going to try various ways
# of implementing spread_to_list()
# on t.


# First, what happens if I call spread()?

s <- spread( t, a, b )
#
# s becomes a one-row tibble:
#         t     x     y     z
#   1     4     1     2     3


# How can I convert s to a named list?

# An obvious way is to call map().
# Let's see what the function argument to
# map() gets passed if I call map() on s.

map( s, show )
#
# Displays
#  [1] 4
#  [1] 1
#  [1] 2
#  [1] 3
# So it gets passed a column of the
# tibble as an atomic vector.

map( s, function(x)x )
#
# Returns a list of these elements:
#   list(t = 4, x = 1, y = 2, z = 3)
# That's because map() is defined to return 
# lists. So the call above uses it merely
# as a type converter.

map( s, identity )
#
# Does the same. identity() is a built-in
# identity function.


# But maybe I can avoid mapping. In my
# experiments with by_row(), 
# http://www.j-paine.org/blog/2017/10/experiments-with-by_row.html ,
# I discovered that as.list() will convert
# a tibble to a named list.

as.list( s )
#
# Also returns 
#   list(t = 4, x = 1, y = 2, z = 3).


# But can I avoid spread() altogether?
# Browsing the discussion groups gave me
# the idea of trying setNames().
# Let's try that, passing t's names as its
# second argunent and t's values as its
# first.

setNames( t[[2]], t[[1]] )
# 
# Gives me an atomic named vector. 

# I need to convert it to a list.
# One way is for the argument to setNames()
# to be a list, because that's specified to
# make it return a list.

setNames( as.list( t[[2]] ), t[[1]] )
#
# Returns the list
#   list(x = 1, y = 2, z = 3, t = 4)

# But I could convert the result instead.

as.list( setNames( t[[2]], t[[1]] ) )
#
# Also returns
#   list(x = 1, y = 2, z = 3, t = 4)


# Let's try these four implementations.
# First, define the functions.

spread_to_list_1 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    map( identity )
}

spread_to_list_2 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    as.list
}

spread_to_list_3 <- function( t )
{
  setNames( as.list( t[[2]] ), t[[1]] )
}

spread_to_list_4 <- function( t )
{
  as.list( setNames( t[[2]], t[[1]] ) )
}


# Now try them.

s1 <- spread_to_list_1( t )
s2 <- spread_to_list_2( t )
s3 <- spread_to_list_3( t )
s4 <- spread_to_list_4( t )

dput( s1 )
dput( s2 )
dput( s3 )
dput( s4 )

identical( s1, s2 )
identical( s1, s3 )
identical( s1, s4 )


# They all return named lists, but the order
# of elements is different for the spread()-based
# versions than from the as.list()-based ones.
# (That was obvious earlier, actually.)
# So I'll sort the lists, then test that
# they're identical. I'll also microbenchmark
# the functions.

sort_list <- function(l)
{ 
  sort( unlist( l ) )
}

identical( sort_list(s1), sort_list(s2) )%>%show
identical( sort_list(s1), sort_list(s3) )%>%show
identical( sort_list(s1), sort_list(s4) )%>%show

mbres <- microbenchmark( spread_to_list_1( t )
                       , spread_to_list_2( t )
                       , spread_to_list_3( t )
                       , spread_to_list_4( t )
                       )
print( mbres )


# Now let's microbenchmark the functions applied
# to bigger tibbles. I'll generate random name-value
# tibbles of sizes n, where n is defined by the
# vector in the 'for' condition.

for ( n in c(10,30,100,300) ) {

  cat( "Trying ", n, "row tibble\n" )

  names <- replicate( n, str_c(sample(letters,5,replace=FALSE),collapse="") )
  #
  # Generate n random alphabetic strings.
  # From Dirk Eddelbuettel's answer to 
  # https://stackoverflow.com/questions/1439513/creating-a-sequential-list-of-letters-with-r .

  values <- runif( n, 1, 100 )
  #
  # Generate n random values.

  t <- tibble( names=names
             , values=values
             )
  #
  # Use these to make a random tibble with
  # two columns and n rows.

  identical( sort_list(s1), sort_list(s2) )%>%show
  identical( sort_list(s1), sort_list(s3) )%>%show
  identical( sort_list(s1), sort_list(s4) )%>%show

  mbres <- microbenchmark( spread_to_list_1( t )
                         , spread_to_list_2( t )
                         , spread_to_list_3( t )
                         , spread_to_list_4( t )
                         )
  print( mbres )
}


# Here are the microbenchmark results for the
# 300-row tibble:
#   Unit: microseconds
#                  expr       min        lq    
#   spread_to_list_1(t) 17738.631 17928.946 
#   spread_to_list_2(t) 14582.521 14805.888 
#   spread_to_list_3(t)    36.223    40.901   
#   spread_to_list_4(t)    35.317    40.146 
#                  expr        mean    median         uq
#   spread_to_list_1(t) 18668.44595 18221.133 19532.8080
#   spread_to_list_2(t) 15386.05835 15050.685 16355.4180
#   spread_to_list_3(t)    46.67244    47.089    51.4655
#   spread_to_list_4(t)    45.77894    45.882    51.6165
#         max neval
#   21477.003   100
#   17314.838   100
#      64.294   100
#      63.087   100

# So the two spread() versions are much slower.
# Converting the spread() result with mapping is
# slower than with as.list(), probably unsurprisingly.
# The two setNames() versions are much faster. 
# It doesn't seem to matter whether we type-convert
# to list by making setNames()'s first argument
# a list, or by making its result one.

Experiments with by_row()

I’ve been experimenting with by_row() from the R purrrlyr package. It’s a function that maps over the rows of a data frame, which I thought might be handy for processing our economic model’s parameter files. I didn’t find the documentation for by_row() told me everything I wanted to know, so I made up and ran some examples. As these might be useful to others, I’m blogging them here.

purrrlyr was written mainly by Hadley Wickham. Regular readers of my posts will know that Hadley wrote the suite of programs known as the Tidyverse, which can be installed with the command install.packages("tidyverse"). However, purrrlyr is not quite part of this, and has to be installed separately, via install.packages("purrrlyr").

Why am I interested in by_rows()? Our economic model reads what we call parameter files. These are spreadsheets describing the taxes and benefits it has to apply, plus various controls. Each spreadsheet contains several worksheets, most of which start with three columns making up a hierarchical parameter name, followed by one or more values. Here’s a fragment from one such sheet:

        Name1              Name2    Name3 Allowance_Spouse PrivatePension BankAccount NationalSavings
1 IncomeLists      IncomeSupport     <NA>                1              1           0               0               
2 IncomeLists      PensionCredit     Main                1              1           0               0               
3 IncomeLists      PensionCredit  Savings                0              1           0               0               
4 IncomeLists NonContributoryESA   Income                1              1           0               0               
5 IncomeLists NonContributoryESA Earnings                0              0           0               0               
6 IncomeLists              HBCTB     <NA>                1              0           0               0               
And here’s a screen shot of it:

I need to process these sheets a row at a time, converting each row into an entry in a lookup table that maps parameter names to their values. Because by_row() maps over rows, it looked worth learning about. So the code below shows the calls I tried. In it, t is the tibble I passed to by_rows(), and tbr is the result of the calls. I’ve commented each call explaining what it tells me, and also listed the things I wanted to know. One was what the documentation meant by saying that the .collate argument makes by_row() collate by row or by column. “Collate” is a very general verb, and didn’t tell me exactly how by_row()‘s results were arranged. I was also puzzled by the .labels argument, in which if TRUE “the returned data frame is prepended with the labels of the slices (the columns in .d used to define the slices)”.

# try_by_row.R
#
# Some experiments with 
# purrrlyr's by_row() function.
# This may be useful when 
# converting segments of tibble
# rows to lists that I can
# store in a single cell.


library( tidyverse )
library( purrrlyr )


t <- tribble( ~a, ~b, ~c
            ,  1,  2,  3
            ,  4,  5,  6
            ,  7,  8,  9
            , 10, 11, 12
            )
#
# I'll try various calls to by_row on this.

# Let's try sum().

tbr <- by_row( t, sum )
#
# tbr becomes t with a .out column added.
#         a     b     c      .out
#   1     1     2     3 
#   2     4     5     6 
#   3     7     8     9 
#   4    10    11    12 
# Each of the 's is a list whose
# single element is the sum of a, b, and c.

tbr <- by_row( t, sum, .collate="rows" )
tbr <- by_row( t, sum, .collate="cols" )
#
# In both of these, tbr becomes a table where
# each .out cell holds a sum rather than a
# list containing the sum:
#   a     b     c        .out
#   1     1     2     3     6
#   2     4     5     6    15
#   3     7     8     9    24
#   4    10    11    12    33

# Let's try mean().

tbr <- by_row( t, mean )
# Warning messages:
# 1: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA
# 2: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA
# 3: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA
# 4: In mean.default(.d[[i]], ...) :
#   argument is not numeric or logical: returning NA

# Let's find out why sum() works but mean() doesn't.

tbr <- by_row( t, show )
#
# I did this to see what by_row()'s function
# argument gets passed. The command above 
# displays four one-row tibbles, each a row of t.
# So that's what the function gets passed.

sum( t[1,] )
#
# 6. So sum() can accept tibbles.

mean( t[1,] )
#   Warning message:
#   In mean.default(t[1, ]) : argument is not numeric 
#   or logical: returning NA
# But mean() can't.

# The above three results show why by_row gave
# an error with mean() but not sum(). It was
# passing one-row tibbles to both. sum() accepts
# these but mean() doesn't.             
    
# Let's see whether I can create a column 
# containing copies of the original rows.

tbr <- by_row( t, function(row)row )
#
# Shows that tbr$.out is a list where
# each element is one of these one-row tibbles.
# So this copies each original row into 
# the corresponding cell of .out .

tbr <- by_row( t, as.list )
#
# This does as above but converts each row to
# a list. So here, 
#   tbr$.out[[1]] 
# is
#   list(a = 1, b = 2, c = 3) .
# and so on. This will be useful when I want to
# convert rows to look-up tables implemented
# as named lists. And I presume it's more
# efficient than storing an entire tibble.

as.list( t[1,] )
#
# Just to show that conversion of one-row
# tibble to named list is what as.list does,
# and not some quirk of by_row(). You never
# know with R.

as.list( t )
#
# I try this for interest. It converts to a
# named list where each element is a 
# column vector. Thus,  
#   as.list( t )
# is
#   list( a = c(1, 4, 7, 10)
#       , b = c(2, 5, 8, 11)
#       , c = c(3, 6, 9, 12)
#       )

# Let's try the .collate argument, because
# I don't understand exactly what the 
# by_row() documentation means by collation.

tbr <- by_row( t, function(row)c(10,20,30)  )
#
# This makes tbr$.out a list column each of
# whose elements is c(10,20,30). So it just 
# puts the entire result of the function 
# into the .out cells.

tbr <- by_row( t, function(row)c(10,20,30), .collate="rows" )
#
# Makes tbr the table
#         a     b     c  .row  .out
#   1     1     2     3     1    10
#   2     1     2     3     1    20
#   3     1     2     3     1    30
#   4     4     5     6     2    10
#   5     4     5     6     2    20
#   6     4     5     6     2    30
#   7     7     8     9     3    10
# etc. So appends one column. But copies
# each original row as many times as there 
# are components in the function's result. 
# The i'th copy's .out cell holds the i'th 
# component. And we get a .row column holding
# the number of the original row.

tbr <- by_row( t, function(row)c(10,20,30), .collate="cols" )
#
# Makes tbr the table
#         a     b     c .out1 .out2 .out3
#   1     1     2     3    10    20    30
#   2     4     5     6    10    20    30
#   3     7     8     9    10    20    30
#   4    10    11    12    10    20    30
# So appends three columns, where the i'th
# column contains the i'th component of the
# function's result.

# Let's see whether .collate can produce
# "Manhattan arrays", where the rows are 
# different lengths.

tbr <- by_row( t, function(row)1:(sum(row)%%5), .collate="cols" )
#
# This gave an error:
#   Error in by_row(t, function(row) 1:(sum(row)%%5), .collate = "cols") : 
#   .f should return equal length vectors or data frames 
#   for collating on `cols`
# My function was intended to return rows of different
# length. But, as the documentation says, they have to
# be the same length.

tbr <- by_row( t, function(row)1:(sum(row)%%5), .collate="rows" )
#
# However, this one does run, producing differing
# numbers of rows for each original row:
#         a     b     c  .row  .out
#   1     1     2     3     1     1
#   2     4     5     6     2     1
#   3     4     5     6     2     0
#   4     7     8     9     3     1
#   5     7     8     9     3     2
#   6     7     8     9     3     3
#   7     7     8     9     3     4
#   8    10    11    12     4     1
#   9    10    11    12     4     2
#  10    10    11    12     4     3

# Let's see what happens with .collate when 
# the function returns a data frame, 
# because the documentation doesn't really 
# explain that.

df <- tribble( ~X, ~Y, "$", "%", "^", "&" )

tbr <- by_row( t, function(row)df )
#
# Each cell of tbr's .out column becomes a copy of
# df.

tbr <- by_row( t, function(row)df, .collate="rows" )
# 
# tbr becomes a table 
#         a     b     c  .row     X     Y
#   1     1     2     3     1     $     %
#   2     1     2     3     1     ^     &
#   3     4     5     6     2     $     %
#   4     4     5     6     2     ^     &
# etc. So we get one copy of each original row
# for each row in df. We get one column for
# each column in df, plus again a .row column 
# holding the number of the original row.

tbr <- by_row( t, function(row)df, .collate="cols" )
#
# I wasn't expecting that. df's cells get
# spread into a single line, thus, and its
# column names suffixed accordingly:
#         a     b     c    X1    X2    Y1    Y2
#   1     1     2     3     $     ^     %     &
#   2     4     5     6     $     ^     %     &
#   3     7     8     9     $     ^     %     &
#   4    10    11    12     $     ^     %     &

# Let's see what the .labels argument does. The
# documentation says: if TRUE, the returned data 
# frame is prepended with the labels of the slices 
# (the columns in .d used to define the slices). 
# But I don't know what it means by slices.

tbr <- by_row( t, sum, .labels=FALSE )
#
# We just get 
#          .out
#   1 <dbl [1]>
#   2 <dbl [1]>
#   3 <dbl [1]>
#   4 <dbl [1]>

tbr <- by_row( t, function(row)df, .collate="rows", .labels=FALSE )
#
# tbr becomes
#      .row     X     Y
#   1     1     $     %
#   2     1     ^     &
#   3     2     $     %
#   4     2     ^     &
# etc.

tbr <- by_row( t, function(row)df, .collate="cols", .labels=FALSE )
#
#        X1    X2    Y1    Y2
#   1     $     ^     %     &
#   2     $     ^     %     &
#   3     $     ^     %     &
#   4     $     ^     %     &

# So .labels=FALSE gives me what I think
# of as a traditional mapping function,
# where by_row() returns only the new
# columns. Except that if .collate="rows",
# it prepends a .row column holding the
# original row number.

Abstract Data Types and the Uniform Referent Principle II: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()

In “Abstract Data Types and the Uniform Referent Principle I: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()”, I explained why the notation for interfacing to a data structure should be independent of that structure’s representation. R programmers honour this principle in the same way that bricks hang in the sky. All published R code that operates on data frames uses column names. Sometimes these follow the $ operator; sometimes the data frame is implicit via attach() or similar. In the Tidyverse, the column names will often be part of a mutate(), the data frame being piped through a sequence of %>% operators. And this is dreadful software engineering.

Why? Look at the tables below. They represent four different ways of storing my income data.

PersonIncome_TypeIncome_Value
AliceWages37000
AliceBonuses0
AliceBenefits0
BobWages14000
BobBonuses1000
BobBenefits6000
PersonIncome_WagesIncome_BonusesIncome_Benefits
Alice370000 0
Bob 1400010006000
PersonIncome
Alice
TypeValue
Wages37000
Bonuses0
Benefits0
Bob
TypeValue
Wages14000
Bonuses1000
Benefits6000
PersonIncome
Alice
WagesBonusesBenefits
3700000
Bob
WagesBonusesBenefits
1400010006000

Abstractly, the data is the same in each case, and if you’re familiar with nest(), unnest(), gather() and spread(), you will easily see how to transform one table into any of the others. But the tables are implemented in very different ways. If you access their elements with $ or an equivalent, and you then change the implementation, you have to rewrite all those accesses. Which is dreadful software engineering.

Cartoon of experimenter peering into innards of complicated piece of machinery. His colleague is holding a plug coming out of it labelled INTERFACE: 'GET' 'INCOME' 'WAGES' and saying 'Don't worry about how it works. It's the interface that's important.'

Abstract Data Types and the Uniform Referent Principle I: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()

“What’s the Uniform Referent Principle?” my colleague asked me on reading my last post. I think I first came across it in Jean Sammet’s famous book Programming Languages: History and Fundamentals. In a description of Douglas Ross’s AED-0 language, she pointed out a feature that she thought particularly noteworthy: the notation for getting information out of a piece of data is the same regardless of how the data is stored. Or as Ross puts it: There must be a single, uniform method of referencing operands regardless of their detailed implementation. He considered this so important that he gave it a name: the Uniform Referent Principle. Unfortunately, it’s a principle universally ignored by R programmers. I’m going to explain why that is bad, and how it affects my coding of the tax and benefit calculations in our economic model.

Sammet doesn’t actually name the principle, but Ross himself does, as is evident from the title of his paper “Uniform Referents: An Essential Property for a Software Engineering Language” by Douglas T. Ross, in Software Engineering: Proceedings of the Third Symposium on Computer and Information Sciences held in Miami Beach, Florida, December, 1969 (Volume 1, edited by Julius Tou). It’s worth asking why it concerned him. His answer is as relevant now as it was in 1969. Here’s the start to his introduction:

The term software engineering is new and has yet to achieve a well-defined meaning. Most present-day software has not been engineered at all, but instead is a product of the individual experience of its creators and primarily ad hoc choices among various alternatives. The effects of this unstructured approach to software are quite obvious in terms of poor performance, missed schedules, and uncontrolled costs.

Ross continues by saying that although software engineering is one of the most challenging problems facing humanity, we are beginning to systematise it and to recognise the fundamental issues. The most fundamental of these involves:

[…] one basic criterion which a general-purpose programming language must meet if it is to be used as the expressive vehicle for software engineering activities regardless of the style of practice of these activities: There must be a single, uniform method of referencing operands regardless of their detailed implementation.

Why is this so important? Ross relates it to his ideas on how to design programming languages:

Programming language features for software engineering must be carefully selected; not any old programming language features will do. An unstructured blur of assembly language depending in turn upon an ad hoc collection of machine hardware features which just happen to have been brought together for some particular computer design has a low probability of matching the real world in any profound way. Similarly, most “high-level” languages invariably will be either not complete or not concise enough to serve as a basis for software engineering, for they have a similar ad hoc ancestry with roots in scientific computation or business data processing areas, which omit many important aspects of general software construction.

Heaven forbid that R’s ancestry be deemed in any way ad hoc. But let Ross continue:

Careful inspection and experimentation with various software engineering activities discloses a few fundamental features that somehow must be present in a software engineering language. These features can be informally derived directly from high-level considerations of the software engineering design process itself. The purpose of this paper is to indicate such a derivation of one basic principle without attempting an exhaustive enumeration of the consequences.

Ross then goes on to those promised high-level considerations. Quoting an excerpt from his report “Computer-Aided Design: A Statement of Objectives” (Tech. Mem. 8436-TM-4, Defense Documentation Center No. AD252060, M.I.T. Electron. Systems Lab., 1960), he writes that:

The manner of stating problems is also of primary importance. You must be able to state problems from outside in, not inside out. The normal practice of being explicit about a problem is to build a description of the solution procedure in small steps, i.e. to code the problem…. Stating a problem step by step from the inside out in this manner is satisfactory if the problem is well understood to begin with. But there is an alternative procedure which is frequently used among people, but is virtually non-existent as regards computers. In this alternate procedure, which we call stating the problem from the outside in, we state the problem originally in general terms and then explain further and in more and more detail what we mean…. It is quite apparent that this procedure of stating a problem from the outside in by means of further and further refinements is the only feasible way to attack grand problems.

This way of stating problems is also called stepwise refinement. One of the computer scientists who popularised the name was Niklaus Wirth, the designer of Pascal, in his paper “Program development by stepwise refinement” (Communications of the ACM, April 1971, Volume 14, Number 4). In it, he writes:

A guideline in the process of stepwise refinement should be the principle to decompose decisions as much as possible, to untangle aspects which are only seemingly interdependent, and to defer those decisions which concern details of representation as long as possible. This will result in programs which are easier to adapt to different environments (languages and computers), where different representations may be required.

Why is it important to defer decisions which concern details of representation? Because undoing these decisions is costly. One has to rewrite code, retest it, and redocument it. The modern technique for minimising this cost is to use abstract data types. To quote the beginning of the Wikipedia page:

In computer science, an abstract data type (ADT) is a mathematical model for data types, where a data type is defined by its behavior (semantics) from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user.

So one centralises these operations inside a module, and makes everything that depends on the concrete representation private, inaccessible from outside. The only things that aren’t inaccessible are the interface functions. Then, no matter how much the representation is changed, the interface won’t be, and so neither will code that calls it.

Ross’s reasoning is similar:

The crucial feature about outside-in design as the underlying methodology for software engineering is that […] proper treatment of interface mechanisms will allow the higher level to remain unchanged regardless of changes in the details of the inner level. In other words all of the variations can be accommodated in the interface changes, so that the outer level in completely unaffected.

[…]

The criterion is this: A single uniform referent mechanism must be used for all references to operands. […] In order for the programs of an outer level to remain completely unchanged as interface programs change to accommodate various forms of inter-level detail, the manner of referencing these operands must be the same for all possible forms.

There is one difference, however. Ross was writing many years ago, before our modern views on abstract data types. I think it’s fair to say that he sees things in slightly more representational terms than we would today. If we were implementing a data structure as an array, we’d make our interface hide the fact that it’s an array, by exporting functions that do the subscripting. Ross, however, would write the subscripting operations themselves in the interface, but in a notation that looks the same as function calls. Which is why later on in his paper, Ross writes the following table:

The notation A(B) in any syntactic context always means

Property A of thing B

AED-0 declarations allow choices:

A: ARRAY B: INDEX
A: COMPONENT POINTER
A: FUNCTION ARGUMENT
A: MACRO ITEM-STRING

With the .INSERT STATEMENT:

Program fileDeclaration files
BEGIN7094 version
.INSERT DECL $ 360 version
BODY STATEMENTS $1108 version
END FINI

The program file never changes when any declaration is used

(Continued.)

Thoughts on nest()

I’ve been experimenting with the Tidyverse’s nest function, because it may be useful when, for example, using households together with benefit units. Below are some thoughts that I first posted as a comment to Hadley Wickham’s blog entry “tidyr 0.4.0”. More on this in future posts.

First, this is likely to be very useful to me. I’m translating a microeconomic model into R. Its input is a set of British households, where each household record contains data on income and expenditure. The model uses these to predict how their incomes will change if you change tax (e.g. by increasing income tax) or benefits (e.g. by increasing pensions or child benefit).

Our data splits households into “benefit units”. A benefit unit ( http://www.poverty.org.uk/summary/households.shtml ) is defined as an adult plus their spouse if they have one, plus any dependent children they are living with. So for example, mum and dad plus 10-year old Johnnie would be one benefit unit. But if Johnnie is over 18, he becomes an adult who just happens to live with his parents, and the household has two benefit units. These are treated more-or-less independently by the benefit system, e.g. if they receive money when out of work.

This matters because each dataset contains one table for household-wide data, and another for benefit-unit-wide data. I’ve been combining these with joins. But it might be cleaner to nest each household’s benefit units inside the household data frame. Not least, because sometimes we have to change data in a household, for example when simulating inflation, and then propagate the changes down to the benefit units.

Second, nest and unnest could be useful elsewhere in our data. Each household’s expenditure data is divided into categories, for example food, rent, and alcohol. We may want to group and ungroup these. For example, I make:

d <- tibble( ID=c( 1, 1, 1,  2, 2,  3, 3,
                   4, 4, 4,  5, 5, 5,  6, 6 ),
             expensetype=c( 'food', 'alcohol', 'rent',  'food', 'rent',  'food', 'rent',
                            'food', 'cigarettes', 'rent',  'food', 'alcohol', 'rent',  'food', 'rent' ),
             expenditure = c( 100, 50, 400,  75, 300,  90, 400,
                              100, 30, 420,  75, 50, 550,  150, 600 )  
           )

Then

d %>% group_by(ID) %>% nest %>% arrange(ID)
gives me a table
1 tibble [3 x 2]
2 tibble [2 x 2]
3 tibble [2 x 2]
4 tibble [3 x 2]
5 tibble [3 x 2]
6 tibble [2 x 2]
where the first column is the ID and the second is a table such as
food      100
alcohol    50
rent      400
So in effect, it makes my original table into a function from ID to ℘ expensetype × expenditure.

Whereas if I do

d %>% group_by(expensetype) %>% nest %>% arrange(expensetype)
I get the table
alcohol    tibble [2 x 2]
cigarettes tibble [1 x 2]
food       tibble [6 x 2]
rent       tibble [6 x 2]
where the first column is expenditure category and the second holds tables of ID versus expenditure. In effect, a function from expensetype to ℘ ID × expenditure. This sort of reformatting may be very useful.

Third. Continuing from the above, I wrote this function:

functionalise <- function( data, cols )
{
  data %>% group_by_( .dots=cols ) %>% nest %>% arrange_( .dots=cols )
}
The idea is that functionalise( data, cols ) reformats data into a data frame that represents a function. The columns cols represent the function’s domain, and will never contain more than one occurrence of the same tuple. The remaining column represents the function’s codomain. Thus, functionalise(d,"expensetype") returns the data frame shown in the last example.

Fourth, I note that I can write either

d %>% group_by( expensetype ) %>% nest %>% arrange( expensetype )
or
nest( d, ID, expenditure )

In the first, I have to name those columns that I want to act as the function’s domain. In the second, I have to name the others. I find the first more natural.

Fifth, nest and unnest, as well as spread and gather make it very easy to generate alternative but logically equivalent representations of a data frame. But every time I change representation in this way, I have to rewrite all the code that accesses the data. It would be really nice if either a representation-independent way of accessing it could be found, or if nest/unnest and spread/gather could be made to operate on the code as well as the data. Paging Douglas Ross and the Uniform Referent Principle…

Literate Programming, Data-flow Networks, and the Inscrutability of Economic Models

Write programs as if they were mathematical essays, but with code instead of equations. This is “literate programming”, an idea invented by the famous computer scientist Donald Knuth. The quote below is from www.literateprogramming.com , apparently first published by Knuth in “Literate Programming (1984)” in Literate Programming, CSLI, 1992, page 99:

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

With our model, there are at least three classes of user who need explaining to. One is me, in six months’ time after I’ve forgotten how my code worked. The second is my colleague. And the third is the economists and others who will use the model.

This final class need a little extra help. My networks can be seen as a kind of literate program where the expository narrative is branched rather than linear, and the essayist is writing around function definitions rather than equations. Traditionally, economic models have been opaque, their assumptions known only to the few who implemented them. But with this kind of display, everyone is free to peer inside.

Viewlets

In my last post, I said that I’d arranged for the nodes in my network diagram to be clickable. Clicking on a data node brings up a display showing its contents plus an explanation thereof; clicking on a transformation node shows the code it runs and an explanation of that. I call these little displays “viewlets”, and will now explain how I implemented them.

The main thing for this prototype was to demonstrate the idea. So I put together two easy-to-use and free pieces of software. The first is datatables.net’s Table plug-in for jQuery. This uses JavaScript to convert an HTML table into something you can scroll through, search, and sort. There are demonstrations here.

The second is Jed Foster’s Readmore.js. This, as the demo on that page shows, reduces large chunks of text to manageable size by equipping them with “Read More” links. Initially, you see only a few lines. Click “Read More”, and you see the full text; click once more, and the extra text goes away, giving just those few lines again.

Here’s a screenshot of a data viewlet:

And here’s a code viewlet:

Browsers have a habit of exploding when asked to display a 20,000-row table. So the data viewer is probably not good for full-scale datasets. Something like base-R View, talking to the server via sockets, might be better. The “Read More” library is also not ideal, because it doesn’t allow the programmer to specify how much of the text should be displayed above the link, and because you have to scroll all the way down to the end to collapse the full text.