Implementing Key-Value Notation for Data Frames without Using Tribbles

There’s a lot to be said for tribbles. As Hadley Wickham says in the “Tibbles” chapter of R for Data Science, his tribble function makes it easy to enter small data tables in a program, because you can type them row by row rather than column by column. Like this:

tribble(
  ~actor   , ~character,
  "Shatner", "Kirk"    ,
  "Nimoy"  , "Spock"   , 
  "Kelley" , "McCoy"     
)
In contrast, traditional data-frame notation makes you do this:
data.frame( actor=c("Shatner","Nimoy","Kelley"),
            character=c("Kirk","Spock","McCoy")
          )
This makes it hard to match up different elements of the same row. In my posts about tribbles for lookup tables, I overcame this by using tribble. But I now want to show a solution that I thought of before discovering it. This uses lapply and a binary operator to convert lists of key-value pairs into data frames. This is what the resulting notation looked like. It’s not quite as convenient as tribble notation, because of having to type three characters to separate keys from values, but it’s better than data.frame permits:
keys_and_values_to_data_frame(  
  1 %:% 'North_East',
  2 %:% 'North_West_and_Merseyside',
  4 %:% 'Yorks_and_Humberside',
  5 %:% 'East_Midlands',
  6 %:% 'West_Midlands',
  7 %:% 'Eastern',
  8 %:% 'London',
  9 %:% 'South_East',
 10 %:% 'South_West',
 11 %:% 'Wales',
 12 %:% 'Scotland',
 13 %:% 'Northern_Ireland'
)

The key (sorry!) to this is that R allows you to define your own operators. I can’t find where this is mentioned in the R language manual, but there’s a good discussion on StackOverflow. An identifier which begins and ends with percent can be assigned a function, and R’s parser will then allow it to be written as an infix operator, i.e. between its arguments. So if I type:

f <- function( x, y )
{
  2 * x + y
}

`%twiceandadd%` <- f

3 %twiceandadd% 5
I get the answer 11, just as if I’d called f( 3, 5 ).

Note that the backtick symbols, ` , are not part of the name, but are there to make the use of the identifier in the second statement valid. The R language manual explains this in the section on quotes.

What I did was to make the infix operator %:% a synonym for the base-R function list. So the code above does the same as

keys_and_values_to_data_frame( 
  list( 1, 'North_East' ),
  list( 2, 'North_West_and_Merseyside' ),
  ...
  list( 13, 'Northern_Ireland' )
)

I then defined keys_and_values_to_data_frame as:

keys_and_values_to_data_frame <- function( ... )
{
  keys_and_values_list_to_data_frame( list( ... ) )
}

and keys_and_values_list_to_data_frame as:
keys_and_values_list_to_data_frame >- function( l )
{
  keys <- unlist( lapply( l, function(x) x[[1]] ) )
  values <- unlist( lapply( l, function(x) x[[2]] ) )
  df <- data.frame( key=keys, value=values )
  df
}

So, via the three-dots construct, which I mentioned in my last post, keys_and_values_list_to_data_frame gets passed a list of lists:

list( list( ,  ), list( ,  ), ... , list( ,  ) )
It then has to slice out all the first (red) elements to give the first column of the data frame df, and all the second (green) elements to give the second column:
df <- data.frame( key=  ... , value=  ...  )

To do this, it uses lapply. The first call selects all the first elements of the sublists, and the second selects all the second elements. As with my last post, I then had to call unlist to flatten the result.

If any of that's unclear, the colours may help. Visualising functions like lapply and mapply in terms of sequences laid alongside one another is often helpful. It may also be helpful to read Hadley Wickham's very clear explanation in the section on "Functionals" from his book Advanced R.

To finish, two notes. First, I made my inner lists, the ones pairing keys and values, with list rather then c. That's because the keys and values are different types, but c would have required them to be the same type.

Second, here's an example of lapply and unlist from the R shell. It also shows something I hadn't realised until I wrote the above code. The subscripting operator [[ is a function, and can be called from lapply and its ilk directly, without having to wrap it inside another function.

> l <- list( list('a','A'), list('b','B') )

> lapply( l, function(e)e[[1]] )
[[1]]
[1] "a"

[[2]]
[1] "b"

> lapply( l, function(e)e[[2]] )
[[1]]
[1] "A"

[[2]]
[1] "B"

> unlist( lapply( l, function(e)e[[1]] ) )
[1] "a" "b"

> unlist( lapply( l, function(e)e[[2]] ) )
[1] "A" "B"

> lapply( l, `[[`, 1 )
[[1]]
[1] "a"

[[2]]
[1] "b"

> unlist( lapply( l, `[[`, 1 ) )
[1] "a" "b"

Leave a Reply

Your email address will not be published. Required fields are marked *