From Python Dictionaries to Tribbles II: How I Implemented Lookup Tables in R for Numeric Data Codes

In my last post, I explained how tribbles make it easy to write data frames as a sequence of key-value pairs. But how can I make these data frames act as lookup tables? By using the base R function match.

This is how it works. First, I’ll make a tibble:

dict <- tribble( ~key, ~value, 'a', 'A', 'b', 'B', 'c', 'C' )
This gives me a two-column table where each key is in the same row as its value:
# A tibble: 3 x 2
    key value
  <chr> <chr>
1     a     A
2     b     B
3     c     C

The values in the second column represent the translations of the keys in the first column.

Now, suppose I want to translate the string 'b'. It’s in row two of column 1. Its translation is in row two of column 2. Generalising, if I want to translate string s, I find out which row r of column 1 it’s in, and then treat row r of column 2 as its translation. I can find its row using match. Here are three examples of match looking up a string in a vector of strings:

> match( 'a', c('a','b','c') )
[1] 1
> match( 'b', c('a','b','c') )
[1] 2
> match( 'c', c('a','b','c') )
[1] 3

Because the columns of tibbles (and data frames) are vectors, I can use match on these. Therefore, I can define my lookup function in this way:

lookup <- function( dict, v )
{
  keys <- dict[[ 1 ]]

  indices <- match( v, keys )

  translations <- dict[[ 2 ]]

  result_col <- translations[ indices ]

  result_col
}

There’s a subtlety here. Many R functions are “vectorised”. To quote from the language definition:

R deals with entire vectors of data at a time, and most of the elementary operators and basic mathematical functions like log are vectorized (as indicated in the table above). This means that e.g. adding two vectors of the same length will create a vector containing the element-wise sums, implicitly looping over the vector index. This applies also to other operators like -, *, and / as well as to higher dimensional structures.

One of the built-in functions that’s vectorised is match. So if I pass a vector as its first argument, it will look up each element thereof in the second element:

> match( c('b','c','a','b'), c('a','b','c') )
[1] 2 3 1 2
This is why I gave my variables plural names. My function is operating on a vector, the entire first column of a lookup table, and passing that to match.

I’ll finish with a complete listing of my code and a demo. Here’s the listing:

# dictionaries.R


library( tidyverse )


# Returns a dictionary. 
# This is implemented as a tibble with
# 'key' and 'value' columns.
#
dictionary <- function( ... )
{
  tribble( ~key, ~value, ... ) 
}


# Translates vector v by looking up
# each element in dictionary 'dict'. The
# result is a vector whose i'th element
# is a translation of the i'th element of
# v.
#
lookup <- function( dict, v )
{
  keys <- dict[[ 1 ]]

  indices <- match( v, keys )
  #
  # 'indices' will become a vector whose 
  # i'th element is the position p of
  # the i'th element of v in 'keys'. 
  # The corresponding element in '
  # 'translations' will be its translation.

  translations <- dict[[ 2 ]]

  result_col <- translations[ indices ]

  result_col
}
The three dots near the top may puzzle some. They denote all the arguments to dictionary, which get passed to tribble. Patrick Burns has some examples in “The three-dots construct in R”.

And here, mimicking the Python with which I began, is a demo using this code.

> tel <- dictionary( 'jack', 4098, 'sape', 4139 )
> tel
# A tibble: 2 x 2
    key value
  <chr> <dbl>
1  jack  4098
2  sape  4139
> lookup( tel, 'jack' )
[1] 4098

From Python Dictionaries to Tribbles I: How I Implemented Lookup Tables in R for Numeric Data Codes

As regular readers will know, I’ve been translating an economic model from Python into R. It reads data about the income and expenditure of British households, from sources such as the Family Resources Survey and Family Expenditure Survey . Much of this data is coded as numbers, and the model has to translate these into something intelligible. The Python version uses a kind of built-in lookup table called a “dictionary”: but these don’t exist in R, and I had to implement an equivalent. It was important that I and my colleague be able to initialise the table by writing it as key-value pairs. So I used tribbles…

I’ll explain what Python does first. Here’s an example taken from python.org’s “Dictionaries” tutorial, run on PythonAnywhere’s interactive interpreter:

In [1]: tel = { 'jack': 4098, 'sape': 4139 }
In [2]: tel
Out[2]: { 'jack': 4098, 'sape': 4139 }
In [3]: tel['guido'] = 4127
In [4]: tel
Out[4]: { 'guido': 4127, 'jack': 4098, 'sape': 4139 }
In [5]: tel['jack']
Out[5]: 4098
The first statement creates a dictionary, using curly brackets around its contents. The third and fifth statements change or look up elements, using indices in square brackets. It’s an easy notation.

Our Python model’s dictionaries look more like the one below, which translates region codes to names, but the idea is the same:

{ 1: 'North_East',
  2: 'North_West_and_Merseyside',
  4: 'Yorks_and_Humberside',
  5: 'East_Midlands',
  6: 'West_Midlands',
  7: 'Eastern',
  8: 'London',
  9: 'South_East',
 10: 'South_West',
 11: 'Wales',
 12: 'Scotland',
 13: 'Northern_Ireland'
}

So I needed a data structure that did the same job in R, and a way to initialise it by writing key-value pairs. But whereas lookup tables are built in to Python, they aren’t in R. There are contributed packages for them such as hashmap and hash. But I decided to implement lookup tables as data frames, as it might give me more control if I needed to do anything odd that these packages didn’t allow.

In fact, I used tibbles instead of ordinary data frames. Tibbles, as Hadley Wickham says in the “Tibbles” chapter of R for Data Science, are data frames, but tweaked to make life a little easier. Importantly for me, “make life easier” includes making it easier to enter small amounts of data in a program by using key-value notation. This is done via the function tribble. This call:

tribble(
  ~x, ~y, ~z,
  "a", 2, 3.6,
  "b", 1, 8.5
)
creates a tibble with columns named x, y and z, and the two rows shown under these names just above. R prints it like this:
# A tibble: 2 x 3
      x     y     z
  <chr> <dbl> <dbl>
1     a     2   3.6
2     b     1   8.5

And this call:

tribble(
   ~key, ~value, 
    1  , 'North_East',
    2  , 'North_West_and_Merseyside',
    4  , 'Yorks_and_Humberside',
    5  , 'East_Midlands',
    6  , 'West_Midlands',
    7  , 'Eastern',
    8  , 'London',
    9  , 'South_East',
    10 , 'South_West',
    11 , 'Wales',
    12 , 'Scotland',
    13 , 'Northern_Ireland'
  )
creates a tibble with with two columns named key and value, and 13 rows. Here’s how R prints this one:
# A tibble: 12 x 2
     key                     value
   <dbl>                      <chr>
 1     1                North_East
 2     2 North_West_and_Merseyside
 3     4      Yorks_and_Humberside
 4     5             East_Midlands
 5     6             West_Midlands
 6     7                   Eastern
 7     8                    London
 8     9                South_East
 9    10                South_West
10    11                     Wales
11    12                  Scotland
12    13          Northern_Ireland

So the Tidyverse has made it easy to enter key-value pairs in Python-dictionary-style notation and turn them into tibbles. How do I make these act as lookup tables? See my next post. By the way, the name “tribble” stands for “transposed tibble”.

Reification

We programmers live our working lives surrounded by data structures and subroutines, entities that become as concrete to us — as “thing-like”, as “manipulable” — as teacups and bricks. The feeling of thingness is strengthened, I think, by interactive shells such as R’s which enable one to call functions and inspect their results, and to store these in variables and pass them around. For our model, those results are either chunks of economic data such as our tables of households, or income-distribution graphs and other such summaries. I hope that being able to touch, probe, and pick up these things with R will make them seem more real.
Cartoon of an experimenter probing the innards of an economic model built as a piece of electronics in a case, and displaying an income distribution on a test meter

The Innocent Eye, the Martian Poet, and the R Inferno

Literature has the concept of the “innocent eye”: that visitor to regions strange who, vision unclouded by familiarity, is able to see and report on how absurd things really are over there. There are also “Martian poets”, who send home postcards about the oddities of their own environment as if visiting it from Mars. As it happens, I’ve tried my hand at both. “Enterro Da Gata’ 98” is an innocent-eye piece about Braga in Portugal, written when I was visiting the University of Minho. And “The Processes That Count” is a Martian-poet — or more accurately, a universe-next-door-poet — view of addition. With R, I find myself tempted into both positions.

On the one hand, I’ve done a lot of R programming over the last few years. Some of R’s quirks now seem distressingly natural, and I have to work hard to see them from outside — from Mars. On the other hand, I’m not a statistician, I’ve not explored the whole of the language and its libraries, and I’m nowhere near as expert as, say, the implementors of the Tidyverse. So I’m still a relative innocent, capable of viewing R’s peculiarities from Planet Pascal, Planet Lisp, or any of the other twenty-or-so languages I know.

I think it’s good to retain an innocent’s point of view, especially when teaching. And writing code for my colleague is a kind of teaching, because I have to explain stuff in my comments. Especially the stuff that will trip up the unwary programmer. Luckily, a lot of this stuff explained in Patrick Burns’s book The R Inferno. I recommend it.
Photo of Satan's head as the entrance to the Dante's Inferno ride in Panama City, Florida
(The Dante’s Inferno ride entrance in Panama City, Florida: by “Marktippin”.)

Which Symbol Should I Use for Assignment?

Perhaps I should add to my post about FreeFormatter. I noted that manual conversion of R code for inclusion in web pages is a pain because of the assignment symbol, <-. But I feel I should say that assignment can also be written as =, though this sometimes clashes with = for named function arguments. Kun Ren gives an example in “Difference between assignment operators in R”. John Mount in “Prefer = for assignment in R” prefers =, saying that if you’re accustomed to typing <-, you might type it by mistake in named function arguments too, causing a bug. But David Smith in “Use = or <- for assignment?” argues for <-. And Bob Rudis in “A Step to the Right in R Assignments” argues for yet another permitted symbol, ->, because it fits better with the Tidyverse’s “pipes” notation, wherein functions are composed from left to right. In engineering, there is never one right answer.

FreeFormatter: Escaping R Code for HTML

My demo of spreading and gathering data, like all my blog posts, is written in HTML. To prevent < and > symbols in my code messing this up, I had to “escape” them by rewriting them as &lt; and &gt;. This is more of a pain in R than in many other languages, because of the assignment symbol, <-. So it’s good that there’s an online tool, FreeFormatter’s HTML Escape / Unescape page, that does this automatically. In fact, a Google search shows other sites that appear to do the same job, but FreeFormatter’s is first. So that’s what I’m using.

Demonstrating spread() and gather()

In my last post about R, I said I was translating an economic model from Python into R. It’s a microeconomic model, meaning that it simulates the behaviour of individual people rather than bulk quantities such as inflation and unemployment. The simulator uses data about the income and expenditure of British households, from sources such as the Family Resources Survey and Family Expenditure Survey . I’ve had to think about how to represent expenditures. For example, Bob spends £100 on food and £400 on rent. Do I have one column for his food expenditure and one for rent, or do I have one column for all expenditures with another “key” column indicating the type? Maybe I need both depending on how I’m going to analyse the data, with functions to translate between them. Luckily, Hadley Wickham’s spread and gather will do this. Here are some experiments.

The code below starts by creating a table, d, which has IDs in column 1, expenditure types in column 2, and expenditures in column 3. It then “spreads” this data so that each expenditure type gets its own column; and then “gathers” the result back into the original format. R has a built-in type for data tables, the “data frame”. But here, I’m using Hadley Wickham’s “tibbles” instead. These have several advantages. For example, you can nest one tibble inside another, which is likely to be useful when representing hierarchical data. And the way tibbles are printed is more informative than that for data frames. Here’s my code, with comments showing what it produces:

# test_gather_spread.R
#
# Some experiments with 'gather'
# and 'spread', to see how useful
# they might be.


library( tidyverse )
#
# Loads the Tidyverse libraries. You need to
# have done 
#   install.packages( "tidyverse" )
# first.


d <- tibble( ID=c( 1, 1, 1, 2, 3, 3 ),
             expensetype=c( 'food', 'alcohol', 'rent', 'food', 'food', 'rent' ),
             value = c( 100, 0, 400, 75, 50, 600 )
           )
#
# Makes a simple table with type of expenditure in one
# column and its value in another.
# Gives:
# A tibble: 6 x 3
#     ID expensetype value
#  <dbl>       <chr> <dbl>
#1     1        food   100
#2     1     alcohol     0
#3     1        rent   400
#4     2        food    75
#5     3        food    50
#6     3        rent   600

ds <- spread( d, expensetype, value )
#
# Spreads out the data so that each type of expenditure
# has its own column.
# Gives:
# A tibble: 3 x 4
#     ID alcohol  food  rent
#  <dbl>   <dbl> <dbl> <dbl>
#1     1       0   100   400
#2     2      NA    75    NA
#3     3      NA    50   600

dg <- gather( ds, "TYPE", "SPENT", 2:4 )
#
# Unspreads the data, back to the original arrangement.
# Gives:
# A tibble: 9 x 3
#     ID    TYPE SPENT
#  <dbl>   <chr> <dbl>
#1     1 alcohol     0
#2     2 alcohol    NA
#3     3 alcohol    NA
#4     1    food   100
#5     2    food    75
#6     3    food    50
#7     1    rent   400
#8     2    rent    NA
#9     3    rent   600

arrange( dg, ID )
#
# Sorts on ID.
# Gives:
# A tibble: 9 x 3
#     ID    TYPE SPENT
#  <dbl>   <chr> <dbl>
#1     1 alcohol     0
#2     1    food   100
#3     1    rent   400
#4     2 alcohol    NA
#5     2    food    75
#6     2    rent    NA
#7     3 alcohol    NA
#8     3    food    50
#9     3    rent   600

One thing worth noting is that I had to sort the sort the gathered data to restore the original ordering. Anyway, the rest of my code, below, shows how easy it is to plot the data. I’ve adapted these examples from monashbioinformaticsplatform.github.io’s “The tidyverse: dplyr, ggplot2, and friends”.

d <- tibble( ID=c( 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,
                   10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18,
                   19, 19, 20, 20, 21, 21 ),
             expensetype=c( 'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent' ),
             value = c( 100, 400, 75, 350, 50, 300, 
                        100, 500, 40, 300, 120, 450,
                        80, 370, 80, 350, 100, 500, 
                        100, 500, 40, 300, 120, 450,
                        70, 340, 75, 350, 150, 500, 
                        100, 500, 120, 500, 120, 450, 
                        130, 450, 50, 380, 100, 550 )
           )
#
# Like d above, but with more rows.

ds <- spread( d, expensetype, value )
#
# Like ds above, but with more rows.

ggplot( ds, aes( food, rent ) ) + geom_point()
ggsave( "plot1.png" )
#
# Plots food expenditure against rent expenditure.

ggplot( ds, aes( food, rent ) ) + geom_point() +
                                  geom_smooth( method="lm" )
ggsave( "plot2.png" )
#
# Plots food expenditure against rent expenditure
# showing the best-fit line from a linear fit.

ggplot( ds, aes( food ) ) + geom_histogram( binwidth=25 )
ggsave( "plot3.png" )
#
# Plots a histogram of the food expenditures.

ggplot( d, aes(value, fill = expensetype)) + geom_histogram( binwidth=25, position="identity", alpha=0.2 )
ggsave( "plot4.png" ) 
#
# Plots a histogram of the food and rent
# expenditures on top of each other. Unlike above,
# this uses the original data rather than the spread
# version: the plotter relies on the expensetype
# column to decide which histogram to add to.

And here are the plots. The originals were bigger, but I’ve shrunk them to fit the table into a reasonably-sized desktop display.

Should I Use the Tidyverse?

I’m doing a lot of programming in the statistics language R, as I translate an economic model into R from Python. This is a big project, and I’ll blog about it more in later posts, as I share useful bits of code I’ve written. But in this post, I want to mention a kind of add-on to R: not part of the base language, but widely used and respected. This is Hadley Wickham‘s Tidyverse.

I’ve had to decide whether to use the Tidyverse or to stick to the base language: to so-called “base R”. Why? Reality is complicated, and in engineering, we evolve alternative sets of tools for manipulating it. For example, computing has “functional programming”, “object-oriented programming”, and “logic programming”. These are different notations for describing reality, and conflicts may occur if we try to think in more than one at the same time. When designing a program that my colleague will also work on, I have to decide whether the benefits of another set of tools justify the conflict he’ll face in learning them.

I hope Hadley Wickham will forgive me for saying that his Tidyverse libraries set up such a conflict. As Bob Muenchen notes in “The Tidyverse Curse”, learners often comment that base R functions and Tidyverse ones feel like two separate languages. Navigating the balance between base R and the Tidyverse can be a challenge.

But as Bob also notes when discussing dplyr, a package that, together with its relatives makes up the Tidyverse, learning it is well worth the effort. Conflicts between the Tidyverse and base R are not there for the hell of it, but because of decisions made by R’s original designers. These probably seemed like a good idea at the time, but conflict with better ways of doing things. The Tidyverse functions are just doing the best they can with the existing architecture.

As for what the Tidyverse contains, Bob talks about some of its features. And there are tutorials scattered around the web: I like monashbioinformaticsplatform.github.io’s “The tidyverse: dplyr, ggplot2, and friends”. Highlights for me so far include: tibbles, a reimplementation of R data frames; pipes notation, which makes it easy to write sequences of data transformations; and various functions for rearranging data, including spread, gather, nest, and unnest.