From Python Dictionaries to Tribbles II: How I Implemented Lookup Tables in R for Numeric Data Codes

In my last post, I explained how tribbles make it easy to write data frames as a sequence of key-value pairs. But how can I make these data frames act as lookup tables? By using the base R function match.

This is how it works. First, I’ll make a tibble:

dict <- tribble( ~key, ~value, 'a', 'A', 'b', 'B', 'c', 'C' )
This gives me a two-column table where each key is in the same row as its value:
# A tibble: 3 x 2
    key value
  <chr> <chr>
1     a     A
2     b     B
3     c     C

The values in the second column represent the translations of the keys in the first column.

Now, suppose I want to translate the string 'b'. It’s in row two of column 1. Its translation is in row two of column 2. Generalising, if I want to translate string s, I find out which row r of column 1 it’s in, and then treat row r of column 2 as its translation. I can find its row using match. Here are three examples of match looking up a string in a vector of strings:

> match( 'a', c('a','b','c') )
[1] 1
> match( 'b', c('a','b','c') )
[1] 2
> match( 'c', c('a','b','c') )
[1] 3

Because the columns of tibbles (and data frames) are vectors, I can use match on these. Therefore, I can define my lookup function in this way:

lookup <- function( dict, v )
{
  keys <- dict[[ 1 ]]

  indices <- match( v, keys )

  translations <- dict[[ 2 ]]

  result_col <- translations[ indices ]

  result_col
}

There’s a subtlety here. Many R functions are “vectorised”. To quote from the language definition:

R deals with entire vectors of data at a time, and most of the elementary operators and basic mathematical functions like log are vectorized (as indicated in the table above). This means that e.g. adding two vectors of the same length will create a vector containing the element-wise sums, implicitly looping over the vector index. This applies also to other operators like -, *, and / as well as to higher dimensional structures.

One of the built-in functions that’s vectorised is match. So if I pass a vector as its first argument, it will look up each element thereof in the second element:

> match( c('b','c','a','b'), c('a','b','c') )
[1] 2 3 1 2
This is why I gave my variables plural names. My function is operating on a vector, the entire first column of a lookup table, and passing that to match.

I’ll finish with a complete listing of my code and a demo. Here’s the listing:

# dictionaries.R


library( tidyverse )


# Returns a dictionary. 
# This is implemented as a tibble with
# 'key' and 'value' columns.
#
dictionary <- function( ... )
{
  tribble( ~key, ~value, ... ) 
}


# Translates vector v by looking up
# each element in dictionary 'dict'. The
# result is a vector whose i'th element
# is a translation of the i'th element of
# v.
#
lookup <- function( dict, v )
{
  keys <- dict[[ 1 ]]

  indices <- match( v, keys )
  #
  # 'indices' will become a vector whose 
  # i'th element is the position p of
  # the i'th element of v in 'keys'. 
  # The corresponding element in '
  # 'translations' will be its translation.

  translations <- dict[[ 2 ]]

  result_col <- translations[ indices ]

  result_col
}
The three dots near the top may puzzle some. They denote all the arguments to dictionary, which get passed to tribble. Patrick Burns has some examples in “The three-dots construct in R”.

And here, mimicking the Python with which I began, is a demo using this code.

> tel <- dictionary( 'jack', 4098, 'sape', 4139 )
> tel
# A tibble: 2 x 2
    key value
  <chr> <dbl>
1  jack  4098
2  sape  4139
> lookup( tel, 'jack' )
[1] 4098

Leave a Reply

Your email address will not be published. Required fields are marked *