Viewlets

In my last post, I said that I’d arranged for the nodes in my network diagram to be clickable. Clicking on a data node brings up a display showing its contents plus an explanation thereof; clicking on a transformation node shows the code it runs and an explanation of that. I call these little displays “viewlets”, and will now explain how I implemented them.

The main thing for this prototype was to demonstrate the idea. So I put together two easy-to-use and free pieces of software. The first is datatables.net’s Table plug-in for jQuery. This uses JavaScript to convert an HTML table into something you can scroll through, search, and sort. There are demonstrations here.

The second is Jed Foster’s Readmore.js. This, as the demo on that page shows, reduces large chunks of text to manageable size by equipping them with “Read More” links. Initially, you see only a few lines. Click “Read More”, and you see the full text; click once more, and the extra text goes away, giving just those few lines again.

Here’s a screenshot of a data viewlet:

And here’s a code viewlet:

Browsers have a habit of exploding when asked to display a 20,000-row table. So the data viewer is probably not good for full-scale datasets. Something like base-R View, talking to the server via sockets, might be better. The “Read More” library is also not ideal, because it doesn’t allow the programmer to specify how much of the text should be displayed above the link, and because you have to scroll all the way down to the end to collapse the full text.

A Web Interface for Accessing Intermediate Results in R, Using Network Graphs with vis.js

My title says it all. I wanted to give those running our model, myself included, a way to visualise how it works, and to access and check intermediate results. At the top level, the model consists of a few rather large functions sending chunks of data from one to another. These chunks are not conceptually complicated, being rectangular tables of data. Spreadsheet users can think of them as worksheets or CSV files. A natural way to display all this was as a network of data nodes and processing nodes, where the data nodes are the tables. I was able to implement this using the excellent network-drawing JavaScript code written by visjs.org.

Here’s my display, showing how data flows between parts of the model. It consists of circular or diamond-shaped nodes connected by arrows. Each circle represents a chunk of data: usually, a complete collection of households, benefit units, or whatever. Each diamond represents a transformation. Clicking on a circle opens a new “viewlet” web page that explains its data and, below that, shows a data viewer. Clicking on a diamond opens a viewlet that shows and explains the code that carried out the transformation. Using the buttons on the left of the display moves up or down; using those on the right zooms in or out.

Before jumping in to my system, it’s worth looking at vis.js’s examples. You need to know JavaScript to understand how to code them. But if you do, the structure is fairly simple, and you start your experiments by copying the page sources into your own web pages. A good starting point is the Basic Usage page . I’ll explain in a later post how I went beyond that.

Implementing Key-Value Notation for Data Frames without Using Tribbles

There’s a lot to be said for tribbles. As Hadley Wickham says in the “Tibbles” chapter of R for Data Science, his tribble function makes it easy to enter small data tables in a program, because you can type them row by row rather than column by column. Like this:

tribble(
  ~actor   , ~character,
  "Shatner", "Kirk"    ,
  "Nimoy"  , "Spock"   , 
  "Kelley" , "McCoy"     
)
In contrast, traditional data-frame notation makes you do this:
data.frame( actor=c("Shatner","Nimoy","Kelley"),
            character=c("Kirk","Spock","McCoy")
          )
This makes it hard to match up different elements of the same row. In my posts about tribbles for lookup tables, I overcame this by using tribble. But I now want to show a solution that I thought of before discovering it. This uses lapply and a binary operator to convert lists of key-value pairs into data frames. This is what the resulting notation looked like. It’s not quite as convenient as tribble notation, because of having to type three characters to separate keys from values, but it’s better than data.frame permits:
keys_and_values_to_data_frame(  
  1 %:% 'North_East',
  2 %:% 'North_West_and_Merseyside',
  4 %:% 'Yorks_and_Humberside',
  5 %:% 'East_Midlands',
  6 %:% 'West_Midlands',
  7 %:% 'Eastern',
  8 %:% 'London',
  9 %:% 'South_East',
 10 %:% 'South_West',
 11 %:% 'Wales',
 12 %:% 'Scotland',
 13 %:% 'Northern_Ireland'
)

The key (sorry!) to this is that R allows you to define your own operators. I can’t find where this is mentioned in the R language manual, but there’s a good discussion on StackOverflow. An identifier which begins and ends with percent can be assigned a function, and R’s parser will then allow it to be written as an infix operator, i.e. between its arguments. So if I type:

f <- function( x, y )
{
  2 * x + y
}

`%twiceandadd%` <- f

3 %twiceandadd% 5
I get the answer 11, just as if I’d called f( 3, 5 ).

Note that the backtick symbols, ` , are not part of the name, but are there to make the use of the identifier in the second statement valid. The R language manual explains this in the section on quotes.

What I did was to make the infix operator %:% a synonym for the base-R function list. So the code above does the same as

keys_and_values_to_data_frame( 
  list( 1, 'North_East' ),
  list( 2, 'North_West_and_Merseyside' ),
  ...
  list( 13, 'Northern_Ireland' )
)

I then defined keys_and_values_to_data_frame as:

keys_and_values_to_data_frame <- function( ... )
{
  keys_and_values_list_to_data_frame( list( ... ) )
}

and keys_and_values_list_to_data_frame as:
keys_and_values_list_to_data_frame >- function( l )
{
  keys <- unlist( lapply( l, function(x) x[[1]] ) )
  values <- unlist( lapply( l, function(x) x[[2]] ) )
  df <- data.frame( key=keys, value=values )
  df
}

So, via the three-dots construct, which I mentioned in my last post, keys_and_values_list_to_data_frame gets passed a list of lists:

list( list( ,  ), list( ,  ), ... , list( ,  ) )
It then has to slice out all the first (red) elements to give the first column of the data frame df, and all the second (green) elements to give the second column:
df <- data.frame( key=  ... , value=  ...  )

To do this, it uses lapply. The first call selects all the first elements of the sublists, and the second selects all the second elements. As with my last post, I then had to call unlist to flatten the result.

If any of that's unclear, the colours may help. Visualising functions like lapply and mapply in terms of sequences laid alongside one another is often helpful. It may also be helpful to read Hadley Wickham's very clear explanation in the section on "Functionals" from his book Advanced R.

To finish, two notes. First, I made my inner lists, the ones pairing keys and values, with list rather then c. That's because the keys and values are different types, but c would have required them to be the same type.

Second, here's an example of lapply and unlist from the R shell. It also shows something I hadn't realised until I wrote the above code. The subscripting operator [[ is a function, and can be called from lapply and its ilk directly, without having to wrap it inside another function.

> l <- list( list('a','A'), list('b','B') )

> lapply( l, function(e)e[[1]] )
[[1]]
[1] "a"

[[2]]
[1] "b"

> lapply( l, function(e)e[[2]] )
[[1]]
[1] "A"

[[2]]
[1] "B"

> unlist( lapply( l, function(e)e[[1]] ) )
[1] "a" "b"

> unlist( lapply( l, function(e)e[[2]] ) )
[1] "A" "B"

> lapply( l, `[[`, 1 )
[[1]]
[1] "a"

[[2]]
[1] "b"

> unlist( lapply( l, `[[`, 1 ) )
[1] "a" "b"

Random Benefit Units for Households II: Generating the Number of Subrows

In my previous post, I assumed my household data would give me the number of children each household has. But suppose I had to generate those numbers too? This is just a note to say that one can do this using the base-R function sample.int .

If I understand its documentation correctly, then the call

sample.int( 4, size=100, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) )
will give me a vector of 100 elements. Each element is an integer between 1 and 4: that’s what the first argument determines. And the probabilities of their occurrence are given by the prob argument.

This seems to work. Let me generate such a vector (but much bigger to reduce sampling error) and tabulate the frequencies of its elements using table:

x <- sample.int( 4, size=1000000, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) )
t <- table(x)
t/t[1]

Then my first runs give me:

        1         2         3         4 
1.0000000 3.9331806 1.9725140 0.9925756 
1.0000000 3.9757329 1.9855526 0.9899258 
1.0000000 3.9984735 2.0017902 0.9916804 
1.0000000 3.9942205 1.9904766 0.9979963 
1.0000000 3.9952263 2.0040621 0.9968735 

Are these close enough? The second and fourth sets of frequencies are always slightly below what I’d expect. So I may be missing some sublety. On the other hand, it’s good enough for the testing I’m doing, as this mainly has to certify that my joins and other data-handling operations are correct.

Random Benefit Units for Households I: Generating Random Subrows of a Row

The data for our economic model comes from records representing the income and expenditure of British households. However, the structure isn’t as simple as just one row per household. This is because it’s necessary to split households into “benefit units”: the word “benefit” here refering to the money the State gives you when you’re ill, out of work, or whatever. The “Households, families and benefit units” page on poverty.org explains that whereas a “household” is a group of people living at the same address who share common housekeeping or a living room, a benefit unit is an adult plus their spouse if they have one, plus any dependent children they are living with. So mum and dad plus 10-year old Johnnie would be one benefit unit. But if Johnnie is over 18, he becomes an adult who just happens to live with his parents, and the household has two benefit units. Some of our data and results are per household, but others have to be per benefit unit. In this post, I’ll explain how, given some households each of which had a field saying how many benefit units it has, I generated random benefit units to go with it. This is more general than it sounds. Given one kind of data, and another kind that it “contains”, how do you generate instances of the second kind? My answer involves function mapply.

Because benefit units won’t be familiar to most readers, I’ll talk about children instead. Let’s start by making some sample households:

library( tidyverse )

households <- tribble( ~id, ~num_kids,
                       'A',         2,
                       'B',         3,
                       'C',         1,
                       'D',         2
                     )
This creates four households. The only fields each has are an ID and a number-of-children field. So the first household has ID 'A' and two children; the second has ID 'B' and three children; and so on. In our data, household IDs are numeric. But I’m using non-numeric strings here because it avoids me accidentally mixing them up with the num_kids values.

Next, I assign the number of households to a variable, because I’m going to use it more than once later on.

num_households <- nrow( households )

Now I want to think about the information I’ll generate for each child record. There has to be a household ID, so that I can link the children table and the households table. I’m also going to give each child a sequence number within its household. And, for this example, one other piece of data: how much pocket money the child gets a week.

So my children table will have three columns. I’ll now generate the first of these:

kid_ids <- mapply( rep, 
                   households$id, 
                   households$num_kids 
                 )

kid_ids_as_vec <- unlist( kid_ids )

The first statement uses mapply to combine two vectors given as its final two arguments. These are 'A', 'B', 'C', 'D' and 2,3,1,2 respectively, from household‘s columns. In effect, mapply zips them together by applying rep to the first element of each, the second element of each, and so on. I’ve drawn this below.

I drew the IDs as colours, because their value doesn’t matter, and it makes the graphic easier to read. What it does show clearly is that the output from mapply is not flat. It’s a list of (vectors of IDs). This is because each call of rep generates a vector, and each of these becomes one element of mapply‘s result. Before I can use the result as a column of a data frame, I have to flatten it, which is what the call to unlist does.

So that’s the first column of my children data frame. It contains the household IDs, but repeated as many times as each household has children. Now I have to make the second column.

kid_nums <- mapply( function(count){1:count}, 
                    households$num_kids 
                  )

kid_nums_as_vec <- unlist( kid_nums )
This works in the same kind of way to the previous mapply, but operates on only one vector. It applies a function which takes a count as argument, and returns a vector made using the built-in : operator of integers running from 1 to that count. Applied to 2, the first element of households$num_kids, this function returns the vector 1 2. Applied to three, the second element, it returns 1 2 3. As before, mapply returns a list of vectors, and also as before, I have to flatten it using unlist.

So I now have the first two columns of my children data. I’ll now make the third. This is how much each child gets in weekly pocket money. In real life, I’d have something else: my choice here just demonstrates one way of generating random data, something useful when testing plotting and aggregation functions, for example.

pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )
This generates a vector of as many values as we have rows (given by length(kid_ids_as_vec)), normally distributed about 10, with a standard deviation of 20.

And now I make these three columns into a data frame.

kids <- data_frame( id=kid_ids_as_vec, 
                    kid_num=kid_nums_as_vec,
                    pocket_money=pocket_monies 
                  )

To finish, here’s the complete code.

# random_kids.R
#
# Written for my blog. Illustrates
# the techniques that I used for
# generating random benefit units.


library( tidyverse )


random_kids <- function( households )
{
  kid_ids <- mapply( rep, 
                     households$id, 
                     households$num_kids 
                   )

  kid_ids_as_vec <- unlist( kid_ids )

  kid_nums <- mapply( function(count){1:count}, 
                      households$num_kids 
                    )

  kid_nums_as_vec <- unlist( kid_nums )

  pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )

  kids <- data_frame( id=kid_ids_as_vec, 
                      kid_num=kid_nums_as_vec,
                      pocket_money=pocket_monies 
                    )

  kids
}


households <- tribble( ~id, ~num_kids,
                       'A',         2,
                       'B',         3,
                       'C',         1,
                       'D',         2
                     )

kids <- random_kids( households )

And this is what the above call outputs. You’ll see that I made my code into a function. I do that with everything, because I never know when I’ll want to reuse it. When testing, for instance, I often want to call the same code more than once.

# A tibble: 8 x 3
     id kid_num pocket_money
  <chr>   <int>        <dbl>
1     A       1     17.12020
2     A       2     13.40505
3     B       1     25.81609
4     B       2     20.62115
5     B       3     31.25019
6     C       1     19.93400
7     D       1     30.98877
8     D       2     31.05521

From Python Dictionaries to Tribbles II: How I Implemented Lookup Tables in R for Numeric Data Codes

In my last post, I explained how tribbles make it easy to write data frames as a sequence of key-value pairs. But how can I make these data frames act as lookup tables? By using the base R function match.

This is how it works. First, I’ll make a tibble:

dict <- tribble( ~key, ~value, 'a', 'A', 'b', 'B', 'c', 'C' )
This gives me a two-column table where each key is in the same row as its value:
# A tibble: 3 x 2
    key value
  <chr> <chr>
1     a     A
2     b     B
3     c     C

The values in the second column represent the translations of the keys in the first column.

Now, suppose I want to translate the string 'b'. It’s in row two of column 1. Its translation is in row two of column 2. Generalising, if I want to translate string s, I find out which row r of column 1 it’s in, and then treat row r of column 2 as its translation. I can find its row using match. Here are three examples of match looking up a string in a vector of strings:

> match( 'a', c('a','b','c') )
[1] 1
> match( 'b', c('a','b','c') )
[1] 2
> match( 'c', c('a','b','c') )
[1] 3

Because the columns of tibbles (and data frames) are vectors, I can use match on these. Therefore, I can define my lookup function in this way:

lookup <- function( dict, v )
{
  keys <- dict[[ 1 ]]

  indices <- match( v, keys )

  translations <- dict[[ 2 ]]

  result_col <- translations[ indices ]

  result_col
}

There’s a subtlety here. Many R functions are “vectorised”. To quote from the language definition:

R deals with entire vectors of data at a time, and most of the elementary operators and basic mathematical functions like log are vectorized (as indicated in the table above). This means that e.g. adding two vectors of the same length will create a vector containing the element-wise sums, implicitly looping over the vector index. This applies also to other operators like -, *, and / as well as to higher dimensional structures.

One of the built-in functions that’s vectorised is match. So if I pass a vector as its first argument, it will look up each element thereof in the second element:

> match( c('b','c','a','b'), c('a','b','c') )
[1] 2 3 1 2
This is why I gave my variables plural names. My function is operating on a vector, the entire first column of a lookup table, and passing that to match.

I’ll finish with a complete listing of my code and a demo. Here’s the listing:

# dictionaries.R


library( tidyverse )


# Returns a dictionary. 
# This is implemented as a tibble with
# 'key' and 'value' columns.
#
dictionary <- function( ... )
{
  tribble( ~key, ~value, ... ) 
}


# Translates vector v by looking up
# each element in dictionary 'dict'. The
# result is a vector whose i'th element
# is a translation of the i'th element of
# v.
#
lookup <- function( dict, v )
{
  keys <- dict[[ 1 ]]

  indices <- match( v, keys )
  #
  # 'indices' will become a vector whose 
  # i'th element is the position p of
  # the i'th element of v in 'keys'. 
  # The corresponding element in '
  # 'translations' will be its translation.

  translations <- dict[[ 2 ]]

  result_col <- translations[ indices ]

  result_col
}
The three dots near the top may puzzle some. They denote all the arguments to dictionary, which get passed to tribble. Patrick Burns has some examples in “The three-dots construct in R”.

And here, mimicking the Python with which I began, is a demo using this code.

> tel <- dictionary( 'jack', 4098, 'sape', 4139 )
> tel
# A tibble: 2 x 2
    key value
  <chr> <dbl>
1  jack  4098
2  sape  4139
> lookup( tel, 'jack' )
[1] 4098

From Python Dictionaries to Tribbles I: How I Implemented Lookup Tables in R for Numeric Data Codes

As regular readers will know, I’ve been translating an economic model from Python into R. It reads data about the income and expenditure of British households, from sources such as the Family Resources Survey and Family Expenditure Survey . Much of this data is coded as numbers, and the model has to translate these into something intelligible. The Python version uses a kind of built-in lookup table called a “dictionary”: but these don’t exist in R, and I had to implement an equivalent. It was important that I and my colleague be able to initialise the table by writing it as key-value pairs. So I used tribbles…

I’ll explain what Python does first. Here’s an example taken from python.org’s “Dictionaries” tutorial, run on PythonAnywhere’s interactive interpreter:

In [1]: tel = { 'jack': 4098, 'sape': 4139 }
In [2]: tel
Out[2]: { 'jack': 4098, 'sape': 4139 }
In [3]: tel['guido'] = 4127
In [4]: tel
Out[4]: { 'guido': 4127, 'jack': 4098, 'sape': 4139 }
In [5]: tel['jack']
Out[5]: 4098
The first statement creates a dictionary, using curly brackets around its contents. The third and fifth statements change or look up elements, using indices in square brackets. It’s an easy notation.

Our Python model’s dictionaries look more like the one below, which translates region codes to names, but the idea is the same:

{ 1: 'North_East',
  2: 'North_West_and_Merseyside',
  4: 'Yorks_and_Humberside',
  5: 'East_Midlands',
  6: 'West_Midlands',
  7: 'Eastern',
  8: 'London',
  9: 'South_East',
 10: 'South_West',
 11: 'Wales',
 12: 'Scotland',
 13: 'Northern_Ireland'
}

So I needed a data structure that did the same job in R, and a way to initialise it by writing key-value pairs. But whereas lookup tables are built in to Python, they aren’t in R. There are contributed packages for them such as hashmap and hash. But I decided to implement lookup tables as data frames, as it might give me more control if I needed to do anything odd that these packages didn’t allow.

In fact, I used tibbles instead of ordinary data frames. Tibbles, as Hadley Wickham says in the “Tibbles” chapter of R for Data Science, are data frames, but tweaked to make life a little easier. Importantly for me, “make life easier” includes making it easier to enter small amounts of data in a program by using key-value notation. This is done via the function tribble. This call:

tribble(
  ~x, ~y, ~z,
  "a", 2, 3.6,
  "b", 1, 8.5
)
creates a tibble with columns named x, y and z, and the two rows shown under these names just above. R prints it like this:
# A tibble: 2 x 3
      x     y     z
  <chr> <dbl> <dbl>
1     a     2   3.6
2     b     1   8.5

And this call:

tribble(
   ~key, ~value, 
    1  , 'North_East',
    2  , 'North_West_and_Merseyside',
    4  , 'Yorks_and_Humberside',
    5  , 'East_Midlands',
    6  , 'West_Midlands',
    7  , 'Eastern',
    8  , 'London',
    9  , 'South_East',
    10 , 'South_West',
    11 , 'Wales',
    12 , 'Scotland',
    13 , 'Northern_Ireland'
  )
creates a tibble with with two columns named key and value, and 13 rows. Here’s how R prints this one:
# A tibble: 12 x 2
     key                     value
   <dbl>                      <chr>
 1     1                North_East
 2     2 North_West_and_Merseyside
 3     4      Yorks_and_Humberside
 4     5             East_Midlands
 5     6             West_Midlands
 6     7                   Eastern
 7     8                    London
 8     9                South_East
 9    10                South_West
10    11                     Wales
11    12                  Scotland
12    13          Northern_Ireland

So the Tidyverse has made it easy to enter key-value pairs in Python-dictionary-style notation and turn them into tibbles. How do I make these act as lookup tables? See my next post. By the way, the name “tribble” stands for “transposed tibble”.

Reification

We programmers live our working lives surrounded by data structures and subroutines, entities that become as concrete to us — as “thing-like”, as “manipulable” — as teacups and bricks. The feeling of thingness is strengthened, I think, by interactive shells such as R’s which enable one to call functions and inspect their results, and to store these in variables and pass them around. For our model, those results are either chunks of economic data such as our tables of households, or income-distribution graphs and other such summaries. I hope that being able to touch, probe, and pick up these things with R will make them seem more real.

The Innocent Eye, the Martian Poet, and the R Inferno

Literature has the concept of the “innocent eye”: that visitor to regions strange who, vision unclouded by familiarity, is able to see and report on how absurd things really are over there. There are also “Martian poets”, who send home postcards about the oddities of their own environment as if visiting it from Mars. As it happens, I’ve tried my hand at both. “Enterro Da Gata’ 98” is an innocent-eye piece about Braga in Portugal, written when I was visiting the University of Minho. And “The Processes That Count” is a Martian-poet — or more accurately, a universe-next-door-poet — view of addition. With R, I find myself tempted into both positions.

On the one hand, I’ve done a lot of R programming over the last few years. Some of R’s quirks now seem distressingly natural, and I have to work hard to see them from outside — from Mars. On the other hand, I’m not a statistician, I’ve not explored the whole of the language and its libraries, and I’m nowhere near as expert as, say, the implementors of the Tidyverse. So I’m still a relative innocent, capable of viewing R’s peculiarities from Planet Pascal, Planet Lisp, or any of the other twenty-or-so languages I know.

I think it’s good to retain an innocent’s point of view, especially when teaching. And writing code for my colleague is a kind of teaching, because I have to explain stuff in my comments. Especially the stuff that will trip up the unwary programmer. Luckily, a lot of this stuff explained in Patrick Burns’s book The R Inferno. I recommend it.

(Photo of Dante’s Inferno ride entrance in Panama City, Florida: by “Marktippin”.)

Which Symbol Should I Use for Assignment?

Perhaps I should add to my post about FreeFormatter. I noted that manual conversion of R code for inclusion in web pages is a pain because of the assignment symbol, <-. But I feel I should say that assignment can also be written as =, though this sometimes clashes with = for named function arguments. Kun Ren gives an example in “Difference between assignment operators in R”. John Mount in “Prefer = for assignment in R” prefers =, saying that if you’re accustomed to typing <-, you might type it by mistake in named function arguments too, causing a bug. But David Smith in “Use = or <- for assignment?” argues for <-. And Bob Rudis in “A Step to the Right in R Assignments” argues for yet another permitted symbol, ->, because it fits better with the Tidyverse’s “pipes” notation, wherein functions are composed from left to right. In engineering, there is never one right answer.