Abstract Data Types and the Uniform Referent Principle I: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()

“What’s the Uniform Referent Principle?” my colleague asked me on reading my last post. I think I first came across it in Jean Sammet’s famous book Programming Languages: History and Fundamentals. In a description of Douglas Ross’s AED-0 language, she pointed out a feature that she thought particularly noteworthy: the notation for getting information out of a piece of data is the same regardless of how the data is stored. Or as Ross puts it: There must be a single, uniform method of referencing operands regardless of their detailed implementation. He considered this so important that he gave it a name: the Uniform Referent Principle. Unfortunately, it’s a principle universally ignored by R programmers. I’m going to explain why that is bad, and how it affects my coding of the tax and benefit calculations in our economic model.

Sammet doesn’t actually name the principle, but Ross himself does, as is evident from the title of his paper “Uniform Referents: An Essential Property for a Software Engineering Language” by Douglas T. Ross, in Software Engineering: Proceedings of the Third Symposium on Computer and Information Sciences held in Miami Beach, Florida, December, 1969 (Volume 1, edited by Julius Tou). It’s worth asking why it concerned him. His answer is as relevant now as it was in 1969. Here’s the start to his introduction:

The term software engineering is new and has yet to achieve a well-defined meaning. Most present-day software has not been engineered at all, but instead is a product of the individual experience of its creators and primarily ad hoc choices among various alternatives. The effects of this unstructured approach to software are quite obvious in terms of poor performance, missed schedules, and uncontrolled costs.

Ross continues by saying that although software engineering is one of the most challenging problems facing humanity, we are beginning to systematise it and to recognise the fundamental issues. The most fundamental of these involves:

[…] one basic criterion which a general-purpose programming language must meet if it is to be used as the expressive vehicle for software engineering activities regardless of the style of practice of these activities: There must be a single, uniform method of referencing operands regardless of their detailed implementation.

Why is this so important? Ross relates it to his ideas on how to design programming languages:

Programming language features for software engineering must be carefully selected; not any old programming language features will do. An unstructured blur of assembly language depending in turn upon an ad hoc collection of machine hardware features which just happen to have been brought together for some particular computer design has a low probability of matching the real world in any profound way. Similarly, most “high-level” languages invariably will be either not complete or not concise enough to serve as a basis for software engineering, for they have a similar ad hoc ancestry with roots in scientific computation or business data processing areas, which omit many important aspects of general software construction.

Heaven forbid that R’s ancestry be deemed in any way ad hoc. But let Ross continue:

Careful inspection and experimentation with various software engineering activities discloses a few fundamental features that somehow must be present in a software engineering language. These features can be informally derived directly from high-level considerations of the software engineering design process itself. The purpose of this paper is to indicate such a derivation of one basic principle without attempting an exhaustive enumeration of the consequences.

Ross then goes on to those promised high-level considerations. Quoting an excerpt from his report “Computer-Aided Design: A Statement of Objectives” (Tech. Mem. 8436-TM-4, Defense Documentation Center No. AD252060, M.I.T. Electron. Systems Lab., 1960), he writes that:

The manner of stating problems is also of primary importance. You must be able to state problems from outside in, not inside out. The normal practice of being explicit about a problem is to build a description of the solution procedure in small steps, i.e. to code the problem…. Stating a problem step by step from the inside out in this manner is satisfactory if the problem is well understood to begin with. But there is an alternative procedure which is frequently used among people, but is virtually non-existent as regards computers. In this alternate procedure, which we call stating the problem from the outside in, we state the problem originally in general terms and then explain further and in more and more detail what we mean…. It is quite apparent that this procedure of stating a problem from the outside in by means of further and further refinements is the only feasible way to attack grand problems.

This way of stating problems is also called stepwise refinement. One of the computer scientists who popularised the name was Niklaus Wirth, the designer of Pascal, in his paper “Program development by stepwise refinement” (Communications of the ACM, April 1971, Volume 14, Number 4). In it, he writes:

A guideline in the process of stepwise refinement should be the principle to decompose decisions as much as possible, to untangle aspects which are only seemingly interdependent, and to defer those decisions which concern details of representation as long as possible. This will result in programs which are easier to adapt to different environments (languages and computers), where different representations may be required.

Why is it important to defer decisions which concern details of representation? Because undoing these decisions is costly. One has to rewrite code, retest it, and redocument it. The modern technique for minimising this cost is to use abstract data types. To quote the beginning of the Wikipedia page:

In computer science, an abstract data type (ADT) is a mathematical model for data types, where a data type is defined by its behavior (semantics) from the point of view of a user of the data, specifically in terms of possible values, possible operations on data of this type, and the behavior of these operations. This contrasts with data structures, which are concrete representations of data, and are the point of view of an implementer, not a user.

So one centralises these operations inside a module, and makes everything that depends on the concrete representation private, inaccessible from outside. The only things that aren’t inaccessible are the interface functions. Then, no matter how much the representation is changed, the interface won’t be, and so neither will code that calls it.

Ross’s reasoning is similar:

The crucial feature about outside-in design as the underlying methodology for software engineering is that […] proper treatment of interface mechanisms will allow the higher level to remain unchanged regardless of changes in the details of the inner level. In other words all of the variations can be accommodated in the interface changes, so that the outer level in completely unaffected.


The criterion is this: A single uniform referent mechanism must be used for all references to operands. […] In order for the programs of an outer level to remain completely unchanged as interface programs change to accommodate various forms of inter-level detail, the manner of referencing these operands must be the same for all possible forms.

There is one difference, however. Ross was writing many years ago, before our modern views on abstract data types. I think it’s fair to say that he sees things in slightly more representational terms than we would today. If we were implementing a data structure as an array, we’d make our interface hide the fact that it’s an array, by exporting functions that do the subscripting. Ross, however, would write the subscripting operations themselves in the interface, but in a notation that looks the same as function calls. Which is why later on in his paper, Ross writes the following table:

The notation A(B) in any syntactic context always means

Property A of thing B

AED-0 declarations allow choices:



Program fileDeclaration files
BEGIN7094 version
.INSERT DECL $ 360 version
BODY STATEMENTS $1108 version

The program file never changes when any declaration is used


Thoughts on nest()

I’ve been experimenting with the Tidyverse’s nest function, because it may be useful when, for example, using households together with benefit units. Below are some thoughts that I first posted as a comment to Hadley Wickham’s blog entry “tidyr 0.4.0”. More on this in future posts.

First, this is likely to be very useful to me. I’m translating a microeconomic model into R. Its input is a set of British households, where each household record contains data on income and expenditure. The model uses these to predict how their incomes will change if you change tax (e.g. by increasing income tax) or benefits (e.g. by increasing pensions or child benefit).

Our data splits households into “benefit units”. A benefit unit ( http://www.poverty.org.uk/summary/households.shtml ) is defined as an adult plus their spouse if they have one, plus any dependent children they are living with. So for example, mum and dad plus 10-year old Johnnie would be one benefit unit. But if Johnnie is over 18, he becomes an adult who just happens to live with his parents, and the household has two benefit units. These are treated more-or-less independently by the benefit system, e.g. if they receive money when out of work.

This matters because each dataset contains one table for household-wide data, and another for benefit-unit-wide data. I’ve been combining these with joins. But it might be cleaner to nest each household’s benefit units inside the household data frame. Not least, because sometimes we have to change data in a household, for example when simulating inflation, and then propagate the changes down to the benefit units.

Second, nest and unnest could be useful elsewhere in our data. Each household’s expenditure data is divided into categories, for example food, rent, and alcohol. We may want to group and ungroup these. For example, I make:

d <- tibble( ID=c( 1, 1, 1,  2, 2,  3, 3,
                   4, 4, 4,  5, 5, 5,  6, 6 ),
             expensetype=c( 'food', 'alcohol', 'rent',  'food', 'rent',  'food', 'rent',
                            'food', 'cigarettes', 'rent',  'food', 'alcohol', 'rent',  'food', 'rent' ),
             expenditure = c( 100, 50, 400,  75, 300,  90, 400,
                              100, 30, 420,  75, 50, 550,  150, 600 )  


d %>% group_by(ID) %>% nest %>% arrange(ID)
gives me a table
1 tibble [3 x 2]
2 tibble [2 x 2]
3 tibble [2 x 2]
4 tibble [3 x 2]
5 tibble [3 x 2]
6 tibble [2 x 2]
where the first column is the ID and the second is a table such as
food      100
alcohol    50
rent      400
So in effect, it makes my original table into a function from ID to ℘ expensetype × expenditure.

Whereas if I do

d %>% group_by(expensetype) %>% nest %>% arrange(expensetype)
I get the table
alcohol    tibble [2 x 2]
cigarettes tibble [1 x 2]
food       tibble [6 x 2]
rent       tibble [6 x 2]
where the first column is expenditure category and the second holds tables of ID versus expenditure. In effect, a function from expensetype to ℘ ID × expenditure. This sort of reformatting may be very useful.

Third. Continuing from the above, I wrote this function:

functionalise <- function( data, cols )
  data %>% group_by_( .dots=cols ) %>% nest %>% arrange_( .dots=cols )
The idea is that functionalise( data, cols ) reformats data into a data frame that represents a function. The columns cols represent the function’s domain, and will never contain more than one occurrence of the same tuple. The remaining column represents the function’s codomain. Thus, functionalise(d,"expensetype") returns the data frame shown in the last example.

Fourth, I note that I can write either

d %>% group_by( expensetype ) %>% nest %>% arrange( expensetype )
nest( d, ID, expenditure )

In the first, I have to name those columns that I want to act as the function’s domain. In the second, I have to name the others. I find the first more natural.

Fifth, nest and unnest, as well as spread and gather make it very easy to generate alternative but logically equivalent representations of a data frame. But every time I change representation in this way, I have to rewrite all the code that accesses the data. It would be really nice if either a representation-independent way of accessing it could be found, or if nest/unnest and spread/gather could be made to operate on the code as well as the data. Paging Douglas Ross and the Uniform Referent Principle…

Literate Programming, Data-flow Networks, and the Inscrutability of Economic Models

Write programs as if they were mathematical essays, but with code instead of equations. This is “literate programming”, an idea invented by the famous computer scientist Donald Knuth. The quote below is from www.literateprogramming.com , apparently first published by Knuth in “Literate Programming (1984)” in Literate Programming, CSLI, 1992, page 99:

I believe that the time is ripe for significantly better documentation of programs, and that we can best achieve this by considering programs to be works of literature. Hence, my title: “Literate Programming.”

Let us change our traditional attitude to the construction of programs: Instead of imagining that our main task is to instruct a computer what to do, let us concentrate rather on explaining to human beings what we want a computer to do.

The practitioner of literate programming can be regarded as an essayist, whose main concern is with exposition and excellence of style. Such an author, with thesaurus in hand, chooses the names of variables carefully and explains what each variable means. He or she strives for a program that is comprehensible because its concepts have been introduced in an order that is best for human understanding, using a mixture of formal and informal methods that reinforce each other.

With our model, there are at least three classes of user who need explaining to. One is me, in six months’ time after I’ve forgotten how my code worked. The second is my colleague. And the third is the economists and others who will use the model.

This final class need a little extra help. My networks can be seen as a kind of literate program where the expository narrative is branched rather than linear, and the essayist is writing around function definitions rather than equations. Traditionally, economic models have been opaque, their assumptions known only to the few who implemented them. But with this kind of display, everyone is free to peer inside.


In my last post, I said that I’d arranged for the nodes in my network diagram to be clickable. Clicking on a data node brings up a display showing its contents plus an explanation thereof; clicking on a transformation node shows the code it runs and an explanation of that. I call these little displays “viewlets”, and will now explain how I implemented them.

The main thing for this prototype was to demonstrate the idea. So I put together two easy-to-use and free pieces of software. The first is datatables.net’s Table plug-in for jQuery. This uses JavaScript to convert an HTML table into something you can scroll through, search, and sort. There are demonstrations here.

The second is Jed Foster’s Readmore.js. This, as the demo on that page shows, reduces large chunks of text to manageable size by equipping them with “Read More” links. Initially, you see only a few lines. Click “Read More”, and you see the full text; click once more, and the extra text goes away, giving just those few lines again.

Here’s a screenshot of a data viewlet:

And here’s a code viewlet:

Browsers have a habit of exploding when asked to display a 20,000-row table. So the data viewer is probably not good for full-scale datasets. Something like base-R View, talking to the server via sockets, might be better. The “Read More” library is also not ideal, because it doesn’t allow the programmer to specify how much of the text should be displayed above the link, and because you have to scroll all the way down to the end to collapse the full text.

A Web Interface for Accessing Intermediate Results in R, Using Network Graphs with vis.js

My title says it all. I wanted to give those running our model, myself included, a way to visualise how it works, and to access and check intermediate results. At the top level, the model consists of a few rather large functions sending chunks of data from one to another. These chunks are not conceptually complicated, being rectangular tables of data. Spreadsheet users can think of them as worksheets or CSV files. A natural way to display all this was as a network of data nodes and processing nodes, where the data nodes are the tables. I was able to implement this using the excellent network-drawing JavaScript code written by visjs.org.

Here’s my display, showing how data flows between parts of the model. It consists of circular or diamond-shaped nodes connected by arrows. Each circle represents a chunk of data: usually, a complete collection of households, benefit units, or whatever. Each diamond represents a transformation. Clicking on a circle opens a new “viewlet” web page that explains its data and, below that, shows a data viewer. Clicking on a diamond opens a viewlet that shows and explains the code that carried out the transformation. Using the buttons on the left of the display moves up or down; using those on the right zooms in or out.

Before jumping in to my system, it’s worth looking at vis.js’s examples. You need to know JavaScript to understand how to code them. But if you do, the structure is fairly simple, and you start your experiments by copying the page sources into your own web pages. A good starting point is the Basic Usage page . I’ll explain in a later post how I went beyond that.

Implementing Key-Value Notation for Data Frames without Using Tribbles

There’s a lot to be said for tribbles. As Hadley Wickham says in the “Tibbles” chapter of R for Data Science, his tribble function makes it easy to enter small data tables in a program, because you can type them row by row rather than column by column. Like this:

  ~actor   , ~character,
  "Shatner", "Kirk"    ,
  "Nimoy"  , "Spock"   , 
  "Kelley" , "McCoy"     
In contrast, traditional data-frame notation makes you do this:
data.frame( actor=c("Shatner","Nimoy","Kelley"),
This makes it hard to match up different elements of the same row. In my posts about tribbles for lookup tables, I overcame this by using tribble. But I now want to show a solution that I thought of before discovering it. This uses lapply and a binary operator to convert lists of key-value pairs into data frames. This is what the resulting notation looked like. It’s not quite as convenient as tribble notation, because of having to type three characters to separate keys from values, but it’s better than data.frame permits:
  1 %:% 'North_East',
  2 %:% 'North_West_and_Merseyside',
  4 %:% 'Yorks_and_Humberside',
  5 %:% 'East_Midlands',
  6 %:% 'West_Midlands',
  7 %:% 'Eastern',
  8 %:% 'London',
  9 %:% 'South_East',
 10 %:% 'South_West',
 11 %:% 'Wales',
 12 %:% 'Scotland',
 13 %:% 'Northern_Ireland'

The key (sorry!) to this is that R allows you to define your own operators. I can’t find where this is mentioned in the R language manual, but there’s a good discussion on StackOverflow. An identifier which begins and ends with percent can be assigned a function, and R’s parser will then allow it to be written as an infix operator, i.e. between its arguments. So if I type:

f <- function( x, y )
  2 * x + y

`%twiceandadd%` <- f

3 %twiceandadd% 5
I get the answer 11, just as if I’d called f( 3, 5 ).

Note that the backtick symbols, ` , are not part of the name, but are there to make the use of the identifier in the second statement valid. The R language manual explains this in the section on quotes.

What I did was to make the infix operator %:% a synonym for the base-R function list. So the code above does the same as

  list( 1, 'North_East' ),
  list( 2, 'North_West_and_Merseyside' ),
  list( 13, 'Northern_Ireland' )

I then defined keys_and_values_to_data_frame as:

keys_and_values_to_data_frame <- function( ... )
  keys_and_values_list_to_data_frame( list( ... ) )

and keys_and_values_list_to_data_frame as:
keys_and_values_list_to_data_frame >- function( l )
  keys <- unlist( lapply( l, function(x) x[[1]] ) )
  values <- unlist( lapply( l, function(x) x[[2]] ) )
  df <- data.frame( key=keys, value=values )

So, via the three-dots construct, which I mentioned in my last post, keys_and_values_list_to_data_frame gets passed a list of lists:

list( list( ,  ), list( ,  ), ... , list( ,  ) )
It then has to slice out all the first (red) elements to give the first column of the data frame df, and all the second (green) elements to give the second column:
df <- data.frame( key=  ... , value=  ...  )

To do this, it uses lapply. The first call selects all the first elements of the sublists, and the second selects all the second elements. As with my last post, I then had to call unlist to flatten the result.

If any of that's unclear, the colours may help. Visualising functions like lapply and mapply in terms of sequences laid alongside one another is often helpful. It may also be helpful to read Hadley Wickham's very clear explanation in the section on "Functionals" from his book Advanced R.

To finish, two notes. First, I made my inner lists, the ones pairing keys and values, with list rather then c. That's because the keys and values are different types, but c would have required them to be the same type.

Second, here's an example of lapply and unlist from the R shell. It also shows something I hadn't realised until I wrote the above code. The subscripting operator [[ is a function, and can be called from lapply and its ilk directly, without having to wrap it inside another function.

> l <- list( list('a','A'), list('b','B') )

> lapply( l, function(e)e[[1]] )
[1] "a"

[1] "b"

> lapply( l, function(e)e[[2]] )
[1] "A"

[1] "B"

> unlist( lapply( l, function(e)e[[1]] ) )
[1] "a" "b"

> unlist( lapply( l, function(e)e[[2]] ) )
[1] "A" "B"

> lapply( l, `[[`, 1 )
[1] "a"

[1] "b"

> unlist( lapply( l, `[[`, 1 ) )
[1] "a" "b"

Random Benefit Units for Households II: Generating the Number of Subrows

In my previous post, I assumed my household data would give me the number of children each household has. But suppose I had to generate those numbers too? This is just a note to say that one can do this using the base-R function sample.int .

If I understand its documentation correctly, then the call

sample.int( 4, size=100, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) )
will give me a vector of 100 elements. Each element is an integer between 1 and 4: that’s what the first argument determines. And the probabilities of their occurrence are given by the prob argument.

This seems to work. Let me generate such a vector (but much bigger to reduce sampling error) and tabulate the frequencies of its elements using table:

x <- sample.int( 4, size=1000000, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) )
t <- table(x)

Then my first runs give me:

        1         2         3         4 
1.0000000 3.9331806 1.9725140 0.9925756 
1.0000000 3.9757329 1.9855526 0.9899258 
1.0000000 3.9984735 2.0017902 0.9916804 
1.0000000 3.9942205 1.9904766 0.9979963 
1.0000000 3.9952263 2.0040621 0.9968735 

Are these close enough? The second and fourth sets of frequencies are always slightly below what I’d expect. So I may be missing some sublety. On the other hand, it’s good enough for the testing I’m doing, as this mainly has to certify that my joins and other data-handling operations are correct.

Random Benefit Units for Households I: Generating Random Subrows of a Row

The data for our economic model comes from records representing the income and expenditure of British households. However, the structure isn’t as simple as just one row per household. This is because it’s necessary to split households into “benefit units”: the word “benefit” here refering to the money the State gives you when you’re ill, out of work, or whatever. The “Households, families and benefit units” page on poverty.org explains that whereas a “household” is a group of people living at the same address who share common housekeeping or a living room, a benefit unit is an adult plus their spouse if they have one, plus any dependent children they are living with. So mum and dad plus 10-year old Johnnie would be one benefit unit. But if Johnnie is over 18, he becomes an adult who just happens to live with his parents, and the household has two benefit units. Some of our data and results are per household, but others have to be per benefit unit. In this post, I’ll explain how, given some households each of which had a field saying how many benefit units it has, I generated random benefit units to go with it. This is more general than it sounds. Given one kind of data, and another kind that it “contains”, how do you generate instances of the second kind? My answer involves function mapply.

Because benefit units won’t be familiar to most readers, I’ll talk about children instead. Let’s start by making some sample households:

library( tidyverse )

households <- tribble( ~id, ~num_kids,
                       'A',         2,
                       'B',         3,
                       'C',         1,
                       'D',         2
This creates four households. The only fields each has are an ID and a number-of-children field. So the first household has ID 'A' and two children; the second has ID 'B' and three children; and so on. In our data, household IDs are numeric. But I’m using non-numeric strings here because it avoids me accidentally mixing them up with the num_kids values.

Next, I assign the number of households to a variable, because I’m going to use it more than once later on.

num_households <- nrow( households )

Now I want to think about the information I’ll generate for each child record. There has to be a household ID, so that I can link the children table and the households table. I’m also going to give each child a sequence number within its household. And, for this example, one other piece of data: how much pocket money the child gets a week.

So my children table will have three columns. I’ll now generate the first of these:

kid_ids <- mapply( rep, 

kid_ids_as_vec <- unlist( kid_ids )

The first statement uses mapply to combine two vectors given as its final two arguments. These are 'A', 'B', 'C', 'D' and 2,3,1,2 respectively, from household‘s columns. In effect, mapply zips them together by applying rep to the first element of each, the second element of each, and so on. I’ve drawn this below. Cartoon of the above data values being zipped by mapply into a list of bags of 2 As, 3 Bs, etc., and then unlisted into a list of 2 As followed by 3 Bs etc.

I drew the IDs as colours, because their value doesn’t matter, and it makes the graphic easier to read. What it does show clearly is that the output from mapply is not flat. It’s a list of (vectors of IDs). This is because each call of rep generates a vector, and each of these becomes one element of mapply‘s result. Before I can use the result as a column of a data frame, I have to flatten it, which is what the call to unlist does.

So that’s the first column of my children data frame. It contains the household IDs, but repeated as many times as each household has children. Now I have to make the second column.

kid_nums <- mapply( function(count){1:count}, 

kid_nums_as_vec <- unlist( kid_nums )
This works in the same kind of way to the previous mapply, but operates on only one vector. It applies a function which takes a count as argument, and returns a vector made using the built-in : operator of integers running from 1 to that count. Applied to 2, the first element of households$num_kids, this function returns the vector 1 2. Applied to three, the second element, it returns 1 2 3. As before, mapply returns a list of vectors, and also as before, I have to flatten it using unlist.

So I now have the first two columns of my children data. I’ll now make the third. This is how much each child gets in weekly pocket money. In real life, I’d have something else: my choice here just demonstrates one way of generating random data, something useful when testing plotting and aggregation functions, for example.

pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )
This generates a vector of as many values as we have rows (given by length(kid_ids_as_vec)), normally distributed about 10, with a standard deviation of 20.

And now I make these three columns into a data frame.

kids <- data_frame( id=kid_ids_as_vec, 

To finish, here’s the complete code.

# random_kids.R
# Written for my blog. Illustrates
# the techniques that I used for
# generating random benefit units.

library( tidyverse )

random_kids <- function( households )
  kid_ids <- mapply( rep, 

  kid_ids_as_vec <- unlist( kid_ids )

  kid_nums <- mapply( function(count){1:count}, 

  kid_nums_as_vec <- unlist( kid_nums )

  pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )

  kids <- data_frame( id=kid_ids_as_vec, 


households <- tribble( ~id, ~num_kids,
                       'A',         2,
                       'B',         3,
                       'C',         1,
                       'D',         2

kids <- random_kids( households )

And this is what the above call outputs. You’ll see that I made my code into a function. I do that with everything, because I never know when I’ll want to reuse it. When testing, for instance, I often want to call the same code more than once.

# A tibble: 8 x 3
     id kid_num pocket_money
  <chr>   <int>        <dbl>
1     A       1     17.12020
2     A       2     13.40505
3     B       1     25.81609
4     B       2     20.62115
5     B       3     31.25019
6     C       1     19.93400
7     D       1     30.98877
8     D       2     31.05521