Abstract Data Types and the Uniform Referent Principle II: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()

In “Abstract Data Types and the Uniform Referent Principle I: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()”, I explained why the notation for interfacing to a data structure should be independent of that structure’s representation. R programmers honour this principle in the same way that bricks hang in the sky. All published R code that operates on data frames uses column names. Sometimes these follow the $ operator; sometimes the data frame is implicit via attach() or similar. In the Tidyverse, the column names will often be part of a mutate(), the data frame being piped through a sequence of %>% operators. And this is dreadful software engineering.

Why? Look at the tables below. They represent four different ways of storing my income data.

PersonIncome_TypeIncome_Value
AliceWages37000
AliceBonuses0
AliceBenefits0
BobWages14000
BobBonuses1000
BobBenefits6000
PersonIncome_WagesIncome_BonusesIncome_Benefits
Alice370000 0
Bob 1400010006000
PersonIncome
Alice
TypeValue
Wages37000
Bonuses0
Benefits0
Bob
TypeValue
Wages14000
Bonuses1000
Benefits6000
PersonIncome
Alice
WagesBonusesBenefits
3700000
Bob
WagesBonusesBenefits
1400010006000

Abstractly, the data is the same in each case, and if you’re familiar with nest(), unnest(), gather() and spread(), you will easily see how to transform one table into any of the others. But the tables are implemented in very different ways. If you access their elements with $ or an equivalent, and you then change the implementation, you have to rewrite all those accesses. Which is dreadful software engineering.

Cartoon of experimenter peering into innards of complicated piece of machinery. His colleague is holding a plug coming out of it labelled INTERFACE: 'GET' 'INCOME' 'WAGES' and saying 'Don't worry about how it works. It's the interface that's important.'

1 thought on “Abstract Data Types and the Uniform Referent Principle II: why Douglas T. Ross would hate nest(), unnest(), gather() and spread()”

  1. I suspect that what this example is getting at is more about how to structure tables than about ADTs. The data frame (or tibble) is a handy type with an interface than hides its gnarly implementation details. I suspect Ross would probably be fairly satisfied with the data frame or tibble as a type.

    On the other hand, I think Chris Date would have useful things to say about this example and how it relates to the relational model of data (as would E.F. Codd, Fabian Pascal, and others in that area).

    The question of how to structure tables (relations) to be able to gracefully handle new requirements and database structural additions without breaking pre-existing code that refers to the data is fascinating to me. You might be interested in checking out the set of practices known as “Anchor Modeling”, which relies on structuring tables in 6th normal form and using views for logical data independence.

Leave a Reply

Your email address will not be published. Required fields are marked *