Google Babysitter

Good news that Google have announced they’ll do something about the spoof children’s videos on YouTube. But given how long they’ve taken to do so, I think my cartoon below is justified. Artificial Intelligence is not up to rejecting such videos automatically (it’s an AI-complete task), and there are too many for staff to moderate.

Google Babysitter robot, with TV screen in head showing the video of Peppa Pig slashing
her face. Scared toddler gazes at screen. Robot is narrating, 'And Peppa Pig self-harmed, took to drugs, and died. Google made lots of money, and everyone lived happily ever after — except for a few traumatised children'.

Description

A Google Babysitter robot sits on a chair, with a TV screen in its head showing the notorious video of Peppa Pig slashing her face. A scared toddler gazes at the screen. The robot is narrating, “And Peppa Pig self-harmed, took to drugs, and died. Google made lots of money, and everyone lived happily ever after — except for a few traumatised children.”

Suffer, Little Children

Here’s a cartoon I just drew, following the news that YouTube has banned the Toy Freaks channel.
YouTube master control console in Google HQ. Screen on console shows a Toy Freaks video, with anguished child. There's a switch labelled 'Delete' / 'Leave until New York Times notices',set to the latter position. Coins are pouring out of a hopper on the side of the console.

Description

YouTube master control console in Google HQ. A screen on the console shows a Toy Freaks video, with anguished child. To its right is a two-position switch labelled ‘Delete’ / ‘Leave until New York Times notices’. It’s set to the latter position. Coins are pouring out of a hopper on the side of the console.

Beat the Delays: installing R and the BH package on a memory stick

I use R on a range of Windows machines. Often, I’ll only use these once, and they won’t already have R. So I want to carry an installation with me. So I decided to install R on a memory stick. Installing R itself worked, once I’d changed the folder on the “Select Destination Location” pop-up. But when I then tried installing the Tidyverse package, it seemed to hang. This is just a note for others facing the same problem, which in my case was caused by the BH package.

I tried installing Tidyverse packages one by one, and found that the hang was probably caused by BH. Some Googling led me to “mondano”‘s question https://stackoverflow.com/questions/31272280/installation-of-r-package-bh-not-possible. To which Dirk Eddelbuettel replied that “BH, as a sizeable subset of Boost Headers is big, as in really big”. It’s 111 megabytes, and mondano might simply have run out of patience if Windows was slow writing the files.

So I tried installing BH on its own, and using the Windows file explorer to watch the files. Indeed, I could see a temporary copy of BH slowly taking shape in a subdirectory of R-3.4.2/library called file1afc7c1c1629, with files slowly appearing under its subdirectories. Presumably, once the temporary copy had been made, R would copy it upwards to library where the other packages lived.

But after far too many minutes — memory sticks must be really slow — when the temporary copy was complete, it never did get copied upwards. Instead, I got the error Warning: unable to move temporary installation.

It would be nice to have been told the reason for the failure, but R is not that clever. There was enough space on the stick, so that wasn’t the problem. Time to Google once more. In https://stackoverflow.com/questions/5700505/windows-7-update-packages-problem-unable-to-move-temporary-installation, Tal Galili had the same error with the MASS package. The consensus reply was that this is probably an antivirus program locking a file.

So I installed R, and the Tidyverse, on my hard drive, which always works. I then copied its version of the final BH install to R-3.4.2/library. And the memory-stick Tidyverse then worked beautifully.

Experiments with count(), tally(), and summarise(): how to count and sum and list elements of a column in the same call

Most people have a job. Some don’t. And a few have more than one. I’ve mentioned before that our economic model works on data about British households, gathered from surveys such as the Family Resources Survey. Each collection of data is a year long, and contains a file that describes all the adults in the 20,000 or so households covered. There’s another file that describes their jobs. I have to make the model aggregate and merge these.

As an example, assume Pat has three jobs. Her adult data consists of one record, with details such as age and sex. Her jobs data has one record per job, stating how much the job pays. I need to squish these three records down to one which has a field for the number of jobs, a field for the total earnings, and a field with a list of earnings. And I must then merge the result with Pat’s adult record. Now, the jobs file and the adults file are both sorted in order of ID. In our Python version of the model, we looped over these, matching IDs, doing subsidiary loops round records with the same ID to aggregate data, and then combining the results. But in R, I’d like to write code that’s more concise, easier to apprehend, easier to tweak, and that suits R’s programming idioms. As it happens, the Tidyverse can do this easily with summarise(). But it took me a bit of time to get there, and I’m going to show the experiments I did, and why. I’ll precede these by explaining the main points I wanted to investigate, and what I discovered.

First, I’ll show the call to summarise() that worked. Suppose my jobs table t has a column id which identifies the adult, and a column value which gives a job’s earnings. Here’s such a table, obviously not realistic.

t <- tribble( ~id, ~value
            , 1  , 101
            , 1  , 102
            , 2  , 201
            , 3  , 301
            , 3  , 302
            , 3  , 303
            , 4  , 401
            )
From this, I want to make a table whose counts column says how many jobs the adult has, whose sums column is their total earnings, and whose lists column is a list of earnings. This is how:
t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value), lists=list(value) )

This may look obvious. But I was thrown off course, partly by some typos I made early on, and partly by the questions I was thinking about when I wrote “Second-Guessing R”.

Second, I’ll mention something that surprised me. In the tables these generate, each element of the lists column turns out to be an atomic vector, not a list. So identical( tsl[[1,4]], list(101,102) ) is FALSE, but identical( tsl[[1,4]], c(101,102) ) is TRUE. I don’t know whether this is something odd about base R or about summarise().

Third, a surprise arising from the above. I can’t call c() instead of list(). If I do, I get an error Error in summarise_impl(.data, dots): Column `lists` must be length 1 (a summary value), not 2. That sort of makes sense, in R’s strange world. Something seems to be “lowering” list() so that it produces atomic vectors when it would normally produce lists. So perhaps any other function also gets “lowered”. Including c(). But c() is already as low as it can get. You can’t have anything lower than an atomic vector. So “lowering” c() would render it useless. Which is what’s happened.

Fourth, I wanted to see whether there was anything special about summarise(). Would the same functions work from mutate()? Indeed they would: they’re not restricted to summarise(). The code above produces one row per adult. Each group (i.e. each collection of rows for an adult’s jobs) has been “collapsed”, as the Tidyverse documentation for tally() puts it. But if I want to calculate counts, sums and lists without collapsing, then I can call mutate() instead of summarise():

 tsl <- t %>% 
  group_by( id ) %>% 
  mutate( counts=n(), sums=sum(value), lists=list(value) )
This adds the count, sum, and list to every row of the original uncollapsed table.

Fifth, another surprise involving c(). This time, I can call it. But it just causes the lists column to duplicate the original values.

Sixth, I wanted to see whether summarise() would work if I didn’t group the table first. And it does. It treats the entire table as one group, and therefore collapses it to a single row.

Seventh, would mutate() also work if I didn’t group the table first? It should, of course, because that’s how it’s normally used. And it does. n() counts the rows in the original table, and sum() and list() get applied to the entire value column.

My eighth and final point is perhaps more general, being about how summarise() and mutate() treat their functions. For more about why I wanted to investigate this, read “Second-Guessing R”. In essence, I’ve found that the name of a column is always replaced by a vector. If the table holding the column is ungrouped, the vector is the entire column, just as it would be in a base-R call such as data_frame$values * 2. If it’s grouped, the vector becomes whatever slice of that column belongs to the first group, then whatever slice of that column belongs to the second group, and so on. This seems to be true whatever the function it’s passed to.

And, the Tidyverse does not appear to recognise certain functions as special because of their name or other identity. It treats the built-in list() the same as a function with different code or a different name.

With all that out of the way, here are my experiments. Incidentally, there are some that I haven’t mentioned above, mainly with tally() and count().

# try_sum_count_and_list.R
#
# When processing adults data, I
# need to summarise the earnings
# from each adult's jobs. An adult
# may have more than one job, and
# I need to make fields containing
# the number of jobs, the earnings
# summed, and a list of the 
# individual earnings. These are
# experiments in doing this with
# Tidyverse functions.
#
# This led on to some general
# questions about the functions,
# which I've blogged in
# http://www.j-paine.org/blog/2017/11/experiments-with-count-tally-and-summarise-how-to-count-and-sum-and-list-elements-of-a-column-in-the-same-call.html .


library( tidyverse )


t <- tribble( ~id, ~value
            , 1  , 101
            , 1  , 102
            , 2  , 201
            , 3  , 301
            , 3  , 302
            , 3  , 303
            , 4  , 401
            )

t %>% 
  group_by( id ) %>%
  tally()
#
#        id     n
#   1     1     2
#   2     2     1
#   3     3     3
#   4     4     1
#
# The 'n' column gives the
# number of rows in each 
# group. Unlike with add_count()
# below, the result is 
# collapsed so there's only 
# one row per ID.


t %>% 
  count( id )
#
# The same.


t %>% 
  add_count( id )
#
#        id value     n
#   1     1   101     2
#   2     1   102     2
#   3     2   201     1
#   4     3   301     3
#   5     3   302     3
#   6     3   303     3
#   7     4   401     1
#
# The table is not collapsed,
# so each ID has the same 
# number of rows as before. But
# they all get an 'n' column
# giving their count. add_tally()
# would do the same.


t %>% 
  group_by( id ) %>%
  tally( wt=value )
#
#        id     n
#   1     1   203
#   2     2   201
#   3     3   906
#   4     4   401
#
# Here, the 'n' column
# is the sum over 'value'
# within each group.


t %>% 
  count( id, wt=value )
#
# The same.


t %>% 
  add_count( id, wt=value )
#
#        id value     n
#   1     1   101   203
#   2     1   102   203
#   3     2   201   201
#   4     3   301   906
#   5     3   302   906
#   6     3   303   906
#   7     4   401   401
#
# As before with add_count(),
# the table is not collapsed.
# An 'n' cell has been appended
# to each row, giving the
# sum over the value's in
# that row's group. add_tally()
# would do the same.


t %>% 
  add_count( id ) %>%
  filter( n== 1 )
#
#        id value     n
#   1     2   201     1
#   2     4   401     1
#
# As the documentation notes,
# "add_count() is useful for groupwise 
# filtering. E.g.: show only species 
# that have a single member."


t %>% 
  group_by( id ) %>% 
  summarise( n(), sum(value) )
#
#        id `n()` `sum(value)`
#   1     1     2          203
#   2     2     1          201
#   3     3     3          906
#   4     4     1          401
#
# With tally() and count(), I
# can't see a way to count rows and
# sum values in the same call. With
# summarise(), I can.


t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value) )
#
# As above, but the new columns have
# my names 'counts' and 'sums'.


tsl <- t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value), lists=list(value) )
# 
#        id counts  sums     lists
#   1     1      2   203 <dbl [2]>
#   2     2      1   201 <dbl [1]>
#   3     3      3   906 <dbl [3]>
#   4     4      1   401 <dbl [1]>
#
# So as well as counting rows and
# summing values, I can aggregate values 
# into a collection. But beware. Each 
# cell of 'lists' becomes an atomic 
# vector, not a list. The code below 
# proves this.
#
# (These are the first and second points
# in my blog post.)

identical( tsl[[1,4]], list(101,102) ) 
#   [1] FALSE

identical( tsl[[1,4]], c(101,102) ) 
#   [1] TRUE



tsc <- t %>% 
  group_by( id ) %>% 
  summarise( counts=n(), sums=sum(value), lists=c(value) )
#
# Gives an error:
#   Error in summarise_impl(.data, dots) : 
#   Column `lists` must be length 1 (a summary value), not 2
#
# So I can't aggregate by calling c().
# I do have to call list().
#
# (This is the third point in my blog post.)


tsl <- t %>% 
  group_by( id ) %>% 
  mutate( counts=n(), sums=sum(value), lists=list(value) )
#
#        id value counts  sums     lists
#   1     1   101      2   203 <dbl [2]>
#   2     1   102      2   203 <dbl [2]>
#   3     2   201      1   201 <dbl [1]>
#   4     3   301      3   906 <dbl [3]>
#   5     3   302      3   906 <dbl [3]>
#   6     3   303      3   906 <dbl [3]>
#   7     4   401      1   401 <dbl [1]>
#
# If I want the same effect as with 
# add_tally() and add_count(), this
# is the way to do it. It adds the
# count, sum, and list to every row
# of the original uncollapsed table.
#
# This also shows that n() and sum()
# are not restricted to use from
# summarise(). You can use them
# on grouped tables from mutate().
# And on ungrouped tables, I suppose,
# but that's probably not useful.
#
# Note: I've not shown it here, but
# the result is grouped, so it's 
# probably best to ungroup it.
#
# (This is the fourth point in my blog 
# post.)


tsl <- t %>% 
  group_by( id ) %>% 
  mutate( counts=n(), sums=sum(value), lists=c(value) )
#
#        id value counts  sums lists
#   1     1   101      2   203   101
#   2     1   102      2   203   102
#   3     2   201      1   201   201
#   4     3   301      3   906   301
#   5     3   302      3   906   302
#   6     3   303      3   906   303
#   7     4   401      1   401   401
#
# Very odd, and not useful. The result
# of c() doesn't get put into each cell
# of 'lists', but treated as a slice of the
# column. Probably related to the 
# behaviour I commented on in
# http://www.j-paine.org/blog/2017/10/experiments-with-summarise-or-when-does-x-sub-1-equal-x.html .
#
# As above, and as expected, the table
# is grouped.
#
# (This is the fifth point in my blog 
# post.)


tsl <- t %>% 
  group_by( id ) %>% 
  transmute( counts=n(), sums=sum(value), lists=list(value) )
#
# This produced the same table as above,
# but without the 'value' column, and 
# accompanied by a warning:
#   Adding missing grouping variables: `id`


tsl <- t %>%  
  summarise( counts=n(), sums=sum(value), lists=list(value) )
#
#     counts  sums     lists
#   1      7  1711 <dbl [7]>
#
# Out of interest, this is what
# happens if I summarise without grouping.
# The table is treated as one group.
#
# (This is the sixth point in my blog 
# post.)


tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=list(value) )
#
#        id value counts  sums     lists
#   1     1   101      7  1711 <dbl [7]>
#   2     1   102      7  1711 <dbl [7]>
#   3     2   201      7  1711 <dbl [7]>
#   4     3   301      7  1711 <dbl [7]>
#   5     3   302      7  1711 <dbl [7]>
#   6     3   303      7  1711 <dbl [7]>
#   7     4   401      7  1711 <dbl [7]>
#
# And this is what happens if I mutate
# without grouping. The aggregating 
# functions n(), sum() and list() get
# applied to the entire column. Their
# result is then appended to each row.
#
# (This is the seventh point in my
# blog post.)


my_list <- list

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# The same table as above.
#
# I wanted to see whether mutate() is 
# treating 'list' specially. For example,
# does it recognise the name and do
# something special because of it?
# This shows that it doesn't, because 
# I'm using a different name but getting
# the same result.
#
# (This and the stuff below make the 
# eigth point in my blog post.)


my_list <- function( x ) list( x )

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# The same table as above. 
#
# mutate() might have been recognising
# the value of list(), i.e. the pointer
# to its code. This shows it isn't,
# because I'm using a different pointer
# but getting the same result.


my_list <- function( x ) str_c( x[1], x[2], x[3], x[4], x[5], x[6], x[7], sep="," )

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# A table like those above but where each
# element of the 'lists' column is a string
# concatenation of all the values in
# the 'value' column. This shows that
# my_list() is receiving all the values
# in one go.


my_list <- function( x ) x[1]

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# A table like those above but where each
# element of the 'lists' column is 101.
# This is consistent with my conclusions
# above.


tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=value[1] )
#
# The same as the last table.
#
# This shows that it doesn't matter whether 
# the subscripting is hidden from mutate()'s
# view. In other words, it doesn't treat
# [] specially.


tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=value*2 )
#
# The same as the tables above, but each element
# of 'lists' is twice the corresponding element
# of 'value.


my_list <- function(x) { cat("Calling my_list() with argument "); dput(x); x*2 } 

tsl <- t %>% 
  mutate( counts=n(), sums=sum(value), lists=my_list(value) )
#
# The same table as above. Outputs
#   Calling my_list() with argument c(101, 102, 201, 301, 302, 303, 401) .
#
# This confirms that my_list() gets called
# only once and is passed the entire column.
# 
# The point of these last two experiments is
# that when I first used calls such as 
#   mutate( new_value=value*2 )
# I assumed * was getting called once per row,
# and that mutate() substituted that row's
# 'value' as the argument to * . But that's
# not true. * gets the entire column. This
# works because of R's vectorisation.

Second-Guessing R

I keep doing experiments with R, and with its Tidyverse package, to discover whether these do what I think they’re doing. Am I justified in spending the time?

I’ve said before that the Tidyverse follows rather different conventions from those of base R. This is something Bob Muenchen wrote about in “The Tidyverse Curse”. Dare I add that he first published this when updating an article called “Why R is Hard to Learn”? I’ve decided that it’s worth putting up with these differences. They are outweighed by the Tidyverse’s benefits. But it does mean I have to understand its specification thoroughly. If I don’t, my code might be wrong. But the understanding is difficult, because the documentation sometimes lacks detail.

Moreover, I find myself second-guessing it, because I’m never sure how much clever processing might be being done by non-standard evaluation, or even by base-R assignment and vectorisation.

It doesn’t help that I’ve used over 20 other programming languages, some of which enable you to extend the language by defining macros — that is, functions that run code while your program is being compiled. In Prolog, a language I once taught Artificial Intelligence with, built-in functions named term_expansion and goal_expansion can look over your code and replace it by other code. You can make them read shorthand notations that describe a problem, and expand those notations into sequences of function calls to solve that problem. A well-known application built into most Prologs is “definite clause grammars” or DCGs. With these, you can write rules defining the grammar of — for example — English. These rules get rewritten by term_expansion, ending up as code that parses strings and discovers whether they are grammatically correct. Markus Triska’s “Prolog DCG Primer” shows what DCGs look like, while his “Prolog Macros” explains the general working of such things.

In Poplog, another language I taught with, there is a very sophisticated macro system, as John Gibson shows in “POP-11 COMPILER PROCEDURES”. You can even make the compiler put machine instructions into your code.

Given my experience of these and of compiler-writing, plus the knowledge that R can do weird and wonderful things with non-standard evaluation, it’s not surprising that I ask myself how much the Tidyverse is hacking my code behind the scenes. For instance, the Tidyverse has an n() function, which “can only be used from within summarise(), mutate() and filter()“. It “returns the number of observations (rows) in each group”. That document is actually wrong, because n() can be used from transmute() too. But apart from that, I wonder whether n() really is a function, or whether some clever bit of non-standard evaluation recognises the name and replaces it by the numbers of observations, or instructions to calculate same.

With n(), that probably doesn’t matter much. It takes no arguments, so I don’t have to worry about how it processes them. But summarise(), mutate() etc. can also take functions such as sum(). Now, the documentation for summarise() has a section named “Useful functions”. This lists a few names: mean(), median(), sd(), min(), max(), and some others. Are these the same as the mean() and min() and max() I know in base R? Or are they more like the functions the Tidyverse describes as Select helpers? Is the Tidyverse recognising the names as special, or the code pointers or whatever identifies a function, so that it treats sum differently to a home-grown function with the same definition? And whatever these functions are, exactly what is the Tidyverse passing as their arguments, and is it hacking their results?

Lest I seem over-cautious, remember again that I want my code to be reliable, and that the semantics of R and the Tidyverse are not stated anywhere as precisely as, say, those of Pascal. Moreover, there’s the strange behaviour of list() and c() referred to in connection with summarise( lists=list(value) ) in my post “Experiments with count(), tally(), and summarise(): how to count and sum and list elements of a column in the same call”.

So in answer to my original question: yes, I believe I am justified.