Here’s another innocent-eye exploration, this time about the Tidyverse’s `summarise()`

function. I’d been combining data tables by nesting and joining them, which gave me a tibble with nested tibbles in. I wanted to check the sizes of these inner tibbles, by mapping `nrow()`

over the columns containing them. The Tidyverse provides several ways to do this, one of which (I thought) would be `summarise()`

. So I tried calling it with the argument
`s=nrow(tibbles)`

, where `tibbles`

was the column with the tibbles in. It crashed. Why? And how should I make it work?

The insight I got from these experiments is that `summarise()`

passes its summarising functions a *slice* of a column, not an element. To illustrate the difference in meaning: a slice of a list is a list, possibly with fewer elements; a slice of a table is a table, possibly with fewer rows. But an element of a list is not a list, and an element of a table is not a table. I’d overlooked the distinction because I’m used to table columns being atomic vectors. In these, there’s
no difference between an element
of the vector and a slice. This is because R is strange, and regards all numbers and other primitive values as one-element vectors.

But when the columns are lists, there is a difference. The summarising functions get passed a list containing the element rather than the element itself, so they have to unwrap it. And, by the way, if they return a result that’s to go into a list column, they must wrap it. That’s important if I want them to return tibbles.

With that as my introduction, here is my code, with comments explaining what I was trying.

# try_summarise.R # # Some experiments with summarise(), # to work out why it didn't seem # to work when applied to columns # that are lists of tibbles. library( tidyverse ) library( stringr ) # For string_c() . t <- tribble( ~a, ~b , ~c , ~d , ~e , ~f , 3 , FALSE, "AA", c(1,11), list(x=1) , tibble(x=c(1)) , 1 , TRUE , "B" , c(2,22), list() , tibble(y=c(1,2)) , 2 , TRUE , "CC", c(3,33), list(1,2,3), tibble(x=1,y=2) ) summarise( t, s=min(a) ) # # A tibble: 1 x 1 # s # <dbl> # 1 1 # So a tibble with one element, # 1. summarise( t, s=max(a) ) # # A tibble with one element, # 3. summarise( t, s=mean(a) ) # # A tibble with one element, # 2. summarise( t, s=str_c(c) ) # # Gives an error: # Error in summarise_impl(.data, dots) : # Column `s` must be length 1 (a summary value), not 3 summarise( t, s=str_c(c,collapse='') ) # # # A tibble with one element, # 'AABCC'. summarise( t, s=any(b) ) # # # A tibble with one element, # TRUE. summarise( t, s=all(b) ) # # # A tibble with one element, # FALSE. summarise( t, s=show(a) ) # [1] 3 1 2 # Then gives error: # Error in summarise_impl(.data, dots) : # Column `s` is of unsupported type NULL # The above all show that the # entire column (t$a or t$b or t$c) # gets passed to the expression # after =. If that expression can't # reduce it to a single atomic # value, we get an error. # This was confirmed by the # summarise( t, s=show(a) ) # and also by the two below. summarise( t, s=identity(a) ) # Gives error: # Error in summarise_impl(.data, dots) : # Column `s` must be length 1 (a summary value), not 3 summarise( t, s=a[[1]] ) # # A tibble with one element, # 3. summarise( t, s=a[[3]] ) # # A tibble with one element, # 2. # These are consistent with the # entire column t$a , which is # c(3,1,2) # being passed. # What happens if I group by a? t %>% group_by(a) %>% summarise( s=min(a) ) # # A tibble: 3 x 2 # a s # <dbl> <dbl> # 1 1 1 # 2 2 2 # 3 3 3 # So now I get a tibble with # as many rows as t has. t %>% group_by(a) %>% summarise( s=show(a) ) # # Shows 1 and then gives an error. # So now, t is sliced into three # groups. There are three calls to # the expression after =, and in # each, the appropriate slice of t$a # is substituted for 'a' in the expression. # Let's confirm this. t %>% group_by(a) %>% summarise( s=nchar(c) ) # # Gives a tibble whose single column # s is three elements, 1 2 2. # These are the lengths of the elements # of t$c . # Does this work with list columns? # That's where I had trouble, and is # what inspired me to try these # calls. t %>% group_by(a) %>% summarise( s=length(e) ) # # A tibble: 3 x 2 # a s # <dbl> <int> # 1 1 1 # 2 2 1 # 3 3 1 # So here I get lengths of 1. But # I'd expect 0, 3, and 1. t %>% group_by(a) %>% summarise( s=nrow(f) ) # # And here, I get an error: # Error in summarise_impl(.data, dots) : # Column `s` is of unsupported type NULL # Why don't these work? The calls # summarise( s=length(e) ) # summarise( s=nrow(f) ) # are surely analogous to # summarise( s=nchar(c) ) . # Epiphany! The reason, I realise, is # that in each of the expressions after =, # the column variable is getting # substituted for by a single-row # slice of that column. (This is in # my grouped examples. In the others, # it gets substituted for by the entire # column.) # When the columns are atomic vectors, # this single-row slice is an atomic # vector too, with just one element. # But in R, these get treated like # single numbers, strings or Booleans. # So the call to nchar() gets passed # a vector containing a single string # element of column c. It works, because # such single elements are vectors anyway. # But when the columns are lists, # the single-row slice is also a list. # So nrow(f), for example, gets passed # not a tibble but a list containing # a tibble. It then crashes. Similarly, # length(e) gets passed a list containing # whichever single slice of column e, # and always returns 1. # I'll confirm this by doing these calls. summarise( t%>%group_by(a), s=nrow(f[[1]]) ) # # Works! Returns a tibble with the table # lengths. summarise( t%>%group_by(a), s=length(e[[1]]) ) # # Also works. Returns a tibble with the # list lengths. # The key to all this is that if v # is a length-1 atomic vector, v=v[[1]]. # > v <- c(1) # > v # [1] 1 # > v[[1]] # [1] 1 # But if l is a length-1 list, l!=l[[1]]: # > l <- list(tibble()) # > l # [[1]] # A tibble: 0 x 0 # Whereas: # > l[[1]] # A tibble: 0 x 0 # The latter loses a level of subscripting. # By the way, I note that summarise() # works with more than one column name, # and more than use of a column name, # and substitutes them all appropriately. summarise( t%>%group_by(a), s=a/a ) # # Returns a tibble with three # rows, all 1. summarise( t%>%group_by(a), s=str_c(a,b,c,collapse='') ) # # Returns a tibble where each # row is the result of concatenating # the first three elements in # the corresponding row of t. # Finally, I also note that if # I want to _return_ a tibble # from a summary function, I have # to wrap it in a list. summarise( t%>%group_by(a), s=tibble(x=a) ) # # Runs, but doesn't do what I wanted. # (Puts a double into each row.) summarise( t%>%group_by(a), s=list(tibble(x=a)) ) # # Runs, and does do what I wanted: # A tibble: 3 x 2 # a s # <dbl> <list> # 1 1 <tibble [1 x 1]> # 2 2 <tibble [1 x 1]> # 3 3 <tibble [1 x 1]> # I suppose that this is because # the expression in aummarise() has # to return something that is # a row of a column, not an element. # If the column is an atomic vector, # these are the same, but they aren't # if it's a list.