How Best to Convert a Names-Values Tibble to a Named List?

Here, in the spirit of my “Experiments with by_row()” post, are some experiments in writing and timing a function spread_to_list that converts a two-column tibble such as:

x 1
y 2 
z 3
t 4
to a named list:
list( x=1, y=2, z=3, t=4 )
I need this for processing the parameter sheets shown in that by_row post, and I’ll explain why later. In this post, I’m just interested in how best to define spread_to_list. The best implementation looks like either spread_to_list_3 or spread_to_list_4 below.

# try_spread_to_list.R
#
# Consider a tibble t with two columns,
# where each cell in the second column
# represents the value associated with
# the string (assumed to be its name) 
# in the first column.
#
# I want to define a function spread_to_list() 
# which converts t into a named list whose names
# are the names in the first column, and whose
# values are the values in the second column.
#
# For example, if t is:
#        names no_of_days
#    DaysInMay         31
#   DaysInJune         30
# then
#   spread_to_list(t) 
# would be the list
#   list( DaysInMay = 31, DaysInJune = 30 )
#
# The code here tries various ways
# of implementing spread_to_list,
# and benchmarks them. Two are variants
# of one another, using spread() and converting
# its result. Another two take the values
# column as a list, and call setNames() to 
# convert to a named list.


library( tidyverse )
library( microbenchmark )
library( stringr )


t <- tribble( ~a , ~b 
            , 'x',  1
            , 'y',  2 
            , 'z',  3
            , 't',  4
            )
#
# I'm going to try various ways
# of implementing spread_to_list()
# on t.


# First, what happens if I call spread()?

s <- spread( t, a, b )
#
# s becomes a one-row tibble:
#         t     x     y     z
#   1     4     1     2     3


# How can I convert s to a named list?

# An obvious way is to call map().
# Let's see what the function argument to
# map() gets passed if I call map() on s.

map( s, show )
#
# Displays
#  [1] 4
#  [1] 1
#  [1] 2
#  [1] 3
# So it gets passed a column of the
# tibble as an atomic vector.

map( s, function(x)x )
#
# Returns a list of these elements:
#   list(t = 4, x = 1, y = 2, z = 3)
# That's because map() is defined to return 
# lists. So the call above uses it merely
# as a type converter.

map( s, identity )
#
# Does the same. identity() is a built-in
# identity function.


# But maybe I can avoid mapping. In my
# experiments with by_row(), 
# http://www.j-paine.org/blog/2017/10/experiments-with-by_row.html ,
# I discovered that as.list() will convert
# a tibble to a named list.

as.list( s )
#
# Also returns 
#   list(t = 4, x = 1, y = 2, z = 3).


# But can I avoid spread() altogether?
# Browsing the discussion groups gave me
# the idea of trying setNames().
# Let's try that, passing t's names as its
# second argunent and t's values as its
# first.

setNames( t[[2]], t[[1]] )
# 
# Gives me an atomic named vector. 

# I need to convert it to a list.
# One way is for the argument to setNames()
# to be a list, because that's specified to
# make it return a list.

setNames( as.list( t[[2]] ), t[[1]] )
#
# Returns the list
#   list(x = 1, y = 2, z = 3, t = 4)

# But I could convert the result instead.

as.list( setNames( t[[2]], t[[1]] ) )
#
# Also returns
#   list(x = 1, y = 2, z = 3, t = 4)


# Let's try these four implementations.
# First, define the functions.

spread_to_list_1 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    map( identity )
}

spread_to_list_2 <- function( t )
{
  colname1 <- names( t )[[1]]
  colname2 <- names( t )[[2]]
  t %>% 
    spread( !!as.name(colname1), !!as.name(colname2) ) %>% 
    as.list
}

spread_to_list_3 <- function( t )
{
  setNames( as.list( t[[2]] ), t[[1]] )
}

spread_to_list_4 <- function( t )
{
  as.list( setNames( t[[2]], t[[1]] ) )
}


# Now try them.

s1 <- spread_to_list_1( t )
s2 <- spread_to_list_2( t )
s3 <- spread_to_list_3( t )
s4 <- spread_to_list_4( t )

dput( s1 )
dput( s2 )
dput( s3 )
dput( s4 )

identical( s1, s2 )
identical( s1, s3 )
identical( s1, s4 )


# They all return named lists, but the order
# of elements is different for the spread()-based
# versions than from the as.list()-based ones.
# (That was obvious earlier, actually.)
# So I'll sort the lists, then test that
# they're identical. I'll also microbenchmark
# the functions.

sort_list <- function(l)
{ 
  sort( unlist( l ) )
}

identical( sort_list(s1), sort_list(s2) )%>%show
identical( sort_list(s1), sort_list(s3) )%>%show
identical( sort_list(s1), sort_list(s4) )%>%show

mbres <- microbenchmark( spread_to_list_1( t )
                       , spread_to_list_2( t )
                       , spread_to_list_3( t )
                       , spread_to_list_4( t )
                       )
print( mbres )


# Now let's microbenchmark the functions applied
# to bigger tibbles. I'll generate random name-value
# tibbles of sizes n, where n is defined by the
# vector in the 'for' condition.

for ( n in c(10,30,100,300) ) {

  cat( "Trying ", n, "row tibble\n" )

  names <- replicate( n, str_c(sample(letters,5,replace=FALSE),collapse="") )
  #
  # Generate n random alphabetic strings.
  # From Dirk Eddelbuettel's answer to 
  # https://stackoverflow.com/questions/1439513/creating-a-sequential-list-of-letters-with-r .

  values <- runif( n, 1, 100 )
  #
  # Generate n random values.

  t <- tibble( names=names
             , values=values
             )
  #
  # Use these to make a random tibble with
  # two columns and n rows.

  identical( sort_list(s1), sort_list(s2) )%>%show
  identical( sort_list(s1), sort_list(s3) )%>%show
  identical( sort_list(s1), sort_list(s4) )%>%show

  mbres <- microbenchmark( spread_to_list_1( t )
                         , spread_to_list_2( t )
                         , spread_to_list_3( t )
                         , spread_to_list_4( t )
                         )
  print( mbres )
}


# Here are the microbenchmark results for the
# 300-row tibble:
#   Unit: microseconds
#                  expr       min        lq    
#   spread_to_list_1(t) 17738.631 17928.946 
#   spread_to_list_2(t) 14582.521 14805.888 
#   spread_to_list_3(t)    36.223    40.901   
#   spread_to_list_4(t)    35.317    40.146 
#                  expr        mean    median         uq
#   spread_to_list_1(t) 18668.44595 18221.133 19532.8080
#   spread_to_list_2(t) 15386.05835 15050.685 16355.4180
#   spread_to_list_3(t)    46.67244    47.089    51.4655
#   spread_to_list_4(t)    45.77894    45.882    51.6165
#         max neval
#   21477.003   100
#   17314.838   100
#      64.294   100
#      63.087   100

# So the two spread() versions are much slower.
# Converting the spread() result with mapping is
# slower than with as.list(), probably unsurprisingly.
# The two setNames() versions are much faster. 
# It doesn't seem to matter whether we type-convert
# to list by making setNames()'s first argument
# a list, or by making its result one.

Leave a Reply

Your email address will not be published. Required fields are marked *