Demonstrating spread() and gather()

In my last post about R, I said I was translating an economic model from Python into R. It’s a microeconomic model, meaning that it simulates the behaviour of individual people rather than bulk quantities such as inflation and unemployment. The simulator uses data about the income and expenditure of British households, from sources such as the Family Resources Survey and Family Expenditure Survey . I’ve had to think about how to represent expenditures. For example, Bob spends £100 on food and £400 on rent. Do I have one column for his food expenditure and one for rent, or do I have one column for all expenditures with another “key” column indicating the type? Maybe I need both depending on how I’m going to analyse the data, with functions to translate between them. Luckily, Hadley Wickham’s spread and gather will do this. Here are some experiments.

The code below starts by creating a table, d, which has IDs in column 1, expenditure types in column 2, and expenditures in column 3. It then “spreads” this data so that each expenditure type gets its own column; and then “gathers” the result back into the original format. R has a built-in type for data tables, the “data frame”. But here, I’m using Hadley Wickham’s “tibbles” instead. These have several advantages. For example, you can nest one tibble inside another, which is likely to be useful when representing hierarchical data. And the way tibbles are printed is more informative than that for data frames. Here’s my code, with comments showing what it produces:

# test_gather_spread.R
#
# Some experiments with 'gather'
# and 'spread', to see how useful
# they might be.


library( tidyverse )
#
# Loads the Tidyverse libraries. You need to
# have done 
#   install.packages( "tidyverse" )
# first.


d <- tibble( ID=c( 1, 1, 1, 2, 3, 3 ),
             expensetype=c( 'food', 'alcohol', 'rent', 'food', 'food', 'rent' ),
             value = c( 100, 0, 400, 75, 50, 600 )
           )
#
# Makes a simple table with type of expenditure in one
# column and its value in another.
# Gives:
# A tibble: 6 x 3
#     ID expensetype value
#  <dbl>       <chr> <dbl>
#1     1        food   100
#2     1     alcohol     0
#3     1        rent   400
#4     2        food    75
#5     3        food    50
#6     3        rent   600

ds <- spread( d, expensetype, value )
#
# Spreads out the data so that each type of expenditure
# has its own column.
# Gives:
# A tibble: 3 x 4
#     ID alcohol  food  rent
#  <dbl>   <dbl> <dbl> <dbl>
#1     1       0   100   400
#2     2      NA    75    NA
#3     3      NA    50   600

dg <- gather( ds, "TYPE", "SPENT", 2:4 )
#
# Unspreads the data, back to the original arrangement.
# Gives:
# A tibble: 9 x 3
#     ID    TYPE SPENT
#  <dbl>   <chr> <dbl>
#1     1 alcohol     0
#2     2 alcohol    NA
#3     3 alcohol    NA
#4     1    food   100
#5     2    food    75
#6     3    food    50
#7     1    rent   400
#8     2    rent    NA
#9     3    rent   600

arrange( dg, ID )
#
# Sorts on ID.
# Gives:
# A tibble: 9 x 3
#     ID    TYPE SPENT
#  <dbl>   <chr> <dbl>
#1     1 alcohol     0
#2     1    food   100
#3     1    rent   400
#4     2 alcohol    NA
#5     2    food    75
#6     2    rent    NA
#7     3 alcohol    NA
#8     3    food    50
#9     3    rent   600

One thing worth noting is that I had to sort the sort the gathered data to restore the original ordering. Anyway, the rest of my code, below, shows how easy it is to plot the data. I’ve adapted these examples from monashbioinformaticsplatform.github.io’s “The tidyverse: dplyr, ggplot2, and friends”.

d <- tibble( ID=c( 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9,
                   10, 10, 11, 11, 12, 12, 13, 13, 14, 14, 15, 15, 16, 16, 17, 17, 18, 18,
                   19, 19, 20, 20, 21, 21 ),
             expensetype=c( 'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent',
                            'food', 'rent', 'food', 'rent', 'food', 'rent' ),
             value = c( 100, 400, 75, 350, 50, 300, 
                        100, 500, 40, 300, 120, 450,
                        80, 370, 80, 350, 100, 500, 
                        100, 500, 40, 300, 120, 450,
                        70, 340, 75, 350, 150, 500, 
                        100, 500, 120, 500, 120, 450, 
                        130, 450, 50, 380, 100, 550 )
           )
#
# Like d above, but with more rows.

ds <- spread( d, expensetype, value )
#
# Like ds above, but with more rows.

ggplot( ds, aes( food, rent ) ) + geom_point()
ggsave( "plot1.png" )
#
# Plots food expenditure against rent expenditure.

ggplot( ds, aes( food, rent ) ) + geom_point() +
                                  geom_smooth( method="lm" )
ggsave( "plot2.png" )
#
# Plots food expenditure against rent expenditure
# showing the best-fit line from a linear fit.

ggplot( ds, aes( food ) ) + geom_histogram( binwidth=25 )
ggsave( "plot3.png" )
#
# Plots a histogram of the food expenditures.

ggplot( d, aes(value, fill = expensetype)) + geom_histogram( binwidth=25, position="identity", alpha=0.2 )
ggsave( "plot4.png" ) 
#
# Plots a histogram of the food and rent
# expenditures on top of each other. Unlike above,
# this uses the original data rather than the spread
# version: the plotter relies on the expensetype
# column to decide which histogram to add to.

And here are the plots. The originals were bigger, but I’ve shrunk them to fit the table into a reasonably-sized desktop display.

Leave a Reply

Your email address will not be published. Required fields are marked *