Random Benefit Units for Households I: Generating Random Subrows of a Row

The data for our economic model comes from records representing the income and expenditure of British households. However, the structure isn’t as simple as just one row per household. This is because it’s necessary to split households into “benefit units”: the word “benefit” here refering to the money the State gives you when you’re ill, out of work, or whatever. The “Households, families and benefit units” page on poverty.org explains that whereas a “household” is a group of people living at the same address who share common housekeeping or a living room, a benefit unit is an adult plus their spouse if they have one, plus any dependent children they are living with. So mum and dad plus 10-year old Johnnie would be one benefit unit. But if Johnnie is over 18, he becomes an adult who just happens to live with his parents, and the household has two benefit units. Some of our data and results are per household, but others have to be per benefit unit. In this post, I’ll explain how, given some households each of which had a field saying how many benefit units it has, I generated random benefit units to go with it. This is more general than it sounds. Given one kind of data, and another kind that it “contains”, how do you generate instances of the second kind? My answer involves function mapply.

Because benefit units won’t be familiar to most readers, I’ll talk about children instead. Let’s start by making some sample households:

library( tidyverse )

households <- tribble( ~id, ~num_kids,
                       'A',         2,
                       'B',         3,
                       'C',         1,
                       'D',         2
                     )
This creates four households. The only fields each has are an ID and a number-of-children field. So the first household has ID 'A' and two children; the second has ID 'B' and three children; and so on. In our data, household IDs are numeric. But I’m using non-numeric strings here because it avoids me accidentally mixing them up with the num_kids values.

Next, I assign the number of households to a variable, because I’m going to use it more than once later on.

num_households <- nrow( households )

Now I want to think about the information I’ll generate for each child record. There has to be a household ID, so that I can link the children table and the households table. I’m also going to give each child a sequence number within its household. And, for this example, one other piece of data: how much pocket money the child gets a week.

So my children table will have three columns. I’ll now generate the first of these:

kid_ids <- mapply( rep, 
                   households$id, 
                   households$num_kids 
                 )

kid_ids_as_vec <- unlist( kid_ids )

The first statement uses mapply to combine two vectors given as its final two arguments. These are 'A', 'B', 'C', 'D' and 2,3,1,2 respectively, from household‘s columns. In effect, mapply zips them together by applying rep to the first element of each, the second element of each, and so on. I’ve drawn this below.

I drew the IDs as colours, because their value doesn’t matter, and it makes the graphic easier to read. What it does show clearly is that the output from mapply is not flat. It’s a list of (vectors of IDs). This is because each call of rep generates a vector, and each of these becomes one element of mapply‘s result. Before I can use the result as a column of a data frame, I have to flatten it, which is what the call to unlist does.

So that’s the first column of my children data frame. It contains the household IDs, but repeated as many times as each household has children. Now I have to make the second column.

kid_nums <- mapply( function(count){1:count}, 
                    households$num_kids 
                  )

kid_nums_as_vec <- unlist( kid_nums )
This works in the same kind of way to the previous mapply, but operates on only one vector. It applies a function which takes a count as argument, and returns a vector made using the built-in : operator of integers running from 1 to that count. Applied to 2, the first element of households$num_kids, this function returns the vector 1 2. Applied to three, the second element, it returns 1 2 3. As before, mapply returns a list of vectors, and also as before, I have to flatten it using unlist.

So I now have the first two columns of my children data. I’ll now make the third. This is how much each child gets in weekly pocket money. In real life, I’d have something else: my choice here just demonstrates one way of generating random data, something useful when testing plotting and aggregation functions, for example.

pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )
This generates a vector of as many values as we have rows (given by length(kid_ids_as_vec)), normally distributed about 10, with a standard deviation of 20.

And now I make these three columns into a data frame.

kids <- data_frame( id=kid_ids_as_vec, 
                    kid_num=kid_nums_as_vec,
                    pocket_money=pocket_monies 
                  )

To finish, here’s the complete code.

# random_kids.R
#
# Written for my blog. Illustrates
# the techniques that I used for
# generating random benefit units.


library( tidyverse )


random_kids <- function( households )
{
  kid_ids <- mapply( rep, 
                     households$id, 
                     households$num_kids 
                   )

  kid_ids_as_vec <- unlist( kid_ids )

  kid_nums <- mapply( function(count){1:count}, 
                      households$num_kids 
                    )

  kid_nums_as_vec <- unlist( kid_nums )

  pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )

  kids <- data_frame( id=kid_ids_as_vec, 
                      kid_num=kid_nums_as_vec,
                      pocket_money=pocket_monies 
                    )

  kids
}


households <- tribble( ~id, ~num_kids,
                       'A',         2,
                       'B',         3,
                       'C',         1,
                       'D',         2
                     )

kids <- random_kids( households )

And this is what the above call outputs. You’ll see that I made my code into a function. I do that with everything, because I never know when I’ll want to reuse it. When testing, for instance, I often want to call the same code more than once.

# A tibble: 8 x 3
     id kid_num pocket_money
  <chr>   <int>        <dbl>
1     A       1     17.12020
2     A       2     13.40505
3     B       1     25.81609
4     B       2     20.62115
5     B       3     31.25019
6     C       1     19.93400
7     D       1     30.98877
8     D       2     31.05521

Leave a Reply

Your email address will not be published. Required fields are marked *