The data for our economic model comes from records representing
the income and expenditure of British households. However, the
structure isn’t as simple as just one row per household. This is
because it’s necessary to split households into “benefit units”: the word “benefit”
here refering to the money the State gives
you when you’re ill, out of work, or whatever. The
“Households, families
and benefit units” page on poverty.org explains that whereas
a “household” is a group of people living at the same address who share
common housekeeping or a living room, a benefit unit is an adult
plus their spouse if they have one, plus any dependent children
they are living with. So mum and dad plus 10-year
old Johnnie would be one benefit unit. But if Johnnie is over 18,
he becomes an adult who just happens to live with his parents, and
the household has two benefit units.
Some of our data and results are per
household, but others have to be per benefit unit.
In this post, I’ll explain how, given some households each
of which had a field saying how many benefit units it
has, I generated random benefit units to go with it. This is
more general than it sounds. Given one kind of data, and
another kind that it “contains”, how do you generate instances
of the second kind? My answer involves
function `mapply`

.

Because benefit units won’t be familiar to most readers, I’ll talk about children instead. Let’s start by making some sample households:

library( tidyverse ) households <- tribble( ~id, ~num_kids, 'A', 2, 'B', 3, 'C', 1, 'D', 2 )This creates four households. The only fields each has are an ID and a number-of-children field. So the first household has ID

`'A'`

and two children;
the second has ID `'B'`

and three children; and so on. In our data,
household IDs are numeric. But I’m using
non-numeric strings here because it avoids
me accidentally mixing them up with the
`num_kids`

values.
Next, I assign the number of households to a variable, because I’m going to use it more than once later on.

num_households <- nrow( households )

Now I want to think about the information I’ll generate for each child record. There has to be a household ID, so that I can link the children table and the households table. I’m also going to give each child a sequence number within its household. And, for this example, one other piece of data: how much pocket money the child gets a week.

So my children table will have three columns. I’ll now generate the first of these:

kid_ids <- mapply( rep, households$id, households$num_kids ) kid_ids_as_vec <- unlist( kid_ids )

The first statement uses `mapply`

to combine two vectors
given as its final two arguments. These are
`'A', 'B', 'C', 'D'`

and `2,3,1,2`

respectively, from `household`

‘s columns.
In effect, `mapply`

zips them
together by applying `rep`

to
the first element of each, the second element
of each, and so on. I’ve drawn this below.

I drew the IDs as colours, because their
value doesn’t matter, and it
makes the graphic easier to read. What it does
show clearly is that the output from
`mapply`

is not flat. It’s a
list of (vectors of IDs). This is
because each call of `rep`

generates a vector,
and each of these becomes one element of
`mapply`

‘s result. Before
I can use the result as a column of
a data frame, I have to flatten it,
which is what the call to `unlist`

does.

So that’s the first column of my children data frame. It contains the household IDs, but repeated as many times as each household has children. Now I have to make the second column.

kid_nums <- mapply( function(count){1:count}, households$num_kids ) kid_nums_as_vec <- unlist( kid_nums )This works in the same kind of way to the previous

`mapply`

, but operates
on only one vector. It applies a function which
takes a count as argument, and returns a vector
made using the built-in `:`

operator
of integers running from 1 to that count. Applied
to 2, the first element of `households$num_kids`

,
this function returns the vector `1 2`

.
Applied to three, the second element, it
returns `1 2 3`

. As before, `mapply`

returns a list of vectors, and also as before,
I have to flatten it using `unlist`

.
So I now have the first two columns of my children data. I’ll now make the third. This is how much each child gets in weekly pocket money. In real life, I’d have something else: my choice here just demonstrates one way of generating random data, something useful when testing plotting and aggregation functions, for example.

pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )This generates a vector of as many values as we have rows (given by

`length(kid_ids_as_vec)`

),
normally distributed
about 10, with a standard deviation of 20.
And now I make these three columns into a data frame.

kids <- data_frame( id=kid_ids_as_vec, kid_num=kid_nums_as_vec, pocket_money=pocket_monies )

To finish, here’s the complete code.

# random_kids.R # # Written for my blog. Illustrates # the techniques that I used for # generating random benefit units. library( tidyverse ) random_kids <- function( households ) { kid_ids <- mapply( rep, households$id, households$num_kids ) kid_ids_as_vec <- unlist( kid_ids ) kid_nums <- mapply( function(count){1:count}, households$num_kids ) kid_nums_as_vec <- unlist( kid_nums ) pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 ) kids <- data_frame( id=kid_ids_as_vec, kid_num=kid_nums_as_vec, pocket_money=pocket_monies ) kids } households <- tribble( ~id, ~num_kids, 'A', 2, 'B', 3, 'C', 1, 'D', 2 ) kids <- random_kids( households )

And this is what the above call outputs. You’ll see that I made my code into a function. I do that with everything, because I never know when I’ll want to reuse it. When testing, for instance, I often want to call the same code more than once.

# A tibble: 8 x 3 id kid_num pocket_money <chr> <int> <dbl> 1 A 1 17.12020 2 A 2 13.40505 3 B 1 25.81609 4 B 2 20.62115 5 B 3 31.25019 6 C 1 19.93400 7 D 1 30.98877 8 D 2 31.05521