The data for our economic model comes from records representing
the income and expenditure of British households. However, the
structure isn’t as simple as just one row per household. This is
because it’s necessary to split households into “benefit units”: the word “benefit”
here refering to the money the State gives
you when you’re ill, out of work, or whatever. The
“Households, families
and benefit units” page on poverty.org explains that whereas
a “household” is a group of people living at the same address who share
common housekeeping or a living room, a benefit unit is an adult
plus their spouse if they have one, plus any dependent children
they are living with. So mum and dad plus 10-year
old Johnnie would be one benefit unit. But if Johnnie is over 18,
he becomes an adult who just happens to live with his parents, and
the household has two benefit units.
Some of our data and results are per
household, but others have to be per benefit unit.
In this post, I’ll explain how, given some households each
of which had a field saying how many benefit units it
has, I generated random benefit units to go with it. This is
more general than it sounds. Given one kind of data, and
another kind that it “contains”, how do you generate instances
of the second kind? My answer involves
function `mapply`

.

Because benefit units won’t be familiar to most readers,
I’ll talk about children instead.
Let’s start by making some sample households:

library( tidyverse )
households <- tribble( ~id, ~num_kids,
'A', 2,
'B', 3,
'C', 1,
'D', 2
)

This creates four households. The only fields each has
are an ID and a number-of-children field.
So the first household has ID

`'A'`

and two children;
the second has ID

`'B'`

and three children; and so on. In our data,
household IDs are numeric. But I’m using
non-numeric strings here because it avoids
me accidentally mixing them up with the

`num_kids`

values.

Next, I assign the number of households to a
variable, because I’m going to use it more
than once later on.

num_households <- nrow( households )

Now I want to think about the information I’ll
generate for each child record. There has to
be a household ID, so that I can link the
children table and the households table. I’m also
going to give each child a sequence number within
its household. And, for this example, one other
piece of data: how much pocket money the child gets
a week.

So my children table will have three
columns. I’ll now generate the first of these:

kid_ids <- mapply( rep,
households$id,
households$num_kids
)
kid_ids_as_vec <- unlist( kid_ids )

The first statement uses `mapply`

to combine two vectors
given as its final two arguments. These are
`'A', 'B', 'C', 'D'`

and `2,3,1,2`

respectively, from `household`

‘s columns.
In effect, `mapply`

zips them
together by applying `rep`

to
the first element of each, the second element
of each, and so on. I’ve drawn this below.

I drew the IDs as colours, because their
value doesn’t matter, and it
makes the graphic easier to read. What it does
show clearly is that the output from
`mapply`

is not flat. It’s a
list of (vectors of IDs). This is
because each call of `rep`

generates a vector,
and each of these becomes one element of
`mapply`

‘s result. Before
I can use the result as a column of
a data frame, I have to flatten it,
which is what the call to `unlist`

does.

So that’s the first column of my children
data frame. It contains the household IDs,
but repeated as many times as each household
has children. Now I have to make the second
column.

kid_nums <- mapply( function(count){1:count},
households$num_kids
)
kid_nums_as_vec <- unlist( kid_nums )

This works in the same kind of way to the
previous

`mapply`

, but operates
on only one vector. It applies a function which
takes a count as argument, and returns a vector
made using the built-in

`:`

operator
of integers running from 1 to that count. Applied
to 2, the first element of

`households$num_kids`

,
this function returns the vector

`1 2`

.
Applied to three, the second element, it
returns

`1 2 3`

. As before,

`mapply`

returns a list of vectors, and also as before,
I have to flatten it using

`unlist`

.

So I now have the first two columns of my
children data. I’ll now make the third. This
is how much each child gets in weekly pocket money.
In real life, I’d have something else: my choice here just demonstrates
one way of generating random data, something useful
when testing plotting and aggregation functions,
for example.

pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )

This generates a vector of as many values as we have rows
(given by

`length(kid_ids_as_vec)`

),
normally distributed
about 10, with a standard deviation of 20.

And now I make these three columns into a
data frame.

kids <- data_frame( id=kid_ids_as_vec,
kid_num=kid_nums_as_vec,
pocket_money=pocket_monies
)

To finish, here’s the complete code.

# random_kids.R
#
# Written for my blog. Illustrates
# the techniques that I used for
# generating random benefit units.
library( tidyverse )
random_kids <- function( households )
{
kid_ids <- mapply( rep,
households$id,
households$num_kids
)
kid_ids_as_vec <- unlist( kid_ids )
kid_nums <- mapply( function(count){1:count},
households$num_kids
)
kid_nums_as_vec <- unlist( kid_nums )
pocket_monies <- rnorm( length(kid_ids_as_vec), 20, 10 )
kids <- data_frame( id=kid_ids_as_vec,
kid_num=kid_nums_as_vec,
pocket_money=pocket_monies
)
kids
}
households <- tribble( ~id, ~num_kids,
'A', 2,
'B', 3,
'C', 1,
'D', 2
)
kids <- random_kids( households )

And this is what the above call outputs. You’ll
see that I made my code into a function. I
do that with everything, because I never know when
I’ll want to reuse it. When testing, for
instance, I often want to call the same
code more than once.

# A tibble: 8 x 3
id kid_num pocket_money
<chr> <int> <dbl>
1 A 1 17.12020
2 A 2 13.40505
3 B 1 25.81609
4 B 2 20.62115
5 B 3 31.25019
6 C 1 19.93400
7 D 1 30.98877
8 D 2 31.05521