In my previous post, I assumed my household
data would give me the number of children each household has. But
suppose I had to generate those numbers too? This is just a note to
say that one can do
this using the base-R
function `sample.int`

.

If I understand its documentation correctly, then the call

sample.int( 4, size=100, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) )will give me a vector of 100 elements. Each element is an integer between 1 and 4: that’s what the first argument determines. And the probabilities of their occurrence are given by the

`prob`

argument.
This seems to work. Let me generate such a vector
(but much bigger to reduce sampling error)
and tabulate the frequencies of its elements using
`table`

:

x <- sample.int( 4, size=1000000, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) ) t <- table(x) t/t[1]

Then my first runs give me:

1 2 3 4 1.0000000 3.9331806 1.9725140 0.9925756 1.0000000 3.9757329 1.9855526 0.9899258 1.0000000 3.9984735 2.0017902 0.9916804 1.0000000 3.9942205 1.9904766 0.9979963 1.0000000 3.9952263 2.0040621 0.9968735

Are these close enough? The second and fourth sets of frequencies are always slightly below what I’d expect. So I may be missing some sublety. On the other hand, it’s good enough for the testing I’m doing, as this mainly has to certify that my joins and other data-handling operations are correct.