Random Benefit Units for Households II: Generating the Number of Subrows

In my previous post, I assumed my household data would give me the number of children each household has. But suppose I had to generate those numbers too? This is just a note to say that one can do this using the base-R function sample.int .

If I understand its documentation correctly, then the call

sample.int( 4, size=100, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) )
will give me a vector of 100 elements. Each element is an integer between 1 and 4: that’s what the first argument determines. And the probabilities of their occurrence are given by the prob argument.

This seems to work. Let me generate such a vector (but much bigger to reduce sampling error) and tabulate the frequencies of its elements using table:

x <- sample.int( 4, size=1000000, replace=TRUE, prob=c(0.1,0.4,0.2,0.1) )
t <- table(x)

Then my first runs give me:

        1         2         3         4 
1.0000000 3.9331806 1.9725140 0.9925756 
1.0000000 3.9757329 1.9855526 0.9899258 
1.0000000 3.9984735 2.0017902 0.9916804 
1.0000000 3.9942205 1.9904766 0.9979963 
1.0000000 3.9952263 2.0040621 0.9968735 

Are these close enough? The second and fourth sets of frequencies are always slightly below what I’d expect. So I may be missing some sublety. On the other hand, it’s good enough for the testing I’m doing, as this mainly has to certify that my joins and other data-handling operations are correct.

Leave a Reply

Your email address will not be published. Required fields are marked *