Article 4WH67 Random sample overlap

Random sample overlap

by
John
from John D. Cook on (#4WH67)

Here's a simple probability problem with an answer you may find surprising. (The statement of the problem and solution are simple. The proof is not as simple.)

Suppose you draw 1,000 serial numbers at random from a set of 1,000,000. Then you make another random sample of 1,000. How likely is it that no numbers will be the same on both lists?

To make the problem slightly more general, take two samples of size n from a population of size n^2 where n is large.

The probability of no overlap in the two samples is approximately 1/e. That is, in the limit as n approaches infinity, the probability converges to 1/e.

For n = 1,000 the approximation is good to three figures.

Proof

There are Binomial(n^2, n) ways to choose a sample of size n from a population of size n^2. After we've drawn the first sample, there are Binomial(n^2 - n, n) ways to draw a new sample that doesn't overlap with the first one. The probability of such a sample is

sample_proof1.svg

The fractions on the last line are in decreasing order, and so their product is less than the first one raised to the nth power, and greater than the last one raised to the nth power.

By taking logs one can show that

sample_proof2.svg

Since upper and lower bounds on the probability converge to 1/e, the probability converges to 1/e.

Comments

This problem is reminiscent of the birthday problem, but it's a little different because the samples are draw without replacement.

Incidentally, I didn't make up this problem out of thin air. It came up naturally in the course of a consulting project this week.

iYNMSRL3X6I
External Content
Source RSS or Atom Feed
Feed Location http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title John D. Cook
Feed Link https://www.johndcook.com/blog
Reply 0 comments