Random sample overlap
Here's a simple probability problem with an answer you may find surprising. (The statement of the problem and solution are simple. The proof is not as simple.)
Suppose you draw 1,000 serial numbers at random from a set of 1,000,000. Then you make another random sample of 1,000. How likely is it that no numbers will be the same on both lists?
To make the problem slightly more general, take two samples of size n from a population of size n^2 where n is large.
The probability of no overlap in the two samples is approximately 1/e. That is, in the limit as n approaches infinity, the probability converges to 1/e.
For n = 1,000 the approximation is good to three figures.
ProofThere are Binomial(n^2, n) ways to choose a sample of size n from a population of size n^2. After we've drawn the first sample, there are Binomial(n^2 - n, n) ways to draw a new sample that doesn't overlap with the first one. The probability of such a sample is
The fractions on the last line are in decreasing order, and so their product is less than the first one raised to the nth power, and greater than the last one raised to the nth power.
By taking logs one can show that
Since upper and lower bounds on the probability converge to 1/e, the probability converges to 1/e.
CommentsThis problem is reminiscent of the birthday problem, but it's a little different because the samples are draw without replacement.
Incidentally, I didn't make up this problem out of thin air. It came up naturally in the course of a consulting project this week.