Random sample overlap

John

from John D. Cook on 2019-12-13 13:26 (#4WH67)

Here's a simple probability problem with an answer you may find surprising. (The statement of the problem and solution are simple. The proof is not as simple.)

Suppose you draw 1,000 serial numbers at random from a set of 1,000,000. Then you make another random sample of 1,000. How likely is it that no numbers will be the same on both lists?

To make the problem slightly more general, take two samples of size n from a population of size n^2 where n is large.

The probability of no overlap in the two samples is approximately 1/e. That is, in the limit as n approaches infinity, the probability converges to 1/e.

For n = 1,000 the approximation is good to three figures.

Proof

There are Binomial(n^2, n) ways to choose a sample of size n from a population of size n^2. After we've drawn the first sample, there are Binomial(n^2 - n, n) ways to draw a new sample that doesn't overlap with the first one. The probability of such a sample is

The fractions on the last line are in decreasing order, and so their product is less than the first one raised to the nth power, and greater than the last one raised to the nth power.

By taking logs one can show that

Since upper and lower bounds on the probability converge to 1/e, the probability converges to 1/e.

Comments

This problem is reminiscent of the birthday problem, but it's a little different because the samples are draw without replacement.

Incidentally, I didn't make up this problem out of thin air. It came up naturally in the course of a consulting project this week.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog