Truncated distributions vs clipped distributions

John

from John D. Cook on 2020-04-21 16:42 (#52FCH)

In the previous post, I looked at truncated probability distributions. A truncated normal distribution, for example, lives on some interval and has a density proportional to a normal distribution; the proportionality constant is whatever it has to be to make the density integrate to 1.

Truncated distributions

Suppose you wanted to generate random samples from a normal distribution truncated to [a, b]. How would you go about it? You could generate a sample x from the (full) normal distribution, and if a x b, then return x. Otherwise, try again. Keep trying until you get a sample inside [a, b], and when you get one, return it [1].

This may be the most efficient way to generate samples from a truncated distribution. It's called the accept-reject method. It works well if the probability of a normal sample being in [a, b] is high. But if the interval is small, a lot of samples may be rejected before one is accepted, and there are more efficient algorithms. You might also use a different algorithm if you want consistent run time more than minimal average run time.

Clipped distributions

Now consider a different kind of sampling. As before, we generate a random sample x from the full normal distribution and if a x b we return it. But if x > b we return b. And if x < a we return a. That is, we clip x to be in [a, b]. I'll call this a clipped normal distribution. (I don't know whether this name is standard, or even if there is a standard name for this.)

Clipped distributions are not truncated distributions. They're really a mixture of three distributions. Start with a continuous random variable X on the real line, and let Y be X clipped to [a, b]. Then Y is a mixture of a continuous distribution on [a, b] and two point masses: Y takes on the value a with probability P(X < a) and Y takes on the value b with probability P(X > b).

Clipped distributions are mostly a nuisance. Maybe the values are clipped due to the resolution of some sensor: nobody wants to clip the values, that's just a necessary artifact. Or maybe values are clipped because some form only allows values within a given range; maybe the person designing the form didn't realize the true range of possible values.

It's common to clip extreme data values while deidentifying data to protect privacy. This means that not only is some data deliberately inaccurate, it also means that the distribution changes slightly since it's now a clipped distribution. If that's unacceptable, there are other ways to balance privacy and statistical utility.

[1] Couldn't this take forever? Theoretically, but with probability zero. The number of attempts follows a geometric distribution, and you could use that to find the expected number of attempts.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog