Fixing Random, part 15

ericlippert

from Fabulous adventures in coding on 2019-03-22 17:00 (#4BMBN)

Last time on FAIC we made a correct, efficient implementation of SelectMany to bind a likelihood function and projection onto a prior, and gave a simple example. I deliberately chose "weird" numbers for all the weights; let's do that same example again but with more "nice round number" weights:

This produces the output:

DoubleDose:270000NormalDose:540000HalfDose:190000

which is correct, but you notice how multiplying the weights during the SelectMany made for some unnecessarily large weights. If we then did another SelectMany on this thing, they'd get even larger, and we'd be getting into integer overflow territory.

Integer overflow is always possible in the system I've developed so far in this series, and I am deliberately glossing over this serious problem. A better implementation would either use doubles for weights, which have a much larger range, or arbitrary-precision integers, or arbitrary-precision rationals. I'm using integers to keep it simple, but as with many aspects of the code in this series, that would become problematic in a realistic implementation.

One thing we can do to tame this slightly is to reduce all the weights when possible; plainly in this case we could divide each of them by 10000 and have exactly the same distribution, so let's do that. And just to make sure, I'm going to mitigate the problem in multiple places:

In SelectMany we could be taking the least common multiple (LCM) instead of the full product of the weights.
In the WeightedInteger factory we could be dividing out all the weights by their greatest common divisor (GCD).

Long-time readers of my blog may recall that I've implemented Euclid's Algorithm before, but this time I'm going to make a much simpler implementation:

public static int GCD(int a, int b) =>
b == 0 ? a : GCD(b, a % b);

We define the GCD of two non-negative integers a and b as:

if both zero, then zero
otherwise, if exactly one is zero, then the non-zero one
otherwise, the largest integer that divides both.

Exercise: Prove that this recursive implementation meets the above contract.

The problem we face though is that we have many weights and we wish to find the GCD of all of them. Fortunately, we can simply do an aggregation:

public static int GCD(this IEnumerable<int> numbers) =>
numbers.Aggregate(GCD);

Similarly we can compute the LCM if we know the GCD:

public static int LCM(int a, int b) =>
a * b / GCD(a, b);
public static int LCM(this IEnumerable<int> numbers) =>
numbers.Aggregate(1, LCM);

And now we can modify our WeightedInteger factory:

public static IDiscreteDistribution<int> Distribution(
IEnumerable<int> weights)
{
List<int> w = weights.ToList();
int gcd = weights.GCD();
for (int i = 0; i < w.Count; i += 1)
w[i] /= gcd;
["]

And our SelectMany:

int lcm = prior.Support()
.Select(a => likelihood(a).TotalWeight())
.LCM();
[" and then use the lcm in the query "]

See the code repository for all the details. If we apply all these changes then our results look much better"

DoubleDose:27NormalDose:54HalfDose:19

" and we are at least a little less likely to get into an integer overflow situation.

Aside: Of course we can do the same thing to the Bernoulli class, and normalize its weights as well.

Next time on FAIC: We can use the gear we've created so far to solve problems in Bayesian inference; we'll see how.

Source	RSS or Atom Feed
Feed Location	http://ericlippert.com/feed
Feed Title	Fabulous adventures in coding
Feed Link	https://ericlippert.com/