Article 4ABG6 Fixing Random, part 10

Fixing Random, part 10

by
ericlippert
from Fabulous adventures in coding on (#4ABG6)

[Code is here.]

We're going to spend the next while in this series on expressing probabilities that have some sort of extra condition associated with them. We'll start with a simple example: I want to roll a fair six-sided die, but discard the threes. Why? Maybe I really dislike threes. Doesn't matter; we will want to be able to create new distributions by putting conditions on other distributions, so let's figure out how to do so.

Now, of course, we could create the desired distribution easily enough; as we know from last time on FAIC, it is

WeightedInteger.Distribution(0, 1, 1, 0, 1, 1, 1)

But suppose we have a distribution already in hand which we wish to apply a condition to, post hoc. I'm not going to labour the point; we can see how to do that easily enough. It's a variation on our "projected" distribution from a while back:

public sealed class Conditioned<T> :
IDiscreteDistribution<T>
{
private readonly List<T> support;
private readonly IDiscreteDistribution<T> underlying;
private readonly Func<T, bool> predicate;
public static IDiscreteDistribution<T> Distribution(
IDiscreteDistribution<T> underlying,
Func<T, bool> predicate)
{
var d = new Conditioned<T>(underlying, predicate);
if (d.support.Count == 0)
throw new ArgumentException();
if (d.support.Count == 1)
return Singleton<T>.Distribution(d.support[0]);
return d;
}
private Conditioned(
IDiscreteDistribution<T> underlying,
Func<T, bool> predicate)
{
this.underlying = underlying;
this.predicate = predicate;
this.support = underlying.Support()
.Where(predicate)
.ToList();
}
public T Sample()
{
while (true) // Ouch
{
T t = this.underlying.Sample();
if (this.predicate(t))
return t;
}
}
public IEnumerable<T> Support() =>
this.support.Select(x => x);
public int Weight(T t) =>
predicate(t) ? underlying.Weight(t) : 0;
}

Given our discussion a few episodes back about projected distributions and the fact that creating them has the same signature as "Select", I'm sure it will come as no surprise to you that I'm going to make a helper method:

public static IDiscreteDistribution<T> Where<T>(
this IDiscreteDistribution<T> d,
Func<T, bool> predicate) =>
Conditioned<T>.Distribution(d, predicate);

And you know what this means: I get to use comprehension syntax again!

var noThrees = from roll in SDU.Distribution(1, 6)
where roll != 3
select roll;
Console.WriteLine(noThrees.Histogram());

And we get as expected, no threes:

1|***************************************2|***************************************4|***************************************5|***************************************6|***************************************

However, there are some problems here. That possibly-long-running loop is deeply worrisome. In fact, dealing with the existence of that loop will be the major theme of the rest of this series, in one way or another. (This should feel familiar; of course this is just another version of the "rejection sampling" problem from a few episodes ago.)

We'll talk about that loop problem more in the next episode. For the remainder of this episode I want to examine an assumption I made in the code above; it's the same assumption that I made when we discussed projections. We just blew right past it, but this assumption introduces what might be characterized as a serious bug.

Consider the following code, which uses only sequences, no distributions:

int filterOut = 3;
Func<int, bool> predicate = x => x != filterOut;
var range = Enumerable.Range(1, 6).Where(predicate);
Console.WriteLine(range.CommaSeparated());
filterOut = 4;
Console.WriteLine(range.CommaSeparated());

Aside: I got sick of typing string.Join(" blah blah blah") so I made a handful of extension methods to be a little more fluent. See the source code for details.)

If you recall that sequences are computed lazily and lambdas are closed over variables, not values, then this output should be expected:

1,2,4,5,61,2,3,5,6

What if we do the same thing to distributions?

int filterOut = 3;
Func<int, bool> predicate = x => x != filterOut;
var d = SDU.Distribution(1, 6).Where(predicate);
Console.WriteLine(d.Samples().Take(10).CommaSeparated());
filterOut = 4;
Console.WriteLine(d.Samples().Take(10).CommaSeparated());

As we'd expect, we first get no threes, and then no fours:

1,1,4,6,6,5,4,1,2,66,5,6,5,1,5,6,3,2,1

So what's the problem?

Console.WriteLine(d.Support().CommaSeparated());

1,2,4,5,6

Uh oh. We just produced a 3 in a distribution that does not list 3 in its support!

Why? Because I computed the support in the constructor and cached it, but the support changed when the predicate changed its behaviour, and it is now out-of-date. The object is now lying to us.

Obviously I could fix this problem easily enough: do not compute the support in the constructor; compute it dynamically, on demand. Is that the right thing to do?

If we do, we lose the ability to reject "null" distributions - distributions with no support - and therefore it is possible to get into a situation where sampling hangs because the predicate is never true. (We could re-check the support before every sample, and throw if the support is empty, but that seems expensive.)

Furthermore, as we'll see in the next few episodes, we can do some pretty good optimizations if we can assume that the predicate does not change.

I'm therefore going to state right now that the predicate passed to Where (and the projection passed to Select) must be "pure" functions when they are intended to operate on probability distributions.

A "pure" function for our purposes has the following characteristics:

  • It must complete normally; in its regular operation it should not hang and it should not throw. It is permissible for a pure function to throw if its preconditions are violated; for example, a pure function that is documented as not taking negative numbers is permitted to throw an argument exception if passed a negative number. But a pure function should not throw or hang when given a normal, expected input.
  • It must not produce any side effect. For example, it must not write to the console or mutate a global variable or anything like that.
  • Its outputs must depend solely on its inputs; a pure method does not produce one result on its first call and a different result on its second call because something in the world changed; the only reason to produce a different result is if the arguments passed in are different.

Pure functions are nice to work with in two particular ways. First, you can reason about their correctness entirely "locally"; you do not have to consider the state of the world at the time the call is made, and you do not have to worry that the state of the world will change depending on how many times the call happens. Second, you can get performance wins by taking advantage of the fact that you know that two calls with the same arguments will produce the same result.

Unfortunately, in C# we have no good way of detecting a non-pure method at compile time and outlawing them as arguments to Where and Select; we will have to rely on the user being smart enough to not shoot themselves in the foot here.

Next time on FAIC: Now that we are requiring purity in Where and Select, what optimizations can we make to the discrete distributions we've created so far?

External Content
Source RSS or Atom Feed
Feed Location http://ericlippert.com/feed
Feed Title Fabulous adventures in coding
Feed Link https://ericlippert.com/
Reply 0 comments