Fixing Random, part 2

ericlippert

from Fabulous adventures in coding on 2019-02-04 16:30 (#48E73)

Last time on FAIC I complained bitterly about the shortcomings of System.Random. In this episode, I'll take a stab at the sort of thing I'd like to see in a redesigned random library. This is the "fix the trivial problems" episode, and then we'll go deeper from there.

The first thing I want is two static methods, one that gives me an actually-random integer from 0 to Int32.MaxValue, and one that gives me a random double from 0.0 (inclusive) to 1.0 (exclusive). Not pseudo-random, but actually indistinguishable from a true random uniform distribution. Here's an attempt:

using CRNG = System.Security.Cryptography.RandomNumberGenerator;
public static class BetterRandom
{
private static readonly ThreadLocal<CRNG> crng =
new ThreadLocal<CRNG>(CRNG.Create);
private static readonly ThreadLocal<byte[]> bytes =
new ThreadLocal<byte[]>(()=>new byte[sizeof(int)]);
public static int NextInt()
{
crng.Value.GetBytes(bytes.Value);
return BitConverter.ToInt32(bytes.Value, 0) & int.MaxValue;
}
public static double NextDouble()
{
while (true)
{
long x = NextInt() & 0x001FFFFF;
x <<= 31;
x |= (long)NextInt();
double n = x;
const double d = 1L << 52;
double q = n / d;
if (q != 1.0)
return q;
}
}
}

It's pretty straightforward, but let's take a quick look. First, all the state is thread-local, so we trade a small amount of per-thread overhead and per-call indirection for thread safety, which is probably a good tradeoff. We're going to be using the same four-byte buffer over and over again, so I figured it is probably worthwhile to cache it and avoid the hit to collection pressure; however this is just a guess and I would want to verify that with empirical tests.

There are other possible performance problems here; for example, is it worthwhile to generate a few thousand random bytes at once, cache them, and then gradually use them up? Or is the per-call cost of GetBytes sufficiently low that this is not an issue? I'd also want to empirically check that.

I want only positive integers; clearing the top bit by and-ing with 0x7FFFFFF does that nicely without changing the distribution.

For a random double, we know that doubles have 52 bits of precision, so we generate a random 52 bit integer and make that the numerator of a fraction. We want to guarantee that 1.0 is never a possible output, so we reject that possibility; one in every few billion calls we'll do some additional calls to NextInt. I can live with that.

Exercise: An alternative, and possibly better solution, would be to build the double from bits directly, rather than doing an expensive division. I have not actually written the code or done the math here; as an exercise, does doing so produce a better distribution in any way?

Now that we have a crypto-strength threadsafe random number library, we can build a cheap pseudo-random library out of it:

public static class Pseudorandom
{
private readonly static ThreadLocal<Random> prng =
new ThreadLocal<Random>(() =>
new Random(BetterRandom.NextInt()));
public static int NextInt() => prng.Value.Next();
public static double NextDouble() => prng.Value.NextDouble();
}

We've solved two of the main problems with Random: historically it gave highly correlated values when called repeatedly from one thread, and it is not thread safe. We now have a single prng per thread, and it is seeded with a crypto-strength random seed, not the current time.

The problem of the candy-machine interface has gone away because I've eliminated that method entirely! We'll come back to this point in a later episode.

I think this is already a major improvement, in that 99% of the questions about mis-uses of Random on StackOverflow would not exist if this had been the received randomness library in the first place, and I am very happy to learn that at long last, these problems have been fixed in the core implementation.

What about performance? The crypto RNG is a lot slower than the pseudo-RNG, and there is definitely overhead for the thread local infrastructure. But: who cares? Most programs would not be observably slowed down by these improvements, and for the ones that are slowed down, you go right ahead and make a cheap, thread-unsafe version if you need it for performance.

We should always make good tradeoffs; if the cost of eliminating a whole class of bugs is a tiny performance hit, we should make that tradeoff. The entire premise of managed languages is that we trade a small amount of performance for improved developer productivity. That's why we have things like a GC, immutable strings, checked array access, and so on; these all trade a small amount of performance for more safety, robustness and productivity. And that's why we have unsafe pointers; if you need to abandon some safety to get performance, C# should let you do that. But we should default to safety!

A commenter noted last time that recent implementations of System.Random solve this problem using some similar techniques; it's instructive to compare my four-line solution here to a solution actually engineered to be performant at scale. As always, remember that the code on my blog is for pedagogy and not intended to be an example of industrial-quality code. You'll notice that for instance they make only the rarely-used parts of the algorithm thread-safe. That seems like a reasonable tradeoff between performance and safety.

Already I think we're off to a good start, but we can do a lot better still. Like I said last time, these are the trivial problems. There are a lot more problems we could be solving. Two that come to mind:

The received PRNG is simply not very good in terms of its randomness; there are far better algorithms that produce harder-to-predict outputs with similar performance and similar amount of state.
A common use for PRNGs is to generate a random-seeming but actually-deterministic sequence for the purposes of randomizing a game. But the received implementation has no way to say "save the current state of the PRNG to disk and restore it later", or any such thing.

I'm not going to address either of these problems in this series, though both are fascinating problems that deserve posts of their own. I'm going to take a different approach to improving how we deal with randomness; I want to look at the problem at a higher level of abstraction than the low-level details of how the random numbers are generated and what the state of the generator is. To really improve the state of the art of dealing with probability in C#, we'll need some more powerful types.

Next time on FAIC: I'll make some useful classes for common continuous probability distributions.

Source	RSS or Atom Feed
Feed Location	http://ericlippert.com/feed
Feed Title	Fabulous adventures in coding
Feed Link	https://ericlippert.com/