Life, part 18
Code for this episode is here.
A couple episodes back we did a first cut at implementing Stafford's algorithm using the triplet data structure to store the current and upcoming states of three cells and their living neighbour counts, all in 15 bits of a 16 bit short. Unsurprisingly, it was quite a bit slower than the same algorithm that did one byte per cell; any win from parallelism is swamped by, I conjectured, all the extra bit twiddling.
Is there a way to eliminate some of that bit twiddling?
Today we'll look at the inner loop of the first pass, which identifies cells that need to change by setting their next state bits:
Triplet c = triplets[x, y];Triplet t = c;int lc = t.LeftCount;int mc = t.MiddleCount;int rc = t.RightCount;t = t.SetLeftNext(lc == 3 | t.LeftCurrent & lc == 2);t = t.SetMiddleNext(mc == 3 | t.MiddleCurrent & mc == 2);t = t.SetRightNext(rc == 3 | t.RightCurrent & rc == 2);if (t.Changed){ currentChanges.Add((x, y)); triplets[x, y] = t;}
What essentially are we doing here?
- Compute the new triplet t from the old triplet c.
- If t is different than c, cause two side effects: update an array, and update a list.
There's nothing we can do about the side effects, but we can observe that the computation has the following characteristics:
- We compute the left, middle and right actual counts; each does bit twiddling to extract one or two states, and one three-bit integer, and then some addition.
- We then do bit twiddling to again extract the three states...
- and then we do a bunch of comparison logic on all of the above
- The net result is the three next state" bits, which we again use bit twiddling to get into place in the new triplet.
- And, cherry on top, we do a bunch of bit twiddling to extract the current and next state bits to ask was there a change?"
- Oh, the pain.
We have a function that consumes 12 bits of a 16 bit short and produces a new 16 bit short and a Boolean saying whether it changed. But given those 12 bits, the short and the Boolean produced are always the same. We can precompute the results for every possible input once and look them up as our inner loop.
That is, we can move the vast majority of the bit twiddling backwards in time to before the algorithm runs the first step.
We will need a lookup key containing the twelve bits we need; fortunately, they are the bottom twelve bits so we can extract them with a minimum of bit twiddling by adding a helper to the triplet struct:
public int LookupKey1 => triplet & 0x0fff;
I'll create two lookups, one for will there be a change?" and one for what is the new triplet?" Since the lookups are keyed on a 12-bit integer, we can simply make them both arrays; there's no need for anything fancier:
static class TripletLookup{ public static Triplet[] lookup; public static bool[] changed;
and initialize them in a static constructor; we just create every possible triplet based on the bottom 12 bits, and compute what the inner loop of the first pass did in the previous version:
static TripletLookup() { lookup = new Triplet[1 << 12]; changed = new bool[1 << 12]; for (int left = 0; left < 2; left += 1) for (int middle = 0; middle < 2; middle += 1) for (int right = 0; right < 2; right += 1) for (int lc = 0; lc < 8; lc += 1) for (int mc = 0; mc < 7; mc += 1) for (int rc = 0; rc < 8; rc += 1) { Triplet t = new Triplet() .SetLeftCurrent(left == 1) .SetMiddleCurrent(middle == 1) .SetRightCurrent(right == 1) .SetLeftCountRaw(lc) .SetMiddleCountRaw(mc) .SetRightCountRaw(rc) .SetLeftNext( (lc + middle == 3) | (left == 1) & (lc + middle == 2)) .SetMiddleNext( (left + mc + right == 3) | (middle == 1) & (left + mc + right == 2)) .SetRightNext( (middle + rc == 3) | (right == 1) & (middle + rc == 2)); lookup[t.LookupKey1] = t; changed[t.LookupKey1] = t.Changed; } }
And that's that.
A few things to note here. First, honestly, how often are you justified in writing a six-deep nested loop? Not very often, so let's take the opportunity while we've got it!
Second, a great many of these lookup conditions are impossible. In Life you cannot get into a situation where the middle cell of a triplet has seven living neighbours that are not left or right, AND left and right both have zero living neighbours other than the middle. Who cares? These two tables take up 12KB of memory, total. We can waste some. And the cost of doing the unnecessary calculations is only paid once, at startup.
Moreover, in Stafford's original implementation he went even further and did not bother to pull out the twelve bits for the key; he just precomputed the entire triplet-to-triplet function and made a table with 65536 entries. Why do any bit twiddling at all?
Anyway, we can now replace the entire inner loop of the first pass with:
int key1 = triplets[tx, y].LookupKey1;if (TripletLookup.changed[key1]){ triplets[tx, y] = TripletLookup.lookup[key1]; currentChanges.Add((tx, y));}
So much nicer to read, and much faster. Let's once again run 5000 generations of acorn on an 8-quad:
Algorithm time(ms) ticks size(quad) megacells/sNaive (Optimized): 4000 5K 8 82Abrash 550 5K 8 596Abrash w/ changes 190 5K 8 1725 Abrash, O(c) 240 5K 8 1365Stafford v1 340 5K 8 963Stafford v2 210 5K 8 1560
We are back in business! We have a three-cells-per-short O(c) algorithm that beats the one-cell-per-byte O(c) algorithm and is within striking distance of beating the copy the array every time" version.
Aside: I told a small fib earlier in this episode; did you catch it?
Give that some thought and then scroll down.
.
.
.
.
.
.
.
...we can move the vast majority of the bit twiddling backwards in time to before the algorithm runs the first step.
I put my fib in boldface for added foolery.
When exactly does the lookup initialization happen? I worked on the compiler team and the language design team and I still always have to double-check Jon's page on the subject whenever it comes up.
Since we have a static class with a static constructor, the static constructor will run when a member is accessed for the first time; we are not using the relaxed" semantics. That means: the cost of precomputing the tables is incurred during the first call to Step, not before. This could make a difference to our performance metrics!
It does not, because when I collected those metrics I ensured that Step had already been called at least once before I started the stopwatch. But I had to know to do that.
Incidentally, the purpose of the relaxed" semantics is to allow the initialization of the static fields to happen when the first method that accesses those fields is jitted. That way the jitter does not have to emit code to ensure that the fields are initialized; it just initializes them and then it knows they are initialized! I've always found it a little ironic that the relaxed" semantic means that the jitter gets more eager about calling the initializers early. More eagerness seems like an odd thing to characterize as relaxed".
Next time on FAIC: Optimizing the first pass was pretty straightforward. Optimizing the second pass is maybe not so easy. We will do so, and finish up our exploration of Stafford's algorithm.
For today's extra content: last time I mentioned that a common technique is to use eaters to constrain unwanted expansion of some pattern in an oscillator; this has been used for a long time, as evidenced by two oscillators discovered in 1972 by Robert Wainwright: dinner table, and roteightor. (Images from wiki; see links.)