Life, part 6

ericlippert

from Fabulous adventures in coding on 2020-05-04 16:34 (#5313M)

Code for this episode can be found here. The only interesting change I've made to the client is that if you press P", it pauses the simulation and runs 5000 steps of the acorn" pattern, and then dumps the number of milliseconds that took to a text file. This is a quick and dirty performance test, but as we'll see, we can get great results even with this primitive tool.

Last time we demonstrated that a single tick in the naive array of bools" algorithm is O(n) in the number of cells, which is not surprising - after all, we do a neighbor count of every cell on every tick. This gives us the insight that if we quadrupled the number of cells, we'd probably quadruple the time to compute a step, which is good to know, but we still don't know how bad it is, aside from noting that we're having no problem keeping up with a cadence of two ticks per second of course!

As I mentioned in the preamble, I wrote the quickest little perf tester you could imagine and ran it on the release version of the app. There is no point in running performance testing on a debug app, or on a release app while it is running in the debugger; remember, the runtime knows it is being debugged and changes the performance characteristics of the program in order to make it easier to debug.

On my home machine - a decidedly mediocre mid-end PC game box - I got 5000 steps of acorn" in 17 seconds, which is around 300 steps per second. By 1992 standards, that would be insanely great, but let me tell you, machines are a lot faster today than they were in 1992.

I did not deliberately sandbag" this implementation - I did not write it to have a deliberately slow bit - so there's no trick to figuring out where we're spending time. I wrote it with an emphasis on clarity and correctness, knowing that we could come back later and find the hot spots; it's a lot easier to optimize code if it is already clear and correct.

The question now is: where would you guess most of the time is being spent in the calculation? (Remember, we are omitting the time to draw to the bitmap.)

If you didn't have a profiler on your machine - and I don't! I have the free version of VS on my home machine - what would your gut tell you? Because my gut is almost always at least partially wrong when it comes to performance.

Let's quickly review the naive algorithm. The major parts are:

Test to see if a point is inside our 8-quad of valid cells:

private bool IsValidPoint(long x, long y) => 0 <= x && x < width && 0 <= y && y < height;

Fetch the value from the array if the point is valid; dead otherwise:

public bool this[long x, long y]{ get { if (IsValidPoint(x, y)) return cells[x, y]; return false; }

Count the neighbors:

private int LivingNeighbors(int x, int y){ int sum = this[x - 1, y - 1] ? 1 : 0; sum += this[x - 1, y] ? 1 : 0; sum += this[x - 1, y + 1] ? 1 : 0; sum += this[x, y - 1] ? 1 : 0; sum += this[x, y + 1] ? 1 : 0; sum += this[x + 1, y - 1] ? 1 : 0; sum += this[x + 1, y] ? 1 : 0; sum += this[x + 1, y + 1] ? 1 : 0; return sum;}

And the main loop:

bool[,] newCells = new bool[width, height];for (int y = 0; y < height; y += 1) for (int x = 0; x < width; x += 1) { int count = LivingNeighbors(x, y); newCells[x, y] = count == 3 || (cells[x, y] && count == 2); }cells = newCells;

So, no peeking below: without a profiler to guide you, what would you assume was the low-hanging fruit that could be attacked here to make the whole thing a little faster without changing the basic algorithm or data structures?

I find that I'm dead wrong about a third of the time when I look at a piece of code and guess where the time is being spent, and fortunately this was one of the times I guessed correctly. The biggest cost by far in this loop is that it calls IsValidPoint eight times on every loop iteration. Since there are 65536 loop iterations per tick, that's over two billion validity checks in our performance run.

Moreover, the vast majority of the time, the point is valid; the only invalid" points we generate are when looking for neighbors when on edges.

How can we improve this? The easiest way is just to note that we have two cases, the interior points where all neighbors are valid, and the edges where not all neighbors are valid. If we are in the former case, which we can check for once, then we can skip all the other validity checks:

if (1 <= x && x < width - 1 && 1 <= y && y < height - 1){ int sum = cells[x - 1, y - 1] ? 1 : 0; sum += cells[x - 1, y] ? 1 : 0; ... return sum;}else{ // as before: int sum = this[x - 1, y - 1] ? 1 : 0; ...

Making this change takes us from 5000 ticks in 17 seconds to 4.4 seconds.That's a 4x improvement right there!

That was no surprise at all, but my experiments with making further tweaks to the code were quite surprising. Before I get into the details, let me hasten to say: I have not actually looked at the generated assembly to figure out why the jitted code has non-obvious performance characteristics, but I have some educated guesses. Take everything below with a grain of salt, and if anyone cares to look at the generated machine code to confirm or deny my guesses, I'd be happy to make an update.

Here are some things I tried to see if they made an improvement or a regression; I'll list the winners first:

I wondered if we were paying any penalty for all these conditionals:

int sum = cells[x - 1, y - 1] ? 1 : 0;

On the one hand, the Boolean should be represented as a byte that is either zero or one, so the jitter could make the conditional a no-op, right?

On the other hand, there are sneaky ways to make a Boolean contain a non-zero, non-one value, and those should all be treated as true, but that would make the turn this conditional into a no-op" optimization invalid.

Like I said above, I haven't checked the codegen to see what exactly it does. But when I turned the bool array into a byte array and removed the conditionals, that trimmed an additional 40 microseconds off of each step for a total of 200 milliseconds over 5000 steps.

Allocating a lot of new arrays has two costs: it has to initialize the new array to zero, and it causes collection pressure which increases the number of GCs. (Fortunately we are sneaking in under the 85K limit for arrays to avoid going on the Large Object Heap, but we're very close, and that would change things as well.)

This algorithm fills in every element in the new array, so there's no need for it to be initialized at all, or allocated and then immediately freed.

If we go to a solution where we cache the temporary array and re-use it, that shaves off another 200 milliseconds over 5000 steps.

When I was working on the nullable lowering code in the compiler I learned that if we were generating, say, an addition of two nullable integer values x and y, that you always generate the code as if the user wrote:

result = x.HasValue() & y.HasValue() ? ...

and not

result = x.HasValue() && y.HasValue() ? ...

Why's that? Because checking whether x has a value and branching if it does not is more expensive than simply calling y.HasValue()! Avoiding the work is more expensive than just doing it. And moreover, every branch is a new basic block, and basic blocks make the jitter work harder and therefore potentially skip more optimizations; the jitter has to be fast.

I therefore wondered if I could rewrite

 if (1 <= x && x < width - 1 && 1 <= y && y < height - 1)

which after all is just

 if (1 <= x) if (x < width - 1) if (1 <= y) if (y < height - 1)

as instead

 if (1 <= x & x < width - 1 & 1 <= y & y < height - 1)

Turns out, that's actually a small regression. This was suprising, particularly since the conditions are almost always true and therefore you don't get much savings by avoiding work on the edges.

I'm not sure what's going on here, but I have a guess. Let's look at the other losers and then I'll theorize further at the end.

As we've seen, the special-purpose code for handling the edges is a source of pain because we do extra work to not dereference outside the array. A standard technique for Life on finite grids is to make a rectangle of death" along the edge. That is, allocate memory for a 256*256 grid, but only actually do the computations on the 254*254 grid on the interior, and keep the 1000 or so cells on the edges permanently dead. We waste a couple KB of memory, but we save time on not having to constantly check whether we're computing neighbors of an edge cell because we just skip them entirely.

When I only restricted the computations to the interior by changing the loop conditions, of course we were now doing 1000 fewer cells than before, so there was a commensurate speedup. But here's the weird bit. When I removed the now unnecessary:

if (1 <= x && x < width - 1 && 1 <= y && y < height - 1)

check, the program got slightly but consistently slower.

What the heck? It's doing less work, so how did it get slower?

This code seems to be doing some recomputation:

int sum = this[x - 1, y - 1] ? 1 : 0; sum += this[x - 1, y] ? 1 : 0;...

We recompute x-1, x+1, y-1 and y+1 several times.

When I put those into local variables to avoid recomputing them, again I got a small but consistent regression!

Finally, the one that was most surprising to me. It is fairly well-known that in .NET, the jitter does a better job optimizing operations on single-dimensional arrays than multi-dimensional arrays. I tried changing my cell array to a single-dimensional array, and then accessed it as

cells[x + y * width]

throughout. This also caused a small regression!

I have a hypothesis that explains these unexpected regressions.

Arrays are memory-safe in C#. Accessing a single-dimensional array outside its bounds throws an exception, and accessing a multi-dimensional array outside of any of its bounds is an exception even if the offset computed into the underlying memory block would be in range.

The jitter has a lot of smarts in it to avoid doing unnecessary bounds checks. In particular, if you have:

for(i = 0; i < a.Length; a += 1) sum += a[i]

Then the runtime knows that the array access inside the loop will always be in bounds, so it skips emitting a bounds check.

I suspect that what is going on here with all of these unexpected regressions is that the edit causes the jitter to fall out of a known to be good" path and it starts generating more array bounds checks.

We are smart enough to know that the array indices will be in bounds, but the jitter only has a very small time budget to analyze a method body; as I noted above, it makes no sense to do a work-avoiding optimization if determining whether the optimization applies costs more than the work saved!

This is just a conjecture though; if anyone knows for sure what is going on here, I'd love to see a deeper analysis.

The obvious comment here is can we avoid the checks entirely by going to unsafe code?" I'll address that point in a later episode, so hold off there for now!

Summing up:

Our tweaked algorithm, which you can find in BoolArrayOpt.cs, goes from 17 seconds to 4 seconds for 5000 ticks, or 1250 ticks per second. That seems decent for an early attempt.
If we get 1250 TPS with an 8-quad, then we should get about 80 TPS with a 10-quad and about 20 with an 11-quad, so a 1024*1024 grid is about as good as we can get with this implementation if we want to meet our performance goal of animating smoothly. I have not actually tried it though; it would be interesting to see what the scaling factor really is.
The moral of the story here is performance of managed code can be surprising, so always measure!

Next time on FAIC: there's a (seemingly) completely different flavor of array-based solution to this problem that I'd like to spend an episode or two digging into.

Source	RSS or Atom Feed
Feed Location	http://ericlippert.com/feed
Feed Title	Fabulous adventures in coding
Feed Link	https://ericlippert.com/