Article 58VGC Life, part 36

Life, part 36

by
ericlippert
from Fabulous adventures in coding on (#58VGC)

This was supposed to be the last episode of this series, but while writing it I discovered a bug in my implementation of Hensel's QuickLife algorithm. It's a small defect, easily fixed, but I thought it might be interesting to dig into it. (The bug is in the code from episodes 30 through 35; I've fixed it in the episode 35 branch.)

When I'm investigating a defect, there are a lot of stages I go through. Today we'll look at:

  • Make a minimal reproducer
  • Determine the exact code path that causes the defect
  • Understand what went wrong at the level of the code
  • Fix it
  • Do a root cause analysis
  • Sum up the lessons learned

Of course if this were a product there would be more steps in there, such as code review of the fix, writing regression tests, getting the regressions code reviewed, adding assertions that verify the violated invariant elsewhere, reviewing related code for similar defects, determining if customers are affected by the bug or the fix, and so on. Obviously we won't go through those here!

It took me some time to find a minimal repro; I won't go through how I did so because it was tedious trial and error; suffice to say that I found a very complicated reproducer by accident and then made it smaller and smaller until I could go no further.

The minimal repro is very straightforward; we load this pattern into an empty QuickLife instance and step it two ticks. There is only one active Quad4. The even generation of the active Quad4 looks like this:

1kocfizizcr2.jpg?w=277

The first step we are going from even generation zero to odd generation one; recall that in QuickLife, the odd generation is offset by one cell in each direction, so odd generation one should look like this:

1kocfizizcr3.jpg?w=276

That's a still Life, so if we've done things correctly, even generation two should look like:

1kocfizizcr4.jpg?w=277

But we have not done things correctly, and it does not look like this! Instead we observe that generation two is displayed as being the same as generation zero and that moreover, this pattern continues on subsequent even generations. We have a little isolated one-cell blinker, which should be impossible.

(ASIDE: This is the simplest version of this defect; while investigating I also found more complex versions where the spark" part was in an adjacent Quad4 and the block" part was not a still Life. Going through the more complex scenarios would make this long post much longer but both the root cause and the fix are the same, so I will skip that part of the analysis.)

What code path leads to this defect?

First, a quick refresher. Recall that QuickLife gets its speed in several ways. The core logic that we are concerned about for this bug is:

  • A Quad4 represents two generations of a 16*16 fragment of the board; one even, one odd.
  • Each of the eight Quad3s in a Quad4 keeps track of four regions" and whether those regions are active, stable or dead. By stable" we mean exactly the same as two ticks previous", and by dead" we mean stable, and every cell is dead".
  • If a Quad3 in an active Quad4 is stable or dead then we can skip computing its next generation because it will be the same as the previous generation.

The code I wrote to implement this earlier in the series is usually correct, but it is subtly wrong on generations zero and one, which of course can then cause the computation to be wrong on every subsequent generation. Let's trace through it carefully and see where things go pear shaped.

Initial state:

  • The even half of the Quad4 cell state is as given in the first diagram.
  • The odd half of the Quad4 cell state is all dead.
  • All even and odd regions of all Quad3s are set to active".

To step from generation zero to generation one, remember we execute this code:

private void StepEven(){ foreach (Quad4 c in active) { if (!RemoveStableEvenQuad4(c)) { c.StepEven(); MakeOddNeighborsActive(c); } c.StayActiveNextStep = false; }}

The even half of the Quad4 is marked active so we do not remove it, and enter the body of the condition:

public void StepEven(){ StepEvenNW(); StepEvenSW(); StepEvenNE(); StepEvenSE();}

Things are going to go wrong in the NE Quad3, so let's take a look at that.

private void StepEvenNE(){ if (OddNEPossiblyActive()) { Quad3 newOddNE = Step9Quad2ToQuad3Even(...); OddNEState = oddNE.UpdateOddQuad3State(newOddNE, OddNEState); oddNE = newOddNE; } else OddNEState = oddNE.MakeOddStableOrDead(OddNEState);}

The first thing we do is check if the odd cycle of the NE Quad3 could possibly change from its current state; since the even cycle is marked as active, it could. We enter the consequence of the condition and correctly compute that the odd NE Quad3 is all dead. There is only one isolated living cell there, and that's not enough to sustain life.

Here's the first thing that goes terribly wrong. UpdateOddQuad3State compares the new odd NE Quad3 to the previous one and discovers both of them are all dead". Remember our definition of the state bits: in order for a Quad3 region to be marked as dead" it must be (1) all dead, and (2) stable; all dead twice. We therefore mark the odd NE Quad3 as dead" in all four regions.

It seems like everything is working correctly so far. I won't go through the rest of the even-to-odd step in detail; summing up:

  • All four Quad3s on the even cycle are still marked as active in all regions
  • The odd SW Quad3 is marked as active overall and on its east edge, because it is different from how it was in the previous" generation.
  • The odd NE, NW and SE Quad3s are marked as dead" in all regions.

Now let's look at what happens on the generation-one-to-generation-two step, just in the northeast Quad3 of this Quad4:

private void StepOddNE(){ if (EvenNEPossiblyActive()) { Quad3 newEvenNE = Step9Quad2ToQuad3Odd(...); EvenNEState = evenNE.UpdateEvenQuad3State(newEvenNE, EvenNEState); evenNE = newEvenNE; } else EvenNEState = evenNE.MakeEvenStableOrDead(EvenNEState);}

Once again, the first thing we check is whether there is any point in recomputing the even state; if we know it is going to be the same this time as last, then we can skip it... and... wait a minute:

private bool EvenNEPossiblyActive() => OddNortheastOrBorderingActive || N != null && N.OddSouthEdge10EastActive;

We just marked the odd NE, NW and SE Quad3s as stably dead in all regions" so OddNortheastOrBorderingActive is false; there is no Quad4 to the north, but if there were, it would be all dead also. The wrong conclusion we reach is: on the next tick, the northeast corner of this Quad3 must be the same on generation 2 as it was on generation 0 so we can skip computing it.

We therefore enter the alternative (else) branch, call MakeEvenStableOrDead, and mark the even NE Quad3 as stable". This is obviously wrong, and worse, that wrongness then persists forever because the whole Quad4 will soon be marked as stable" and we will have a period-two oscillator between the states illustrated in the first two diagrams above.

The appropriate fix is determined by understanding what has really gone wrong at the level of the code. What invariant did we depend upon that was violated?

If we're stepping from an even generation to an odd generation, the current generation is stored in the even Quad3s, and the previous generation is stored in the odd Quad3s. But there is an important assumption made by these optimizations: we assume that the current generation was itself computed by stepping the previous generation. The reasoning is:

  • Generation -1 was all dead in the NE Quad3
  • Generation 0 was the result of stepping generation -1 - uh oh
  • Stepping generation 0 gives us generation 1 all dead in the NE Quad3
  • Generations -1 and 1 are the same in the NE Quad3; it is stable
  • Therefore generation 2 is also stable
  • Generation 2 is the same as generation 0 in the NE Quad3 so we can skip computing it and mark it as stable.

The second premise is false, so the conclusion does not follow.

There are a lot of ways to fix this. What we're going to do here is keep track of one more piece of board-global state: was the current generation actually computed from the previous generation? Or put another way, is the stored cell state of the previous generation correct? The vast majority of the time it will be, but it is not if we have just loaded a pattern in.

If the previous generation is correct then our algorithm is already correct (I hope!) and we do not need to do anything. What should we do if the previous generation is not correct?

The tempting thing to do is to make that flag a parameter to the step functions. Where did things first go wrong? When we marked the odd NE Quad3 as stably dead". We could pass a flag in that says when stepping even to odd, if the previous odd generation is incorrect then do not compare the new odd cells to the previous odd cells; instead mark them all active." On the next step we would then not skip computing the new even state, and all would be well.

However, there are now eight code paths that we would have to plumb this flag through. An easier, hackier solution - the solution Hensel applied in the original QuickLife source code - is to allow the step to compute the wrong" stability flags and then fix them later:

private void StepEven(){ foreach (Quad4 c in active) { if (!RemoveStableEvenQuad4(c)) { c.StepEven(); MakeOddNeighborsActive(c); } c.StayActiveNextStep = false; if (!previousCorrect) c.SetOddQuad4AllRegionsActive(); } previousCorrect = true;}

And similarly on the odd cycle. That's our fix.

Whenever I fix a defect, I try to spend some time asking myself how this defect came to be in the code in the first place. In this particular case, that's easy. I remember very clearly the faulty thought process.

As I've mentioned before, Hensel's explanation of the basic ideas underlying the original QuickLife implementation is very clear, but the code that implements it is arcane in many ways. There are loop bodies and conditional bodies with hundreds of lines of dense, bit-twiddling code. The bit manipulation is all raw hex numbers. The variable naming conventions are obscure; p means even, q means odd, for instance. It took me a long time to understand the details of the algorithm, and I wanted to simplify my task by ignoring anything that wasn't directly related to the actual optimizations.

One of the nice-to-have features of QuickLife (and several other algorithms we've looked at) that I did not spend any time in this series addressing is: because we have the previous and current states stored, you could easily add a handy step backwards one tick" feature to the UI. And in fact the original Java implementation of QuickLife has this feature.

Now, you have to be careful about this, for two reasons. First, obviously once you have stepped backwards once, you cannot do so again. You need to keep track of whether you've already stepped back once or not and disallow it. But there is a bigger problem which, given the preceding material in this ever-longer post, you have undoubtedly already deduced. If we were on an even generation, then the current state is in the even state and the previous state is in the odd state. If we step back one tick, then the current state is in the odd state, but the even state is not the previous state; it is the next state. We need to make sure that when we do the odd-to-even step, we do not come to the erroneous conclusion that the even state is all stable!

The original code has a very clearly-named state variable; it has a field backcorrect which indicates whether the previous state is correct or not. If set to false, then the algorithm does exactly what I've done in my bug fix: upon stepping, it marks every region of every Quad3 to active", and then finally sets the correctness flag to true.

My mistake was that I wrongly believed upon initially reading the code that I could ignore backcorrect because it was only for implementing the step-backwards feature, which I was not going to add to my UI. I completely missed the fact that Hensel sets this flag to false whenever the user forces a change to cell state, and that setting that flag to false is crucial for ensuring the correctness of the algorithm in that scenario.

I feel a keen sense of irony that after figuring out all the obscure and arcane parts of the algorithm, I missed the importance of this state bit that was more clearly named than all the other state bits.

That was the root cause of the defect, but there were other contributory factors:

  • I made the classic high school math mistake of reasoning inductively without establishing a base case. In this blog series I made arguments for the correctness of this algorithm based on the assumption that the previous generation was the real previous generation of the current generation. But that assumption breaks down on the base case of generation zero.
  • My test regimen was inadequate. I have no unit tests or end-to-end tests; all my testing for this entire series has been completely ad-hoc-try-it-out-in-the-UI-and-see-if-it-looks-right.
  • I have very few assertions in the code.
  • When implementing a complex and easily misunderstood algorithm like QuickLife, I probably should have built a check mode" implementation of the ILife interface that takes two implementations, does all the same operations to both, and verifies that the new board states are the same in both implementations. I could then have written a random soup" test generator and verified my implementations against each other.

I (re)learned some lessons from this bug. From the specific to the general:

  • When you're porting code you didn't write and there is a part you think is unnecessary to port, really dig in and understand whether it can be safely removed.
  • I made good arguments about the correctness of the steady state", but I did not establish the correctness of the code on boundary" conditions such as it's the very first step". Edge cases are not necessarily rare cases; every program starts at an edge case.
  • A lesson I have learned many times in my life as a compiler developer is strongly reinforced by this bug: every time you cache the result of an analysis of a mutable data structure for performance reasons you create an opportunity for a bug should a mutation cause the cached analysis to become incorrect. I've fixed so many compiler bugs like that.
  • The whole point of this series is that you can sometimes find specific aspects of the business domain" of your computation and use those to drive performance optimizations. If those optimizations depend on invariants that can be forced to be violated, the optimization will lead to an incorrect computation!

We took advantage of the rules of Life to get wins; when we force those rules to be violated, computations that depend on those rules become incorrect.

Next time on FAIC: Let's finish this thing up! Are there patterns that grow quadratically?

External Content
Source RSS or Atom Feed
Feed Location http://ericlippert.com/feed
Feed Title Fabulous adventures in coding
Feed Link https://ericlippert.com/
Reply 0 comments