Election 2029: An Impossible Exception

jonskeet

from Jon Skeet's coding blog on 2025-02-07 23:32 (#6V4W3)

I really thought I'd already written a first blog post about my Election 2029 site (https://election2029.uk) but I appear to be further behind on my blogging than I'd thought. This is therefore a little odd first post in the series, but never mind. To some extent it isn't particularly related to the election site, except that a) this is relatively modern C# compared wtih most of my codebases; b) it will explain the two strings in question.

I cannot stress enough: when my code misbehaves, it's almost always my fault. (I've blogged about this before.)

But on Thursday, I saw an exception I can't explain. This post will give a bit of background, show you the code involved and the exception message, and my next (pretty weak) step. The purpose of this post is threefold:

It's a sort of act of humility: if I'm going to post when I'm pleased with myself for diagnosing a tricky problem, I should be equally content to post when I'm stumped.
It's just possible that someone will have some insight into what's going on, and add a comment.
If I ever do work out what happened, it'll be good to have this post written at the time of the original problem to refer to.

When my code misbehaves, it's almost always my fault.

First, a bit of background on how the site stores data.

Election data

I'll go into a lot more detail about the general architecture of the site in future posts, but for the purposes of this post, you only need to know:

The data is stored in Firestore
Normal page requests don't hit the database at all. Instead, it's all held in memory, and occasionally (and atomically, per instance) updated. The type containing all the data (in a relatively raw" form) is called ElectionContext, and is fully immutable.
During development, there's a manual reload current data" page that I use after I know I've updated the database. (I'll explain more about my reloading plans in another post.)
When the ElectionContext is reloaded, it performs some validation - if it's not valid, the reload operation fails and the previously-loaded data continues to be used. (This doesn't help if a new instance is started, of course.) Part of the validation is checking that certain collections are in an expected order.

Two of the collections in the context are data providers and prediction sets. On Thursday I added a new data provider (Focaldata) and their first prediction set.

I always add data first to my local test environment (which uses both Firestore and a file-based version), my staging environment, and finally production. When I update the ElectionContext, I validate it before storing it, then fetch it from scratch twice, validating it both times and then checking that a) the first fetched context is equal to the context that I stored, and that the second fetched context is equal to the first fetched context.

When my code misbehaves, it's almost always my fault.

The code and exception

This is the code used to check the order of a collection within the ElectionContext.Validate() method:

void ValidateOrdering<T>( IEnumerable<T> source, Func<T, string> selector, [CallerArgumentExpression(nameof(source))] string? message = null){ foreach (var (current, next) in source.Zip(source.Skip(1))) { string currentText = selector(current); string nextText = selector(next); if (StringComparer.Ordinal.Compare(currentText, nextText) >= 0) { throw new InvalidOperationException( $"Incorrect ordering for {message}: {currentText} should occur before {nextText}"); } }}

And specifically for prediction sets:

ValidateOrdering(PredictionSets, ps => ps.Id);

The PredictionSets property is an ImmutableList<PredictionSet> and PredictionSet is a record:

public sealed record PredictionSet(string Id, /* many other properties */)

So basically the code is checking that the IDs of all the prediction sets are in order, when sorted with an ordinal string comparison. (My initial thought after seeing the exception was that there was some bizarre culture involved and that I was performing a culture-specific ordering check - but no, it's ordinal.)

When I pushed the new data, both test and staging environments were fine. When I hit the reload page in production, the reload failed, and this exception was in the logs:

System.InvalidOperationException: Incorrect ordering for PredictionSets: mic-01 should occur before focaldata-01 at Election2029.Models.ElectionContext.<Validate>g__ValidateOrdering [...]

Before we look in more detail at the exception itself, the logs show five requests to the reload page:

At 20:15:38, the new data hadn't yet been stored: the reload page found there was no new data, and succeeded
At 20:15:45, 20:15:49 and 20:15:57 the reload operation failed (with the exception above each time)
At 20:26:22, the reload operation succeeded

Interestingly, the storage component of the reload operation looks for new data and returns its cached version if there isn't any; it's the component one level up that performs validation. (This cached version isn't the serving" context - it's just in the storage component. So during this period the site kept serving the previous data with only a single prediction set.) In this case, the logs show:

20:15:38: No new data, so cache returned - then validated successfully
20:15:45: New data loaded, then failed validation
20:15:49: No new data, so cached context returned, then failed validation
20:15:57: No new data, so cached context returned, then failed validation
20:26:22: No new data, so cached context returned - then validated successully

In other words, the same data that failed validation three times was then validated successfully later. The log entries are quite explicit about not loading anything - so it doesn't appear to be a problem in the storage component, where the data was loaded incorrectly first, then loaded correctly later. I say doesn't appear to be" because when my code misbehaves, it's almost always my fault. (And appearances can be deceptive. Basically, I'm suspicious of any deductions until I've got a way of reproducing the problem.)

Let's look at the code and exception again.

What do we know" / suspect?

In the exception, mic-01 and focaldata-01 are the IDs of the two prediction sets. In production, these are the only two prediction sets, and those are the correct IDs. The exception message implies that at the time of the exception, the value of currentText was "focaldata-01", and the value of nextText was "mic-01". Those values make sense in that I would have expected "focaldata-01" to come before "mic-01".

In other words, it looks like the data was right, but the check was wrong. In other words, with the arguments substituted with the actual (apparent) values it looks like this expression evaluated to true:

// How could this possibly be true?StringComparer.Ordinal.Compare("focaldata-01", "mic-01") >= 0

To cover some normal bases, even if they wouldn't easily explain the exception:

The use of source.Zip(source.Skip(1)) might look worrying in terms of the collection changing, but source is an ImmutableList<PredictionSet>.
The type of PredictionSet.Id is string - so that's immutable.
PredictionSet itself is immutable, so Id can't change over time.
Most importantly, currentText and nextText are both local variables in the loop.

When my code misbehaves, it's almost always my fault. But in this case I really, really can't understand how it could be.

I'm left thinking that of two options:

The ordinal string comparer has a bug which makes the result non-deterministic (don't forget this same check passed on the same machine minutes later).
The JIT compiler has a bug which meant that the arguments weren't evaluated properly - either they were passed in the wrong order, or perhaps either the second argument or both arguments evaluated to null for some reason, but then were properly evaluated for the string formatting in the exception.

Neither of these seems particularly likely, to be honest. The second seems a little more likely given that I'm already aware of a JIT compiler issue in .NET 9 which has affected some customers from my Google Cloud client library work. I don't understand the linked issue well enough to judge whether it would explain the exception though.

What's next?

I can't reproduce the problem in any environment. The only aspect I can think of to improve the diagnostics a bit, to rule out non-printable characters, is to log the length of each string as part of the exception:

throw new InvalidOperationException( $"Incorrect ordering for {message}: {currentText} (length {currentText.Length}) should occur before {nextText} (length {nextText.Length})");

It's not much, but it's all I've got at the moment.

My guess is that I'll never know what happened here. That I'll never see the exception again, but also never be able to reproduce it in a before" state in order to know that it's been fixed. All somewhat unsatisfying - but at least interesting. Oh, and I absolutely still have faith that when my code misbehaves, it's almost always my fault.

Update - solved!

Thanks to a smarter colleague, this mystery has now been solved. While she didn't have enough context to know where the problem actually was, she was able to pick up where my reasoning went wrong above.

Source	RSS or Atom Feed
Feed Location	http://codeblog.jonskeet.uk/feed/
Feed Title	Jon Skeet's coding blog
Feed Link	https://codeblog.jonskeet.uk/