Article 6W74V Election 2029: Storage

Election 2029: Storage

by
jonskeet
from Jon Skeet's coding blog on (#6W74V)
Story ImageStorage

Since my last post about the data models, I've simplified things very slightly - basically the improvements that I thought about while writing the post have now been implemented. I won't go into the details of the changes, as they're not really important, but that's just to explain why some examples might look like things are missing.

I have two storage implementations at the moment:

  • Firestore, with separate named databases for the test, staging and production environments
  • JSON files on disk, for development purposes so that I don't need to hit Firestore every time I start the site locally. (The test" Firestore database doesn't get much use - but it's nice to be able to easily switch to use it for testing any Firestore storage implementation changes before they hit staging.)

As I mentioned before, the data models are immutable in memory. I have an interface (IElectionStorage) used to abstract the storage aspects (at least most of them) so that hardly any code (and no code in the site itself) needs to know or care which implementation is in use. The interface was originally quite fine-grained, with separate methods for different parts of ElectionContext. It's now really simple though:

public interface IElectionStorage{ Task StoreElectionContext(ElectionContext context, CancellationToken cancellationToken); Task<ElectionContext> LoadElectionContext(CancellationToken cancellationToken);}

This post will dive into how that's implemented - primarily focusing on the Firestore side as that's rather more interesting.

Storage classes

When choosing to store data models as JSON or XML, I usually end up following one of three broad patterns:

  • Produce and consume the JSON/XML directly (using JObject or XDocument etc) explicitly in code
  • Separate storage" and usage" - introduce a parallel set of types that roughly mirrors the data models used by the rest of the code; these storage-specific classes are only used as a simpler way of performing serialization and deserialization
  • Serialize and deserialize the data model used by the rest of the code directly

I've had success with all three of these approaches - my church A/V system uses the last of these, for example. The first (most explicit) approach is one I sometimes use for XML, but Json.NET (and System.Text.Json) makes it easy to serialize and deserialize types to/from JSON that I rarely use it there. (I'm aware that XML serialization options exist, but I've never found them nearly as easy to use as JSON serialization.)

The middle option - having a separate set of classes just for serialization/deserialization - feels like a sweet spot though. It does involve duplication - when I add a new property to one of the core data models, I have to add it to my storage classes too. But that happens relatively rarely, and it makes things significantly more flexible, and keeps all storage concerns out of the core data model. It helps me to resist designing the data model around what would be easy to store, too.

Note on JSON library choice
For my election site, I'm using Json.NET (aka Newtonsoft.Json). I have no doubt that System.Text.Json has some benefits in terms of performance, but I'm personally still more comfortable with Json.NET, having used it for years. None of the JSON code is performance-sensitive, so I've gone with the familiar.

The fact that the data models are immutable encourages the choice of using separate storage classes, too. While both Json.NET and System.Text.Json support deserializing to records, the deserialization code for Firestore is a little more limited. I could write a custom converter to use constructors (and I have added a few custom converters) but if we separate storage and usage, it's fine to just make the storage classes mutable.

Initially - for quite a long time, in fact - I had separate storage classes for JSON (decorated with [JsonProperty] where necessary) and for Firestore (decorated with [FirestoreData] and [FirestoreProperty]). That provided me the flexibility to use different storage representations for files and for JSON. It became clear after a while that I didn't actually need this flexibility though - so now there's just a single set of types used for both storage implementations. There's still the opportunity to use different types for one particular piece of data should I wish to - and for a while I had a shared representation for everything apart from the postcode mapping.

Storage representationsFile storage

On disk, each collection within the context has its own JSON file - and there's a single timestamp.json file for the timestamp of the whole context. So the files are ballots-2029.json, by-elections.json, candidates.json, constituencies.json, data-providers.json, electoral-commission-parties.json, notional-results-2019.json, parties.json, party-changes.json, polls.json, postcodes.json, projection-sets.json, results-2024.json, results-2029.json and timestamp.json. It's currently just under 4MB. The test environment on disk is used for some automated testing as well as local development, so it's checked into source control. That's useful as history for how the data has changed.

Firestore storage - initial musings

Firestore is a little more complicated. A Firestore database consists of documents in collections - and a document can contain its own collections too. So a path to a Firestore document (relative to the database root) is something along the lines of collection1/document1/collection2/document2. So, how should we go about storing the election data in Firestore?

There are four important requirements:

  • It must be reasonably fast to load the entire context on startup
  • It must be fast to load a new context after changes, using whatever in-memory caching we want
  • We must be consistent: when a new context is available, loading it should load everything from the new context. (We don't want to start loading a new context before it's finished being stored, for example.)
  • I'm a cheapskate: it must be cheap to run, even if the site becomes popular

First, let's consider the chunkiness" of our documents. Firestore pricing is primarily based (at least in our case) on the number of document reads we perform. One option would be to make each object in each collection its own document - so there'd be one document per constituency, one per party, one per candidate, one per result, etc. That would involve reading thousands of documents when starting the server, which could end up being relatively expensive - and I'd expect it to be slower than reading the same amount of data in a much smaller number of documents.

At the other end of the chunkiness spectrum, we could try storing the whole context in a single document. That wouldn't work due to the Firestore storage limits: each document (not including documents in nested collections) has a maximum size of about 1MB. (See the docs for details of how the document size is calculated.)

Handling the data in Firestore roughly the same way as on disk works pretty well - putting each collection in its own document. The only document I'd be nervous about here is the collection of projection sets. Each projection set has details of the projections for up to 650 constituencies (more commonly 632, as few projection sets include projections for Northern Ireland). Right now, that's fine - but if there are a lot of projection sets before the election, we could hit the 1MB document limit. I'm sure there would be ways I could optimize how much data is stored - there's a lot of redundancy in party IDs, field names etc - but I'd prefer not to get into that if I could avoid it. So instead, I've chosen to store each projection set in its own document.

So, how should we store contexts? Initially I just used a static set of document names (typically an all-values document within a Firestore collection named after the collection in the context - so results-2024/all-values, results-2029/all-values etc) but that doesn't satisfy the consistency requirement... at least not in a trivial way. Firestore supports transactions, so in theory I could commit all the data for a new context at the same time, and then read within a transaction as well, to effectively get a consistent snapshot at a point in time. I'm sure that would work, but it would be at least a little more fiddly to use transactions than to not do so.

Another option would be to have a collection per context - so whenever I made a change at all, I'd create a new collection, with a complete set of documents within that collection to represent the context. We'd still need to work out how to avoid starting to read a context while it's being stored, but at least we wouldn't be modifying any documents, so it's easier to reason about. Again, I'm sure that would work - but it feels really wasteful when we consider one really important point about the data:

Almost all the data changes at a glacial pace.

There are rarely new parties, new constituencies, new by-elections etc. The data for the 2019 and 2024 elections isn't going to change at this point, modulo schema changes etc. Existing polls and projection sets are almost never modified (although it can happen, if I had a bug in the processing code or some bad metadata). In other words, the vast majority of the data in a new" context is the same as in the old" context.

Manifest-based storage

Thinking about this, I hit on the idea of having a manifest" for a context, and storing that in one document. Each collection in the context is still be a single document, but the name of that document would be based on a hash (SHA-256 in the current implementation) - and the manifest records which documents are in the context. When storing a new context, if a document with the same name that we would write already exists, we can just skip it. (I'm assuming that there won't be any hash collisions - which feels reasonable.) So for example, the common operation of add a poll" only needs to write a new document for all the polls" (which will admittedly be largely the same as the previous all the polls" document) and then write a new manifest which refers to that new document. For seat projections, there's a list of documents instead of a single document.

Loading a context from scratch requires loading the manifest and then loading all the documents it refers to. But importantly, loading a new context can be really efficient:

  • See whether there's a new manifest. If there isn't, we're done.
  • Otherwise, load the new manifest.
  • For each document in the new manifest, see whether that's the same document that was in the old manifest. If it is, the data hasn't changed so we can use the old data model, adapted to update any cross-references. (More on that below.) If it's actually a new document, we need to deserialize it as normal.

This leaves garbage" documents in terms of old manifests, and old documents which are only referenced by old manifests. These don't actually do any harm - the amount of storage taken is really pretty minimal - but they do make it slightly more difficult to examine the data in the Firestore console. Fortunately, it's simple to just prune storage periodically: delete any manifests which were created more than 10 minutes ago (just in case any server is still in the process of loading data from recently-created-but-not-latest manifests), and delete any documents which aren't referenced from the remaining manifests. This is the only time that code accessing storage needs to know which implementation it's talking to - because the pruning operation only exists for Firestore.

Cross-referencing

As discussed in the previous post, there are effectively two layers of data models in the context: core and non-core. Core models are loaded first, and don't refer to each other. Non-core models don't refer to each other, but can refer to core models.

If any of the core collections has changed in the manifest, we need to create a new core context. If none of them has changed, we can keep the whole old" core context as it is (i.e. reuse the existing ElectionCoreContext object). In that case, any non-core models which haven't changed in storage can be reused as well - all their references will still be valid.

If the core context has changed - so we've got a new ElectionCoreContext - then any non-core models which haven't changed in storage need to be recreated to refer to the new core context. (For example, if a candidate were to change name, then the displaying the 2024 election results should display that new name, even though the election results themselves haven't changed.)

This is pretty easy to do in code, and in a generic way, by introducing two interfaces and some extension methods:

public interface ICoreModel<T>{ /// <summary> /// Returns the equivalent model (e.g. with the same ID) from a different core context. /// </summary> T InCoreContext(ElectionCoreContext coreContext);}public interface INonCoreModel<T>{ /// <summary> /// Returns a new but equivalent model which references elements from the given core context. /// </summary> T WithCoreContext(ElectionCoreContext coreContext);}public static class ContextModels{ [return: NotNullIfNotNull(nameof(list))] public static ImmutableList<T> InCoreContext<T>( this ImmutableList<T> list, ElectionCoreContext coreContext) where T : ICoreModel<T> => [.. list.Select(m => m.InCoreContext(coreContext))]; [return: NotNullIfNotNull(nameof(list))] public static ImmutableList<T> WithCoreContext<T>( this ImmutableList<T> list, ElectionCoreContext coreContext) where T : INonCoreModel<T> => [.. list.Select(m => m.WithCoreContext(coreContext))];}

We end up with quite a lot of somewhat boilerplate" code to implement these interfaces. Result is a simple example of this, implementing WithCoreContext by calling InCoreContext on its direct core context dependencies, and WithCoreContext for each CandidateResult. All standalone data (strings, numbers, dates, timestamps etc) can just be copied directly:

public Result WithCoreContext(ElectionCoreContext coreContext) => new(Constituency.InCoreContext(coreContext), Date, WinningParty.InCoreContext(coreContext), CandidateResults?.WithCoreContext(coreContext), SpoiltBallots, RegisteredVoters, IngestionTime, DeclarationTime);

There's an alternative to this, of course. We could recreate the whole ElectionContext from scratch each time we load a new manifest. We could keep a cache of all the documents referenced by the manifest so that we wouldn't need to actually hit storage again, and just deserialize. That would probably involve fewer lines of code - but it would possibly be more fiddly code to get right. It would also probably be somewhat less efficient in terms of memory and CPU, but I'm frankly not overly bothered about that. Even on election night, it's unlikely that we'll create new manifests more than once every 30 seconds.

Conclusion

It's possible that all of this will still change, but at the moment I'm really happy with how the data is stored. There's nothing particularly innovative about it, but the characteristics of the Firestore storage are really nice - even if the site ends up being really popular and Cloud Run spins up several instances, I don't expect to end up paying significant amounts for Firestore across the entire lifetime of the site. To put it another way: the expected cost is so low that any effort in optimizing it further would almost certainly cost more in terms of time than it would achieve in savings. (Admittedly the cost" of time for a hobby project where I'm enjoying spending the time anyway is hard to quantify.) Of course, I'm keeping an eye on my GCP expenditure over time to make sure that reality matches the theory here.

I believe I could implement the Firestore approach to storage in Google Cloud Storage (blob storage) almost trivially - especially given that serialization to JSON is already implemented for local storage. I could use separate storage buckets for different environments, in the same way that I use separate databases in Firestore. I have no particular reason to do so other than curiosity as to whether it would be as easy as I'm imagining. (I guess it might be interesting to look at any performance differences in terms of I/O as well.) Migrating away from Firestore entirely would mean I only ever had to deal with one serialization format (JSON) which in turn would potentially allow me to reconsider the design decision of having separate storage classes - at least for some types. That's not currently enough of a motivation to move, but who knows what changes the next four years might bring.

Having looked at how we're storing models, the next post will be about records and collections.

External Content
Source RSS or Atom Feed
Feed Location http://codeblog.jonskeet.uk/feed/
Feed Title Jon Skeet's coding blog
Feed Link https://codeblog.jonskeet.uk/
Reply 0 comments