Election 2029: Postcodes

After a pretty practical previous post about records and collections, this post is less likely to give anyone ideas about how they might tackle a problem in their own project, and doesn't have any feature requests for Microsoft either. It's an area I've found really fun though.
An introduction to postcodes in the UKI realise that most of the readers of this blog post will probably not be in the UK. Most countries have postal codes of some description, but the details vary a lot by country. UK postcodes are quite different in scale to US zipcodes, for example. A UK postcode is quite fine-grained - often just a part of a single street - so knowing a house/flat number an a postcode is usually enough to get to a precise address.
Importantly for my election site, every postcode is in a single constituency - and indeed that's what I want to use them for. My constituencies page allows you to start typing a constituency name or postcode, and it filters the results as you type. I suspect that a significant proportion of the UK population knows their postcode but not the name of their constituency (particularly after the boundary changes in 2023) so it's helpful to be able to specify either.
Wikipedia has a lot more information about UK postcodes but I'll summarize it briefly here. Note that I only care about modern postcodes - the Wikipedia has details about the history of how things evolved, but my site doesn't need to care about legacy postcodes.
A postcode consists of an outcode (or outward code) followed by incode (or inward code). The outcode is an area followed by a district, and the incode is a sector followed by a unit. As an example, my postcode is RG30 4TT: broken down into:
- Outcode: RG30
- Area: RG
- District: 30
- Incode: 4TT
- Sector: 4
- Unit: TT
Incodes are nice and uniform: they're always a digit followed by two letters. (And those two letters are within a 20-character alphabet.) Outcodes are more interesting" as they fall into one of the seven formats below, where 9' represents any digit" and A' represents a letter" (although the alphabet varies):
- AA9
- AA99
- A9
- A99
- A9A
- AA9A
Note how the first two formats for outcodes only vary by whether they have one or two digits - and remember that an incode always starts with a digit. This isn't a problem when parsing text that is intended to represent a complete postcode (as you can tell the length of the outcode by assuming the final three characters are the incode) - but when I need to parse an incomplete postcode, it can be ambiguous. For example, RG31 4FG" and RG3 1AA" are both valid postcodes, and as I don't want to force users to type the space, RG31" should display constituencies for both (Reading West and Mid Berkshire" and Reading Central" respectively).
RequirementsThe requirements for my use of postcodes in the election site is pretty straightforward:
- Ingest data from an external source (the Office of National Statistics was what I found)
- Store everything we need in a compact format, ideally in a single Firestore document (so 1MB)
- Keep the data in memory, still in a reasonably compact format
- Provide an API with input of a postcode prefix" and return the set of all constituencies covered by postcodes starting with that prefix" sufficiently quickly for it to feel instant"
Interestingly, Wikipedia refers to a table of all 1.7 million postcodes" but the file I ingest currently has 2,712,507 entries. I filter out a few, but I still end up with 2,687,933 distinct postcodes. I don't know whether the Wikipedia number is just a typo, or whether there's a genuine discrepancy there.
Spoiler alert: the network latency easily dominates the latency for making a request to the API. Using airport wifi in Munich (which is pretty good for free wifi) I see a roundtrip time of about 60ms, and the page feels like it's updating pretty much instantaneously as I type. But the time within ASP.NET Core is less than a millisecond (and that's including filtering by constituency name as well).
Storage and in-memory formatsLogically, the input data is just a sequence of postcode, constituency code" entries. A constituency code is 9 characters long. That means a very naive representation of 2712507 entries, each taking 16 characters (expecting an average of 7 characters for the postcode, and 9 characters for the constituency code) would be just under 42MB, and that's before we take into account splitting the data into the entries. Surely we can do better.
There's a huge amount of redundancy here - we're bound to be able to do better. For a start, my naive calculation assumed using a whole byte per character, even though every character we need is in the range A-Z, 0-9 (so an alphabet of 36 characters) - and several of the characters within those values are required to just be digits.
Grouping by outcodeBut before we start writing complex code to represent the entry string values in fewer bytes, there's a lot more redundancy in the actual data. It's not like postcodes are randomly allocated across all constituencies. Most postcodes within the same sector (e.g. RG30 4") will be in the same constituency. Even when we group by outcode (e.g. RG30") there are relatively few constituencies represented in each group.
At the time of writing, there are 3077 outcodes:
- 851 are all in 1 constituency
- 975 are in 2 constituencies
- 774 are in 3
- 348 are in 4
- 98 are in 5
- 21 are in 6
- 8 are in 7
- 2 are in 8
We also want to diffentiate between invalid and valid postcodes - so we can think of no constituency at all" as an additional constituency in each of the lines above.
In each outcode, there are 4000 possible incodes, which are always the same (0AA, 0AB .. 9ZY, 9ZZ). So for each outcode, we just need an array of 4000 values.
A simple representation with a byte per value would be 4000 bytes per map, with 3077 such maps (one per outcode) - that's 12,308,000 bytes.
Compressing the mapsCan we do better by using the fact that we don't really need a whole byte for each value? For the 851 outcodes in a single constituency, each of those values would be 0 or 1. For the 975 outcodes in two constituencies, each value would be 0, 1 or 2 - and so on. Even without splitting things across bytes, we could store those maps as:
- 851 maps with a single bit per value (500 bytes per map)
- 975 + 774 maps with two bits per value (1000 bytes per map)
- The remaining (348 + 98 + 21 + 8 + 2) maps with four bits per value (2000 bytes per map)
That's a total of (851*500) + ((975 + 774) * 1000) + ((348 + 98 + 21 + 8 + 2) * 2000) = 3,128,500 bytes. That sounds great - it's saved three quarters of the space needed for the maps! But it's still well over the maximum size of a Firestore document. And, of course, for each outcode we also need the outcode text, and the list of constituencies represented in the map. That shouldn't be much, but it doesn't help an already-blown size budget.
At this point, I tried multiple options, including grouping by sector instead of outcode (as few outcodes actually have all ten possible sectors, and within a sector there are likely to be fewer constituencies, so we'll have more sectors which can be represented by one or two bits per value). I tried a hand-rolled run-length encoding and various options. I did manage to get things down quite a lot - but the code became increasingly complicated. Fundamentally, what I was doing was performing compression, and trying to apply some knowledge about the data to achieve a decent level of compression.
Around this point, I decided to stop trying to make the code smarter, and instead try making the code simpler and lean on existing code being smart. The simple 4000-byte maps contain a lot of redundancy. Rather than squeezing that redundancy out manually, I tried general-purpose compression - using DeflateStream and InflateStream. The result was beyond what I'd hoped: the total size of the compressed maps is just 500,602 bytes. That leaves plenty of room for the extra information on a per-outcode basis, while staying well under the 1MB document limit.
I would note that when I wrote about storage options, I mentioned the possibility of using Google Cloud Storage instead of Firestore - at which point there'd really be no need for this compression at all. While it would be nice to remove the compression code, it's a very isolated piece of complexity - which is only actually about 15 lines of code in the end.
Processing a requestSo with the storage requirements satisfied, are we done? Pretty much, it turns out. I keep everything uncompressed in memory, with each per-outcode map storing a byte per incode - leading to just over 12MB in memory, as shown earlier. That's fine. Each per-outcode map is stored as an OutcodeMapping.
The process of converting a prefix into a set of possible constituencies is fiddly simply due to the ambiguity in the format. I don't return anything for a single character - it's reasonable to require users to type in two characters of a postcode before we start doing anything with it.
It's easy to store a dictionary of first two characters" to list of possible OutcodeMapping entries". That list is always reasonably short, and can be pruned down very quickly to only the entries that actually match the supplied prefix". For each matching OutcodeMapping, we then need to find the constituencies that might be represented by the incode part of whatever's left of the supplied prefix.
For example, if the user has typed RG30" then we'll quickly filter down to RG3" and RG30" as matching outcodes. From that, we need to say:
- Which constituencies are represented by incodes for RG3 which start with 0"?
- Which constituencies are represented by incodes for RG30? (We've already consumed the whole of the supplied prefix for that outcode.)
By splitting it into outcode and then incode, it makes life much simpler because the incode format is so simpler. For example, if the user had typed in RG30 4" then the questions above become:
- Which constituencies are represented by incodes for RG3 which start with 04"?
- There can't be any of those, because the second character of an incode is never a digit.
- Which constituencies are represented by incodes for RG30 which start with 4"?
We always have zero, one, two or three characters to process within an outcode mapping:
- Zero characters: the full set of constituencies for this outcode
- One character: the set of constituencies for that sector" (which we can cache really easily; there are only ten sectors per outcode)
- Two characters: the fiddliest one - look through all 20 possibilities for that half unit" in the map.
- Three characters: a single postcode, so just look it up in the map; the result will be either nothing (if the postcode is invalid) or a single-entry result
I suspect that there's more efficiency that could be squeezed out of my implementation here - but it's already fast enough that I really don't care. At some point I might do a load test to see how many requests I can process at a time while still keeping the latency low... but I'd be very surprised if this caused a problem. (And hey, Cloud Run will just spin up some more instances if necessary.)
ConclusionThere are two things I love about this topic. Firstly, it's so isolated. You don't need to know anything about the rest of the site, election polls, parties etc to understand this small domain. You need to know about postcodes and constituencies, but that's it. Aside from anything else, that makes it a pleasant topic to write about. Secondly, the results seem so wildly impressive, with relatively little effort. In around half a megabyte of data, we're storing information about over 2.7 million postcodes, and the final lookup is dirt cheap.
There's an interesting not-quite-contradiction in the approaches taken in the post, too.
Firstly, even though the data is sort of mapping a 6/7 character input (postcode) to a 9 character output (constituency code), it's very far from random. There are only 650 constituencies, and the constituencies are obviously very clustered" with respect to the postcodes (so two postcodes which only differ in the fifth or later character are very likely to belong to the same constiteuncy). The approach I've taken uses what we know about the data as a first pass - so that aspect is very domain-specific. Just compressing a file of postcode constituency-code" lines only gets us down to just over 7MB.
In contrast, when we've got as far as for each outcode, we have a map of 4000 entries, each to a number between 0 and 8 inclusive" (with awareness that for many outcodes, the numbers are either in the range 0-1 or 0-2 inclusive), trying to be clever" didn't work out terribly well. The code became increasingly complicated as I tried to pack as much information as possible into each byte... whereas when I applied domain-agnostic compression to each map, the results were great.
I don't think it's particularly rare to need to find that balance between writing code to try to take advantage of known data characteristics, and relying on existing domain-agnostic algorithms. This is just one of the clearest examples of it that I've come across.