News Roundup: Excellent Data Gathering
In a global health crisis, like say, a pandemic, accurate and complete data about its spread is a "must have". Which is why, in the UK, there's a great deal of head-scratching over how the government "lost" thousands of cases.
Oops.
Normally, we don't dig too deeply into current events on this site, but the circumstances here are too "WTF" to allow them to pass without comment.
From the BBC, we know that this system was used to notify individuals if they have tested positive for COVID-19, and notify their close contacts that they have been exposed. That last bit is important. Disease spread can quickly turn exponential, and even though COVID-19 has a low probability of causing fatalities, the law of large numbers means that a lot of people will die anyway on that exponential curve. If you can track exposure, get exposed individuals tested and isolated before they spread the disease, you can significantly cut down its spread.
People are rightfully pretty upset about this mistake. Fortunately, the BBC has a followup article discussing the investigation, where an analyst explores what actually happened, and as it turns out, we're looking at an abuse of everyone's favorite data analytics tool: Microsoft Excel.
The companies administering the tests compile their data into plain text which appear to be CSV files. No real surprise there. Each test created multiple rows within the CSV file. Then, the people working for Public Health England imported that data into Excel... as .xls files.
.xls is the old Excel format, dating back into far-off years, and retained for backwards compatibility. While modern .xlsx files can support a little over a million rows, the much older format caps out at 65,536.
So: these clerks imported the CSV file, hit "save as..." and made a .xls, and ended up truncating the results. With the fact that these input datasets had multiple rows per tests, "in practice it meant that each template was limited to about 1,400 cases."
Again, "oops".
I've discussed how much users really want to do everything in Excel, and this is clearly what happened here. The users had one tool, Excel, and it looked like a hammer to them. Arcane technical details like how many rows different versions of Excel may or may not support aren't things it's fair to expect your average data entry clerk to know.
On another level, this is a clear failing of the IT services. Excel was not the right tool for this job, but in the middle of a pandemic, no one is entirely sure what they needed. Excel becomes a tempting tool, because pretty much any end user can look at complicated data and slice/shape/chart/analyze it however they like. There's a good reason why they want to use Excel for everything: it's empowering to the users. When they have an idea for a new report, they don't need to go through six levels of IT management, file requests in triplicate, and have a testing and approval cycle to ensure the report meets the specifications. They just... make it.
There are packaged tools that offer similar, purpose built functionality but still give users all the flexibility they could want for slicing data. But they're expensive, and many organizations (especially government offices) will be stingy about who gets a license. They may or may not be easy to use. And of course, the time to procure such a thing was in the years before a massive virus outbreak. Excel is there, on everyone's computer already, and does what they need.
Still, they made the mistake, they saw the consequences, so now we know, they will definitely start taking steps to correct the problem, right? They know that Excel isn't fit-for-purpose, so they're switching tools, right?
From the BBC:
To handle the problem, PHE is now breaking down the data into smaller batches to create a larger number of Excel templates in order to make sure none hit their cap.
Oops.
[Advertisement] Keep the plebs out of prod. Restrict NuGet feed privileges with ProGet. Learn more.