The Fallacies of Big Data

in internet on 2014-03-30 19:00 (#3H5)

It's been almost ten years since two Google engineers published a paper describing the architecture of Map Reduce , a framework for simplifying the development and deployment of algorithms that process terabytes or petabytes of data across a cluster of commodity servers. The Open Source community soon responded with Hadoop , a Map Reduce work-alike, and in the following years it seems that most large IT organizations, and many startups, have jumped on the bandwagon pitching the virtues of Big Data, Hadoop and/or NoSQL as a revolutionary set of techniques for capturing actionable trends and correlations from the firehose of real-time data (clickstream, Twitter feeds, Facebook likes, server logs, sensor and surveillance data, mobile call events, and of course, all the stuff the NSA looks at).

Tim Harford of the Financial Times points out that this methodology is subject to various types of sampling bias , even in cases where the more enthusiastic proponents claim to be 'observing the entire population, not just a statistical sample'. First, data collected from social media or smart phone apps is heavily biased by the user profile of those technologies, whch is disproportionately young, affluent, and urban or suburban. Harford mentions the famous case of the Literary Digest, a well-established magazine that forecasted a landslide victory for Alf Landon in the 1936 US Presidential election, based on a massive poll of one out of five eligible voters - whose contact information was pulled from telephone subscriber lists (Landon lost the election to Franklin Roosevelt, who carried all but two of the 48 states; Literary Digest ceased publication soon afterwards).

Second, people adjust their behavior over time with respects to various topics in the news. The sudden increases in flu-related searches that made Google Flu Trends look very prescient five winters ago, turned into a debacle when Google used similar data to warn of a severe flu outbreak four years later; but the flu season turned out to be average when the curated data from the CDC finally came in.

What about the famous anecdote about Target finding out that a teenage customer was pregnant before her dad did? Maybe so, says a researcher quoted in Harford's article, but there's an issue with false positives. The world likely didn't hear about other Target customers who got pregnancy-related marketing materials they wouldn't have any use for.

7 comments

I'm glad to see this kind of criticism (Score: 4, Insightful)

by danieldvorkin@pipedot.org on 2014-03-30 20:12 (#W2)

As a biostatistician working in bioinformatics, I'm well aware of both how powerful and how dangerous "big data" can be--powerful because it can tell you things you couldn't discover any other way, dangerous because it will tell you all kinds of things that aren't true unless you're very, very careful with your analysis. A lot of the people talking about "big data" and "data science" and all the rest of it are sound like teenagers who have just got their drivers' licenses.

Re: I'm glad to see this kind of criticism (Score: 2, Insightful)

by marqueeblink@pipedot.org on 2014-03-31 03:01 (#W4)

So you're saying BD can be useful for exploratory data analysis to suggest avenues for investigation using more traditional methods. I remember they used to say the same about data mining before Big Data became the buzzword.

Maybe Big Data is just the reincarnation of data mining?

Re: I'm glad to see this kind of criticism (Score: 1)

by danieldvorkin@pipedot.org on 2014-03-31 09:01 (#W7)

Maybe Big Data is just the reincarnation of data mining?

Yeah, I think that's pretty much it.

On a Technical Note (Score: 3, Insightful)

by geotti@pipedot.org on 2014-03-31 01:01 (#W3)

There's also Stratosphere , which, besides map and reduce, offers several more second-order functions like join, cross, union, and cogroup and can do flow iterations.

Re: On a Technical Note (Score: 1)

by marqueeblink@pipedot.org on 2014-03-31 16:03 (#WG)

Thanks, that's now no my long list of things to check out.

It seems that the Apache folks are trying to corner the market on open source Big Data projects...

Good article, but a bit biased on its own (Score: 1)

by quadrox@pipedot.org on 2014-03-31 08:32 (#W5)

I fully agree with the overal message of the article, pointing out that just having a lot of data does not automatically prevent sampling errors/sampling bias and/or other fallacies.

That being said, I think they were downplaying what Target seems to have achieved a bit too much. Of course the system is bound to produce some false positives, but given the criteria described it does seem reasonable that they can make a quite good assessment of pregnancy. Granted, without having access to Targets systems we cannot know for sure how well it works, but the article seems to strongly imply that it doesn't work, and indeed cannot work.

Re: Good article, but a bit biased on its own (Score: 2, Interesting)

by zafiro17@pipedot.org on 2014-03-31 10:18 (#W8)

Good point about how the Target article fails to describe how many misses there are. I always get a good laugh out of the ads Google's fancy algorithms think I should see. I am regularly bombarded with ads for things I've already bought, for example.

There's an old quote by Pico Ayer (I can't find it at the moment but probably discovered it through that old website from the late 90s, the Utne Cafe), in which he worries that people are confusing information for knowledge, and knowledge for wisdom, and that although the modern technologies provide us tons of information they don't provide much knowledge, and far less wisdom. I'd kick that one level further to say that before even information, they drown us in data, which is totally worthless unless you know what you're looking for, or are trained in noticing things you were not suspecting.