Pipe 2X7 Trending Now: The Fallacies of Big Data

Trending Now: The Fallacies of Big Data

by
in internet on (#2X7)
It's been almost ten years since two Google engineers published a paper describing the architecture of Map Reduce , a framework for simplifying the development and deployment of algorithms that process terabytes or petabytes of data across a cluster of commodity servers. The Open Source community soon responded with Hadoop , a Map Reduce work-alike, and in the following years it seems that most large IT organizations, and many startups, have jumped on the bandwagon pitching the virtues of Big Data, Hadoop and/or NoSQL as a revolutionary set of techniques for capturing actionable trends and correlations from the firehose of real-time data (clickstream, Twitter feeds, Facebook likes, server logs, sensor and surveillance data, mobile call events, and of course, all the stuff the NSA looks at).
Tim Harford of the Financial Times points out that this methodology is subject to various types of sampling bias , even in cases where the more enthusiastic proponents claim to be 'observing the entire population, not just a statistical sample'. First, data collected from social media or smart phone apps is heavily biased by the user profile of those technologies, whch is disproportionately young, affluent, and urban or suburban. Harford mentions the famous case of the Literary Digest, a well-established magazine that forecasted a landslide victory for Alf Landon in the 1936 US Presidential election, based on a massive poll of one out of five eligible voters - whose contact information was pulled from telephone subscriber lists (Landon lost the election to Franklin Roosevelt, who carried all but two of the 48 states; Literary Digest ceased publication soon afterwards).
Second, people adjust their behavior over time with respects to various topics in the news. The sudden increases in flu-related searches that made Google Flu Trends look very prescient five winters ago, turned into a debacle when Google used similar data to warn of a severe flu outbreak four years later; but the flu season turned out to be average when the curated data from the CDC finally came in.
What about the famous anecdote about Target finding out that a teenage customer was pregnant before her dad did? Maybe so, says a researcher quoted in Harford's article, but there's an issue with false positives. The world likely didn't hear about other Target customers who got pregnancy-related marketing materials they wouldn't have any use for.

History


Deprecated: mb_convert_encoding(): Handling HTML entities via mbstring is deprecated; use htmlspecialchars, htmlentities, or mb_encode_numericentity/mb_decode_numericentity instead in /var/pipedot/include/diff.php on line 25

Deprecated: Creation of dynamic property FineDiff::$granularityStack is deprecated in /var/pipedot/lib/finediff/finediff.php on line 217

Deprecated: Creation of dynamic property FineDiff::$edits is deprecated in /var/pipedot/lib/finediff/finediff.php on line 218

Deprecated: Creation of dynamic property FineDiff::$from_text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 219

Deprecated: Creation of dynamic property FineDiff::$last_edit is deprecated in /var/pipedot/lib/finediff/finediff.php on line 372

Deprecated: Creation of dynamic property FineDiff::$stackpointer is deprecated in /var/pipedot/lib/finediff/finediff.php on line 373

Deprecated: Creation of dynamic property FineDiff::$from_offset is deprecated in /var/pipedot/lib/finediff/finediff.php on line 375

Deprecated: Creation of dynamic property FineDiffReplaceOp::$fromLen is deprecated in /var/pipedot/lib/finediff/finediff.php on line 126

Deprecated: Creation of dynamic property FineDiffReplaceOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 127

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffReplaceOp::$fromLen is deprecated in /var/pipedot/lib/finediff/finediff.php on line 126

Deprecated: Creation of dynamic property FineDiffReplaceOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 127

Deprecated: Creation of dynamic property FineDiffReplaceOp::$fromLen is deprecated in /var/pipedot/lib/finediff/finediff.php on line 126

Deprecated: Creation of dynamic property FineDiffReplaceOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 127

Deprecated: Creation of dynamic property FineDiffReplaceOp::$fromLen is deprecated in /var/pipedot/lib/finediff/finediff.php on line 126

Deprecated: Creation of dynamic property FineDiffReplaceOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 127

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffReplaceOp::$fromLen is deprecated in /var/pipedot/lib/finediff/finediff.php on line 126

Deprecated: Creation of dynamic property FineDiffReplaceOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 127

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffInsertOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 104

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffReplaceOp::$fromLen is deprecated in /var/pipedot/lib/finediff/finediff.php on line 126

Deprecated: Creation of dynamic property FineDiffReplaceOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 127

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffInsertOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 104

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffReplaceOp::$fromLen is deprecated in /var/pipedot/lib/finediff/finediff.php on line 126

Deprecated: Creation of dynamic property FineDiffReplaceOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 127

Deprecated: Creation of dynamic property FineDiffCopyOp::$len is deprecated in /var/pipedot/lib/finediff/finediff.php on line 155

Deprecated: Creation of dynamic property FineDiffInsertOp::$text is deprecated in /var/pipedot/lib/finediff/finediff.php on line 104
2014-03-30 19:00
The Fallacies of Big Data
zafiro17@pipedot.org
It's been almost ten years since two Google engineers published a paper describing the architecture of Map Reduce , a framework for simplifying the development and deployment of algorithms that process terabytes or petabytes of data across a cluster of commodity servers. The Open Source community soon responded with Hadoop , a Map Reduce work-alike, and in the following years it seems that most large IT organizations, and many startups, have jumped on the bandwagon pitching the virtues of Big Data, Hadoop and/or NoSQL as a revolutionary set of techniques for capturing actionable trends and correlations from the firehose of real-time data (clickstream, Twitter feeds, Facebook likes, server logs, sensor and surveillance data, mobile call events, and of course, all the stuff the NSA looks at).

Tim Harford of the Financial Times points out that this methodology is subject to various types of sampling bias , even in cases where the more enthusiastic proponents claim to be 'observing the entire population, not just a statistical sample'. First, data collected from social media or smart phone apps is heavily biased by the user profile of those technologies, whch is disproportionately young, affluent, and urban or suburban. Harford mentions the famous case of the Literary Digest, a well-established magazine that forecasted a landslide victory for Alf Landon in the 1936 US Presidential election, based on a massive poll of one out of five eligible voters - whose contact information was pulled from telephone subscriber lists (Landon lost the election to Franklin Roosevelt, who carried all but two of the 48 states; Literary Digest ceased publication soon afterwards).

Second, people adjust their behavior over time with respects to various topics in the news. The sudden increases in flu-related searches that made Google Flu Trends look very prescient five winters ago, turned into a debacle when Google used similar data to warn of a severe flu outbreak four years later; but the flu season turned out to be average when the curated data from the CDC finally came in.

What about the famous anecdote about Target finding out that a teenage customer was pregnant before her dad did? Maybe so, says a researcher quoted in Harford's article, but there's an issue with false positives. The world likely didn't hear about other Target customers who got pregnancy-related marketing materials they wouldn't have any use for.
Reply 0 comments