A rose by any other name: Data science etc.

John

from John D. Cook on 2015-10-14 12:48 (#QEX1)

I help people make decisions in the face of uncertainty. Sounds interesting.

I'm a data scientist. Not sure what that means, but it sounds cool.

I study machine learning. Hmm. Maybe interesting, maybe a little ominous.

I'm into big data. Exciting or passi(C), depending on how many times you've heard the term.

Even though each of these descriptions makes a different impression, they're all essentially the same thing. You could throw in a few more terms too, like artificial intelligence, inferential science, decision theory, or inverse probability.

There are distinctions. These terms don't entirely overlap, but the overlap is huge. They all have to do with taking data and making an inference.

"Decision-making under uncertainty" emphasizes that you never have complete data, and yet you need to make decisions anyway. "Decision theory" emphasizes that the whole point of analyzing data is to do something as a result, and suggests that focusing directly on the decision itself, rather than proxies along the way, is the best way to do this.

"Data science" stresses that there is more to the process of making inferences than what falls under the traditional heading of "statistics." Statistics has never been only about "the grotesque phenomenon generally known as mathematical statistics," as Francis Anscombe described it. Things like data cleaning and visualization have always been part of the practice of statistics, though not the theory of statistics. Data science also emphasizes the role of computation. Some say a data scientist is a statistician who can program. Some say data science is statistics on a Mac.

Despite the hype around the term data science, it's growing on me. It has its drawbacks, but so does every other name.

Machine learning, like decision theory, emphasizes the ultimate goal of doing something with data rather than creating an accurate model of the process that generates the data. If you can create such a model, so much the better. But it may not be necessary to have a great model in order to accomplish what you originally set out to do. "Naive Bayes," for example, is a classification algorithm that is admittedly naive. It knowingly makes a gross simplification, assuming events are independent that we know are certainly not independent, and yet it often works well enough.

"Big data" is a big can of worms. It is often concerned with data sets that are indeed big, but it also implies other things, such as the way the data become available, as a real time stream rather than as a complete static set. See Erik Meijer's Big data cube. And that's just when the term "big data" is used in some fairly meaningful way. It's also used so broadly as to be meaningless.

The term "statistics" literally means the mathematics of the interests of states, as in governments, because these were the first applications of statistics. So while "statistics" may be the most established and perhaps most respectable term discussed here, it's not great. As I remarked here, "The term statistics would be equivalent to governmentistics, a historically accurate but otherwise useless term." Statistics emphasizes probability models and mathematical rigor more than other variations on data analysis do. Statisticians criticize machine learning folks for being sloppy. Machine learning folks criticize statisticians for being too conservative, or for being too focused on description and not focused enough on prediction.

Bayesian statistics is much older than what is now sometimes called "classical" statistics. It was essential dormant during the first half of the 20th century before experiencing a renaissance in the second half of the century. Bayesian statistics was originally called "inverse probability" for good reason. Probability theory takes the probabilities of events as given and makes inferences about possible outcomes. Bayesian statistics does the inverse, taking data as given and inferring the probabilities that lead to the data. All statistics does something like this, but Bayesian statistics is consistent in forming all inference directly as probabilities. Frequetist ("classical") statistics also infers probabilities, but the results, things like p-values and confidence intervals, are not the probabilities of what most people think they are. See Anthony O'Hagan's description here.

Data analysis has gone by many names over time, sometimes with meaningful distinctions and sometimes not. Often people make a distinction without a difference.

Source	RSS or Atom Feed
Feed Location	http://feeds.feedburner.com/TheEndeavour?format=xml
Feed Title	John D. Cook
Feed Link	https://www.johndcook.com/blog