Where did Data Science come from, and what defines this new field? Here is a bibliography of some seminal, critical opinions, from Statistics, from Computer Science and early Machine Learning to offer some deep background into a controversial question.

In my view several fields have been running in parallel as researchers devise better means to glean understanding from data, and what we are seeing is a strong convergence between data analysis and AI fields in Computer Science as it pervades areas more typical of Statistics. Statistics only lately has evolved (or some would say was dragged under protest) into more general applied computational areas.

It is too soon to give a succint definition of Data Science, and that’s why I offer a historical view into where it is today, as a way to get a clue into the next question— Whither Data Science? My opinion is that the answer will become clearer by exploring the fundamental philosophical questions of inference and causality that underlie reasoning from data.

To form one’s own opinion here’s a selected set of sources marking some key milestones that every serious researcher in the field should be aware of. There are links to pdf files for each item herein.

Machine Learning before Data Science (and even before Machine Learning was connected to Statistics)

Nils J. Nilsson Learning Machines (1965, McGraw-Hill) Reprinted, with Introduction by Terrence Sejnowski and Halbert White, (1990, Morgan Kaufmann).

The introduction to the new edition reviews the origins in Computer Science of what has become known as neural networks, which was a seminal numerical approach in to learning Aritificial Intelligence.

Empirical Predictions

Leo Breiman Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author), Statistical Science Volume 16, Issue 3 (2001), 199-231.

This article makes the point for the primacy of predictive, empirical techniques as an emerging defining feature of this new science. His argument gains credibility from Breiman’s work in CART (Classification And Regression Trees) that was recognized by both communities as a novel contribution, marking an early point of convergence between research in Statistics and Machine Learning.

Massive statistical language succcess

Halevy, A., Norvig, P. and Pereira, F (Google) The unreasonable effectiveness of data, (2009), Intelligent Systems, IEEE Volume 24(2), pp. 8-12.
Here or here.

In this early reference to Big Data, the authors explain that the title refers back to a historic paper “The unreasonable effectiveness of mathematics”. The paper argues that access to hitherto impossibly large collections of simultaneous translations make possible by Google’s web scraping moreso than advances in theory solved the automatic translations problem that had been the goal of linguists for decades. A similar argument can be made for the recent advances by Deep Neural Networks in computer vision and speech processing.

History of Learning from Data

David Donoho 50 years of Data Science, (draft, version 1.00, September 2015).

Building on Breiman’s ideas, the author attributes the term Data Science to William Cleveland, in 2001, for a trend in the field going back aways that gives equal importance to the exploration and extraction of information from data, beyond the conventional methods in the field.

Along the same lines Joe Blitzstein, of Harvard published his own list of other well-known Statistian’s opinions, in a post on Quora.

  1. Going back to the origins of Statistics, Bin Yu shows how many of its predecessors embodied larger domains of applied and computational skills that define Data Science today: IMS Presidential Address: Let us own Data Science (Bin Yu)

  2. Larry’s Statistics blog laments that not enough attention is paid to Statistics by data scientists, while conceding that statisticians have limited computational skills in the face of current problems. Data Science: The End of Statistics? (Larry Wasserman)

  3. This blog starts with a favorable review of the O’Reilly book Doing Data Science by Rachel Schutt and Cathy O’Neil to make the point that Statistics is the smaller part of doing data science. Doing Data Science: What’s it all about? - Statistical Modeling, Causal Inference, and Social Science (Andrew Gelman)

  4. Terry’s video lecture, to quote from the Vimeo description:, … reports on some reflections on Big Data issues, offer some suggestions for statisticians, and summarize some theory which, in his opinion, has relevance to the analysis of data, whoever does it. Data Science, Big Data and Statistics – can we all live together? (Terry Speed)

  5. From a larger discussion on reddit, arguing that the distinction between Machine Learning and Statistics, and similarly between theory and application is blurred: AMA: Michael I Jordan • /r/MachineLearning (Michael Jordan)