Neil Lawrence (2017)
We define the field of data science to be the challenge of making sense of the large volumes of data that have now become available through the increase in sensors and the large interconnection of the internet. Phenomena variously known as 'big data' or 'the internet of things'.
Data science differs from traditional statistics in that this data is not necessarily collected with a purpose or experiment in mind. It is collected by happenstance, and we try and extract value from it later.
It means rethinking traditional data science workflows and embracing new principles that deeply involve the people and communities being studied.
This includes best practices like:
In a nutshell:
Statistics refers to the mathematics and techniques with which we understand data. It is a discpline of analyzing data. Statistics intersects heavily with data science, machine learning and of course traditional statistical analysis.
Key activities that define the field:
$$ data + model \xrightarrow{\text{compute}} prediction $$
Machine learning | Traditional statistical analyses |
---|---|
Emphasize predictions | Emphasizes superpopulation inference |
Evaluates results via prediction performance | Focuses on a-priori hypotheses |
Concern for overfitting but not model complexity per se | Simpler models preferred over complex ones (parsimony) |
Emphasis on performance | Emphasis on parameter interpretability |
Generalizability is obtained through performance on novel datasets | Statistical modeling or sampling assumptions |
Concern over performance and robustness | Concern over assumptions and robustness |
STEP 1: Define the goal
STEP 2: Data understanding and preparation
STEP 3: Building your machine learning model
STEP 4: Model deployment
C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006.
Neil Lawrence (2017), What is Machine Learning?
Vince Buffalo, Bioinformatics Data Skills, O'Reilly Media, Inc., 2015.