DSA log

Introduction to Data Science

End to end Data Science

Dina Machuve

12th November, 2018

What is Data Science?

Neil Lawrence (2017)

We define the field of data science to be the challenge of making sense of the large volumes of data that have now become available through the increase in sensors and the large interconnection of the internet. Phenomena variously known as 'big data' or 'the internet of things'.

What is Data Science?

Data science differs from traditional statistics in that this data is not necessarily collected with a purpose or experiment in mind. It is collected by happenstance, and we try and extract value from it later.

What is Machine Learning?

What is Data Science?

DS_hype

Why do we do data science?

  • New features
  • Optimising features
  • Dashboard for decision making
  • Ad-hoc analysis

End to end Data Science

It means rethinking traditional data science workflows and embracing new principles that deeply involve the people and communities being studied.

End to end Data Science practices

This includes best practices like:

  • Involving community members in the selection of study questions
  • Partnering with communities in the data generation and curation process
  • Being intentional when formulating models, assumptions — can we think like engineers rather than scientists?
  • Rigorous, controlled experimentation
  • Transparent communication of results back to the community and stakeholders (e.g. policy makers)

Data Science vs AI

Neil Lawrence (2017)

  • The challenge in artificial intelligence is to recreate “intelligent” behaviour.
  • Data is acquired from humans, and the computer is given the task of reconstructing that data.
  • Role of machine learning techniques is on emulating the data creation process by combining a model with the data.
  • The model incorporates assumptions about the data generating process.

Data Science vs Machine Learning

  • Takes the approach of observing a system in practice and emulating its behavior with mathematics.
  • One of the design aspects in designing machine learning solutions is where to put the mathematical function.
  • Obtaining complex behavior in the resulting system can require some imagination in the design process.

There is a lot of data!

bigdata

Data Science Information Flow

ML

  • Large amounts of data and high interconnection bandwidth
  • Much of our information about the world around us received through computers

knowledgegap

DATA SAVES LIVES

Computer Modelers vs Ebola

David Brown, 2017

Goal: Predict the course of the West African Ebola outbreak

  • The project sought to represent, through mathematical equations, the evolution of a deadly epidemic.
  • Ideally, models should be able to predict: how quickly a disease will spread; who is most at risk and where the hot spots will be
  • Findings enable public health authorities to take steps that diminish or shorten the epidemic.

To determine optimal locations for six Ebola treatment centers in Liberia

  • To minimize the distance an infected person anywhere in the six-county area would have to travel for treatment.
  • Dataset: population, household size, daily activities, travel patterns, the condition and the existence of roads
  • Challenge on the dataset: missing, old, or unreliable
  • the distribution of the six counties' populations estimates from LandScan and WorldPop

'Go big, go quick' response

model

Learning from Lofa

Lofa-Liberia

  • Many Liberian counties reported alarming numbers of new cases in late August and September 2014
  • But in Lofa county, the numbers were dropping.
  • Almost nobody noticed.

Worst-case Scenarios

  • Even inaccurate forecasts helped to quantify interventions:
    • singly or in combination
    • immediate or delayed
    • could change an epidemic's trajectory
  • They helped decision makers establish priorities
  • They helped to inspire and inform the strong international response that may at last be slowing the epidemic

Data Science is a process

  • Formulating a quantitative question that can be answered with data,
  • Collecting and cleaning the data,
  • Analyzing the data and
  • Communicating the answer to the question to a relevant audience.

In a nutshell:

  • Explore: identify patterns
  • Predict: make informed guesses
  • Infer: quantify what you know

Data

  • What is it?
  • Where can we find it?
  • How can we explore it?

Science

  • What does it mean to learn from data?
  • How do we know when we are right or wrong?
  • The keyword in "Data Science" is not Data, it is Science

Skills of Data Scientists

Data Scientist vs Data Engineer skills

Statistics

Statistics refers to the mathematics and techniques with which we understand data. It is a discpline of analyzing data. Statistics intersects heavily with data science, machine learning and of course traditional statistical analysis.

Statistics

Key activities that define the field:

  • Descriptive statistics (EDA, quantification, summarization, clustering)
  • Inference (estimation, sampling, variability, defining populations)
  • Prediction (machine learning)
  • Experimental Design (the process of designing experiments)

Descriptive Statistics Example

Data from a cabbage field trial - Head Weight

Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.875 2.550 2.593 3.125 4.300

A cabbage field trial - Head Weight

histboxHeadWt

Descriptive Statistics Example

Data from a cabbage field trial - Vitamin C content variable

Min. 1st Qu. Median Mean 3rd Qu. Max.
41.00 50.75 56.00 57.95 66.25 84.00

A cabbage field trial - Vitamin C content

histboxVitC

What is Machine Learning?

Neil Lawrence, 2017

  • The principle technology underpinning the recent advances in Artificial Intelligence.
  • The principal technology behind two emerging domains: Data Science and Artificial Intelligence.
  • The rise of machine learning is coming about through the availability of data and computation,
  • But machine learning methodologies are fundamentally dependent on models.

What is Machine Learning?

$$ data + model \xrightarrow{\text{compute}} prediction $$

  • data: observations, could be actively or passively acquired (meta-data).
  • model: assumptions, based on previous experience (other data! transfer learning etc).
  • prediction: an action to be taken or a categorization or a quality score.

Machine Learning Approaches

1. Supervised Learning

  • Learn a model from a given set of input-output pairs, in order to predict the output of new inputs.
  • Further grouped into Regression and classification problems.

2. Unsupervised Learning

  • Discover patterns and learn the structure of unlabelled data.
  • Example Distribution modeling and Clustering.

3. Reinforcement Learning

  • Learn what actions to take in a given situation, based on rewards and penalties
  • Example consider teaching a dog a new trick: you cannot tell it what to do, but you can reward/punish it.

Machine Learning vs Traditional Statistical Analyses

Machine learning Traditional statistical analyses
Emphasize predictions Emphasizes superpopulation inference
Evaluates results via prediction performance Focuses on a-priori hypotheses
Concern for overfitting but not model complexity per se Simpler models preferred over complex ones (parsimony)
Emphasis on performance Emphasis on parameter interpretability
Generalizability is obtained through performance on novel datasets Statistical modeling or sampling assumptions
Concern over performance and robustness Concern over assumptions and robustness

Example of Prediction

  • The Olympic gold medalist in the marathons pace is predicted using a regression fit.
  • In this case the mathematical function is directly predicting the pace of the winner as a function of the year of the Olympics. olympic

Trade-offs on Machine Learning

Model Selection tradeoff

Data Science Pipeline (end to end)

Pipeline

End-to-end predictive analytics approach

STEP 1: Define the goal

STEP 2: Data understanding and preparation

  • Importing, cleaning, manipulating and
  • Visualizing your data

STEP 3: Building your machine learning model

  • Feature selection
  • Model training
  • Model validation

STEP 4: Model deployment

Data Science Products Requisites

  1. Robust: the testing error has to be consistent with the training error, or the performance is stable after adding some noise to the dataset.
  2. Reproducible: the only way to confirm scientific findings are accurate and not the artifact of a single experiment or analysis

Recommendations to achieve Robustness

  1. Good experimental design
  2. Write code as cleverly as possible: simple, clear code makes debugging easier.
  3. Automate tasks, it decreases trivial mistakes by humans.
  4. Make assertions in code and in your methods
  5. Test code
  6. Use existing libraries whenever possible
  7. Let data prove that it's high quality, e.g. using EDA
  8. Treat data as Read-Only
  9. Develop frequently used scripts into tools

Recommendations for Reproducible Projects

  1. Release your code and data
  2. Document everything
  3. Make figures and statistics the results of scripts
  4. Use code as documentation

Best Practices

Damon Civin (2018)

  1. Solve the right problem
  2. Fail better everyday
  3. Data quality matters
  4. Simplicity is your friend
  5. Debugging makes you a wizard
  6. Fairness and privacy are not dirty words

References

C. M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006.

Neil Lawrence (2017), What is Machine Learning?

Vince Buffalo, Bioinformatics Data Skills, O'Reilly Media, Inc., 2015.