Principles of Data Science
Full course description
Nowadays data science is at the core of modern society. We collect large amounts of data with the goal to make better decisions. We need to make sense of the data and leverage it in effective ways.
In this course, we will start with where data comes from—controlled experiments and observational studies. We will look at potential biases that can affect conclusions that we make from data. We will focus on what kind of causal statement one can draw based on data coming from experiments versus observational studies.
We will then summarize and visualize data using histograms and scatter plots. As we will see, there are some interesting recurring patterns when we summarize data. For example, the distribution of the average follows the bell shape curve. We will also consider deviations from the bell shape curve in case of outliers, and how to deal with real world and possible “unclean” data. Scatter plots will help us study the regression line and correlations.
This course will build the foundation for subsequent courses: probability and statistics, simulation and statistical analysis, and machine learning. You will learn how to convert data into tables and use them for subsequent analysis and plotting. We will focus on the principles of modern reproducible science, that is, to build analysis workflows that can easily be understood and re-run by others. We will learn how to keep track of analysis decisions and parameter choices. We will summarize all the uncertainties in an accessible way and see that this is crucial for effective decision making in the modern world.
During the labs, we will learn Python—one of the main programming languages used in data science—and how to use it to write analysis reports using literate programming—mixing code, plots, and narrative in the same document. We will analyze and visualize real datasets.
Prerequisites
Desired Prior Knowledge: Procedural Programming
Recommended reading
Study material: Statistics (fourth edition, 2007) by Freedman, Pisani, and Purves.
Additional selected material from data science textbooks and other resources.