1.1. The Students of Data 100¶
The data science lifecycle involves the following general steps:
What do we want to know or what problems are we trying to solve?
What are our hypotheses?
What are our metrics of success?
Data Acquisition and Cleaning:
What data do we have and what data do we need?
How will we collect more data?
How do we organize the data for analysis?
Exploratory Data Analysis:
Do we already have relevant data?
What are the biases, anomalies, or other issues with the data?
How do we transform the data to enable effective analysis?
Prediction and Inference:
What does the data say about the world?
Does it answer our questions or accurately solve the problem?
How robust are our conclusions?
We now demonstrate this process applied to a dataset of student first names from a previous offering of Data 100. In this chapter, we proceed quickly in order to give the reader a general sense of a complete iteration through the lifecycle. In later chapters, we expand on each step in this process to develop a repertoire of skills and principles.
1.1.1. Question Formulation¶
We would like to figure out if the student first names give us additional information about the students themselves. Although this is a vague question to ask, it is enough to get us working with our data and we can make the question more precise as we go.
1.1.2. Data Acquisition and Cleaning¶
Let’s begin by looking at our data, the roster of student first names that we’ve downloaded from a previous offering of Data 100.
Don’t worry if you don’t understand the code for now; we introduce the libraries in more depth soon. Instead, focus on the process and the charts that we create.
import pandas as pd students = pd.read_csv('roster.csv') students
279 rows × 2 columns
We can quickly see that there are some quirks in the data. For example, one of the student’s names is all uppercase letters. In addition, it is not obvious what the Role column is for.
In Data 100, we will study how to identify anomalies in data and apply corrections. The differences in capitalization will cause our programs to think that
'Bryan' are different names when they are identical for our purposes. Let’s convert all names to lower case to avoid this.
students['Name'] = students['Name'].str.lower() students
279 rows × 2 columns
Now that our data are in a more useful format, we proceed to exploratory data analysis.