1.2. Exploratory Data Analysis¶
The term Exploratory Data Analysis (EDA for short) refers to the process of discovering traits about our data that inform future analysis.
students table from the previous page:
279 rows × 2 columns
We are left with a number of questions. How many students are in this roster? What does the
Role column mean? We conduct EDA in order to understand our data more thoroughly.
Oftentimes, we explore the data by repeatedly posing questions as we uncover more information.
How many students are in our dataset?
print("There are", len(students), "students on the roster.")
There are 279 students on the roster.
A natural follow-up question: does this dataset contain the complete list of students? In this case, this table contains all students in one semester’s offering of Data 100.
What is the meaning of the
We often example the field’s data in order to understand the field itself.
We can see here that our data contain not only students enrolled in the class at the time but also the students on the waitlist. The
Role column tells us whether each student is enrolled.
What about the names? How can we summarize this field?
In Data 100 we will work with many different kinds of data, including numerical, categorical, and text data. Each type of data has its own set of tools and techniques.
A quick way to start understanding the names is to examine the lengths of the names.
sns.distplot(students['Name'].str.len(), rug=True, bins=np.arange(12), axlabel="Number of Characters") plt.xlim(0, 12) plt.xticks(np.arange(12)) plt.ylabel('Proportion per character');
This visualization shows us that most names are between 3 and 9 characters long. This gives us a chance to check whether our data seem reasonable — if there were many names that were 1 character long we’d have good reason to re-examine our data.
1.2.1. What’s in a Name?¶
Although this dataset is rather simple, we will soon see that first names alone can reveal quite a bit about our group of students.