9.4. Transformations and Timestamps

At times a feature is not in a form best suited for analysis and so we transform it. There are many reasons a feature might need a transformation. The value codings might not be useful for analysis, we may want to combine two features into one using an arithmetic expression, or we might want to pull information out of a feature to create a new feature. We describe these three basic kinds of transformations: type conversions, mathematical transformations, and extractions.

Type conversion. This kind of transformation occurs when we convert the data from one format to another to make the data more useful for analysis. We might convert information stored as a string to another format. For example, we would want to convert prices reported as strings to numeric (like changing the string "$2.17" to the number 2.17) so that we can compute summary statistics on them. Or, we might want to convert a time stored as a string, such as “1955-10-12”, to a datetime format. Yet another example occurs when we lump categories together, such as reducing the 11 categories for age in DAWN to 5 groupings.

Mathematical transformation. One kind of mathematical transformation is when we change the units of a measurement from, say, pounds to kilograms. We might make unit conversions so statistics on our data can be more easily compared to other statistics. Yet another reason to transform a feature is to make its distribution more symmetric (covered in more detail in Section %s). The most common transformation for this is the logarithm; another widely used transformation is the square root. Lastly, we might want to create new features from an arithmetic operations on others. For example, we can combine heights and weights to create body mass index by calculating \(\text{height} / \text{weight}^2\).

Extraction: Sometimes we will want to create a feature that contains partial information taken from another feature. For example, the inspection violations consists of a string with a description of the violation, and we may be interested in only whether the violation is related to, say vermin. We can create a new feature that is True if the violation contains the word “vermin” in its text description and False otherwise. This conversion of information to logical values (or 0-1 values) is extremely useful in data science. The example in Section 9.6 gives a concrete use case for these binary features.

We cover many other examples of useful transformations in the Exploratory Data Analysis chapter. For the rest of this section, we’ll explain one more transformation in detail: transforming dates and times. Dates and times appear in many kinds of data, so it’s worth learning how to work with these data types.

9.4.1. Transforming Timestamps

A timestamp is a data value that records a specific date and time. For instance, a timestamp could be recorded like Jan 1 2020 2pm, or 2021-01-31 14:00:00, or 2017 Mar 03 05:12:41.211 PDT. As we can see, timestamps come in many different formats. Timestamps are also very useful for analysis, since they let us answer questions like: “What times of day do we have the most website traffic?”. So, we often want to parse timestamps to make them easier to manipulate using code.

Let’s take a look at an example. The insp dataframe records when restaurant inspections happened. We see that the date column looks like a timestamp.

insp
business_id score date type
0 19 94 20160513 routine
1 19 94 20171211 routine
2 24 98 20171101 routine
... ... ... ... ...
14219 94142 100 20171220 routine
14220 94189 96 20171130 routine
14221 94231 85 20171214 routine

14222 rows × 4 columns

By default, however, pandas reads in the date column as an integer:

insp['date']
0        20160513
1        20171211
2        20171101
           ...   
14219    20171220
14220    20171130
14221    20171214
Name: date, Length: 14222, dtype: int64

This storage type makes it hard to answer some useful questions about the data. Let’s say we want to know: do inspections happen more often on the weekends? Or the weekdays? To answer this question, we want to convert the date column to the pandas Timestamp storage type.

The date values appear to come in the format: YYYYMMDD, where YYYY, MM, and DD correspond to the year, month, and day. The pd.to_datetime() method can parse the date strings into objects, and we can pass in the format of the dates as a date format string 1.

# This is the Python representation of the YYYYMMDD format
date_format = '%Y%m%d'

# Converts the date column to datetime objects
insp_dates = pd.to_datetime(insp['date'], format=date_format)
insp_dates
0       2016-05-13
1       2017-12-11
2       2017-11-01
           ...    
14219   2017-12-20
14220   2017-11-30
14221   2017-12-14
Name: date, Length: 14222, dtype: datetime64[ns]

We can see that the insp_dates now has a dtype of datetime64[ns], which means that the values were successfully converted into pd.Timestamp objects 2.

Note

The pd.to_datetime() method tries to automatically infer the timestamp format if we don’t pass in the format= argument. In many cases pandas will parse the timestamps properly. However, sometimes the parsing doesn’t output the correct timestamps (including this case), so we must explicitly specify the format.

pandas has special methods and properties for Series objects that hold timestamps using the .dt accessor. For instance, we can easily pull out the year for each timestamp:

insp_dates.dt.year
0        2016
1        2017
2        2017
         ... 
14219    2017
14220    2017
14221    2017
Name: date, Length: 14222, dtype: int64

The pandas documentation has the complete details on the .dt accessor 3. By looking at the documentation, we see that the .dt.day_of_week attribute gets the day of week for each timestamp (Monday=0, Tuesday=1, …, Sunday=6).

insp_dates.dt.dayofweek
0        4
1        0
2        2
        ..
14219    2
14220    3
14221    3
Name: date, Length: 14222, dtype: int64

So, let’s assign new columns to the insp dataframe containing both the parsed timestamps and the day of week for each timestamp.

insp = insp.assign(timestamp=insp_dates, dow=insp_dates.dt.dayofweek)
insp
business_id score date type timestamp dow
0 19 94 20160513 routine 2016-05-13 4
1 19 94 20171211 routine 2017-12-11 0
2 24 98 20171101 routine 2017-11-01 2
... ... ... ... ... ... ...
14219 94142 100 20171220 routine 2017-12-20 2
14220 94189 96 20171130 routine 2017-11-30 3
14221 94231 85 20171214 routine 2017-12-14 3

14222 rows × 6 columns

Now, we can see whether restaurant inspectors favor a certain day of the week by grouping on the dow column.

sns.countplot(x='dow', data=insp)

# set xticklabels to be the day of the week
plt.gca().set_xticklabels(['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']);
../../_images/wrangling_transformations_22_0.png

So, restaurant owners can generally expect inspections to happen on a weekday rather than a weekend.

9.4.2. Using pipe() For Transformations

In real data analyses, we typically apply many transformations on the data. However, it is easy to introduce bugs when we repeatedly mutate a dataframe because Jupyter notebooks let us run cells in any order we want. As good practice, we recommend putting transformation code into functions with helpful names and using the DataFrame.pipe() method to chain transformations together.

For example, let’s rewrite the timestamp parsing code above into a function. We’ll also add the timestamps back into the dataframe as a new column, and add a second column containing the year for each timestamp.

date_format = '%Y%m%d'

# A transformation function takes in a dataframe as its first input and returns
# the transformed dataframe. We can add optional parameters via keyword
# arguments.
def parse_dates_and_years(df, column='date'):
    dates = pd.to_datetime(df[column], format=date_format)
    years = dates.dt.year
    return df.assign(timestamp=dates, year=years)

Now, we can pipe the insp dataframe through this function using .pipe():

insp = (pd.read_csv("data/inspections.csv")
        .pipe(parse_dates_and_years))
insp
business_id score date type timestamp year
0 19 94 20160513 routine 2016-05-13 2016
1 19 94 20171211 routine 2017-12-11 2017
2 24 98 20171101 routine 2017-11-01 2017
... ... ... ... ... ... ...
14219 94142 100 20171220 routine 2017-12-20 2017
14220 94189 96 20171130 routine 2017-11-30 2017
14221 94231 85 20171214 routine 2017-12-14 2017

14222 rows × 6 columns

We can chain as many .pipe() calls as we want.

def extract_day_of_week(df, col='timestamp'):
    return df.assign(dow=df[col].dt.day_of_week)

insp = (pd.read_csv("data/inspections.csv")
        .pipe(parse_dates_and_years)
        .pipe(extract_day_of_week))
insp
business_id score date type timestamp year dow
0 19 94 20160513 routine 2016-05-13 2016 4
1 19 94 20171211 routine 2017-12-11 2017 0
2 24 98 20171101 routine 2017-11-01 2017 2
... ... ... ... ... ... ... ...
14219 94142 100 20171220 routine 2017-12-20 2017 2
14220 94189 96 20171130 routine 2017-11-30 2017 3
14221 94231 85 20171214 routine 2017-12-14 2017 3

14222 rows × 7 columns

There are several key advantages of using pipe(). First, when there are many transformations on a single dataframe, it’s easier to see what transformations happen since we can simply read the function names. Second, we can reuse transformation functions for different dataframes. For instance, the viol dataframe, which contains restaurant safety violations, also has a date column.

viol
business_id date description
0 19 20171211 Inadequate food safety knowledge or lack of ce...
1 19 20171211 Unapproved or unmaintained equipment or utensils
2 19 20160513 Unapproved or unmaintained equipment or utensi...
... ... ... ...
39039 94231 20171214 High risk vermin infestation [ date violation...
39040 94231 20171214 Moderate risk food holding temperature [ dat...
39041 94231 20171214 Wiping cloths not clean or properly stored or ...

39042 rows × 3 columns

This means we can use .pipe() to reuse the timestamp parsing function without needing to write extra code. Convenient!

# Instead of rewriting the timestamp parsing code, we can simply reuse the
# parse_dates_and_years function we defined earlier.
viol = (pd.read_csv("data/violations.csv")
        .pipe(parse_dates_and_years)) 
viol
business_id date description timestamp year
0 19 20171211 Inadequate food safety knowledge or lack of ce... 2017-12-11 2017
1 19 20171211 Unapproved or unmaintained equipment or utensils 2017-12-11 2017
2 19 20160513 Unapproved or unmaintained equipment or utensi... 2016-05-13 2016
... ... ... ... ... ...
39039 94231 20171214 High risk vermin infestation [ date violation... 2017-12-14 2017
39040 94231 20171214 Moderate risk food holding temperature [ dat... 2017-12-14 2017
39041 94231 20171214 Wiping cloths not clean or properly stored or ... 2017-12-14 2017

39042 rows × 5 columns

9.4.3. Takeaways

In this section, we gave an overview of the most common transformations that we perform when cleaning data. Next, we also introduced working with timestamps in pandas. Finally, we showed how using the .pipe() function is a useful practice for applying multiple transformations. In the next section, we’ll explain another common type of transformation: modifying the structure of a dataframe.


1

For more about date format strings, see the quick reference at https://strftime.org/ or the Python documentation.

2

datetime64[ns] means that pandas uses 64 bits of memory for each value, and that each datetime is accurate to the nanosecond (or ns, for short).

3

https://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#time-date-components