7.5. How are Dataframes Different from Other Data Representations?

Dataframes are just one way to represent data stored in a table. In practice, data scientists encounter many other types of data tables, like spreadsheets, matrices, and relations. In this section, we’ll compare and contrast the dataframe with other representations to explain why dataframes have become so widely used for data analysis. We’ll also point out scenarios where other representations might be more appropriate.

7.5.1. Dataframes and Spreadsheets

Spreadsheets are computer applications where users can enter data in a grid and use formulas to perform calcuations. One famous example today is Microsoft Excel, although spreadsheets date back to at least 1979 with VisiCalc [Grad, 2007]. Spreadsheets make it easy to see and directly manipulate data. These properties have make spreadsheets highly popular—by a 2005 estimate, there are over 55 million spreadsheet users compared to 3 million professional programmers in industry [Scaffidi et al., 2005].

Dataframes have several key advantages over spreadsheets. Writing dataframe code in a computational notebook like Jupyter naturally produces a data lineage. Someone who opens the notebook can see the input files for the notebook and how the data were changed. Spreadsheets do not make a data lineage visible; if a person manually edits data values in a cell, it is difficult for future users to see which values were manually edited or how they were edited. Dataframes can also handle larger datasets than spreadsheets, and users can also use distributed programming tools to work with huge datasets that would be very hard to load into a spreadsheet.

7.5.2. Dataframes and Matrices

A matrix is a two-dimensional array of data used primarily for linear algebra operations. In the example below, \( \mathbf{X} \) is a matrix with three rows and two columns.

\[\begin{split} \begin{aligned} \mathbf{X} = \begin{bmatrix} 1 & 0 \\ 0 & 4 \\ 0 & 0 \\ \end{bmatrix} \end{aligned} \end{split}\]

Matrices are mathematical objects defined by the operators that they allow. For instance, matrices can be added or multiplied together. Matrices also have a transpose. These operators have very useful properties which data scientists rely on for statistical modeling.

One important difference between a matrix and a dataframe: when treated as a mathematical object, matrices can only contain numbers. Dataframes can contain numbers too, but dataframes can also have other types of data like text. This makes dataframes more useful for loading and processing raw data which may contain all kinds of data types. In fact, data scientists often use dataframes to create matrices. In this book, we’ll generally use dataframes for exploratory data analysis and data cleaning, then process the data into matrices for machine learning models.

Note

Data scientists refer to matrices not only as mathematical objects, but also as program objects as well. For instance, the R programming language has a matrix object, while in Python we could represent a matrix using a two-dimensional numpy array. Matrices as implemented in Python and R can contain other data types besides numbers, but lose mathematical properties when doing so. This is yet another example of how domains can refer to different things with the same term.

7.5.3. Dataframes and Relations

The relation is a data table representation used in database systems, especially SQL systems like SQLite and PostgreSQL. (We cover relations and SQL in the Working with Relations using SQL chapter of this book.) Relations share many similarities with dataframes; both use rows to represent records and columns to represent features. Both have column names, and data within a column have the same type. One difference is that dataframes have an ordering for rows, while relations don’t. This means that data scientists can’t compute the “first” row of a relation, but they can ask for the first row of a dataframe.

But we think that the more important difference is that dataframes don’t require rows to represent records and columns to represent features. Many times, raw data don’t come in a convenient format that can directly be put into a relation. In these scenarios, data scientists use the dataframe to load and process data since dataframes are more flexible in this regard. Similar to the matrix, data scientists use dataframes to produce matrices. For a more rigourous description of the difference between dataframes and relations, see [Petersohn et al., 2020].