{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# Reference: https://jupyterbook.org/interactive/hiding.html\n",
"# Use {hide, remove}-{input, output, cell} tags to hiding content\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"import ipywidgets as widgets\n",
"from ipywidgets import interact, interactive, fixed, interact_manual\n",
"from IPython.display import display\n",
"\n",
"sns.set()\n",
"sns.set_context('talk')\n",
"np.set_printoptions(threshold=20, precision=2, suppress=True)\n",
"pd.set_option('display.max_rows', 7)\n",
"pd.set_option('display.max_columns', 8)\n",
"pd.set_option('precision', 2)\n",
"# This option stops scientific notation for pandas\n",
"# pd.set_option('display.float_format', '{:.2f}'.format)\n",
"\n",
"def display_df(df, rows=pd.options.display.max_rows,\n",
" cols=pd.options.display.max_columns):\n",
" with pd.option_context('display.max_rows', rows,\n",
" 'display.max_columns', cols):\n",
" display(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# How are Relations Different from Other Data Representations?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Relations are just one way to represent data stored in a table. In practice,\n",
"data scientists encounter many other types of data tables, like spreadsheets,\n",
"matrices, and dataframes. In this section, we'll compare and contrast the\n",
"relations with other representations to explain why relations have become so\n",
"widely used for data analysis. We'll also point out scenarios where other\n",
"representations might be more appropriate."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Relations and Spreadsheets"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Spreadsheets are computer applications where users can enter data in a grid and\n",
"use formulas to perform calcuations. One famous example today is Microsoft\n",
"Excel, although spreadsheets date back to at least 1979 with VisiCalc\n",
"{cite}`gradCreation2007`. Spreadsheets make it easy to see and directly\n",
"manipulate data. These properties make spreadsheets highly popular---by a\n",
"2005 estimate, there are over 55 million spreadsheet users compared to 3\n",
"million professional programmers in industry {cite}`scaffidiEstimating2005`.\n",
"\n",
"Relations have several key advantages over spreadsheets. Writing SQL code in a\n",
"computational notebook like Jupyter naturally produces a data lineage. Someone\n",
"who opens the notebook can see the input files for the notebook and how the\n",
"data were changed. Spreadsheets do not make a data lineage visible; if a person\n",
"manually edits data values in a cell, it is difficult for future users to see\n",
"which values were manually edited or how they were edited. Relations can also\n",
"handle larger datasets than spreadsheets; users can use SQL systems to work\n",
"with huge datasets that would be very hard to load into a spreadsheet."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Relations and Matrices"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A matrix is a two-dimensional array of data used primarily for linear algebra\n",
"operations. In the example below, $ \\mathbf{X} $ is a matrix with three rows\n",
"and two columns.\n",
"\n",
"$$\n",
"\\begin{aligned}\n",
"\\mathbf{X} = \\begin{bmatrix}\n",
"1 & 0 \\\\\n",
"0 & 4 \\\\\n",
"0 & 0 \\\\\n",
"\\end{bmatrix}\n",
"\\end{aligned}\n",
"$$\n",
"\n",
"Matrices are mathematical objects defined by the operators that they allow. For\n",
"instance, matrices can be added or multiplied together. Matrices also have a\n",
"transpose. These operators have very useful properties which data scientists\n",
"rely on for statistical modeling.\n",
"\n",
"One important difference between a matrix and a relation: when treated as a\n",
"mathematical object, matrices can only contain numbers.\n",
"Relations, on the other hand, can also have other types of data like text.\n",
"This\n",
"makes relations more useful for loading and processing real-world data which\n",
"may contain all kinds of data types."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
":::{note}\n",
"\n",
"Data scientists refer to matrices not only as mathematical objects, but also as\n",
"program objects as well. For instance, the R programming language has a matrix\n",
"object, while in Python we could represent a matrix using a two-dimensional\n",
"`numpy` array. Matrices as implemented in Python and R can contain other data\n",
"types besides numbers, but lose mathematical properties when doing so. This is\n",
"yet another example of how domains can refer to different things with the same\n",
"term. \n",
"\n",
":::"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Relations and Dataframes"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dataframes are one of the most common ways to represent data tables in general\n",
"purpose programming languages like Python and R. (We cover dataframes in the\n",
"{ref}`ch:pandas` chapter of this book.) Dataframes share many similarities with\n",
"relations; both use rows to represent records and columns to represent\n",
"features. Both have column names, and data within a column have the same type.\n",
"\n",
"One key advantage of dataframes is that they don't *require* rows to represent\n",
"records and columns to represent features. Many times, raw data don't come in a\n",
"convenient format that can directly be put into a relation. In these scenarios,\n",
"data scientists use the dataframe to load and process data since dataframes are\n",
"more flexible in this regard. Often, data scientists will load raw data into a\n",
"dataframe, then process the data into a format that can easily stored into a\n",
"relation.\n",
"\n",
"One key advantage that relations have over dataframes is that relations are\n",
"used by relational database systems like PostgreSQL [^psql] that have highly\n",
"useful features for data storage and management. Consider a data scientist at a\n",
"company that runs a large social media website. The database might hold data\n",
"that is far too large to read into a `pandas` dataframe all at once; instead,\n",
"data scientists use SQL queries to subset and aggregate data since database\n",
"systems are more capable of handling large datasets. Also, website users\n",
"constantly make updates to their data by making posts, uploading pictures, and\n",
"editing their profile. Here, database systems let data scientists reuse their\n",
"existing SQL queries to update their analyses with the latest data rather than\n",
"having to repeatedly download large CSV files. \n",
"\n",
"For a more rigourous description of the difference between dataframes and\n",
"relations, see {cite}`petersohnScalable2020`.\n",
"\n",
"[^psql]: https://www.postgresql.org/about/"
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}