{ "cells": [ { "cell_type": "code", "execution_count": 1, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "import sys\n", "import os\n", "if not any(path.endswith('textbook') for path in sys.path):\n", " sys.path.append(os.path.abspath('../../..'))\n", "from textbook_utils import *" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "df = pd.read_csv('data/fake_news.csv', parse_dates=['timestamp'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "(sec:fake_news_exploring)=\n", "# Exploring the Data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The dataset of news articles we're exploring is just one part of the larger FakeNewsNet dataset. As such, the original paper doesn't provide detailed information about our subset of data.\n", "So, to better understand the data, we must explore it ourselves.\n", "\n", "Before starting exploratory data analysis, we apply our standard practice of splitting the data into training and test sets. We perform EDA using only the train set:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "from sklearn.model_selection import train_test_split\n", "\n", "df['label'] = (df['label'] == 'fake').astype(int)\n", "\n", "X_train, X_test, y_train, y_test = train_test_split(\n", " df[['timestamp', 'baseurl', 'content']], df['label'],\n", " test_size=0.25, random_state=42,\n", ")" ] }, { "cell_type": "code", "execution_count": 324, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
timestampbaseurlcontent
1642019-01-04 19:25:46worldnewsdailyreport.comChinese lunar rover finds no evidence of Ameri...
282016-01-12 21:02:28occupydemocrats.comVirginia Republican Wants Schools To Check Chi...
\n", "
" ], "text/plain": [ " timestamp baseurl \n", "164 2019-01-04 19:25:46 worldnewsdailyreport.com \\\n", "28 2016-01-12 21:02:28 occupydemocrats.com \n", "\n", " content \n", "164 Chinese lunar rover finds no evidence of Ameri... \n", "28 Virginia Republican Wants Schools To Check Chi... " ] }, "execution_count": 324, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train.head(2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let's count the number of real and fake articles in the train set: " ] }, { "cell_type": "code", "execution_count": 325, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "label\n", "0 320\n", "1 264\n", "Name: count, dtype: int64" ] }, "execution_count": 325, "metadata": {}, "output_type": "execute_result" } ], "source": [ "y_train.value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our train set has 584 articles, and there are about 60 more articles labeled as `real` compared to `fake`. Next, we check for missing values in the three fields:" ] }, { "cell_type": "code", "execution_count": 326, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "\n", "Index: 584 entries, 164 to 102\n", "Data columns (total 3 columns):\n", " # Column Non-Null Count Dtype \n", "--- ------ -------------- ----- \n", " 0 timestamp 306 non-null datetime64[ns]\n", " 1 baseurl 584 non-null object \n", " 2 content 584 non-null object \n", "dtypes: datetime64[ns](1), object(2)\n", "memory usage: 18.2+ KB\n" ] } ], "source": [ "X_train.info()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Nearly half of the timestamps are null. This feature will limit the dataset if we use it in the analysis. Let's take a closer look at the `baseurl`, which represents the website that published the original article." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring the Publishers" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To understand the `baseurl` column, we start by counting the number of articles from each website:" ] }, { "cell_type": "code", "execution_count": 327, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "baseurl\n", "whitehouse.gov 21\n", "abcnews.go.com 20\n", "nytimes.com 17\n", " ..\n", "occupydemocrats.com 1\n", "legis.state.ak.us 1\n", "dailynewsforamericans.com 1\n", "Name: count, Length: 337, dtype: int64" ] }, "execution_count": 327, "metadata": {}, "output_type": "execute_result" } ], "source": [ "X_train['baseurl'].value_counts()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Our train set has 584 rows, and we have found that there are 337 unique publishing websites. This means that the dataset includes many publications with only a few articles. A histogram of the number of articles published by each website confirms this:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "bingroup": "x", "hovertemplate": "variable=baseurl
Number of articles published at a URL=%{x}
count=%{y}", "legendgroup": "baseurl", "marker": { "color": "#1F77B4", "pattern": { "shape": "" } }, "name": "baseurl", "offsetgroup": "baseurl", "orientation": "v", "showlegend": true, "type": "histogram", "x": [ 21, 20, 17, 16, 15, 14, 11, 11, 11, 9, 6, 6, 5, 5, 5, 4, 4, 4, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1 ], "xaxis": "x", "yaxis": "y" } ], "layout": { "barmode": "relative", "height": 250, "legend": { "title": { "text": "variable" }, "tracegroupgap": 0 }, "showlegend": false, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 450, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ 0.5, 21.5 ], "title": { "text": "Number of articles published at a URL" }, "type": "linear" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ 0, 274.7368421052632 ], "title": { "text": "count" } } } }, "image/png": "", "image/svg+xml": [ "5101520050100150200250Number of articles published at a URLcount" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig=px.histogram(X_train['baseurl'].value_counts(), width=450, height=250,\n", " labels={\"value\":\"Number of articles published at a URL\"})\n", "\n", "fig.update_layout(showlegend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This histogram shows that the vast majority (261 out of 337) of websites have only one article in the train set, and only a few websites have more than five articles in the train set.\n", "Nonetheless, it can be informative to identify the websites that published the most fake or real articles. First, we find the websites that published the most fake articles:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "hovertemplate": "variable=baseurl
Number of articles published at a URL=%{x}
Base URL=%{y}", "legendgroup": "baseurl", "marker": { "color": "#1F77B4", "pattern": { "shape": "" } }, "name": "baseurl", "offsetgroup": "baseurl", "orientation": "h", "showlegend": true, "textposition": "auto", "type": "bar", "x": [ 3, 3, 3, 3, 3, 3, 4, 4, 5, 15 ], "xaxis": "x", "y": [ "worldnewsdailyreport.com", "newsweek.com", "cnn.com", "washingtonpost.com", "trendolizer.com", "observeronline.news", "thehill.com", "dailyusaupdate.com", "thegatewaypundit.com", "yournewswire.com" ], "yaxis": "y" } ], "layout": { "barmode": "relative", "height": 250, "legend": { "title": { "text": "variable" }, "tracegroupgap": 0 }, "showlegend": false, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 550, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ 0, 15.789473684210526 ], "title": { "text": "Number of articles published at a URL" }, "type": "linear" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ -0.5, 9.5 ], "title": { "text": "Base URL" }, "type": "category" } } }, "image/png": "", "image/svg+xml": [ "051015worldnewsdailyreport.comnewsweek.comcnn.comwashingtonpost.comtrendolizer.comobserveronline.newsthehill.comdailyusaupdate.comthegatewaypundit.comyournewswire.comNumber of articles published at a URLBase URL" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "top_fake_publishers = (\n", " X_train.assign(label=y_train)\n", " .query(\"label == 1\")\n", " [\"baseurl\"]\n", " .value_counts()\n", " .iloc[:10]\n", " .sort_values()\n", ")\n", "\n", "fig = px.bar(\n", " top_fake_publishers,\n", " orientation=\"h\", width=550, height=250,\n", " labels={\"value\": \"Number of articles published at a URL\", \n", " \"index\": \"Base URL\"},\n", ")\n", "fig.update_layout(showlegend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we list the websites that published the greatest number of real articles:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "hovertemplate": "variable=baseurl
Number of articles published at a URL=%{x}
Base URL=%{y}", "legendgroup": "baseurl", "marker": { "color": "#1F77B4", "pattern": { "shape": "" } }, "name": "baseurl", "offsetgroup": "baseurl", "orientation": "h", "showlegend": true, "textposition": "auto", "type": "bar", "x": [ 5, 8, 9, 10, 11, 11, 15, 16, 20, 21 ], "xaxis": "x", "y": [ "medium.com", "cnn.com", "msnbc.msn.com", "foxnews.com", "cq.com", "washingtonpost.com", "politifact.com", "nytimes.com", "abcnews.go.com", "whitehouse.gov" ], "yaxis": "y" } ], "layout": { "barmode": "relative", "height": 250, "legend": { "title": { "text": "variable" }, "tracegroupgap": 0 }, "showlegend": false, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 550, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ 0, 22.105263157894736 ], "title": { "text": "Number of articles published at a URL" }, "type": "linear" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ -0.5, 9.5 ], "title": { "text": "Base URL" }, "type": "category" } } }, "image/png": "", "image/svg+xml": [ "05101520medium.comcnn.commsnbc.msn.comfoxnews.comcq.comwashingtonpost.compolitifact.comnytimes.comabcnews.go.comwhitehouse.govNumber of articles published at a URLBase URL" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "top_real_publishers = (\n", " X_train.assign(label=y_train)\n", " .query(\"label == 0\")\n", " [\"baseurl\"]\n", " .value_counts()\n", " .iloc[:10]\n", " .sort_values()\n", ")\n", "\n", "fig = px.bar(\n", " top_real_publishers,\n", " orientation=\"h\", width=550, height=250,\n", " labels={\"value\": \"Number of articles published at a URL\",\n", " \"index\": \"Base URL\"},\n", ")\n", "fig.update_layout(showlegend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Only `cnn.com` appears on both lists. Even without knowing the total number of articles for these sites, we might expect that an article from `yournewswire.com` is more likely to be labeled as `fake`, while an article from `whitehouse.gov` is more likely to be labeled as `real`. That said, we don't expect that using the publishing website to predict article truthfulness would work very well; there are simply too few articles from most of the websites in the dataset." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, let's explore the `timestamp` column, which records the publication date of the news articles." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring Publication Date" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plotting the timestamps on a histogram shows that most articles were published after 2000, although there seems to be at least one article published before 1940:" ] }, { "cell_type": "code", "execution_count": 331, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "bingroup": "x", "hovertemplate": "variable=timestamp
Publication year=%{x}
count=%{y}", "legendgroup": "timestamp", "marker": { "color": "#1F77B4", "pattern": { "shape": "" } }, "name": "timestamp", "offsetgroup": "timestamp", "orientation": "v", "showlegend": true, "type": "histogram", "x": [ "2019-01-04T19:25:46", "2016-01-12T21:02:28", "2017-05-28T07:00:00", "2018-04-18T07:00:00", null, null, "2014-02-25T08:00:00", null, "2008-02-02T08:00:00", "2014-04-06T07:00:00", null, null, "2017-10-17T07:00:00", null, "2017-12-23T11:38:49", "2017-06-29T10:11:23", null, "2011-11-17T08:00:00", "2017-08-11T05:30:55", null, null, "2013-07-24T07:00:00", "2017-08-06T07:00:00", "2008-06-15T07:00:00", "2012-05-05T07:00:00", null, "2013-04-18T07:00:00", "2017-06-23T07:00:00", null, null, null, "2017-05-11T13:20:53", "2018-01-18T13:11:21", "2017-07-13T07:00:00", null, "2017-05-05T07:00:00", null, null, null, "2017-04-10T23:33:37", null, null, null, null, null, "2018-01-15T00:45:10", "2017-12-06T03:56:22", null, null, null, null, "2007-01-30T23:00:00", "2018-04-27T07:00:00", null, null, "2017-08-26T23:55:58", null, "2016-08-24T16:31:21", "2017-08-12T07:00:00", "2017-12-08T16:27:57", null, null, null, null, "2018-01-20T19:55:11", "2017-07-06T23:08:01", "2011-06-10T07:00:00", "2014-08-06T07:00:00", null, null, "2016-12-23T19:20:30", null, "2012-06-17T07:00:00", "2016-03-15T07:00:00", null, "2007-07-20T16:03:00", "2008-05-05T07:00:00", "2018-04-30T07:00:00", null, "2016-11-15T13:49:28", "2017-12-15T08:00:00", "2018-05-14T07:00:00", null, "2008-03-17T07:00:00", null, null, null, null, "2017-05-10T07:00:00", null, "2017-03-02T21:08:00", "2011-08-15T07:00:00", "2022-01-21T20:11:33", "2017-06-21T08:29:37", null, null, null, null, null, null, null, null, null, null, null, null, "2017-04-28T22:43:30", null, null, "2017-10-02T07:00:00", null, "2015-12-22T08:00:00", null, "2022-01-21T20:11:33", null, "2018-05-12T10:31:03", null, "2017-05-13T12:16:07", null, "2015-07-12T07:00:00", null, "2012-08-15T07:00:00", null, null, "2018-05-02T21:43:38", null, "2017-06-14T10:54:18", "2017-03-21T17:03:42", "2018-06-26T07:00:00", null, null, null, null, null, null, "2018-05-10T10:27:59", "2012-11-12T08:00:00", "2016-08-15T21:35:17", "2012-06-23T07:00:00", "2017-04-22T16:49:00", "2012-02-17T08:00:00", "2015-11-04T19:03:03", "2018-04-02T07:00:00", "2012-02-08T08:00:00", "2016-09-26T07:00:00", "2022-01-21T20:11:33", null, null, "2017-01-30T08:00:00", null, null, "2017-10-30T03:51:53", "2014-07-28T06:25:34", null, "2016-12-15T17:58:10", null, "2017-05-25T23:45:14", null, "2011-07-11T07:00:00", "2018-01-15T12:47:13", null, "2021-04-05T16:39:51", "2015-12-19T08:00:00", "2009-07-12T07:00:00", null, null, "2017-11-06T14:31:59", null, "2017-09-10T07:00:00", "2017-11-10T15:06:39", "2012-06-02T00:00:00", null, null, "2012-11-06T08:00:00", "2017-10-17T10:00:14", null, "2011-08-26T07:00:00", null, "2012-08-23T07:00:00", "2012-10-22T07:00:00", null, "2015-07-24T07:00:00", null, "2008-06-03T07:00:00", null, "2022-01-21T20:11:33", null, "2017-02-05T00:19:34", null, null, null, "2017-02-14T08:00:00", "2018-01-04T08:00:00", null, "2018-01-17T08:00:00", "2022-01-21T20:11:33", null, null, null, "2017-11-06T16:38:11", null, "2010-12-13T08:00:00", "2009-02-03T08:00:00", null, "2017-09-05T07:00:00", "2017-05-20T13:17:59", null, null, "2013-03-31T07:00:00", "2022-01-21T20:11:33", "2017-07-15T07:00:00", null, null, "2017-12-08T22:25:03", null, null, null, "2012-10-03T07:00:00", null, null, null, "2013-03-08T08:00:00", "2018-04-28T23:36:00", null, null, "2015-10-22T17:38:15", null, null, "2017-05-09T07:00:00", null, "1996-07-13T05:00:00", "2013-05-07T07:00:00", null, null, "2016-01-01T16:30:16", null, null, "2016-11-25T08:00:00", null, "2017-03-01T13:48:02", null, null, "2022-01-21T20:11:33", "2017-11-02T17:25:12", "2017-01-31T14:33:42", "2017-03-10T15:40:16", null, null, null, null, null, "2017-11-15T00:52:37", "2017-02-04T20:49:12", "2012-08-09T07:00:00", "2018-05-16T15:34:00", null, null, null, null, "2018-03-04T22:45:53", "2008-09-19T07:00:00", null, null, null, null, "2017-06-01T01:33:02", "2016-11-22T19:46:39", "2017-11-10T14:21:36", "2018-03-13T07:00:00", "2017-02-09T00:43:39", "1935-07-24T05:00:00", null, null, "2018-06-25T07:00:00", null, null, "2007-08-01T07:00:00", "2017-11-09T08:00:00", "2018-02-12T09:00:00", "2016-07-31T16:41:09", "2009-10-21T07:00:00", null, null, null, "2011-08-15T07:00:00", "2011-02-06T08:00:00", "2018-07-02T07:00:00", "2015-01-07T08:00:00", "2018-06-09T21:52:02", null, "2017-04-25T07:00:00", null, "2016-02-23T08:00:00", null, "2006-11-07T08:00:00", "2016-06-13T07:00:00", "2009-09-07T07:00:00", "2017-11-13T14:53:49", null, null, "2017-08-14T02:18:16", null, "2015-07-10T07:00:00", "2016-02-11T08:00:00", null, null, "2017-10-02T14:13:42", "2022-01-21T20:11:33", null, "2017-11-06T07:00:44", "2020-03-25T05:17:00", null, null, null, "2022-01-21T20:11:33", "2016-09-05T07:00:00", "2022-01-21T20:11:33", "2017-06-21T20:42:21", "2008-03-03T08:00:00", null, "1994-06-10T05:00:00", null, "2013-04-17T07:00:00", null, "2017-12-30T08:00:00", null, "2011-02-11T08:00:00", null, "2012-04-03T07:00:00", "2013-10-27T07:00:00", "2013-08-06T07:00:00", null, null, "2017-07-01T11:58:59", null, null, null, null, "2022-01-21T20:11:33", null, "2017-02-16T20:48:25", null, "2016-05-10T23:21:17.233000", "2022-01-21T20:11:33", "2017-07-06T07:00:00", "2022-01-21T20:11:33", null, null, null, "2008-08-15T07:00:00", null, "2018-07-05T10:48:31", null, "2016-04-24T21:04:28", null, "2017-05-12T12:50:32", "2018-07-17T07:00:00", null, null, "2013-04-25T20:33:20", "2015-02-12T12:30:05", null, null, null, "2011-11-29T08:00:00", "2010-10-14T17:04:00", null, null, "2016-11-25T08:00:00", null, null, "2011-03-11T08:00:00", null, "2017-11-26T18:25:52", "1997-10-19T05:00:00", "2017-10-08T23:43:28", "2022-07-13T14:49:56", null, null, "2018-06-27T16:09:14", "2015-07-13T07:00:00", "2016-09-20T07:00:00", "2017-05-08T07:00:00", "2017-07-08T07:00:00", null, null, "2022-05-13T19:50:48", "2017-04-01T10:22:51", "2011-11-30T08:00:00", null, null, null, "2017-11-27T15:18:23", null, null, null, null, null, "2012-09-18T03:00:58", "2013-03-07T08:00:00", null, "2007-08-01T07:00:00", null, null, "2008-01-30T08:00:00", "2017-02-11T08:00:00", null, "2016-05-09T07:00:00", null, "2018-01-31T03:12:22", "2016-12-12T19:20:51", "2008-03-17T07:00:00", null, null, "2001-09-11T05:00:00", "2018-07-10T07:00:00", null, "2012-10-12T07:00:00", "2016-03-10T18:11:20.173000", "2017-07-17T13:06:07.680999", null, null, null, "2018-02-15T19:38:00", null, null, "2016-02-07T00:23:19.497999", "2017-06-28T02:30:25", "2016-03-14T07:00:00", null, null, "2009-12-18T08:00:00", null, null, "2013-10-02T07:00:00", "2011-05-24T07:00:00", null, "2018-05-21T07:00:00", "2017-09-07T15:58:27", "2018-06-09T13:46:04", "2013-05-02T07:00:00", "2017-03-16T21:05:45", null, "2017-04-12T23:27:02", "2013-07-24T07:00:00", null, "2008-01-30T08:00:00", "2016-11-07T08:00:00", "2018-03-20T17:56:46", "2018-04-09T07:00:00", null, null, null, null, "2017-08-08T07:00:00", "2009-03-05T08:00:00", "2017-01-05T08:00:00", null, null, "2015-11-25T08:00:00", null, "2016-01-05T08:00:00", null, null, "2017-12-30T08:00:00", null, null, "2014-07-30T13:14:31", null, null, "2017-12-11T05:56:25", null, "2017-03-01T23:25:53", null, "2017-09-13T07:00:00", null, null, "2017-09-14T20:00:00", null, null, "2009-02-17T08:00:00", null, "2016-12-06T00:12:45", null, null, "2016-05-23T11:29:00", "2017-09-30T23:15:05", null, "2008-06-03T07:00:00", null, "2018-05-04T14:02:09", "2018-03-16T07:00:00", "2017-01-02T08:00:00", null, "2016-10-10T04:00:36.789999", "2017-08-20T07:00:00", null, "2018-03-24T18:45:45", null, "2016-12-20T08:00:00", "2018-04-09T07:00:00", "2012-07-17T07:00:00", "2010-02-23T08:00:00", null, "2016-06-21T13:41:37", null, "2017-12-14T08:00:00", null, "2009-02-09T08:00:00", null, null, "2018-01-14T01:29:40", null, "2016-01-02T08:00:00", "2016-01-01T23:17:43", null, "2008-02-08T08:00:00", "2018-07-12T16:28:19", null, null, null, "2016-07-09T07:00:00", "2009-09-10T07:00:00", null, "2018-05-13T07:00:00", "2016-11-14T08:00:00", "2008-06-18T07:00:00", "2016-12-23T17:04:56", null, "2017-06-24T07:00:00", "2017-10-12T23:15:41", null, null, null, "2016-04-29T12:25:53", null, "2008-08-23T07:00:00", "2014-12-21T08:00:00", "2016-05-06T20:32:20", "2016-10-10T04:00:36.789999", null, "2012-07-12T07:00:00", null, null, null, null, null, null, null, "2022-01-21T20:11:33", "2016-03-10T08:00:00", null, "2011-12-09T08:00:00", "2017-05-20T07:00:00", "2017-09-01T03:14:21", "2018-04-15T07:00:00", null, "2018-06-23T17:18:57", null, "2017-06-01T07:00:00", null, "2015-04-13T07:00:00", "2012-04-17T07:00:00", null, "2022-01-21T20:11:33", null, "2017-03-14T18:19:08", null, "2017-09-26T07:00:00", null, "2020-07-01T05:10:00", "2011-03-15T07:00:00", "2018-05-13T13:33:50", "2017-10-10T07:00:00", "2010-06-07T13:30:42", "2014-02-28T05:49:15", "2014-01-24T08:00:00", null, "2011-07-06T07:00:00", "2016-01-10T20:56:04", "2017-07-28T22:24:04", null, null, "2017-11-23T08:00:00" ], "xaxis": "x", "yaxis": "y" } ], "layout": { "barmode": "relative", "height": 250, "legend": { "title": { "text": "variable" }, "tracegroupgap": 0 }, "showlegend": false, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 550, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ "1935-01-01", "2023-01-01" ], "title": { "text": "Publication year" }, "type": "date" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ 0, 100 ], "title": { "text": "count" } } } }, "image/png": "", "image/svg+xml": [ "19401960198020002020020406080100Publication yearcount" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = px.histogram(\n", " X_train[\"timestamp\"],\n", " labels={\"value\": \"Publication year\"}, width=550, height=250,\n", ")\n", "fig.update_layout(showlegend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When we take a closer look at the new articles published prior to 2000, we find that the timestamps don't match the actual publication date of the article. These date issues are most likely related to the web scraper collecting inaccurate information from the web pages. We can zoom into the region of the histogram after 2000:" ] }, { "cell_type": "code", "execution_count": 332, "metadata": {}, "outputs": [ { "data": { "application/vnd.plotly.v1+json": { "config": { "plotlyServerURL": "https://plot.ly" }, "data": [ { "alignmentgroup": "True", "bingroup": "x", "hovertemplate": "variable=timestamp
Publication year=%{x}
count=%{y}", "legendgroup": "timestamp", "marker": { "color": "#1F77B4", "pattern": { "shape": "" } }, "name": "timestamp", "offsetgroup": "timestamp", "orientation": "v", "showlegend": true, "type": "histogram", "x": [ "2019-01-04T19:25:46", "2016-01-12T21:02:28", "2017-05-28T07:00:00", "2018-04-18T07:00:00", "2014-02-25T08:00:00", "2008-02-02T08:00:00", "2014-04-06T07:00:00", "2017-10-17T07:00:00", "2017-12-23T11:38:49", "2017-06-29T10:11:23", "2011-11-17T08:00:00", "2017-08-11T05:30:55", "2013-07-24T07:00:00", "2017-08-06T07:00:00", "2008-06-15T07:00:00", "2012-05-05T07:00:00", "2013-04-18T07:00:00", "2017-06-23T07:00:00", "2017-05-11T13:20:53", "2018-01-18T13:11:21", "2017-07-13T07:00:00", "2017-05-05T07:00:00", "2017-04-10T23:33:37", "2018-01-15T00:45:10", "2017-12-06T03:56:22", "2007-01-30T23:00:00", "2018-04-27T07:00:00", "2017-08-26T23:55:58", "2016-08-24T16:31:21", "2017-08-12T07:00:00", "2017-12-08T16:27:57", "2018-01-20T19:55:11", "2017-07-06T23:08:01", "2011-06-10T07:00:00", "2014-08-06T07:00:00", "2016-12-23T19:20:30", "2012-06-17T07:00:00", "2016-03-15T07:00:00", "2007-07-20T16:03:00", "2008-05-05T07:00:00", "2018-04-30T07:00:00", "2016-11-15T13:49:28", "2017-12-15T08:00:00", "2018-05-14T07:00:00", "2008-03-17T07:00:00", "2017-05-10T07:00:00", "2017-03-02T21:08:00", "2011-08-15T07:00:00", "2022-01-21T20:11:33", "2017-06-21T08:29:37", "2017-04-28T22:43:30", "2017-10-02T07:00:00", "2015-12-22T08:00:00", "2022-01-21T20:11:33", "2018-05-12T10:31:03", "2017-05-13T12:16:07", "2015-07-12T07:00:00", "2012-08-15T07:00:00", "2018-05-02T21:43:38", "2017-06-14T10:54:18", "2017-03-21T17:03:42", "2018-06-26T07:00:00", "2018-05-10T10:27:59", "2012-11-12T08:00:00", "2016-08-15T21:35:17", "2012-06-23T07:00:00", "2017-04-22T16:49:00", "2012-02-17T08:00:00", "2015-11-04T19:03:03", "2018-04-02T07:00:00", "2012-02-08T08:00:00", "2016-09-26T07:00:00", "2022-01-21T20:11:33", "2017-01-30T08:00:00", "2017-10-30T03:51:53", "2014-07-28T06:25:34", "2016-12-15T17:58:10", "2017-05-25T23:45:14", "2011-07-11T07:00:00", "2018-01-15T12:47:13", "2021-04-05T16:39:51", "2015-12-19T08:00:00", "2009-07-12T07:00:00", "2017-11-06T14:31:59", "2017-09-10T07:00:00", "2017-11-10T15:06:39", "2012-06-02T00:00:00", "2012-11-06T08:00:00", "2017-10-17T10:00:14", "2011-08-26T07:00:00", "2012-08-23T07:00:00", "2012-10-22T07:00:00", "2015-07-24T07:00:00", "2008-06-03T07:00:00", "2022-01-21T20:11:33", "2017-02-05T00:19:34", "2017-02-14T08:00:00", "2018-01-04T08:00:00", "2018-01-17T08:00:00", "2022-01-21T20:11:33", "2017-11-06T16:38:11", "2010-12-13T08:00:00", "2009-02-03T08:00:00", "2017-09-05T07:00:00", "2017-05-20T13:17:59", "2013-03-31T07:00:00", "2022-01-21T20:11:33", "2017-07-15T07:00:00", "2017-12-08T22:25:03", "2012-10-03T07:00:00", "2013-03-08T08:00:00", "2018-04-28T23:36:00", "2015-10-22T17:38:15", "2017-05-09T07:00:00", "2013-05-07T07:00:00", "2016-01-01T16:30:16", "2016-11-25T08:00:00", "2017-03-01T13:48:02", "2022-01-21T20:11:33", "2017-11-02T17:25:12", "2017-01-31T14:33:42", "2017-03-10T15:40:16", "2017-11-15T00:52:37", "2017-02-04T20:49:12", "2012-08-09T07:00:00", "2018-05-16T15:34:00", "2018-03-04T22:45:53", "2008-09-19T07:00:00", "2017-06-01T01:33:02", "2016-11-22T19:46:39", "2017-11-10T14:21:36", "2018-03-13T07:00:00", "2017-02-09T00:43:39", "2018-06-25T07:00:00", "2007-08-01T07:00:00", "2017-11-09T08:00:00", "2018-02-12T09:00:00", "2016-07-31T16:41:09", "2009-10-21T07:00:00", "2011-08-15T07:00:00", "2011-02-06T08:00:00", "2018-07-02T07:00:00", "2015-01-07T08:00:00", "2018-06-09T21:52:02", "2017-04-25T07:00:00", "2016-02-23T08:00:00", "2006-11-07T08:00:00", "2016-06-13T07:00:00", "2009-09-07T07:00:00", "2017-11-13T14:53:49", "2017-08-14T02:18:16", "2015-07-10T07:00:00", "2016-02-11T08:00:00", "2017-10-02T14:13:42", "2022-01-21T20:11:33", "2017-11-06T07:00:44", "2020-03-25T05:17:00", "2022-01-21T20:11:33", "2016-09-05T07:00:00", "2022-01-21T20:11:33", "2017-06-21T20:42:21", "2008-03-03T08:00:00", "2013-04-17T07:00:00", "2017-12-30T08:00:00", "2011-02-11T08:00:00", "2012-04-03T07:00:00", "2013-10-27T07:00:00", "2013-08-06T07:00:00", "2017-07-01T11:58:59", "2022-01-21T20:11:33", "2017-02-16T20:48:25", "2016-05-10T23:21:17.233000", "2022-01-21T20:11:33", "2017-07-06T07:00:00", "2022-01-21T20:11:33", "2008-08-15T07:00:00", "2018-07-05T10:48:31", "2016-04-24T21:04:28", "2017-05-12T12:50:32", "2018-07-17T07:00:00", "2013-04-25T20:33:20", "2015-02-12T12:30:05", "2011-11-29T08:00:00", "2010-10-14T17:04:00", "2016-11-25T08:00:00", "2011-03-11T08:00:00", "2017-11-26T18:25:52", "2017-10-08T23:43:28", "2022-07-13T14:49:56", "2018-06-27T16:09:14", "2015-07-13T07:00:00", "2016-09-20T07:00:00", "2017-05-08T07:00:00", "2017-07-08T07:00:00", "2022-05-13T19:50:48", "2017-04-01T10:22:51", "2011-11-30T08:00:00", "2017-11-27T15:18:23", "2012-09-18T03:00:58", "2013-03-07T08:00:00", "2007-08-01T07:00:00", "2008-01-30T08:00:00", "2017-02-11T08:00:00", "2016-05-09T07:00:00", "2018-01-31T03:12:22", "2016-12-12T19:20:51", "2008-03-17T07:00:00", "2001-09-11T05:00:00", "2018-07-10T07:00:00", "2012-10-12T07:00:00", "2016-03-10T18:11:20.173000", "2017-07-17T13:06:07.680999", "2018-02-15T19:38:00", "2016-02-07T00:23:19.497999", "2017-06-28T02:30:25", "2016-03-14T07:00:00", "2009-12-18T08:00:00", "2013-10-02T07:00:00", "2011-05-24T07:00:00", "2018-05-21T07:00:00", "2017-09-07T15:58:27", "2018-06-09T13:46:04", "2013-05-02T07:00:00", "2017-03-16T21:05:45", "2017-04-12T23:27:02", "2013-07-24T07:00:00", "2008-01-30T08:00:00", "2016-11-07T08:00:00", "2018-03-20T17:56:46", "2018-04-09T07:00:00", "2017-08-08T07:00:00", "2009-03-05T08:00:00", "2017-01-05T08:00:00", "2015-11-25T08:00:00", "2016-01-05T08:00:00", "2017-12-30T08:00:00", "2014-07-30T13:14:31", "2017-12-11T05:56:25", "2017-03-01T23:25:53", "2017-09-13T07:00:00", "2017-09-14T20:00:00", "2009-02-17T08:00:00", "2016-12-06T00:12:45", "2016-05-23T11:29:00", "2017-09-30T23:15:05", "2008-06-03T07:00:00", "2018-05-04T14:02:09", "2018-03-16T07:00:00", "2017-01-02T08:00:00", "2016-10-10T04:00:36.789999", "2017-08-20T07:00:00", "2018-03-24T18:45:45", "2016-12-20T08:00:00", "2018-04-09T07:00:00", "2012-07-17T07:00:00", "2010-02-23T08:00:00", "2016-06-21T13:41:37", "2017-12-14T08:00:00", "2009-02-09T08:00:00", "2018-01-14T01:29:40", "2016-01-02T08:00:00", "2016-01-01T23:17:43", "2008-02-08T08:00:00", "2018-07-12T16:28:19", "2016-07-09T07:00:00", "2009-09-10T07:00:00", "2018-05-13T07:00:00", "2016-11-14T08:00:00", "2008-06-18T07:00:00", "2016-12-23T17:04:56", "2017-06-24T07:00:00", "2017-10-12T23:15:41", "2016-04-29T12:25:53", "2008-08-23T07:00:00", "2014-12-21T08:00:00", "2016-05-06T20:32:20", "2016-10-10T04:00:36.789999", "2012-07-12T07:00:00", "2022-01-21T20:11:33", "2016-03-10T08:00:00", "2011-12-09T08:00:00", "2017-05-20T07:00:00", "2017-09-01T03:14:21", "2018-04-15T07:00:00", "2018-06-23T17:18:57", "2017-06-01T07:00:00", "2015-04-13T07:00:00", "2012-04-17T07:00:00", "2022-01-21T20:11:33", "2017-03-14T18:19:08", "2017-09-26T07:00:00", "2020-07-01T05:10:00", "2011-03-15T07:00:00", "2018-05-13T13:33:50", "2017-10-10T07:00:00", "2010-06-07T13:30:42", "2014-02-28T05:49:15", "2014-01-24T08:00:00", "2011-07-06T07:00:00", "2016-01-10T20:56:04", "2017-07-28T22:24:04", "2017-11-23T08:00:00" ], "xaxis": "x", "yaxis": "y" } ], "layout": { "barmode": "relative", "height": 250, "legend": { "title": { "text": "variable" }, "tracegroupgap": 0 }, "showlegend": false, "template": { "data": { "bar": [ { "error_x": { "color": "rgb(36,36,36)" }, "error_y": { "color": "rgb(36,36,36)" }, "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "bar" } ], "barpolar": [ { "marker": { "line": { "color": "white", "width": 0.5 }, "pattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 } }, "type": "barpolar" } ], "carpet": [ { "aaxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "baxis": { "endlinecolor": "rgb(36,36,36)", "gridcolor": "white", "linecolor": "white", "minorgridcolor": "white", "startlinecolor": "rgb(36,36,36)" }, "type": "carpet" } ], "choropleth": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "choropleth" } ], "contour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "contour" } ], "contourcarpet": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "contourcarpet" } ], "heatmap": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmap" } ], "heatmapgl": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "heatmapgl" } ], "histogram": [ { "marker": { "line": { "color": "white", "width": 0.6 } }, "type": "histogram" } ], "histogram2d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2d" } ], "histogram2dcontour": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "histogram2dcontour" } ], "mesh3d": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "type": "mesh3d" } ], "parcoords": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "parcoords" } ], "pie": [ { "automargin": true, "type": "pie" } ], "scatter": [ { "fillpattern": { "fillmode": "overlay", "size": 10, "solidity": 0.2 }, "type": "scatter" } ], "scatter3d": [ { "line": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatter3d" } ], "scattercarpet": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattercarpet" } ], "scattergeo": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergeo" } ], "scattergl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattergl" } ], "scattermapbox": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scattermapbox" } ], "scatterpolar": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolar" } ], "scatterpolargl": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterpolargl" } ], "scatterternary": [ { "marker": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "type": "scatterternary" } ], "surface": [ { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" }, "colorscale": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "type": "surface" } ], "table": [ { "cells": { "fill": { "color": "rgb(237,237,237)" }, "line": { "color": "white" } }, "header": { "fill": { "color": "rgb(217,217,217)" }, "line": { "color": "white" } }, "type": "table" } ] }, "layout": { "annotationdefaults": { "arrowhead": 0, "arrowwidth": 1 }, "autosize": true, "autotypenumbers": "strict", "coloraxis": { "colorbar": { "outlinewidth": 1, "tickcolor": "rgb(36,36,36)", "ticks": "outside" } }, "colorscale": { "diverging": [ [ 0, "rgb(103,0,31)" ], [ 0.1, "rgb(178,24,43)" ], [ 0.2, "rgb(214,96,77)" ], [ 0.3, "rgb(244,165,130)" ], [ 0.4, "rgb(253,219,199)" ], [ 0.5, "rgb(247,247,247)" ], [ 0.6, "rgb(209,229,240)" ], [ 0.7, "rgb(146,197,222)" ], [ 0.8, "rgb(67,147,195)" ], [ 0.9, "rgb(33,102,172)" ], [ 1, "rgb(5,48,97)" ] ], "sequential": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ], "sequentialminus": [ [ 0, "#440154" ], [ 0.1111111111111111, "#482878" ], [ 0.2222222222222222, "#3e4989" ], [ 0.3333333333333333, "#31688e" ], [ 0.4444444444444444, "#26828e" ], [ 0.5555555555555556, "#1f9e89" ], [ 0.6666666666666666, "#35b779" ], [ 0.7777777777777778, "#6ece58" ], [ 0.8888888888888888, "#b5de2b" ], [ 1, "#fde725" ] ] }, "colorway": [ "#1F77B4", "#FF7F0E", "#2CA02C", "#D62728", "#9467BD", "#8C564B", "#E377C2", "#7F7F7F", "#BCBD22", "#17BECF" ], "font": { "color": "rgb(36,36,36)" }, "geo": { "bgcolor": "white", "lakecolor": "white", "landcolor": "white", "showlakes": true, "showland": true, "subunitcolor": "white" }, "height": 250, "hoverlabel": { "align": "left" }, "hovermode": "closest", "mapbox": { "style": "light" }, "margin": { "b": 10, "l": 10, "r": 10, "t": 10 }, "paper_bgcolor": "white", "plot_bgcolor": "white", "polar": { "angularaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "radialaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "scene": { "xaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "zaxis": { "backgroundcolor": "white", "gridcolor": "rgb(232,232,232)", "gridwidth": 2, "linecolor": "rgb(36,36,36)", "showbackground": true, "showgrid": false, "showline": true, "ticks": "outside", "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } }, "shapedefaults": { "fillcolor": "black", "line": { "width": 0 }, "opacity": 0.3 }, "ternary": { "aaxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "baxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" }, "bgcolor": "white", "caxis": { "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": false, "showline": true, "ticks": "outside" } }, "title": { "x": 0.5, "xanchor": "center" }, "width": 350, "xaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" }, "yaxis": { "automargin": true, "gridcolor": "rgb(232,232,232)", "linecolor": "rgb(36,36,36)", "showgrid": true, "showline": true, "ticks": "outside", "title": { "standoff": 15 }, "zeroline": false, "zerolinecolor": "rgb(36,36,36)" } } }, "width": 550, "xaxis": { "anchor": "y", "autorange": true, "domain": [ 0, 1 ], "range": [ "2001-01-01", "2023-01-01" ], "title": { "text": "Publication year" }, "type": "date" }, "yaxis": { "anchor": "x", "autorange": true, "domain": [ 0, 1 ], "range": [ 0, 100 ], "title": { "text": "count" } } } }, "image/png": "", "image/svg+xml": [ "2005201020152020020406080100Publication yearcount" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fig = px.histogram(\n", " X_train.loc[X_train[\"timestamp\"] > \"2000\", \"timestamp\"],\n", " labels={\"value\": \"Publication year\"}, width=550, height=250, \n", ")\n", "fig.update_layout(showlegend=False)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As expected, most of the articles were published between 2007 (the year Politifact was founded) and 2020 (the year the FakeNewsNet repository was published). But we also find that the timestamps are concentrated on the years 2016 to 2018—the year of the controversial 2016 US presidential election and the two years following. This insight is a further caution on the limitation of our analysis to carry over to nonelection years. \n", "\n", "Our main aim is to use the text content for classification. We explore some word frequencies next." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Exploring Words in Articles" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We'd like to see whether there's a relationship between the words used in the articles and whether the article was labeled as `fake`. One simple way to do this is to look at individual words like _military_, then count how many articles that mentioned \"military\" were labeled `fake`. For _military_ to be useful, the articles that mention it should have a much higher or much lower fraction of fake articles than 45% (the proportion of fake articles in the dataset: 264/584). \n", "\n", "We can use our domain knowledge of political topics to pick out a few candidate words to explore:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "tags": [ "remove-cell" ] }, "outputs": [], "source": [ "word_features = [\n", " # names of presidential candidates\n", " 'trump', 'clinton',\n", " #congress words\n", " 'state', 'vote', 'congress', 'shutdown',\n", " \n", " # other possibly useful words\n", " 'military', 'princ', 'investig', 'antifa', \n", " 'joke', 'homeless', 'swamp', 'cnn', 'the'\n", "]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we define a function that creates a new feature for each word, where the feature contains `True` if the word appeared in the article and `False` if not: " ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "def make_word_features(df, words):\n", " features = { word: df['content'].str.contains(word) for word in words }\n", " return pd.DataFrame(features)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is like one-hot encoding for the presence of a word (see {numref}`Chapter %s `). We can use this function to further wrangle our data and create a new data frame with a feature for each of our chosen words: " ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "df_words = make_word_features(X_train, word_features)\n", "df_words[\"label\"] = df[\"label\"]" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(584, 16)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_words.shape" ] }, { "cell_type": "code", "execution_count": 339, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
trumpclintonstatevote...swampcnnthelabel
164FalseFalseTrueFalse...FalseFalseTrue1
28FalseFalseFalseFalse...FalseFalseTrue1
708FalseFalseTrueTrue...FalseFalseTrue0
193FalseFalseFalseFalse...FalseFalseTrue1
\n", "

4 rows × 16 columns

\n", "
" ], "text/plain": [ " trump clinton state vote ... swamp cnn the label\n", "164 False False True False ... False False True 1\n", "28 False False False False ... False False True 1\n", "708 False False True True ... False False True 0\n", "193 False False False False ... False False True 1\n", "\n", "[4 rows x 16 columns]" ] }, "execution_count": 339, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df_words.head(4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can find the proportion of these articles that were labeled `fake`. We visualize these calculations in the following plots. In the left plot, we mark the proportion of `fake` articles in the entire train set using a dotted line, which helps us understand how informative each word feature is—a highly informative word will have a point that lies far away from the line:" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "tags": [ "hide-input" ] }, "outputs": [ { "data": { "image/svg+xml": [ "\n", "\n", "\n", "\n", " \n", " \n", " \n", " \n", " 2023-07-17T23:13:40.657923\n", " image/svg+xml\n", " \n", " \n", " Matplotlib v3.3.4, https://matplotlib.org/\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n" ], "text/plain": [ "
" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "fake_props = (make_word_features(X_train, word_features)\n", " .assign(label=(y_train == 1))\n", " .melt(id_vars=['label'], var_name='word', value_name='appeared')\n", " .query('appeared == True')\n", " .groupby('word')\n", " ['label']\n", " .agg(['mean', 'count'])\n", " .rename(columns={'mean': 'prop_fake'})\n", " .sort_values('prop_fake', ascending=False)\n", " .reset_index()\n", " .melt(id_vars='word')\n", ")\n", "\n", "g = sns.catplot(data=fake_props, x='value', y='word', col='variable',\n", " s=5, jitter=False, sharex=False, height=3)\n", "\n", "[[prop_ax, _]] = g.axes\n", "prop_ax.axvline(0.45, linestyle='--')\n", "prop_ax.set(xlim=(-0.05, 1.05))\n", "\n", "titles = ['Proportion of articles marked fake', 'Number of articles with word']\n", "\n", "for ax, title in zip(g.axes.flat, titles):\n", " # Set a different title for each axes\n", " ax.set(title=title)\n", " ax.set(xlabel=None)\n", " ax.set(ylabel=None)\n", " ax.yaxis.grid(True);" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This plot reveals a few interesting considerations for modeling.\n", "For example, notice that the word _antifa_ is highly predictive---all articles that mention the word _antifa_ are labeled `fake`. However, _antifa_ only appears in a few articles. On the other hand, the word _the_ appears in nearly every article, but is uninformative for distinguishing between `real` and `fake` articles because the proportion of articles with _the_ that are fake matches the proportion of fake articles overall. We might instead do better with a word like _vote_, which is predictive and appears in many news articles." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This exploratory analysis brought us understanding of the time frame that our news articles were published, the broad range of publishing websites captured in the data, and candidate words to use for prediction. Next, we fit models for predicting whether articles are fake or real." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "celltoolbar": "Tags", "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.4" } }, "nbformat": 4, "nbformat_minor": 4 }