{
"cells": [
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"tags": [
"remove-cell"
]
},
"outputs": [],
"source": [
"# Reference: https://jupyterbook.org/interactive/hiding.html\n",
"# Use {hide, remove}-{input, output, cell} tags to hiding content\n",
"\n",
"import numpy as np\n",
"import matplotlib.pyplot as plt\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"%matplotlib inline\n",
"import ipywidgets as widgets\n",
"from ipywidgets import interact, interactive, fixed, interact_manual\n",
"from IPython.display import display\n",
"\n",
"sns.set()\n",
"sns.set_context('talk')\n",
"np.set_printoptions(threshold=20, precision=2, suppress=True)\n",
"pd.set_option('display.max_rows', 7)\n",
"pd.set_option('display.max_columns', 8)\n",
"pd.set_option('precision', 2)\n",
"# This option stops scientific notation for pandas\n",
"# pd.set_option('display.float_format', '{:.2f}'.format)\n",
"\n",
"def display_df(df, rows=pd.options.display.max_rows,\n",
" cols=pd.options.display.max_columns):\n",
" with pd.option_context('display.max_rows', rows,\n",
" 'display.max_columns', cols):\n",
" display(df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exercises\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- In cluster sampling, the population is divided into non-overlapping subgroups, which tend to be smaller than strata. The sampling method is to take a simple random sample of the clusters and include all of the units in the chosen clusters in the sample. Use the urn analogy, to represent cluster sampling. As a simple example, suppose our population of $7$ starship prototypes is placed into $4$ clusters as follows: $\\left(A, B\\right)~ \\left(C, D\\right)~ \\left(E, F\\right) ~ \\left(G\\right).$ And, imagine we take a SRS of $2$ clusters.\n",
" - List all of the possible collections of starships that might result. \n",
" - What is the chance that $A$ is in the sample?\n",
" - What is the chance that $A$, $C$ and $E$ are in the sample? \n",
"\n",
"\n",
"- Cluster sampling has the advantage of making sample collection easier. For example, it is much easier to poll 100 homes of 2-4 people each than to poll 300 individuals. But, since people in a cluster tend to be similar to each other, we need to keep the sampling procedure in mind as we generalize from sample to population. Continue with the starship clusters from the previous exercise, and suppose additionally that prototypes, $A$, $B$, $E$ and $F$ are defective. \n",
" - Using the list of all possible samples that might result that you found in the previous exercise, calculate the fraction of defective prototypes for each sample.\n",
" - Create the sampling distribution of the fraction of defective prototypes. Represent it in a table that contains the possible fractions and the chance of observing each fraction. \n",
" - Compare this sampling distribution to one obtained for a SRS of three prototypes. \n",
" - Notice that the clusters contain similar prototypes (in terms of whether they are defective). How does this impact the shape of the sampling distribution? \n",
"\n",
"\n",
"- Systematic sampling is another popular technique. To start, the population is ordered, and the first unit is selected at random from the first $k$ elements. Then, every $k^{th}$ unit after that is placed in the sample. As an example, suppose our population of $7$ prototypes is ordered alphabetically and we select one from the first three $A$, $B$, $C$ at random, and then every third element after that. \n",
" - List all of the possible samples that might result.\n",
" - What is the chance that $A$ is in the sample?\n",
" - What is the chance that $A$ and $B$ are in the sample? $A$ and $G$? \n",
"\n",
"- An example of an online *intercept survey* is when a popup window asks you to complete a brief questionnaire. If every $k^{th}$ visitor to a website is asked to complete a survey, then this intercept survey behaves like a systematic sample.\n",
" - Describe the population, access frame, and sample for an intercept survey.\n",
" - Consider a stream of visits to a website on one day, $V1$, $V2$, .... Suppose you choose the 20th visit, $V20$, to your site as the first to receive the popup survey. Then, you administer the popup survey to every 20th visit after that. What's the chance that the 17th and 27th visits are in the sample? What's the chance that the 17th and 37th visits are in the sample? \n",
" - Why don't you need to know the size of the population to calculate the above chances?\n",
" - Does it seem reasonable to imagine that this sample doesn't introduce a selection bias in the sampling process? Why or why not?"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"- A recent report found that half of the prisoners released in any given year in the United States will end up back in prison within three years. Yet, another report that follows individual prisoners released from prison over 20 years found that about 2/3 of the prisoners released will never end up back in prison, over their whole lifetime. How is this possible?\n",
" - For each report, carefully describe the population of interest and the access frame.\n",
" - Some have described this apparent paradox as the difference between an \"event-based\" sample and an \"offender-based\" sample. What might this mean? \n",
" - The notion of size-biased sampling and length-biased sampling occur when a unit that is larger/longer is more likely to be part of the sample than another smaller/shorter unit. Explain how a length bias may be responsible for these apparent conflicting findings. "
]
}
],
"metadata": {
"celltoolbar": "Tags",
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.4"
}
},
"nbformat": 4,
"nbformat_minor": 4
}