Distributions: Population, Empirical, Sampling

16.1. Distributions: Population, Empirical, Sampling

The population, sampling, and empirical distributions refer to different concepts that are important to understand when making inferences about a model or predictions for new observations. Figure 16.1 below provides a diagram that can be helpful in distinguishing between them. It is based on the notions of population and access frame from Chapter 2 and the urn model from Chapter 3. On the left is the population that we are studying represented as marbles in an urn, one for each unit. We have simplified things to the situation where the access frame and the population are the same; that is, we can access every unit in the population through the frame. The arrow from the urn to the sample represents the design, meaning the protocol for selecting the sample from the frame. This selection process uses a chance mechanism, which is why we represent the population as an urn filled with indistinguishable marbles. On the right, the collection of marbles constitutes our sample.

../../_images/SamplingTriptych.png

Fig. 16.1 This diagram of the data generation generation process, shows the three distributions. The population distribution is typically not observed, and the empirical distribution is based on the sample. In the middle, the sampling distribution of the summary statistic is a probability distribution that is determined by the population and the mechanism for selecting the sample.

We have kept the diagram simple and consider the values for one feature. Below the population on the left in the diagram is the population histogram for the feature. That is, the histogram represents the distribution of values for the feature across the entire population. On the right, the histogram is an empirical one that shows the distribution of the values for the feature only for the sample. Notice that these two distributions are similar in shape. This is because our sampling mechanism produces representative samples.

We are often interested in a summary of the sample, such as the mean, median, slope from a simple linear model, etc. And, typically, this summary statistic is an estimate for a population parameter, such as the population mean, median, etc. The population parameter is shown as \(\theta^*\) on the left of the diagram, and the summary statistic, calculated from the sample, is \(\hat{\theta}\).

The chance mechanism that generates our sample, might well produce a different sample is we were to conduct our investigation over again. But, if the protocols are well designed, we expect the sample to resemble the population so that we can infer the population parameter from the summary statistic of our sample. The sampling distribution in the middle of the diagram is a probability distribution for the statistic. It shows the possible values that the statistics might take and their chances. Earlier, in Chapter 3, we used simulation to estimate sampling distribution for several examples. We will revisit these and other examples in this chapter to formalize the analysis.

Typically, we don’t know population distribution or parameter, and we try to infer the parameter or predict values for unseen units in the population. At times, a conjecture about the population parameter can be tested using the sample. This is the topic of the next section.