16.8. Exercises

  • Here we walk you through a test of the hypothesis related to the zeros observed in the two groups. Specifically, we examine the difference in the proportion of zeros in the groups: \( \hat{p}_{T} - \hat{p}_C,\) where \(\hat{p}_T = \# 0s \textrm{ in treatment} / 100\), and \(\hat{p}_C\) is similarly defined for the control group.

    • What is the Null Hypothesis?

    • Describe an urn that reflects the null. Include in your description, the number of draws from the urn, whether draws are with or without replacement, and the statistic you plan to compute.

    • Carry out the simulation with 10,000 trials. Recall from Chapter 3 that there is a named probability distribution that you can use to carry out your simulation.

    • Plot the approximate sampling distribution for the statistic.

    • Compute the approximate the \(p\)-value, which is the chance of seeing a difference at least as big as your observed statistic.

    • Describe your conclusions from this hypothesis test. Is the probability a surprise?

  • Let’s again walk through a hypothesis test for the Wikipedia experiment. This time though, the test is based on a comparison of medians, and is an example of a randomization test. The test works conditionally on the observed data, and considers all possible alternative outcomes based on a different, random assignment.

    • The test statistic is the difference: \(median_T - median_C\).

    • What is the Null Hypothesis? This time, our model doesn’t reduce to an urn of marbles labeled zero or one.

    • Describe an urn that reflects the null. Include in your description, the number of draws from the urn, whether draws are with or without replacement, and the statistic you plan to compute.

    • Carry out the simulation with 10,000 trials. Simulating this sampling distribution was not practical before today’s world of cheap, powerful computers.

      • To carry out this simulation, set up the urn as an array of productivity data, where the first 100 values are the productivity values from the treatment and the next 100 are from the control group.

      • Shuffle the 200 values in the array. Then the first 100 values will represent a randomly sampled treatment group and the next 100 the control. We write a function to find the difference in medians.

      • Shuffle the marbles in the urn 10000 times, and compute the difference in medians for each shuffle.

    • Plot the approximate sampling distribution for the statistic.

    • Compute the approximate the \(p\)-value, which is the chance of seeing a difference at least as big as your observed statistic.

    • Describe your conclusions from this hypothesis test. Is the probability a surprise?

#sim_right_P = np.random.poisson(lam=n_good_C, size=500000)
#sim_left_P = np.random.poisson(lam=n_good_T, size=500000)

16.8.1. A test for a difference in medians

We saw earlier that the two groups had quite a bit different median productivity. Under the null, we expect the treatment to have no effect and each contributor’s productivity to be unaffected by their receipt of the award. That is, the productivity for a top contributor would be unchanged by the receipt of an informal award. Any difference in the observed medians for the two groups is simply due to the chance assignment to treatment and control.

Now our statistic is the difference: \(median_T - median_C\). This time, our model doesn’t reduce to an urn of marbles labeled zero or one. Instead, it is an urn with 200 marbles, each labeled with the productivity of one of the contributors. Again, we draw 100 marbles from the urn to find the treatment group, and the marbles remaining in the urn represent the control group.

#urn = wiki[['postproductivity']].to_numpy()

Our statistic is the difference:

#sample_median_dif = np.median(urn[100:]) - np.median(urn[:100])
#sample_median_dif

As we might expect, the sampling distribution of the difference in medians is centered on 0. This reflects our assumptions of the treatment having no effect. Our observed statistic is quite large (in the negative sense). We use this simulated sampling distribution to find the approximate \(p\)-value for observing a difference in medians at least as big as ours.

#len([elem for elem in median_diff_simulation if elem <= -212]) / 10000

This \(p\)-value is an even bigger surprise than the \(p\)-value based on the proportion of zeros in the groups. Only about 3 of the 10,000 simulated samples had a difference this large. This test raises doubts about the null model. Statistical logic has us conclude that the pattern is real. How do we interpret this? The experiment was well designed. The 200 contributors were selected at random from the top 1% and then they were divided at random into two groups. These chance processes say that we can rely on the sample of 200 being representative of top contributors, and the treatment and control groups being similar to each other in every way except for the application of the treatment. Given the careful design, we conclude that informal awards have a positive effect on productivity for top contributors.

Such a simulation like the one we just performed, was not practical before today’s world of cheap, powerful computers. An alternative approach, that was developed in the 1950s and 1960s, uses rank statistics. The approach is still used today with highly skewed samples and multiple tests. We describe it next.

We wrap up this section by noting that the sampling distributions for all three tests are roughly symmetric and normal in shape. But, the sampling distribution for the rank-sum statistic is nearly perfect in its normal shape. This is because the ranks maintain order, but remove skewness from the data, so the sum of random ranks has an approximate normal distribution for sums of ranks for samples of as few as 20 to 30 values. While, we have simply simulated the approximate sampling distribution of the rank-sum and not used this normal property of the sampling distribution, it was originally designed as a statistics because of this useful property, and today, when conducting hundreds and thousands of A/B tests, many of which have skewed empirical distributions, tests based on ranks are still popular.

The simulations we carried out in this section, were possible because our null model completely specified the chance mechanism for the data. That is often not the case, and the next section introduces an approach to approximate the model using the data.

16.8.2. Example: A test for proportions

We encountered a hypothesis test in Chapter 3 without actually calling it that. Recall the randomized clinical trial of the J&J vaccine. There were 43,738 people enrolled in the trial, randomly split in equal groups. The group that received the vaccine had 117 sick people, and the control group had 351.

Statistic The statistic used is vaccine efficacy,

\[ \hat{VE} = \frac {\hat{p}_{C} - \hat{p}_T} {\hat{p}_T} \]

And the value of vaccine efficacy in this trial was,

\[ \frac{(351/21869 - 117/21869)} {351/21869} = \frac{351 - 117}{351} = 0.67\]

Model The CDC set a standard of 50% for \(VE\), meaning that the efficacy has to exceed 50% to be approved for distribution. The model assumes that the vaccine has no effect on health and any difference observed between the treatment and control groups is due to the chance process in assigning people to groups. We set the Null hypothesis as the status quo that the vaccine doesn’t work; we hope to find a surprise and reject the Null.

In this first test our statistic relates to the zeros observed in the two groups. Specifically, we examine the difference in the proportion of zeros in the groups.

\[ \hat{p}_{T} - \hat{p}_C,\]

where \(\hat{p}_T = \# 0s \textrm{ in treatment} / 100\), and \(\hat{p}_C\) is similarly defined for the control group. The Null Hypothesis assumes that this difference is 0, and it’s simply the luck of the draw that resulted in so few zeros in the treatment group.

Under the null, the proportion of zeros in the treatment and control groups is the same, and any observed difference is due to the random variation in assignment to treatment/control group. In other word, the 19 contributors who had no productivity in the 90-day period following the treatment, would have behaved exactly the same whether they were assigned to treatment or control. We can represent this null model as an urn with 200 marbles, with 19 of them marked \(0\) and \(181\) marked with a \(1\), and we draw 100 times to select the treatment group. In Chapter 3, we encountered a similar situation with a vaccine trial. There we found that draws from a 0-1 urn followed a named distribution called the hypergeometric, and numpy provides a simulator for this model, making it easy to find the p-value.