Biostatistics with R

Distribution of difference between two sample proportions

Suppose we collect random samples of sizes \(\small{n_1}\) and \(\small{n_2}\) from two trials and count the number of successes (favourable events) to be \(\small{Y_1}\) and \(\small{Y_2}\) respectively. From this, we compute corresponding fractions \(\small{\dfrac{Y_1}{n_1} }\) and \(\small{\dfrac{Y_2}{n_2} }\) of successes among the two sets of samples.

Let the proportion (probability) of success in the two populations be \(\small{p_1}\) and \(\small{p_2}\).

We wish know the distribution followed by the difference \(\small{ \dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}}\) in the observed proportions of success among the samples from which a confidence interval around the difference \(\small{p_1 - p_2 }\) can be computed.

Since \(\small{Y_1}\) and \(\small{Y_2}\) are independent independent random variables representing the number of successes among the observed events \(\small{n_1}\) and \(\small{n_2}\) respectively, they follow binomial distributions with probabilities \(\small{p_1}\) and \(\small{p_2}\) of success per event. Therefore, we can write, using central limit theorem,

\(\small{\dfrac{Y_1 - np_1}{\sqrt{np_1(1-p_1) }} = \dfrac{\dfrac{Y_1}{n} - p_1} {\sqrt{\dfrac{p_1(1-p_1)}{n}} } ~~~~~ }\) has an approximate normal distribution N(0,1) for large n.

\(\small{\dfrac{Y_2 - np_2}{\sqrt{np_2(1-p_2) }} = \dfrac{\dfrac{Y_2}{n} - p_2} {\sqrt{\dfrac{p_2(1-p_2)}{n}} } ~~~~~ }\) has an approximate normal distribution N(0,1) for large n.

This means the variables \(\small{\dfrac{Y_1}{n_1} }\) and \(\small{\dfrac{Y_2}{n_2} }\) follow an approximate normal distribution with means \(\small{p_1 }\) and \(\small{p_2 }\) and variances \(\small{\dfrac{p_1(1-p_1)}{n_1} }\) and \(\small{\dfrac{p_2(1-p_2)}{n_2} }\) respectively.

If two variables follow normal distribution, their difference follows normal distribution with a mean which is the difference between individual means and a standard deviation which is a quardratic sum of the individual standard deviations.

Accordingly, the difference \(\small{\dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}}\) follows an approximate nomral distribution with mean \(\small{p_1-p_2 }\) and variance \(\small{\dfrac{p_1(1-p_1)}{n_1} - \dfrac{p_2(1-p_2)}{n_2} }\). We can therefore say, using central limit theorem, that

\(\small{ \dfrac{\dfrac{Y_1}{n_1}- \dfrac{Y_2}{n_2}~-~(p_1-p_2)}{\sqrt{\dfrac{p_1(1-p_1)}{n_1} + \dfrac{p_2(1-p_2)}{n_2} } }~~~~~ }\) approximately will be \(\small{N(0,1)}\).

When the sample sizes \(\small{n_1}\) and \(\small{n_2}\) are large enough, we can replace the probabilities \(\small{p_1}\) and \(\small{p_2}\) by their estimates \(\small{\dfrac{Y_1}{n_1}}\) and \(\small{\dfrac{Y_2}{n_2}}\) respectively in the denominator of the above expression.

Following the same procedure adopted for confidence interval for a Gaussian distribution, we write a \(\small{100(1-\alpha)\% }\) confidence interval for the difference in the unknown population proportions \(\small{p_1-p_2}\) as ,

\(\small{ \dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}~~\pm~~ Z_{1-\alpha/2} \sqrt{\dfrac{\dfrac{Y_1}{n_1} \left(1-\dfrac{Y_1}{n_1}\right)}{n_1} + \dfrac{\dfrac{Y_2}{n_2}\left(1-\dfrac{Y_2}{n_2}\right)}{n_2} } }\)







We should keep in mind that the above expression for the two sided confidence interval for the unknown proportion difference is valid for sufficiently large values of sample sizes \(\small{n_1}\) and \(\small{n_2 }\).