## Distribution of difference between two sample proportions

Suppose we collect random samples of sizes $\small{n_1}$ and $\small{n_2}$ from two trials and count the number of successes (favourable events) to be $\small{Y_1}$ and $\small{Y_2}$ respectively. From this, we compute corresponding fractions $\small{\dfrac{Y_1}{n_1} }$ and $\small{\dfrac{Y_2}{n_2} }$ of successes among the two sets of samples.

Let the proportion (probability) of success in the two populations be $\small{p_1}$ and $\small{p_2}$.

We wish know the distribution followed by the difference $\small{ \dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}}$ in the observed proportions of success among the samples from which a confidence interval around the difference $\small{p_1 - p_2 }$ can be computed.

Since $\small{Y_1}$ and $\small{Y_2}$ are independent independent random variables representing the number of successes among the observed events $\small{n_1}$ and $\small{n_2}$ respectively, they follow binomial distributions with probabilities $\small{p_1}$ and $\small{p_2}$ of success per event. Therefore, we can write, using central limit theorem,

$\small{\dfrac{Y_1 - np_1}{\sqrt{np_1(1-p_1) }} = \dfrac{\dfrac{Y_1}{n} - p_1} {\sqrt{\dfrac{p_1(1-p_1)}{n}} } ~~~~~ }$ has an approximate normal distribution N(0,1) for large n.

$\small{\dfrac{Y_2 - np_2}{\sqrt{np_2(1-p_2) }} = \dfrac{\dfrac{Y_2}{n} - p_2} {\sqrt{\dfrac{p_2(1-p_2)}{n}} } ~~~~~ }$ has an approximate normal distribution N(0,1) for large n.

This means the variables $\small{\dfrac{Y_1}{n_1} }$ and $\small{\dfrac{Y_2}{n_2} }$ follow an approximate normal distribution with means $\small{p_1 }$ and $\small{p_2 }$ and variances $\small{\dfrac{p_1(1-p_1)}{n_1} }$ and $\small{\dfrac{p_2(1-p_2)}{n_2} }$ respectively.

If two variables follow normal distribution, their difference follows normal distribution with a mean which is the difference between individual means and a standard deviation which is a quardratic sum of the individual standard deviations.

Accordingly, the difference $\small{\dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}}$ follows an approximate nomral distribution with mean $\small{p_1-p_2 }$ and variance $\small{\dfrac{p_1(1-p_1)}{n_1} - \dfrac{p_2(1-p_2)}{n_2} }$. We can therefore say, using central limit theorem, that

$\small{ \dfrac{\dfrac{Y_1}{n_1}- \dfrac{Y_2}{n_2}~-~(p_1-p_2)}{\sqrt{\dfrac{p_1(1-p_1)}{n_1} + \dfrac{p_2(1-p_2)}{n_2} } }~~~~~ }$ approximately will be $\small{N(0,1)}$.

When the sample sizes $\small{n_1}$ and $\small{n_2}$ are large enough, we can replace the probabilities $\small{p_1}$ and $\small{p_2}$ by their estimates $\small{\dfrac{Y_1}{n_1}}$ and $\small{\dfrac{Y_2}{n_2}}$ respectively in the denominator of the above expression.

Following the same procedure adopted for confidence interval for a Gaussian distribution, we write a $\small{100(1-\alpha)\% }$ confidence interval for the difference in the unknown population proportions $\small{p_1-p_2}$ as ,

$\small{ \dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}~~\pm~~ Z_{1-\alpha/2} \sqrt{\dfrac{\dfrac{Y_1}{n_1} \left(1-\dfrac{Y_1}{n_1}\right)}{n_1} + \dfrac{\dfrac{Y_2}{n_2}\left(1-\dfrac{Y_2}{n_2}\right)}{n_2} } }$

We should keep in mind that the above expression for the two sided confidence interval for the unknown proportion difference is valid for sufficiently large values of sample sizes $\small{n_1}$ and $\small{n_2 }$.