Suppose we collect random samples of sizes \(\small{n_1}\) and \(\small{n_2}\) from two populations and count
the number of successes (favourable events) in the samples to be \(\small{Y_1}\) and \(\small{Y_2}\) respectively. From this, we compute corresponding fractions \(\small{\dfrac{Y_1}{n_1} }\) and \(\small{\dfrac{Y_2}{n_2} }\) of
successes among the two sets of samples.
Let the proportion (probability) of success in the two populations be \(\small{p_1}\) and \(\small{p_2}\).
We wish to know the distribution followed by the difference \(\small{ \dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}}\) in the observed proportions of success among the samples from which a confidence interval around the difference \(\small{p_1 - p_2 }\) can be computed.
Since \(\small{Y_1}\) and \(\small{Y_2}\) are independent random variables representing the number of successes among the observed events \(\small{n_1}\) and \(\small{n_2}\) respectively, they follow binomial distributions with probabilities \(\small{p_1}\) and \(\small{p_2}\) of success per event.
The variable $Y_1$ has a binomial distribution with mean $np_1$ and standard deviation $\sqrt{np_1(1-np_1)}$.
Similarly, the variable $Y_2$ has a binomial distribution with mean $np_2$ and standard deviation $\sqrt{np_2(1-np_2)}$.
Therefore, we can write, using central limit theorem,
\(\small{\dfrac{Y_1 - np_1}{\sqrt{np_1(1-p_1) }} = \dfrac{\dfrac{Y_1}{n} - p_1} {\sqrt{\dfrac{p_1(1-p_1)}{n}} } ~~~~~ }\) having an approximate distribution N(0,1) for large n.
\(\small{\dfrac{Y_2 - np_2}{\sqrt{np_2(1-p_2) }} = \dfrac{\dfrac{Y_2}{n} - p_2} {\sqrt{\dfrac{p_2(1-p_2)}{n}} } ~~~~~ }\) having an approximate distribution N(0,1) for large n.
This means the variables \(\small{\dfrac{Y_1}{n_1} }\) and \(\small{\dfrac{Y_2}{n_2} }\) follow an approximate normal distribution with means \(\small{p_1 }\) and \(\small{p_2 }\) and variances
\(\small{\dfrac{p_1(1-p_1)}{n_1} }\) and \(\small{\dfrac{p_2(1-p_2)}{n_2} }\) respectively.
If two variables are normally distributed, their difference has a normal distribution with a mean equal to the difference between individual means and a standard deviation which is a quardratic sum of the individual standard deviations.
Accordingly, the difference \(\small{\dfrac{Y_1}{n_1} - \dfrac{Y_2}{n_2}}\) follows an approximate nomral distribution with mean \(\small{p_1-p_2 }\) and variance
\(\small{\dfrac{p_1(1-p_1)}{n_1} - \dfrac{p_2(1-p_2)}{n_2} }\). We can therefore say, using central limit theorem, that
\(\small{ \dfrac{\dfrac{Y_1}{n_1}- \dfrac{Y_2}{n_2}~-~(p_1-p_2)}{\sqrt{\dfrac{p_1(1-p_1)}{n_1} + \dfrac{p_2(1-p_2)}{n_2} } }~~~~~ }\) approximately will be \(\small{N(0,1)}\).
When the sample sizes \(\small{n_1}\) and \(\small{n_2}\) are large enough, we can replace the probabilities \(\small{p_1}\) and \(\small{p_2}\) by their estimates \(\small{\dfrac{Y_1}{n_1}}\) and \(\small{\dfrac{Y_2}{n_2}}\) respectively in the denominator of the above expression.
Following the same procedure adopted for confidence interval for a Gaussian distribution, we write a \(\small{100(1-\alpha)\% }\) confidence interval for the difference in the unknown population proportions \(\small{p_1-p_2}\) as ,