Biostatistics with R

Distribution of sample proportion

In some biological experiments, we sample a population and determine the fraction of samples with a particular trait or property. For example, from a sample survey of 600 adults in a city, we estimate the fraction of people who practice walking as an exercise to be 0.19(ie., \(\small{19\% }\)).

Suppose we surveyed the entire population in the city to estimate the sample fraction p of people having this trait. Then how close is the estimate of 0.19 based on n random samples to the true fraction p of the entire population? In reality, we will never be able to survey the entire population to get the value of p. Using our sample fraction, we can get a confidence interval for the true fraction p.

In a random sample of size n from a population, let Y be the number(frequency) of samples that stand for the given trait or observation we are looking for. We can call this "Y successes out of n observations".

This gives the observed fraction \(~~~\small{f~=~\dfrac{Y}{n} }\)

If the random samples are independent of each other and the probability p is a constant, we can say that the variable f follows a binomial distribution \(\small{P_b(Y,n,p) }\) with a mean and standard deviation given by, \(~\small{\mu=np }~\) and \(~\small{\sigma=\sqrt{np(1-p) } }\).

Using central limit theorem, we can write

\(\small{ \dfrac{Y - \mu}{\sigma}~=~\dfrac{Y - np}{\sqrt{np(1-p)}}~= \dfrac{\dfrac{Y}{n} - p}{\sqrt{\dfrac{p(1-p)}{n} } } }~~~~\) approximately follows \(\small{N(0,1) }\) provided the sample size n is large.

This means that for a given probability \(\small{1-\alpha}\), we can find two sided Z values \(\small{+Z_{1-\alpha/2} }\) and \(\small{-Z_{1-\alpha/2} }\) on unit normal distribution such that,

\(\small{P\left[ -Z_{1-\alpha/2}~ \leq~ \dfrac{\left(\dfrac{Y}{n}\right) - p}{\sqrt{\dfrac{p(1-p)}{n} }}~\leq~Z_{1-\alpha/2} \right] ~~ \approx ~~ 1-p }\)

Manipulating the inequality within the square brackets of above expression, we can get a \(\small{100(1-\alpha)\% }\)confidence interval for the population mean \(\small{ \mu}\) as,

\(\small{P\left[\dfrac{Y}{n}-Z_{1-\alpha/2} \sqrt{\dfrac{p(1-p)}{n}}~~\leq~~p~~\leq~~ \dfrac{Y}{n}+Z_{1-\alpha/2} \sqrt{\dfrac{p(1-p)}{n}} \right] ~~\approx~~1-\alpha }\)`

In the above expression, the (unknown) probability p of success on each trial appears. We can find an expression for this under 2 cases: when n is large and when n is small.

Case 1 : Confidence interval for proportion when n is large

When the number of samples n is large, we can approximately take the observed fraction of successes \(\small{\dfrac{Y}{n} }\) to be approximately equal to p. By substituting this ratio for p in the above inequality, we can write the approximate the expression for \(\small{100(1-\alpha)\% }\) confidence interval for the population proportion p for success when the number of observaions is large :

\(\small{ \dfrac{Y}{n}~~\pm~~Z_{1-\alpha/2} \sqrt{\dfrac{\left(\dfrac{Y}{n}\right) \left(1 - \dfrac{Y}{n} \right)}{n} } }\)

Case 2 : Confidence interval for proportion when n is small

When the number of samples n is small, we cannot take the ratio \(\small{\dfrac{Y}{n} }\) to be approximately equal to the probability p for observing a success. In this case we need to find an expression for p.

We start with the inequality we wrote before:

\(\small{ \dfrac{\left(\dfrac{Y}{n}\right) - p}{\sqrt{\dfrac{p(1-p)}{n} }} ~~\leq~~Z_{1-\alpha/2} }\)

Squaring both sides of above inequality and rearranging, we can write,

\(\small{\left(\dfrac{Y}{n} - p \right)^2 ~-~ \dfrac{Z_{1-\alpha/2}^2 p (1-p)}{n}~ \leq~ 0 }\)

Taking the upper bound value of inequality, we can write the equation,

\(\small{\left(\dfrac{Y}{n} - p \right)^2 ~-~ \dfrac{Z_{1-\alpha/2}^2 p (1-p)}{n}~ = ~ 0 }\)

This equation is quardratic in p. The two solutions of this equation give an upper bound for the confidence interval on p.

Solving the above quardratic equation, we finally get the two solutions as,

\(\small{ \dfrac{ \left( \dfrac{Y}{n}\right) + \dfrac{Z_{1-\alpha/2}^2}{2n} ~~\pm~~ Z_{1-\alpha/2} \sqrt{ \dfrac{\dfrac{Y}{n}\left(1-\dfrac{Y}{n}\right)}{n} + \dfrac{Z_{1-\alpha/2}^2}{4n^2} } } {1 + \dfrac{Z_{1-\alpha/2}^2}{n} } }\)

is the expression for an approximate \(\small{100(1-\alpha)\% }\) confidence interval for population proportion p for success, when the number of observations n is small.

Note that when n is large enough to neglect the terms \(\small{ \dfrac{Z_{1-\alpha/2}^2}{n} }\) and \(\small{ \dfrac{Z_{1-\alpha/2}^2}{n^2} }\) in the above expression, it reduces to the expression in case 1 for large sample size n.