Biostatistics with R


From the point of view of statistics, the experiments we do, or the systems we observe come under two categories: deterministic and random experiments.

The deterministic system results in predictable outcomes. For example, in the absence of appreciable air resistance, we can almost predict the trajectory of a projectile thrown at an angle with a known velocity using the equations of Newtonian Mechanics. Here the system obeys the known laws.

There are experiments which have two or more possible outcomes, and the experiment may result in any one of those possible outcomes in a random manner . They are called stochastic systems. There are no deterministic equations or laws governing these systems, and they can be studied only with the help of statistics . For example, we throw a dice and look for the number that shows up. Any one of the six possibilities can occur, and we cannot predict it. We can only say that if the dice is perfect, all the six numbers have equal chance for showing up! . This is what make the dice throwing a random experiment as against the projectile motion, which is deterministic.

Random experiment, outcome space and events

A random experiment will result in one or more possible outcomes. A set of all possible outcomes of a random experiment is called its outcome space .

For example, an experiment involving a single coin toss has two possibe outcomes namely Head(H) and Tail(T). We represent the outcome space as a set in set theory notation: \[\small{S = \{H, T\} }\]

Similarly, in a random experiment that involves the throw of a single dice, the outcome space has six possible results of a dice throw: \[\small{ S = \{1,2,3,4,5,6\} }\]

A subset of possible events in an outcome space is called an event. Thus, the set \[ \small{A = \{2,4,6\} }\] is an event(subset) of outcome space of dice throw and contains possible events that are even in value. The set \[ \small{B = \{1,2,3\} }\] is another event(subset) of dice throw that has outcomes whose values are less than four.

When a random experiment results in an outcome which is a member of set 'A', we say that the event A has occured . Thus, from the above definition of event A and B, if a dice throw results in 4, we say "event A has occured". If it results in 3, we say "event B has occured". If the result is 2, we can say "event A and event B have occured", since the outcome 2 is a member of both the sets.

The concept of probability

The probability that an event occurs denotes the chance or likelihood of the event occuring, and is denoted by a symbol p(A). The value of p(A) is in the range 0 to 1. A value of p(A)=1 denotes that the event A will occur certainly, and p(A)=0 denotes that the event A will definitely not occur. A value of p(A)=0.5 tells us that the event A has equal chance of occuring or not occuring.

There are two ways in which we can approach the probability.

The first way involves looking at the outcome space and count the number of possible ways in which an event A can occur. This is same as the number elements in the set A. We define the probability of occurance of event A as, \[ \small{p(A) = \frac{number~of~outcomes~favourable~to~event~A}{total~number~of~possible~outcomes} }\]

Suppose we want to compute the probability of getting a value more than 4 in a dice throw. Out of six possible outcomes, two outcomes, namely 5 and 6 satisfy this event. Therefore, we say \[\small{ p(>4) = \frac{number~of~outcomes~that~result~in~>4}{total~number~of~possible~outcomes} = \frac{2}{6} }\]

As a second example, we define an experiment in which we throw a pair of coins and note down the result in both of them. If first coin shows Head, and second shows Tail, we call this out come "HT". This way, there are four possible outcomes in thsi random experiment: \[ \small{S = \{HH, HT, TH, TT\} } \]

Now we ask the question : If we toss a pair of coins, what is the probability of getting at least one head in the pair?. When we look at the above outcome space, we realize that out of the 4 elements in the space, three of them namely \[ \small{ \{HH, HT, TH \} }\] result in at least one head . Therefore, we write, for a two coin toss, \[ \small{ p(at~least~one~head ) = \frac{number~of~outcomes~that~result~in~at~least~on~head}{total~number~of~possible~outcomes} = \frac{3}{4} }\]

We can also approach the probability from the experimental side. Suppose an event A has a probability of occurance p(A). We perform the experiment once and note down whether it resulted in event A or not. We then repeat this experiment N times ( N trials), and count the number of trials n(A) that resulted in the event A. For example, we throw a coing n times and note down the number of times it resulted in head(H).

The ratio \(\small{ \displaystyle\frac{n(A)}{N} }\) is the relative frequency of occurance of event A. This ratio is unstable for small number of trials N, but stabilizes towards a number for very large N.

This number, \( \small{p(A) = \displaystyle\frac{n(A)}{N} } \) for very large N is called the probability of occurance of event A.

How many trials we need before the experimentally observed ratio is equal to the probability obtained before in terms of favourable events?. ie., how many coin tosses are needed to get a p(H) exactly equal to 0.5?. It turns out that we need infinite number of tosses!!

In the same manner, if we throw a dice very very large number of times, the probability of getting each one of the 6 outcomes will approach the number \( \small{\frac{1}{6}} \) as the number of throws approaches infinity!

R Scripts

Simulating the coin toss experiment in R

We can realistically simulate the coin toss experiment using the runif() function in R which generates set of random numbers with a uniform probability between 0 and 1. These random numbers are called "uniform deviates". If we generate a large number of uniform deviates, half of the numbers generated will be below 0.5 and the ramaining half will be above 0.5.

We first generate N deviates. We separate those numbers with value below 0.5 and count them. We call this "Head". Similarly, we separate those numbers out of N that are above 0.5 and coun them. We call them "Tail". The ration of number of heads to total number N gives the p(H), the probability of getting head. We repeat this for increasing N like N=10,100,10000, 100000,etc. As N increases, p(H) should approach the expected probability (called "apriory probability") of 0.5.

##### script "coin_toss.r" ################################# ## Simulating a coins toss using unform deviates from runif() ## number of trials in 6 experiments. ntrials = c(10,100,1000, 10000, 100000, 1000000) ## print the column nmaes print(paste("trials ", "heads ", "tails ", "P(head)")) ## find p(H) for each experment with N trials (ie., N coin tosses) for(N in ntrials) { ## generate N uniform deviates between 0 and 1 and returns them as a vector nx = runif(N) ## If the number is less than or equals 0.5, call it Head(H) nheads = length(subset(nx, nx <= 0.5)) ## If the number if greater than 0.5, call it Tail(H) ntails = length(subset(nx, nx > 0.5)) ## compute the probblity of head prob_head = nheads/N print(paste(nheads+ntails," ", nheads, " ", ntails," ", prob_head)) }

Run the above code file in R with "source(cointoss.r)" in command line. Results similar to the following results are printed: (It won't be same numbers everytime due to randomness).

[1] "trials heads tails P(head)" [1] "10 4 6 0.4" [1] "100 49 51 0.49" [1] "1000 512 488 0.512" [1] "10000 4949 5051 0.4949" [1] "100000 50191 49809 0.50191" [1] "1000000 500186 499814 0.500186"

From the printout above, we see that as N increases, p(H) stabilizes arounf 0.5, as expected. The p(head) will tend to 0.5 as N tends to infinitly large number of events.