Biostatistics with R

Parent populations and samples

In general, the word population refers to a collection of people, animals, birds, plants or any organisms sharing a geographical entity. We thus refer to the population of humans living in a country, island or city. We also speak of population of particular bird in a very large forest area or the population of certain type of dolphins near an island in pacific.


Statistics deals with the measurements of one or more variables. In this context, the word "parent population" or simply the "population" refers to the largest possible set of measurements available on a variable under study .


For example, if a study involves the weight of girl children of 10 years of age from a district, the population for this study refers to the measured weights of all the 10 year old girls living in the district during the period considered for the study. If another study involves the number of eggs laid by an adult hen in a poultry form in one year period, the parent population of this study consists of data on the number of eggs laid by very large number of adult hens in that form in one year under the same conditions considered in the study.

Sampling from a population

The fundamental assumption of the statistical analysis is this:

For each parameter we measure in an experiment, there is a parent population of its values. An observed data set consisting of n repeat measurements of the parameter is assumed to have been randomly drawn from the reservoir of parent population or simply "the population".

The observed data points are the random samples from the population and hence constitute
the sample data .


We will illustrate the above concept with an example. Suppose we have to study the effect of traffic and pollution on the blood pressure of adult males in the age group 30-35 in a particular city. The blood pressure of all adult males in the same age group 30-35 in that city constitutes our population as far as this blood pressure measurement is concerned. Since we do not have the resources to measure the blood pressure of all the males of this city in the selected age group, we randomly pick 200 males from this population across the city and measure their blood pressure.

These 200 measured values now constitute the sample data for this study. It is belived that this sample data best represents the population data.

A sample data is a subset of the population data. We can draw more than one sample data set from the same population.

In a biology practical class, suppose an experiment is performed separately by 8 students under very similar conditions. They all measure the value of a quantity Q under same conditions. If each one of them repeat the experiment to get 5 values of Q, we can say that there are 8 sample data sets drawn randomly from the same population, each consisting of 5 data points.


Why insist on randomness while sampling?

The randomness in the sampling is very essential for removing human biases, even if remotely possible.


Suppose application details of 100 equally qualified students are given to us and we are required to select 6 among them for a scholarship program. Here we are supposed to treat all of them equally during the selection.If we sample by looking at the details like name, appearance, income etc., there is a small chance that we may develope bias towards some of the students, no matter how hard we try to be neutral. The best way to remove this bias is to assigna number from 1 to 100 for the students, and let a computer algorithm to chosse 6 out of 100 numbers randomly.


For the same reason, numbers for winners in a lottery are chosen using devices like "rotating random wheels" for each digit.


A strict randomness in the allotment of patients or medicines to various groups is an important component of clinical trial studies. Developing algorithms and tools for this purpose is a multi billion doller industry.


There are mathematical algorithms which can generate a sequence of numbers in any given range which are random, ie., they are not related to each other by any mathematical rule. They are called "pseudo random numbers", as against the real random numbers from nature (like coin toss, dice throw, electrical noise etc). These pseudo random numbers can be used to mimic many random selection processes from a given population of data. R package has many interesting functions to do this.The followingscript demonstrates them.



R scripts

Given a set of objects, we can select n among them randomly. R provides an internal library function called sample() for this purpose. The important parameters of this function are listed here:


       x  -------> a vector of elements

       size  ------> a non-negative integer that gives the number of objects to choose

       replace  ------> boolean value (TRUE or FALSE) that indicates whether 
                           the sampling is done with replacement.

       prob  -------->  A vector of probability weights for obtaining the 
                              elements of the vector being sampled


The R script below demonstrates the use of this function for sampling with and withour replacement, as well as a weighted sampling:


### Random sampling from a list of objects x = 1:20 ## sample without replacement, by default sampled_data = sample(x, 12) print("12 samples without replacements : ") print( sampled_data) cat("\n") cat("\n") ## sample with replacement sampled_data = sample(x, 12, replace=TRUE) print("12 samples with replacements : ") print(sampled_data) cat("\n") cat("\n") ### Weighted sampling ## we define 4 nucleotides nucleotides = c('A','T','G','C') ## Probabilities for their sampling. 'G' and 'C' appear 4 times more than 'A' and 'T') probabilities = c(0.1, 0.1, 0.4, 0.4) ## sample 20 nucleotides randomly with replacement, with given probabilities. nuc = sample(nucleotides, 20, prob=probabilities, replace=TRUE) print("20 samples with probability weights (0.1,0.1,0.4,0.4) for (A,T,G,C) : ") print(nuc) ## Collapse the character vector into a sequence string seq = paste(nuc, collapse="") cat("\n") print(paste("letters collapsed into a weighted DNA sequence : ", seq))

Executing the above script prints the following results:

[1] "12 samples without replacements : " [1] 1 7 9 4 5 2 20 18 6 17 15 10 [1] "12 samples with replacements : " [1] 19 9 1 1 1 19 14 9 13 14 6 19 [1] "20 samples with probability weights (0.1,0.1,0.4,0.4) for (A,T,G,C) : " [1] "G" "G" "A" "T" "T" "G" "G" "A" "C" "T" "A" "G" "G" "G" "G" "G" "C" "C" "G" [20] "G" [1] "letters collapsed into a weighted DNA sequence : GGATTGGACTAGGGGGCCGG"