## Biostatistics with R

#### ( R. Srivatsan )

The subject of Biostatistics involves the application of statistics to the biological data, and is not appreciably different from the statistics used in other fields like physics, engineering and economics.

However, data from biological experiments are extremely complex and diverse in nature. A wide range of statistical algorithms are employed to analyse the data. The results of simple laboratory experiments can be analysed with error analysis and basic statistical tests. Data from clinicl trials employ the frequentist as well as the Bayesian methods. The bioinformaics data from DNA/RNA sequncing and microarrays employ very complex algorithms using frequentist, Bayesian statistics and hidden Markov models. The ever increasing quantity and quality of data in the field of molecular genetics calls for new statistical methods published and updated every day along with new opensource tools implementing them.

The gateway to this intricate world of data analysis is a foundational course on basic statistics. Even the most difficult statistical methodology can be understood in a block level, if not to the level of inner details, using the knowledge of basic statistics.

Since the subject of statistics is an inherent part of mathematics, students of life scinces who are not formally trained in the language of mathematics find it difficult to comprehend the finer aspects of the algorithms and formulas used. For example, a given formula for a particular statistical test may be derived based on the assumption that "the means of parent distributions are equal". This formula should not be used for the cases when they are not equal. While a student who goes through the mathematical derivation of the formula understands this naturally, the life sciences student who skips the derivation to use the formula should be specifically made aware of this fact through many words. This important aspect makes a text book or tutorial on "biostatistics" more verbose than a similar book on "statistics", even though they are telling the same thing! Similar things happen while explaining the properties of statistical distributions, errors and many other concepts.

We need software tools for performing the statistical analysis of data. Most of the commercial as well as the opensource tools provide "pipelines" for data analysis. Starting from reading formatted data files (generally in the form of tables), each pipeline performs a set of required statistical analysis and presents the resuts in the form of output files and plots. The R statistical package provides a poweful environment, languge and libraries for statistical analysis. Along with a very large number of bioconductor libraries for computational biology, the R package has become a dominant analysis tool in biological sciences.

This tutorial is aimed at explaining the essential concepts of basic biostatistics in a simple mathematical language and poviding easy to use scripts in R language for statistical analysis. For using the R scripts, the user is expeted to have learnt the fundamentals of R statistical package. The R tutorials in this web site will be useful to them.

rce tools provide "pipelines" for data analysis. Starting from reading formatted data files (generally in the form of tables), each pipeline performs a set of required statistical analysis and presents the resuts in the form of output files and plots. The R statistical package provides a poweful environment, languge and libraries for statistical analysis. Along with a very large number of bioconductor libraries for computational biology, the R package has become a dominant analysis tool in biological sciences.

This tutorial is aimed at explaining the essential concepts of basic biostatistics in a simple mathematical language and poviding easy to use scripts in R language for statistical analysis. For using the R scripts, the user is expeted to have learnt the fundamentals of R statistical package. The R tutorials in this web site will be useful for learning basic R language.

Starting from probability theory, each section explains one or two basic ideas on the topic. The use of mathematical derivations are kept at a minimum adequate level. In some places, the derivations are separated from the statement of results so that they can be skipped when not needed. At the end of each chapter, R scripts for the computation of example problems related to the chapter are placed. The scripts are mostly self contained, and can be copied and pasted into the the R prompt in the R packge for computation.

Your valuable comments and suggestions will help in improving the quality and presentation of the contents of the turorial. Kindly email your comments and suggestions to srivatsan1963@gmail.com. The erros, mistakes and inadequacies in the tutorial pointed out by the users will be immediately corrected, and the author will be thankful to them for this.