Biostatistics with R

An overview of R

R is a software environment for statistical analysis, graphics and data handling. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team . It is a free software distributed under the terms of the "GNU General Public License" of the "Free software foundation" in source code form. We can download R from R home page for installing in various operating systems like Linux, Windows and MAC.

The core of R is written in higher languages like C and Fortran.

R is also a programming environment in which one can develope very sophisticated programs and pipelines for complex analysis.

We will list the essential features of R under specific titles:

Programming environment in R

R has a programming environment that provides almost all the features of a higher language like C. Some important features are:

  • Data structures
    • Basic data types like numbers, characters and boolean values
    • String operations supported by string manipulation functions
    • Vector to handle an array of objects supported by library functions for operations like sorting, filtering, growing, merging and statistical computations.
    • Data frame to load the excel type tables supported by functions to sort, merge, add, filter and many more operations
    • List to store assortment of data types that can be writen to and read from binary files
    • Two and multi dimensional arrays
    • Matrices supported by very efficient library functions to perform computations like matrix multiplication, trace, eigen vevtors, eigen values etc.

  • Logical statements and loops
    • All the logical statements in C
    • Logical loops such as for loop, while loop, if-else statements and more.

  • Functions and libraries
    • User defined functions that can take any data object of R as argument and return an appropriate object.
    • Internal libraries and packages for specific tasks.

Data handling in R

R provides excellent environment for data manipulation and analysis. Some of the important features:

    • Tabled data can be read into R as a data frame from various file formats like excel, csv and text files.
    • Once the table data is read, we can perform operations like row/column filtering, filter table by condition on row elements, append and merge tables, and many many operations on columns.
    • The R data frames can be written into tables in many file formats.
    • Data can be read as character, strings or lines from files.
    • formatted reading/wrting is allowed for data with complex formats.

Statistical analysis and data analytics in R

Compared to any other commercial or open source tool, R has the highest number of libraries implementing statistical algorithms that cover the entire span of statistics. These algorithms are extensively used by engineers, statisticians, researchers in life sciences, clinical trial organizations and other branches of science and industry. Though it is difficult to prepare a list of all the existing algorithms, we can list the broad topics for which (more than a few) algorithms are available in R.

  • Statistical analysis
    • Computing statistical parameters and summary statistics from data
    • handling missing data and missing data impution.
    • Correlation and Covariance
    • Computations with discrete and continuous distributions and their random deviates. We can generate properties and tables of more than 25 standard distributions like binomial, Poisson, negative binial, Gaussian, t, F, chi-square, Weibul etc. inside R.
    • Large number of parametric and non-parametric statistical tests like Z-test, t-test, Chi-square test, ANOVA tests, Mann-Whitney tests, Kruskal Wallis-tests and so on. Very large number of other tests are also supported by external libraries that can be called from R.
    • Multiple testing corrections for p-value
    • Many internal and external libraries for regression analysis such as linear and non-linear regression, multiple linear regression, logistic regression, multivariate analysis etc. These library functions perform complete analysis including testing and plotting the results
    • Survival analysis
    • Many more miscellaneous analysis methods supported by the R community
  • Advanced data analytics
    • Factor analysis
    • Pricipal Component Analysis
    • Clustering analysis with relevant plots
    • Classification analysis with sophisticated methods
    • Artificial Neural Networks (ANN)
    • Many more analysis methods supported by the R community

Numerical computations in R

R environment has packages for performing numerical computations. Most of the algorithms are implemented in C and are best among the available methods. Some of the topics for which R has readily available internal and external libraries are:

  • Linear Algebra
    • Matrix operations, solving simultaneous equations, Vector algebra, eigen values and eigen vectors, creating correlation and covariance matrices, etc
  • Computing numerical integration and derivatives
  • Numerical solutions to Ordinary Differential Equations (ODE) and Partial Differential Equations (PDE)
  • Pseudo random number generators of very high randomness.
  • Randomization methods for a list of objects

Graphics in R

The R software environment provides extensive support for producing very good quality graphs and plots. The basic installation in itself contains multiple libraries for graphics. In addition, large number of external librarires have been created for various types of graphics applications. Some of the important featurs are listed here:

  • Basic functions for creating entire plots from data
    • Functions for plotting scatter plots, line plots, histograms, bar charts, Pie Charts,Tree plots, Dendrograms and many more types of complete parts.
    • These functions have parameters to vary almost every single property of the graph like titles, properties of points and lines, size of plot, scaling the axes, background colours, logarithmic axis, aspect ratio, font size, font type, colors etc.
  • Functions for adding features to the existing plot
    • Once a basic graph is plotted, we can add more graphics elements like arrows, polygons, triangles, circles to the existing plot using additional graphics library functions. These functions can be called only after drawing a basic plot.
    • There are secondary functions that can add more plots on the same graph and can change the properties of the existing plot. Many plots can be created on the same page.
    • Libraries for creating three dimensional plots
  • Several device drivers are available in R for crating publication quality plots. They include
    • On screen graphics on windows, Unix, Linux and Macintosh machines
    • Device drivers for creating Postscript, pdf, png, jpeg, WMF(for windows) image files of higher quality.
  • Plot functions can handle different data structures
    • The graphics functions in R can take data in more than one struture like vectors, data frames(table). Also, the plot functions have capability to handle data provided in the form of formulas. This is very useful for creating plots of the modelled data.

Computational biology in R using Bioconductor

Using the standard statistical packages in R, we can carry out the processing and analysis of data from biological experiments in general. However, the data from high throughput experiments such as microarrays, DNA and RNA sequencing require very advanced and special methods of analysis. For example, we may need annotation files of specific microarray chips from various companies. We may need to implement more than one (many) normalization algorithms for data from sequencers or microarrays. We need to implement complex statistical procedures often published in literature for differential expression analysis of data from RNA sequencing.

With the aim of creating and distributing opensource tools for the analysis of high throughput data from biological experiments, Bioconductor project was launched in 2001. This is overseen by a core team and based primarily at the Fred Hutchinson Cancer center, USA and by other members from International community.

The Bioconductor project creates and distributes software tools for analyzing high throughput data under open source licensing. The package is based mainly on R language, and can be downloaded from inside R.

Each package includes a user manual in PDF format along with example data, where appropriate.

There are more than 1300 packages spanning a very wide range of high throughput experiments and meta analysis methods. This includes the analysis of microarray data, DNA and RNA sequecing,meta analysis like Chi-seq, genetics analysis like Linkage Disequlibrium studies, analysis of data from proteomics studies, mass spectroscopy, GWAS and more and more.

Interfacing R with data bases

R can connect to and fetch data from relational data bases such a MySQL, oracle, sql server etc which store the normalized data. There are many external libraries created for this. Using simple function calls, we can query the data bases and fetch data into R for further analysis. Once the data is inside R environment, it can be manipulated like any normal R data set.

For example, the RMySQL package provides functions for connecting to MySQL data bases and fetch data from them. From an R script, We can perform tasks like connecting to the data base, querying the tables, creating and dropping tables in MySQL, data manipulation, writing into table etc. through simple R function calls.

Interfacing R with programming languages

Some times we may require to execute codes of other langages from inside R environment and vice versa. Many libraries have been cretated for interfacing R with languages like C, C++, C#, Python, Java, Perl and excel. In addition, R scripts can also be interfaced with windows and Linux shells.

For example, the Rcpp package provides C++ classes that facilitates interfacing C or C++ code in R packages. The packages rpy2, PyPeR interface R with Python. Similarly, the interface called JRI allows R scripts to run inside Java environment as a singe thread. There are many more packages to connect other languages with R.