R Fundamentals

(Author : Srivatsan)

The analysis of data from biological experiments involve steps like data filtering, plotting, statistical analysis and modeling. A variety of software tools, open source as well as commercial, are readily available for this purpose. Commercial tools like microsoft Excel, minitab and matlab have been in use for long time, and are quiet adequate for the analysis of data from experimental labs and clinical trials. These tools are also used in other fields like engineering, economics and business analysis.


After the year 2000, the data landscape underwent a great transformation. The microarrays and high throughput sequencing started generating huge quantity of data with enormous complexity. The analysis of data from these experiments required handling of large data in memory, sophisticated statistical algorithms, graphics libraries and data mining techneques. New analysis methods were adopted in quick succession and some of the old methods abandoned. The situation continues to this day. No single commercial tool could cope up with the rapid changes occuring in the field. The community needed an open source framework under which a collection of tools could be created and shared by all.


The R statistical package that existed since early 1990's came to the rescue. R was created in the early 90's by Ross Ihaka and Robert Gentleman at the university of Auckland, Newzeland. It is a programming language as well as a software environment for statistical computing with excellent libraries for statistical analysis, graphics and data mining. It is maintained under GNU project and is freely available for download.


R is designed to be highly extendable through libraries created and added by the users. In 2001, Bioconductor project was launched to provide tools for the analysis of high-throughput genomic data, and is successfully continuing ever since. Most of the Bioconductor libraries are created and distributed as R packages, and can be downloded and installed from R. According to the Bioconductor website, bioconductor is overseen by a core team, based primarily at the Fred Hutchinson Cancer Research Center, and by other members coming from US and international institutions. It has 2 releases a year, and as on November 2015, has 1104 software packages.


The addition of Bioconductor has made R the most important tool for computational biology. Using a single framework, all steps of data analysis and modeling can be carried out by the life sciences researcher. However, there is one barrier to be crossed. Unlike Microsoft Excel and Minitab, R is not principally GUI driven. For any analysis, We may have to write a few lines of code to manipulate the data structures and call the appropriate library functions. For this, we need a formal learning of the R language contruct and its data structures.


The goal of this tutorial set is to help the user acquire a working knowledge of R. Even if the user has no prior programming experience, she can systematically learn R using these notes.