Microarray Data Analysis in
          R/Bioconductor

( R. Srivatsan )

Introduction to Microarray Technology

Microarray Technology studies the gene expression. For a sample of cells, expressions of thousands of genes can be studied at a time with a single hybridization.


DNA Microarrays: Quantifies the amount of mRNA transcripts present in a collection of cells.

During the process of gene expression, the mRNA molecules carry the genetic information from the nucleus to the cytoplasm for protein synthesis. Whenever a gene expresses or in active state, the corresponding mRNA molecule are produced by the mechanism of trnascription. Following this, these mRNA synthesize the corresponding protein by the process of translation. Thus by measuring the level of mRNA, we measure the level of gene expression.

The basic idea behind the measurement of gene expression is that the amount of mRNA transcripts of a gene is an approximate estimate of the level of expression of the gene.

ie., (The amount of mRNA produced in a transcript) $~\Large \propto~$ (level of gene expression)

Types of Microarrays

Using the mRNA expression detection, various regions of the genome can be studied. This gives rise to varioys types of microarrays. Some of the importnt ones are:

Gene expression arrays: Measures the expression levels of entire gene.

Exon arrays: Measures the expression levels of individual exons within a gene. Used for detecting alternative splicing and and other transcript isoform differences.

ChIP seq arrays: Microarrays in combination with chromatin immunoprecipitation to determine the binding sites of transcription factors.

SNP arrays: Microarrays designed to detect Single Nucleotide Polymorphisms (SNP), characterizing variations at a single nucleotide locations in a genome.

Tiling arrays: In this type of microarrays, the entire genome can be scanned at a high resolution by short, overlapping probes. The tiling arrays provide a comprehensive view of entire genome, instead of specific genomic genomes as done by the gene expression microarrays.

In this section, we will learn the analysis of data from only the Gene Expression microarrays.

Applications of Microarrays

The microarrays are extensively used for many studies that include

          Gene Expression Studies

          DNA sequence variations

          SNP discovery and genotyping

          DNA Copy Number Varition (CNV)

          DNA Methylation

          Transcription factor binding/chromatin immunoprecipitation

          Gene regulation elements

          Reverse transcripts

          Protein interactions

          miRNA studies

The methodology of gene expression microarrays

We know that the Deoxyribose Nucleic Acid (DNA) is a double stranded structure, where the two sequence strands of the pair are coplementary to each other. The two strands pair with each other by forming hydrogen bonds between complementary base pairs.

The cDNA, called the "complementary DNA", is a copy of DNA synthesized from the mRNA template using the enzyme called Reverse Transcriptase. It is complementary to the mRNA created by the gene. Thus the cDNA and mRNA strands will make a complementary binding when they are brough together.

Suppose we take the cDNA strands corresponding to a gene we call G1 and fix it on a glass plate (substrate) like a spot (called "probe"). We then extract the mRNA from a tissue sample under investigation, mix it with a fluorescent dye (we call this label) and wash it on to this glass plate.

The mRNA extracted from the sample contains the mRNA from all the genes of the genome which have expressed.When the fluorescent labelled mRNA from the sample is washed over the glass plate, the mRNA strands from the expression of gene G1 will make a complementary binding to the corresponding cDNA strands on the spot (probe).

The amout of mRNA from the sample that binds to the cDNA probe on the plate is directly proportional to the expression level of the gene G1 in the sample, according to our assumption. Since the dye is mixed with the mRNA (we say that "mRNA is labelled with the dye"), the amount of dye attached to the spot on glass will also be proportional to the amount of mRNA bound, and hence to the expression level of gene G1. This dye generates an optical signal which can be measured by a device. This light signal is proportional to the amount of dye bound to the spot which is proportional to the amount of mRNA bound, which in turn is proportional to the expression of gene G1 in the sample!. This is the idea.

In another type of technology, instead of fixing cDNA of a gene/genomic region on the surface of the plate, a short, single stranded chain of nucleotide from a genomic region (called oligonucleotide) is directly grown on the surface of the microarray plate to which its complementary sequence from mRNA of sample can bind. These are called oligonucleotide arrays . If cDNA spots are made on the surface as described above, they are called "cDNA" arrays.


Terminologies to remenber:

(i) The strands of polynucleotides from a gene/genomic region mobilized on a solid surface is called a probe. Thus, each probe will represent a gene or a genomic region.

(ii) The mRNA transcripts extracted from a sample under study are mixed with flourescent dye. We say that the transcripts are labelled with the dye. These labelled polynucleotides are called target. These targets are mixed in a solution and washed on the microarrays dotted with probes.


How labelling is done? :

The DNA fragments (probes) in a microarray are labelled by attaching a fluorescent dye to each fragment. These flourescent dyes have conjugate bonds that contain electrons which absorb and emit in visible range.

The commonly used dyes for labelling nucleic acid are the Cynine3 (Cy3) and the Cynine5 (Cy5) dyes.

Cy3 is a bright, orange fluorescent dye with absorption peak around 550 nm and emission pek around 570 nm. Emits in yellow-green range.

Cy5 has absorption peak around 650 nm and emission peak around 670 nm and emits in far-red range.

The intensity of a fluorescent label bound to each probe is proportional to the level of expression of the corresponding gene represented by the probe.


How the intensity of the probe is measured? :

In microarray experiments, the spots are excited by visible laser light of suitable wavelength and the fluorescence is measured by a photon detector like a Photomultiplier Tube (PMT). A confocal microscope focuses the photons emitted by the labelled probe-target spot on the array onto the photon detector like Photomultiplier tube.

The spatial resolution of the confocal system is much less than the probe spots on the microarray.

The microarray plate surface is divided into regular grid pixels whose area is much smaller then the probe spots, and the light intensity emitted from each grid is measured by the photon detector. This results in a high resolution image file consisting of amount of light emitted from each grid. Then an image analysis is performed to compute the light intensity from each probe spot using the light intensities from the smaller grids. This method also enables the computation of background intensity of sopts.


Some idea on the dimensions involved :

Suppose a DNA microarray has a dimension of $1~cm \times 1~cm$.

A typical pixel size is 5 to 10 microns. Let us take it as 10 microns.

Since $1~cm=10^4~microns$, there will be 1000 pixels per cm length or a million pixels over the $1~cm \times 1~cm$ area.

Therefore, one can easily fix the probe spots of the order of tens of thousand on a $1~cm \times 1~cm$.

By reducing the pixel size, we can accommodate more probes. For example, typical affymetrix arrays of dimensios of the order of $1.2~cm \times 1.2~cm$ can have about 500000 (half a million) probes on its surface!!

About 50 ng of total RNA will be required per chip.

Important microarray technologies

The microarrays technology comes broadly under the following categories which are enduring over the last three decades:

Spotted microarrays :

In spotted microarrays, cDNA or fragments corresponding to specific genes are spotted on the micrarray chip. More than one probe per gene can be added. These arrays are generally used for two channel (two dye) experiments. In the two channel experiments, two cDNA samples can be mixed with two different dyes (one with Cy3, other with Cy5) and hybritised to the same chip. After washing over, the spots are read with two different laser beams to measure both the samples in the same run. Unlike the oligonucleotide arrays, the genoimc regions are can be chosen by us, thus making the arrays very customized. We will not be discussing two dye arrays in this tutorial.


Bead based Oligonucleotide microarrays developed by Illumina Technologies

In this array, tiny silica microbeads are embedded into wells on the surface of the array. The beads are coated with copies of oligonucleotides corresponding to specific genimic regions. When the DNA fragments from samples are passed over the bead chip, each probe binds to its complementary sequence in the sample DNA, and the bound oligonucleotides are measured with a single fluorescent label. On an average there are 30 beads per probe, providing a good redencency.

The beads are randomly distributed across the surface of the chip and a 29-mer address sequence present on each bead are used to map the genes on the array. The probes on the bead chip are of 50 mer length. Thus, each probe has a 50 mer long gene specific sequence and a 29 mer address sequence.

Apart from the probe beads, Illumina array also has 1000 control beads, which do not correspond to any expressed sequence in the genome. These control beads do not hybridize to any gene in the RNA sample. They are used as negative controls for non-specific bindings or background noise in the experiment.

About 50-500 ng of RNA may be required per chip.

Some important arrays of this type from Illumina are : HT-12 V4.0 for human genome, WG-6 V2.0 for Mouse genome, DASL HT for human genome.

Reference:

https://sapac.illumina.com/science/technology/microarray.html

This illumina beadchip technology later paved way for the creation of Illumina RNA seq technology which completely dominates the RNA seqeuncing market, accounting for nearly 90% of the RNAseq machine used throught the worls!!


In-situ synthesized oligonucleotide microarrays developed by Affymetrix Corporation

In this technology, a large number of short DNA sequences called oligonucloetides are directly synthesized on the solid surface (like glas slide) of the array. The probes are thus directly created in-situ on the surface, eliminating the need to clone the sample libraries.

In this arrays, each gene is represented by 50 mer long "Perfectly Matched (PM) probes" which are perfect replica of seqeunce from a region of gene. For each perfect matched probe, another replica is created with a mismatch at exactly one nucleotide location. This is called "mismatch (MM) probe". For each gene, there are, for example, 16 perfect match and 16 mismatch probes, corresponding to 16 regions from the 3' to 5' end of the gene.

The following sentence taken from internet describes how the detection mechanism works:

"RNA from a sample is reverse transcribed into cDNA, then in vitro transcribed into cRNA, and labeled with a biotin tag. The labeled cRNA is fragmented and hybridized to the GeneChip. The chip is scanned to detect the fluorescent signal, and the intensities of the PM and MM probes are used to determine the expression levels of the corresponding genes. The chip is washed to remove unbound material, and then stained with a fluorescently labeled molecule that binds to the biotin".

The Affymetrix arrays account for the large fraction of microarray experiments done in the last 25 years.

Some of the legacy arrays chips are: GH-U133- plus 2, GH-U95 Avs, HG-U133, MOE 430 2.0, ... and many more.

Affymetrix Genechip arrays : Gene 1.0 ST, Gene 2.0 ST, Transcriptome 1.0, Transcriptome 2.0 etc for Human, mouse, rat genomes

Affymetric Genechip for Exon level gene expression arrays for human, mouse and rat genomes

Genomewide Human SNP array 6.0 contains more than 906,600 SNP's and more than 946,000 probes for the detection of CNV's.

Affymetrix Genechip whole genome tiling arrays contain probes that cover every base pair of the himan genome and designed to identify novel transcripts, mapping sites of proteins/DNA, interaction in Chromatin Immunoprecipitation (ChiP seq). probes are tiled at 5-35 base pair resolution.

Many more types of Affymetrix arrays exist.

Affymetrix is now fully owned by Thermofisher Scientific.


Oligonucleotide microarrays developed by Agilent Technologies

The Agilent Corporation makes Oligonucleotide gene expression microarrays for studying gene expression patterns, exon microarrays, CGH (comparative genomic hybridization) microarrays for analyzing DNA copy number variations (CNVs), microarrays for microRNA profiling and exon-level expression analysis. Agilent's SurePrint technology, which uses in situ synthesis of oligonucleotides, is a key feature of their microarray manufacturing process.