Biostatistics with R

Two Factor Analysis of Variance (Two way ANOVA)
Single observation per cell

Two factor ANOVA is employed to investigate the effects of two factors that decide the outcome of an experiment.

Let us take two factors (attributes) called A and B, with levels (ie., categories) a and b respectively. This gives us a total of $n=ab$ possible combinations (cells). For each cell with a pair of attributes from A and B, we make a single observation. The observations are tabulated in a row,column format as follows:

$------------------------------------------$
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Attribute-B~~~~~~~~~~~$
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~1~~~~~~2~~~~~~3~~~~..........~~~~~~j~~~~~...................~~b~~~~~~~Row~mean$
$Attribute-A~~~~~~~~~~~~~~~~~------------------------------$

$~~~~~1~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X_{11}~~~X_{12}~~~X_{13}~~~~~......~~~~~~~~X_{1j}~~~~~..............~~~~~~X_{1b}~~~~~~~\overline{X}_{1{\large\cdot} }$

$~~~~~2~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X_{21}~~~X_{22}~~~X_{23}~~~~~......~~~~~~~~X_{2j}~~~~~..............~~~~~~X_{2b}~~~~~~~\overline{X}_{2{\large\cdot} }$

$~~~~~3~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X_{31}~~~X_{32}~~~X_{33}~~~~~......~~~~~~~~X_{3j}~~~~~..............~~~~~~X_{3b}~~~~~~~\overline{X}_{3{\large\cdot} }$

$~~~~~.$

$~~~~~.$

$~~~~~.$

$~~~~~i~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X_{i1}~~~X_{i2}~~~X_{i3}~~~~~......~~~~~~~~~~X_{ij}~~~~~..............~~~~~~X_{ib}~~~~~~~\overline{X}_{i{\large\cdot} }$

$~~~~~.$

$~~~~~.$

$~~~~~.$

$~~~~~a~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~X_{a1}~~~X_{a2}~~~X_{a3}~~~~~......~~~~~~~~X_{aj}~~~~~..............~~~~~~X_{ab}~~~~~~~\overline{X}_{a{\large\cdot} }$

$------------------------------------------$

$~Column~mean~~~~~~~~~~~~~~~~~~\overline{X}_{{\large\cdot} 1}~~~\overline{X}_{{\large\cdot} 2} ~~~\overline{X}_{{\large\cdot} 3}~~~~~......~~~~~~~~\overline{X}_{{\large\cdot} j}~~~~~..............~~~~~~\overline{X}_{{\large\cdot} b}$ $------------------------------------------$ Mean of all $a\times b$ data points $~=~\overline{X}_{\large\cdot\cdot}$ $------------------------------------------$

In the above table, we lebel the b levels of the attribute B along each row by the index $j=1,2,3,....,b$. Similarly, the a levels of the attribute A is labelled by the index $i=1,2,3,....,a$ along each column.

In general, the single observaton in the cell at $i^{th}$ row and $j^{th}$ column is denoted by the double indexing $X_{ij}$.

The following assumptions are made in this analysis :

1. The observations in the $n=ab$ cells are independent of each other.

2. Each data point $X_{ij}$ in a cell (i,j) is a random sample from a Gaussian distribution of mean $\mu_{ij}$ and variance $\sigma^2$, ie., $N(\mu_{ij},\sigma^2)$.

3. It is assumed that the mean value $\mu_{ij}$ of the distribution from which $X_{ij}$ was randomly sampled can be written as a sum of a row effect, a column effect and an overall effect:

\(~~~~~~~~~~~~~~~~~~~~~~~~~~\small{\mu_{ij}~=~\mu~+\alpha_i~+~\beta_j }\)

where,

\(\small{\mu}~~\) is an unknown constant value across all bins,

\(\small{\alpha_i}~~\) is the contribution from the $i^{th}$ level of attribute A.

\(\small{\beta_j}~~\) is the contribution from the $j^{th}$ level of attribute B.

4. The effects of $i^{th}$ level of attribute A and the $j^{th}$ level of attribute B are additive , ie, their combined effect is not greater than or less than the sum of their individual effects. Mathematically, this assumption leads to the folowing conditions:

\(~~~~~~~~~~~~~~~~~~~~~~~~~\small{\displaystyle{ \sum_{i=1}^{a} \alpha_i}~=~0 ~~~~~}\) and \(~~~~~\small{\displaystyle{ \sum_{j=1}^{b} \beta_j}~=~0 }\)

The above assumption is not violated in general so long as the individual meas values are not appreciably different from each other.


$~~~~~~~~~~$ Hypothesis to be tested

In this analyis, we test two null hypothesis together:

1. We test the null hypothesis that there is no row effect: $~~~H_0^A~:~\alpha_1=\alpha_2=\alpha_3=....=\alpha_a~=0$.

2. We also test the null hypothesis that there is no column effect: $~H_0^B~:~\beta_1=\beta_2=\beta_3=....=\beta_b~=0$.

In order to do the hypothesis testing, we proceed with the following steps:


$~~~~~~~~~~$ Step 1 : Compute the row, colum means and the mean of the whole data

The mean of $i^{th}$ row is computed as,

\(~~~~~~~~~~~~\small{\overline{X}_{i\large\cdot}~=~\dfrac{1}{b}\displaystyle{\sum_{j=1}^{b}X_{ij}}~~=~~\dfrac{X_{i1}+X_{i2}+X_{i3}+.....+X_{ib}}{b} }\)

The mean of $j^{th}$ column is computed as,

\(~~~~~~~~~~~~\small{\overline{X}_{{\large\cdot}j}~=~\dfrac{1}{a}\displaystyle{\sum_{i=1}^{a}X_{ij}}~~=~~\dfrac{X_{1j}+X_{2j}+X_{3j}+.....+X_{aj}}{a} }\)

The mean of the whole data set ("grand mean") is computed by summing all the data sets and dividing by the total number of data points in all of them together:

\(\small{\overline{X}_{\large{\cdot \cdot}}~=~\dfrac{1}{ab}\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}X_{ij}}~=~\dfrac{(X_{11}+X_{12}+....+X_{1b})+(X_{21}+X_{22}+....+X_{2b}) + .....+ (X_{a1}+X_{a2}+....+X_{ab}) }{ab} }\)


$~~~~~~~~~~$ Step 2 : Partition the variance of the whole data set into components

The variance associated with the whole data is obtained by sum squares of deviation of each data point $X_{ij}$ from the grand mean. The sum square term ("the total sum square (SST)") is written as,

\(\small{~~~~~~~~~~~~~SST~=~\displaystyle{\sum_{i=1}^{m}\sum_{j=1}^{n_j}(X_{ij}-\overline{X}_{\large\cdot\cdot})^2} }\)

We will now split the above total sum of squares into three parts - the sum of squares among the levels of factor A, the sum of squares among the levels of factor B and residual sums of squares. In order to achive this, we add and substract the terms \(\small{\overline{X}_{i\large\cdot}}\) and \(\small{\overline{X}_{{\large\cdot}j}}\) to the total sum square expression and manipulate the summation terms:

\(\small{~~~~~~~~~~~~~SST~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij}-\overline{X}_{\large\cdot\cdot})^2} }\)

\(\small{~~~~~~~~~~~~~~~~~~~~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b} [(\overline{X}_{i\large\cdot}-\overline{X}_{\large\cdot\cdot} ) + (\overline{X}_{{\large\cdot} j}-\overline{X}_{\large\cdot\cdot} ) + (X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot} ) ]^2 } }\)

$$~~=~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(\overline{X}_{i\large\cdot}-\overline{X}_{\large\cdot\cdot})^2} ~+~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(\overline{X}_{{\large\cdot} j}-\overline{X}_{\large\cdot\cdot} )^2} ~+~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot})^2} \\ ~~~~~~~~~~~~~~~~+~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b} 2 (\overline{X}_{i\large\cdot}-\overline{X}_{\large\cdot\cdot}) (\overline{X}_{{\large\cdot} j}-\overline{X}_{\large\cdot\cdot} )} ~+~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}2(\overline{X}_{{\large\cdot} j}-\overline{X}_{\large\cdot\cdot})(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot}) } \\ ~+~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}2(\overline{X}_{i\large\cdot}-\overline{X}_{\large\cdot\cdot})(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot}) } $$

The first thee terms on the right hand side of the above expression are square of the deviation terms.The last three terms are cross multiplication terms. It can be shown that the last three cross terms sum to zero (not done here). Retaining the first three terms, the expression for the total sum of squares SST is written as,

$$SST~~=~ \displaystyle{b\sum_{i=1}^{a}(\overline{X}_{i\large\cdot}-\overline{X}_{\large\cdot\cdot})^2} ~+~ \displaystyle{a\sum_{j=1}^{b}(\overline{X}_{{\large\cdot} j}-\overline{X}_{\large\cdot\cdot} )^2} ~+~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot})^2} $$

\(\small{~~~~~~~~~~~~~~SST~=~SSA~+~SSB~+~SSE }\)

where,

\(\small{SSA~~=~\displaystyle{b\sum_{i=1}^{a}(\overline{X}_{i\large\cdot}-\overline{X}_{\large\cdot\cdot})^2}}~~~~\)is the sum of squares among the levels of factor A,

\(\small{SSB~~=~~\displaystyle{a\sum_{j=1}^{b}(\overline{X}_{{\large\cdot} j}-\overline{X}_{\large\cdot\cdot} )^2}}~~~~\)is the sum of squares among the levels of factor B,

\(\small{SSE~~=~~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot})^2}~~~~}\)is the error or residual sum of squares.

In order to understand the residual sum of squares, we manipulate the terms as follows:

\(~~~~~~\small{SSE~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot})^2} }\)

\(~~~~~~~~~~~~\small{~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - (\overline{X}_{i\large\cdot}~-~\overline{X}_{\large\cdot\cdot})- (\overline{X}_{{\large\cdot} j}~-~ \overline{X}_{\large\cdot\cdot} ) - \overline{X}_{\large\cdot\cdot})^2} }\)

The deviation term inside the bracket that is squared is of the form, \(\small{X_{ij}-\mu_{ij}~=~X_{ij}~-~(\alpha_i + \beta_j + \mu )}~~\), which is consistent with our model for $X_{ij}$ assumed in the beginning.


$~~~~~~~~~~$ Step 3 : Expression for a statistic in terms of observed sums of squares

Under the assumptions of the null hypothesis \(\small{H_A:~\alpha_1=\alpha_2=....=\alpha_a~=~0~~~}\) and \(\small{~~H_B:~\beta_1=\beta_2=....=\beta_b~=~0~~~}\), the quantities \(\small{SSA/\sigma^2, SSB/\sigma^2 ~}\) and \(\small{SSE/\sigma^2 }~~\) are independent chi-square variables.

When the null hypothesis $H_A$ is true, \(\small{SS(A)/(a-1)}~\) and \(\small{SS(E)/(a-1)(b-1)}~\) are both unbiased estimators of $\sigma^2$ and their ratio is thus an F statistic:

\(~~~~~~~~~~~~~~\small{F_A~=~\dfrac{SSA/(a-1)}{SSE/[(a-1)(b-1)}}~~~~\)follows $F(a-1, (a-1)(b-1) )$, an F distribution with $(a-1)$ $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ and $(a-1)(b-1)$ degrees of freedom.

Therefore, the null hypothesis $H_A$ (ie, the hypothesis of no row effect) is rejected at a significance level of $\alpha$ if the observed value of statistic $F_A \geq F_{1-\alpha}(a-1, (a-1)(b-1))$.


Similarly, when the null hypothesis $H_B$ is true, \(\small{SS(B)/(b-1)}~\) and \(\small{SS(E)/(a-1)(b-1)}~\) are both unbiased estimators of $\sigma^2$ and their ratio is thus an F statistic:

\(~~~~~~~~~~~~~~\small{F_B~=~\dfrac{SSB/(b-1)}{SSE/[(a-1)(b-1)}}~~~~\)follows $F(b-1, (a-1)(b-1) )$, an F distribution with $(b-1)$ $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ and $(a-1)(b-1)$ degrees of freedom.

Therefore, the null hypothesis $H_B$ (ie, the hypothesis of no column effect) is rejected at a significance level of $\alpha$ if the observed value of statistic $F_B \geq F_{1-\alpha}(b-1, (a-1)(b-1))$.

Example-1 : In a quality control experiment on drugs, three treatmet methods T1, T2 and T3 were tried across five brands B1, B2, B3, B4 and B5 of a generic drug. for each (T,B) value, a patient was tried and the appropriate measuements are tabulated:

----------------------------------------------------------------------------------
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ batch

Treatment$~~~~~~~~~~$B1$~~~~~~~~$B2$~~~~~~~~$B3$~~~~~~~$B4$~~~~~~~~$B5

A1$~~~~~~~~~~~~~~~~~~~~~~$52$~~~~~~~~~$47$~~~~~~~$44$~~~~~~~~$51$~~~~~~~~$42

A2$~~~~~~~~~~~~~~~~~~~~~~$60$~~~~~~~~~$55$~~~~~~~$49$~~~~~~~~$52$~~~~~~~~$43

A3$~~~~~~~~~~~~~~~~~~~~~~$56$~~~~~~~~~$48$~~~~~~~~$45$~~~~~~~~$44$~~~~~~~~$38

-----------------------------------------------------------------------------------
Using two way anova, test \(\small{H_{oA} : \alpha_1=\alpha_2=\alpha_3}\) to a significant level of 0.05 against all alternatives.

Also test \(\small{H_{0B} : \beta_1 = \beta_2 = \beta_3 = \beta_4 = \beta_5}\) against all alternatives.


We will perform the row and column means required for two way ANOVA to fill the table as follows:

----------------------------------------------------------------------------------
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ batch

Treatment$~~~~~~~~~~$B1$~~~~~~~~$B2$~~~~~~~~$B3$~~~~~~~$B4$~~~~~~~~$B5$~~~~~~~~~~~$\(\small{\overline{X_{i\large\cdot}}}\)

A1$~~~~~~~~~~~~~~~~~~~~~~$52$~~~~~~~~~$47$~~~~~~~$44$~~~~~~~~$51$~~~~~~~~$42$~~~~~~~~~~~$ 47.2

A2$~~~~~~~~~~~~~~~~~~~~~~$60$~~~~~~~~~$55$~~~~~~~$49$~~~~~~~~$52$~~~~~~~~$43$~~~~~~~~~~~$ 51.8

A3$~~~~~~~~~~~~~~~~~~~~~~$56$~~~~~~~~~$48$~~~~~~~~$45$~~~~~~~~$44$~~~~~~~$38$~~~~~~~~~~~$ 46.2

\(\small{\overline{X_{\large\cdot j}}}\)$~~~~~~~~~~~~~~~~~~~~~$56$~~~~~~~~~$50$~~~~~~~~$46$~~~~~~~~$49$~~~~~~~$41


$~~~~~~~~~~~~~~~~~~~~~~~~~~~~$\({\small{X_{\large\cdot\cdot}}~=~ 48.4 }\)

-----------------------------------------------------------------------------------

We calculate the basic quantities for F statistics:

\(\small{SSA~~=~\displaystyle{b\sum_{i=1}^{a}(\overline{X}_{i\large\cdot}-\overline{X}_{\large\cdot\cdot})^2}~=~5[(47.2-48.4)^2 + (51.8-48.4)^2 + (46.2-48.4)^2]~=~89.2 }\)

\(\small{ SSB~~=~~\displaystyle{a\sum_{j=1}^{b}(\overline{X}_{{\large\cdot} j}-\overline{X}_{\large\cdot\cdot} )^2} }\)

\(\small{~~~~~~~~~~~=~ 3[(56-48.4)^2 + (50-48.4)^2 + (46-48.4)^2 + (49-48.4)^2 + (41-48.4)^2]~}\)

\(\small{~~~~~~~~~~~=~ 363.6 }\)

\(\small{SSE~~=~~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot})^2} }\)

\(\small{=(52-47.2-56+48.4)^2+(47-47.2-50+48.4)^2+(44-47.2-46+48.4)^2 }\)
\(\small{~~~~~~~~~+(51-47.2-49+48.4)^2 +(42-47.2-41+48.4)^2 + (60-51.8-56+48.4)^2 }\)
\(\small{+(55-51.8-50+48.4)^2 + (49-51.8-46+48.4)^2 + (52-51.8-49+48.4)^2 }\)
\(\small{(43-51.8-41+48.4)^2 + (56-46.2-56+48.4)^2 + (48-46.2-50+48.4)^2 }\) \(\small{+ (45-46.2-46+48.4)^2 + (44-46.2-49+48.4)^2 + (38-46.2-41+48.4)^2 }\)

\(\small{~~~~=~ 46.8 }\)


\(\small{ F_A~=~\dfrac{SSA/(a-1)}{SSE/[(a-1)(b-1)}~= \dfrac{(89.2/(3-1))}{46.8/((3-1)(5-1))}~=~ 7.624 }\)

\(\small{ F_B~=~\dfrac{SSB/(b-1)}{SSE/[(a-1)(b-1)}~= \dfrac{(363.6/(5-1))}{46.8/((3-1)(5-1))}~=~ 15.54 }\)

For an \(\small{\alpha=0.05,~~~~~~~~F_{1-\alpha}(a-1, (a-1)(b-1)~=~F_{0.95}(3-1, (3-1)(5-1))~=~4.459}\)

Since, our \(\small{F_A \geq 4.459}\), we reject the null hypothesis \(\small{H_0: \alpha_{1}=\alpha_{2}=\alpha{3}}\) aginst all other alternatives.

Similarly,for an \(\small{\alpha=0.05,~~~~~~~~F_{1-\alpha}(b-1, (a-1)(b-1)~=~F_{0.95}(5-1, (3-1)(5-1))~=~3.83}\)

Since, our \(\small{F_B \geq 3.83}\), reject the null hypothesis \(\small{H_0: \beta_{1}=\beta_{2}=\alpha{3}=\beta_4 = \beta_5}\) aginst all other alternatives.

R scripts


In R, we can use aov() function to perform two factor anova.

For this, we have tp prepare the input file with three column, teo for names of two factors 
and one with the observation value.

For example, we consider the same data for which we compute anova F statistic.

The input data is typed into a text file twoway_anova_data.txt in the following format:


factorA   factorB      value
  A1         B1           52
  A1         B2           47
  A1         B3           44
  A1         B4           51
  A1         B5           42
  A2         B1           60
  A2         B2           55
  A2         B3           49
  A2         B4           52
  A2         B5           43
  A3         B1           56
  A3         B2           48
  A3         B3           45
  A3         B4           44
  A3         B5           38

With this data, the following R script can perform two factor anova.

## Two way anova with one observation per cell ## read the data into a data frame dat = read.table("twoway_anova_data.txt", header=TRUE) boxplot(value~factorA*factorB, data=dat) ## call te function aov with two factors res = aov(value ~ factorA + factorB, data = dat) ## print the results summary(res)

Running the above code gives the following output:
Df Sum Sq Mean Sq F value Pr(>F) factorA 2 89.2 44.60 7.624 0.014023 * factorB 4 363.6 90.90 15.538 0.000768 *** Residuals 8 46.8 5.85 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

We see that the p-values for the null hypothesis of FactorA and FactorB are 0.014 and 0.000768 respectively. Since both are less than 0.05, we reject null hypothesis $H_0A$ and $H_0B$.