Biostatistics with R

Two Factor Analysis of Variance (Two way ANOVA)
Multiple observations per cell

In the two factor ANOVA, there may be cases when the two factors A and B interact

Let us take two factors (attributes) called A and B, with levels (ie., categories) a and b respectively. This gives us a total of $n=ab$ possible combinations (cells). For each cell with a pair of attributes from A and B, we make c independent observations. The observations are tabulated in a row,column format below:

$------------------------------------------$
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Attribute-B~~~~~~~~~~~$
$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~1~~~~~~~~~~~~~2~~~~~~~~~~~~~3~~~~..........~~j~~~~~~~..............~~b~~~~~~~~~Row~mean$
$Attribute-A~~~~~~~~~-------------------------------------------$

$~~~~~1~~~~~~~~~~~~~~~~~~~~~~\{X_{11k}\}~~~~~\{X_{12k}\}~~~\{X_{13k}\}~~~......~~~\{X_{1jk}\}..............~\{X_{1bk}\}~~~~~~~\overline{X}_{1{\large\cdot\cdot} }$

$~~~~~2~~~~~~~~~~~~~~~~~~~~~~\{X_{21k}\}~~~~~\{X_{22k}\}~~~\{X_{23k}\}~~~......~~~\{X_{2jk}\}..............~\{X_{2bk}\}~~~~~~~\overline{X}_{2{\large\cdot\cdot} }$

$~~~~~3~~~~~~~~~~~~~~~~~~~~~~\{X_{31k}\}~~~~~~\{X_{32k}\}~~~\{X_{33k}\}~~~......~~~\{X_{3jk}\}..............~\{X_{3bk}\}~~~~~~~\overline{X}_{3{\large\cdot\cdot} }$

$~~~~~.$

$~~~~~.$

$~~~~~.$

$~~~~~i~~~~~~~~~~~~~~~~~~~~~~\{X_{i1k}\}~~~~~~\{X_{i2k}\}~~~\{X_{i3k}\}~~~......~~~\{X_{ijk}\}..............~~~~~\{X_{ibk}\}~~~~~~\overline{X}_{i{\large\cdot\cdot} }$

$~~~~~.$

$~~~~~.$

$~~~~~.$

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~\{X_{a1k}\}~~~~~~\{X_{a2k}\}~~~\{X_{a3k}\}~~~......~~~\{X_{ajk}\}..............~\{X_{abk}\}~~~~~~~\overline{X}_{a{\large\cdot\cdot} }$

$------------------------------------------$

$~Column~mean~~~~~~~~~\overline{X}_{{\large\cdot}1{\large\cdot}}~~~~~~~~~\overline{X}_{{\large\cdot}2{\large\cdot}} ~~~~~~~~~\overline{X}_{{\large\cdot}3{\large\cdot}}~~......~~~~~\overline{X}_{{\large\cdot}j{\large\cdot}}~~~..............~~~\overline{X}_{{\large\cdot}b{\large\cdot}}$ $------------------------------------------$ Mean of all $a\times b \times c$ data points $~=~\overline{X}_{\large\cdot\cdot\cdot}$ $------------------------------------------$

In the above table, we lebel the b levels of the attribute B along each row by the index $j=1,2,3,....,b$. Similarly, the a levels of the attribute A is labelled by the index $i=1,2,3,....,a$ along each column.

A set of c independent observations in the cell at $i^{th}$ row and $j^{th}$ column is denoted by the symbol $\{X_{ijk}\}$ where $k =1,2,3,...,c$.

For example, $\{X_{23k}\}$ represents a set of c observations in the cell {2,3}.

The following assumptions are made in this analysis :

1. The n=abc observations are independent of each other.

2. Each data point $X_{ijk}$ in a cell (i,j) is a random sample from a Gaussian distribution of mean $\mu_{ijk}$ and a common but unknown variance $\sigma^2$, ie., $N(\mu_{ijk},\sigma^2)$.

3. It is assumed that the mean value $\mu_{ijk}$ of the distribution from which $X_{ijj}$ was randomly sampled can be written as a sum of an overall effect $\mu$, a row effect $\alpha_i$, a column effect $\beta_j$ and an interaction term $\gamma_{ij}$ which represent the interaction between the cells:

\(~~~~~~~~~~~~~~~~~~~~~~~~~~\small{\mu_{ij}~=~\mu~+\alpha_i~+~\beta_j ~ + \gamma_{ij}}\)

where,

\(\small{\mu}~~\) is an unknown constant value across all bins,

\(\small{\alpha_i}~~\) is the contribution from the $i^{th}$ level of attribute A.

\(\small{\beta_j}~~\) is the contribution from the $j^{th}$ level of attribute B.

\(\small{\gamma_{ij}}~~\) is the interaction term associated with the cell (i,j).

4. The following conditions have to be satisfied by these terms:

\(~~~~~~~~~~~~~~~~~~~~~~~~~\small{\displaystyle{ \sum_{i=1}^{a} \alpha_i}~=~0 ~~~~~}\), \(~~~~~\small{\displaystyle{ \sum_{j=1}^{b} \beta_j}~=~0 }~~~~\), \(~~~~~\small{\displaystyle{ \sum_{i=1}^{a} \gamma_{ij}}~=~0 }~~~\) and \(~~~~~\small{\displaystyle{ \sum_{j=1}^{b} \gamma_{ij}}~=~0 }\)


$~~~~~~~~~~$ Hypothesis to be tested

In this analyis, we test three null hypothesis together:

1. We test the null hypothesis that there is no row effect: $~~~H_0^A~:~\alpha_1=\alpha_2=\alpha_3=....=\alpha_a~=0$.

2. We test the null hypothesis that there is no column effect: $~H_0^B~:~\beta_1=\beta_2=\beta_3=....=\beta_b~=0$.

2. We test the null hypothesis that there is no interaction: $~~~~~~~H_0^{AB}~:~\gamma_{ij}~=0,~~~~i=1,2,3,...a~~~~and~~~j=1,2,3,...,b$.

In order to do the hypothesis testing, we proceed with the following steps:


$~~~~~~~~~~$ Step 1 : Compute the row, colum means and the mean of the whole data

The mean of $i^{th}$ row is computed as,

\(~~~~~~~~~~~~\small{\overline{X}_{i\large\cdot\cdot}~=~\dfrac{1}{bc}\displaystyle{\sum_{j=1}^{b} \sum_{k=1}^{c}X_{ijk}}~~~~~~~}\) mean of all observations in the cells along row i.

The mean of $j^{th}$ column is computed as,

\(~~~~~~~~~~~~\small{\overline{X}_{{\large\cdot}j{\large\cdot}}~=~\dfrac{1}{ac}\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}X_{ijk}}~~~~~~ }\) mean of all observations in the cells along column j.

The mean of c observations in cell (i,j) is computed as,

\(~~~~~~~~~~~~\small{\overline{X}_{ij{\large\cdot}}~=~\dfrac{1}{c}\displaystyle{\sum_{k=1}^{c}X_{ijk}}~~~~~~ }\)

The mean of the whole data set ("grand mean") is computed by summing all the observations in all the cells and dividing by the total number of data points $n=abc~~$:

\(~~~~~~~~~~~~\small{\overline{X}_{\large{\cdot \cdot \cdot}}~=~\dfrac{1}{abc}\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}X_{ijk}} }\)


$~~~~~~~~~~$ Step 2 : Partition the variance of the whole data set into components

The variance associated with the whole data is obtained by sum squares of deviation of each data point $X_{ij}$ from the grand mean. The sum square term ("the total sum square (SST)") is written as,

\(\small{~~~~~~~~~~~~~SST~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(X_{ijk}-\overline{X}_{\large\cdot\cdot\cdot})^2} }\)

We will now split the above total sum of squares into four parts - the sum of squares among the levels of factor A, the sum of squares among the levels of factor B, the sum of squares due to interaction between levels A,B and residual sums of squares. In order to achive this, we add and substract the terms \(\small{\overline{X}_{i\large\cdot\cdot}}\), \(\small{\overline{X}_{{\large\cdot}j{\large\cdot}}}\), \(\small{\overline{X}_{ij{\large\cdot}}}\) and \(\small{\overline{X}_{\large\cdot\cdot\cdot}}\)to the total sum square expression and manipulate the summation terms:

\(\small{SST~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(X_{ijk}-\overline{X}_{\large\cdot\cdot\cdot})^2} }\)

\(\small{~~~~~~~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c} [(\overline{X}_{i\large\cdot\cdot}-\overline{X}_{\large\cdot\cdot\cdot} ) + (\overline{X}_{{\large\cdot}j{\large\cdot}}-\overline{X}_{\large\cdot\cdot\cdot} ) + (X_{ij\large\cdot} - \overline{X}_{i\large\cdot\cdot} - \overline{X}_{{\large\cdot}j{\large\cdot}} + \overline{X}_{\large\cdot\cdot\cdot} ) + (X_{ijk}-\overline{X}_{ij\large\cdot} ) ]^2 } }\)

$$~~=~ \displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(\overline{X}_{i\large\cdot\cdot}-\overline{X}_{\large\cdot\cdot\cdot})^2 + \sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(\overline{X}_{{\large\cdot}j{\large\cdot}}-\overline{X}_{\large\cdot\cdot\cdot}^2) + \sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(X_{ij\large\cdot} - \overline{X}_{i\large\cdot\cdot} - \overline{X}_{{\large\cdot}j{\large\cdot}} + \overline{X}_{\large\cdot\cdot\cdot})^2 \\ ~~~~~+ \sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(X_{ijk}-\overline{X}_{ij\large\cdot})^2} ~+~ (Cross~multiplication~terms) $$

The first four terms on the right hand side of the above expression are square of the deviation terms.The remaining terms are cross multiplication terms. It can be shown that all the cross product terms sum to zero (not done here). Retaining the first four terms, the expression for the total sum of squares SST is written as,

$$SST~~=~ \displaystyle{bc\sum_{i=1}^{a}(\overline{X}_{i\large\cdot\cdot}-\overline{X}_{\large\cdot\cdot\cdot})^2 + ac\sum_{j=1}^{b}(\overline{X}_{{\large\cdot}j{\large\cdot}}-\overline{X}_{\large\cdot\cdot\cdot}^2) + c\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij\large\cdot} - \overline{X}_{i\large\cdot\cdot} - \overline{X}_{{\large\cdot}j{\large\cdot}} + \overline{X}_{\large\cdot\cdot\cdot})^2 \\ ~~~~~+ \sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(X_{ijk}-\overline{X}_{ij\large\cdot})^2} $$

\(\small{~~~~~~~~~~~~~~SST~=~SSA~+~SSB~+~SSAB~+~SSE }\)

where,

\(\small{SSA~~=~\displaystyle{bc\sum_{i=1}^{a}(\overline{X}_{i\large\cdot\cdot}-\overline{X}_{\large\cdot\cdot\cdot})^2}}~~~~\)is the sum of squares among the levels of factor A,

\(\small{SSB~~=~~\displaystyle{ac\sum_{j=1}^{b}(\overline{X}_{{\large\cdot}j{\large\cdot}}-\overline{X}_{\large\cdot\cdot\cdot} )^2}}~~~~\)is the sum of squares among the levels of factor B,

\(\small{SSAB~~=~~\displaystyle{c\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij{\large\cdot}} - \overline{X}_{i{\large\cdot\cdot}} - \overline{X}_{{\large\cdot}j{\large\cdot}} + \overline{X}_{\large\cdot\cdot\cdot})^2}~~~~}\)is the sum of squares due to interaction between $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ factors A and B

\(\small{SSE~~=~\sum_{i=1}^{a}\sum_{j=1}^{b}\sum_{k=1}^{c}(X_{ijk}-\overline{X}_{ij\large\cdot})^2~~~~~}\)is the residual sum of squares

In order to understand the residual sum of squares, we manipulate the terms as follows:

\(~~~~~~\small{SSE~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - \overline{X}_{i\large\cdot} - \overline{X}_{{\large\cdot} j} + \overline{X}_{\large\cdot\cdot})^2} }\)

\(~~~~~~~~~~~~\small{~=~\displaystyle{\sum_{i=1}^{a}\sum_{j=1}^{b}(X_{ij} - (\overline{X}_{i\large\cdot}~-~\overline{X}_{\large\cdot\cdot})- (\overline{X}_{{\large\cdot} j}~-~ \overline{X}_{\large\cdot\cdot} ) - \overline{X}_{\large\cdot\cdot})^2} }\)

The deviation term inside the bracket that is squared is of the form, \(\small{X_{ij}-\mu_{ij}~=~X_{ij}~-~(\alpha_i + \beta_j + \mu )}~~\), which is consistent with our model for $X_{ij}$ assumed in the beginning.


$~~~~~~~~~~$ Step 3 : Expression for a statistic in terms of observed sums of squares



Under the assumptions of the null hypothesis \(\small{~~H_0^A:~\alpha_1=\alpha_2=....=\alpha_a~=~0~~~}\), \(\small{~~~~~~H_0^B:~\beta_1=\beta_2=....=\beta_b~=~0~~~}\) and \(~~\small{H_0^{AB}:~\gamma_{ij}=0,~~i=1,2,3,...,a,~~~~~j=1,2,3,...,b }~~\) the quantities \(\small{SSA/\sigma^2, SSB/\sigma^2 ~}\),\(\small{SSAB/\sigma^2 }~~\) and \(\small{SSE/\sigma^2 }~~\) are independent chi-square variables as follows:

\(~~~~~~~~~\small{SSA/\sigma^2~~~}\) is a chi-square variable with 'a' degrees of reedom

\(~~~~~~~~~\small{SSB/\sigma^2~~~}\) is a chi-square varible with 'b' degrees of freedom

\(~~~~~~~~~\small{ {SSAB/\sigma^2~~~}}\) is a chi-square variable with '(a-1)(b-1)' degrees of freedom

\(~~~~~~~~~\small{SSE/\sigma^2~~~}\) is a chi-square variable with 'ab(c-1)' degrees of freedom



1. When the null hypothesis $H_0^A$ is true, \(\small{SS(A)/(a-1)}~\) and \(\small{SS(E)/[ab(c-1)]}~\) are both unbiased estimators of $\sigma^2$ and their ratio is thus an F statistic

\(~~~~~~~~~~~~~~\small{F_A~=~\dfrac{SSA/(a-1)}{SSE/[ab(c-1)]}}~~~~\)follows $F(a-1, ab(c-1))$, an F distribution with $(a-1)$ $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ and $ab(c-1)$ degrees of freedom.

Therefore, the null hypothesis $H_0^A$ (ie, the hypothesis of no row effect) is rejected at a significance level of $\alpha$ if the observed value of statistic $F_A \geq F_\alpha(a-1, ab(c-1))$.


2. When the null hypothesis $H_0^B$ is true, \(\small{SS(B)/(b-1)}~\) and \(\small{SS(E)/[(a-1)(b-1)]}~\) are both are unbiased estimators of $\sigma^2$ and their ratio is thus an F statistic:

\(~~~~~~~~~~~~~~\small{F_B~=~\dfrac{SSB/(b-1)}{SSE/[ab(c-1)]}}~~~~\)follows $F(b-1, ab(c-1) )$, an F distribution with $(b-1)$ $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$ and $ab(c-1)$ degrees of freedom.

Therefore, the null hypothesis $H_0^B$ (ie, the hypothesis of no column effect) is rejected at a significance level of $\alpha$ if the observed value of statistic $F_B \geq F_\alpha(b-1, (a-1)(b-1))$.


3. When the null hypothesis $H_0^{AB}$ is true, \(\small{SSAB/[(a-1)(b-1)]}~\) and \(\small{SSE/[ab(c-1)]}~\) are both unbiased estimators of $\sigma^2$ and their ratio is thus an F statistic:

\(~~~~~~~~~~~~~~\small{F_{AB}~=~\dfrac{SSAB/[(a-1)(b-1)]}{SSE/[ab(c-1)]}}~~~~\)follows $F( (a-1)(b-1), ab(c-1) )$, an F distribution with $~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$(a-1)(b-1) and ab(c-1) degrees of freedom.

Therefore, the null hypothesis $H_0^{AB}$ (ie, the hypothesis of no interaction between factors A and B) is rejected at a significance level of $\alpha$ if the observed value of statistic $F_{AB} \geq F_\alpha((a-1)(b-1), ab(c-1))$.