Biostatistics with R

Kruskal Wallis Test

The mothod of one way ANOVA tests whether more than two data sets in hand have come from the Gaussian ditributions with equal means under the assumption that their variances are equal.

The Kruskal Wallis test is a non-parmetric test which tests whether the samples originate from the same distribution.

Similar to other non-parametric tests, it does not assume that the data sets are drawn from Normal distribution and ranks the data points to compute a test statistics under null hypothesis.

Kruskal Wallis test can be used when the data sizes are small and we are not sure whether they are drawn from normal distribution.

In the Krukal wallis test, the following assumptions are made on the data sets:

1. All the data points are randomly samples from their populations

2. Within a sample, data points are independent of each other. The samples among themselve are also independent of each other.

3. The data measurement id ordinal,ie., the data points can be ranked.

The null hypothesis for Kruskal Wallis test staes that the population distibutions are identical. The alternte hypothesis is that at least one of the populations tend to exhibit higher values than at leaset one of the higher population.

the test procedure

Let there be k data sets, identified by indices i=1,2,3,....,k.

Let \(\small{n_i}\) be the number of data points in the data set i and
\(\small{n = n_1 + n_2 + n_3 + .....+n_r}\) be the total number of data points. 1. Mix all the data sets and rank the data points them from smallest to the largest value.

2. Now separate the data points back into the data sets. For each data set, sum the ranks of their data points to get a rank sum. Thus each data set i has a rand sum \(\small{R_i}\). In case of tie, give the mean ranks.

3.The test statisti is defined as,

\(\small{ H~=~\dfrac{12}{n(n+1)} \displaystyle\sum_{i=1}{k}\dfrac{R_i^2}{n_i}~=~3(n+1) }\)

4. If each one of the distributions have 5 or more data points, the above statistic follows a chi-square distribution with n-1 degrees of freedom.When the sample sizes are smaller, a probability table is used for computing the critical values.

5. We reject the null hypothesis of identical population distributions if the test statistic H is greater than the \(\small{\chi^2}\) critical value with n-1 degrees of freedom.

6.There is a correction for the ties, which is generally negligible.

The R script below performs Kruskal Wallis test.


We consider the following 4 data sets to be compared by Kriuskal Wallis test:

Group-1 :  220  214  203  184  186  200  165

Group-2 :  262  193  225  200  164  266  179

Group-3 :  272  192  190  208  231  235  141 

Group-4 :  190  255  247  278  230  269  289

We shorten the names to G1, G2, g3 and G4. 

These 4 setscan be written as a two column data in which first column is the value against 
which the group label is given. See below:

(two colums can be tb separated(*.txt file) or comman separated (*.csv file)

value  group
220     g1
214     g1
203     g1
184     g1
186     g1
200     g1
165     g1
262     g2
193     g2
225     g2
200     g2
164     g2
266     g2
179     g2
272     g3
192     g3
190     g3
208     g3
231     g3
235     g3
141     g3
190     g4
255     g4
247     g4
278     g4
230     g4
269     g4
289     g4

The above data format is stored as a text file with a txt extension. 
We give some name like, "cholesterol_data.txt". 
It can also be stored as a comma separated csv format.

This has the advantage that we can handle data sets of different lengths. 
If you store it as data frame, then many NA's may be required to make them equaal length. 
Above format is better.

Once this file "cholesterol_data.txt" is ready, we can write an R script 
for performing one factor ANOVA as follows:

## Perform nonparametric Kruskal wallis test in R ## read data into a frame mydat = read.table("cholesterol_data.txt", header=TRUE) ## Function call to kruskal.test() res = kruskal.test(value~group, data = mydat) ## print the result print(res)

Executing the above script in R prints the following output.
Kruskal-Wallis rank sum test data: value by group Kruskal-Wallis chi-squared = 7.0933, df = 3, p-value = 0.06898