Mathematical tools for natural sciences

The Kruskal Wallis test is a non-parmetric test which tests whether the samples originate from the same distribution.

Similar to other non-parametric tests, it does not assume that the data sets are drawn from Normal distribution and ranks the data points to compute a test statistics under null hypothesis.

Kruskal Wallis test can be used when the data sizes are small and we are not sure whether they are drawn from normal distribution.

In the Krukal wallis test, the following assumptions are made on the data sets:

1. All the data points are randomly samples from their populations

2. Within a sample, data points are independent of each other. The samples among themselve are also independent of each other.

3. The data measurement id ordinal,ie., the data points can be ranked.

Let \(\small{n_i}\) be the number of data points in the data set i and

\(\small{n = n_1 + n_2 + n_3 + .....+n_r}\) be the total number of data points. 1. Mix all the data sets and rank the data points them from smallest to the largest value.

2. Now separate the data points back into the data sets. For each data set, sum the ranks of their data points to get a rank sum. Thus each data set i has a rand sum \(\small{R_i}\). In case of tie, give the mean ranks.

3.The test statisti is defined as,

\(\small{ H~=~\dfrac{12}{n(n+1)} \displaystyle\sum_{i=1}{k}\dfrac{R_i^2}{n_i}~=~3(n+1) }\)

4. If each one of the distributions have 5 or more data points, the above statistic follows a chi-square distribution with n-1 degrees of freedom.When the sample sizes are smaller, a probability table is used for computing the critical values.

5. We reject the null hypothesis of identical population distributions if the test statistic H is greater than the \(\small{\chi^2}\) critical value with n-1 degrees of freedom.

6.There is a correction for the ties, which is generally negligible.

The R script below performs Kruskal Wallis test.

We consider the following 4 data sets to be compared by Kriuskal Wallis test: Group-1 : 220 214 203 184 186 200 165 Group-2 : 262 193 225 200 164 266 179 Group-3 : 272 192 190 208 231 235 141 Group-4 : 190 255 247 278 230 269 289 We shorten the names to G1, G2, g3 and G4. These 4 setscan be written as a two column data in which first column is the value against which the group label is given. See below: (two colums can be tb separated(*.txt file) or comman separated (*.csv file) value group 220 g1 214 g1 203 g1 184 g1 186 g1 200 g1 165 g1 262 g2 193 g2 225 g2 200 g2 164 g2 266 g2 179 g2 272 g3 192 g3 190 g3 208 g3 231 g3 235 g3 141 g3 190 g4 255 g4 247 g4 278 g4 230 g4 269 g4 289 g4 The above data format is stored as a text file with a txt extension. We give some name like, "cholesterol_data.txt" . It can also be stored as a comma separated csv format. This has the advantage that we can handle data sets of different lengths. If you store it as data frame, then many NA's may be required to make them equaal length. Above format is better. Once this file"cholesterol_data.txt" is ready, we can write an R script for performing one factor ANOVA as follows:

## Perform nonparametric Kruskal wallis test in R ## read data into a frame mydat = read.table("cholesterol_data.txt", header=TRUE) ## Function call to kruskal.test() res = kruskal.test(value~group, data = mydat) ## print the result print(res)

Executing the above script in R prints the following output.

Kruskal-Wallis rank sum test data: value by group Kruskal-Wallis chi-squared = 7.0933, df = 3, p-value = 0.06898