Kruskal Wallis Test
The mothod of one way ANOVA tests whether more than two data sets in hand have come from the Gaussian ditributions with equal means under the assumption that their variances are equal.
The Kruskal Wallis test is a non-parmetric test which tests whether the samples originate from the same distribution.
Similar to other non-parametric tests, it does not assume that the data sets are drawn from Normal distribution and ranks the data points to compute a test statistics under null hypothesis.
Kruskal Wallis test can be used when the data sizes are small and we are not sure whether they are drawn from normal distribution.
In the Krukal wallis test, the following assumptions are made on the data sets:
1. All the data points are randomly samples from their populations
2. Within a sample, data points are independent of each other. The samples among themselve are also independent of each other.
3. The data measurement id ordinal,ie., the data points can be ranked.
The null hypothesis for Kruskal Wallis test staes that the population distibutions are identical. The alternte hypothesis is that at least one of the populations tend to exhibit higher values than at leaset one of the higher population.
the test procedure
Let there be k data sets, identified by indices i=1,2,3,....,k.
Let \(\small{n_i}\) be the number of data points in the data set i and
\(\small{n = n_1 + n_2 + n_3 + .....+n_r}\) be the total number of data points.
1. Mix all the data sets and rank the data points them from smallest to the largest value.
2. Now separate the data points back into the data sets. For each data set, sum the ranks of their data points
to get a rank sum. Thus each data set i has a rand sum \(\small{R_i}\). In case of tie, give the mean ranks.
3.The test statisti is defined as,
\(\small{ H~=~\dfrac{12}{n(n+1)} \displaystyle\sum_{i=1}{k}\dfrac{R_i^2}{n_i}~=~3(n+1) }\)
4. If each one of the distributions have 5 or more data points, the above statistic follows a chi-square distribution with n-1 degrees of freedom.When the sample sizes are smaller, a probability table is used for computing the critical values.
5. We reject the null hypothesis of identical population distributions if the test statistic H is greater than the \(\small{\chi^2}\) critical value with n-1 degrees of freedom.
6.There is a correction for the ties, which is generally negligible.
The R script below performs Kruskal Wallis test.
R-scripts
We consider the following 4 data sets to be compared by Kriuskal Wallis test:
Group-1 : 220 214 203 184 186 200 165
Group-2 : 262 193 225 200 164 266 179
Group-3 : 272 192 190 208 231 235 141
Group-4 : 190 255 247 278 230 269 289
We shorten the names to G1, G2, g3 and G4.
These 4 setscan be written as a two column data in which first column is the value against
which the group label is given. See below:
(two colums can be tb separated(*.txt file) or comman separated (*.csv file)
value group
220 g1
214 g1
203 g1
184 g1
186 g1
200 g1
165 g1
262 g2
193 g2
225 g2
200 g2
164 g2
266 g2
179 g2
272 g3
192 g3
190 g3
208 g3
231 g3
235 g3
141 g3
190 g4
255 g4
247 g4
278 g4
230 g4
269 g4
289 g4
The above data format is stored as a text file with a txt extension.
We give some name like, "cholesterol_data.txt".
It can also be stored as a comma separated csv format.
This has the advantage that we can handle data sets of different lengths.
If you store it as data frame, then many NA's may be required to make them equaal length.
Above format is better.
Once this file "cholesterol_data.txt" is ready, we can write an R script
for performing one factor ANOVA as follows:
## Perform nonparametric Kruskal wallis test in R
## read data into a frame
mydat = read.table("cholesterol_data.txt", header=TRUE)
## Function call to kruskal.test()
res = kruskal.test(value~group, data = mydat)
## print the result
print(res)
Executing the above script in R prints the following output.
Kruskal-Wallis rank sum test
data: value by group
Kruskal-Wallis chi-squared = 7.0933, df = 3, p-value = 0.06898