Biostatistics with R

Skewness and Kurtosis

While stydying about the statistical prameters, we learnt that mean and variance are the central measures for the loction and spread of the data respectively. Similarly, the Skewness and Kurtosis are the two central measures which measure the shape of the distribution underlying the observed data.

Skewness

Skewness is a measure of the symmetry in the data,ie., whether the data is symmetrically distributed on bothe side of the mean value.
Let \(\small{\overline{X}}\) and \(\small{s}\) be the mean and standard deviation estimated from the n random data points \(\small{X_1,X_2,X_3,....,X_n}\) of a variable X. There are many ways in which the skewness parameter is defined.

The Fisher-Pearson expression \(\small{g_p}\) for the skewness is defined as,

\(\small{g_p~~=~~\dfrac{\dfrac{1}{n} \sum\limits_{i=1}^n (x_i - \overline{x})^3 }{s^3}~~~~~~~~~~~~~~~~~~~where~~~~~ \overline{x}= \dfrac{1}{n}\sum\limits_{i=1}^n x_i~~~~~and~~~~~s = \sqrt{\dfrac{\sum\limits_{i=1}^n(x_i-\overline{x})^2}{n} } }\)

NOTE : In the formula, s is computed with n in the denominator, rather than the usual (n-1)






We recall that the expression \(\small{ \sum\limits_{i=1}^n (x_i - \overline{x})^3 }\) is the third moment of the distribution.

There is a Fisher-Pearson expression formula which is adjusted for sample size n. This adjusted Pearson_Fisher formula is given by,

\(\small{~~~~~~~~~g_{padj}~~=~~\dfrac{\sqrt{(n(n-1)}}{n-2} \dfrac{\dfrac{1}{n} \sum\limits_{i=1}^n (x_i - \overline{x})^3 }{s^3} ~~~~~~~~~~where~~~~~ \overline{x}= \dfrac{1}{n}\sum\limits_{i=1}^n x_i~~~~~and~~~~~s = \sqrt{\dfrac{\sum\limits_{i=1}^n(x_i-\overline{x})^2}{n} } }\)

NOTE : In the formula, s is computed with n in the denominator, rather than the usual (n-1)









The skeness of a Gaussian distribution or any symmetric distribution is zero. The skewness is negative when the distribution is tailed to the left. Skeness is positive for a distribution which has a tail to the right of the mean.


Kurtosis


The Kurtosis is a measure of whether the data is having a heavy tail or a ligheter tail with respect to normal distribution. Data with heavy tails have larger outliers and data with lighter tails have smaller outliers.

For the same set of variables defined above, the formula for the Kurtosis K is given as,

\(\small{K~~=~~\dfrac{\dfrac{1}{n} \sum\limits_{i=1}^n (x_i - \overline{x})^4 }{s^4}~~~~~~~~~~~~~~~~~~~where~~~~~ \overline{x}= \dfrac{1}{n}\sum\limits_{i=1}^n x_i~~~~~and~~~~~s = \sqrt{\dfrac{\sum\limits_{i=1}^n(x_i-\overline{x})^2}{n} } }\)

NOTE : In the formula, s is computed with n in the denominator, rather than the usual (n-1)






We recall that the expression \(\small{ \sum\limits_{i=1}^n (x_i - \overline{x})^4 }\) is the fourth moment of the distribution.

The normal distribution has a kurtosis of 3. If a distibution has kurtosis less than 3 (platykurtic), it produces less outliers than a normal distribution. If a distribution has kurtosis more than 3 (leptokurtic), it produces more outliers than the normal distribution.

R-scripts



The  moments  library of R has functions skewness() and kurtosis()  to compute these two quantities. 

These two functions take a vector of data points as argument and return the quantity computed.

We can install the   moments  library in R prompt with the following command:

 install.packages("moments")

The script below demonstrates how to use these two functions:

# include the moments library library(moments) ## define the data vector x = c(3.02, 3.97, 3.63, 5.65, 5.52, 6.33, 5.40, 4.41, 5.42, 5.70, 4.36, 4.42, 4.93, 5.60, 3.38, 3.75) ## compute skewness sk = skewness(x) ku = kurtosis(x) print(paste("skewness = ", round(sk, digits=3), " kurtosis=", round(ku, digits=3)))

Runnung the above script in R prints the following output on the screen:
[1] "skewness = -0.158 kurtosis= 1.826"