Biostatistics with R

Non-linear Regression

While studying the theory of linear regression, we learnt that a straight line equation (model) of the form \(\small{y = b_0 + b_1x}\) can be fitted to a set of data points \(\small{x_i, y_i}\) by minimizing the sum square deviations between the observed \(\small{y_i}\) values and the expected values of y based on the model.

In this section we will learn to regress non-linear functions through data points. By non-liner we mean a function involving powers of x higher or lower than 1 and other functions like exponential and logarithmic functions.

There are two important non-linear functions used in many branches of sciences.

1. Exponential function : Used in radioactivity, populations models and in studying any growth or decay process. This is closely related to logarithmic functions,.

2. Plynomial functions : Many non-linear relations can be approximated to a polynomial of suitable number of degrees. Using polynomial fit, we can pass complicated curves through data points.

In the discussion below, we will learn to do regression using the method of least square fit to these two family of functions.

Least square fit to an exponential function

We have to fit an exponential function of the form \(\small{~~y = A e^{Bx} }\) to a set of n data points \(\small{(x_i, y_i)}\) to get the best fit coefficients A and B.

We need not struggle with a least square fit to this function. Our life is made simple by the fact that the log transformation of above relation can convert the exponential relation to a linear one!.

We have, \(~~\small{y = Ae^{Bx}~~}\), where A and B are constants.

Taking a natural logarithm on both sides,

\(\small{ log(y)~=~ log(A)~+~Bx }\)

Let \(\small{Y = log(y),~~~~b_0=log(A),~~~~and~~~b_1=B }\)

With this transformation, the exponential function \(\small{~~y = A e^{Bx} }\) is transformed to a linear function of the form,

\(\small{Y = b_0 + b_1 x }\)

convert \(\small{x_i, y_i}\) pairs of observations into, (x_i, Y_i=log(y_i)) pairs.

Then, using norml linear regression procedure we studied before, perform a least square fit to the straight line \(\small{Y = b_0 + b_1 x }\) to get the coefficients \(\small{b_0}\) and \(\small{b_1}\).

Get back the original coefficients using,

\(\small{A = e^{b_0}~~~}\) and \(~~~\small{B = b_1}\)

Plot the fitted curve \(\small{x_i, Ae^{Bx_i}}\) on the same plot as aline with the data points \(\small{x_i,y_i }\)

Least square fit to a polynomial function

In this section we describe the least sqaure fit to a polinomial function. The methodology is very similar to the one followed for the least square fit to the stright line in the linear regression.

The expected values of y is written as a polynomial of degree k as,

\(\small{y_{exp} = b_0 + b_1x + b_1x^2 + b_1x^3 + .... + b_kx^k }\)

We write down the difference between the expected and observed values summed over data points(called "residual"):

$$ \chi^2~=\sum_{i=1}^{n} [ y_i~-~(b_0+b_1x+b_2x^2+....+b_kx^k) ]^2 $$

We minimize the \(\small{\chi^2}\) values with respect to each one of the variables $b_0,b_1,b_2,....,b_k$ and equate them to zero.

\(\small{ \dfrac{\partial \chi^2}{\partial b_0}~=~ -2\displaystyle\sum_{i=1}^{n}[ y_i~-~(b_0+b_1x+b_2x^2+....+b_kx^k) ]~=~0 }\)

\(\small{ \dfrac{\partial \chi^2}{\partial b_1}~=~ -2\displaystyle\sum_{i=1}^{n}[ y_i~-~(b_0+b_1x+b_2x^2+....+b_kx^k) ] x~=~0 }\)

\(\small{ \dfrac{\partial \chi^2}{\partial b_2}~=~ -2\displaystyle\sum_{i=1}^{n}[ y_i~-~(b_0+b_1x+b_2x^2+....+b_kx^k) ]x^2~=~0 }\)

\(\small{ \dfrac{\partial \chi^2}{\partial b_3}~=~ -2\displaystyle\sum_{i=1}^{n}[ y_i~-~(b_0+b_1x+b_2x^2+....+b_kx^k) ]x^3~=~0 }\)

................................
.................................
\(\small{ \dfrac{\partial \chi^2}{\partial b_k}~=~ -2\displaystyle\sum_{i=1}^{n}[ y_i~-~(b_0+b_1x+b_2x^2+....+b_kx^k) ]x^k~=~0 }\)

This leads to a set of simultaneous equations given by,

\(\small{b_0n~+~ b_1\displaystyle\sum_{i=1}^{n}x_i~+~b_2\displaystyle\sum_{i=1}^{n}x_i^2~+~....~+b_k\displaystyle\sum_{i=1}^{n}x^k~=~\displaystyle\sum_{i=1}^{n} y_i }\)

\(\small{b_0\displaystyle\sum_{i=1}^{n}x_i~+~ b_1\displaystyle\sum_{i=1}^{n}x_i^2~+~b_2\displaystyle\sum_{i=1}^{n}x_i^3~+~....~+b_k\displaystyle\sum_{i=1}^{n}x_i^k~=~\displaystyle\sum_{i=1}^{n} y_i x_i }\)

........................................

.........................................

\(\small{b_0\displaystyle\sum_{i=1}^{n}x_i^k~+~ b_1\displaystyle\sum_{i=1}^{n}x_i^{k+1}~+~b_2\displaystyle\sum_{i=1}^{n}x_i^{k+2}~+~....~+b_k\displaystyle\sum_{i=1}^{n}x_i^{2k}~=~\displaystyle\sum_{i=1}^{n} x_i^ky_i }\)

The above set of k+1 simultaneous equations in variables \(\small{b_0, b_1,b_2,...,b_k }\) can be solved by matrix methods to get the coefficients \(\small{b_0, b_1,b_2,...,b_k }\).

Once the coefficients are obtained, the observed and fitted lines can be plotted on the same graph and $r^2$ value can be obtained, using methods similar to linear regression. We demonstrate how to do non-linear regression in R:

R scripts


In the example R script below, we demonstrate lm() and nsl() functions of R library to
fit the data.

# polynomial fit using lm ## define the two data sets x = c(0.5, 1.5, 2.0, 2.5, 3.0, 3.5, 4.0) y = c(0.87, 7.12, 14.01, 24.37, 39.058, 58.62, 83.92) ## call lm() function with third degree polinomial of the form y = ax + bx^2 + cx^3 result = lm(y ~ x + I(x^2) + I(x^3)) # accessing coefficient of x print(result$coefficient[[1]]) # accessing coefficient of x^2 print(result$coefficient[[2]]) # accessing coefficient of x^3 print(result$coefficient[[3]]) #plot original points and the fitted curve plot(x,y,type="p", col="blue") yfitted = result$coefficient[[1]]*x + result$coefficient[[2]]*x^2 + result$coefficient[[3]]*x^3 lines(x,yfitted, type="l",col="red") # non-linear curve fitting with nls function # ---------------------------------------------------------------------------- # non-linear data is fitted with nls() function, by method of least squares. # we fit a power model. xdat = c(1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0) ydat = c(-0.09, 0.59, 0.42, 2.03, 3.43, 1.84, 5.30, 4.40, 6.67, 7.40, 7.34, 8.76, 10.25, 10.93, 13.78, 14.21, 17.82, 21.83, 23.04, 24.25, 27.48) df = data.frame(x = xdat, y = ydat) dat = nls(y ~ I(x^power), data=df) print(class(dat)) print(summary(dat)) print(summary(dat)$coefficients) power = round(summary(dat)$coefficients[1], 3) power.se = round(summary(dat)$coefficients[2], 3) s = seq(1, 3, 0.1) plot(ydat ~ xdat, main = "Fitted power model", sub = "Blue: fit; green: known") lines(s, s^3, lty = 2, col = "green") # We also fit a formula data = lm(formula = log(xdat)~ydat) summary(data)