Biostatistics with R

String operations

The DNA and RNA sequence strings form an important component of the data from genomics experiments. Any language used for genomics data analysis should support a set standard operations on strings. R has excellent set of library functions that support string operations. We will learn them one by one.

String declarations

In R, a string can be declared within double quotes:
> str = "abcacabac" > str1 = "qqqqqq" > str2 = " ++++++" > > str1
[1] "qqqqqq"
For assignment, the operator <- (a left angle bracket followed by a dash) is also extensively used in R. Thus, above mentioned assignments can also be written as,
> str <- "abcacabac" > str1 <- "qqqqqq" > str2 <- " ++++++"

The = operator assigns the value on its right to the variable to its left. But the assignment operator <- can assign value to a variable either from left to right or from right to left. Thus the following two assignments for a string s are equally valid:

> str <- "abcacabac" > "abcacabac" -> str

To get the string length

The number of characters in a string (called string length ) is returned by a function called "nchar()". In the following commands, the number of characters in the string 'astr' returned by the function 'nchar' is copied on to a varable called 'slen':

> astr = "ATGCGCTAGACAG" > slen = nchar(astr) > slen
[1] 13

To concatinate strings

We can concatinate (join) two or more strings using paste() function.

The "paste()" function takes two or more strings. By default, it joins the strings with a single space between them:

> str1 = "ATGCTGAG" > str2 = "XXXXX" > > ps = paste(str1,str2) > > ps

The paste() function, in addition to string names, can also take another parameter called "sep" to specify the separator between the strings while they are concatinated.

For example, to concatinate the above mentioned two strings "str1" and "str2" with a
separator "- - -" between them:

> scat <- paste(str, str1, sep="---") > scat

To concatinate the strings "str1" and "str2" without any gap between them, use a null separator:

> scat = paste(str1,str2,sep="") > scat

We can concatinate more than two strings with paste() function, as demonstrated below:

> st1 = "AAAAA" > st2 = "TTTT" > st3 = "GGGG" > > combstr = paste(st1,st2,st3,sep="_") > > combstr

To get substrings

A substring can be formed by calling substr() function specifying the start and stop character locations of the substring in the main string. To form a substring from location 4 to 8 of string "str",

> str = "Mitochondria and Golgi bodies" > > su = substr(str,4,8) > > su
[1] "ochon"

We can also replace a portion of string with other substring:

> substr(scat,4,8) <- "UUUUU" > scat
[1] "abcUUUUUaqqqqqq"
In the above code lines, the given string "UUUUU" replaces the characters in the location 4 to 8 of "scat". The = operator also can be used instead of <- operator in the above example.

In case we want a substring from a given start positition to the end of original string, give an arbitrarily large integer for the end location:

> str3 = "" > sublg <- substr(str3,4,100000000L) > sublg
[1] ""

Instead of using a long integer to represent the end of the string, we can use the nchar() funtion as an argument of substr() function to get the end location of the string:

> str3 = "" > sublg <- substr(str3,4,nchar(str3)) > sublg
[1] ""

To truncate(trim) a string

A string can be truncated to a certain number of characters from its beginning with strtrim() function. For example, we truncate(trim) the string str4 at 4 characters from the beginning.

> str4 <- "AECH9939-ALM" > strunk <- strtrim(str4, 4) > strunk
[1] "AECH"

To split a string by particular character(s)

The function strsplit() is used to split a string by a given character. For example, the string "fname_doc" can be split by the character "_" into "fname" and "doc" as follows:

> st = "filename_doc" > > strsplit(st, "_")
[[1]] [1] "filename" "doc"

The two portions of the split string can be converted to a list, as shown below. More on lists later:

> aa <- unlist(strsplit("fname.doc", "\\.")) > aa[1]
[[1]] [1] "fname"
> aa[2]
[1] "doc"

To split a string by 'special characters' such as dot(.), we have to place the character after a double backslash inside quotes:

> ss = "filename.doc" > > strsplit(ss, "\\.")
[[1]] [1] "filename" "doc"

Letter case conversion

For converting the upper cases to lower cases and vice versa , we use functions toupper() and tolower() :

> str = "THIS IS a sentance" > > toupper(str)
> tolower(str)
[1] "this is a sentance"