## Data manipulation with vectors

#### Creating subsets of vectors

We learnt that an operarion performed on a vector name is applied to all its elements separately, resulting in another vector. Thus, if v is a vector, any operation performed on v is applied to all its elements in turn, and this results in a new vector.

Similarly, if a logical condition is applied to a vector x , it is applied to each element of x , resulting in a vector of TRUE oe FALSE values against every element of x .

As an example, if x is a vector of numbers, then the statement x > 12 will check whether every element of x is greater than 12. It will accordingly generate a vector of TRUE or FALSE boolean values.See here:


>  x = c(8,10,12,7,14,16,2,4,9,19,20,3,6)
>
>  x > 12


[1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[13] FALSE


If the above vector of TRUE and FALSE values are placed inside the square bracket of vector x , the elements of x corresponding to the TRUE values will be filtered out into a vector. Carefully note the following code and its outpout:


>  x = c(8,10,12,7,14,16,2,4,9,19,20,3,6)
>
>  x > 12


[1] FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
[13] FALSE


>  y = x[x>12]
>
>  y

[1] 14 16 19 20

The logical statement inside square bracket can be complex. Thus, if we want to filter out elements of vector x whose values are more than 10 and less than 7, we use


>  x[ (x>10) & (x<20) ]

[1] 12 14 16 19

In another example, we create a vector of numbers with some missing values (ie. NA). We will apply a filter to select elements which are not NA's and at the same time have values below 100 and write them into another vector. In a second operation, we will remove all the NA values from the original vector itself.

The script below achieves this:



tarray <- c(2, 7, 29, 32, 41, 11, 15, NA, NA, 55, 32, NA, 42, 109)

karray <- tarray[ !is.na(tarray) & (tarray < 100) ]

tarray[is.na(tarray)] <- 0

print("Filter with NA's and numbers greater than 100 removed:")
print(karray)

print("Filter with NA's replaced by 0")
print(tarray)



When the above code lines are executed in an R script, the following output is created.


[1] "Filter with NA's and numbers greater than 100 removed:"
[1]  2  7 29 32 41 11 15 55 32 42
[1] "Filter with NA's replaced by 0"
[1]   2   7  29  32  41  11  15   0   0  55  32   0  42 109


In the above script, the statement tarray[ !is.na(tarray) & (tarray < 100) ] selects elements of vector "tarray" that are not NA's and at the same time less than 100. The statement tarray[is.na(tarray)] <- 0 assigns the value 0 to the elemts of vector "tarray" that are missing values (NA's). After this, all NA's in vector "tarray" are replaced by 0.

#### Creating subsets of data frames

From a data frame, a subset can be created using subset() funtion by applying conditions on one or more column members.

For example, suppose a data frame is called "datframe" with many columns and one of them have name "npcol". Then the statement,


subdata <- subset(datframe, datframe$npcol > 30.0)  will select all the rows of datframe in which npcol is greater than 30 to create a new data frame called "subdata" In the example code below, we will create a data frame with an (imaginary) experimental data. In this data set, there are 7 genes for which some experimental measurements are available from 7 experiments. We will use "subset()" function to create a subset of this data after filtering on individual column values. The code below demonstrates this. The comments are self explanatory.  # creating a vector of gene names genes = c("gene-1","gene-2","gene-3","gene-4","gene-5","gene-5","gene-6") # creating a vector of gender gender = c("M", "M", "F", "M", "F", "F", "M") # creating 7 data vectors with experimental results result1 = c(12.3, 11.5, 13.6, 15.4, 9.4, 8.1, 10.0) result2 = c(22.1, 25.7, 32.5, 42.5, 12.6, 15.5, 17.6) result3 = c(15.5, 13.4, 11.5, 21.7, 14.5, 16.5, 12.1) result4 = c(14.4, 16.6, 45.0, 11.0, 9.7, 10.0, 12.5) result51 = c(12.2, 15.5, 17.4, 19.4, 10.2, 9.8, 9.0) result52 = c(13.3, 14.5, 21.6, 17.9, 15.6, 14.4, 12.0) result6 = c(11.0, 10.0, 12.2, 14.3, 23.3, 19.8, 13.4) # creating a data frame with this data. # genes along rows, results along columns datframe = data.frame(genes,gender,result1,result2,result3,result4, result51,result52,result6) # adding column names to data frame names(datframe) = c("GeneName", "Gender", "expt1", "expt2", "expt3", "expt4", "expt51", "expt52", "expt6") # creating subset of data with expt2 values above 20 subframe1 = subset(datframe, datframe$expt2 > 20)

# creating a subset of data with only Female gender

#### Getting the index of a vector element

Given a vector, suppose we want to know the array index of a particular element. Using the name or value of an element, its position in the array ( third, fourth etc) should be obtained. We can get this done with the help of a function called which(). See here:


>  x = c("aaa","bbb","ccc","ddd","eee","fff","ggg","hhh")
>
>  which(x=="ddd")

[1] 4
In the above vector x, the element "ddd" is in the fourth position. Therefore, the above call to which() returns an integer 4.

If a particular element is present in the vector more than once, the which() function returns a vector containing the indices of all the locations of that element in the input vector:


>  dat = c("ATG","TAG","ATG","TTA","TGC","ATT","ATG", "GGG")
>
>  d = which(dat=="ATG")
>
>  d

[1] 1 3 7

#### Joining data frames

In R, data tables are generally loaded as data frames. We can join ( bind) two data frames one below each other (vertical binding) or adjacent to each other (horizontal binding) provided the column or row numbers are matched accordingly.

We will create three data frames called frame1, <>frame2 and frame3 to demonstrate the data set binding.



>  index = seq(1:8)
>
>  product = c("wheat","rice","millet","ragi","corn","pulses","meat","sugarCane")
>
>  quantity1 = c(118,179,24,39,32,59,72,84)
>
>  quantity2 = c(128,169,29,35,30,57,67,78)
>
>  sales = c(1200,1400,800,600,400,2900,3000,490 )
>
>  frame1 = data.frame(index = index, product=product, quantity=quantity1)
>
>  frame2 = data.frame(index=index, product=product, quantity=quantity2)
>
>  frame3 = data.frame(index=index, product=product, sales=sales)
>
>
>  frame1


index   product quantity
1     1     wheat       118
2     2      rice       179
3     3    millet        24
4     4      ragi        39
5     5      corn        32
6     6    pulses        59
7     7      meat        72
8     8 sugarCane        84


>  frame2


index   product quantity
1     1     wheat       128
2     2      rice       169
3     3    millet        29
4     4      ragi        35
5     5      corn        30
6     6    pulses        57
7     7      meat        67
8     8 sugarCane        78


>  frame3


index   product sales
1     1     wheat  1200
2     2      rice  1400
3     3    millet   800
4     4      ragi   600
5     5      corn   400
6     6    pulses  2900
7     7      meat  3000
8     8 sugarCane   490


To join two frames vertically one below the other ( row binding ), use rbind() function. For this,the two data frames must have same variables (ie., column names), though they need not be present in the same order :


>  vbframe = rbind(frame1, frame2)
>
>  vbframe


index   product quantity
1      1     wheat      118
2      2      rice      179
3      3    millet       24
4      4      ragi       39
5      5      corn       32
6      6    pulses       59
7      7      meat       72
8      8 sugarCane       84
9      1     wheat      128
10     2      rice      169
11     3    millet       29
12     4      ragi       35
13     5      corn       30
14     6    pulses       57
15     7      meat       67
16     8 sugarCane       78


To join two data frames horizontally (column binding ), we use the cbind() function.For this, they should have same number of rows, and the variables (column names) can be same or different:

>  hbframe = cbind(frame1, frame2)
>
>  hbframe


index   product quantity index   product quantity
1     1     wheat      118     1     wheat      128
2     2      rice      179     2      rice      169
3     3    millet       24     3    millet       29
4     4      ragi       39     4      ragi       35
5     5      corn       32     5      corn       30
6     6    pulses       59     6    pulses       57
7     7      meat       72     7      meat       67
8     8 sugarCane       84     8 sugarCane       78


#### Merging data frames

To merge two data frames horizontally, use the merge() function. In most cases, you join two data frames by one or more common key variables (i.e., an inner join ).

We will now merge the frame1 and the frame3 created before. These two frames have two common variables namely "index" and "product".

First we merge them by the variable "product":


>  mrgA = merge(frame1, frame3, by="product")
>
>  mrgA


product index.x quantity index.y sales
1      corn       5       32       5   400
2      meat       7       72       7  3000
3    millet       3       24       3   800
4    pulses       6       59       6  2900
5      ragi       4       39       4   600
6      rice       2      179       2  1400
7 sugarCane       8       84       8   490
8     wheat       1      118       1  1200

Carefully note the following fact in the above merged frame "mrgA" : since two frames merged by "product" have a common variable called "index", the merged frame distinguishes them by changing the two names to "index.x" and "index.y"

We can also merge by more than one variable. For example, we can merge the frame1 and frame3 by the two common variables "index" and "product" as shown below:


>  mrgB = merge(frame1, frame3, by=c("index","product"))
>
>  mrgB


index   product quantity sales
1     1     wheat      118  1200
2     2      rice      179  1400
3     3    millet       24   800
4     4      ragi       39   600
5     5      corn       32   400
6     6    pulses       59  2900
7     7      meat       72  3000
8     8 sugarCane       84   490


Different types of merging like Outer join, Left outer, Right outer and Cross join are demonstrated below with new data frames called df1 and df2 :


>  df1 = data.frame(experimentID = c(1:6), genes=c("g1","g1","g1","g2","g2","g2"))
>
>  df2 = data.frame(experimentID = c(1,3,5), tissues = c("heart","heart","liver"))
>
>  df1


experimentID genes
1            1    g1
2            2    g1
3            3    g1
4            4    g2
5            5    g2
6            6    g2


>
>  df2


experimentID tissues
1            1   heart
2            3   heart
3            5   liver


Outer join :

>  OJ = merge(x = df1, y = df2, by = "experimentID", all = TRUE)
>
>  OJ


experimentID genes tissues
1            1    g1   heart
2            2    g1    <NA>
3            3    g1   heart
4            4    g2    <NA>
5            5    g2   liver
6            6    g2    <NA>


Left Outer :

>  LO = merge(x = df1, y = df2, by = "experimentID", all.x = TRUE)
>
>  LO


experimentID genes tissues
1            1    g1   heart
2            2    g1    <NA>
3            3    g1   heart
4            4    g2    <NA>
5            5    g2   liver
6            6    g2    <NA>


Right Outer :

>  RO = merge(x = df1, y = df2, by = "experimentID", all.y = TRUE)
>
>  RO


experimentID genes tissues
1            1    g1   heart
2            3    g1   heart
3            5    g2   liver


Cross Join :

>  CJ = merge(x = df1, y = df2, by = NULL)
>
>


experimentID.x genes experimentID.y tissues
1               1    g1              1   heart
2               2    g1              1   heart
3               3    g1              1   heart
4               4    g2              1   heart
5               5    g2              1   heart
6               6    g2              1   heart
7               1    g1              3   heart
8               2    g1              3   heart
9               3    g1              3   heart
10              4    g2              3   heart
11              5    g2              3   heart
12              6    g2              3   heart
13              1    g1              5   liver
14              2    g1              5   liver
15              3    g1              5   liver
16              4    g2              5   liver
17              5    g2              5   liver
18              6    g2              5   liver