Dataframe is a data structure similar to matrix, with a special feature that different columns can have different data types. A matrix has numbers as elements arranged in rows and columns. A data frame is like a table, with each column is allowed to be of a different data type, like strings, integers, floating point numbers etc. Dataframe is very useful for combining vectors of same length with different data types into a single data structure. In R, each column of a data frame is treated as a vector whose length is equal to the number of rows in the data frame. Similar to matrices, all the columns of a data frame should have same number of rows.
A data frame is made up of individual vectors of same length placed as columns. We can easily create a data frame from vectors using
> data1 <- c("Iron","Sulphur","Calcium", "Magnecium", "Copper") > data2 <- c(12.5, 32.6, 16.7, 20.6, 7.5) > data3 <- c(1122, 1123, 1124, 1125, 1126) > > frm1 <- data.frame(data1, data2, data3) > > frm1
data1 data2 data3 1 Iron 12.5 1122 2 Sulphur 32.6 1123 3 Calcium 16.7 1124 4 Magnecium 20.6 1125 5 Copper 7.5 1126
In the above example, note that the column names of the data frame 'frm1' we created are just the names of the vector objects themselves. A sequence of indices 1,2,3,4 and 5 have been added as row names, by default. We can also give our own row names to the data frame.
To get the column names of a data frame, call
> names(frm1)
> rname = rownames(frm1) > > rname
> cname = colnames(frm1) > > cname
The columns of a data frame can be named explicitly using a vector of strings. For the above frame "frm1", we can set the column names with our own vector of strings.
> colnames(frm1) <- c("Element", "Proportion", "Product_ID") > > frm1
Element Proportion Product_ID 1 Iron 12.5 1122 2 Sulphur 32.6 1123 3 Calcium 16.7 1124 4 Magnecium 20.6 1125 5 Copper 7.5 1126
In the above example, we can use
Similarly, the row names can be initialized by a vector of strings:
> rownames(frm1) = c("elmt-1","elmt-2","elmt-3","elmt-4","elmt-5") > > frm1
Element Proportion Product_ID elmt-1 Iron 12.5 1122 elmt-2 Sulphur 32.6 1123 elmt-3 Calcium 16.7 1124 elmt-4 Magnecium 20.6 1125 elmt-5 Copper 7.5 1126
The elements of a Data frame are accessed using same subscript convention as matrices.
Thus,
> frm1[1,3]
> frm1[1,]
> frm1[,2]
> frm1[1:3,]
Element Proportion Product_ID 1 Iron 12.5 1122 2 Sulphur 32.6 1123 3 Calcium 16.7 1124
We can also access a column of a dataframe by its name, by typing the frame name and the column names separated by a '$' sign. The accessed column is treated as a vector. For example, columns of the data frame 'frm1' can be accessed by their names as shown here:
> frm1$Element
> frm1$Proportion
> frm1$Product_ID
> frm1$Proportion * 1000
> frm1[,2] * 1000
A new column can be added to the existing data frame by creating a vector and naming it as a new column of the frame. Obviously, this vector should have same length as the number of rows of the existing frame. For example, a new column called "symbol" is added to the existing frame "frm1":
> frm1$symbol = c("Fe","S","Ca","Mg","Cu") > > frm1
Element Proportion Product_ID symbol 1 Iron 12.5 1122 Fe 2 Sulphur 32.6 1123 S 3 Calcium 16.7 1124 Ca 4 Magnecium 20.6 1125 Mg 5 Copper 7.5 1126 Cu
A column can be removed from a data frame by accessing it by name and assigning NULL value to it. In the following example, we will access the column named "Product-ID" from frane "frm1" and remove it:
> frm1
Element Proportion Product_ID symbol elmt-1 Iron 12.5 1122 Fe elmt-2 Sulphur 32.6 1123 S elmt-3 Calcium 16.7 1124 Ca elmt-4 Magnecium 20.6 1125 Mg elmt-5 Copper 7.5 1126 Cu
> > frm1$Product_ID <- NULL > > frm1
A given column in a data frame can also be removed by using the its column index with a negative sign:Element Proportion symbol elmt-1 Iron 12.5 Fe elmt-2 Sulphur 32.6 S elmt-3 Calcium 16.7 Ca elmt-4 Magnecium 20.6 Mg elmt-5 Copper 7.5 Cu
> > frm1[,-3] > > frm1
Element Proportion elmt-1 Iron 12.5 elmt-2 Sulphur 32.6 elmt-3 Calcium 16.7 elmt-4 Magnecium 20.6 elmt-5 Copper 7.5
We learnt to access a column of a data frame by mentioning the column name along with the frame name separated by '$' sign. When there are more than one data frame in memory with same column names(s), this format can distinguish between them. Suppose we have a situation when we do not have this naming conflict. In this case it will more convenient to access the column by mentioning only its name, dropping the frame name. We use
The
> frm1 Element Proportion symbol elmt-1 Iron 12.5 Fe elmt-2 Sulphur 32.6 S elmt-3 Calcium 16.7 Ca elmt-4 Magnecium 20.6 Mg elmt-5 Copper 7.5 Cu > > symbol
> > attach(frm1) > > symbol