11  Data Frames

A data frames is a data structure for representing tabular data.

Every column of a data frame may have a different data type. For example, the following data frame would have one character column, two numeric columns, one boolean column and one character column, in order.

Name Height Weight Gym member? City
Cem 1.75 66 T Istanbul
Can 1.70 65 F Ankara
Hande 1.62 61 T Izmir

Just as lists are heterogeneous analogs of vectors, data frames are heterogenous analogs of matrices.

Internally, a data frame is a list of equal-length vectors. This means that each column must be of the same data type.

Creating data frames

Several vectors can be combined into a data frame using the data.frame() function.

heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)
city <- c("Istanbul","Ankara","Izmir")
people <- data.frame(Height=heights, 
                     Weight=weights, 
                     Member=c(Can=TRUE, Cem=FALSE, Hande=TRUE), 
                     City=city)
people
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir

Recycling applies to data frames as well. Suppose we add the "City" data and make it "Istanbul" for all:

data.frame(Height=heights, Weight=weights, City="Istanbul")
      Height Weight     City
Can     1.70     65 Istanbul
Cem     1.75     66 Istanbul
Hande   1.62     61 Istanbul

Here, the element "Istanbul" is repeated until it matches the length of other vectors.

The functions rownames() and colnames() can be used to change labels of rows and columns.

Create a data frame with some unspecified row and column labels:

tempdf <- data.frame(c(1.70, 1.75,1.62),c(65, 66, 61))
tempdf
  c.1.7..1.75..1.62. c.65..66..61.
1               1.70            65
2               1.75            66
3               1.62            61

We can explicitly set row and column names:

rownames(tempdf) <- c("Can","Cem","Hande")
colnames(tempdf) <- c("Height","Weight")
tempdf
      Height Weight
Can     1.70     65
Cem     1.75     66
Hande   1.62     61

Accessing columns of data frames

A data frame is a list of columns; so we can access a column using the list notation we’ve seen before.

people
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir
people[[1]]  # indexing with component number
[1] 1.70 1.75 1.62
people$Weight  # component name
[1] 65 66 61
people[["City"]]
[1] "Istanbul" "Ankara"   "Izmir"   

Accessing elements via matrix-like indexing

A data frame can be indexed as if it is a matrix, using the [row, col] notation.

people
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir
people[,1]  # column 1
[1] 1.70 1.75 1.62
people[2,1] # row 2, column 1
[1] 1.75
people["Cem","Height"]
[1] 1.75
people["Can",]
    Height Weight Member     City
Can    1.7     65   TRUE Istanbul

Selecting rows using indices

We can specify a vector of indices to select rows.

people
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir
people[c(1,3),]
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Hande   1.62     61   TRUE    Izmir
people[c("Can","Hande"),]
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Hande   1.62     61   TRUE    Izmir

A negative index, again, indicates an element that is to be omitted.

people[-2,-3]
      Height Weight     City
Can     1.70     65 Istanbul
Hande   1.62     61    Izmir

Selecting some columns

We can provide a list of column names or numeric indices to get a subframe.

All rows, only Member and City columns:

people[, c("Member","City")]
      Member     City
Can     TRUE Istanbul
Cem    FALSE   Ankara
Hande   TRUE    Izmir

Same, with numeric indices:

people[, 3:4]
      Member     City
Can     TRUE Istanbul
Cem    FALSE   Ankara
Hande   TRUE    Izmir

A subset of rows and a subset of columns:

people[c("Can","Cem"), 1:2]
    Height Weight
Can   1.70     65
Cem   1.75     66

Filtering data frames

The Boolean operators we use for filtering vectors are applicable to data frames as well.

people
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir

Filter for people who are at least 1.70, or in Izmir:

people$Height >= 1.70 | people$City == "Izmir"
[1] TRUE TRUE TRUE

Get the city of people who are at least 1.70:

people[ people$Height>= 1.70, "City"]
[1] "Istanbul" "Ankara"  

Show only members, all columns:

people[ people$Member, ]
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Hande   1.62     61   TRUE    Izmir

Show the height and city of members:

people[ people$Member, c("Height","City")]
      Height     City
Can     1.70 Istanbul
Hande   1.62    Izmir

Adding new rows

As with matrices, we can use rbind() to add a new row to an existing data frame. The new row is usually in the form of a list.

Before:

people
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir

After:

rbind(people, Lale=list(1.71, 64, FALSE, "Bursa"))
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir
Lale    1.71     64  FALSE    Bursa

Concatenate two data frames

rbind() can also be used to extend a dataframe with another.

Let’s generate a new data frame:

newpeople <- data.frame(
    Weight=c(64, 50),
    Member=c(F,T),
    City=c("Bursa","Istanbul"),
    Height=c(Lale=1.71, Ziya=1.45)
)
newpeople
     Weight Member     City Height
Lale     64  FALSE    Bursa   1.71
Ziya     50   TRUE Istanbul   1.45

Combine this new dataframe with the old one:

rbind(people, newpeople)
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir
Lale    1.71     64  FALSE    Bursa
Ziya    1.45     50   TRUE Istanbul

Adding new columns

Suppose we want to add a column for BMI, which we calculate using the existing columns. We can do this using cbind() as follows.

people_bmi <- cbind(people, people$Weight/people$Height^2)
people_bmi
      Height Weight Member     City people$Weight/people$Height^2
Can     1.70     65   TRUE Istanbul                      22.49135
Cem     1.75     66  FALSE   Ankara                      21.55102
Hande   1.62     61   TRUE    Izmir                      23.24341

Note that the name of the new column is automatically set. It’s ugly! We can change this using the names() or colnames() functions.

names(people_bmi)[5] <- "BMI"
people_bmi
      Height Weight Member     City      BMI
Can     1.70     65   TRUE Istanbul 22.49135
Cem     1.75     66  FALSE   Ankara 21.55102
Hande   1.62     61   TRUE    Izmir 23.24341

A more direct way:

people2 <- people
people2$BMI <- people2$Weight/people2$Height^2
people2
      Height Weight Member     City      BMI
Can     1.70     65   TRUE Istanbul 22.49135
Cem     1.75     66  FALSE   Ankara 21.55102
Hande   1.62     61   TRUE    Izmir 23.24341

We can create a new column as we please. For example, add a Boolean column for obesity value.

people2$obese <- people2$BMI>30
people2
      Height Weight Member     City      BMI obese
Can     1.70     65   TRUE Istanbul 22.49135 FALSE
Cem     1.75     66  FALSE   Ankara 21.55102 FALSE
Hande   1.62     61   TRUE    Izmir 23.24341 FALSE

We can remove a column by setting it to NULL.

people2$obese <- NULL
people2
      Height Weight Member     City      BMI
Can     1.70     65   TRUE Istanbul 22.49135
Cem     1.75     66  FALSE   Ankara 21.55102
Hande   1.62     61   TRUE    Izmir 23.24341

Merging data frames

The merge(x,y) function is used to create a new data frame from existing frames x and y, by combining them along a common column.

df1 <- data.frame(Name=c("Can","Cem","Hande"), Phone=c(1234,4345,8492))
df2 <- data.frame(Age=c(25,27,26), Name=c("Cem","Hande","Can"))
df1
   Name Phone
1   Can  1234
2   Cem  4345
3 Hande  8492
df2
  Age  Name
1  25   Cem
2  27 Hande
3  26   Can
merge(df1,df2)
   Name Phone Age
1   Can  1234  26
2   Cem  4345  25
3 Hande  8492  27

The merge() function automatically detected the Name column that is common in both, and merged the data on it.

Even though the order of names are different in the two frames, merge() merged them correctly.

What if the columns we want to merge on have different names in the two dataframes? In that case we use the by.x and by.y arguments to merge().

df2 <- data.frame(Age=c(25,27,26), first_name=c("Cem","Hande","Can"))
df1
   Name Phone
1   Can  1234
2   Cem  4345
3 Hande  8492
df2
  Age first_name
1  25        Cem
2  27      Hande
3  26        Can
merge(df1, df2, by.x="Name", by.y="first_name")
   Name Phone Age
1   Can  1234  26
2   Cem  4345  25
3 Hande  8492  27

To merge by row names, specify"row.names" for both by.x and by.y.

Suppose we have a new data frame holding phone numbers, rows indexed by names.

phonebook <- data.frame(phone=c(Can=1234, Cem=4345, Lale=8492))
phonebook
     phone
Can   1234
Cem   4345
Lale  8492

And our old people dataframe:

people
      Height Weight Member     City
Can     1.70     65   TRUE Istanbul
Cem     1.75     66  FALSE   Ankara
Hande   1.62     61   TRUE    Izmir

Note that phonebook does not contain Hande, and people does not contain Lale.

The merge operation takes names that are common in both dataframes:

merge(people, phonebook, by.x="row.names", by.y="row.names")
  Row.names Height Weight Member     City phone
1       Can   1.70     65   TRUE Istanbul  1234
2       Cem   1.75     66  FALSE   Ankara  4345

Inner and outer joins

In the previous example, the merge operation removed Hande and Lale, because they are missing in one or the other data frame. This is called an inner join operation.

In contrast, an outer join operation merges with all available data, leaving some entries NA.

The all=TRUE option of merge() performs an outer join:

merged_df <- merge(people, phonebook,
                   by.x="row.names", by.y="row.names", 
                   all=TRUE)
merged_df
  Row.names Height Weight Member     City phone
1       Can   1.70     65   TRUE Istanbul  1234
2       Cem   1.75     66  FALSE   Ankara  4345
3     Hande   1.62     61   TRUE    Izmir    NA
4      Lale     NA     NA     NA     <NA>  8492

Hande was not in the phonebook data, so the phone entry for her is NA. Similarly, Lale was absent in the people data, so all columns except phone are NA for her.

The merge has converted row names to a new column Row.names. To restore row names as before,assign them using rownames(), and remove the redundant "Row.names" column afterwards.

rownames(merged_df) <- merged_df$Row.names
merged_df
      Row.names Height Weight Member     City phone
Can         Can   1.70     65   TRUE Istanbul  1234
Cem         Cem   1.75     66  FALSE   Ankara  4345
Hande     Hande   1.62     61   TRUE    Izmir    NA
Lale       Lale     NA     NA     NA     <NA>  8492
merged_df$Row.names <- NULL
merged_df
      Height Weight Member     City phone
Can     1.70     65   TRUE Istanbul  1234
Cem     1.75     66  FALSE   Ankara  4345
Hande   1.62     61   TRUE    Izmir    NA
Lale      NA     NA     NA     <NA>  8492

Applications

Analyze the grades in a class

Create a dataframe holding the exam scores of a small class:

grades <- data.frame(
    student = c("Can","Cem","Hande","Lale","Ziya"),
    midterm1 = c(45, 74, 67, 52, 31),
    midterm2 = c(68, 83, 56, 22, 50),
    final = c(59, 91, 62, 49, 65))
grades
  student midterm1 midterm2 final
1     Can       45       68    59
2     Cem       74       83    91
3   Hande       67       56    62
4    Lale       52       22    49
5    Ziya       31       50    65

Get weighted average scores, assuming a weight of 30% for each midterm and 40% for the final.

grades$score <- grades$midterm1*0.3 + grades$midterm2*0.3 + grades$final*0.4
grades
  student midterm1 midterm2 final score
1     Can       45       68    59  57.5
2     Cem       74       83    91  83.5
3   Hande       67       56    62  61.7
4    Lale       52       22    49  41.8
5    Ziya       31       50    65  50.3

Get averages of columns. grades[,-1] drops the name column, then we take the mean along the second dimension (columns).

apply(grades[,-1],2,mean)
midterm1 midterm2    final    score 
   53.80    55.80    65.20    58.96 

The “simple apply” does that more directly:

sapply(grades[,-1],mean)
midterm1 midterm2    final    score 
   53.80    55.80    65.20    58.96 

ALternatively, “list apply” returns the same result as a list.

lapply(grades[,-1],mean)
$midterm1
[1] 53.8

$midterm2
[1] 55.8

$final
[1] 65.2

$score
[1] 58.96

Assign letter grades using scores:

lettergrade <- function(score){
    if (score > 80) "A" else if (score > 70) "B" else if (score>60) "C" else if (score>50) "D" else "F"
}
grades$letter <- sapply(grades$score, lettergrade)
grades
  student midterm1 midterm2 final score letter
1     Can       45       68    59  57.5      D
2     Cem       74       83    91  83.5      A
3   Hande       67       56    62  61.7      C
4    Lale       52       22    49  41.8      F
5    Ziya       31       50    65  50.3      D

Grading multiple-choice exams

Our students have taken a multiple-choice exam. All their answers, as well as the answer key, are recorded as vectors.

key <- c("A","B","C","D","A")
answers <- rbind(
    c("A", "B", "D", "A", "B"),
    c("A", "D", "C", "D", "A"),
    c("B", "B", "C", "D", "B"),
    c("A", "B", "C", "D", "D"),
    c("C", "C", "C", "D", "A")
)

We initialize a separate data frame with the student information:

exam <- data.frame(answers,
                   row.names = c("Can","Cem","Hande","Lale","Ziya"))
exam
      X1 X2 X3 X4 X5
Can    A  B  D  A  B
Cem    A  D  C  D  A
Hande  B  B  C  D  B
Lale   A  B  C  D  D
Ziya   C  C  C  D  A

Now we can process this data frame to get the number of correct answers for each student. For that, we can use the sum(x==y) operation, which gives us the number of equal elements.

key
[1] "A" "B" "C" "D" "A"
exam[1,]==key
      X1   X2    X3    X4    X5
Can TRUE TRUE FALSE FALSE FALSE
sum(exam[1,]==key)
[1] 2

To repeat this for each row, we create a function that returns the number of matching answers.

ncorrect <- function(x){
    sum(x==key)
}
ncorrect(exam[1,])
[1] 2

And we use apply() to apply it to every row.

apply(exam,1,ncorrect)
  Can   Cem Hande  Lale  Ziya 
    2     4     3     4     3 

We can store this result in a new column in the original dataframe itself.

exam$correct <- apply(exam,1,ncorrect)
exam
      X1 X2 X3 X4 X5 correct
Can    A  B  D  A  B       2
Cem    A  D  C  D  A       4
Hande  B  B  C  D  B       3
Lale   A  B  C  D  D       4
Ziya   C  C  C  D  A       3

Store database

Suppose you run a retail store and you keep a data base of your items, their unit price, and the value-added tax (VAT) rate for each item. For example:

items <- data.frame(
    row.names = c("Milk","Meat","Toothpaste","Pencil","Detergent"),
    vat = c(0.05, 0.04, 0.05, 0.06, 0.03),
    unitprice = c(10, 20, 5, 1, 4)
)
items
            vat unitprice
Milk       0.05        10
Meat       0.04        20
Toothpaste 0.05         5
Pencil     0.06         1
Detergent  0.03         4

You get some orders, which your automated system stores with an order ID:

orders <- data.frame(
    row.names = c("1234","5761","1832"), # order ID
    item = c("Milk","Meat","Toothpaste"),
    amount = c(3,1,2))
orders
           item amount
1234       Milk      3
5761       Meat      1
1832 Toothpaste      2

Our task is to add a new column to the orders data frame that holds the total payment for each order, including the VAT.

  item       amount vat  unitprice total
1 Meat       1      0.04 20        20.8 
2 Milk       3      0.05 10        31.5 
3 Toothpaste 2      0.05  5        10.5

Let’s merge orders and items with an inner join, assign the result in a new data frame.

orders2 <- merge(orders,items,by.x="item",by.y="row.names")
orders2
        item amount  vat unitprice
1       Meat      1 0.04        20
2       Milk      3 0.05        10
3 Toothpaste      2 0.05         5

Now that we have the unit price and the VAT information on the same data frame, we can calculate the total due and store it in a new column.

orders2$total <- (orders2$amount*orders2$unitprice)*(1+orders2$vat)
orders2
        item amount  vat unitprice total
1       Meat      1 0.04        20  20.8
2       Milk      3 0.05        10  31.5
3 Toothpaste      2 0.05         5  10.5