A data frames is a data structure for representing tabular data.
Every column of a data frame may have a different data type. For example, the following data frame would have one character column, two numeric columns, one boolean column and one character column, in order.
Name
Height
Weight
Gym member?
City
Cem
1.75
66
T
Istanbul
Can
1.70
65
F
Ankara
Hande
1.62
61
T
Izmir
Just as lists are heterogeneous analogs of vectors, data frames are heterogenous analogs of matrices.
Internally, a data frame is a list of equal-length vectors. This means that each column must be of the same data type.
Creating data frames
Several vectors can be combined into a data frame using the data.frame() function.
Weight Member City Height
Lale 64 FALSE Bursa 1.71
Ziya 50 TRUE Istanbul 1.45
Combine this new dataframe with the old one:
rbind(people, newpeople)
Height Weight Member City
Can 1.70 65 TRUE Istanbul
Cem 1.75 66 FALSE Ankara
Hande 1.62 61 TRUE Izmir
Lale 1.71 64 FALSE Bursa
Ziya 1.45 50 TRUE Istanbul
Adding new columns
Suppose we want to add a column for BMI, which we calculate using the existing columns. We can do this using cbind() as follows.
Height Weight Member City people$Weight/people$Height^2
Can 1.70 65 TRUE Istanbul 22.49135
Cem 1.75 66 FALSE Ankara 21.55102
Hande 1.62 61 TRUE Izmir 23.24341
Note that the name of the new column is automatically set. It’s ugly! We can change this using the names() or colnames() functions.
names(people_bmi)[5] <-"BMI"people_bmi
Height Weight Member City BMI
Can 1.70 65 TRUE Istanbul 22.49135
Cem 1.75 66 FALSE Ankara 21.55102
Hande 1.62 61 TRUE Izmir 23.24341
Height Weight Member City BMI
Can 1.70 65 TRUE Istanbul 22.49135
Cem 1.75 66 FALSE Ankara 21.55102
Hande 1.62 61 TRUE Izmir 23.24341
We can create a new column as we please. For example, add a Boolean column for obesity value.
people2$obese <- people2$BMI>30people2
Height Weight Member City BMI obese
Can 1.70 65 TRUE Istanbul 22.49135 FALSE
Cem 1.75 66 FALSE Ankara 21.55102 FALSE
Hande 1.62 61 TRUE Izmir 23.24341 FALSE
We can remove a column by setting it to NULL.
people2$obese <-NULLpeople2
Height Weight Member City BMI
Can 1.70 65 TRUE Istanbul 22.49135
Cem 1.75 66 FALSE Ankara 21.55102
Hande 1.62 61 TRUE Izmir 23.24341
Merging data frames
The merge(x,y) function is used to create a new data frame from existing frames x and y, by combining them along a common column.
Row.names Height Weight Member City phone
1 Can 1.70 65 TRUE Istanbul 1234
2 Cem 1.75 66 FALSE Ankara 4345
Inner and outer joins
In the previous example, the merge operation removed Hande and Lale, because they are missing in one or the other data frame. This is called an inner join operation.
In contrast, an outer join operation merges with all available data, leaving some entries NA.
The all=TRUE option of merge() performs an outer join:
Row.names Height Weight Member City phone
1 Can 1.70 65 TRUE Istanbul 1234
2 Cem 1.75 66 FALSE Ankara 4345
3 Hande 1.62 61 TRUE Izmir NA
4 Lale NA NA NA <NA> 8492
Hande was not in the phonebook data, so the phone entry for her is NA. Similarly, Lale was absent in the people data, so all columns except phone are NA for her.
The merge has converted row names to a new column Row.names. To restore row names as before,assign them using rownames(), and remove the redundant "Row.names" column afterwards.
Row.names Height Weight Member City phone
Can Can 1.70 65 TRUE Istanbul 1234
Cem Cem 1.75 66 FALSE Ankara 4345
Hande Hande 1.62 61 TRUE Izmir NA
Lale Lale NA NA NA <NA> 8492
merged_df$Row.names <-NULLmerged_df
Height Weight Member City phone
Can 1.70 65 TRUE Istanbul 1234
Cem 1.75 66 FALSE Ankara 4345
Hande 1.62 61 TRUE Izmir NA
Lale NA NA NA <NA> 8492
Applications
Analyze the grades in a class
Create a dataframe holding the exam scores of a small class:
student midterm1 midterm2 final score letter
1 Can 45 68 59 57.5 D
2 Cem 74 83 91 83.5 A
3 Hande 67 56 62 61.7 C
4 Lale 52 22 49 41.8 F
5 Ziya 31 50 65 50.3 D
Grading multiple-choice exams
Our students have taken a multiple-choice exam. All their answers, as well as the answer key, are recorded as vectors.
X1 X2 X3 X4 X5
Can A B D A B
Cem A D C D A
Hande B B C D B
Lale A B C D D
Ziya C C C D A
Now we can process this data frame to get the number of correct answers for each student. For that, we can use the sum(x==y) operation, which gives us the number of equal elements.
key
[1] "A" "B" "C" "D" "A"
exam[1,]==key
X1 X2 X3 X4 X5
Can TRUE TRUE FALSE FALSE FALSE
sum(exam[1,]==key)
[1] 2
To repeat this for each row, we create a function that returns the number of matching answers.
We can store this result in a new column in the original dataframe itself.
exam$correct <-apply(exam,1,ncorrect)exam
X1 X2 X3 X4 X5 correct
Can A B D A B 2
Cem A D C D A 4
Hande B B C D B 3
Lale A B C D D 4
Ziya C C C D A 3
Store database
Suppose you run a retail store and you keep a data base of your items, their unit price, and the value-added tax (VAT) rate for each item. For example: