height1 <- 1.70
weight1 <- 65
bmi1 <- weight1 / height1^2
height2 <- 1.75
weight2 <- 66
bmi2 <- weight2 / height2^2
# ... a lot of lines...
height100 <- 1.68
weight100 <- 70
bmi100 <- weight100 / height100^23 Data vectors in R
The fundamental data type in R is the vector, which is an ordered collection of elements of the same type (numbers, strings, etc).
Single numbers such as 10 or
sqrt(25)are actually handled as one-element vectors.
Why vectors?
Often, we store large amounts of related data and process them together. For example, we may have height and weight values of 100 people, and we may want to calculate their body-mass index, or the mean height. Storing them as separate variables are difficult, and we don’t have a way of automatically processing them:
How can we get, e.g. the mean body-mass index of the group? Again, with difficulty:
total <- 0 
height1 <- 1.70
weight1 <- 65
bmi1 <- weight1 / height1^2
total <- total + bmi1
height2 <- 1.75
weight2 <- 66
bmi2 <- weight2 / height2^2
total <- total + bmi2
# ... a lot of lines...
height100 <- 1.68
weight100 <- 70
bmi100 <- weight100 / height100^2
total <- total + bmi100
mean_bmi = total / 100Imagine doing this for 1000, or one million people.
On the other hand, vectors have a sequential structure. If heights is a vector holding the height values, we can get the first element with height[1], second element with height[2], and so on. Better yet, later we’ll see that we can use a loop to go over every element with a few lines of code, regardless of the length of the vector:
total <- 0
i <- 1
while (i<=100)
    total <- total + weights[i]/heights[i]^2
mean_bmi <- total / 100An even shorter (and better!) way is calling the mean() function
mean_bmi <- mean(weights/heights^2)R is designed for data processing, so vectors are the main object of fast calculations. The function mean() works faster than the loop equivalent, because it is internally optimized. It is always more efficient to use an existing function than doing the same thing manually.
Internally, vectors are designed to occupy an unbroken section in the memory. Every element occupies the same amount of bytes. That way, vector operations can go over elements quickly, without wasting time to jump around the memory.
Creating vectors
The most general way to create data vectors is to use the c() function (short for concatenate).
heights <- c(1.70, 1.75, 1.62)
weights <- c(65, 66, 61)heights[1] 1.70 1.75 1.62
weights[1] 65 66 61
Vectors can also be created with the colon operator (:)
x <- 2:10 # assign integers from 2 to 10, inclusive.
x[1]  2  3  4  5  6  7  8  9 10
Extending vectors
The function c() can also be used to add new elements to vectors.
Suppose initially we have only two pieces of data:
heights <- c(1.70, 1.75)
heights[1] 1.70 1.75
Then we get another data point, and we extend the vector.
c(heights, 1.62)[1] 1.70 1.75 1.62
heights[1] 1.70 1.75
heights <- c(heights, 1.62)heights[1] 1.70 1.75 1.62
heights <- c(1.62, heights)heights[1] 1.62 1.70 1.75 1.62
c(heights, heights)[1] 1.62 1.70 1.75 1.62 1.62 1.70 1.75 1.62
Modes
R variable types are called modes. Modes include: “numeric”, “character”, “logical”, “complex”, and so on.
mode(c(1,2))[1] "numeric"
mode(c("abc","xyz"))[1] "character"
mode(c(TRUE,FALSE))[1] "logical"
mode(2+4i)[1] "complex"
All elements in a vector must be of the same mode. This is required for efficient calculation over vectors. Other data structures, such as the list, can be used to store mixed-type data.
Vector arithmetic
If you add two vectors, both with the same number of elements, corresponding elements are added:
c(1,4,9) + c(2,16,5)[1]  3 20 14
The same rule applies to all basic operations:
c(1,4,9) * c(2,16,5)[1]  2 64 45
c(1,4,9) / c(2,16,5)[1] 0.50 0.25 1.80
Logical comparisons such as < or > also follow this rule.
3 > 2[1] TRUE
c(1,4,9) > c(2,16,5)[1] FALSE FALSE  TRUE
If we add a vector and a single number, the single number is recycled until its length matches the other vector.
c(1,4,9) + c(5)  # converted to: c(1,4,9) + c(5,5,5)[1]  6  9 14
Same goes for other operations:
c(1,4,9) < 5  # converted to: c(1,4,9) < c(5,5,5)[1]  TRUE  TRUE FALSE
c(1,4,9)^2   # converted to: c(1,4,9) ^ c(2,2,2)[1]  1 16 81
If the operation is done with two vectors with different sizes, you might get a warning:
c(1,4,9) + c(2,3)Warning in c(1, 4, 9) + c(2, 3): longer object length is not a multiple of
shorter object length
[1]  3  7 11
However, the following works without warning:
c(1,4,9,10) + c(2,3) # converted to c(1,4,9,10) + c(2,3,2,3)[1]  3  7 11 13
R recycles the shorter vector until its length matches the longer vector. This is also done with vectors of length 3 and 2, but not perfectly, so a warning gets issued.
Vectorized functions
These are built-in R functions that operate on vectors as a whole, optimized to process them fast.
sum()
Adds up all elements in vector, returns the total. `
sum(c(1,4,9))[1] 14
sum(1:1000)[1] 500500
cumsum()
Adds up the elements at every step, returns a vector of cumulative totals up to that element.
cumsum(1:10) [1]  1  3  6 10 15 21 28 36 45 55
prod(), cumprod()
Similar to sum() and cumsum(), but for products of elements.
prod(c(1,4,9))[1] 36
prod(1:5)  # 5![1] 120
cumprod(1:5)[1]   1   2   6  24 120
Mathematical functions
Familiar mathematical functions are all designed to apply on vectors elementwise.
The square root function:
sqrt(c(4,9,16))[1] 2 3 4
The constant pi holds the value of \(\pi\)
x <- c(0, pi/4, pi/2, 3*pi/4, pi)
# alternatively: x <- 0:4 * pi/4
sin(x)[1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224647e-16
The exponential function \(\mathrm{e}^x\)
exp(1:5)[1]   2.718282   7.389056  20.085537  54.598150 148.413159
The natural logarithm function (inverse of exp)
log(exp(1:5))[1] 1 2 3 4 5
Representing missing data
- In many data sets, we often have some missing data, i.e., observations for which the values are missing.
 - In R, missing values are denoted with 
NA. - Any vector can contain missing values.
 
weights <- c(65, NA, 61)
names <- c("Can","Cem",NA)Vector element names
For readability, we can assign name labels to the elements of a data vector.
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
heights  Can   Cem Hande 
 1.70  1.75  1.62 
weights <- c(Can=65, Cem=66, Hande=61)
weights  Can   Cem Hande 
   65    66    61 
We can retrieve these names with the names() function.
names(heights)[1] "Can"   "Cem"   "Hande"
We can assign names to the elements of a vector that already exists.
heights <- c(1.70, 1.75, 1.62)
names(heights) <- c("Can","Cem","Hande")
heights  Can   Cem Hande 
 1.70  1.75  1.62 
If for some reason we want to remove the names, we use the unname() function.
unname(heights)[1] 1.70 1.75 1.62
The original vector is not changed with this function call, because we did not assign the result to heights.
heights  Can   Cem Hande 
 1.70  1.75  1.62 
Vector indexing
We can access a single element of a vector by providing the index of the element in square brackets.
heights  Can   Cem Hande 
 1.70  1.75  1.62 
heights[1]  # first elementCan 
1.7 
heights[3] # third elementHande 
 1.62 
We can select a slice of the vector by providing a range inside brackets.
heights[1:2]  # select from element 1 to element 2, inclusive. Can  Cem 
1.70 1.75 
We can also give a vector consisting of element indices.
heights[c(1,3)]  # select elements 1 and 3.  Can Hande 
 1.70  1.62 
The indices do not have to be in order:
heights[c(2,1,3)]  Cem   Can Hande 
 1.75  1.70  1.62 
We can select the same element more than once.
heights[c(1,1,3,2,3)]  Can   Can Hande   Cem Hande 
 1.70  1.70  1.62  1.75  1.62 
We can provide a Boolean (true/false) vector for indexing. This will select only elements with corresponding TRUE values.
heights  Can   Cem Hande 
 1.70  1.75  1.62 
heights[c(T,F,F)]  # T is a shorthand for TRUE, F is for FALSE.Can 
1.7 
We can exclude elements using negative indices.
heights[-1]  # exclude first element.  Cem Hande 
 1.75  1.62 
heights[c(-1,-3)]  # exclude 1st and 3rd elements Cem 
1.75 
Using names to select elements
If the elements are given names consisting of strings, we can use these names in brackets instead of indices.
heights["Can"]Can 
1.7 
heights[c("Can","Can","Hande","Cem","Hande")]  Can   Can Hande   Cem Hande 
 1.70  1.70  1.62  1.75  1.62 
Modify element values in a vector
heights  Can   Cem Hande 
 1.70  1.75  1.62 
heights[1] <- 1.72heights  Can   Cem Hande 
 1.72  1.75  1.62 
heights[1] <- 1.70Insert values to an existing vector
A vector’s size is determined at its creation, and its elements are stored contiguously (side-by-side) in memory. Therefore it is really not possible to add or remove an element in a vector. However, we can reassign the identifier to a new one.
heights  Can   Cem Hande 
 1.70  1.75  1.62 
heights <- c(heights[1:2], Lale=1.76, heights[3])heights  Can   Cem  Lale Hande 
 1.70  1.75  1.76  1.62 
Delete elements from vector
Again, we cannot directly remove an element from an existing vector, but we can create a new vector without the element we want to delete, and reassign to the name.
heights  Can   Cem  Lale Hande 
 1.70  1.75  1.76  1.62 
heights <- heights[-3]  # exclude element 3heights  Can   Cem Hande 
 1.70  1.75  1.62 
Getting the length of a vector
We can get the number of elements in a vector using the length() function.
length(heights)[1] 3
length(10:17)[1] 8
Vector filtering
- Apply a Boolean function (e.g., greater than, less than, …) to each element of the vector.
 - Returns a Boolean vector according to the result on each element.
 
heights  Can   Cem Hande 
 1.70  1.75  1.62 
heights > 1.65  Can   Cem Hande 
 TRUE  TRUE FALSE 
Using this Boolean vector, we can select data points satisfying the condition.
tall_people <- heights>1.65
tall_people  Can   Cem Hande 
 TRUE  TRUE FALSE 
heights[tall_people] Can  Cem 
1.70 1.75 
Obviously, this can be done in a single line, too.
heights[heights>1.65] Can  Cem 
1.70 1.75 
One can also filter a vector according to another vector’s values.
heights  Can   Cem Hande 
 1.70  1.75  1.62 
weights  Can   Cem Hande 
   65    66    61 
weights[ heights > 1.65 ]  # weights of people who are taller than 1.65Can Cem 
 65  66 
Modify a vector by filtering
We can use filtering to selectively change only the elements that satisfy a condition.
Example: For people who weigh more than 65 kg, decrease the weight by 1 kg.
weights  Can   Cem Hande 
   65    66    61 
weights[weights > 65] - 1Cem 
 65 
weights[weights > 65] <- weights[weights > 65] - 1
weights  Can   Cem Hande 
   65    65    61 
Get indices of elements that satisfy a condition
The which() function returns the indices (and labels, if available) of elements in a vector for which a Boolean function returns TRUE.
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)heights > 1.65  Can   Cem Hande 
 TRUE  TRUE FALSE 
which(heights > 1.65)Can Cem 
  1   2 
Using all() and any()
- We use the 
all()function to check if all elements in a vector areTRUE. - We use the 
any()function to check if any one of the elements in a vector areTRUE. 
heights  Can   Cem Hande 
 1.70  1.75  1.62 
all(heights > 1.60) # TRUE[1] TRUE
all(heights > 1.70) # FALSE[1] FALSE
any(heights > 1.70) # TRUE[1] TRUE
Generating vectors with repeated elements
The rep() function can be used to replicate values or vectors a specified number of times.
rep(3,10) [1] 3 3 3 3 3 3 3 3 3 3
rep("abc",5)[1] "abc" "abc" "abc" "abc" "abc"
rep(c(1,2,3),5) [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),length.out=10) [1] 1 2 3 1 2 3 1 2 3 1
Generating sequences with seq()
The seq() function generates a vector of numbers in arithmetic progression. It is a generalization of the colon(:) operator.
seq(4,9)  # same as 4:9[1] 4 5 6 7 8 9
Unlike the colon operator, you can specify a step size with the by parameter.
seq(from=12, to=29, by=3)[1] 12 15 18 21 24 27
Alternatively, if you want a fixed number of elements, you can specify the length.out parameter.
seq(from=1.1, to=6, length.out=10) [1] 1.100000 1.644444 2.188889 2.733333 3.277778 3.822222 4.366667 4.911111
 [9] 5.455556 6.000000
Sorting a vector
The sort() function returns a vector with elements sorted in increasing order.
sort(heights)Hande   Can   Cem 
 1.62  1.70  1.75 
If you want reverse (decreasing) order, set the decreasing parameter to TRUE.
sort(heights, decreasing = TRUE)  Cem   Can Hande 
 1.75  1.70  1.62 
Suppose we want to sort the weights in a special way: First element is weight of the shortest person, last element is the weight of the tallest person.
In order to do that, we compute an ordering.
order(heights)[1] 3 1 2
This shows that when heights are sorted, element 3 would be in the first location, element 1 in the second location, and element 2 in the last location.
We can use this ordering with the weights vector to get what we want.
weights[order(heights)]  # return the weights of people ordered by their heights.Hande   Can   Cem 
   61    65    66 
If you have a named vector, you can sort it by the names:
heights[sort(names(heights))]  Can   Cem Hande 
 1.70  1.75  1.62