height1 <- 1.70
weight1 <- 65
bmi1 <- weight1 / height1^2
height2 <- 1.75
weight2 <- 66
bmi2 <- weight2 / height2^2
# ... a lot of lines...
height100 <- 1.68
weight100 <- 70
bmi100 <- weight100 / height100^23 Data vectors in R
The fundamental data type in R is the vector, which is an ordered collection of elements of the same type (numbers, strings, etc).
Single numbers such as 10 or
sqrt(25)are actually handled as one-element vectors.
Why vectors?
Often, we store large amounts of related data and process them together. For example, we may have height and weight values of 100 people, and we may want to calculate their body-mass index, or the mean height. Storing them as separate variables are difficult, and we don’t have a way of automatically processing them:
How can we get, e.g. the mean body-mass index of the group? Again, with difficulty:
total <- 0
height1 <- 1.70
weight1 <- 65
bmi1 <- weight1 / height1^2
total <- total + bmi1
height2 <- 1.75
weight2 <- 66
bmi2 <- weight2 / height2^2
total <- total + bmi2
# ... a lot of lines...
height100 <- 1.68
weight100 <- 70
bmi100 <- weight100 / height100^2
total <- total + bmi100
mean_bmi = total / 100Imagine doing this for 1000, or one million people.
On the other hand, vectors have a sequential structure. If heights is a vector holding the height values, we can get the first element with height[1], second element with height[2], and so on. Better yet, later we’ll see that we can use a loop to go over every element with a few lines of code, regardless of the length of the vector:
total <- 0
i <- 1
while (i<=100)
total <- total + weights[i]/heights[i]^2
mean_bmi <- total / 100An even shorter (and better!) way is calling the mean() function
mean_bmi <- mean(weights/heights^2)R is designed for data processing, so vectors are the main object of fast calculations. The function mean() works faster than the loop equivalent, because it is internally optimized. It is always more efficient to use an existing function than doing the same thing manually.
Internally, vectors are designed to occupy an unbroken section in the memory. Every element occupies the same amount of bytes. That way, vector operations can go over elements quickly, without wasting time to jump around the memory.
Creating vectors
The most general way to create data vectors is to use the c() function (short for concatenate).
heights <- c(1.70, 1.75, 1.62)
weights <- c(65, 66, 61)heights[1] 1.70 1.75 1.62
weights[1] 65 66 61
Vectors can also be created with the colon operator (:)
x <- 2:10 # assign integers from 2 to 10, inclusive.
x[1] 2 3 4 5 6 7 8 9 10
Extending vectors
The function c() can also be used to add new elements to vectors.
Suppose initially we have only two pieces of data:
heights <- c(1.70, 1.75)
heights[1] 1.70 1.75
Then we get another data point, and we extend the vector.
c(heights, 1.62)[1] 1.70 1.75 1.62
heights[1] 1.70 1.75
heights <- c(heights, 1.62)heights[1] 1.70 1.75 1.62
heights <- c(1.62, heights)heights[1] 1.62 1.70 1.75 1.62
c(heights, heights)[1] 1.62 1.70 1.75 1.62 1.62 1.70 1.75 1.62
Modes
R variable types are called modes. Modes include: “numeric”, “character”, “logical”, “complex”, and so on.
mode(c(1,2))[1] "numeric"
mode(c("abc","xyz"))[1] "character"
mode(c(TRUE,FALSE))[1] "logical"
mode(2+4i)[1] "complex"
All elements in a vector must be of the same mode. This is required for efficient calculation over vectors. Other data structures, such as the list, can be used to store mixed-type data.
Vector arithmetic
If you add two vectors, both with the same number of elements, corresponding elements are added:
c(1,4,9) + c(2,16,5)[1] 3 20 14
The same rule applies to all basic operations:
c(1,4,9) * c(2,16,5)[1] 2 64 45
c(1,4,9) / c(2,16,5)[1] 0.50 0.25 1.80
Logical comparisons such as < or > also follow this rule.
3 > 2[1] TRUE
c(1,4,9) > c(2,16,5)[1] FALSE FALSE TRUE
If we add a vector and a single number, the single number is recycled until its length matches the other vector.
c(1,4,9) + c(5) # converted to: c(1,4,9) + c(5,5,5)[1] 6 9 14
Same goes for other operations:
c(1,4,9) < 5 # converted to: c(1,4,9) < c(5,5,5)[1] TRUE TRUE FALSE
c(1,4,9)^2 # converted to: c(1,4,9) ^ c(2,2,2)[1] 1 16 81
If the operation is done with two vectors with different sizes, you might get a warning:
c(1,4,9) + c(2,3)Warning in c(1, 4, 9) + c(2, 3): longer object length is not a multiple of
shorter object length
[1] 3 7 11
However, the following works without warning:
c(1,4,9,10) + c(2,3) # converted to c(1,4,9,10) + c(2,3,2,3)[1] 3 7 11 13
R recycles the shorter vector until its length matches the longer vector. This is also done with vectors of length 3 and 2, but not perfectly, so a warning gets issued.
Vectorized functions
These are built-in R functions that operate on vectors as a whole, optimized to process them fast.
sum()
Adds up all elements in vector, returns the total. `
sum(c(1,4,9))[1] 14
sum(1:1000)[1] 500500
cumsum()
Adds up the elements at every step, returns a vector of cumulative totals up to that element.
cumsum(1:10) [1] 1 3 6 10 15 21 28 36 45 55
prod(), cumprod()
Similar to sum() and cumsum(), but for products of elements.
prod(c(1,4,9))[1] 36
prod(1:5) # 5![1] 120
cumprod(1:5)[1] 1 2 6 24 120
Mathematical functions
Familiar mathematical functions are all designed to apply on vectors elementwise.
The square root function:
sqrt(c(4,9,16))[1] 2 3 4
The constant pi holds the value of \(\pi\)
x <- c(0, pi/4, pi/2, 3*pi/4, pi)
# alternatively: x <- 0:4 * pi/4
sin(x)[1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224647e-16
The exponential function \(\mathrm{e}^x\)
exp(1:5)[1] 2.718282 7.389056 20.085537 54.598150 148.413159
The natural logarithm function (inverse of exp)
log(exp(1:5))[1] 1 2 3 4 5
Representing missing data
- In many data sets, we often have some missing data, i.e., observations for which the values are missing.
- In R, missing values are denoted with
NA. - Any vector can contain missing values.
weights <- c(65, NA, 61)
names <- c("Can","Cem",NA)Vector element names
For readability, we can assign name labels to the elements of a data vector.
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
heights Can Cem Hande
1.70 1.75 1.62
weights <- c(Can=65, Cem=66, Hande=61)
weights Can Cem Hande
65 66 61
We can retrieve these names with the names() function.
names(heights)[1] "Can" "Cem" "Hande"
We can assign names to the elements of a vector that already exists.
heights <- c(1.70, 1.75, 1.62)
names(heights) <- c("Can","Cem","Hande")
heights Can Cem Hande
1.70 1.75 1.62
If for some reason we want to remove the names, we use the unname() function.
unname(heights)[1] 1.70 1.75 1.62
The original vector is not changed with this function call, because we did not assign the result to heights.
heights Can Cem Hande
1.70 1.75 1.62
Vector indexing
We can access a single element of a vector by providing the index of the element in square brackets.
heights Can Cem Hande
1.70 1.75 1.62
heights[1] # first elementCan
1.7
heights[3] # third elementHande
1.62
We can select a slice of the vector by providing a range inside brackets.
heights[1:2] # select from element 1 to element 2, inclusive. Can Cem
1.70 1.75
We can also give a vector consisting of element indices.
heights[c(1,3)] # select elements 1 and 3. Can Hande
1.70 1.62
The indices do not have to be in order:
heights[c(2,1,3)] Cem Can Hande
1.75 1.70 1.62
We can select the same element more than once.
heights[c(1,1,3,2,3)] Can Can Hande Cem Hande
1.70 1.70 1.62 1.75 1.62
We can provide a Boolean (true/false) vector for indexing. This will select only elements with corresponding TRUE values.
heights Can Cem Hande
1.70 1.75 1.62
heights[c(T,F,F)] # T is a shorthand for TRUE, F is for FALSE.Can
1.7
We can exclude elements using negative indices.
heights[-1] # exclude first element. Cem Hande
1.75 1.62
heights[c(-1,-3)] # exclude 1st and 3rd elements Cem
1.75
Using names to select elements
If the elements are given names consisting of strings, we can use these names in brackets instead of indices.
heights["Can"]Can
1.7
heights[c("Can","Can","Hande","Cem","Hande")] Can Can Hande Cem Hande
1.70 1.70 1.62 1.75 1.62
Modify element values in a vector
heights Can Cem Hande
1.70 1.75 1.62
heights[1] <- 1.72heights Can Cem Hande
1.72 1.75 1.62
heights[1] <- 1.70Insert values to an existing vector
A vector’s size is determined at its creation, and its elements are stored contiguously (side-by-side) in memory. Therefore it is really not possible to add or remove an element in a vector. However, we can reassign the identifier to a new one.
heights Can Cem Hande
1.70 1.75 1.62
heights <- c(heights[1:2], Lale=1.76, heights[3])heights Can Cem Lale Hande
1.70 1.75 1.76 1.62
Delete elements from vector
Again, we cannot directly remove an element from an existing vector, but we can create a new vector without the element we want to delete, and reassign to the name.
heights Can Cem Lale Hande
1.70 1.75 1.76 1.62
heights <- heights[-3] # exclude element 3heights Can Cem Hande
1.70 1.75 1.62
Getting the length of a vector
We can get the number of elements in a vector using the length() function.
length(heights)[1] 3
length(10:17)[1] 8
Vector filtering
- Apply a Boolean function (e.g., greater than, less than, …) to each element of the vector.
- Returns a Boolean vector according to the result on each element.
heights Can Cem Hande
1.70 1.75 1.62
heights > 1.65 Can Cem Hande
TRUE TRUE FALSE
Using this Boolean vector, we can select data points satisfying the condition.
tall_people <- heights>1.65
tall_people Can Cem Hande
TRUE TRUE FALSE
heights[tall_people] Can Cem
1.70 1.75
Obviously, this can be done in a single line, too.
heights[heights>1.65] Can Cem
1.70 1.75
One can also filter a vector according to another vector’s values.
heights Can Cem Hande
1.70 1.75 1.62
weights Can Cem Hande
65 66 61
weights[ heights > 1.65 ] # weights of people who are taller than 1.65Can Cem
65 66
Modify a vector by filtering
We can use filtering to selectively change only the elements that satisfy a condition.
Example: For people who weigh more than 65 kg, decrease the weight by 1 kg.
weights Can Cem Hande
65 66 61
weights[weights > 65] - 1Cem
65
weights[weights > 65] <- weights[weights > 65] - 1
weights Can Cem Hande
65 65 61
Get indices of elements that satisfy a condition
The which() function returns the indices (and labels, if available) of elements in a vector for which a Boolean function returns TRUE.
heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)heights > 1.65 Can Cem Hande
TRUE TRUE FALSE
which(heights > 1.65)Can Cem
1 2
Using all() and any()
- We use the
all()function to check if all elements in a vector areTRUE. - We use the
any()function to check if any one of the elements in a vector areTRUE.
heights Can Cem Hande
1.70 1.75 1.62
all(heights > 1.60) # TRUE[1] TRUE
all(heights > 1.70) # FALSE[1] FALSE
any(heights > 1.70) # TRUE[1] TRUE
Generating vectors with repeated elements
The rep() function can be used to replicate values or vectors a specified number of times.
rep(3,10) [1] 3 3 3 3 3 3 3 3 3 3
rep("abc",5)[1] "abc" "abc" "abc" "abc" "abc"
rep(c(1,2,3),5) [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),length.out=10) [1] 1 2 3 1 2 3 1 2 3 1
Generating sequences with seq()
The seq() function generates a vector of numbers in arithmetic progression. It is a generalization of the colon(:) operator.
seq(4,9) # same as 4:9[1] 4 5 6 7 8 9
Unlike the colon operator, you can specify a step size with the by parameter.
seq(from=12, to=29, by=3)[1] 12 15 18 21 24 27
Alternatively, if you want a fixed number of elements, you can specify the length.out parameter.
seq(from=1.1, to=6, length.out=10) [1] 1.100000 1.644444 2.188889 2.733333 3.277778 3.822222 4.366667 4.911111
[9] 5.455556 6.000000
Sorting a vector
The sort() function returns a vector with elements sorted in increasing order.
sort(heights)Hande Can Cem
1.62 1.70 1.75
If you want reverse (decreasing) order, set the decreasing parameter to TRUE.
sort(heights, decreasing = TRUE) Cem Can Hande
1.75 1.70 1.62
Suppose we want to sort the weights in a special way: First element is weight of the shortest person, last element is the weight of the tallest person.
In order to do that, we compute an ordering.
order(heights)[1] 3 1 2
This shows that when heights are sorted, element 3 would be in the first location, element 1 in the second location, and element 2 in the last location.
We can use this ordering with the weights vector to get what we want.
weights[order(heights)] # return the weights of people ordered by their heights.Hande Can Cem
61 65 66
If you have a named vector, you can sort it by the names:
heights[sort(names(heights))] Can Cem Hande
1.70 1.75 1.62