3  Data vectors in R

Why vectors?

Often, we store large amounts of related data and process them together. For example, we may have height and weight values of 100 people, and we may want to calculate their body-mass index, or the mean height. Storing them as separate variables are difficult, and we don’t have a way of automatically processing them:

height1 <- 1.70
weight1 <- 65
bmi1 <- weight1 / height1^2

height2 <- 1.75
weight2 <- 66
bmi2 <- weight2 / height2^2

# ... a lot of lines...

height100 <- 1.68
weight100 <- 70
bmi100 <- weight100 / height100^2

How can we get, e.g. the mean body-mass index of the group? Again, with difficulty:

total <- 0 
height1 <- 1.70
weight1 <- 65
bmi1 <- weight1 / height1^2
total <- total + bmi1

height2 <- 1.75
weight2 <- 66
bmi2 <- weight2 / height2^2
total <- total + bmi2

# ... a lot of lines...

height100 <- 1.68
weight100 <- 70
bmi100 <- weight100 / height100^2
total <- total + bmi100

mean_bmi = total / 100

Imagine doing this for 1000, or one million people.

On the other hand, vectors have a sequential structure. If heights is a vector holding the height values, we can get the first element with height[1], second element with height[2], and so on. Better yet, later we’ll see that we can use a loop to go over every element with a few lines of code, regardless of the length of the vector:

total <- 0
i <- 1
while (i<=100)
    total <- total + weights[i]/heights[i]^2
mean_bmi <- total / 100

An even shorter (and better!) way is calling the mean() function

mean_bmi <- mean(weights/heights^2)
Tip

R is designed for data processing, so vectors are the main object of fast calculations. The function mean() works faster than the loop equivalent, because it is internally optimized. It is always more efficient to use an existing function than doing the same thing manually.

Internally, vectors are designed to occupy an unbroken section in the memory. Every element occupies the same amount of bytes. That way, vector operations can go over elements quickly, without wasting time to jump around the memory.

Creating vectors

The most general way to create data vectors is to use the c() function (short for concatenate).

heights <- c(1.70, 1.75, 1.62)
weights <- c(65, 66, 61)
heights
[1] 1.70 1.75 1.62
weights
[1] 65 66 61

Vectors can also be created with the colon operator (:)

x <- 2:10 # assign integers from 2 to 10, inclusive.
x
[1]  2  3  4  5  6  7  8  9 10

Extending vectors

The function c() can also be used to add new elements to vectors.

Suppose initially we have only two pieces of data:

heights <- c(1.70, 1.75)
heights
[1] 1.70 1.75

Then we get another data point, and we extend the vector.

c(heights, 1.62)
[1] 1.70 1.75 1.62
heights
[1] 1.70 1.75
heights <- c(heights, 1.62)
heights
[1] 1.70 1.75 1.62
heights <- c(1.62, heights)
heights
[1] 1.62 1.70 1.75 1.62
c(heights, heights)
[1] 1.62 1.70 1.75 1.62 1.62 1.70 1.75 1.62

Modes

R variable types are called modes. Modes include: “numeric”, “character”, “logical”, “complex”, and so on.

mode(c(1,2))
[1] "numeric"
mode(c("abc","xyz"))
[1] "character"
mode(c(TRUE,FALSE))
[1] "logical"
mode(2+4i)
[1] "complex"

All elements in a vector must be of the same mode. This is required for efficient calculation over vectors. Other data structures, such as the list, can be used to store mixed-type data.

Vector arithmetic

If you add two vectors, both with the same number of elements, corresponding elements are added:

c(1,4,9) + c(2,16,5)
[1]  3 20 14

The same rule applies to all basic operations:

c(1,4,9) * c(2,16,5)
[1]  2 64 45
c(1,4,9) / c(2,16,5)
[1] 0.50 0.25 1.80

Logical comparisons such as < or > also follow this rule.

3 > 2
[1] TRUE
c(1,4,9) > c(2,16,5)
[1] FALSE FALSE  TRUE

If we add a vector and a single number, the single number is recycled until its length matches the other vector.

c(1,4,9) + c(5)  # converted to: c(1,4,9) + c(5,5,5)
[1]  6  9 14

Same goes for other operations:

c(1,4,9) < 5  # converted to: c(1,4,9) < c(5,5,5)
[1]  TRUE  TRUE FALSE
c(1,4,9)^2   # converted to: c(1,4,9) ^ c(2,2,2)
[1]  1 16 81

If the operation is done with two vectors with different sizes, you might get a warning:

c(1,4,9) + c(2,3)
Warning in c(1, 4, 9) + c(2, 3): longer object length is not a multiple of
shorter object length
[1]  3  7 11

However, the following works without warning:

c(1,4,9,10) + c(2,3) # converted to c(1,4,9,10) + c(2,3,2,3)
[1]  3  7 11 13

R recycles the shorter vector until its length matches the longer vector. This is also done with vectors of length 3 and 2, but not perfectly, so a warning gets issued.

Exercise

A set of temperature measurements are given in Fahrenheit scale as follows:

temperatures_F <- c(87, 89, 101, 91, 86, 71, 76)

This can be converted to Celsius degrees using the formula \[C = \frac{5}{9}(F-32)\]

Write an R expression that returns a vector of corresponding Celsius values.

Vectorized functions

These are built-in R functions that operate on vectors as a whole, optimized to process them fast.

sum()

Adds up all elements in vector, returns the total. `

sum(c(1,4,9))
[1] 14
sum(1:1000)
[1] 500500
Exercise

Using sum() calculate the total of squares of the first 100 numbers: \(\sum_{i=1}^{100} i^2 = 1 + 4 + 9 + \cdots + 10000\)

cumsum()

Adds up the elements at every step, returns a vector of cumulative totals up to that element.

cumsum(1:10)
 [1]  1  3  6 10 15 21 28 36 45 55

prod(), cumprod()

Similar to sum() and cumsum(), but for products of elements.

prod(c(1,4,9))
[1] 36
prod(1:5)  # 5!
[1] 120
cumprod(1:5)
[1]   1   2   6  24 120
Exercise

Suppose you have 10 marbles, each of a different color. The number of ways you can select 4 marbles out of them is given by \[\left(\begin{matrix} 10\\4 \end{matrix}\right)=\frac{10!}{4!6!}\] Calculate this number using prod().

Mathematical functions

Familiar mathematical functions are all designed to apply on vectors elementwise.

The square root function:

sqrt(c(4,9,16))
[1] 2 3 4

The constant pi holds the value of \(\pi\)

x <- c(0, pi/4, pi/2, 3*pi/4, pi)
# alternatively: x <- 0:4 * pi/4
sin(x)
[1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224647e-16

The exponential function \(\mathrm{e}^x\)

exp(1:5)
[1]   2.718282   7.389056  20.085537  54.598150 148.413159

The natural logarithm function (inverse of exp)

log(exp(1:5))
[1] 1 2 3 4 5

Representing missing data

  • In many data sets, we often have some missing data, i.e., observations for which the values are missing.
  • In R, missing values are denoted with NA.
  • Any vector can contain missing values.
weights <- c(65, NA, 61)
names <- c("Can","Cem",NA)

Vector element names

For readability, we can assign name labels to the elements of a data vector.

heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
heights
  Can   Cem Hande 
 1.70  1.75  1.62 
weights <- c(Can=65, Cem=66, Hande=61)
weights
  Can   Cem Hande 
   65    66    61 

We can retrieve these names with the names() function.

names(heights)
[1] "Can"   "Cem"   "Hande"

We can assign names to the elements of a vector that already exists.

heights <- c(1.70, 1.75, 1.62)
names(heights) <- c("Can","Cem","Hande")
heights
  Can   Cem Hande 
 1.70  1.75  1.62 

If for some reason we want to remove the names, we use the unname() function.

unname(heights)
[1] 1.70 1.75 1.62

The original vector is not changed with this function call, because we did not assign the result to heights.

heights
  Can   Cem Hande 
 1.70  1.75  1.62 

Vector indexing

We can access a single element of a vector by providing the index of the element in square brackets.

heights
  Can   Cem Hande 
 1.70  1.75  1.62 
heights[1]  # first element
Can 
1.7 
heights[3] # third element
Hande 
 1.62 

We can select a slice of the vector by providing a range inside brackets.

heights[1:2]  # select from element 1 to element 2, inclusive.
 Can  Cem 
1.70 1.75 

We can also give a vector consisting of element indices.

heights[c(1,3)]  # select elements 1 and 3.
  Can Hande 
 1.70  1.62 

The indices do not have to be in order:

heights[c(2,1,3)]
  Cem   Can Hande 
 1.75  1.70  1.62 

We can select the same element more than once.

heights[c(1,1,3,2,3)]
  Can   Can Hande   Cem Hande 
 1.70  1.70  1.62  1.75  1.62 

We can provide a Boolean (true/false) vector for indexing. This will select only elements with corresponding TRUE values.

heights
  Can   Cem Hande 
 1.70  1.75  1.62 
heights[c(T,F,F)]  # T is a shorthand for TRUE, F is for FALSE.
Can 
1.7 

We can exclude elements using negative indices.

heights[-1]  # exclude first element.
  Cem Hande 
 1.75  1.62 
heights[c(-1,-3)]  # exclude 1st and 3rd elements
 Cem 
1.75 
Exercise

Suppose we define a four-element vector

v <- c(3,6,2,-1).

Which of the following CANNOT be used to select the second and third elements of this vector?

  • v[2:3]
  • v[c(2,3)]
  • v[c(6,2)]
  • v[c(F,T,T,F)]
  • v[c(-1,-4)]

Using names to select elements

If the elements are given names consisting of strings, we can use these names in brackets instead of indices.

heights["Can"]
Can 
1.7 
heights[c("Can","Can","Hande","Cem","Hande")]
  Can   Can Hande   Cem Hande 
 1.70  1.70  1.62  1.75  1.62 

Modify element values in a vector

heights
  Can   Cem Hande 
 1.70  1.75  1.62 
heights[1] <- 1.72
heights
  Can   Cem Hande 
 1.72  1.75  1.62 
heights[1] <- 1.70

Insert values to an existing vector

A vector’s size is determined at its creation, and its elements are stored contiguously (side-by-side) in memory. Therefore it is really not possible to add or remove an element in a vector. However, we can reassign the identifier to a new one.

heights
  Can   Cem Hande 
 1.70  1.75  1.62 
heights <- c(heights[1:2], Lale=1.76, heights[3])
heights
  Can   Cem  Lale Hande 
 1.70  1.75  1.76  1.62 

Delete elements from vector

Again, we cannot directly remove an element from an existing vector, but we can create a new vector without the element we want to delete, and reassign to the name.

heights
  Can   Cem  Lale Hande 
 1.70  1.75  1.76  1.62 
heights <- heights[-3]  # exclude element 3
heights
  Can   Cem Hande 
 1.70  1.75  1.62 
Exercise

Suppose we define a vector with

v <- c(3,4,5)

What is the output of the following commands?

v <- c(5, v, 1:2)
v <- v[-2]
v[2:4]
  • 2 3 4
  • 5 3 4 5 3 4
  • 4 5 3
  • 4 5 1

Getting the length of a vector

We can get the number of elements in a vector using the length() function.

length(heights)
[1] 3
length(10:17)
[1] 8

Vector filtering

  • Apply a Boolean function (e.g., greater than, less than, …) to each element of the vector.
  • Returns a Boolean vector according to the result on each element.
heights
  Can   Cem Hande 
 1.70  1.75  1.62 
heights > 1.65
  Can   Cem Hande 
 TRUE  TRUE FALSE 

Using this Boolean vector, we can select data points satisfying the condition.

tall_people <- heights>1.65
tall_people
  Can   Cem Hande 
 TRUE  TRUE FALSE 
heights[tall_people]
 Can  Cem 
1.70 1.75 

Obviously, this can be done in a single line, too.

heights[heights>1.65]
 Can  Cem 
1.70 1.75 

One can also filter a vector according to another vector’s values.

heights
  Can   Cem Hande 
 1.70  1.75  1.62 
weights
  Can   Cem Hande 
   65    66    61 
weights[ heights > 1.65 ]  # weights of people who are taller than 1.65
Can Cem 
 65  66 
Exercise

Given the vectors with named values:

ages <- c(Ali=18, Hasan=21, Fatma=18, Hande=22, Cem=21)
weights <- c(Ali=75, Hasan=72, Fatma=60, Hande=56, Cem=67)

which of the following commands prints the weights of people who are 18 years old?

  • weights[ages==18]
  • ages[weights]==18
  • weights[names(ages==18)]
  • names(weights[ages==18])

Modify a vector by filtering

We can use filtering to selectively change only the elements that satisfy a condition.

Example: For people who weigh more than 65 kg, decrease the weight by 1 kg.

weights
  Can   Cem Hande 
   65    66    61 
weights[weights > 65] - 1
Cem 
 65 
weights[weights > 65] <- weights[weights > 65] - 1
weights
  Can   Cem Hande 
   65    65    61 

Get indices of elements that satisfy a condition

The which() function returns the indices (and labels, if available) of elements in a vector for which a Boolean function returns TRUE.

heights <- c(Can=1.70, Cem=1.75, Hande=1.62)
weights <- c(Can=65, Cem=66, Hande=61)
heights > 1.65
  Can   Cem Hande 
 TRUE  TRUE FALSE 
which(heights > 1.65)
Can Cem 
  1   2 

Using all() and any()

  • We use the all() function to check if all elements in a vector are TRUE.
  • We use the any() function to check if any one of the elements in a vector are TRUE.
heights
  Can   Cem Hande 
 1.70  1.75  1.62 
all(heights > 1.60) # TRUE
[1] TRUE
all(heights > 1.70) # FALSE
[1] FALSE
any(heights > 1.70) # TRUE
[1] TRUE
Exercise

Suppose a vector named ages holds the ages of a group who want to enter a museum. You want to make sure that there is at least one grownup among them. Which command do you use?

  • any(ages > 18)
  • all(ages > 18)
  • any(ages < 18)
  • all(ages < 18)
Exercise

Suppose a vector named ages holds the ages of a group who want to enter a bar. You want to make sure that everybody is of proper age to drink. Which command do you use?

  • any(ages > 18)
  • all(ages > 18)
  • any(ages < 18)
  • all(ages < 18)

Generating vectors with repeated elements

The rep() function can be used to replicate values or vectors a specified number of times.

rep(3,10)
 [1] 3 3 3 3 3 3 3 3 3 3
rep("abc",5)
[1] "abc" "abc" "abc" "abc" "abc"
rep(c(1,2,3),5)
 [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),length.out=10)
 [1] 1 2 3 1 2 3 1 2 3 1

Generating sequences with seq()

The seq() function generates a vector of numbers in arithmetic progression. It is a generalization of the colon(:) operator.

seq(4,9)  # same as 4:9
[1] 4 5 6 7 8 9

Unlike the colon operator, you can specify a step size with the by parameter.

seq(from=12, to=29, by=3)
[1] 12 15 18 21 24 27

Alternatively, if you want a fixed number of elements, you can specify the length.out parameter.

seq(from=1.1, to=6, length.out=10)
 [1] 1.100000 1.644444 2.188889 2.733333 3.277778 3.822222 4.366667 4.911111
 [9] 5.455556 6.000000
Exercise

Create a sequence of values from 5 down to −11 that progresses in steps of 0.3.

Exercise

Create a sequence of length 101 from 0 to \(\pi\).

Sorting a vector

The sort() function returns a vector with elements sorted in increasing order.

sort(heights)
Hande   Can   Cem 
 1.62  1.70  1.75 

If you want reverse (decreasing) order, set the decreasing parameter to TRUE.

sort(heights, decreasing = TRUE)
  Cem   Can Hande 
 1.75  1.70  1.62 

Suppose we want to sort the weights in a special way: First element is weight of the shortest person, last element is the weight of the tallest person.

In order to do that, we compute an ordering.

order(heights)
[1] 3 1 2

This shows that when heights are sorted, element 3 would be in the first location, element 1 in the second location, and element 2 in the last location.

We can use this ordering with the weights vector to get what we want.

weights[order(heights)]  # return the weights of people ordered by their heights.
Hande   Can   Cem 
   61    65    66 

If you have a named vector, you can sort it by the names:

heights[sort(names(heights))]
  Can   Cem Hande 
 1.70  1.75  1.62 
Exercise

Consider the following data

Country Area Population
Russia 17,098,242 142,257,519
United States 9,833,517 326,625,791
China 9,596,960 1,379,302,771
Brazil 8,515,770 207,353,391
Australia 7,741,220 23,232,413
India 3,287,263 1,281,935,911
Turkey 783,562 80,845,215
France 643,801 67,106,161
Japan 377,915 126,451,398
United Kingdom 243,610 65,648,100
  • Create two vectors area and population that hold the data in the respective columns. Label the elements in each vector with the country name.

  • Create a new vector called density that holds the population density of the countries.

  • Print the names of countries sorted by population density, in descending order (from highest to lowest).