<- 1.70
height1 <- 65
weight1 <- weight1 / height1^2
bmi1
<- 1.75
height2 <- 66
weight2 <- weight2 / height2^2
bmi2
# ... a lot of lines...
<- 1.68
height100 <- 70
weight100 <- weight100 / height100^2 bmi100
3 Data vectors in R
The fundamental data type in R is the vector, which is an ordered collection of elements of the same type (numbers, strings, etc).
Single numbers such as 10 or
sqrt(25)
are actually handled as one-element vectors.
Why vectors?
Often, we store large amounts of related data and process them together. For example, we may have height and weight values of 100 people, and we may want to calculate their body-mass index, or the mean height. Storing them as separate variables are difficult, and we don’t have a way of automatically processing them:
How can we get, e.g. the mean body-mass index of the group? Again, with difficulty:
<- 0
total <- 1.70
height1 <- 65
weight1 <- weight1 / height1^2
bmi1 <- total + bmi1
total
<- 1.75
height2 <- 66
weight2 <- weight2 / height2^2
bmi2 <- total + bmi2
total
# ... a lot of lines...
<- 1.68
height100 <- 70
weight100 <- weight100 / height100^2
bmi100 <- total + bmi100
total
= total / 100 mean_bmi
Imagine doing this for 1000, or one million people.
On the other hand, vectors have a sequential structure. If heights
is a vector holding the height values, we can get the first element with height[1]
, second element with height[2]
, and so on. Better yet, later we’ll see that we can use a loop to go over every element with a few lines of code, regardless of the length of the vector:
<- 0
total <- 1
i while (i<=100)
<- total + weights[i]/heights[i]^2
total <- total / 100 mean_bmi
An even shorter (and better!) way is calling the mean()
function
<- mean(weights/heights^2) mean_bmi
R is designed for data processing, so vectors are the main object of fast calculations. The function mean()
works faster than the loop equivalent, because it is internally optimized. It is always more efficient to use an existing function than doing the same thing manually.
Internally, vectors are designed to occupy an unbroken section in the memory. Every element occupies the same amount of bytes. That way, vector operations can go over elements quickly, without wasting time to jump around the memory.
Creating vectors
The most general way to create data vectors is to use the c()
function (short for concatenate).
<- c(1.70, 1.75, 1.62)
heights <- c(65, 66, 61) weights
heights
[1] 1.70 1.75 1.62
weights
[1] 65 66 61
Vectors can also be created with the colon operator (:)
<- 2:10 # assign integers from 2 to 10, inclusive.
x x
[1] 2 3 4 5 6 7 8 9 10
Extending vectors
The function c()
can also be used to add new elements to vectors.
Suppose initially we have only two pieces of data:
<- c(1.70, 1.75)
heights heights
[1] 1.70 1.75
Then we get another data point, and we extend the vector.
c(heights, 1.62)
[1] 1.70 1.75 1.62
heights
[1] 1.70 1.75
<- c(heights, 1.62) heights
heights
[1] 1.70 1.75 1.62
<- c(1.62, heights) heights
heights
[1] 1.62 1.70 1.75 1.62
c(heights, heights)
[1] 1.62 1.70 1.75 1.62 1.62 1.70 1.75 1.62
Modes
R variable types are called modes. Modes include: “numeric”, “character”, “logical”, “complex”, and so on.
mode(c(1,2))
[1] "numeric"
mode(c("abc","xyz"))
[1] "character"
mode(c(TRUE,FALSE))
[1] "logical"
mode(2+4i)
[1] "complex"
All elements in a vector must be of the same mode. This is required for efficient calculation over vectors. Other data structures, such as the list, can be used to store mixed-type data.
Vector arithmetic
If you add two vectors, both with the same number of elements, corresponding elements are added:
c(1,4,9) + c(2,16,5)
[1] 3 20 14
The same rule applies to all basic operations:
c(1,4,9) * c(2,16,5)
[1] 2 64 45
c(1,4,9) / c(2,16,5)
[1] 0.50 0.25 1.80
Logical comparisons such as <
or >
also follow this rule.
3 > 2
[1] TRUE
c(1,4,9) > c(2,16,5)
[1] FALSE FALSE TRUE
If we add a vector and a single number, the single number is recycled until its length matches the other vector.
c(1,4,9) + c(5) # converted to: c(1,4,9) + c(5,5,5)
[1] 6 9 14
Same goes for other operations:
c(1,4,9) < 5 # converted to: c(1,4,9) < c(5,5,5)
[1] TRUE TRUE FALSE
c(1,4,9)^2 # converted to: c(1,4,9) ^ c(2,2,2)
[1] 1 16 81
If the operation is done with two vectors with different sizes, you might get a warning:
c(1,4,9) + c(2,3)
Warning in c(1, 4, 9) + c(2, 3): longer object length is not a multiple of
shorter object length
[1] 3 7 11
However, the following works without warning:
c(1,4,9,10) + c(2,3) # converted to c(1,4,9,10) + c(2,3,2,3)
[1] 3 7 11 13
R recycles the shorter vector until its length matches the longer vector. This is also done with vectors of length 3 and 2, but not perfectly, so a warning gets issued.
Vectorized functions
These are built-in R functions that operate on vectors as a whole, optimized to process them fast.
sum()
Adds up all elements in vector, returns the total. `
sum(c(1,4,9))
[1] 14
sum(1:1000)
[1] 500500
cumsum()
Adds up the elements at every step, returns a vector of cumulative totals up to that element.
cumsum(1:10)
[1] 1 3 6 10 15 21 28 36 45 55
prod(), cumprod()
Similar to sum()
and cumsum()
, but for products of elements.
prod(c(1,4,9))
[1] 36
prod(1:5) # 5!
[1] 120
cumprod(1:5)
[1] 1 2 6 24 120
Mathematical functions
Familiar mathematical functions are all designed to apply on vectors elementwise.
The square root function:
sqrt(c(4,9,16))
[1] 2 3 4
The constant pi
holds the value of \(\pi\)
<- c(0, pi/4, pi/2, 3*pi/4, pi)
x # alternatively: x <- 0:4 * pi/4
sin(x)
[1] 0.000000e+00 7.071068e-01 1.000000e+00 7.071068e-01 1.224647e-16
The exponential function \(\mathrm{e}^x\)
exp(1:5)
[1] 2.718282 7.389056 20.085537 54.598150 148.413159
The natural logarithm function (inverse of exp
)
log(exp(1:5))
[1] 1 2 3 4 5
Representing missing data
- In many data sets, we often have some missing data, i.e., observations for which the values are missing.
- In R, missing values are denoted with
NA
. - Any vector can contain missing values.
<- c(65, NA, 61)
weights <- c("Can","Cem",NA) names
Vector element names
For readability, we can assign name labels to the elements of a data vector.
<- c(Can=1.70, Cem=1.75, Hande=1.62)
heights heights
Can Cem Hande
1.70 1.75 1.62
<- c(Can=65, Cem=66, Hande=61)
weights weights
Can Cem Hande
65 66 61
We can retrieve these names with the names()
function.
names(heights)
[1] "Can" "Cem" "Hande"
We can assign names to the elements of a vector that already exists.
<- c(1.70, 1.75, 1.62)
heights names(heights) <- c("Can","Cem","Hande")
heights
Can Cem Hande
1.70 1.75 1.62
If for some reason we want to remove the names, we use the unname()
function.
unname(heights)
[1] 1.70 1.75 1.62
The original vector is not changed with this function call, because we did not assign the result to heights
.
heights
Can Cem Hande
1.70 1.75 1.62
Vector indexing
We can access a single element of a vector by providing the index of the element in square brackets.
heights
Can Cem Hande
1.70 1.75 1.62
1] # first element heights[
Can
1.7
3] # third element heights[
Hande
1.62
We can select a slice of the vector by providing a range inside brackets.
1:2] # select from element 1 to element 2, inclusive. heights[
Can Cem
1.70 1.75
We can also give a vector consisting of element indices.
c(1,3)] # select elements 1 and 3. heights[
Can Hande
1.70 1.62
The indices do not have to be in order:
c(2,1,3)] heights[
Cem Can Hande
1.75 1.70 1.62
We can select the same element more than once.
c(1,1,3,2,3)] heights[
Can Can Hande Cem Hande
1.70 1.70 1.62 1.75 1.62
We can provide a Boolean (true/false) vector for indexing. This will select only elements with corresponding TRUE
values.
heights
Can Cem Hande
1.70 1.75 1.62
c(T,F,F)] # T is a shorthand for TRUE, F is for FALSE. heights[
Can
1.7
We can exclude elements using negative indices.
-1] # exclude first element. heights[
Cem Hande
1.75 1.62
c(-1,-3)] # exclude 1st and 3rd elements heights[
Cem
1.75
Using names to select elements
If the elements are given names consisting of strings, we can use these names in brackets instead of indices.
"Can"] heights[
Can
1.7
c("Can","Can","Hande","Cem","Hande")] heights[
Can Can Hande Cem Hande
1.70 1.70 1.62 1.75 1.62
Modify element values in a vector
heights
Can Cem Hande
1.70 1.75 1.62
1] <- 1.72 heights[
heights
Can Cem Hande
1.72 1.75 1.62
1] <- 1.70 heights[
Insert values to an existing vector
A vector’s size is determined at its creation, and its elements are stored contiguously (side-by-side) in memory. Therefore it is really not possible to add or remove an element in a vector. However, we can reassign the identifier to a new one.
heights
Can Cem Hande
1.70 1.75 1.62
<- c(heights[1:2], Lale=1.76, heights[3]) heights
heights
Can Cem Lale Hande
1.70 1.75 1.76 1.62
Delete elements from vector
Again, we cannot directly remove an element from an existing vector, but we can create a new vector without the element we want to delete, and reassign to the name.
heights
Can Cem Lale Hande
1.70 1.75 1.76 1.62
<- heights[-3] # exclude element 3 heights
heights
Can Cem Hande
1.70 1.75 1.62
Getting the length of a vector
We can get the number of elements in a vector using the length()
function.
length(heights)
[1] 3
length(10:17)
[1] 8
Vector filtering
- Apply a Boolean function (e.g., greater than, less than, …) to each element of the vector.
- Returns a Boolean vector according to the result on each element.
heights
Can Cem Hande
1.70 1.75 1.62
> 1.65 heights
Can Cem Hande
TRUE TRUE FALSE
Using this Boolean vector, we can select data points satisfying the condition.
<- heights>1.65
tall_people tall_people
Can Cem Hande
TRUE TRUE FALSE
heights[tall_people]
Can Cem
1.70 1.75
Obviously, this can be done in a single line, too.
>1.65] heights[heights
Can Cem
1.70 1.75
One can also filter a vector according to another vector’s values.
heights
Can Cem Hande
1.70 1.75 1.62
weights
Can Cem Hande
65 66 61
> 1.65 ] # weights of people who are taller than 1.65 weights[ heights
Can Cem
65 66
Modify a vector by filtering
We can use filtering to selectively change only the elements that satisfy a condition.
Example: For people who weigh more than 65 kg, decrease the weight by 1 kg.
weights
Can Cem Hande
65 66 61
> 65] - 1 weights[weights
Cem
65
> 65] <- weights[weights > 65] - 1
weights[weights weights
Can Cem Hande
65 65 61
Get indices of elements that satisfy a condition
The which()
function returns the indices (and labels, if available) of elements in a vector for which a Boolean function returns TRUE
.
<- c(Can=1.70, Cem=1.75, Hande=1.62)
heights <- c(Can=65, Cem=66, Hande=61) weights
> 1.65 heights
Can Cem Hande
TRUE TRUE FALSE
which(heights > 1.65)
Can Cem
1 2
Using all()
and any()
- We use the
all()
function to check if all elements in a vector areTRUE
. - We use the
any()
function to check if any one of the elements in a vector areTRUE
.
heights
Can Cem Hande
1.70 1.75 1.62
all(heights > 1.60) # TRUE
[1] TRUE
all(heights > 1.70) # FALSE
[1] FALSE
any(heights > 1.70) # TRUE
[1] TRUE
Generating vectors with repeated elements
The rep()
function can be used to replicate values or vectors a specified number of times.
rep(3,10)
[1] 3 3 3 3 3 3 3 3 3 3
rep("abc",5)
[1] "abc" "abc" "abc" "abc" "abc"
rep(c(1,2,3),5)
[1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
rep(c(1,2,3),length.out=10)
[1] 1 2 3 1 2 3 1 2 3 1
Generating sequences with seq()
The seq()
function generates a vector of numbers in arithmetic progression. It is a generalization of the colon(:
) operator.
seq(4,9) # same as 4:9
[1] 4 5 6 7 8 9
Unlike the colon operator, you can specify a step size with the by
parameter.
seq(from=12, to=29, by=3)
[1] 12 15 18 21 24 27
Alternatively, if you want a fixed number of elements, you can specify the length.out
parameter.
seq(from=1.1, to=6, length.out=10)
[1] 1.100000 1.644444 2.188889 2.733333 3.277778 3.822222 4.366667 4.911111
[9] 5.455556 6.000000
Sorting a vector
The sort()
function returns a vector with elements sorted in increasing order.
sort(heights)
Hande Can Cem
1.62 1.70 1.75
If you want reverse (decreasing) order, set the decreasing
parameter to TRUE
.
sort(heights, decreasing = TRUE)
Cem Can Hande
1.75 1.70 1.62
Suppose we want to sort the weights in a special way: First element is weight of the shortest person, last element is the weight of the tallest person.
In order to do that, we compute an ordering.
order(heights)
[1] 3 1 2
This shows that when heights
are sorted, element 3 would be in the first location, element 1 in the second location, and element 2 in the last location.
We can use this ordering with the weights
vector to get what we want.
order(heights)] # return the weights of people ordered by their heights. weights[
Hande Can Cem
61 65 66
If you have a named vector, you can sort it by the names:
sort(names(heights))] heights[
Can Cem Hande
1.70 1.75 1.62