2.2 Data structures

The data we are going to work with is usually stored as a table or a vector of values. One of the basic operations on tables and vectors is selecting a subset of rows, columns or values. More information about various methods of data processing can be found in Section 2.5.

2.2.1 Vectors

Vectors, or sequences of values of the same type, are the basic data structure in R. We can create sequences of numbers, strings, or logical values. For R, even a single number is a one-element vector.

Vectors can be created from smaller vectors using the c() function.

c(2, 3, 5, 7, 11, 13, 17)
## [1]  2  3  5  7 11 13 17

One particular group of vectors are sequences of consecutive numbers. The easiest way to create such sequences is by using the : operator.

## [1] -3 -2 -1  0  1  2  3

If we need sequences of numbers with a step value other than 1, we may use the seq() function.

seq(from = 0, to = 100, by = 11)
##  [1]  0 11 22 33 44 55 66 77 88 99

Many useful vectors are available in basic packages. Some examples are month names or subsequent letters.

##  [1] "January"   "February"  "March"     "April"     "May"       "June"     
##  [7] "July"      "August"    "September" "October"   "November"  "December"
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z" Indexing vectors

You can use the [] operator to refer to the elements of a vector. Inside the square brackets, provide the vector of indices to read. Vector elements are numbered from 1. We can easily show indexing by selecting a subset of letters from the LETTERS vector.

LETTERS[ 5:10 ]
## [1] "E" "F" "G" "H" "I" "J"
LETTERS[ c(1, 2, 9:14) ]
## [1] "A" "B" "I" "J" "K" "L" "M" "N"

The length() function returns the number of elements in a vector. For example, it may prove useful when we want to select every second element from the vector. In the code below, we use seq() to build a sequence of indices made up of every second letter from the LETTERS vector.

every_second <- seq(from = 1, to = length(LETTERS), by = 2)
LETTERS[ every_second ]
##  [1] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"

Indices also accept negative numbers. This way, we can select all values except those with the indices provided.

month.name[ -(5:9) ]
## [1] "January"  "February" "March"    "April"    "October"  "November" "December"

Vector elements can be assigned names when constructing a vector with the c() function, or at a later time with the names() function. If a given vector has names, then its elements can be referred to with those names instead of the indices.

value <- c(pawn = 1, knight = 3, bishop = 3,
rook = 5, queen = 9, king = Inf)

Getting a subvector is not the only aim of indexing. Often enough, we want to change the order of elements in a vector. In the sample code below, we reverse the order of elements by providing a new set of indices.

value[ 6:1 ]
##   king  queen   rook bishop knight   pawn 
##    Inf      9      5      3      3      1

Logical values are another useful method to select a subsequence of elements from a vector. The code below presents how we can choose all elements of a vector whose values are less than 6.

value[ value < 6 ]
##   pawn knight bishop   rook 
##      1      3      3      5 Changing vector elements

Vector indices are also useful when we only want to change some of the elements. We can choose a subset of vector elements, and then assign a value to it using the <- operator. In the example below, we change the value of the fourth and fifth elements of the value vector.

value[ c(4,5) ] <- c(6,7)
##   pawn knight bishop   rook  queen   king 
##      1      3      3      6      7    Inf

In order to add new elements to the vector, we can use the c() function.

(value <- c(value, new_piece = 5))
##      pawn    knight    bishop      rook     queen      king new_piece 
##         1         3         3         6         7       Inf         5

2.2.2 Data frames

The most widely used data structure in data analysis is called a data frame. Each data frame is a set of columns (variables), and each column is a vector of equal length. Columns may have values of different types. We will show you how to work with data frames using a small dataset called cats_birds. The seven columns represent various traits of selected cats and birds.

cats_birds <- read.csv("cats_birds.csv",sep=";",dec=",")
##    species weight length velocity habitat lifespan team
## 1    Tiger    300    2.5       60    Asia       25  Cat
## 2     Lion    200    2.0       80  Africa       29  Cat
## 3   Jaguar    100    1.7       90 America       15  Cat
## 4     Puma     80    1.7       70 America       13  Cat
## 5 Panthera     70    1.4       85    Asia       21  Cat
## 6  Cheetah     60    1.4      115  Africa       12  Cat Indexing data frames

Data frames are structured as two-dimensional tables. In order to specify their elements, we must provide both the row and the column. To that end, we can use the [] or $ operators. The latter will be discussed at the end of the current section.

When using the [] operator, we should separately specify the indices of rows and columns, and separate them with "," (commas). If indices of rows or columns are not specified, then all rows/columns will be selected.

The example below shows how to choose three rows from the cats_birds data frame.

cats_birds[ c(1, 3 , 5) , ]
##    species weight length velocity habitat lifespan team
## 1    Tiger    300    2.5       60    Asia       25  Cat
## 3   Jaguar    100    1.7       90 America       15  Cat
## 5 Panthera     70    1.4       85    Asia       21  Cat

We can use indices for both rows and columns at the same time.

cats_birds[ c(1, 3, 5) , 2:5]
##   weight length velocity habitat
## 1    300    2.5       60    Asia
## 3    100    1.7       90 America
## 5     70    1.4       85    Asia

Each dimension can be indexed using the same rules that are applied to vectors. This means that (1) logical conditions can be indices, (2) we can refer to names, and (3) we can use negative indices.

In the example below, we only select animals that develop a velocity greater than 100 km/h. For each row that satisfies this condition, we show the species, velocity and length of the animal.

cats_birds[ cats_birds[,"velocity"] > 100,
            c("species", "velocity", "length")]
##             species velocity length
## 6           Cheetah      115    1.4
## 8             Swift      170    0.2
## 10     Golden eagle      160    0.9
## 11 Peregrine falcon      110    0.5
## 13        Albatross      120    0.8

Individual columns of the data frame are vectors. If we refer to a single column, we will get a vector instead of a data frame. This is a very convenient property which makes the code much shorter on many occasions. However, there are situations where this notation leads to mistakes, which is why we need to be careful when selecting a single column.

We can choose a column by providing its number or name. The command below has the same meaning as cats_birds[,4].

cats_birds[, "velocity"]
##  [1]  60  80  90  70  85 115  65 170  70 160 110 100 120

When we want to select a single column from a data frame, we can also use the $ operator. It shortens the code by 4 characters. At the same time, it becomes much clearer that the result is a vector.

After the $ operator, we can provide a full variable name, or just a part of it. If we only provide a part of it, we will get a column whose name starts with the part we provide.

##  [1]  60  80  90  70  85 115  65 170  70 160 110 100 120

The $ operator also comes in handy when we add a new column to our data frame. We can add a new column with the given name at the end of the given data frame. Below, we change the unit, bearing in mind that 1 mile/h = 1.6 km/h.

cats_birds$velocity_miles <- cats_birds$velocity * 1.6
head(cats_birds, 2)
##   species weight length velocity habitat lifespan team velocity_miles
## 1   Tiger    300    2.5       60    Asia       25  Cat             96
## 2    Lion    200    2.0       80  Africa       29  Cat            128

Another way to add a column is by using the cbind() or data.frame() function.

The rbind() function binds two data frames row by row, thus allowing us to add new rows. The dplyr package provides quicker versions of both functions. Their names are bind_cols() and bind_rows().

2.2.3 Lists

The third most popular data structure used in R are lists. A list, just like a vector, is a sequence of values. Unlike a vector, however, a list can store elements of multiple types. A list element may be another complex structure, such as a data frame or another list.

Lists are created with list(). The example below shows how to create a list that contains a vector with numbers, letters and logical values.

triplet <- list(numbers = 1:5, letters = LETTERS, logic = c(TRUE, TRUE, TRUE, FALSE))
## $numbers
## [1] 1 2 3 4 5
## $letters
##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
## $logic

You can access list elements with [[]]. You can only select a single element.

If a list has a name, its elements can be referred to with the $ operator in a way similar to data frames. Internally, data frames are represented as lists of equally long vectors.

##  [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
## [1] 1 2 3 4 5

Lists are very useful when we need to process multiple data sets or when we want to generate multiple models. Both data sets and models can become list elements.

R contains many useful functions to work with lists. We will discuss them in Section 3.9.1.