2.2 Data structures
The data we are going to work with is usually stored as a table or a vector of values. One of the basic operations on tables and vectors is selecting a subset of rows, columns or values. More information about various methods of data processing can be found in Section 2.5.
2.2.1 Vectors
Vectors, or sequences of values of the same type, are the basic data structure in R. We can create sequences of numbers, strings, or logical values. For R, even a single number is a one-element vector.
Vectors can be created from smaller vectors using the c()
function.
c(2, 3, 5, 7, 11, 13, 17)
## [1] 2 3 5 7 11 13 17
One particular group of vectors are sequences of consecutive numbers. The easiest way to create such sequences is by using the :
operator.
-3:3
## [1] -3 -2 -1 0 1 2 3
If we need sequences of numbers with a step value other than 1, we may use the seq()
function.
seq(from = 0, to = 100, by = 11)
## [1] 0 11 22 33 44 55 66 77 88 99
Many useful vectors are available in basic packages. Some examples are month names or subsequent letters.
month.name
## [1] "January" "February" "March" "April" "May" "June"
## [7] "July" "August" "September" "October" "November" "December"
LETTERS
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
2.2.1.1 Indexing vectors
You can use the []
operator to refer to the elements of a vector. Inside the square brackets, provide the vector of indices to read. Vector elements are numbered from 1
. We can easily show indexing by selecting a subset of letters from the LETTERS
vector.
5:10 ] LETTERS[
## [1] "E" "F" "G" "H" "I" "J"
c(1, 2, 9:14) ] LETTERS[
## [1] "A" "B" "I" "J" "K" "L" "M" "N"
The length()
function returns the number of elements in a vector. For example, it may prove useful when we want to select every second element from the vector. In the code below, we use seq()
to build a sequence of indices made up of every second letter from the LETTERS
vector.
<- seq(from = 1, to = length(LETTERS), by = 2)
every_second LETTERS[ every_second ]
## [1] "A" "C" "E" "G" "I" "K" "M" "O" "Q" "S" "U" "W" "Y"
Indices also accept negative numbers. This way, we can select all values except those with the indices provided.
-(5:9) ] month.name[
## [1] "January" "February" "March" "April" "October" "November" "December"
Vector elements can be assigned names when constructing a vector with the c()
function, or at a later time with the names()
function. If a given vector has names, then its elements can be referred to with those names instead of the indices.
<- c(pawn = 1, knight = 3, bishop = 3,
value rook = 5, queen = 9, king = Inf)
Getting a subvector is not the only aim of indexing. Often enough, we want to change the order of elements in a vector. In the sample code below, we reverse the order of elements by providing a new set of indices.
6:1 ] value[
## king queen rook bishop knight pawn
## Inf 9 5 3 3 1
Logical values are another useful method to select a subsequence of elements from a vector. The code below presents how we can choose all elements of a vector whose values are less than 6.
< 6 ] value[ value
## pawn knight bishop rook
## 1 3 3 5
2.2.1.2 Changing vector elements
Vector indices are also useful when we only want to change some of the elements. We can choose a subset of vector elements, and then assign a value to it using the <-
operator.
In the example below, we change the value of the fourth and fifth elements of the value
vector.
c(4,5) ] <- c(6,7)
value[ value
## pawn knight bishop rook queen king
## 1 3 3 6 7 Inf
In order to add new elements to the vector, we can use the c()
function.
<- c(value, new_piece = 5)) (value
## pawn knight bishop rook queen king new_piece
## 1 3 3 6 7 Inf 5
2.2.2 Data frames
The most widely used data structure in data analysis is called a data frame. Each data frame is a set of columns (variables), and each column is a vector of equal length. Columns may have values of different types.
We will show you how to work with data frames using a small dataset called cats_birds
. The seven columns represent various traits of selected cats and birds.
library("PogromcyDanych")
<- read.csv("cats_birds.csv",sep=";",dec=",")
cats_birds head(cats_birds)
## species weight length velocity habitat lifespan team
## 1 Tiger 300 2.5 60 Asia 25 Cat
## 2 Lion 200 2.0 80 Africa 29 Cat
## 3 Jaguar 100 1.7 90 America 15 Cat
## 4 Puma 80 1.7 70 America 13 Cat
## 5 Panthera 70 1.4 85 Asia 21 Cat
## 6 Cheetah 60 1.4 115 Africa 12 Cat
2.2.2.1 Indexing data frames
Data frames are structured as two-dimensional tables. In order to specify their elements, we must provide both the row and the column. To that end, we can use the []
or $
operators. The latter will be discussed at the end of the current section.
When using the []
operator, we should separately specify the indices of rows and columns, and separate them with ","
(commas). If indices of rows or columns are not specified, then all rows/columns will be selected.
The example below shows how to choose three rows from the cats_birds
data frame.
c(1, 3 , 5) , ] cats_birds[
## species weight length velocity habitat lifespan team
## 1 Tiger 300 2.5 60 Asia 25 Cat
## 3 Jaguar 100 1.7 90 America 15 Cat
## 5 Panthera 70 1.4 85 Asia 21 Cat
We can use indices for both rows and columns at the same time.
c(1, 3, 5) , 2:5] cats_birds[
## weight length velocity habitat
## 1 300 2.5 60 Asia
## 3 100 1.7 90 America
## 5 70 1.4 85 Asia
Each dimension can be indexed using the same rules that are applied to vectors. This means that (1) logical conditions can be indices, (2) we can refer to names, and (3) we can use negative indices.
In the example below, we only select animals that develop a velocity greater than 100 km/h. For each row that satisfies this condition, we show the species, velocity and length of the animal.
"velocity"] > 100,
cats_birds[ cats_birds[,c("species", "velocity", "length")]
## species velocity length
## 6 Cheetah 115 1.4
## 8 Swift 170 0.2
## 10 Golden eagle 160 0.9
## 11 Peregrine falcon 110 0.5
## 13 Albatross 120 0.8
Individual columns of the data frame are vectors. If we refer to a single column, we will get a vector instead of a data frame. This is a very convenient property which makes the code much shorter on many occasions. However, there are situations where this notation leads to mistakes, which is why we need to be careful when selecting a single column.
We can choose a column by providing its number or name. The command below has the same meaning as cats_birds[,4]
.
"velocity"] cats_birds[,
## [1] 60 80 90 70 85 115 65 170 70 160 110 100 120
When we want to select a single column from a data frame, we can also use the $
operator. It shortens the code by 4 characters. At the same time, it becomes much clearer that the result is a vector.
After the $
operator, we can provide a full variable name, or just a part of it. If we only provide a part of it, we will get a column whose name starts with the part we provide.
$velocity cats_birds
## [1] 60 80 90 70 85 115 65 170 70 160 110 100 120
The $
operator also comes in handy when we add a new column to our data frame. We can add a new column with the given name at the end of the given data frame. Below, we change the unit, bearing in mind that 1 mile/h = 1.6 km/h.
$velocity_miles <- cats_birds$velocity * 1.6
cats_birdshead(cats_birds, 2)
## species weight length velocity habitat lifespan team velocity_miles
## 1 Tiger 300 2.5 60 Asia 25 Cat 96
## 2 Lion 200 2.0 80 Africa 29 Cat 128
Another way to add a column is by using the cbind()
or data.frame()
function.
The rbind()
function binds two data frames row by row, thus allowing us to add new rows. The dplyr
package provides quicker versions of both functions. Their names are bind_cols()
and bind_rows()
.
2.2.3 Lists
The third most popular data structure used in R are lists. A list, just like a vector, is a sequence of values. Unlike a vector, however, a list can store elements of multiple types. A list element may be another complex structure, such as a data frame or another list.
Lists are created with list()
. The example below shows how to create a list that contains a vector with numbers, letters and logical values.
<- list(numbers = 1:5, letters = LETTERS, logic = c(TRUE, TRUE, TRUE, FALSE))
triplet triplet
## $numbers
## [1] 1 2 3 4 5
##
## $letters
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
##
## $logic
## [1] TRUE TRUE TRUE FALSE
You can access list elements with [[]]
. You can only select a single element.
If a list has a name, its elements can be referred to with the $
operator in a way similar to data frames. Internally, data frames are represented as lists of equally long vectors.
2]] triplet[[
## [1] "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S"
## [20] "T" "U" "V" "W" "X" "Y" "Z"
"logic"]] triplet[[
## [1] TRUE TRUE TRUE FALSE
$numbers triplet
## [1] 1 2 3 4 5
Lists are very useful when we need to process multiple data sets or when we want to generate multiple models. Both data sets and models can become list elements.
R contains many useful functions to work with lists. We will discuss them in Section 3.9.1.