2.5 How to process data with the dplyr package
R has thousands (thousands!) of functions to process data. Fortunately, we only need a few of them to start working with data in a convenient way.
The 80/20 principle (i.e. the Pareto principle, named after Wilfred Pareto) applies to data processing. This means that we only need to know a small portion of all available functions to work with the majority of data sets (i.e. 80% of all possible data).
Hadley Wickham prepared two packages, dplyr
and tidyr
, which provide only those really important functions. Each function performs an elementary operation, but various functions can be combined to perform all typical data operations.
Wickham calls functions from those packages verbs, and compares the analysis process to sentence construction. The basic verbs in this data analysis are:
filter()
- choose the selected rows from the dataset,select()
- choose only the selected columns from the dataset,arrange()
- sort the selected rows based on the selected columns,mutate()
- add a new column with data or change an existing column,group_by()
/ ungroup() - group data based on the selected factor / remove grouping,summarise()
- find specific aggregates in each group,gather() / spread()
- transform data between wide and long formats.
Those basic verbs are described in subsequent sections of this chapter. More functions to explore data are presented in a cheat sheet developed by RStudio, available at their website [26] or at http://bit.ly/1LaYWBd.
2.5.1 How to filter rows
Filtering rows that satisfy a given condition or conditions is one of the most frequent data operations.
The filter()
function from the dplyr
package performs filtering. Its first argument is the dataset, and further arguments define logical conditions.
The result of this function contains rows the satisfy all of the conditions specified. When specifying the conditions, we can use column names from the dataset without additional links.
We will present that with an example. Below is an instruction that chooses only those offers where Model == "Corsa"
from the vehicles
dataset:
library("dplyr")
<- filter(vehicles, Model == "Corsa")
CorsaOnly head(CorsaOnly)
## Make Model Price HP Capacity Mileage Fuel Production
## 1 Opel Corsa 13450 70 1248 190000 diesel 2004
## 2 Opel Corsa 25990 75 1300 84000 diesel 2008
## 3 Opel Corsa 17990 80 1229 48606 gasoline 2007
## 4 Opel Corsa 23999 60 998 63000 gasoline 2009
## 5 Opel Corsa 16500 101 1700 118000 diesel 2006
## 6 Opel Corsa 27900 80 1229 73500 gasoline 2007
We can specify multiple conditions simultaneously. The example below finds rows with Corsa vehicles and Diesel engines that were manufactured in 2010.
<- filter(vehicles, Model == "Corsa", Production == 2010,
CorsaOnly == "diesel")
Fuel head(CorsaOnly)
## Make Model Price HP Capacity Mileage Fuel Production
## 1 Opel Corsa 49050.00 75 1300 100 diesel 2010
## 2 Opel Corsa 47202.34 76 1300 53000 diesel 2010
## 3 Opel Corsa 37900.00 95 1248 8300 diesel 2010
## 4 Opel Corsa 12200.00 75 1248 11378 diesel 2010
## 5 Opel Corsa 34900.00 95 1300 24000 diesel 2010
## 6 Opel Corsa 37900.00 75 1248 18500 diesel 2010
2.5.2 How to choose columns
Sometimes, we only need a portion of all columns from a dataset. We can remove the unnecessary columns to make our work quicker.
Another advantage of choosing only those columns that we really need is that it is easier to show data. Instead of showing all columns, even those that are unnecessary, it is better to only show those that are important to us.
The select()
function from the dplyr
package allows us to choose one or more variables from the data source. The first argument is the data source, and subsequent arguments define the columns that we want to choose.
The example below selects only three columns: make, model and price.
<- select(vehicles, Make, Model, Price)
threeColumns head(threeColumns)
## Make Model Price
## 1 Peugeot 206 8799
## 2 Peugeot 206 15500
## 3 Peugeot 206 11900
## 4 Peugeot 206 10999
## 5 Peugeot 206 11900
## 6 Peugeot 206 19900
We can use names to select columns, but also use the negation operator “-” (minus sign), which selects all columns except those specified, or functions matches()
, starts_with()
and ends_with()
, which choose only those rows that satisfy the respective conditions.
In the example below, we select only those columns that start with an M
.
head(select(vehicles, starts_with("M")))
## Make Model Mileage
## 1 Peugeot 206 85000
## 2 Peugeot 206 114000
## 3 Peugeot 206 215000
## 4 Peugeot 206 165000
## 5 Peugeot 206 146000
## 6 Peugeot 206 86400
2.5.3 How to create and transform variables
Data modelling and processing frequently requires creating new variables based on existing ones. Sometimes, based on the price, we create the logarithm of price (a single variable from a single variable). Sometimes, based on the weight and height, we create the BMI (a single variable from multiple variables).
The mutate()
function from the dplyr
package is used to conveniently create additional columns in a dataset by transforming existing columns.
We will show this function using the vehicles
dataset from the Przewodnik
package. It is a dataset with car offers from the otomoto.pl
website from 2012.
We will start the example by specifying the age of the cars offered. Since the offers come from 2012, and the manufacture year is placed in the Production
column, the age of the car can be calculated as 2013 - Production
.
In the next step, we calculate the average mileage in a single year. We will need data from two columns.
<- mutate(vehicles,
carsWithAge Age = 2013 - Production,
MileagePerYear = round(Mileage/Age))
head(select(carsWithAge, Make, Model, Price, Fuel, Age,
MileagePerYear))
## Make Model Price Fuel Age MileagePerYear
## 1 Peugeot 206 8799 gasoline 10 8500
## 2 Peugeot 206 15500 gasoline 8 14250
## 3 Peugeot 206 11900 diesel 10 21500
## 4 Peugeot 206 10999 diesel 10 16500
## 5 Peugeot 206 11900 diesel 8 18250
## 6 Peugeot 206 19900 diesel 7 12343
Column processing can be more complex than arithmetic transformations and can involve any variables, not just numerical ones. In further chapters, we will discuss operations on strings, logical values and factors.
2.5.4 How to sort rows
Sorting data by a given column makes it much easier to analyse values in that column. First of all, we can immediately identify outliers. Sorting rows is therefore useful when exploring data.
We can use the arrange()
function from the dplyr
package to sort by one or more variables. When there are equal values in the first criterion, further criteria influence the order.
In the example below, the data is first sorted by the Model
column. When there are identical values in the Model
column, the order is decided based on the Price
variable.
<- arrange(vehicles, Model, Price)
sortedVehicles head(sortedVehicles)
## Make Model Price HP Capacity Mileage Fuel Production
## 1 Peugeot 206 6330.70 90 1997 90000 diesel 2004
## 2 Peugeot 206 6599.00 70 1398 277000 diesel 2004
## 3 Peugeot 206 6900.00 90 1997 132000 diesel 2004
## 4 Peugeot 206 7900.00 88 1360 114000 gasoline 2004
## 5 Peugeot 206 8469.45 60 1124 77800 gasoline 2005
## 6 Peugeot 206 8500.00 60 1124 117000 gasoline+LPG 2004
To reverse the sorting order, the variable should be surrounded with a desc()
function execution.
<- arrange(vehicles, Model, desc(Price))
sortedVehicles head(sortedVehicles, 2)
## Make Model Price HP Capacity Mileage Fuel Production
## 1 Peugeot 206 36115 60 1100 100 gasoline 2010
## 2 Peugeot 206 33600 65 1100 380 gasoline 2011
2.5.5 How to work with streams
We usually process data in multiple steps. We perform operation A, then B, then C, etc.
When we look at some R code, we should be able to easily grasp what operations are performed. This makes code analysis quicker - be it when we look for errors, or when we want to show our code to other people.
Streams are a mechanism introduced to R recently, but they are quickly gaining supporters. They have three main advantages: make our code shorter, increase its readability, and make it easy to add further processing steps.
Onion problem
To present streams, we will first consider a series of fours instructions:
<- filter(vehicles, Make == "Kia")
KiaOnly <- arrange(KiaOnly, Price)
sorted <- select(sorted, Model, Price, Mileage, Production)
fourColumns head(fourColumns, 4)
## Model Price Mileage Production
## 1 Cee'd 1026.6 28000 2008
## 2 Cee'd 13900.0 129000 2009
## 3 Cee'd 14700.0 34000 2009
## 4 Cee'd 14900.0 158500 2005
In R, the result of one function can be directly passed as an argument to another function. To make the code shorter, the “big onion” syntax is often introduced.
head(
select(
arrange(
filter(vehicles,
== "Kia"),
Make
Price),
Model, Price, Mileage, Production)4) ,
## Model Price Mileage Production
## 1 Cee'd 1026.6 28000 2008
## 2 Cee'd 13900.0 129000 2009
## 3 Cee'd 14700.0 34000 2009
## 4 Cee'd 14900.0 158500 2005
Such an “onion” should be read from the inside. First, we apply filtering, then sorting, then selecting four columns. Finally, we select the first four rows.
The problem is that this syntax does not read easily, particularly when we look at the outer functions. In the example above, the head()
function takes two arguments, but the first argument is six lines of code long, and the other one is only mentioned on line seven. With longer “onions,” it is even more difficult to find out which arguments belong to which functions.
How the stream operator works
To get rid of the “onion,” we can use a special operator for stream processing: %>%
.This operator comes from the magrittr
package, but is also available when we turn on the dplyr
package.
To remember what this operator does, we can use a simple example of a function with two arguments. The following code:
a %>% function(b)
means:
function(a,b)
We can also use this operator with functions that take more than two arguments. The %>%
operator passes the left-hand side as the first operator of the function specified on the right.
If we want to pass a value to the second or a subsequent argument, we can specify this argument with a dot “.” as shown in the example below:
a %>% function(b, data=.)
Streams in action
The “onion” shown above can be rewritten in the following way:
%>%
vehicles filter(Make == "Kia") %>%
arrange(Price) %>%
select(Model, Price, Mileage,Production) ->
sortedhead(sorted)
## Model Price Mileage Production
## 1 Cee'd 1026.60 28000 2008
## 2 Cee'd 13900.00 129000 2009
## 3 Cee'd 14700.00 34000 2009
## 4 Cee'd 14900.00 158500 2005
## 5 Cee'd 16900.00 9900 2010
## 6 Cee'd 18803.89 15000 2010
In the example above, we also used the ->
assignment operator. Thanks to this, we can consequently read the elements of the stream from left to right.
The ->
operator works like <-
, the only difference being that it assigns the result to the variable on the right, not on the left.
The functions from the dplyr
package are defined in such a way that their first argument is always a dataset. Thanks to this, the default behaviour of the %>%
operator makes our code much shorter and more understandable even with big sequences of function calls.
In the case of functions that take datasets as the second or even further arguments, we also need to use the “.” symbol (dot). This symbol denotes where the left side of the %>%
operator should be placed.
Let us take a look at an example with the lm()
function which builds a linear model. This function expects the data to be passed as the second argument. This means we need to use the “.” symbol.
%>%
vehicles lm(Price~Mileage, data = .) %>%
summary()
##
## Call:
## lm(formula = Price ~ Mileage, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40291 -13437 -3639 5809 3947024
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.388e+04 3.693e+03 11.883 < 2e-16 ***
## Mileage -9.166e-02 2.856e-02 -3.209 0.00135 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 82640 on 2398 degrees of freedom
## Multiple R-squared: 0.004277, Adjusted R-squared: 0.003862
## F-statistic: 10.3 on 1 and 2398 DF, p-value: 0.001348
In the example above, we do not need to specify the argument using data=
, but we did so nevertheless to increase code readability.
Notice that streams are not only easier to read, but they are also easier to comment. A comment line can be placed before each line of a stream to describe what happens in the specific place.
2.5.6 How to compute aggregates/statistics in groups
A frequent operation on datasets, particularly on the big ones, is the calculation of statistics/summaries/aggregates on subsets of data.
To find such aggregates with the dplyr
package, we can use the summarise()
and group_by()
functions. The first function specifies the statistics that we want to calculate, whereas the second function describes the groups we want to create.
We present those functions below one after another.
Aggregates
We can use summarise()
to find aggregates in datasets.
For instance, the instruction below calculates the mean price, median mileage, and mean age in the vehicles
dataset.
%>%
vehicles summarise(meanPrice = mean(Price),
medianMileage = median(Mileage),
meanAge = mean(2013 - Production),
numbers = n())
## meanPrice medianMileage meanAge numbers
## 1 33340.38 122000 6.638333 2400
An aggregate is not always related with value transformation in a column. For instance, the count of rows is a useful statistic which does not depend on the values in the dataset. An aggregate like shown above is set with the n()
function.
Grouping
The group_by()
function is used to define groups for further processing. The function itself does not modify data. It only adds a tag which specifies the grouping variable. Further functions of a stream use this tag to perform processing in groups.
The group_by
function influences other functions from the dplyr
package in an intuitive way.
The
summarise()
function, which computes statistics, will find statistics for each possible combination of grouping variables.The
arrange()
function, which sorts rows, will first sort rows by the grouping variable.The
sample_n()
functions, which filters rows, will sample data independently in each group.The
mutate()
function finds each aggregate within a group, which makes it easy to normalise variables in groups.
The example below calculates four statistics (mean price, median mileage, mean vehicle age, and count of offers) in groups specified by the make.
%>%
vehicles group_by(Make) %>%
summarise(meanPrice = mean(Price),
medianMileage = median(Mileage),
meanAge = mean(2013 - Production),
offerCount = n())
## # A tibble: 6 × 5
## Make meanPrice medianMileage meanAge offerCount
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Audi 59892. 150452. 6.55 200
## 2 Fiat 15772. 75000 6.33 200
## 3 Kia 36983. 42850 3.98 200
## 4 Opel 33591. 120550 6.62 800
## 5 Peugeot 17389. 127750 8.16 400
## 6 Volkswagen 39433. 146448 6.67 600
Aggregates are an ordinary data frame, which means we can perform subsequent operations on them, such as sorting.
%>%
vehicles group_by(Make) %>%
summarise(meanPrice = mean(Price),
medianMileage = median(Mileage),
meanAge = mean(2013 - Production),
offerCount = n()) %>%
arrange(meanPrice)
## # A tibble: 6 × 5
## Make meanPrice medianMileage meanAge offerCount
## <chr> <dbl> <dbl> <dbl> <int>
## 1 Fiat 15772. 75000 6.33 200
## 2 Peugeot 17389. 127750 8.16 400
## 3 Opel 33591. 120550 6.62 800
## 4 Kia 36983. 42850 3.98 200
## 5 Volkswagen 39433. 146448 6.67 600
## 6 Audi 59892. 150452. 6.55 200
The example below modifies the Mileage
variable. It divides Mileage
by the mean mileage for the given make.
%>%
vehicles select(Make, Price, Mileage, Model) %>%
group_by(Make) %>%
mutate(Mileage = Mileage/mean(Mileage, na.rm=TRUE))
## # A tibble: 2,400 × 4
## # Groups: Make [6]
## Make Price Mileage Model
## <chr> <dbl> <dbl> <chr>
## 1 Peugeot 8799 0.671 206
## 2 Peugeot 15500 0.900 206
## 3 Peugeot 11900 1.70 206
## 4 Peugeot 10999 1.30 206
## 5 Peugeot 11900 1.15 206
## 6 Peugeot 19900 0.682 206
## 7 Peugeot 16400 1.18 206
## 8 Peugeot 13900 1.03 206
## 9 Peugeot 13900 1.26 206
## 10 Peugeot 18400 0.450 206
## # … with 2,390 more rows
2.5.7 Wide and long formats
Many functions assume that data is structured as tables in which rows are subsequent observations and columns are variables describing those observations. This data format is expected by packages such as dplyr
, ggplot2
and others.
However, data is sometimes given in other formats. Sometimes, the same variable is described by many columns, or multiple variables are provided a single column. Some functions from the tidyr
package are used to transform various data formats.
We will present those data formats and the functions used to transform them on some sample data from Eurostat. We will get this data using the eurostat
package. Below, we load data from the t2020_rk310
table in which Eurostat gathers information about the popularity of various means of transport in various countries.
library("eurostat")
<- search_eurostat("Modal split of passenger transport", type = "table")$code[1]
id <- get_eurostat(id, time_format = "num") %>% as.data.frame()
tsdtr210 head(tsdtr210, 4)
## unit vehicle geo time values
## 1 PC BUS_TOT AT 1990 8.2
## 2 PC BUS_TOT BE 1990 10.6
## 3 PC BUS_TOT CH 1990 3.7
## 4 PC BUS_TOT DE 1990 9.1
The data is given in the long format. What this means will become clear once we present the wide format. In this example, the geo
column specifies the country, the time
column specifies the year, the vehicle
column specifies the means of transport, and the values
column specifies the popularity of a given means of transport in the given country and year.
Spread onto columns
To transform data from the long format to the wide format, we can use the spread()
function. The name wide format comes from the fact that it contains more columns. New columns in the wide format correspond to the values of a single column in the long format.
The spread()
function expects three arguments. The first argument is, naturally, the dataset. The second argument is the key corresponding to column names, and the last argument are the values to be written into the specific columns. Rows in the new table are chosen as unique values in the other columns.
In the example below, the time
function is changed from the long format to the wide format in which names correspond to subsequent years - values from the time
column. The values in new columns are copies from the values
column in the long data format.
library("tidyr")
<- spread(tsdtr210, time, values)
wideFormat %>% filter(geo == "PL") wideFormat
## unit vehicle geo 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## 1 PC BUS_TOT PL 28.1 25.6 24.4 23.2 20.9 19.9 19.4 17.9 17.4 16.8 25.4 23.5
## 2 PC CAR PL 41.3 49.8 55.3 57.9 62.3 64.6 69.3 71.3 72.1 72.3 64.3 66.9
## 3 PC TRN PL 30.6 24.6 20.3 18.9 16.8 15.5 11.3 10.8 10.5 10.9 10.3 9.5
## 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 1 21.7 21.2 20.4 18.6 17.0 20.8 19.9 17.9 16.8 15.7 16.0 15.2 15.5 14.7 14.2
## 2 69.7 70.8 72.3 74.6 76.6 70.7 71.9 74.7 76.1 77.4 76.8 78.0 78.2 78.5 78.5
## 3 8.6 8.1 7.3 6.8 6.4 8.5 8.2 7.4 7.1 6.9 7.2 6.7 6.3 6.8 7.3
## 2017 2018 2019
## 1 13.8 12.9 12.0
## 2 78.5 79.3 80.7
## 3 7.7 7.9 7.3
The data in the tsdtr210
table describes values in intersections through four dimensions (unit, vehicle type, country, year). Each dimension can be used to create new columns in the wide format.
In the example below, subsequent columns are created based on the geo
column in the long format.
<- spread(tsdtr210, geo, values) wideFormat2
We can show the rows for year 2010 (some columns are omitted).
%>% filter(time == "2010") wideFormat2
## unit vehicle time AT BE BG CH CY CZ DE DK EE EL ES
## 1 PC BUS_TOT 2010 9.3 12.7 16.4 5.1 18.1 19.5 6 10.5 16.6 17.3 12.3
## 2 PC CAR 2010 79.6 79.7 80.0 77.3 81.9 73.0 86 79.7 81.4 81.6 82.3
## 3 PC TRN 2010 11.1 7.6 3.6 17.6 NA 7.5 8 9.8 2.0 1.1 5.4
## EU27_2020 EU28 FI FR HR HU IE IS IT LT LU LV ME MK
## 1 9.8 9.3 9.9 5.3 10.7 21.5 14.5 11.4 12.8 7.6 12.1 17.1 1.9 22.9
## 2 83.0 83.5 84.9 85.5 83.7 68.6 82.6 88.6 81.7 91.7 83.5 78.2 96.0 74.6
## 3 7.1 7.2 5.2 9.3 5.6 10.0 2.9 NA 5.5 0.7 4.5 4.7 2.1 2.5
## MT NL NO PL PT RO RS SE SI SK TR UK
## 1 18.5 2.9 6.9 16.8 6.5 16.3 13.0 7.3 10.8 15.3 38.3 6.2
## 2 81.5 86.9 88.3 76.1 89.1 78.0 85.5 84.0 86.8 78.0 59.3 86.3
## 3 NA 10.2 4.8 7.1 4.4 5.6 1.5 8.7 2.5 6.7 2.4 7.5
Gather into columns
The opposite of spreading a single column into multiple columns is gathering multiple columns into a single column. This can be accomplished with the gather()
function.
This function takes a dataset as the first argument. Further two arguments specify column names with keys and values, while the other arguments point to those columns from the old dataset that should be gathered together in the new dataset. We can use the minus sign notation “-” which means “everything except….”
In the example below, we transform the wideFormat
data frame into the long format in such a way that all columns except for geo
and vehicle
are transformed into a column called value. This is why the result contains the geo
and vehicle
columns (skipped when gathering) as well as year
and value
(results of gathering).
%>%
wideFormat gather(year, value, -geo, -vehicle) %>%
tail()
## vehicle geo year value
## 3436 TRN RS 2019 0.7
## 3437 TRN SE 2019 10.5
## 3438 TRN SI 2019 1.8
## 3439 TRN SK 2019 10.2
## 3440 TRN TR 2019 3.3
## 3441 TRN UK 2019 8.5
To show an example of 4 rows, we used the tail()
function which shows the last 6 rows, because the first six rows are NA
values.
2.5.8 Uniting/separating columns
When we work with data, we often want to unite multiple columns into one, or split a column into multiple columns.
Uniting columns
We can use the unite()
function to merge a few columns into a single column.
The code below unites values from columns geo
and time
into a single string column geo_time
.
unite(tsdtr210, geo_time, geo, time, sep=":") %>%
head(4)
## unit vehicle geo_time values
## 1 PC BUS_TOT AT:1990 8.2
## 2 PC BUS_TOT BE:1990 10.6
## 3 PC BUS_TOT CH:1990 3.7
## 4 PC BUS_TOT DE:1990 9.1
Uniting columns is frequently useful when we want to group rows by multiple variables. We can then unite such columns, and group rows by the new, united column.
Separating columns
The opposite of uniting is separating. Here, we can use the separate()
function.
We will show an example of an artificial dataset with two columns: data and identifier.
In the example below, the separate()
function creates three new columns based on the dates
column. The columns are filled with parts of the old column separated with the - sign.
<- data.frame(dates = c("2004-01-01", "2012-04-15", "2006-10-29",
df "2010-03-03"), id = 1:4)
df
## dates id
## 1 2004-01-01 1
## 2 2012-04-15 2
## 3 2006-10-29 3
## 4 2010-03-03 4
separate(df, col=dates, into=c("year", "month", "day"), sep="-")
## year month day id
## 1 2004 01 01 1
## 2 2012 04 15 2
## 3 2006 10 29 3
## 4 2010 03 03 4
The length of the into
vector specifies the number of columns to be created after separation. If the column that is being split contains too few or too many values (e.g. only two elements separated with a separator), then separate()
will produce warnings by default. Additional extra
and fill
arguments can be used to specify what should happen when some data is missing.