2.5 How to process data with the dplyr package

R has thousands (thousands!) of functions to process data. Fortunately, we only need a few of them to start working with data in a convenient way.

The 80/20 principle (i.e. the Pareto principle, named after Wilfred Pareto) applies to data processing. This means that we only need to know a small portion of all available functions to work with the majority of data sets (i.e. 80% of all possible data).

Hadley Wickham prepared two packages, dplyr and tidyr, which provide only those really important functions. Each function performs an elementary operation, but various functions can be combined to perform all typical data operations.

Wickham calls functions from those packages verbs, and compares the analysis process to sentence construction. The basic verbs in this data analysis are:

filter() - choose the selected rows from the dataset,
select() - choose only the selected columns from the dataset,
arrange() - sort the selected rows based on the selected columns,
mutate() - add a new column with data or change an existing column,
group_by() / ungroup() - group data based on the selected factor / remove grouping,
summarise() - find specific aggregates in each group,
gather() / spread() - transform data between wide and long formats.

Those basic verbs are described in subsequent sections of this chapter. More functions to explore data are presented in a cheat sheet developed by RStudio, available at their website [26] or at http://bit.ly/1LaYWBd.

2.5.1 How to filter rows

Filtering rows that satisfy a given condition or conditions is one of the most frequent data operations.

The filter() function from the dplyr package performs filtering. Its first argument is the dataset, and further arguments define logical conditions.

The result of this function contains rows the satisfy all of the conditions specified. When specifying the conditions, we can use column names from the dataset without additional links.

We will present that with an example. Below is an instruction that chooses only those offers where Model == "Corsa" from the vehicles dataset:

library("dplyr")
CorsaOnly <- filter(vehicles, Model == "Corsa")
head(CorsaOnly)

##   Make Model Price  HP Capacity Mileage     Fuel Production
## 1 Opel Corsa 13450  70     1248  190000   diesel       2004
## 2 Opel Corsa 25990  75     1300   84000   diesel       2008
## 3 Opel Corsa 17990  80     1229   48606 gasoline       2007
## 4 Opel Corsa 23999  60      998   63000 gasoline       2009
## 5 Opel Corsa 16500 101     1700  118000   diesel       2006
## 6 Opel Corsa 27900  80     1229   73500 gasoline       2007

We can specify multiple conditions simultaneously. The example below finds rows with Corsa vehicles and Diesel engines that were manufactured in 2010.

CorsaOnly <- filter(vehicles, Model == "Corsa", Production == 2010,
                    Fuel == "diesel")
head(CorsaOnly)

##   Make Model    Price HP Capacity Mileage   Fuel Production
## 1 Opel Corsa 49050.00 75     1300     100 diesel       2010
## 2 Opel Corsa 47202.34 76     1300   53000 diesel       2010
## 3 Opel Corsa 37900.00 95     1248    8300 diesel       2010
## 4 Opel Corsa 12200.00 75     1248   11378 diesel       2010
## 5 Opel Corsa 34900.00 95     1300   24000 diesel       2010
## 6 Opel Corsa 37900.00 75     1248   18500 diesel       2010

2.5.2 How to choose columns

Sometimes, we only need a portion of all columns from a dataset. We can remove the unnecessary columns to make our work quicker.

Another advantage of choosing only those columns that we really need is that it is easier to show data. Instead of showing all columns, even those that are unnecessary, it is better to only show those that are important to us.

The select() function from the dplyr package allows us to choose one or more variables from the data source. The first argument is the data source, and subsequent arguments define the columns that we want to choose.

The example below selects only three columns: make, model and price.

threeColumns <- select(vehicles, Make, Model, Price)
head(threeColumns)

##      Make Model Price
## 1 Peugeot   206  8799
## 2 Peugeot   206 15500
## 3 Peugeot   206 11900
## 4 Peugeot   206 10999
## 5 Peugeot   206 11900
## 6 Peugeot   206 19900

We can use names to select columns, but also use the negation operator “-” (minus sign), which selects all columns except those specified, or functions matches(), starts_with() and ends_with(), which choose only those rows that satisfy the respective conditions.

In the example below, we select only those columns that start with an M.

head(select(vehicles, starts_with("M")))

##      Make Model Mileage
## 1 Peugeot   206   85000
## 2 Peugeot   206  114000
## 3 Peugeot   206  215000
## 4 Peugeot   206  165000
## 5 Peugeot   206  146000
## 6 Peugeot   206   86400

2.5.3 How to create and transform variables

Data modelling and processing frequently requires creating new variables based on existing ones. Sometimes, based on the price, we create the logarithm of price (a single variable from a single variable). Sometimes, based on the weight and height, we create the BMI (a single variable from multiple variables).

The mutate() function from the dplyr package is used to conveniently create additional columns in a dataset by transforming existing columns.

We will show this function using the vehicles dataset from the Przewodnik package. It is a dataset with car offers from the otomoto.pl website from 2012.

We will start the example by specifying the age of the cars offered. Since the offers come from 2012, and the manufacture year is placed in the Production column, the age of the car can be calculated as 2013 - Production.

In the next step, we calculate the average mileage in a single year. We will need data from two columns.

carsWithAge <- mutate(vehicles,
                      Age = 2013 - Production,
                      MileagePerYear = round(Mileage/Age))
head(select(carsWithAge, Make, Model, Price, Fuel, Age,
            MileagePerYear))

##      Make Model Price     Fuel Age MileagePerYear
## 1 Peugeot   206  8799 gasoline  10           8500
## 2 Peugeot   206 15500 gasoline   8          14250
## 3 Peugeot   206 11900   diesel  10          21500
## 4 Peugeot   206 10999   diesel  10          16500
## 5 Peugeot   206 11900   diesel   8          18250
## 6 Peugeot   206 19900   diesel   7          12343

Column processing can be more complex than arithmetic transformations and can involve any variables, not just numerical ones. In further chapters, we will discuss operations on strings, logical values and factors.

2.5.4 How to sort rows

Sorting data by a given column makes it much easier to analyse values in that column. First of all, we can immediately identify outliers. Sorting rows is therefore useful when exploring data.

We can use the arrange() function from the dplyr package to sort by one or more variables. When there are equal values in the first criterion, further criteria influence the order.

In the example below, the data is first sorted by the Model column. When there are identical values in the Model column, the order is decided based on the Price variable.

sortedVehicles <- arrange(vehicles, Model, Price)
head(sortedVehicles)

##      Make Model   Price HP Capacity Mileage         Fuel Production
## 1 Peugeot   206 6330.70 90     1997   90000       diesel       2004
## 2 Peugeot   206 6599.00 70     1398  277000       diesel       2004
## 3 Peugeot   206 6900.00 90     1997  132000       diesel       2004
## 4 Peugeot   206 7900.00 88     1360  114000     gasoline       2004
## 5 Peugeot   206 8469.45 60     1124   77800     gasoline       2005
## 6 Peugeot   206 8500.00 60     1124  117000 gasoline+LPG       2004

To reverse the sorting order, the variable should be surrounded with a desc() function execution.

sortedVehicles <- arrange(vehicles, Model, desc(Price))
head(sortedVehicles, 2)

##      Make Model Price HP Capacity Mileage     Fuel Production
## 1 Peugeot   206 36115 60     1100     100 gasoline       2010
## 2 Peugeot   206 33600 65     1100     380 gasoline       2011

2.5.5 How to work with streams

We usually process data in multiple steps. We perform operation A, then B, then C, etc.

When we look at some R code, we should be able to easily grasp what operations are performed. This makes code analysis quicker - be it when we look for errors, or when we want to show our code to other people.

Streams are a mechanism introduced to R recently, but they are quickly gaining supporters. They have three main advantages: make our code shorter, increase its readability, and make it easy to add further processing steps.

Onion problem

To present streams, we will first consider a series of fours instructions:

KiaOnly <- filter(vehicles, Make == "Kia")
sorted <- arrange(KiaOnly, Price)
fourColumns <- select(sorted, Model, Price, Mileage, Production)
head(fourColumns, 4)

##   Model   Price Mileage Production
## 1 Cee'd  1026.6   28000       2008
## 2 Cee'd 13900.0  129000       2009
## 3 Cee'd 14700.0   34000       2009
## 4 Cee'd 14900.0  158500       2005

In R, the result of one function can be directly passed as an argument to another function. To make the code shorter, the “big onion” syntax is often introduced.

head(
  select(
    arrange(
      filter(vehicles,
             Make == "Kia"),
      Price),
    Model, Price, Mileage, Production)
  , 4)

##   Model   Price Mileage Production
## 1 Cee'd  1026.6   28000       2008
## 2 Cee'd 13900.0  129000       2009
## 3 Cee'd 14700.0   34000       2009
## 4 Cee'd 14900.0  158500       2005

Such an “onion” should be read from the inside. First, we apply filtering, then sorting, then selecting four columns. Finally, we select the first four rows.

The problem is that this syntax does not read easily, particularly when we look at the outer functions. In the example above, the head() function takes two arguments, but the first argument is six lines of code long, and the other one is only mentioned on line seven. With longer “onions,” it is even more difficult to find out which arguments belong to which functions.

How the stream operator works

To get rid of the “onion,” we can use a special operator for stream processing: %>%.This operator comes from the magrittr package, but is also available when we turn on the dplyr package.

To remember what this operator does, we can use a simple example of a function with two arguments. The following code:

a %>% function(b)

means:

function(a,b)

We can also use this operator with functions that take more than two arguments. The %>% operator passes the left-hand side as the first operator of the function specified on the right.

If we want to pass a value to the second or a subsequent argument, we can specify this argument with a dot “.” as shown in the example below:

a %>% function(b, data=.)

Streams in action

The “onion” shown above can be rewritten in the following way:

vehicles %>%
  filter(Make == "Kia") %>%
  arrange(Price) %>%
  select(Model, Price, Mileage,Production) ->
  sorted
head(sorted)

##   Model    Price Mileage Production
## 1 Cee'd  1026.60   28000       2008
## 2 Cee'd 13900.00  129000       2009
## 3 Cee'd 14700.00   34000       2009
## 4 Cee'd 14900.00  158500       2005
## 5 Cee'd 16900.00    9900       2010
## 6 Cee'd 18803.89   15000       2010

In the example above, we also used the -> assignment operator. Thanks to this, we can consequently read the elements of the stream from left to right.

The -> operator works like <-, the only difference being that it assigns the result to the variable on the right, not on the left.

The functions from the dplyr package are defined in such a way that their first argument is always a dataset. Thanks to this, the default behaviour of the %>% operator makes our code much shorter and more understandable even with big sequences of function calls.

In the case of functions that take datasets as the second or even further arguments, we also need to use the “.” symbol (dot). This symbol denotes where the left side of the %>% operator should be placed. Let us take a look at an example with the lm() function which builds a linear model. This function expects the data to be passed as the second argument. This means we need to use the “.” symbol.

vehicles %>%
  lm(Price~Mileage, data = .) %>%
  summary()

## 
## Call:
## lm(formula = Price ~ Mileage, data = .)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -40291  -13437   -3639    5809 3947024 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4.388e+04  3.693e+03  11.883  < 2e-16 ***
## Mileage     -9.166e-02  2.856e-02  -3.209  0.00135 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 82640 on 2398 degrees of freedom
## Multiple R-squared:  0.004277,   Adjusted R-squared:  0.003862 
## F-statistic:  10.3 on 1 and 2398 DF,  p-value: 0.001348

In the example above, we do not need to specify the argument using data=, but we did so nevertheless to increase code readability.

Notice that streams are not only easier to read, but they are also easier to comment. A comment line can be placed before each line of a stream to describe what happens in the specific place.

2.5.6 How to compute aggregates/statistics in groups

A frequent operation on datasets, particularly on the big ones, is the calculation of statistics/summaries/aggregates on subsets of data.

To find such aggregates with the dplyr package, we can use the summarise() and group_by() functions. The first function specifies the statistics that we want to calculate, whereas the second function describes the groups we want to create.

We present those functions below one after another.

Aggregates

We can use summarise() to find aggregates in datasets.

For instance, the instruction below calculates the mean price, median mileage, and mean age in the vehicles dataset.

vehicles %>%
  summarise(meanPrice = mean(Price),
            medianMileage = median(Mileage),
            meanAge = mean(2013 - Production),
            numbers = n())

##   meanPrice medianMileage  meanAge numbers
## 1  33340.38        122000 6.638333    2400

An aggregate is not always related with value transformation in a column. For instance, the count of rows is a useful statistic which does not depend on the values in the dataset. An aggregate like shown above is set with the n() function.

Grouping

The group_by() function is used to define groups for further processing. The function itself does not modify data. It only adds a tag which specifies the grouping variable. Further functions of a stream use this tag to perform processing in groups.

The group_by function influences other functions from the dplyr package in an intuitive way.

The summarise() function, which computes statistics, will find statistics for each possible combination of grouping variables.
The arrange() function, which sorts rows, will first sort rows by the grouping variable.
The sample_n() functions, which filters rows, will sample data independently in each group.
The mutate() function finds each aggregate within a group, which makes it easy to normalise variables in groups.

The example below calculates four statistics (mean price, median mileage, mean vehicle age, and count of offers) in groups specified by the make.

vehicles %>%
  group_by(Make) %>%
  summarise(meanPrice = mean(Price),
            medianMileage = median(Mileage),
            meanAge = mean(2013 - Production),
            offerCount = n())

## # A tibble: 6 × 5
##   Make       meanPrice medianMileage meanAge offerCount
##   <chr>          <dbl>         <dbl>   <dbl>      <int>
## 1 Audi          59892.       150452.    6.55        200
## 2 Fiat          15772.        75000     6.33        200
## 3 Kia           36983.        42850     3.98        200
## 4 Opel          33591.       120550     6.62        800
## 5 Peugeot       17389.       127750     8.16        400
## 6 Volkswagen    39433.       146448     6.67        600

Aggregates are an ordinary data frame, which means we can perform subsequent operations on them, such as sorting.

vehicles %>%
  group_by(Make) %>%
  summarise(meanPrice = mean(Price),
            medianMileage = median(Mileage),
            meanAge = mean(2013 - Production),
            offerCount = n()) %>%
  arrange(meanPrice)

## # A tibble: 6 × 5
##   Make       meanPrice medianMileage meanAge offerCount
##   <chr>          <dbl>         <dbl>   <dbl>      <int>
## 1 Fiat          15772.        75000     6.33        200
## 2 Peugeot       17389.       127750     8.16        400
## 3 Opel          33591.       120550     6.62        800
## 4 Kia           36983.        42850     3.98        200
## 5 Volkswagen    39433.       146448     6.67        600
## 6 Audi          59892.       150452.    6.55        200

The example below modifies the Mileage variable. It divides Mileage by the mean mileage for the given make.

vehicles %>%
  select(Make, Price, Mileage, Model) %>%
  group_by(Make) %>%
  mutate(Mileage = Mileage/mean(Mileage, na.rm=TRUE))

## # A tibble: 2,400 × 4
## # Groups:   Make [6]
##    Make    Price Mileage Model
##    <chr>   <dbl>   <dbl> <chr>
##  1 Peugeot  8799   0.671 206  
##  2 Peugeot 15500   0.900 206  
##  3 Peugeot 11900   1.70  206  
##  4 Peugeot 10999   1.30  206  
##  5 Peugeot 11900   1.15  206  
##  6 Peugeot 19900   0.682 206  
##  7 Peugeot 16400   1.18  206  
##  8 Peugeot 13900   1.03  206  
##  9 Peugeot 13900   1.26  206  
## 10 Peugeot 18400   0.450 206  
## # … with 2,390 more rows

2.5.7 Wide and long formats

Many functions assume that data is structured as tables in which rows are subsequent observations and columns are variables describing those observations. This data format is expected by packages such as dplyr, ggplot2 and others.

However, data is sometimes given in other formats. Sometimes, the same variable is described by many columns, or multiple variables are provided a single column. Some functions from the tidyr package are used to transform various data formats.

We will present those data formats and the functions used to transform them on some sample data from Eurostat. We will get this data using the eurostat package. Below, we load data from the t2020_rk310 table in which Eurostat gathers information about the popularity of various means of transport in various countries.

library("eurostat")
id <- search_eurostat("Modal split of passenger transport", type = "table")$code[1]
tsdtr210 <- get_eurostat(id, time_format = "num") %>% as.data.frame()
head(tsdtr210, 4)

##   unit vehicle geo time values
## 1   PC BUS_TOT  AT 1990    8.2
## 2   PC BUS_TOT  BE 1990   10.6
## 3   PC BUS_TOT  CH 1990    3.7
## 4   PC BUS_TOT  DE 1990    9.1

The data is given in the long format. What this means will become clear once we present the wide format. In this example, the geo column specifies the country, the time column specifies the year, the vehicle column specifies the means of transport, and the values column specifies the popularity of a given means of transport in the given country and year.

Spread onto columns

To transform data from the long format to the wide format, we can use the spread() function. The name wide format comes from the fact that it contains more columns. New columns in the wide format correspond to the values of a single column in the long format.

The spread() function expects three arguments. The first argument is, naturally, the dataset. The second argument is the key corresponding to column names, and the last argument are the values to be written into the specific columns. Rows in the new table are chosen as unique values in the other columns.

In the example below, the time function is changed from the long format to the wide format in which names correspond to subsequent years - values from the time column. The values in new columns are copies from the values column in the long data format.

library("tidyr")
wideFormat <- spread(tsdtr210, time, values)
wideFormat %>% filter(geo == "PL")

##   unit vehicle geo 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001
## 1   PC BUS_TOT  PL 28.1 25.6 24.4 23.2 20.9 19.9 19.4 17.9 17.4 16.8 25.4 23.5
## 2   PC     CAR  PL 41.3 49.8 55.3 57.9 62.3 64.6 69.3 71.3 72.1 72.3 64.3 66.9
## 3   PC     TRN  PL 30.6 24.6 20.3 18.9 16.8 15.5 11.3 10.8 10.5 10.9 10.3  9.5
##   2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016
## 1 21.7 21.2 20.4 18.6 17.0 20.8 19.9 17.9 16.8 15.7 16.0 15.2 15.5 14.7 14.2
## 2 69.7 70.8 72.3 74.6 76.6 70.7 71.9 74.7 76.1 77.4 76.8 78.0 78.2 78.5 78.5
## 3  8.6  8.1  7.3  6.8  6.4  8.5  8.2  7.4  7.1  6.9  7.2  6.7  6.3  6.8  7.3
##   2017 2018 2019
## 1 13.8 12.9 12.0
## 2 78.5 79.3 80.7
## 3  7.7  7.9  7.3

The data in the tsdtr210 table describes values in intersections through four dimensions (unit, vehicle type, country, year). Each dimension can be used to create new columns in the wide format.

In the example below, subsequent columns are created based on the geo column in the long format.

wideFormat2 <- spread(tsdtr210, geo, values)

We can show the rows for year 2010 (some columns are omitted).

wideFormat2 %>% filter(time == "2010")

##   unit vehicle time   AT   BE   BG   CH   CY   CZ DE   DK   EE   EL   ES
## 1   PC BUS_TOT 2010  9.3 12.7 16.4  5.1 18.1 19.5  6 10.5 16.6 17.3 12.3
## 2   PC     CAR 2010 79.6 79.7 80.0 77.3 81.9 73.0 86 79.7 81.4 81.6 82.3
## 3   PC     TRN 2010 11.1  7.6  3.6 17.6   NA  7.5  8  9.8  2.0  1.1  5.4
##   EU27_2020 EU28   FI   FR   HR   HU   IE   IS   IT   LT   LU   LV   ME   MK
## 1       9.8  9.3  9.9  5.3 10.7 21.5 14.5 11.4 12.8  7.6 12.1 17.1  1.9 22.9
## 2      83.0 83.5 84.9 85.5 83.7 68.6 82.6 88.6 81.7 91.7 83.5 78.2 96.0 74.6
## 3       7.1  7.2  5.2  9.3  5.6 10.0  2.9   NA  5.5  0.7  4.5  4.7  2.1  2.5
##     MT   NL   NO   PL   PT   RO   RS   SE   SI   SK   TR   UK
## 1 18.5  2.9  6.9 16.8  6.5 16.3 13.0  7.3 10.8 15.3 38.3  6.2
## 2 81.5 86.9 88.3 76.1 89.1 78.0 85.5 84.0 86.8 78.0 59.3 86.3
## 3   NA 10.2  4.8  7.1  4.4  5.6  1.5  8.7  2.5  6.7  2.4  7.5

Gather into columns

The opposite of spreading a single column into multiple columns is gathering multiple columns into a single column. This can be accomplished with the gather() function.

This function takes a dataset as the first argument. Further two arguments specify column names with keys and values, while the other arguments point to those columns from the old dataset that should be gathered together in the new dataset. We can use the minus sign notation “-” which means “everything except….”

In the example below, we transform the wideFormat data frame into the long format in such a way that all columns except for geo and vehicle are transformed into a column called value. This is why the result contains the geo and vehicle columns (skipped when gathering) as well as year and value (results of gathering).

wideFormat %>%
  gather(year, value, -geo, -vehicle) %>%
  tail()

##      vehicle geo year value
## 3436     TRN  RS 2019   0.7
## 3437     TRN  SE 2019  10.5
## 3438     TRN  SI 2019   1.8
## 3439     TRN  SK 2019  10.2
## 3440     TRN  TR 2019   3.3
## 3441     TRN  UK 2019   8.5

To show an example of 4 rows, we used the tail() function which shows the last 6 rows, because the first six rows are NA values.

2.5.8 Uniting/separating columns

When we work with data, we often want to unite multiple columns into one, or split a column into multiple columns.

Uniting columns

We can use the unite() function to merge a few columns into a single column.

The code below unites values from columns geo and time into a single string column geo_time.

unite(tsdtr210, geo_time, geo, time, sep=":") %>%
  head(4)

##   unit vehicle geo_time values
## 1   PC BUS_TOT  AT:1990    8.2
## 2   PC BUS_TOT  BE:1990   10.6
## 3   PC BUS_TOT  CH:1990    3.7
## 4   PC BUS_TOT  DE:1990    9.1

Uniting columns is frequently useful when we want to group rows by multiple variables. We can then unite such columns, and group rows by the new, united column.

Separating columns

The opposite of uniting is separating. Here, we can use the separate() function.

We will show an example of an artificial dataset with two columns: data and identifier.

In the example below, the separate() function creates three new columns based on the dates column. The columns are filled with parts of the old column separated with the - sign.

df <- data.frame(dates = c("2004-01-01", "2012-04-15", "2006-10-29",
                           "2010-03-03"), id = 1:4)
df

##        dates id
## 1 2004-01-01  1
## 2 2012-04-15  2
## 3 2006-10-29  3
## 4 2010-03-03  4

separate(df, col=dates, into=c("year", "month", "day"), sep="-")

##   year month day id
## 1 2004    01  01  1
## 2 2012    04  15  2
## 3 2006    10  29  3
## 4 2010    03  03  4

The length of the into vector specifies the number of columns to be created after separation. If the column that is being split contains too few or too many values (e.g. only two elements separated with a separator), then separate() will produce warnings by default. Additional extra and fill arguments can be used to specify what should happen when some data is missing.