# Chapter 4 Data Sets

We illustrate the methods presented in this book by using three datasets:

*Sinking of the RMS Titanic**Apartment prices**Hire or fire*

The first dataset will be used to illustrate the application of the techniques in the case of a predictive model for a binary dependent variable. The second one will provide an example for models for a continuous variable. Finally, the third dataset will be used for illustration of models for a categorical dependent variable.

In this chapter, we provide a short description of each of the datasets, together with results of exploratory analyses. We also introduce models that will be used for illustration purposes in subsequent chapters.

## 4.1 Sinking of the RMS Titanic

Sinking of the RMS Titanic is one of the deadliest maritime disasters in history (during peacetime). Over 1500 people died as a consequence of collision with an iceberg. Projects like *Encyclopedia titanica* `https://www.encyclopedia-titanica.org/`

are a source of rich and precise data about Titanic’s passengers. The data are available in a dataset included in the `stablelearner`

package. The dataset, after some data cleaning and variable transformations, is also avaliable in the `DALEX`

package. In particular, the `titanic’ data frame contains 2207 observations (for 1317 passengers and 890 crew members) and nine variables:

*gender*, person’s (passenger’s or crew member’s) gender, a factor (categorical variable) with two levels (categories);*age*, person’s age in years, a numerical variable; for adults, the age is given in (integer) years; for children younger than one year, the age is given as \(x/12\), where \(x\) is the number of months of child’s age;*class*, the class in which the passenger travelled, or the duty class of a crew member; a factor with seven levels*embarked*, the harbor in which the person embarked on the ship, a factor with four levels;*country*, person’s home country, a factor with 48 levels;*fare*, the price of the ticket (only available for passengers; 0 for crew members), a numerical variable;*sibsp*, the number of siblings/spouses aboard the ship, a numerical variable;*parch*, the number of parents/children aboard the ship, a numerical variable;*survived*, a factor with two levels indicating whether the person survived or not.

The R code below provides more info about the contents of the dataset, values of the variables, etc.

```
## gender age class embarked country fare sibsp parch survived
## 1 male 42 3rd Southampton United States 7.11 0 0 no
## 2 male 13 3rd Southampton United States 20.05 0 2 no
```

```
## 'data.frame': 2207 obs. of 9 variables:
## $ gender : Factor w/ 2 levels "female","male": 2 2 2 1 1 2 2 1 2 2 ...
## $ age : num 42 13 16 39 16 25 30 28 27 20 ...
## $ class : Factor w/ 7 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 2 2 3 3 ...
## $ embarked: Factor w/ 4 levels "Belfast","Cherbourg",..: 4 4 4 4 4 4 2 2 2 4 ...
## $ country : Factor w/ 48 levels "Argentina","Australia",..: 44 44 44 15 30 44 17 17 26 16 ...
## $ fare : num 7.11 20.05 20.05 20.05 7.13 ...
## $ sibsp : num 0 0 1 1 0 0 1 1 0 0 ...
## $ parch : num 0 2 1 1 0 0 0 0 0 0 ...
## $ survived: Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 2 2 ...
```

```
## [1] "1st" "2nd" "3rd"
## [4] "deck crew" "engineering crew" "restaurant staff"
## [7] "victualling crew"
```

`## [1] "Belfast" "Cherbourg" "Queenstown" "Southampton"`

Models considered for this dataset will use *survived* as the (binary) dependent variable.

### 4.1.1 Data exploration

It is always advisable to explore data before modelling. However, as this book is focused on model exploration, we will limit the data exploration part.

Before exploring the data, we first do some pre-processing. In particular, the value of variables *age*, *country*, *sibsp*, *parch*, and *fare* is missing for a limited number of observations (2, 81, 10, 10, and 26, respectively). Analyzing data with missing values is a topic on its own (Little and Rubin 1987; Schafer 1997; Molenberghs and Kenward 2007). An often-used approach is to impute the missing values. Toward this end, multiple imputation should be considered (Schafer 1997; Molenberghs and Kenward 2007; van Buuren 2012). However, given the limited number of missing values and the intended illustrative use of the dataset, we will limit ourselves to, admittedly inferior, single imputation. In particular, we replace the missing *age* values by the mean of the observed ones, i.e., 30. Missing *country* will be coded by “X”. For *sibsp* and *parch*, we replace the missing values by the most frequently observed value, i.e., 0. Finally, for *fare*, we use the mean fare for a given *class*, i.e., 0 pounds for crew, 89 pounds for the 1st, 22 pounds for the 2nd, and 13 pounds for the 3rd class. The R code presented below implements the imputation steps.

```
# missing age is replaced by average (30)
titanic$age[is.na(titanic$age)] = 30
# missing country is replaced by "X"
titanic$country <- as.character(titanic$country)
titanic$country[is.na(titanic$country)] = "X"
titanic$country <- factor(titanic$country)
# missing fare is replaced by class average
titanic$fare[is.na(titanic$fare) & titanic$class == "1st"] = 89
titanic$fare[is.na(titanic$fare) & titanic$class == "2nd"] = 22
titanic$fare[is.na(titanic$fare) & titanic$class == "3rd"] = 13
# missing sibsp, parch are replaced by 0
titanic$sibsp[is.na(titanic$sibsp)] = 0
titanic$parch[is.na(titanic$parch)] = 0
```

After imputing the missing values, we investigate the association between survival status and other variables. Figures 4.1-4.6 present graphically the proportion non- and survivors for different levels of the other variables. The height of the bars (on the y-axis) reflects the marginal distribution (proportions) of the observed levels of the variable. On the other hand, the width of the bars (on the x-axis) provides the information about the proportion of non- and survivors. Note that, to construct the graphs for *age* and *fare*, we categorized the range of the observed values.

Figures 4.1 and 4.2 indicate that the proportion of survivors was larger for females and children below 5 years of age. This is most likely the result of the “women and children first” principle that is often evoked in situations that require evacuation of persons whose life is in danger. The principle can, perhaps, partially explain the trend seen in Figures 4.3 and 4.4, i.e., a higher proportion of survivors among those with 1-3 parents/children and 1-2 siblings/spouses aboard. Figure 4.5 indicates that passengers travelling in the first and second class had a higher chance of survival, perhaps due to the proximity of the location of their cabins to the deck. Interestingly, the proportion of survivors among crew deck was similar to the proportion of the first-class passengers. Figure 4.6 shows that the proportion of survivors increased with the fare, which is consistent with the fact that the proportion was higher for passengers travelling in the first and second class. Finally, Figures 4.7 and 4.8 do not suggest any noteworthy trends.

### 4.1.2 Logistic regression

The dependent variable of interest, *survival*, is binary. Thus, a natural choice to build a predictive model is logistic regression. We do not consider country as an explanatory variable. As there is no reason to expect a linear relationship between age and odds of survival, we use linear tail-restricted cubic splines, available in the `rcs()`

function of the `rms`

package (Harrell Jr 2018), to model the effect of age. We also do not expect linear relation for the `fare`

variable, but because of it’s skewness, we do not use splines for this variable. The results of the model are stored in model-object `titanic_lmr_v6`

, which will be used in subsequent chapters.

```
library("rms")
set.seed(1313)
titanic_lmr_v6 <- lrm(survived == "yes" ~ gender + rcs(age) + class + sibsp +
parch + fare + embarked, titanic)
titanic_lmr_v6
```

```
## Logistic Regression Model
##
## lrm(formula = survived == "yes" ~ gender + rcs(age) + class +
## sibsp + parch + fare + embarked, data = titanic)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 2207 LR chi2 752.06 R2 0.404 C 0.817
## FALSE 1496 d.f. 17 g 1.647 Dxy 0.635
## TRUE 711 Pr(> chi2) <0.0001 gr 5.191 gamma 0.636
## max |deriv| 0.0001 gp 0.282 tau-a 0.277
## Brier 0.146
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 4.5746 0.5480 8.35 <0.0001
## gender=male -2.7687 0.1586 -17.45 <0.0001
## age -0.1180 0.0221 -5.35 <0.0001
## age' 0.6313 0.1628 3.88 0.0001
## age'' -2.6583 0.7840 -3.39 0.0007
## age''' 2.8977 1.0130 2.86 0.0042
## class=2nd -1.1390 0.2501 -4.56 <0.0001
## class=3rd -2.0627 0.2490 -8.28 <0.0001
## class=deck crew 1.0672 0.3498 3.05 0.0023
## class=engineering crew -0.9702 0.2648 -3.66 0.0002
## class=restaurant staff -3.1712 0.6583 -4.82 <0.0001
## class=victualling crew -1.0877 0.2596 -4.19 <0.0001
## sibsp -0.4504 0.1006 -4.48 <0.0001
## parch -0.0871 0.0987 -0.88 0.3776
## fare 0.0014 0.0020 0.70 0.4842
## embarked=Cherbourg 0.7881 0.2836 2.78 0.0055
## embarked=Queenstown 0.2745 0.3409 0.80 0.4208
## embarked=Southampton 0.2343 0.2119 1.11 0.2689
##
```

### 4.1.3 Random forest

As an alternative to a logistic regression model, we consider a random forest model. Random forest is known for good predictive performance, is able to grasp low-level variable interactions, and is quite stable (Breiman 2001). To fit the model, we apply the `randomForest()`

function, with default settings, from the package with the same name (Liaw and Wiener 2002b).

In the first instance, we fit a model with the same set of explanatory variables as the logistic regression model. The results of the model are stored in model-object `titanic_rf_v6`

.

```
library("randomForest")
set.seed(1313)
titanic_rf_v6 <- randomForest(survived ~ class + gender + age + sibsp + parch + fare + embarked,
data = titanic)
titanic_rf_v6
```

```
##
## Call:
## randomForest(formula = survived ~ class + gender + age + sibsp + parch + fare + embarked, data = titanic)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 18.62%
## Confusion matrix:
## no yes class.error
## no 1393 103 0.06885027
## yes 308 403 0.43319269
```

For comparison purposes, we also consider a model with only three explanatory variables: *class*, *gender*, and *age*. The results of the model are stored in model-object `titanic_rf_v3`

.

```
##
## Call:
## randomForest(formula = survived ~ class + gender + age, data = titanic)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 1
##
## OOB estimate of error rate: 21.02%
## Confusion matrix:
## no yes class.error
## no 1367 129 0.08622995
## yes 335 376 0.47116737
```

### 4.1.4 Gradient boosting

Finally, we consider the gradient-boosting model. (Friedman 2000) The model is known for being able to accomodate higher-order interactions between variables. We use the same set of explanatory variables as for the logistic regression model. To fit the gradient-boosting model, we use function `gbm()`

from the `gbm`

package (Ridgeway 2017). The results of the model are stored in model-object `titanic_gbm_v6`

.

```
library("gbm")
set.seed(1313)
titanic_gbm_v6 <- gbm(survived == "yes" ~ class + gender + age + sibsp + parch + fare + embarked,
data = titanic, n.trees = 15000)
```

`## Distribution not specified, assuming bernoulli ...`

```
## gbm(formula = survived == "yes" ~ class + gender + age + sibsp +
## parch + fare + embarked, data = titanic, n.trees = 15000)
## A gradient boosted model with bernoulli loss function.
## 15000 iterations were performed.
## There were 7 predictors of which 7 had non-zero influence.
```

### 4.1.5 Model predictions

Let us now compare predictions that are obtained from the three different models. In particular, we will compute the predicted probability of survival for an 8-year-old boy who embarked in Belfast and travelled in the 1-st class with no parents nor siblings and with a ticket costing 72 pounds.

First, we create a data frame `johny_d`

that contains the data describing the passenger.

```
johny_d <- data.frame(
class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
gender = factor("male", levels = c("female", "male")),
age = 8,
sibsp = 0,
parch = 0,
fare = 72,
embarked = factor("Belfast", levels = c("Belfast","Cherbourg","Queenstown","Southampton"))
)
```

Subsequently, we use the generic function `predict()`

to get the predicted probability of survival for the logistic regression model.

```
## 1
## 0.7233413
```

The predicted probability is equal to 0.72.

We do the same for the random forest and gradient boosting models.

```
## no yes
## 1 0.626 0.374
## attr(,"class")
## [1] "matrix" "votes"
```

`## [1] 0.6700909`

As a result, we obtain the predicted probabilities of 0.37 and 0.67, respectively.

The models lead to different probabilities. Thus, it might be of interest to understand the reason for the differences, as it could help us to decide which of the predictions we might want to trust.

Note that for some examples we will use another observation (instance) with lower chances of survival. Let’s call this passenger Henry.

```
henry <- data.frame(
class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", "engineering crew", "restaurant staff", "victualling crew")),
gender = factor("male", levels = c("female", "male")),
age = 47,
sibsp = 0,
parch = 0,
fare = 25,
embarked = factor("Cherbourg", levels = c("Belfast","Cherbourg","Queenstown","Southampton"))
)
round(predict(titanic_lmr_v6, henry, type = "fitted"),2)
```

```
## 1
## 0.43
```

`## [1] 0.25`

`## [1] 0.42`

### 4.1.6 Explainers

Model-objects created with different libraries may have different internal structures. Thus, first, we have got to create a wrapper around the model. Toward this end, we use the `explain()`

function from the `DALEX`

package (Biecek 2018). The function requires five arguments:

`model`

, a model-object;`data`

, a validation data frame;`y`

, observed values of the dependent variable for the validation data;`predict_function`

, a function that returns prediction scores; if not specified, then a default`predict()`

function is used;`label`

, a function that returns prediction scores; if not specified, then it is extracted from the`class(model)`

. In the example below we create explainers for the logistic regression, random forest, and gradient boosting models created for the Titanic data.

Each explainer wraps all elements needed to create a model explanation, i.e., a suitable `predict()`

function, validation data set, and the model object. Thus, in subsequent chapters we will use the explainers instead of the model objects to keep code snippets more concise.

```
explain_titanic_lmr_v6 <- explain(model = titanic_lmr_v6,
data = titanic[, -9],
y = titanic$survived == "yes",
predict_function = function(m, x) predict(m, x, type = "fitted"),
label = "Logistic Regression v6")
explain_titanic_rf_v6 <- explain(model = titanic_rf_v6,
data = titanic[, -9],
y = titanic$survived == "yes",
label = "Random Forest v6")
explain_titanic_rf_v3 <- explain(model = titanic_rf_v3,
data = titanic[, -9],
y = titanic$survived == "yes",
label = "Random Forest v3")
explain_titanic_gbm_v6 <- explain(model = titanic_gbm_v6,
data = titanic[, -9],
y = titanic$survived == "yes",
label = "Generalized Boosted Regression v6")
```

### 4.1.7 List of objects for the `titanic`

example

In the previous sections we have built several predictive models for the `titanic`

data set. The models will be used in the rest of the book to illustrate the model explanation methods and tools.

For the ease of reference, we summarize the models in Table 4.1. The binary model-objects can be downloaded by using the indicated `archivist`

hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can recreate a selected model in a local R environment.

Model name | Model generator | Variables | Archivist hooks |
---|---|---|---|

`titanic_lmr_v6` |
`rms:: lmr` v.5.1.3 |
gender, age, class, sibsp, parch, fare, embarked | Get the model: `archivist:: aread("pbiecek/models/ceb40")` . Get the explainer: `archivist:: aread("pbiecek/models/2b9b6")` |

`titanic_rf_v6` |
`randomForest:: randomForest` v.4.6.14 |
gender, age, class, sibsp, parch, fare, embarked | Get the model: `archivist:: aread("pbiecek/models/1f938")` . Get the explainer: `archivist:: aread("pbiecek/models/9b971")` |

`titanic_rf_v3` |
`randomForest:: randomForest` v.4.6.14 |
gender, age, class | Get the model: `archivist:: aread("pbiecek/models/855c1")` . Get the explainer: `archivist:: aread("pbiecek/models/92754")` |

`titanic_gbm_v6` |
`gbm:: gbm` v.2.1.5 |
gender, age, class, sibsp, parch, fare, embarked | Get the model: `archivist:: aread("pbiecek/models/24e72")` . Get the explainer: `archivist:: aread("pbiecek/models/84d5f")` |

Table 4.2 summarizes the data frames that will be used in examples in the subsequent chapters.

Description | No. rows | Variables | Link to this object |
---|---|---|---|

`titanic` dataset with imputed missing values |
2207 | gender, age, class, embarked, country, fare, sibsp, parch, survived | `archivist:: aread("pbiecek/models/27e5c")` |

`johny_d` 8-year-old boy that travelled in the 1st class without parents |
1 | class, gender, age, sibsp, parch, fare, embarked | `archivist:: aread("pbiecek/models/e3596")` |

`henry` 47-year-old male passenger from the 1st class, paid 25 pounds and embarked at Cherbourg |
1 | class, gender, age, sibsp, parch, fare, embarked | `archivist:: aread("pbiecek/models/a6538")` |

## 4.2 Apartment prices

Predicting house prices is a common exercise used in machine-learning courses. Various datasets for house prices are available at websites like Kaggle (https://www.kaggle.com) or UCI Machine Learning Repository (https://archive.ics.uci.edu).

In this book, we will work with an interesting variant of this problem. The `apartments`

dataset is an artificial dataset created to match key characteristics of real apartments in Warszawa, the capital of Poland. However, the dataset is created in a way that two very different models, namely linear regression and random forest, have almost exactly the same accuracy. The natural question is then: which model should we choose? We will show that the model-explanation tools provide important insight into the key model characteristics and are helpful in model selection.

The dataset is available in the `DALEX`

package (Biecek 2018). It contains 1000 observations (apartments) and six variables:

*m2.price*, apatments price per meter-squared (in EUR), a numerical variable;*construction.year*, the year of construction of the block of flats in which the apartment is located, a numerical variable;*surface*, apartment’s total surface in squared meters, a numerical variable;*floor*, the floor at which the apartment is located (ground floor taken to be the first floor), a numerical integer variable with values from 1 to 10;*no.rooms*, the total number of rooms, a numerical variable with values from 1 to 6;*distric*, a factor with 10 levels indicating tha distric of Warszawa where the apartment is located.

The R code below provides more info about the contents of the dataset, values of the variables, etc.

```
## m2.price construction.year surface floor no.rooms district
## 1 5897 1953 25 3 1 Srodmiescie
## 2 1818 1992 143 9 5 Bielany
```

```
## 'data.frame': 1000 obs. of 6 variables:
## $ m2.price : num 5897 1818 3643 3517 3013 ...
## $ construction.year: num 1953 1992 1937 1995 1992 ...
## $ surface : num 25 143 56 93 144 61 127 105 145 112 ...
## $ floor : int 3 9 1 7 6 6 8 8 6 9 ...
## $ no.rooms : num 1 5 2 3 5 2 5 4 6 4 ...
## $ district : Factor w/ 10 levels "Bemowo","Bielany",..: 6 2 5 4 3 6 3 7 6 6 ...
```

```
##
## 1 2 3 4 5 6 7 8 9 10
## 90 116 87 86 95 104 103 103 108 108
```

```
##
## 1 2 3 4 5 6
## 99 202 231 223 198 47
```

```
## [1] "Bemowo" "Bielany" "Mokotow" "Ochota" "Praga"
## [6] "Srodmiescie" "Ursus" "Ursynow" "Wola" "Zoliborz"
```

Models considered for this dataset will use *m2.price* as the (continuous) dependent variable.

Model predictions will be obtained for a set of six apartments included in data frame `apartments_test`

, also included in the `DALEX`

package.

```
## m2.price construction.year surface floor no.rooms district
## 1001 4644 1976 131 3 5 Srodmiescie
## 1002 3082 1978 112 9 4 Mokotow
## 1003 2498 1958 100 7 4 Bielany
## 1004 2735 1951 112 3 5 Wola
## 1005 2781 1978 102 4 4 Bemowo
## 1006 2936 2001 116 7 4 Bemowo
```

### 4.2.1 Data exploration

Note that `apartments`

is an artificial dataset created to illustrate and explain differences between random forest and linear regression. Hence, the structure of the data, the form and strength of association between variables, plausibility of distributional assumptions, etc., is better than in a real-life dataset. In fact, all these characteristics of the data are known. Nevertheless, we conduct some data exploration to illustrate the important aspects of the data.

*m2.price*, the price per meter-squared. The histogram presented in Figure 4.9 indicates that the distribution of the variable is slightly skewed to the right.

Figure 4.10 suggests (possibly) a nonlinear relation between *construction.year* and *m2.price*.

Figure 4.11 indicates a linear relation between *surface* and *m2.price*.

Relation between *floor* and *m2.price* is also close to linear, as seen in Figure 4.12.

There is a close to linear relation between *no.rooms* and *m2.price*, as suggested by Figure 4.13. It is worth noting that, quite naturally, surface and number of rooms are correlated (see Figure 4.14).

Prices depend on district. Violin plots in Figure 4.15 indicate that the highest prices per meter-squared are observed in Srodmiescie (Downtown).

### 4.2.2 Linear regression

The dependent variable of interest, *m2.price*, is continuous. Thus, a natural choice to build a predictive model is linear regression. We treat all the other variables in the `apartments`

data frame as explanatory and include them in the model. The results of the model are stored in model-object `apartments_lm_v5`

.

```
##
## Call:
## lm(formula = m2.price ~ ., data = apartments)
##
## Coefficients:
## (Intercept) construction.year surface
## 5003.248 -0.229 -10.238
## floor no.rooms districtBielany
## -99.482 -37.730 34.106
## districtPraga districtUrsynow districtBemowo
## -20.214 -1.974 16.891
## districtUrsus districtZoliborz districtMokotow
## 46.833 906.865 935.271
## districtOchota districtSrodmiescie
## 943.145 2097.502
```

### 4.2.3 Random forest

As an alternative to linear regression, we consider a random forest model. To fit the model, we apply the `randomForest()`

function, with default settings, from the package with the same name (Liaw and Wiener 2002b).

The results of the model are stored in model-object `apartments_rf_v5`

.

```
library("randomForest")
set.seed(72)
apartments_rf_v5 <- randomForest(m2.price ~ ., data = apartments)
apartments_rf_v5
```

```
##
## Call:
## randomForest(formula = m2.price ~ ., data = apartments)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 79789.39
## % Var explained: 90.28
```

### 4.2.4 Model predictions

By aplying the `predict()`

function to model-object `apartments_lm_v5`

with `apartments_test`

as the data frame for which predictions are to be computed, we obtain the predicted prices for the testing set of six apartments for the linear regression model. Subsequently, we compute the mean squared difference between the predicted and actual prices for the test apartments. We repeat the same steps for the random forest model.

```
predicted_apartments_lm <- predict(apartments_lm_v5, apartments_test)
rmsd_lm <- sqrt(mean((predicted_apartments_lm - apartments_test$m2.price)^2))
rmsd_lm
```

`## [1] 283.0865`

```
predicted_apartments_rf <- predict(apartments_rf_v5, apartments_test)
rmsd_rf <- sqrt(mean((predicted_apartments_rf - apartments_test$m2.price)^2))
rmsd_rf
```

`## [1] 282.9519`

For the random forest model, the square-root of the mean squared difference is equal to 283. It is only minimally smaller than the value of 283.1, obtained for the linear regression model. Thus, the question we may face is: should we choose the more complex, but flexible random-forest model, or the simpler and easier to interpret linear model? In the subsequent chapters we will try to provide an answer to this question.

### 4.2.5 List of objects for the `apartments`

example

In Sections 4.2.2 and 4.2.3 we have built two predictive models for the `apartments`

data set. The models will be used in the rest of the book to illustrate the model explanation methods and tools.

For the ease of reference, we summarize the models in Table 4.3. The binary model-objects can be downloaded by using the indicated `archivist`

hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can recreate a selected model in a local R environment.

Model name | Model generator | Variables | Archivist hooks |
---|---|---|---|

`apartments_lm_v5` |
`stats:: lm` v.3.5.3 |
construction .year, surface, floor, no.rooms, district | Get the model: `archivist:: aread("pbiecek/models/55f19")` . Get the explainer: TODO: add if needed |

`apartments_rf_v5` |
`randomForest:: randomForest` v.4.6.14 |
construction .year, surface, floor, no.rooms, district | Get the model: `archivist:: aread("pbiecek/models/fe7a5")` . Get the explainer: TODO: add if needed |

## 4.3 Hire or fire

Predictive models can be used to support decisions. For instance, they can be used in a human-resources department to decide whether, for instance, to promote an employee. An advantage of using a model for this purpose would be the objectivity of the decision, which would not be subject to personal preferences of a manager. However, in such a situation, one would most likely want to understand what influences the model’s prediction.

To illustrate such a situation, we will use the `HR`

dataset that is available in the `DALEX`

package (Biecek 2018). It is an artificial set of data from a human-resources department of a call center. It contains 7847 observations (employees of the call center) and six variables:

*gender*, person’s gender, a factor with two levels;*age*, person’s age in years, a numerical variable;*hours*, average number of working hours per week, a numerical variable;*evaluation*, the last evaluation score, a numerical variable with values 2 (fail), 3 (satisfactory), 4 (good), and 5 (very good);*salary*, the salary level, a numerical variable with values from 0 (lowest) to 5 (highest);*status*, a factor with three indicating whether the employee was fired, retained, or promoted.

The R code below provides more info about the contents of the dataset, values of the variables, etc.

```
## gender age hours evaluation salary status
## 1 male 32.58267 41.88626 3 1 fired
## 2 female 41.21104 36.34339 2 5 fired
## 3 male 37.70516 36.81718 3 0 fired
## 4 female 30.06051 38.96032 3 2 fired
```

```
## 'data.frame': 7847 obs. of 6 variables:
## $ gender : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 1 ...
## $ age : num 32.6 41.2 37.7 30.1 21.1 ...
## $ hours : num 41.9 36.3 36.8 39 62.2 ...
## $ evaluation: num 3 2 3 3 5 2 4 2 2 4 ...
## $ salary : num 1 5 0 2 3 0 0 4 4 4 ...
## $ status : Factor w/ 3 levels "fired","ok","promoted": 1 1 1 1 3 1 3 2 1 3 ...
```

```
##
## 2 3 4 5
## 2371 2272 1661 1543
```

```
##
## 0 1 2 3 4 5
## 1105 1417 1461 1508 1316 1040
```

Models considered for this dataset will use *status* as the (categorical) dependent variable.

### 4.3.1 Data exploration

As it was the case for the `apartments`

dataset (see Section 4.2), the `HR`

data were simulated. Despite the fact that characteristics of the data are known, we conduct some data exploration to illustrate the important aspects of the data.

Figure 4.16 indicates that young females and older males were fired more frequently than older females and younger males.

Figure 4.17 indicates that the proportion of promoted employees was the lowest for the lowest and highest salary level. At the same time, the proportion of fired employees was the highest for the two salary levels.

### 4.3.2 Multinomial logistic regression

The dependent variable of interest, *status*, is categorical with three categories. Thus, a simple choice is to consider a multinomial logistic regression model (Venables and Ripley 2010). We fit the model with the help of function `multinom`

from package `nnet`

. The function fits multinomial log-linear models by using the neural-networks approach. As a result, we obtain a complex model that is smoother as compared to a random forest model that relies on binary splits for continuous variables. We treat all variables other than `status`

in the `HR`

data frame as explanatory and include them in the model. The results of the model are stored in model-object `HR_glm_v5`

.

```
library("nnet")
set.seed(1313)
HR_glm_v5 <- multinom(status ~ gender + age + hours + evaluation + salary, data = HR)
```

```
## # weights: 21 (12 variable)
## initial value 8620.810629
## iter 10 value 7002.127738
## iter 20 value 6239.478146
## iter 20 value 6239.478126
## iter 20 value 6239.478124
## final value 6239.478124
## converged
```

```
## Call:
## multinom(formula = status ~ gender + age + hours + evaluation +
## salary, data = HR)
##
## Coefficients:
## (Intercept) gendermale age hours evaluation
## ok -3.199741 0.05185293 0.001003521 0.06628055 -0.03734345
## promoted -12.677639 0.11838037 0.003436872 0.16253343 1.26109093
## salary
## ok 0.01680039
## promoted 0.01507927
##
## Residual Deviance: 12478.96
## AIC: 12502.96
```

### 4.3.3 Random forest

As an alternative to multinomial logistic regression, we consider a random forest model. To fit the model, we apply the `randomForest()`

function, with default settings, from the package with the same name (Liaw and Wiener 2002b).
The results of the model are stored in model-object `HR_rf_v5`

.

```
library("randomForest")
set.seed(1313)
HR_rf_v5 <- randomForest(status ~ gender + age + hours + evaluation + salary, data = HR)
HR_rf_v5
```

```
##
## Call:
## randomForest(formula = status ~ gender + age + hours + evaluation + salary, data = HR)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 27.46%
## Confusion matrix:
## fired ok promoted class.error
## fired 2270 388 197 0.2049037
## ok 530 1243 448 0.4403422
## promoted 202 390 2179 0.2136413
```

### 4.3.4 Model predictions

Let us now compare predictions that are obtained from the multinomial regression and random forest models. In particular, we compute the predicted probabilities of being fired, retained in service, or promoted for Dilbert, a 58-year old male working around 42 hours per week for a salary at level 2, who got evaluation score equal to 2. Data frame `dilbert`

contains the data describing the employee.

```
dilbert <- data.frame(gender = factor("male", levels = c("male", "female")),
age = 57.7,
hours = 42.3,
evaluation = 2,
salary = 2)
```

By aplying the `predict()`

function to model-objects `HR_rf_v5`

and `HR_glm_v5`

, with `dilbert`

as the data frame for which predictions are to be computed, and argument `type="prob"`

, we obtain the predicted probabilities of being fired, retained in service, or promoted for Dilbert.

```
## fired ok promoted
## 1 0.778 0.218 0.004
## attr(,"class")
## [1] "matrix" "votes"
```

```
## fired ok promoted
## 0.56369601 0.40630786 0.02999612
```

For both models, the predicted probability of promotion is low; it is more likely that Dilbert will be fired. It is of interest to understand why such prediction is made? Moreover, random forest yields a higher probability of firing (0.78) than the multinomial regression model (0.56). We may want to learn where does this difference come from? We will try to answer these questions in subsequent chapters.

### 4.3.5 List of objects for the `HR`

example

In Sections 4.3.2 and 4.3.3 we have built two predictive models for the `HR`

data set. The models will be used in the remainder of the book to illustrate model-explanation methods and tools.

For the ease of reference, we summarize the models in Table 4.4. The binary model-objects can be downloaded by using the indicated `archivist`

hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can recreate a selected model in a local R environment.

Model name | Model generator | Variables | Archivist hooks |
---|---|---|---|

`HR_rf_v5` |
`randomForest:: randomForest` v.4.6.14 |
gender, age, hours, evaluation, salary | Get the model: `archivist:: aread("pbiecek/models/1ecfd")` . Get the explainer: TODO: add if needed |

`HR_glm_v5` |
`stats:: glm` v.3.5.3 |
gender, age, hours, evaluation, salary | Get the model: `archivist:: aread("pbiecek/models/f0244")` . Get the explainer: TODO: add if needed |

### References

Harrell Jr, Frank E. 2018. *Rms: Regression Modeling Strategies*. https://CRAN.R-project.org/package=rms.

Breiman, Leo. 2001. “Random Forests.” In *Machine Learning*, 45:5–32. https://doi.org/10.1023/A:1010933404324.

Liaw, Andy, and Matthew Wiener. 2002b. “Classification and Regression by randomForest.” *R News* 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.

Friedman, Jerome H. 2000. “Greedy Function Approximation: A Gradient Boosting Machine.” *Annals of Statistics* 29: 1189–1232.

Ridgeway, Greg. 2017. *Gbm: Generalized Boosted Regression Models*. https://CRAN.R-project.org/package=gbm.

Biecek, Przemyslaw. 2018. *DALEX: Descriptive mAchine Learning Explanations*. https://pbiecek.github.io/DALEX/.

Biecek, Przemyslaw, and Marcin Kosinski. 2017. “archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.” *Journal of Statistical Software* 82 (11): 1–28. https://doi.org/10.18637/jss.v082.i11.

Venables, W. N., and B. D. Ripley. 2010. *Modern Applied Statistics with S*. Springer Publishing Company, Incorporated.