Chapter 4 Data Sets

We illustrate the methods presented in this book by using three datasets:

  • Sinking of the RMS Titanic
  • Apartment prices
  • Hire or fire

The first dataset will be used to illustrate the application of the techniques in the case of a predictive model for a binary dependent variable. The second one will provide an example for models for a continuous variable. Finally, the third dataset will be used for illustration of models for a categorical dependent variable.

In this chapter, we provide a short description of each of the datasets, together with results of exploratory analyses. We also introduce models that will be used for illustration purposes in subsequent chapters.

4.1 Sinking of the RMS Titanic

Titanic sinking by Willy Stöwer

Titanic sinking by Willy Stöwer

Sinking of the RMS Titanic is one of the deadliest maritime disasters in history (during peacetime). Over 1500 people died as a consequence of collision with an iceberg. Projects like Encyclopedia titanica https://www.encyclopedia-titanica.org/ are a source of rich and precise data about Titanic’s passengers. The data are available in a dataset included in the stablelearner package. The dataset, after some data cleaning and variable transformations, is also avaliable in the DALEX package. In particular, the `titanic’ data frame contains 2207 observations (for 1317 passengers and 890 crew members) and nine variables:

  • gender, person’s (passenger’s or crew member’s) gender, a factor (categorical variable) with two levels (categories);
  • age, person’s age in years, a numerical variable; for adults, the age is given in (integer) years; for children younger than one year, the age is given as \(x/12\), where \(x\) is the number of months of child’s age;
  • class, the class in which the passenger travelled, or the duty class of a crew member; a factor with seven levels
  • embarked, the harbor in which the person embarked on the ship, a factor with four levels;
  • country, person’s home country, a factor with 48 levels;
  • fare, the price of the ticket (only available for passengers; 0 for crew members), a numerical variable;
  • sibsp, the number of siblings/spouses aboard the ship, a numerical variable;
  • parch, the number of parents/children aboard the ship, a numerical variable;
  • survived, a factor with two levels indicating whether the person survived or not.

The R code below provides more info about the contents of the dataset, values of the variables, etc.

##   gender age class    embarked       country  fare sibsp parch survived
## 1   male  42   3rd Southampton United States  7.11     0     0       no
## 2   male  13   3rd Southampton United States 20.05     0     2       no
## 'data.frame':    2207 obs. of  9 variables:
##  $ gender  : Factor w/ 2 levels "female","male": 2 2 2 1 1 2 2 1 2 2 ...
##  $ age     : num  42 13 16 39 16 25 30 28 27 20 ...
##  $ class   : Factor w/ 7 levels "1st","2nd","3rd",..: 3 3 3 3 3 3 2 2 3 3 ...
##  $ embarked: Factor w/ 4 levels "Belfast","Cherbourg",..: 4 4 4 4 4 4 2 2 2 4 ...
##  $ country : Factor w/ 48 levels "Argentina","Australia",..: 44 44 44 15 30 44 17 17 26 16 ...
##  $ fare    : num  7.11 20.05 20.05 20.05 7.13 ...
##  $ sibsp   : num  0 0 1 1 0 0 1 1 0 0 ...
##  $ parch   : num  0 2 1 1 0 0 0 0 0 0 ...
##  $ survived: Factor w/ 2 levels "no","yes": 1 1 1 2 2 2 1 2 2 2 ...
## [1] "1st"              "2nd"              "3rd"             
## [4] "deck crew"        "engineering crew" "restaurant staff"
## [7] "victualling crew"
## [1] "Belfast"     "Cherbourg"   "Queenstown"  "Southampton"

Models considered for this dataset will use survived as the (binary) dependent variable.

4.1.1 Data exploration

It is always advisable to explore data before modelling. However, as this book is focused on model exploration, we will limit the data exploration part.

Before exploring the data, we first do some pre-processing. In particular, the value of variables age, country, sibsp, parch, and fare is missing for a limited number of observations (2, 81, 10, 10, and 26, respectively). Analyzing data with missing values is a topic on its own (Little and Rubin 1987; Schafer 1997; Molenberghs and Kenward 2007). An often-used approach is to impute the missing values. Toward this end, multiple imputation should be considered (Schafer 1997; Molenberghs and Kenward 2007; van Buuren 2012). However, given the limited number of missing values and the intended illustrative use of the dataset, we will limit ourselves to, admittedly inferior, single imputation. In particular, we replace the missing age values by the mean of the observed ones, i.e., 30. Missing country will be coded by “X”. For sibsp and parch, we replace the missing values by the most frequently observed value, i.e., 0. Finally, for fare, we use the mean fare for a given class, i.e., 0 pounds for crew, 89 pounds for the 1st, 22 pounds for the 2nd, and 13 pounds for the 3rd class. The R code presented below implements the imputation steps.

After imputing the missing values, we investigate the association between survival status and other variables. Figures 4.1-4.6 present graphically the proportion non- and survivors for different levels of the other variables. The height of the bars (on the y-axis) reflects the marginal distribution (proportions) of the observed levels of the variable. On the other hand, the width of the bars (on the x-axis) provides the information about the proportion of non- and survivors. Note that, to construct the graphs for age and fare, we categorized the range of the observed values.

Figures 4.1 and 4.2 indicate that the proportion of survivors was larger for females and children below 5 years of age. This is most likely the result of the “women and children first” principle that is often evoked in situations that require evacuation of persons whose life is in danger. The principle can, perhaps, partially explain the trend seen in Figures 4.3 and 4.4, i.e., a higher proportion of survivors among those with 1-3 parents/children and 1-2 siblings/spouses aboard. Figure 4.5 indicates that passengers travelling in the first and second class had a higher chance of survival, perhaps due to the proximity of the location of their cabins to the deck. Interestingly, the proportion of survivors among crew deck was similar to the proportion of the first-class passengers. Figure 4.6 shows that the proportion of survivors increased with the fare, which is consistent with the fact that the proportion was higher for passengers travelling in the first and second class. Finally, Figures 4.7 and 4.8 do not suggest any noteworthy trends.

Survival according to gender in the Titanic data.

Figure 4.1: Survival according to gender in the Titanic data.

Survival accroding to age group in the Titanic data.

Figure 4.2: Survival accroding to age group in the Titanic data.

Survival according to the number of parents/children in the Titanic data.

Figure 4.3: Survival according to the number of parents/children in the Titanic data.

Survival according to the number of siblings/spouses in the Titanic data.

Figure 4.4: Survival according to the number of siblings/spouses in the Titanic data.

Survival according to the class in the Titanic data.

Figure 4.5: Survival according to the class in the Titanic data.

Survival according to fare in the Titanic data.

Figure 4.6: Survival according to fare in the Titanic data.

Survival according to the port of embarking in the Titanic data.

Figure 4.7: Survival according to the port of embarking in the Titanic data.

Survival according to country in the Titanic data.

Figure 4.8: Survival according to country in the Titanic data.

4.1.2 Logistic regression

The dependent variable of interest, survival, is binary. Thus, a natural choice to build a predictive model is logistic regression. We do not consider country as an explanatory variable. As there is no reason to expect a linear relationship between age and odds of survival, we use linear tail-restricted cubic splines, available in the rcs() function of the rms package (Harrell Jr 2018), to model the effect of age. We also do not expect linear relation for the fare variable, but because of it’s skewness, we do not use splines for this variable. The results of the model are stored in model-object titanic_lmr_v6, which will be used in subsequent chapters.

## Logistic Regression Model
##  
##  lrm(formula = survived == "yes" ~ gender + rcs(age) + class + 
##      sibsp + parch + fare + embarked, data = titanic)
##  
##                         Model Likelihood     Discrimination    Rank Discrim.    
##                            Ratio Test           Indexes           Indexes       
##  Obs           2207    LR chi2     752.06    R2       0.404    C       0.817    
##   FALSE        1496    d.f.            17    g        1.647    Dxy     0.635    
##   TRUE          711    Pr(> chi2) <0.0001    gr       5.191    gamma   0.636    
##  max |deriv| 0.0001                          gp       0.282    tau-a   0.277    
##                                              Brier    0.146                     
##  
##                         Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept               4.5746 0.5480   8.35 <0.0001 
##  gender=male            -2.7687 0.1586 -17.45 <0.0001 
##  age                    -0.1180 0.0221  -5.35 <0.0001 
##  age'                    0.6313 0.1628   3.88 0.0001  
##  age''                  -2.6583 0.7840  -3.39 0.0007  
##  age'''                  2.8977 1.0130   2.86 0.0042  
##  class=2nd              -1.1390 0.2501  -4.56 <0.0001 
##  class=3rd              -2.0627 0.2490  -8.28 <0.0001 
##  class=deck crew         1.0672 0.3498   3.05 0.0023  
##  class=engineering crew -0.9702 0.2648  -3.66 0.0002  
##  class=restaurant staff -3.1712 0.6583  -4.82 <0.0001 
##  class=victualling crew -1.0877 0.2596  -4.19 <0.0001 
##  sibsp                  -0.4504 0.1006  -4.48 <0.0001 
##  parch                  -0.0871 0.0987  -0.88 0.3776  
##  fare                    0.0014 0.0020   0.70 0.4842  
##  embarked=Cherbourg      0.7881 0.2836   2.78 0.0055  
##  embarked=Queenstown     0.2745 0.3409   0.80 0.4208  
##  embarked=Southampton    0.2343 0.2119   1.11 0.2689  
## 

4.1.3 Random forest

As an alternative to a logistic regression model, we consider a random forest model. Random forest is known for good predictive performance, is able to grasp low-level variable interactions, and is quite stable (Breiman 2001). To fit the model, we apply the randomForest() function, with default settings, from the package with the same name (Liaw and Wiener 2002b).

In the first instance, we fit a model with the same set of explanatory variables as the logistic regression model. The results of the model are stored in model-object titanic_rf_v6.

## 
## Call:
##  randomForest(formula = survived ~ class + gender + age + sibsp +      parch + fare + embarked, data = titanic) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 18.62%
## Confusion matrix:
##       no yes class.error
## no  1393 103  0.06885027
## yes  308 403  0.43319269

For comparison purposes, we also consider a model with only three explanatory variables: class, gender, and age. The results of the model are stored in model-object titanic_rf_v3.

## 
## Call:
##  randomForest(formula = survived ~ class + gender + age, data = titanic) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##         OOB estimate of  error rate: 21.02%
## Confusion matrix:
##       no yes class.error
## no  1367 129  0.08622995
## yes  335 376  0.47116737

4.1.4 Gradient boosting

Finally, we consider the gradient-boosting model. (Friedman 2000) The model is known for being able to accomodate higher-order interactions between variables. We use the same set of explanatory variables as for the logistic regression model. To fit the gradient-boosting model, we use function gbm() from the gbm package (Ridgeway 2017). The results of the model are stored in model-object titanic_gbm_v6.

## Distribution not specified, assuming bernoulli ...
## gbm(formula = survived == "yes" ~ class + gender + age + sibsp + 
##     parch + fare + embarked, data = titanic, n.trees = 15000)
## A gradient boosted model with bernoulli loss function.
## 15000 iterations were performed.
## There were 7 predictors of which 7 had non-zero influence.

4.1.5 Model predictions

Let us now compare predictions that are obtained from the three different models. In particular, we will compute the predicted probability of survival for an 8-year-old boy who embarked in Belfast and travelled in the 1-st class with no parents nor siblings and with a ticket costing 72 pounds.

First, we create a data frame johny_d that contains the data describing the passenger.

Subsequently, we use the generic function predict() to get the predicted probability of survival for the logistic regression model.

##         1 
## 0.7233413

The predicted probability is equal to 0.72.

We do the same for the random forest and gradient boosting models.

##      no   yes
## 1 0.626 0.374
## attr(,"class")
## [1] "matrix" "votes"
## [1] 0.6700909

As a result, we obtain the predicted probabilities of 0.37 and 0.67, respectively.

The models lead to different probabilities. Thus, it might be of interest to understand the reason for the differences, as it could help us to decide which of the predictions we might want to trust.

Note that for some examples we will use another observation (instance) with lower chances of survival. Let’s call this passenger Henry.

##    1 
## 0.43
## [1] 0.25
## [1] 0.42

4.1.6 Explainers

Model-objects created with different libraries may have different internal structures. Thus, first, we have got to create a wrapper around the model. Toward this end, we use the explain() function from the DALEX package (Biecek 2018). The function requires five arguments:

  • model, a model-object;
  • data, a validation data frame;
  • y, observed values of the dependent variable for the validation data;
  • predict_function, a function that returns prediction scores; if not specified, then a default predict() function is used;
  • label, a function that returns prediction scores; if not specified, then it is extracted from the class(model). In the example below we create explainers for the logistic regression, random forest, and gradient boosting models created for the Titanic data.

Each explainer wraps all elements needed to create a model explanation, i.e., a suitable predict() function, validation data set, and the model object. Thus, in subsequent chapters we will use the explainers instead of the model objects to keep code snippets more concise.

4.1.7 List of objects for the titanic example

In the previous sections we have built several predictive models for the titanic data set. The models will be used in the rest of the book to illustrate the model explanation methods and tools.

For the ease of reference, we summarize the models in Table 4.1. The binary model-objects can be downloaded by using the indicated archivist hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can recreate a selected model in a local R environment.

Table 4.1: Predictive models created for the titanic dataset.
Model name Model generator Variables Archivist hooks
titanic_lmr_v6 rms:: lmr v.5.1.3 gender, age, class, sibsp, parch, fare, embarked Get the model: archivist:: aread("pbiecek/models/ceb40"). Get the explainer: archivist:: aread("pbiecek/models/2b9b6")
titanic_rf_v6 randomForest:: randomForest v.4.6.14 gender, age, class, sibsp, parch, fare, embarked Get the model: archivist:: aread("pbiecek/models/1f938"). Get the explainer: archivist:: aread("pbiecek/models/9b971")
titanic_rf_v3 randomForest:: randomForest v.4.6.14 gender, age, class Get the model: archivist:: aread("pbiecek/models/855c1"). Get the explainer: archivist:: aread("pbiecek/models/92754")
titanic_gbm_v6 gbm:: gbm v.2.1.5 gender, age, class, sibsp, parch, fare, embarked Get the model: archivist:: aread("pbiecek/models/24e72"). Get the explainer: archivist:: aread("pbiecek/models/84d5f")

Table 4.2 summarizes the data frames that will be used in examples in the subsequent chapters.

Table 4.2: Data frames created for the titanic example.
Description No. rows Variables Link to this object
titanic dataset with imputed missing values 2207 gender, age, class, embarked, country, fare, sibsp, parch, survived archivist:: aread("pbiecek/models/27e5c")
johny_d 8-year-old boy that travelled in the 1st class without parents 1 class, gender, age, sibsp, parch, fare, embarked archivist:: aread("pbiecek/models/e3596")
henry 47-year-old male passenger from the 1st class, paid 25 pounds and embarked at Cherbourg 1 class, gender, age, sibsp, parch, fare, embarked archivist:: aread("pbiecek/models/a6538")

4.2 Apartment prices

Warsaw skyscrapers by Artur Malinowski Flicker

Warsaw skyscrapers by Artur Malinowski Flicker

Predicting house prices is a common exercise used in machine-learning courses. Various datasets for house prices are available at websites like Kaggle (https://www.kaggle.com) or UCI Machine Learning Repository (https://archive.ics.uci.edu).

In this book, we will work with an interesting variant of this problem. The apartments dataset is an artificial dataset created to match key characteristics of real apartments in Warszawa, the capital of Poland. However, the dataset is created in a way that two very different models, namely linear regression and random forest, have almost exactly the same accuracy. The natural question is then: which model should we choose? We will show that the model-explanation tools provide important insight into the key model characteristics and are helpful in model selection.

The dataset is available in the DALEX package (Biecek 2018). It contains 1000 observations (apartments) and six variables:

  • m2.price, apatments price per meter-squared (in EUR), a numerical variable;
  • construction.year, the year of construction of the block of flats in which the apartment is located, a numerical variable;
  • surface, apartment’s total surface in squared meters, a numerical variable;
  • floor, the floor at which the apartment is located (ground floor taken to be the first floor), a numerical integer variable with values from 1 to 10;
  • no.rooms, the total number of rooms, a numerical variable with values from 1 to 6;
  • distric, a factor with 10 levels indicating tha distric of Warszawa where the apartment is located.

The R code below provides more info about the contents of the dataset, values of the variables, etc.

##   m2.price construction.year surface floor no.rooms    district
## 1     5897              1953      25     3        1 Srodmiescie
## 2     1818              1992     143     9        5     Bielany
## 'data.frame':    1000 obs. of  6 variables:
##  $ m2.price         : num  5897 1818 3643 3517 3013 ...
##  $ construction.year: num  1953 1992 1937 1995 1992 ...
##  $ surface          : num  25 143 56 93 144 61 127 105 145 112 ...
##  $ floor            : int  3 9 1 7 6 6 8 8 6 9 ...
##  $ no.rooms         : num  1 5 2 3 5 2 5 4 6 4 ...
##  $ district         : Factor w/ 10 levels "Bemowo","Bielany",..: 6 2 5 4 3 6 3 7 6 6 ...
## 
##   1   2   3   4   5   6   7   8   9  10 
##  90 116  87  86  95 104 103 103 108 108
## 
##   1   2   3   4   5   6 
##  99 202 231 223 198  47
##  [1] "Bemowo"      "Bielany"     "Mokotow"     "Ochota"      "Praga"      
##  [6] "Srodmiescie" "Ursus"       "Ursynow"     "Wola"        "Zoliborz"

Models considered for this dataset will use m2.price as the (continuous) dependent variable.

Model predictions will be obtained for a set of six apartments included in data frame apartments_test, also included in the DALEX package.

##      m2.price construction.year surface floor no.rooms    district
## 1001     4644              1976     131     3        5 Srodmiescie
## 1002     3082              1978     112     9        4     Mokotow
## 1003     2498              1958     100     7        4     Bielany
## 1004     2735              1951     112     3        5        Wola
## 1005     2781              1978     102     4        4      Bemowo
## 1006     2936              2001     116     7        4      Bemowo

4.2.1 Data exploration

Note that apartments is an artificial dataset created to illustrate and explain differences between random forest and linear regression. Hence, the structure of the data, the form and strength of association between variables, plausibility of distributional assumptions, etc., is better than in a real-life dataset. In fact, all these characteristics of the data are known. Nevertheless, we conduct some data exploration to illustrate the important aspects of the data.

The variable of interest is m2.price, the price per meter-squared. The histogram presented in Figure 4.9 indicates that the distribution of the variable is slightly skewed to the right.
(fig:appartmentsExplorationMi2) Distribution of the price per meter-squared in the apartments data.

Figure 4.9: (fig:appartmentsExplorationMi2) Distribution of the price per meter-squared in the apartments data.

Figure 4.10 suggests (possibly) a nonlinear relation between construction.year and m2.price.

(fig:appartmentsMi2Construction) Price per meter-squared vs. construction year

Figure 4.10: (fig:appartmentsMi2Construction) Price per meter-squared vs. construction year

Figure 4.11 indicates a linear relation between surface and m2.price.

(fig:appartmentsMi2Surface) Price per meter-squared vs. surface

Figure 4.11: (fig:appartmentsMi2Surface) Price per meter-squared vs. surface

Relation between floor and m2.price is also close to linear, as seen in Figure 4.12.

(fig:appartmentsMi2Floor) Price per meter-squared vs. floor

Figure 4.12: (fig:appartmentsMi2Floor) Price per meter-squared vs. floor

There is a close to linear relation between no.rooms and m2.price, as suggested by Figure 4.13. It is worth noting that, quite naturally, surface and number of rooms are correlated (see Figure 4.14).

(fig:appartmentsMi2Norooms) Price per meter-squared vs. number of rooms

Figure 4.13: (fig:appartmentsMi2Norooms) Price per meter-squared vs. number of rooms

(fig:appartmentsSurfaceNorooms) Surface vs. number of rooms

Figure 4.14: (fig:appartmentsSurfaceNorooms) Surface vs. number of rooms

Prices depend on district. Violin plots in Figure 4.15 indicate that the highest prices per meter-squared are observed in Srodmiescie (Downtown).

(fig:appartmentsMi2District) Price per meter-squared for different districts

Figure 4.15: (fig:appartmentsMi2District) Price per meter-squared for different districts

4.2.2 Linear regression

The dependent variable of interest, m2.price, is continuous. Thus, a natural choice to build a predictive model is linear regression. We treat all the other variables in the apartments data frame as explanatory and include them in the model. The results of the model are stored in model-object apartments_lm_v5.

## 
## Call:
## lm(formula = m2.price ~ ., data = apartments)
## 
## Coefficients:
##         (Intercept)    construction.year              surface  
##            5003.248               -0.229              -10.238  
##               floor             no.rooms      districtBielany  
##             -99.482              -37.730               34.106  
##       districtPraga      districtUrsynow       districtBemowo  
##             -20.214               -1.974               16.891  
##       districtUrsus     districtZoliborz      districtMokotow  
##              46.833              906.865              935.271  
##      districtOchota  districtSrodmiescie  
##             943.145             2097.502

4.2.3 Random forest

As an alternative to linear regression, we consider a random forest model. To fit the model, we apply the randomForest() function, with default settings, from the package with the same name (Liaw and Wiener 2002b).
The results of the model are stored in model-object apartments_rf_v5.

## 
## Call:
##  randomForest(formula = m2.price ~ ., data = apartments) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 79789.39
##                     % Var explained: 90.28

4.2.4 Model predictions

By aplying the predict() function to model-object apartments_lm_v5 with apartments_test as the data frame for which predictions are to be computed, we obtain the predicted prices for the testing set of six apartments for the linear regression model. Subsequently, we compute the mean squared difference between the predicted and actual prices for the test apartments. We repeat the same steps for the random forest model.

## [1] 283.0865
## [1] 282.9519

For the random forest model, the square-root of the mean squared difference is equal to 283. It is only minimally smaller than the value of 283.1, obtained for the linear regression model. Thus, the question we may face is: should we choose the more complex, but flexible random-forest model, or the simpler and easier to interpret linear model? In the subsequent chapters we will try to provide an answer to this question.

4.2.5 List of objects for the apartments example

In Sections 4.2.2 and 4.2.3 we have built two predictive models for the apartments data set. The models will be used in the rest of the book to illustrate the model explanation methods and tools.

For the ease of reference, we summarize the models in Table 4.3. The binary model-objects can be downloaded by using the indicated archivist hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can recreate a selected model in a local R environment.

Table 4.3: Predictive models created for the apartments dataset.
Model name Model generator Variables Archivist hooks
apartments_lm_v5 stats:: lm v.3.5.3 construction .year, surface, floor, no.rooms, district Get the model: archivist:: aread("pbiecek/models/55f19"). Get the explainer: TODO: add if needed
apartments_rf_v5 randomForest:: randomForest v.4.6.14 construction .year, surface, floor, no.rooms, district Get the model: archivist:: aread("pbiecek/models/fe7a5"). Get the explainer: TODO: add if needed

4.3 Hire or fire

Predictive models can be used to support decisions. For instance, they can be used in a human-resources department to decide whether, for instance, to promote an employee. An advantage of using a model for this purpose would be the objectivity of the decision, which would not be subject to personal preferences of a manager. However, in such a situation, one would most likely want to understand what influences the model’s prediction.

To illustrate such a situation, we will use the HR dataset that is available in the DALEX package (Biecek 2018). It is an artificial set of data from a human-resources department of a call center. It contains 7847 observations (employees of the call center) and six variables:

  • gender, person’s gender, a factor with two levels;
  • age, person’s age in years, a numerical variable;
  • hours, average number of working hours per week, a numerical variable;
  • evaluation, the last evaluation score, a numerical variable with values 2 (fail), 3 (satisfactory), 4 (good), and 5 (very good);
  • salary, the salary level, a numerical variable with values from 0 (lowest) to 5 (highest);
  • status, a factor with three indicating whether the employee was fired, retained, or promoted.

The R code below provides more info about the contents of the dataset, values of the variables, etc.

##   gender      age    hours evaluation salary status
## 1   male 32.58267 41.88626          3      1  fired
## 2 female 41.21104 36.34339          2      5  fired
## 3   male 37.70516 36.81718          3      0  fired
## 4 female 30.06051 38.96032          3      2  fired
## 'data.frame':    7847 obs. of  6 variables:
##  $ gender    : Factor w/ 2 levels "female","male": 2 1 2 1 2 2 1 2 1 1 ...
##  $ age       : num  32.6 41.2 37.7 30.1 21.1 ...
##  $ hours     : num  41.9 36.3 36.8 39 62.2 ...
##  $ evaluation: num  3 2 3 3 5 2 4 2 2 4 ...
##  $ salary    : num  1 5 0 2 3 0 0 4 4 4 ...
##  $ status    : Factor w/ 3 levels "fired","ok","promoted": 1 1 1 1 3 1 3 2 1 3 ...
## 
##    2    3    4    5 
## 2371 2272 1661 1543
## 
##    0    1    2    3    4    5 
## 1105 1417 1461 1508 1316 1040

Models considered for this dataset will use status as the (categorical) dependent variable.

4.3.1 Data exploration

As it was the case for the apartments dataset (see Section 4.2), the HR data were simulated. Despite the fact that characteristics of the data are known, we conduct some data exploration to illustrate the important aspects of the data.

Figure 4.16 indicates that young females and older males were fired more frequently than older females and younger males.

Employment status for age-groups and gender.

Figure 4.16: Employment status for age-groups and gender.

Figure 4.17 indicates that the proportion of promoted employees was the lowest for the lowest and highest salary level. At the same time, the proportion of fired employees was the highest for the two salary levels.

Employment status for different salary levels.

Figure 4.17: Employment status for different salary levels.

Figure 4.18 indicates that the chance of being fired was larger for evaluation scores equal to 2 or 3. On the other hand, the chance of being promoted substantially increased for scores equal to 4 or 5.
Employment status for different evaluation scores.

Figure 4.18: Employment status for different evaluation scores.

4.3.2 Multinomial logistic regression

The dependent variable of interest, status, is categorical with three categories. Thus, a simple choice is to consider a multinomial logistic regression model (Venables and Ripley 2010). We fit the model with the help of function multinom from package nnet. The function fits multinomial log-linear models by using the neural-networks approach. As a result, we obtain a complex model that is smoother as compared to a random forest model that relies on binary splits for continuous variables. We treat all variables other than status in the HR data frame as explanatory and include them in the model. The results of the model are stored in model-object HR_glm_v5.

## # weights:  21 (12 variable)
## initial  value 8620.810629 
## iter  10 value 7002.127738
## iter  20 value 6239.478146
## iter  20 value 6239.478126
## iter  20 value 6239.478124
## final  value 6239.478124 
## converged
## Call:
## multinom(formula = status ~ gender + age + hours + evaluation + 
##     salary, data = HR)
## 
## Coefficients:
##          (Intercept) gendermale         age      hours  evaluation
## ok         -3.199741 0.05185293 0.001003521 0.06628055 -0.03734345
## promoted  -12.677639 0.11838037 0.003436872 0.16253343  1.26109093
##              salary
## ok       0.01680039
## promoted 0.01507927
## 
## Residual Deviance: 12478.96 
## AIC: 12502.96

4.3.3 Random forest

As an alternative to multinomial logistic regression, we consider a random forest model. To fit the model, we apply the randomForest() function, with default settings, from the package with the same name (Liaw and Wiener 2002b). The results of the model are stored in model-object HR_rf_v5.

## 
## Call:
##  randomForest(formula = status ~ gender + age + hours + evaluation +      salary, data = HR) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 27.46%
## Confusion matrix:
##          fired   ok promoted class.error
## fired     2270  388      197   0.2049037
## ok         530 1243      448   0.4403422
## promoted   202  390     2179   0.2136413

4.3.4 Model predictions

Let us now compare predictions that are obtained from the multinomial regression and random forest models. In particular, we compute the predicted probabilities of being fired, retained in service, or promoted for Dilbert, a 58-year old male working around 42 hours per week for a salary at level 2, who got evaluation score equal to 2. Data frame dilbert contains the data describing the employee.

By aplying the predict() function to model-objects HR_rf_v5 and HR_glm_v5, with dilbert as the data frame for which predictions are to be computed, and argument type="prob", we obtain the predicted probabilities of being fired, retained in service, or promoted for Dilbert.

##   fired    ok promoted
## 1 0.778 0.218    0.004
## attr(,"class")
## [1] "matrix" "votes"
##      fired         ok   promoted 
## 0.56369601 0.40630786 0.02999612

For both models, the predicted probability of promotion is low; it is more likely that Dilbert will be fired. It is of interest to understand why such prediction is made? Moreover, random forest yields a higher probability of firing (0.78) than the multinomial regression model (0.56). We may want to learn where does this difference come from? We will try to answer these questions in subsequent chapters.

4.3.5 List of objects for the HR example

In Sections 4.3.2 and 4.3.3 we have built two predictive models for the HR data set. The models will be used in the remainder of the book to illustrate model-explanation methods and tools.

For the ease of reference, we summarize the models in Table 4.4. The binary model-objects can be downloaded by using the indicated archivist hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can recreate a selected model in a local R environment.

Table 4.4: Predictive models created for the HR dataset.
Model name Model generator Variables Archivist hooks
HR_rf_v5 randomForest:: randomForest v.4.6.14 gender, age, hours, evaluation, salary Get the model: archivist:: aread("pbiecek/models/1ecfd"). Get the explainer: TODO: add if needed
HR_glm_v5 stats:: glm v.3.5.3 gender, age, hours, evaluation, salary Get the model: archivist:: aread("pbiecek/models/f0244"). Get the explainer: TODO: add if needed

References

Harrell Jr, Frank E. 2018. Rms: Regression Modeling Strategies. https://CRAN.R-project.org/package=rms.

Breiman, Leo. 2001. “Random Forests.” In Machine Learning, 45:5–32. https://doi.org/10.1023/A:1010933404324.

Liaw, Andy, and Matthew Wiener. 2002b. “Classification and Regression by randomForest.” R News 2 (3): 18–22. https://CRAN.R-project.org/doc/Rnews/.

Friedman, Jerome H. 2000. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29: 1189–1232.

Ridgeway, Greg. 2017. Gbm: Generalized Boosted Regression Models. https://CRAN.R-project.org/package=gbm.

Biecek, Przemyslaw. 2018. DALEX: Descriptive mAchine Learning Explanations. https://pbiecek.github.io/DALEX/.

Biecek, Przemyslaw, and Marcin Kosinski. 2017. “archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.” Journal of Statistical Software 82 (11): 1–28. https://doi.org/10.18637/jss.v082.i11.

Venables, W. N., and B. D. Ripley. 2010. Modern Applied Statistics with S. Springer Publishing Company, Incorporated.