5 Datasets and models

We will illustrate the methods presented in this book by using three datasets related to:

  • predicting probability of survival for passengers of the RMS Titanic;
  • predicting prices of apartments in Warsaw;
  • estimating the value of the football players based on FIFA dataset.

The first dataset will be used to illustrate the application of the techniques in the case of classification predictive models for a binary dependent variable. This dataset is mainly used in the examples in part II of this book. The second one will offer an example of regression models for a continuous variable. This dataset is mainly used in the examples in part III of this book. The third dataset will be introduced in Chapter 22 to summarise all techniques introduced in the book.

In this chapter, we provide a short description of the first two datasets, together with results of exploratory analyses. We also introduce models that will be used for illustration purposes in subsequent chapters.

5.1 Sinking of the RMS Titanic

Titanic sinking by Willy Stöwer

The sinking of the RMS Titanic is one of the deadliest maritime disasters in history (during peacetime). Over 1500 people died as a consequence of a collision with an iceberg. Projects like Encyclopedia titanica (https://www.encyclopedia-titanica.org/) are a source of rich and precise data about Titanic’s passengers. The stablelearner package in R includes a data frame with information about passengers’ characteristics. The dataset, after some data cleaning and variable transformations, is also available in the DALEX package for R and dalex library for Python. In particular, the titanic data frame contains 2207 observations (for 1317 passengers and 890 crew members) and nine variables:

  • gender, person’s (passenger’s or crew member’s) gender, a factor (categorical variable) with two levels (categories): “male” (78%) and “female” (22%);
  • age, person’s age in years, a numerical variable; the age is given in (integer) years, in the range of 0–74 years;
  • class, the class in which the passenger travelled, or the duty class of a crew member; a factor with seven levels: “1st” (14.7%), “2nd” (12.9%), “3rd” (32.1%), “deck crew” (3%), “engineering crew” (14.7%), “restaurant staff” (3.1%), and “victualling crew” (19.5%);
  • embarked, the harbor in which the person embarked on the ship, a factor with four levels: “Belfast” (8.9%), “Cherbourg” (12.3%), “Queenstown” (5.6%), and “Southampton” (73.2%);
  • country, person’s home country, a factor with 48 levels; the most common levels are “England” (51%), “United States” (12%), “Ireland” (6.2%), and “Sweden” (4.8%);
  • fare, the price of the ticket (only available for passengers; 0 for crew members), a numerical variable in the range of 0–512;
  • sibsp, the number of siblings/spouses aboard the ship, a numerical variable in the range of 0–8;
  • parch, the number of parents/children aboard the ship, a numerical variable in the range of 0–9;
  • survived, a factor with two levels: “yes” (67.8%) and “no” (32.2%) indicating whether the person survived or not.

See an first six rows of this dataset in the Table below.

gender age class embarked country fare sibsp parch survived
male 42 3rd Southampton United States 7.11 0 0 no
male 13 3rd Southampton United States 20.05 0 2 no
male 16 3rd Southampton United States 20.05 1 1 no
female 39 3rd Southampton England 20.05 1 1 yes
female 16 3rd Southampton Norway 7.13 0 0 yes
male 25 3rd Southampton United States 7.13 0 0 yes

Models considered for this dataset will use survived as the (binary) dependent variable.

5.1.1 Data exploration

As discussed in Chapter 2, it is always advisable to explore data before modelling. However, as this book is focused on model exploration, we will limit the data exploration part.

Before exploring the data, we first conduct some pre-processing. In particular, the value of variables age, country, sibsp, parch, and fare is missing for a limited number of observations (2, 81, 10, 10, and 26, respectively). Analyzing data with missing values is a topic on its own (Schafer 1997; Little and Rubin 2002; Molenberghs and Kenward 2007). An often-used approach is to impute the missing values. Toward this end, multiple imputation should be considered (Schafer 1997; Molenberghs and Kenward 2007; Buuren 2012). However, given the limited number of missing values and the intended illustrative use of the dataset, we will limit ourselves to, admittedly inferior, single imputation. In particular, we replace the missing age values by the mean of the observed ones, i.e., 30. Missing country is encoded by "X". For sibsp and parch, we replace the missing values by the most frequently observed value, i.e., 0. Finally, for fare, we use the mean fare for a given class, i.e., 0 pounds for crew, 89 pounds for the 1st, 22 pounds for the 2nd, and 13 pounds for the 3rd class.

After imputing the missing values, we investigate the association between survival status and other variables. Most variables in the Titanic dataset are categorical, except age and fare. In order to keep the exploration uniform, we first transformed them into categorical variables. Figure 5.1 shows histograms for both variables. Age is discretized into five categories resulting from cutoffs equal to 5, 10, 20, and 30, while fare is discretized by applying cutoffs equal to 1, 10, 25, and 50.

Histograms for variables age and fare for the Titanic data.

Figure 5.1: Histograms for variables age and fare for the Titanic data.

Figures 5.25.5 present graphically the proportion of non- and survivors for different levels of the other variables with the use of mosaic plots. The width of the bars (on the x-axis) reflects the marginal distribution (proportions) of the observed levels of the variable. On the other hand, the height of the bars (on the y-axis) provides information about the proportion of non- and survivors. The graphs for age and fare were constructed by using the categorized versions of the variables.

Figure 5.2 indicates that the proportion of survivors was larger for females and children below 5 years of age. This is most likely the result of the “women and children first” principle that is often evoked in situations that require the evacuation of persons whose life is in danger.

Survival according to gender and age category in the Titanic data.

Figure 5.2: Survival according to gender and age category in the Titanic data.

The principle can, perhaps, partially explain the trend seen in Figure 5.3, i.e., a higher proportion of survivors among those with 1-2 parents/children and 1-2 siblings/spouses aboard.

Survival according to the number of parents/children and siblings/spouses in the Titanic data.

Figure 5.3: Survival according to the number of parents/children and siblings/spouses in the Titanic data.

Figure 5.4 indicates that passengers travelling in the first and second class had a higher chance of survival, perhaps due to the proximity of the location of their cabins to the deck. Interestingly, the proportion of survivors among the crew deck was similar to the proportion of the first-class passengers. The figure also shows that the proportion of survivors increased with the fare, which is consistent with the fact that the proportion was higher for passengers travelling in the first and second class.

Survival according to travel-class and ticket-fare in the Titanic data.

Figure 5.4: Survival according to travel-class and ticket-fare in the Titanic data.

Finally, Figure 5.5 does not suggest any noteworthy trends.

Survival according to the embarked harbour and country in the Titanic data.

Figure 5.5: Survival according to the embarked harbour and country in the Titanic data.

5.2 R classification models for Titanic

5.2.1 Logistic-regression model

The dependent variable of interest, survived, is binary. Thus, a natural choice is to start the predictive modelling with a logistic-regression model. As there is no reason to expect a linear relationship between age and odds of survival, we use linear tail-restricted cubic splines, available in the rcs() function of the rms package (Harrell Jr 2018), to model the effect of age. We also do not expect linear relation for the fare variable, but because of its skewness (see Figure 5.1), we do not use splines for this variable. The results of the model are stored in model-object titanic_lmr, which will be used in subsequent chapters.

library("rms")
titanic_lmr <- lrm(survived == "yes" ~ gender + rcs(age) + class +
         sibsp + parch + fare + embarked, titanic)
titanic_lmr
## Logistic Regression Model
##  
##  lrm(formula = survived == "yes" ~ gender + rcs(age) + class + 
##      sibsp + parch + fare + embarked, data = titanic)
##  
##                          Model Likelihood    Discrimination    Rank Discrim.    
##                                Ratio Test           Indexes          Indexes    
##  Obs           2207    LR chi2     752.06    R2       0.404    C       0.817    
##   FALSE        1496    d.f.            17    g        1.647    Dxy     0.635    
##   TRUE          711    Pr(> chi2) <0.0001    gr       5.191    gamma   0.636    
##  max |deriv| 0.0001                          gp       0.282    tau-a   0.277    
##                                              Brier    0.146                     
##  
##                         Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept               4.5746 0.5480   8.35 <0.0001 
##  gender=male            -2.7687 0.1586 -17.45 <0.0001 
##  age                    -0.1180 0.0221  -5.35 <0.0001 
##  age'                    0.6313 0.1628   3.88 0.0001  
##  age''                  -2.6583 0.7840  -3.39 0.0007  
##  age'''                  2.8977 1.0130   2.86 0.0042  
##  class=2nd              -1.1390 0.2501  -4.56 <0.0001 
##  class=3rd              -2.0627 0.2490  -8.28 <0.0001 
##  class=deck crew         1.0672 0.3498   3.05 0.0023  
##  class=engineering crew -0.9702 0.2648  -3.66 0.0002  
##  class=restaurant staff -3.1712 0.6583  -4.82 <0.0001 
##  class=victualling crew -1.0877 0.2596  -4.19 <0.0001 
##  sibsp                  -0.4504 0.1006  -4.48 <0.0001 
##  parch                  -0.0871 0.0987  -0.88 0.3776  
##  fare                    0.0014 0.0020   0.70 0.4842  
##  embarked=Cherbourg      0.7881 0.2836   2.78 0.0055  
##  embarked=Queenstown     0.2745 0.3409   0.80 0.4208  
##  embarked=Southampton    0.2343 0.2119   1.11 0.2689  
## 

Note that we are not very much interested in the assessment of the model’s predictive performance, but rather on understanding how does the model yield its predictions. This is why we do not split the data into the training and testing subsets. Instead, the model is fitted to the entire dataset and will be examined on the same dataset.

5.2.2 Random-forest model

As an alternative to the logistic-regression model we consider a random-forest model. Random-forest modelling is known for good predictive performance, ability to grasp low-order variable interactions, and stability (Leo Breiman 2001a). To fit the model, we apply the randomForest() function, with default settings, from the package with the same name (Liaw and Wiener 2002).

In the first instance, we fit a model with the same set of explanatory variables as the logistic-regression model (see Section 5.2.1). The results of the random-forest model are stored in model-object titanic_rf.

library("randomForest")
set.seed(1313)
titanic_rf <- randomForest(survived ~ class + gender + age + sibsp + parch + fare + embarked,
   data = titanic)
titanic_rf
## 
## Call:
##  randomForest(formula = survived ~ class + gender + age + sibsp +      parch + fare + embarked, data = titanic) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 18.62%
## Confusion matrix:
##       no yes class.error
## no  1393 103  0.06885027
## yes  308 403  0.43319269

5.2.3 Gradient-boosting model

Additionally, we consider the gradient-boosting model (Friedman 2000). Tree-based-boosting models are known for being able to accommodate higher-order interactions between variables. We use the same set of six explanatory variables as for the logistic-regression model (see Section 5.2.1). To fit the gradient-boosting model, we use function gbm() from the gbm package (Ridgeway 2017). The results of the model are stored in model-object titanic_gbm.

library("gbm")
set.seed(1313)
titanic_gbm <- gbm(survived == "yes" ~ class + gender + age + sibsp + 
         parch + fare + embarked, data = titanic, n.trees = 15000, 
         distribution = "bernoulli")
titanic_gbm
## gbm(formula = survived == "yes" ~ class + gender + age + sibsp + 
##     parch + fare + embarked, distribution = "bernoulli", data = titanic, 
##     n.trees = 15000)
## A gradient boosted model with bernoulli loss function.
## 15000 iterations were performed.
## There were 7 predictors of which 7 had non-zero influence.

5.2.4 Support Vector Machine model for Classification

Finally, we also consider a support vector machine model (Cortes and Vapnik 1995). We use the C-classification mode. Again, we fit a model with the same set of explanatory variables as in the logistic-regression model (see Section 5.2.1) To fit the model, we use function svm() from the e1071 package (Meyer et al. 2019). The results of the model are stored in model-object titanic_svm.

library("e1071")
titanic_svm <- svm(survived == "yes" ~ class + gender + age + sibsp +
            parch + fare + embarked, data = titanic, 
            type = "C-classification", probability = TRUE)
titanic_svm
## 
## Call:
## svm(formula = survived == "yes" ~ class + gender + age + sibsp + 
##     parch + fare + embarked, data = titanic, type = "C-classification", 
##     probability = TRUE)
## 
## 
## Parameters:
##    SVM-Type:  C-classification 
##  SVM-Kernel:  radial 
##        cost:  1 
## 
## Number of Support Vectors:  1030

5.2.5 Models’ predictions

Let us now compare predictions that are obtained from the different models. In particular, we compute the predicted probability of survival for Johnny D, an 8-year-old boy who embarked in Southampton and travelled in the first class with no parents nor siblings, and with a ticket costing 72 pounds.

First, we create a dataframe johnny_d that contains the data describing the passenger.

johnny_d <- data.frame(
            class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew",
                        "engineering crew", "restaurant staff", "victualling crew")),
            gender = factor("male", levels = c("female", "male")),
            age = 8,
            sibsp = 0,
            parch = 0,
            fare = 72,
            embarked = factor("Southampton", levels = c("Belfast",
                        "Cherbourg","Queenstown","Southampton"))
)

Subsequently, we use the generic function predict() to obtain the predicted probability of survival for the logistic-regression model.

(pred_lmr <- predict(titanic_lmr, johnny_d, type = "fitted"))
##         1 
## 0.7677036

The predicted probability is equal to 0.77.

We do the same for the remaining three models.

(pred_rf <- predict(titanic_rf, johnny_d, type = "prob"))
##      no   yes
## 1 0.578 0.422
## attr(,"class")
## [1] "matrix" "array"  "votes"
(pred_gbm <- predict(titanic_gbm, johnny_d, type = "response", n.trees = 15000))
## [1] 0.6632574
(pred_svm <- predict(titanic_svm, johnny_d, probability = TRUE))
##     1 
## FALSE 
## attr(,"probabilities")
##       FALSE      TRUE
## 1 0.7799685 0.2200315
## Levels: FALSE TRUE

As a result, we obtain the predicted probabilities of 0.42, 0.66, and 0.22 for the random-forest, gradient-boosting, and support vector machine models, respectively. The models lead to different probabilities. Thus, it might be of interest to understand the reason for the differences, as it could help us to decide which of the predictions we might want to trust. We will investigate this issue in the subsequent chapters.

Note that, for some examples later in the book, we will use another observation (instance) with lower chances of survival. We will call this passenger Henry.

henry <- data.frame(
         class = factor("1st", levels = c("1st", "2nd", "3rd", "deck crew", 
                     "engineering crew", "restaurant staff", "victualling crew")),
         gender = factor("male", levels = c("female", "male")),
         age = 47,
         sibsp = 0,
         parch = 0,
         fare = 25,
         embarked = factor("Cherbourg", levels = c("Belfast",
                           "Cherbourg","Queenstown","Southampton"))
)
predict(titanic_lmr, henry, type = "fitted")
##         1 
## 0.4318245
predict(titanic_rf, henry, type = "prob")[,2]
## [1] 0.246
predict(titanic_gbm, henry, type = "response", n.trees = 15000)
## [1] 0.3073358
attr(predict(titanic_svm, henry, probability = TRUE),"probabilities")[,2]
## [1] 0.1767995

5.2.6 Models’ explainers

Model-objects created with different libraries may have different internal structures. Thus, first, we have got to create an “explainer,” i.e., an object that provides an uniform interface for different models. Toward this end, we use the explain() function from the DALEX package (Biecek 2018). As it was mentioned in Section 3.2, there is only one argument that is required by the function, i.e., model. The argument is used to specify the model-object with the fitted form of the model. However, the function allows additional arguments that extend its functionalities. In particular, the list of arguments includes the following:

  • data, a data frame or matrix providing data to which the model is to be applied; if not provided (data = NULL by default), the data are extracted from the model-object. Note that the data object should not in principle, contain the dependent variable.
  • y, observed values of the dependent variable corresponding to the data given in the data object; if not provided (y = NULL by default), the values are extracted from the model-object;
  • predict_function, a function that returns prediction scores; if not specified (predict_function = NULL by default), then a default predict() function is used (note that this may lead to errors);
  • residual_function, a function that returns model residuals; if not specified (residual_function = NULL by default), then model residuals defined in equation (2.1) are calculated;
  • verbose, a logical argument (verbose = TRUE by default) indicating whether diagnostic messages are to be printed;
  • precalculate, a logical argument (precalculate = TRUE by default) indicating whether predicted values and residuals are to be calculated when the explainer is created. Note that this will also happen if verbose = TRUE. To skip the calculations, both verbose and precalculate should be set to FALSE .
  • model_info, a named list (with components “package,” “version,” and “type”) providing information about the model; if not specified (model_info = NULL by default), DALEX seeks for information on its own;
  • type, information about the type of the model, either “classification” (for a binary dependent variable) or “regression” (for a continuous dependent variable); if not specified (type = NULL by default), then the value of the argument is extracted from model_info;
  • label, a unique name of the model; if not specified (label = NULL by default), then it is extracted from class(model).

Application of function explain() provides an object of class explainer. It is a list of many components that include:

  • model, the explained model;
  • data, the data to which the model is applied;
  • y, observed values of the dependent variable corresponding to data;
  • y_hat, predictions obtained by applying model to data;
  • residuals, residuals computed based on y and y_hat;
  • predict_function, the function used to obtain the model’s predictions;
  • residual_function, the function used to obtain residuals;
  • class, class/classes of the model;
  • label, label of the model/explainer;
  • model_info, a named list (with components package, version, and type) providing information about the model.

Thus, each explainer-object contains all elements needed to create a model explanation. The code below creates explainers for the models (see Sections 5.2.15.2.4) fitted to the Titanic data. Note that, in the data argument, we indicate the titanic data frame without the ninth column, i.e., without the survived variable. The variable is used in the y argument to explicitly define the binary dependent variable equal to 1 for survivors and 0 for passengers who did not survive.

titanic_lmr_exp <- explain(model = titanic_lmr, 
                           data = titanic[, -9],
                           y = titanic$survived == "yes", 
                           label = "Logistic Regression")
titanic_lmr_exp$model_info$type = "classification"

titanic_rf_exp <- explain(model = titanic_rf, 
                           data = titanic[, -9],
                           y = titanic$survived == "yes", 
                           label = "Random Forest")

titanic_gbm_exp <- explain(model = titanic_gbm, 
                           data = titanic[, -9],
                           y = titanic$survived == "yes", 
                           label = "Generalized Boosted Regression")

titanic_svm_exp <- explain(model = titanic_svm, 
                           data = titanic[, -9],
                           y = titanic$survived == "yes", 
                           label = "Support Vector Machine")

5.2.7 List of objects for the Titanic example

In the previous sections, we have built four predictive models for the Titanic dataset. The models will be used in the rest of the book to illustrate model-explanation methods and tools.

For the ease of reference, we summarize the models in Table 5.1. The binary model-objects can be downloaded by using the indicated archivist hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can restore a selected model in its local R environment.

Table 5.1: Predictive models created for the titanic dataset.
Model name Model generator Variables Archivist hooks
titanic_lmr rms:: lmr v.5.1.3 gender, age, class, sibsp, parch, fare, embarked Get the model: archivist:: aread("pbiecek/models/58b24").
titanic_rf randomForest:: randomForest v.4.6.14 gender, age, class, sibsp, parch, fare, embarked Get the model: archivist:: aread("pbiecek/models/4e0fc").
titanic_gbm gbm:: gbm v.2.1.5 gender, age, class, sibsp, parch, fare, embarked Get the model: archivist:: aread("pbiecek/models/b7078").
titanic_svm e1071:: svm v.1.7.3 gender, age, class, sibsp, parch, fare, embarked Get the model: archivist:: aread("pbiecek/models/9c27f").

Table 5.2 summarizes the data frames that will be used in examples in the subsequent chapters.

Table 5.2: Data frames created for the Titanic use-case.
Description No. rows Variables Link to this object
titanic dataset with imputed missing values 2207 gender, age, class, embarked, country, fare, sibsp, parch, survived archivist:: aread("pbiecek/models/27e5c")
johnny_d 8-year-old boy from the 1st class without parents, paid 72 pounds, embarked in Southampton 1 class, gender, age, sibsp, parch, fare, embarked archivist:: aread("pbiecek/models/e3596")
henry 47-year-old male from the 1st class, travelled alone, paid 25 pounds, embarked in Cherbourg 1 class, gender, age, sibsp, parch, fare, embarked archivist:: aread("pbiecek/models/a6538")

5.3 Python classification models for Titanic

Examples in Python are based on the titanic data set, which is available in the dalex library. The survived column is a target variable, the remaining columns will be used to construct the classifier.

The following instructions load the titanic dataset and split it into the target variable y and the predictive variables X. For the purpose of this example we do not divide the data into training and testing data. Instructions on how to deal with the situation when you want to analyze the model on data other than training data will be presented in the next chapter.

import dalex as dx
titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived

Data X contains numeric variables with different ranges (e.g. age and fare) and categorical variables. Machine Learning algorithms in scikitlearn require the data to be preprocessed into numeric form. Therefore, before modeling, we prepared a pipeline that performs data preprocessing. That is scaling for continuous variables (age, fare, parch, sibsp) and one-hot-encoding for qualitative variables (gender, class, embarked).

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

preprocess = make_column_transformer(
    (StandardScaler(), ['age', 'fare', 'parch', 'sibsp']),
    (OneHotEncoder(), ['gender', 'class', 'embarked']))

5.3.1 Logistic-regression model

The dependent variable of interest, survived, is binary. Thus, a natural choice is to start the predictive modelling with a logistic-regression model. Here we use the LogisticRegression algorithm from sklearn library. By default the sklearn implementation use ridge regression with \(L_2\) penalty for model coefficients. This is whay it was impotant to scale numerical valriables like age and fare.

The result is a model stored in object titanic_lr, which will be used in subsequent chapters.

from sklearn.linear_model import LogisticRegression

titanic_lr = make_pipeline(
    preprocess,
    LogisticRegression(penalty='l2'))
    
titanic_lr.fit(X, y)

Note that we are not very much interested in the assessment of the model’s predictive performance, but rather on understanding how does the model yield its predictions. This is why we do not split the data into the training and testing subsets. Instead, the model is fitted to the entire dataset and will be examined on the same dataset.

5.3.2 Random-forest model

As an alternative to the logistic-regression model we consider a random-forest model. Random-forest modelling is known for good predictive performance, ability to grasp low-order variable interactions, and stability (Leo Breiman 2001a). To fit the model, we use RandomForestClassifier algorithm from sklearn library. We use the default settings with tree not deeper than 3 levels, and the number of trees in the random forest is set to 500.

The result is a model stored in object model_rf, which will be used in subsequent chapters.

from sklearn.ensemble import RandomForestClassifier

titanic_rf = make_pipeline(
    preprocess,
    RandomForestClassifier(max_depth=3, random_state=0, n_estimators=500))
    
titanic_rf.fit(X, y)

5.3.3 Gradient-boosting model

Additionally, we consider the gradient-boosting model (Friedman 2000). Tree-based-boosting models are known for being able to accommodate higher-order interactions between variables. To fit the model, we use GradientBoostingClassifier algorithm from sklearn library. We use the default settings and the number of trees in the ensemble is set to 100.

The result is a model stored in object model_gbc, which will be used in subsequent chapters.

from sklearn.ensemble import GradientBoostingClassifier

titanic_gbc = make_pipeline(
    preprocess,
    GradientBoostingClassifier(n_estimators=100))

titanic_gbc.fit(X, y)

5.3.4 Support Vector Machine model for Classification

Finally, we also consider a support vector machine model (Cortes and Vapnik 1995). We use the C-Support Vector Classification mode. To fit the model, we use SVC algorithm from sklearn library based on libsvm.

The result is a model stored in object model_svm, which will be used in subsequent chapters.

from sklearn.svm import SVC

titanic_svm = make_pipeline(
    preprocess,
    SVC(probability=True))
    
titanic_svm.fit(X, y)

5.3.5 Models’ predictions

Let us now compare predictions that are obtained from the different models. In particular, we compute the predicted probability of survival for Johnny D, an 8-year-old boy who embarked in Southampton and travelled in the first class with no parents nor siblings, and with a ticket costing 72 pounds.

First, we create a dataframe johnny_d that contains the data describing the passenger.

import pandas as pd

johnny_d = pd.DataFrame({'gender': ['male'],
                       'age': [8],
                       'class': ['1st'],
                       'embarked': ['Southampton'],
                       'fare': [72],
                       'sibsp': [0],
                       'parch': [0]},
                      index = ['JohnnyD'])

Subsequently, we use the method predict_proba() to obtain the predicted probability of survival for the logistic-regression model.

titanic_lr.predict_proba(johnny_d)
# array([[0.35884528, 0.64115472]])

We do the same for the remaining three models.

titanic_rf.predict_proba(johnny_d)
# array([[0.63028556, 0.36971444]])

titanic_gbc.predict_proba(johnny_d)
# array([[0.1567194, 0.8432806]])

titanic_svm.predict_proba(johnny_d)
# array([[0.78308146, 0.21691854]])

Note that, for some examples later in the book, we will use another observation (instance) with lower chances of survival. We will call this passenger Henry.

henry = pd.DataFrame({'gender': ['male'],
                       'age': [47],
                       'class': ['1st'],
                       'embarked': ['Southampton'],
                       'fare': [25],
                       'sibsp': [0],
                       'parch': [0]},
                      index = ['Henry'])

titanic_lr.predict_proba(henry)
# array([[0.69547744, 0.30452256]])

titanic_rf.predict_proba(henry)
# array([[0.73060059, 0.26939941]])

titanic_gbc.predict_proba(henry)
# array([[0.1567194, 0.8432806]])

titanic_svm.predict(henry)
# array([[0.82429369, 0.17570631]])

5.3.6 Models’ explainers

The examples that we show in this chapter are based on the sklearn library, which makes it possible to work with models in a uniform way. But often we also want to work with models built in other libraries. To make it easier to work models with different structures, the dalex library wraps models in the objects of the class Explainer, that have all the necessary functions of the model available in a uniform way.

There is only one argument that is required by the function, i.e., model. However, the function allows additional arguments that extend its functionalities. In particular, the list of arguments includes the following:

  • data, a data frame or matrix providing data to which the model is to be applied.
  • y, observed values of the dependent variable corresponding to the data given in the data object;
  • predict_function, a function that returns prediction scores; if not specified, then dalex will make a guess which function shall be used, predict(), predict_proba() or something else;
  • residual_function, a function that returns model residuals;
  • label, a unique name of the model;
  • model_class, class of actual model;
  • verbose, a logical argument (verbose = TRUE by default) indicating whether diagnostic messages are to be printed;
  • model_type, information about the type of the model, either “classification” (for a binary dependent variable) or “regression” (for a continuous dependent variable);
  • model_info, a dictionary with additional information about the model.

Application of constructor Explainer() provides an object of class “Explainer”. It is an object with many components that include:

  • model, the explained model;
  • data, the data to which the model is applied;
  • y, observed values of the dependet variable corresponding to data;
  • y_hat, predictions obtained by applying model to ’data`;
  • residuals, residuals computed based on y and y_hat;
  • predict_function, the function used to obtain the model’s predictions;
  • residual_function, the function used to obtain residuals;
  • class, class/classes of the model;
  • label, label of the model/Explainer;
  • model_info, a dictionary (with components “package,” “version,” and “type”) providing information about the model.

Thus, each Explainer-object contains all elements needed to create a model explanation. The code below creates explainers for the models (see Sections 5.3.15.3.4) fitted to the Titanic data.

titanic_rf_exp = dx.Explainer(titanic_rf, X, y, label = "Titanic RF Pipeline")

titanic_lr_exp = dx.Explainer(titanic_lr, X, y, label = "Titanic LR Pipeline")

titanic_gbc_exp = dx.Explainer(titanic_gbc, X, y, label = "Titanic XGB Pipeline")

titanic_svm_exp = dx.Explainer(titanic_svm, X, y, label = "Titanic SVM Pipeline")

5.4 Apartment prices

Warsaw skyscrapers by Artur Malinowski Flicker

Predicting house prices is a common exercise used in machine-learning courses. Various datasets for house prices are available at websites like Kaggle (https://www.kaggle.com) or UCI Machine Learning Repository (https://archive.ics.uci.edu).

In this book, we will work with an interesting variant of this problem. The apartments dataset is an artificial dataset created to match key characteristics of real apartments in Warsaw, the capital of Poland. However, the dataset is created in a way that two very different models, namely linear regression and random forest, have almost exactly the same accuracy. The natural question is then: which model should we choose? We will show that the model-explanation tools provide important insight into the key model characteristics and are helpful in model selection.

The dataset is available in the DALEX package for R and dalex library for Python. It contains 1000 observations (apartments) and six variables:

  • m2.price, apartment’s price per square meter (in EUR), a numerical variable in the range of 1607–6595;
  • construction.year, the year of construction of the block of flats in which the apartment is located, a numerical variable in the range of 1920–2010;
  • surface, apartment’s total surface in square meters, a numerical variable in the range of 20–150;
  • floor, the floor at which the apartment is located (ground floor taken to be the first floor), a numerical integer variable with values ranging from 1 to 10;
  • no.rooms, the total number of rooms, a numerical integer variable with values ranging from 1 to 6;
  • district, a factor with 10 levels indicating the district of Warsaw where the apartment is located.

See an first six rows of this dataset in the Table below.

m2.price construction.year surface floor no.rooms district
5897 1953 25 3 1 Srodmiescie
1818 1992 143 9 5 Bielany
3643 1937 56 1 2 Praga
3517 1995 93 7 3 Ochota
3013 1992 144 6 5 Mokotow
5795 1926 61 6 2 Srodmiescie

Models considered for this dataset will use m2.price as the (continuous) dependent variable. Models’ predictions will be validated on a set of 9000 apartments included in data frame apartments_test.

Note that usually, the testing data is smaller than the training data. In this example we deliberately use a small training set so that model selection is not too easy. After all, it’s just an use-case.

5.4.1 Data exploration

Note that apartments is an artificial dataset created to illustrate and explain differences between random forest and linear regression. Hence, the structure of the data, the form and strength of association between variables, plausibility of distributional assumptions, etc., is better than in a real-life dataset. In fact, all these characteristics of the data are known. Nevertheless, we present some data exploration below to illustrate the important aspects of the data.

The variable of interest is m2.price, the price per square meter. The histogram presented in Figure 5.6 indicates that the distribution of the variable is slightly skewed to the right.

Distribution of the price per square meter in the apartment-prices data.

Figure 5.6: Distribution of the price per square meter in the apartment-prices data.

Figure 5.7 suggests (possibly) a non-linear relationship between construction.year and m2.price and a linear relation between surface and m2.price.

Apartment-prices data. Price per square meter vs. year of construction (left-hand-side panel) and vs. surface (right-hand-side panel).

Figure 5.7: Apartment-prices data. Price per square meter vs. year of construction (left-hand-side panel) and vs. surface (right-hand-side panel).

Figure 5.8 indicates that the relationship between floor and m2.price is also close to linear, as well as is the association between no.rooms and m2.price .

Apartment-prices data. Price per square meter vs. floor (left-hand-side panel) and vs. number of rooms (right-hand-side panel).

Figure 5.8: Apartment-prices data. Price per square meter vs. floor (left-hand-side panel) and vs. number of rooms (right-hand-side panel).

Figure 5.9 shows that surface and number of rooms are positively associated and that prices depend on the district. In particular, box plots in Figure 5.9 indicate that the highest prices per square meter are observed in Srodmiescie (Downtown).

Apartment-prices data. Surface vs. number of rooms (left-hand-side panel) and price per square meter for different districts (right-hand-side panel).

Figure 5.9: Apartment-prices data. Surface vs. number of rooms (left-hand-side panel) and price per square meter for different districts (right-hand-side panel).

5.5 R regression model for Apartment prices

5.5.1 Linear-regression model

The dependent variable of interest, m2.price, is continuous. Thus, a natural choice to build a predictive model is linear regression. We treat all the other variables in the apartments data frame as explanatory and include them in the model. To fit the model, we apply the lm() function. The results of the model are stored in model-object apartments_lm.

apartments_lm <- lm(m2.price ~ ., data = apartments)
anova(apartments_lm)
## Analysis of Variance Table
## 
## Response: m2.price
##                    Df    Sum Sq   Mean Sq  F value    Pr(>F)    
## construction.year   1   2629802   2629802   33.233 1.093e-08 ***
## surface             1 207840733 207840733 2626.541 < 2.2e-16 ***
## floor               1  79823027  79823027 1008.746 < 2.2e-16 ***
## no.rooms            1    956996    956996   12.094  0.000528 ***
## district            9 451993980  50221553  634.664 < 2.2e-16 ***
## Residuals         986  78023123     79131                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

5.5.2 Random-forest model

As an alternative to linear regression, we consider a random-forest model. Again, we treat all the variables in the apartments data frame other than m2.price as explanatory and include them in the model. To fit the model, we apply the randomForest() function, with default settings, from the package with the same name (Liaw and Wiener 2002). The results of the model are stored in model-object apartments_rf.

library("randomForest")
set.seed(72)
apartments_rf <- randomForest(m2.price ~ ., data = apartments)
apartments_rf
## 
## Call:
##  randomForest(formula = m2.price ~ ., data = apartments) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 79789.39
##                     % Var explained: 90.28

5.5.3 Support Vector Machine model for Regression

Finally, we consider a support vector machine model, with all the variables in the apartments data frame other than m2.price treated as explanatory. To fit the model, we use the svm() function, with default settings, from package e1071 (Meyer et al. 2019). The results of the model are stored in model-object apartments_svm.

library("e1071")
apartments_svm <- svm(m2.price ~ construction.year + surface + floor + 
         no.rooms + district, data = apartments)
apartments_svm
## 
## Call:
## svm(formula = m2.price ~ construction.year + surface + floor + no.rooms + 
##     district, data = apartments)
## 
## 
## Parameters:
##    SVM-Type:  eps-regression 
##  SVM-Kernel:  radial 
##        cost:  1 
##       gamma:  0.07142857 
##     epsilon:  0.1 
## 
## 
## Number of Support Vectors:  536

5.5.4 Models’ predictions

The predict() function calculates predictions for a specific model. In the example below, we use model-objects apartments_lm, apartments_rf, and apartments_svm, to calculate predictions for prices of the apartments from the apartments_test data frame. Note that, for brevity sake, we compute the predictions only for the first six observations from the data frame.

Actual prices for the six observations from apartments_test.

apartments_test$m2.price[1:6]
## [1] 4644 3082 2498 2735 2781 2936

Predictions with linear regression.

predict(apartments_lm, apartments_test[1:6,])
##     1001     1002     1003     1004     1005     1006 
## 4820.009 3292.678 2717.910 2922.751 2974.086 2527.043

Predictions with random-forest model.

predict(apartments_rf, apartments_test[1:6,])
##     1001     1002     1003     1004     1005     1006 
## 4214.084 3178.061 2695.787 2744.775 2951.069 2999.450

Predictions with support vector model.

predict(apartments_svm, apartments_test[1:6,])
##     1001     1002     1003     1004     1005     1006 
## 4590.076 3012.044 2369.748 2712.456 2681.777 2750.904

By using the code below, we summarize the predictive performance of the linear-regression and random-forest models by computing the square root of the mean-squared-error (RMSE). For a “perfect” predictive model, which would predict all observations exactly, RMSE should be equal to 0. More information about RMSE can be found in Section 16.3.1.

predicted_apartments_lm <- predict(apartments_lm, apartments_test)
(rmsd_lm <- sqrt(mean((predicted_apartments_lm - apartments_test$m2.price)^2)))
## [1] 283.0865
predicted_apartments_rf <- predict(apartments_rf, apartments_test)
(rmsd_rf <- sqrt(mean((predicted_apartments_rf - apartments_test$m2.price)^2)))
## [1] 282.9519

For the random-forest model, RMSE is equal to 283. It is almost identical to the RMSE for the linear-regression model, which is equal to 283.1. Thus, the question we may face is: should we choose the more complex, but flexible random-forest model, or the simpler and easier to interpret linear-regression model? In the subsequent chapters, we will try to provide an answer to this question. In particular, we will show that a proper model exploration may help to discover weak and strong sides of any of the models and, in consequence, allow the creation of a new model, with better performance than either of the two.

5.5.5 Models’ explainers

The code below creates explainers for the models (see Sections 5.5.15.5.3) fitted to the apartment-prices data. Note that we use the apartments_test data frame without the first column, i.e., the m2.price variable, in the data argument. This will be the dataset to which the model will be applied (see Section 5.2.6). The m2.price variable is explicitly specified as the dependent variable in the y argument (see Section 5.2.6).

apartments_lm_exp <- explain(model = apartments_lm, 
                             data = apartments_test[,-1], 
                             y = apartments_test$m2.price, 
                             label = "Linear Regression")

apartments_rf_exp <- explain(model = apartments_rf, 
                             data = apartments_test[,-1], 
                             y = apartments_test$m2.price, 
                             label = "Random Forest")

apartments_svm_exp <- explain(model = apartments_svm, 
                              data = apartments_test[,-1], 
                              y = apartments_test$m2.price, 
                              label = "Support Vector Machine")

5.5.6 List of objects for the Apartment prices example

In Sections 5.5.15.5.3, we have built three predictive models for the apartments dataset. The models will be used in the rest of the book to illustrate the model-explanation methods and tools.

For the ease of reference, we summarize the models in Table 5.3. The binary model-objects can be downloaded by using the indicated archivist hooks (Biecek and Kosinski 2017). By calling a function specified in the last column of the table, one can restore a selected model in a local R environment.

Table 5.3: Predictive models created for the dataset Apartment prices.
Model name Model generator Variables Archivist hooks
apartments_lm stats:: lm v.3.5.3 construction .year, surface, floor, no.rooms, district Get the model: archivist:: aread("pbiecek/models/55f19").
apartments_rf randomForest:: randomForest v.4.6.14 construction .year, surface, floor, no.rooms, district Get the model: archivist:: aread("pbiecek/models/fe7a5").
apartments_svm e1071:: svm v.1.7.3 construction .year, surface, floor, no.rooms, district Get the model: archivist:: aread("pbiecek/models/d2ca0").

5.6 Python regression models for Apartment prices

Examples in Python are based on the apartmets data set, which is available in the dalex library. The m2_price column is a target variable, the remaining columns will be used to construct the predictive model.

The following instructions load the apartmets dataset and split it into the target variable y and the predictive variables X.

import dalex as dx
apartmets = dx.datasets.load_apartmets()
X = apartmets.drop(columns='m2_price')
y = apartmets.m2_price

Data X contains numeric variables with different ranges (e.g. surface and no.rooms) and categorical variables (i.e. district). Machine Learning algorithms in sklearn require the data to be preprocessed into numeric form. Therefore, before modeling, we prepared a pipeline that performs data preprocessing. That is scaling for continuous variables (construction.year, surface, floor, no.rooms) and one-hot-encoding for categorical variables (district).

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

preprocess = make_column_transformer(
    (StandardScaler(), ['construction.year', 'surface', 'floor', 'no.rooms']),
    (OneHotEncoder(), ['district']))

5.6.1 Linear-regression model

The dependent variable of interest, m2.price, is continuous. Thus, a natural choice to build a predictive model is linear regression. We treat all the other variables in the apartments data frame as explanatory and include them in the model. Here we use the LinearRegression algorithm from sklearn library.

The result is a model stored in object apartments_lm, which will be used in subsequent chapters.

from sklearn.linear_model import LinearRegression

apartments_lm = make_pipeline(
    preprocess,
    LinearRegression())
    
apartments_lm.fit(X, y)

5.6.2 Random-forest model

As an alternative to linear regression, we consider a random-forest model. Again, we treat all the variables in the apartments data frame other than m2.price as explanatory and include them in the model.

To fit the model, we use RandomForestRegressor algorithm from sklearn library. We use the default settings with tree not deeper than 3 levels, and the number of trees in the random forest is set to 500.

The result is a model stored in object apartments_rf, which will be used in subsequent chapters.

from sklearn.ensemble import RandomForestRegressor

apartments_rf = make_pipeline(
    preprocess,
    RandomForestClassifier(max_depth=3, random_state=0, n_estimators=500))
    
apartments_rf.fit(X, y)

5.6.3 Support Vector Machine model for Regression

Finally, we also consider a support vector machine model (Cortes and Vapnik 1995). We use the Support Vector for Regression with linear kernel. To fit the model, we use SVR algorithm from sklearn library.

The result is a model stored in object apartments_svm, which will be used in subsequent chapters.

from sklearn.svm import SVR

apartments_svm = make_pipeline(
    preprocess,
    SVR())
    
apartments_svm.fit(X, y)

5.6.4 Models’ predictions

Let us now compare predictions that are obtained from the different models. Subsequently, we use the method predict() to obtain the predicted price of square metter for the linear regression model.

apartments_lr.predict(apartments_test)

We do the same for the remaining two models.

apartments_rf.predict(apartments_test)

apartments_svm.predict(apartments_test)

5.6.5 Models’ explainers

The examples that we show in this chapter are based on the sklearn library, which makes it possible to work with models in a uniform way. But often we also want to work with models built in other libraries. To make it easier to work models with different structures, the dalex library wraps models in the objects of the class Explainer, that have all the necessary functions of the model available in a uniform way. More detailed description is given in Section 5.3.6.

apartments_lr_exp = dx.Explainer(apartments_lr, X, y, label = "Apartments LR Pipeline")

apartments_rf_exp = dx.Explainer(apartments_rf, X, y, label = "Apartments RF Pipeline")

apartments_svm_exp = dx.Explainer(apartments_svm, X, y, label = "Apartments SVM Pipeline")

References

Biecek, Przemyslaw. 2018. DALEX: Explainers for Complex Predictive Models in R. Journal of Machine Learning Research. Vol. 19. http://jmlr.org/papers/v19/18-416.html.

Biecek, Przemyslaw, and Marcin Kosinski. 2017. “archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.” Journal of Statistical Software 82 (11): 1–28. https://doi.org/10.18637/jss.v082.i11.

Breiman, Leo. 2001a. “Random Forests.” In Machine Learning, 45:5–32. https://doi.org/10.1023/a:1010933404324.

Buuren, S. van. 2012. Flexible Imputation of Missing Data. Boca Raton, FL: Chapman & Hall/CRC.

Cortes, Corinna, and Vladimir Vapnik. 1995. “Support-Vector Networks.” In Machine Learning, 273–97.

Friedman, Jerome H. 2000. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29: 1189–1232.

Harrell Jr, Frank E. 2018. Rms: Regression Modeling Strategies. https://CRAN.R-project.org/package=rms.

Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. http://CRAN.R-project.org/doc/Rnews/.

Little, R. J. A., and D. B. Rubin. 2002. Statistical Analysis with Missing Data (2nd Ed.). Hoboken, NJ: Wiley.

Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2019. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), Tu Wien. https://CRAN.R-project.org/package=e1071.

Molenberghs, G., and M. G. Kenward. 2007. Missing Data in Clinical Studies. Chichester, England: Wiley.

Ridgeway, Greg. 2017. Gbm: Generalized Boosted Regression Models. https://CRAN.R-project.org/package=gbm.

Schafer, J. L. 1997. Analysis of Incomplete Multivariate Data. Boca Raton, FL: Chapman & Hall/CRC.