# 17 Variable’s Importance

## 17.1 Introduction

In this chapter, we present methods that are useful for the evaluation of an explanatory variable importance. The methods may be applied for several purposes.

- Model simplification: variables that do not influence model’s predictions may be excluded from the model.
- Model exploration: comparison of a variable’s importance in different models may help in discovering interrelations between the variables.Also, ordering of variables in function of their importance is helpful in deciding in what order should we perform further model exploration.
- Domain-knowledge-based model validation: identification of the most important variables may be helpful in assessing the validity of the model based on the domain knowledge.
- Knowledge generation: identification of the most important variables may lead to discovery of new factors involved in a particular mechanism.

The methods for assessment of variable importance can be divided, in general, into two groups: model-specific and model-agnostic.

For models like linear models, random forest, and many others, there are methods of assessing of variable importance that exploit particular elements of the structure of the model. These are model-specific methods. For instance, for linear models, one can use the value of the normalized regression coefficient or its corresponding p-value as the variable-importance measure. For tree-based ensembles, such a measure may be based on the use of a particular variable in particular trees (see, e.g., `XgboostExplainer`

(Foster 2017) for gradient boosting and `RandomForestExplainer`

(Paluszynska and Biecek 2017) for random forest).

In this book we focus on model-agnostic methods. These methods do not assume anything about the model structure. Therefore, they can be applied to any predictive model or ensemble of models. Moreover, and perhaps even more importantly, they allow comparing variable importance between models with different structures.

## 17.2 Intuition

We focus on the method described in more detail in (Fisher, Rudin, and Dominici 2018). The main idea is to measure how much the model fit decreases if the effect of a selected explanatory variable or of a group of variables is removed. The effect is removed by means of perturbations like resampling from an empirical distribution of just permutation of the values of the variable.

The idea is in some sense borrowed from variable important measure proposed by @ref{randomForestBreiman} for random forest. If a variable is important, then after permutation of this variable we expect that the model performance will be lower. The larger drop in the performance, the more important is the variable.

Despite the simplicity of definition, the permutation variable importance is a very powerful model agnostic tool for model exploration. Values of permutation variable importance may be compared between different structures of models. This property is discussed in detail in the section *Pros and Cons*.

## 17.3 Method

Consider a set of \(n\) observations for a set of \(p\) explanatory variables. Denote by \(\widetilde{y}=(f(x_1),\ldots,f(x_n))\) the vector of predictions for model \(f()\) for all the observations. Let \(y\) denote the vector of observed values of the dependent variable \(Y\).

Let \(\mathcal L(\widetilde{y}, y)\) be a loss function that quantifies goodness of fit of model \(f()\) based on \(\widetilde{y}\) and \(y\). For instance, \(\mathcal L\) may be the value of likelihood. Consider the following algorithm:

- Compute \(L = \mathcal L(\widetilde{y}, y)\), i.e., the value of the loss function for the original data.
- For each explanatory variable \(X^j\) included in the model, do steps 3-6.
- Replace vector \(x^j\) of observed values of \(X^j\) by vector \(x^{*j}\) of resampled or permuted values.
- Calculate model predictions \(\widetilde{y}^{*j}\) for the modified data, \(\widetilde{y}^{*j} = f(x^{*j})\).
- Calculate the value of the model performance for the modified data: \[ L^{*j} = \mathcal L(\widetilde{y}^{*j}, y) \]
- Quantify the importance of explanatory variable \(x^j\) by calculating \(vip_{Diff}(x^j) = L^{*j} - L\) or \(vip_{Ratio}(x^j) = L^{*j} / L\), where \(L\) is the value of the loss function for the original data.

Note that the use of resampling or permuting data in Step 3 involves randomness. Thus, the results of the procedure may depend on the actual configuration of resampled/permuted values. Hence, it is advisable to repeat the procedure several times. In this way, the uncertainty related to the calculated variable-importance values can be assessed.

The calculations in Step 6 ``normalize’’ the value of the variable importance measure with respect to \(L\). However, given that \(L\) is a constant, the normalization has no effect on the ranking of variables according to \(vip_{Diff}(x^j)\) or \(vip_{Ratio}(x^j)\). Thus, in practice, often the values of \(L^{*j}\) are simply used to quantify variable’s importance.

## 17.4 Example: Titanic data

In this section, we illustrate the use of the permutation-based variable-importance method by applying it to the random forest model for the Titanic data (see Section 5.1.3). Recall that the goal is to predict survival probability of passengers based on their sex, age, cityplace of embarkment, class in which they travelled, fare, and the number of persons they travelled with.

Figure 17.1 shows the values of loss function measured as \(1-AUC^{*j}\) after permuting, in turn, each of the variables included in the model. Additionally, the plot indicates the value of \(L\) by the vertical dashed line at the left-hand-side of the plot. Length of the bar span between \(L\) and \(L^{*j}=1-AUC^{*j}\) and correspond to the variable importance.

The plot in Figure 17.1 suggests that the most important variable in the model is gender. This agrees with the conclusions drawn in the exploratory analysis presented in Section 5.1.1. The next three important variables are class of the travel (first-class patients had a higher chance of survival), age (children had a higher chance of survival), and fare (owners of more expensive tickets had a higher chance of survival).

To take into account the uncertainty related to the use of permutations, we can consider computing the average values of \(L^{*j}\) over a set of, say, 10 permutations. The plot in Figure 17.2 presents the average values. The only remarkable difference, as compared to Figure 17.1, is the change in the ordering of the `sibsp`

and `parch`

variables.

The plots similar to those presented in Figures 17.1 and 17.2 are useful for comparisons of variable importance for different models.
Figure 17.3 presents the single-permutation results for the random forest, gradient boosting (see Section 5.1.4), and logistic regression (see Section 5.1.2) models. The best result, in terms of the smallest value of the goodness-of-fit function \(L\), are obtained for the random forest model. Note, however, that this model includes more variables than the other two. For instance, variable `fare`

, which is highly correlated with the travel class, is not important neither in the gradient boosting nor in the logistic regression model, but is important in the random forest model.

The plots in Figure 17.3 indicate that `gender`

is the most important variable in all three models, followed by `class`

.

## 17.5 Pros and cons

Permutation variable importance offer a model-agnostic approach to assessment of the influence of each variable on model performance. The approach offers several advantages. The plots are easy to understand. They are compact, all most important variables are presented in a single plot.

Permutation variable importance is expressed in a terms of model performance and can be compared between models. In different models the same variable may have different importance scores and comparison of such scores may lead to interesting insights. For example if variables are correlated then models like random forest are expected to spread importance across every variable while in regularized regression models coefficients for one correlated variable may dominate over coefficients for other variables.

The same approach can be used to measure importance of a single explanatory variable or a group of variables. The latter is useful for aspects - groups of variables that are complementary or are related to a similar concept. For example in the Titanic example the `fare`

and `class`

variables are linked with wealth of a passenger. Instead of calculation of effects of each variable independently we may calculate effect of both variables by permutation of both.

The disadvantage of this measure comes from the randomness behind permutations. For different permutations we may get different results. Also different choices of model performance measure, like Precision, Accuracy, AUC, lead to different numeric values of variable importance. And last disadvantage is related with the data used for assessment of model performance. Different importance values may be obtained on training and testing data.

## 17.6 Code snippets for R

For illustration, We will use the random forest model for the apartment prices data (see Section 5.2.3).

Let’s recover a regression model for prediction of apartment prices.

A popular loss function for regression model is the root mean square loss.

\[ L(y, \tilde y) = \sqrt{\frac1n \sum_{i=1}^n (y_i - \tilde y_i)^2} \]

It is implemented in the `DALEX`

package in the function `loss_root_mean_square`

. The initial loss function \(L\) for this model is

`## [1] 792.8346`

Let’s calculate variable importance for root mean square loss with the `model_parts`

function.

```
## variable mean_dropout_loss label
## 1 _full_model_ 796.0100 randomForest
## 2 no.rooms 828.4179 randomForest
## 3 construction.year 842.2287 randomForest
## 4 district 850.4096 randomForest
## 5 floor 858.8663 randomForest
## 6 surface 875.2063 randomForest
## 7 _baseline_ 1118.4724 randomForest
```

On a diagnostic plot is useful to present variable importance with boxplots that show results for different permutations.

### 17.6.1 Models comparison

Variable importance plots are very useful tool for model comparison. In the section 5.2 we have trained three models on `apartments`

dataset.
These were models with different structures to make the comparison more interesting.
Random Forest model (Breiman et al. 2018) (elastic but biased), Support Vector Machines model (Meyer et al. 2017) (large variance on boundaries) and Linear Model (stable but not very elastic).

Let’s calculate permutation variable importance with root mean square error for these three models.

```
## variable mean_dropout_loss label
## 1 _full_model_ 281.8345 lm
## 2 construction.year 281.7864 lm
## 3 no.rooms 293.7945 lm
## 4 floor 486.0535 lm
## 5 surface 614.4047 lm
## 6 district 1018.8827 lm
## 7 _baseline_ 1262.6592 lm
```

```
## variable mean_dropout_loss label
## 1 _full_model_ 802.9422 randomForest
## 2 no.rooms 834.9660 randomForest
## 3 construction.year 851.9975 randomForest
## 4 district 852.5380 randomForest
## 5 floor 874.3987 randomForest
## 6 surface 880.9620 randomForest
## 7 _baseline_ 1110.6190 randomForest
```

```
## variable mean_dropout_loss label
## 1 _full_model_ 984.9034 svm
## 2 district 950.4622 svm
## 3 no.rooms 980.3698 svm
## 4 construction.year 1041.9925 svm
## 5 floor 1072.9481 svm
## 6 surface 1096.7851 svm
## 7 _baseline_ 1237.6861 svm
```

Now we can plot variable importance for all three models on a single plot. Intervals start in a different values, thus we can read that loss for SVM model is the lowest.

When we compare other variables it looks like in all models the `district`

is the most important feature followed by `surface`

and `floor`

.

There is interesting difference between linear model and others in the way how important is the `construction.year`

. For linear model this variable is not importance, while for remaining two models there is some importance.

In the next chapter we will see how this is possible.

### References

Breiman, Leo, Adele Cutler, Andy Liaw, and Matthew Wiener. 2018. *RandomForest: Breiman and Cutler’s Random Forests for Classification and Regression*. https://CRAN.R-project.org/package=randomForest.

Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2018. “Model Class Reliance: Variable Importance Measures for Any Machine Learning Model Class, from the ’Rashomon’ Perspective.” *Journal of Computational and Graphical Statistics*. http://arxiv.org/abs/1801.01489.

Foster, David. 2017. *XgboostExplainer: An R Package That Makes Xgboost Models Fully Interpretable*. https://github.com/AppliedDataSciencePartners/xgboostExplainer/.

Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2017. *E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), Tu Wien*. https://CRAN.R-project.org/package=e1071.

Paluszynska, Aleksandra, and Przemyslaw Biecek. 2017. *RandomForestExplainer: A Set of Tools to Understand What Is Happening Inside a Random Forest*. https://github.com/MI2DataLab/randomForestExplainer.