17 Variable’s Importance

17.1 Introduction

In this chapter, we present methods that are useful for the evaluation of an explanatory variable importance. The methods may be applied for several purposes.

  • Model simplification: variables that do not influence model’s predictions may be excluded from the model.
  • Model exploration: comparison of a variable’s importance in different models may help in discovering interrelations between the variables.Also, ordering of variables in function of their importance is helpful in deciding in what order should we perform further model exploration.
  • Domain-knowledge-based model validation: identification of the most important variables may be helpful in assessing the validity of the model based on the domain knowledge.
  • Knowledge generation: identification of the most important variables may lead to discovery of new factors involved in a particular mechanism.

The methods for assessment of variable importance can be divided, in general, into two groups: model-specific and model-agnostic.

For models like linear models, random forest, and many others, there are methods of assessing of variable importance that exploit particular elements of the structure of the model. These are model-specific methods. For instance, for linear models, one can use the value of the normalized regression coefficient or its corresponding p-value as the variable-importance measure. For tree-based ensembles, such a measure may be based on the use of a particular variable in particular trees (see, e.g., XgboostExplainer (Foster 2017) for gradient boosting and RandomForestExplainer (Paluszynska and Biecek 2017) for random forest).

In this book we focus on model-agnostic methods. These methods do not assume anything about the model structure. Therefore, they can be applied to any predictive model or ensemble of models. Moreover, and perhaps even more importantly, they allow comparing variable importance between models with different structures.

17.2 Intuition

We focus on the method described in more detail in (Fisher, Rudin, and Dominici 2018). The main idea is to measure how much the model fit decreases if the effect of a selected explanatory variable or of a group of variables is removed. The effect is removed by means of perturbations like resampling from an empirical distribution of just permutation of the values of the variable.

The idea is in some sense borrowed from variable important measure proposed by @ref{randomForestBreiman} for random forest. If a variable is important, then after permutation of this variable we expect that the model performance will be lower. The larger drop in the performance, the more important is the variable.

Despite the simplicity of definition, the permutation variable importance is a very powerful model agnostic tool for model exploration. Values of permutation variable importance may be compared between different structures of models. This property is discussed in detail in the section Pros and Cons.

17.3 Method

Consider a set of \(n\) observations for a set of \(p\) explanatory variables. Denote by \(\widetilde{y}=(f(x_1),\ldots,f(x_n))\) the vector of predictions for model \(f()\) for all the observations. Let \(y\) denote the vector of observed values of the dependent variable \(Y\).

Let \(\mathcal L(\widetilde{y}, y)\) be a loss function that quantifies goodness of fit of model \(f()\) based on \(\widetilde{y}\) and \(y\). For instance, \(\mathcal L\) may be the value of likelihood. Consider the following algorithm:

  1. Compute \(L = \mathcal L(\widetilde{y}, y)\), i.e., the value of the loss function for the original data.
  2. For each explanatory variable \(X^j\) included in the model, do steps 3-6.
  3. Replace vector \(x^j\) of observed values of \(X^j\) by vector \(x^{*j}\) of resampled or permuted values.
  4. Calculate model predictions \(\widetilde{y}^{*j}\) for the modified data, \(\widetilde{y}^{*j} = f(x^{*j})\).
  5. Calculate the value of the model performance for the modified data: \[ L^{*j} = \mathcal L(\widetilde{y}^{*j}, y) \]
  6. Quantify the importance of explanatory variable \(x^j\) by calculating \(vip_{Diff}(x^j) = L^{*j} - L\) or \(vip_{Ratio}(x^j) = L^{*j} / L\), where \(L\) is the value of the loss function for the original data.

Note that the use of resampling or permuting data in Step 3 involves randomness. Thus, the results of the procedure may depend on the actual configuration of resampled/permuted values. Hence, it is advisable to repeat the procedure several times. In this way, the uncertainty related to the calculated variable-importance values can be assessed.

The calculations in Step 6 ``normalize’’ the value of the variable importance measure with respect to \(L\). However, given that \(L\) is a constant, the normalization has no effect on the ranking of variables according to \(vip_{Diff}(x^j)\) or \(vip_{Ratio}(x^j)\). Thus, in practice, often the values of \(L^{*j}\) are simply used to quantify variable’s importance.

17.4 Example: Titanic data

In this section, we illustrate the use of the permutation-based variable-importance method by applying it to the random forest model for the Titanic data (see Section 5.1.3). Recall that the goal is to predict survival probability of passengers based on their sex, age, cityplace of embarkment, class in which they travelled, fare, and the number of persons they travelled with.

Figure 17.1 shows the values of loss function measured as \(1-AUC^{*j}\) after permuting, in turn, each of the variables included in the model. Additionally, the plot indicates the value of \(L\) by the vertical dashed line at the left-hand-side of the plot. Length of the bar span between \(L\) and \(L^{*j}=1-AUC^{*j}\) and correspond to the variable importance.

Each interval presents the difference between the loss function for the original data (vertical dashed line at the left) and for the data with permuted observation for a particular variable.

Figure 17.1: Each interval presents the difference between the loss function for the original data (vertical dashed line at the left) and for the data with permuted observation for a particular variable.

The plot in Figure 17.1 suggests that the most important variable in the model is gender. This agrees with the conclusions drawn in the exploratory analysis presented in Section 5.1.1. The next three important variables are class of the travel (first-class patients had a higher chance of survival), age (children had a higher chance of survival), and fare (owners of more expensive tickets had a higher chance of survival).

To take into account the uncertainty related to the use of permutations, we can consider computing the average values of \(L^{*j}\) over a set of, say, 10 permutations. The plot in Figure 17.2 presents the average values. The only remarkable difference, as compared to Figure 17.1, is the change in the ordering of the sibsp and parch variables.

Average variable importance based on 10 permutations.

Figure 17.2: Average variable importance based on 10 permutations.

The plots similar to those presented in Figures 17.1 and 17.2 are useful for comparisons of variable importance for different models. Figure 17.3 presents the single-permutation results for the random forest, gradient boosting (see Section 5.1.4), and logistic regression (see Section 5.1.2) models. The best result, in terms of the smallest value of the goodness-of-fit function \(L\), are obtained for the random forest model. Note, however, that this model includes more variables than the other two. For instance, variable fare, which is highly correlated with the travel class, is not important neither in the gradient boosting nor in the logistic regression model, but is important in the random forest model.

The plots in Figure 17.3 indicate that gender is the most important variable in all three models, followed by class.

Variable importance for the random forest, gradient boosting, and logistic regression models for the Titanic data.

Figure 17.3: Variable importance for the random forest, gradient boosting, and logistic regression models for the Titanic data.

17.5 Pros and cons

Permutation variable importance offer a model-agnostic approach to assessment of the influence of each variable on model performance. The approach offers several advantages. The plots are easy to understand. They are compact, all most important variables are presented in a single plot.

Permutation variable importance is expressed in a terms of model performance and can be compared between models. In different models the same variable may have different importance scores and comparison of such scores may lead to interesting insights. For example if variables are correlated then models like random forest are expected to spread importance across every variable while in regularized regression models coefficients for one correlated variable may dominate over coefficients for other variables.

The same approach can be used to measure importance of a single explanatory variable or a group of variables. The latter is useful for aspects - groups of variables that are complementary or are related to a similar concept. For example in the Titanic example the fare and class variables are linked with wealth of a passenger. Instead of calculation of effects of each variable independently we may calculate effect of both variables by permutation of both.

The disadvantage of this measure comes from the randomness behind permutations. For different permutations we may get different results. Also different choices of model performance measure, like Precision, Accuracy, AUC, lead to different numeric values of variable importance. And last disadvantage is related with the data used for assessment of model performance. Different importance values may be obtained on training and testing data.

17.6 Code snippets for R

For illustration, We will use the random forest model for the apartment prices data (see Section 5.2.3).

Let’s recover a regression model for prediction of apartment prices.

A popular loss function for regression model is the root mean square loss.

\[ L(y, \tilde y) = \sqrt{\frac1n \sum_{i=1}^n (y_i - \tilde y_i)^2} \]

It is implemented in the DALEX package in the function loss_root_mean_square. The initial loss function \(L\) for this model is

## [1] 792.8346

Let’s calculate variable importance for root mean square loss with the model_parts function.

##            variable mean_dropout_loss        label
## 1      _full_model_          796.0100 randomForest
## 2          no.rooms          828.4179 randomForest
## 3 construction.year          842.2287 randomForest
## 4          district          850.4096 randomForest
## 5             floor          858.8663 randomForest
## 6           surface          875.2063 randomForest
## 7        _baseline_         1118.4724 randomForest

On a diagnostic plot is useful to present variable importance with boxplots that show results for different permutations.

Permutation variable importance calculated as root mean square loss for random forest model for apartments data.

Figure 17.4: Permutation variable importance calculated as root mean square loss for random forest model for apartments data.

17.6.1 Models comparison

Variable importance plots are very useful tool for model comparison. In the section 5.2 we have trained three models on apartments dataset. These were models with different structures to make the comparison more interesting. Random Forest model (Breiman et al. 2018) (elastic but biased), Support Vector Machines model (Meyer et al. 2017) (large variance on boundaries) and Linear Model (stable but not very elastic).

Let’s calculate permutation variable importance with root mean square error for these three models.

##            variable mean_dropout_loss label
## 1      _full_model_          281.8345    lm
## 2 construction.year          281.7864    lm
## 3          no.rooms          293.7945    lm
## 4             floor          486.0535    lm
## 5           surface          614.4047    lm
## 6          district         1018.8827    lm
## 7        _baseline_         1262.6592    lm
##            variable mean_dropout_loss        label
## 1      _full_model_          802.9422 randomForest
## 2          no.rooms          834.9660 randomForest
## 3 construction.year          851.9975 randomForest
## 4          district          852.5380 randomForest
## 5             floor          874.3987 randomForest
## 6           surface          880.9620 randomForest
## 7        _baseline_         1110.6190 randomForest
##            variable mean_dropout_loss label
## 1      _full_model_          984.9034   svm
## 2          district          950.4622   svm
## 3          no.rooms          980.3698   svm
## 4 construction.year         1041.9925   svm
## 5             floor         1072.9481   svm
## 6           surface         1096.7851   svm
## 7        _baseline_         1237.6861   svm

Now we can plot variable importance for all three models on a single plot. Intervals start in a different values, thus we can read that loss for SVM model is the lowest.

When we compare other variables it looks like in all models the district is the most important feature followed by surface and floor.

Permutation variable importance on apartments data for Random forest, Support vector model and Linear model.

Figure 17.5: Permutation variable importance on apartments data for Random forest, Support vector model and Linear model.

There is interesting difference between linear model and others in the way how important is the construction.year. For linear model this variable is not importance, while for remaining two models there is some importance.

In the next chapter we will see how this is possible.

References

Breiman, Leo, Adele Cutler, Andy Liaw, and Matthew Wiener. 2018. RandomForest: Breiman and Cutler’s Random Forests for Classification and Regression. https://CRAN.R-project.org/package=randomForest.

Fisher, Aaron, Cynthia Rudin, and Francesca Dominici. 2018. “Model Class Reliance: Variable Importance Measures for Any Machine Learning Model Class, from the ’Rashomon’ Perspective.” Journal of Computational and Graphical Statistics. http://arxiv.org/abs/1801.01489.

Foster, David. 2017. XgboostExplainer: An R Package That Makes Xgboost Models Fully Interpretable. https://github.com/AppliedDataSciencePartners/xgboostExplainer/.

Meyer, David, Evgenia Dimitriadou, Kurt Hornik, Andreas Weingessel, and Friedrich Leisch. 2017. E1071: Misc Functions of the Department of Statistics, Probability Theory Group (Formerly: E1071), Tu Wien. https://CRAN.R-project.org/package=e1071.

Paluszynska, Aleksandra, and Przemyslaw Biecek. 2017. RandomForestExplainer: A Set of Tools to Understand What Is Happening Inside a Random Forest. https://github.com/MI2DataLab/randomForestExplainer.