12 Ceteris-paribus Oscillations

12.1 Introduction

Visual examination of ceteris-paribus (CP) profiles, as illustrated in the previous chapter, is insightful. However, for a model with a large number of explanatory variables, we may end up with a large number of plots that may be overwhelming. In such situation, it might be useful to select the most interesting or important profiles. In this chapter, we describe a measure that can be used for such a purpose and that is directly linked to CP profiles. It can be seen as an instance-level variable-importance measure alternative to the measures discussed in the Chapters 710.

12.2 Intuition

To assign importance to CP profiles, we can use the concept of profile oscillations. It is worth noting that the larger influence of an explanatory variable on prediction at a particular instance, the larger the fluctuations along with the corresponding CP profile. For a variable that exercises little or no influence on a model’s prediction, the profile will be flat or will barely change. In other words, the values of the CP profile should be close to the value of the model’s prediction for a particular instance. Consequently, the sum of differences between the profile and the value of the prediction, taken across all possible values of the explanatory variable, should be close to zero. The sum can be graphically depicted by the area between the profile and the horizontal line representing the instance prediction. On the other hand, for an explanatory variable with a large influence on the prediction, the area should be large. Figure 12.1 illustrates the concept based on CP profiles introduced in the previous Chapter in Figure 11.5. The larger the highlighted area in Figure 12.1, the more important is the variable for the particular prediction.

The value of the coloured area summarizes the ceteris-paribus (CP) profile oscillations and provides the mean of the absolute deviations between the CP profile and the instance prediction (here it is Henry). Plots are prepared for the titanic_rf random-forest model for the Titanic data.

Figure 12.1: The value of the coloured area summarizes the ceteris-paribus (CP) profile oscillations and provides the mean of the absolute deviations between the CP profile and the instance prediction (here it is Henry). Plots are prepared for the titanic_rf random-forest model for the Titanic data.

12.3 Method

Let us formalize this concept now. Denote by \(g^j(z)\) the probability density function of the distribution of the \(j\)-th explanatory variable. The summary measure of the variable’s importance for the model’s prediction at \(\underline{x}_*\), \(vip_{CP}^{j}(\underline{x}_*)\), is defined by using the variable’s CP profile (11.1) as follows:

\[\begin{equation} vip_{CP}^j(\underline{x}_*) = \int_{\mathcal R} |h^{j}_{\underline{x}_*}(z) - f(\underline{x}_*)| g^j(z)dz=E_{X^j}\left\{|h^{j}_{\underline{x}_*}(X^j) - f(\underline{x}_*)|\right\}. \tag{12.1} \end{equation}\]

Thus, \(vip_{CP}^j(\underline{x}_*)\) is the expected absolute deviation of the CP profile from the model’s prediction at \(\underline{x}_*\) over the distribution \(g^j(z)\) for the \(j\)-th explanatory variable.

The true distribution of \(j\)-th explanatory variable is, in most cases, unknown. Thus, there are several options on how to calculate (12.1).

One is to calculate just the area under the CP curve, i.e., to assume that \(g^j(z)\) is a uniform distribution over the range of variable \(X^j\). It follows that a straightforward estimator of \(vip_{CP}^{j}(\underline{x}_*)\) is

\[\begin{equation} \widehat{vip}_{CP}^{j,uni}(\underline{x}_*) = \frac 1k \sum_{l=1}^k |h^{j}_{x_*}(z_l) - f(\underline{x}_*)|, \tag{12.2} \end{equation}\]

where \(z_l\) (\(l=1, \ldots, k\)) are the selected values of the \(j\)-th explanatory variable. For instance, one can select use all unique values of \(X^{j}\) in the considered dataset. Alternatively, for a continuous variable, one can use an equidistant grid of values.

Another approach is to use the empirical distribution of \(X^{j}\). This leads to the estimator defined as follows:

\[\begin{equation} \widehat{vip}_{CP}^{j,emp}(\underline{x}_*) = \frac 1n \sum_{i=1}^n |h^{j}_{\underline{x}_*}(x^{j}_i) - f(\underline{x}_*)|, \tag{12.3} \end{equation}\]

where index \(i\) runs through all observations in a dataset.

The use of \(\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)\) is preferred when there are enough data to accurately estimate the empirical distribution and when the distribution is not uniform. On the other hand, \(\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)\) is in most cases quicker to compute and, therefore, it is preferred if we look for fast approximations.

Note that, the local evaluation of the variables’ importance can be very different from the global evaluation. This is well illustrated by such a simple example. Consider a model \[ f(x^1, x^2) = x^1 * x^2, \] where varibles \(x^1\) and \(x^2\) take values in \([0,1]\). Consider prediction for an observation described by vector \(\underline{x}_* = (0,1)\). In that case, the importance of \(X^1\) is larger than \(X^2\). This is because the CP profile \(h^1_{x_*}(z) = z\) while \(h^2_{x_*}(z) = 0\). Thus we observe oscillations for the first variable but no oscillations for the second. Globally, both variables are equally important because the model is symmetrical, but local behavior is different.

12.4 Example: Titanic data

Figure 12.2 shows a barplot for the size of oscillations for explanatory variables for the random-forest model titanic_rf (see Section 5.2.2) for Henry, a 47-year-old man who travelled in the first class (see Section 5.2.5). In the left panel the measures were computed by using estimator \(\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)\), given in (12.2), by using an equidistant grid of values. In the right panel the measures were computed by using estimator \(\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)\), given in (12.3), by using an empirical distribution for explanatory variables. The longer the bar, the larger the CP-profile oscillations for a particular explanatory variable.

Both approaches presented in Figure 12.2 agree that the most important variables for prediction for Henry are gender and age, followed by class. There is an interesting difference between these approaches regarding the sibsp variable. From the 5.3 graph we remember that this variable has a very skewed distribution. A significant mass of this variable is concentrated in zero, but few individuals have a high value of this variable. In such situations, the of empirical density will be very different from a uniform distribution.

Note that, while the variable-importance plot in Figure 12.2 does indicate which explanatory variables are important, it does not describe how the variables influence the prediction. In that respect, the CP profile for age for Henry (see Figure 11.5) suggested that, if Henry were older, this would significantly lower his probability of survival. One the other hand, the CP profile for sibsp (see Figure 11.5) indicated that, were Henry not travelling alone, this would increase his chances. Thus, the variable-importance plots should always be accompanied by plots of the relevant CP profiles.

Variable-importance measures based on ceteris-paribus oscillations estimated by using (left panel) a uniform grid of explanatory-variable values, (right panel) empirical distribution of explanatory-variables for the titanic_rf model and passenger Henry for the Titanic data.

Figure 12.2: Variable-importance measures based on ceteris-paribus oscillations estimated by using (left panel) a uniform grid of explanatory-variable values, (right panel) empirical distribution of explanatory-variables for the titanic_rf model and passenger Henry for the Titanic data.

12.5 Pros and cons

Oscillations of CP profiles are easy to interpret and understand. By using the average of oscillations, it is possible to select the most important variables for an instance prediction. This method can easily be extended to two or more variables. In such cases, one needs to integrate the equation (12.2) over a larger number of variables.

There are several issues related to the use of the CP oscillations, though. For example, the oscillations may not be of help in situations when the use of CP profiles may itself be problematic (e.g., in the case of correlated explanatory variables or interactions – see Section 11.5). An important issue is that the CP-based variable-importance measures (12.1) do not fulfil the local accuracy condition (see Section 9.2), i.e., they do not sum up to the instance prediction for which they are calculated, unlike the break-down attributions (Chapter (7.2)) or Shapley values (Chapter (9.1)).

12.6 Code snippets for R

In this section, we present ceteris paribus oscillations as implemented in the DALEX package for R. For illustration, we use the random-forest model titanic_rf (Section 5.2.2). The model was developed to predict the probability of survival after sinking of Titanic. Instance-level explanations are calculated for Henry, a 47-year-old male passenger that travelled in the first class (see Section 5.2.5).

We first retrieve the titanic_rf model-object and the data frame for Henry via the archivist hooks, as listed in Section 5.2.7. We also retrieve the version of the titanic data with imputed missing values.

titanic_imputed <- archivist::aread("pbiecek/models/27e5c")
titanic_rf <- archivist:: aread("pbiecek/models/4e0fc")
(henry <- archivist::aread("pbiecek/models/a6538"))
  class gender age sibsp parch fare  embarked
1   1st   male  47     0     0   25 Cherbourg

Then we construct the explainer for the model by using the function explain() from the DALEX package (see Section 5.2.6). We also load the randomForest package, as the model was fitted by using function randomForest() from this package (see Section 5.2.2) and it is important to have the corresponding predict() function available. The model’s prediction for Henry is obtained with the help of that function.

library("randomForest")
library("DALEX")
explain_rf <- DALEX::explain(model = titanic_rf,  
                          data = titanic_imputed[, -9],
                             y = titanic_imputed$survived == "yes", 
                         label = "Random Forest")
predict(explain_rf, henry)
[1] 0.246

12.6.1 Basic use of the predict_parts() function

To calculate CP-profile oscillations, we will use again the predict_parts() function, but this time with type="oscillations_uni" for the Equation \(\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)\) or type="oscillations_emp" for the Equation \(\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)\). The function was already introduced in Section 7.6.

In particular, we apply it to explainer explain_rf for the random-forest model titanic_rf and the data frame for the instance of interest, i.e., henry. Additionally, we specify the type="oscillations_uni" argument to indicate that we want to compute CP-profile oscillations and the estimated value of the variable-importance measure as defined in Equation (12.2). Note that, one nesd to specify type="oscillations_emp" to get oscillations estimated with empirical density as defined in Equation (12.2). By default, oscillations are calculated for all explanatory variables. Use the variable argument to specify subset of variables.

oscillations_uniform <- predict_parts(explainer = explain_rf, 
                                new_observation = henry, 
                                           type = "oscillations_uni")
oscillations_uniform
##    _vname_ _ids_ oscillations
## 2   gender     1   0.33700000
## 4    sibsp     1   0.16859406
## 3      age     1   0.16744554
## 1    class     1   0.14257143
## 6     fare     1   0.09942574
## 7 embarked     1   0.02400000
## 5    parch     1   0.01031683

The resulting object is of class “ceteris_paribus_oscillations”, which is a data frame with three variables _vname_, _ids_, and oscillations that provide, respectively, the name of the variable, the value of the identifier of the instance, and the estimated value of the variable-importance measure. Additionally, the object has also got an overloaded plot() function. We can use the latter function to plot the estimated values of the variable-importance measure for the instance of interest. In the code below, before creating the plot, we make the identifier for Henry more explicit. The resulting graph is shown in Figure 12.3.

oscillations_uniform$`_ids_` <- "Henry"
plot(oscillations_uniform) +
    ggtitle("Ceteris-paribus Oscillations", 
            "Expectation over uniform distribution (unique values)")
Variable-importance measures based on ceteris-paribus oscillations estimated by the oscillations_uni method of the predict_parts() function for the titanic_rf model and Henry for the Titanic data.

Figure 12.3: Variable-importance measures based on ceteris-paribus oscillations estimated by the oscillations_uni method of the predict_parts() function for the titanic_rf model and Henry for the Titanic data.

12.6.2 Advanced use of the predict_parts() function

As mentioned in the previous section, the predict_parts() function with type = "oscillations_uni" calculates oscilations accoring to the Equation \(\widehat{vip}_{CP}^{j,uni}(\underline{x}_*)\), and with type = "oscillations_emp" accoring to the Equation \(\widehat{vip}_{CP}^{j,emp}(\underline{x}_*)\).

However, other distributions solutions can also be used. For instance, for a continuous explanatory variable, one could apply the same estimator, but using an predefined grid of values. Toward this aim, we can use the variable_splits argument to explicitly specify values for the density estimation. Its application is illustrated in the code below.

oscillations_equidist <- predict_parts(explain_rf, henry, 
              variable_splits = list(age = seq(0, 65, 0.1),
                                    fare = seq(0, 200, 0.1),
                                  gender = unique(titanic_imputed$gender),
                                   class = unique(titanic_imputed$class)), 
                         type = "oscillations")
oscillations_equidist
##   _vname_ _ids_ oscillations
## 3  gender     1    0.3370000
## 1     age     1    0.1677235
## 4   class     1    0.1425714
## 2    fare     1    0.1040790

We can use the plot() function to create a barplot of the estimated values. In the code below, before creating the plot, we make the identifier for Henry more explicit. The resulting graph is shown in Figure 12.2.

oscillations_equidist$`_ids_` <- "Henry"
plot(oscillations_equidist) + 
    ggtitle("Ceteris-paribus Oscillations", 
            "Expectation over specified grid of points")
Variable-importance measures based on ceteris-paribus oscillations estimated by using a specified grid of points for the titanic_rf model and passenger Henry for the Titanic data.

Figure 12.4: Variable-importance measures based on ceteris-paribus oscillations estimated by using a specified grid of points for the titanic_rf model and passenger Henry for the Titanic data.