5.1 Ceteris Paribus profiles for a single observation

Let \(f_{M}(x): \mathcal R^{d} \rightarrow \mathcal R\) denote a predictive model, i.e. function that takes \(d\) dimensional vector and calculate numerical score. In section in which we work with larger number of models we use subscript \(M\) to index models. But to simplify notation, this subscript is omitted if profiles for only one model are considered.

Symbol \(x \in \mathcal R^d\) refers to a point in the feature space. We use subscript \(x_i\) to refer to a different data points and superscript \(x^j\) to refer to specific dimensions. Additionally, let \(x^{-j}\) denote all coordinates except \(j\)-th and let \(x|^j=z\) denote a data point \(x^*\) with all coordinates equal to \(x\) except coordinate \(j\) equal to value \(z\). I.e. \(\forall_{i \neq {j}} x^i = x^{*,i}\) and \(x^j = z\). In other words \(x|^j=z\) denote a \(x\) with \(j\)th coordinate changed to \(z\).

Now we can define Ceteris Paribus Profile for model \(f\), variable \(j\) and point \(x\) as

\[ CP^{f, j, x}(z) := f(x|^j = z). \] I.e. CP profile is a model response obtained for observations created based on \(x\) with \(j\) coordinated changes and all other coordinates kept unchanged.

It is convenient to use an alternative name for this plot: What-If Plots. CP profiles show what would happen if only a single variable is changed.

Figure 5.1 shows an example of Ceteris Paribus profile. The black dot stands for prediction for a single observation. Grey line show how the model response would change if in this single observation coordinate surface will be changes to selected value. From this profile one may read that the model response is non monotonic. If construction.year for this observation would be below 1935 the model response would be higher, but if construction year were between 1935 and 1995 the model response would be lower.

Figure 5.1: Ceteris Paribus profiles for a single observation

Ceteris Paribus profiles for a single observation

5.1.0.1 How to do this in R?

First, we need to specify an observation. Let’s assume that we have a new apartment with following attributes.

##   construction.year surface floor no.rooms district
## 1              2000     100     1        4   Bemowo

And we are interested in the predicted price for this apartment calculated with the random forest model model_rf.

##        1 
## 3490.541

We also know, that the variable construction.year is used in the model. So how would the model response change for different values of construction.year attribute?

Based on this observation we create \(N\) virtual apartments with construction.year span between 1920 and 2010. New values for this attribute are selected from empirical distribution of the apartments$construction.year variable. By default \(N = 101\) so percentiles are used for new values of construction.year.

## Top profiles    : 
##     construction.year surface floor no.rooms district   _yhat_
## 1                1920     100     1        4   Bemowo 3693.729
## 1.1              1921     100     1        4   Bemowo 3733.894
## 1.2              1922     100     1        4   Bemowo 3748.508
## 1.3              1923     100     1        4   Bemowo 3756.184
## 1.4              1923     100     1        4   Bemowo 3756.184
## 1.5              1924     100     1        4   Bemowo 3670.971
##               _vname_ _ids_      _label_
## 1   construction.year     1 randomForest
## 1.1 construction.year     1 randomForest
## 1.2 construction.year     1 randomForest
## 1.3 construction.year     1 randomForest
## 1.4 construction.year     1 randomForest
## 1.5 construction.year     1 randomForest
## 
## 
## Top observations:
##   construction.year surface floor no.rooms district   _yhat_      _label_
## 1              2000     100     1        4   Bemowo 3490.541 randomForest

Also note, that the apartments data is available in the model explainer specified as the first parameter of the ceteris_paribus function.

These artificial apartments constitute profile of conditional model response for different values of construction.year. Such profiles may be plotted with the generic plot() function. Note that the ceteris_paribus() function by default calculates profiles for every variable in the dataset (this can be changes with the variables parameter).

We use selected_variables parameter in the plot() function to limit plot to only a single variable construction.year.

5.1.0.2 Ceteris Paribus for many variables

There is no reason to limit our perspective to only one variable. The plot() function by default plots profiles for all numerical variables.

Use the selected_variables parameter to limit number of variables to be presented.