- 1 Introduction
- 2 Architecture of DALEX
- 3 Model understanding
- 4 Prediction understanding
- 5 Ceteris Paribus Profiles
- 5.1 Ceteris Paribus profiles for a single observation
- 5.2 Exploration of local structure with Ceteris Paribus profiles
- 5.3 Exploration of global structure with Ceteris Paribus profiles
- 5.4 What-If scenarios: Single Observation and Multiple Models
- 5.5 Exploration of multiclass classification models
- 5.6 Global Structure and Multiple Models

- 6 Epilogue
- 7 Exercises

Let \(f_{M}(x): \mathcal R^{d} \rightarrow \mathcal R\) denote a predictive model, i.e. function that takes \(d\) dimensional vector and calculate numerical score. In section in which we work with larger number of models we use subscript \(M\) to index models. But to simplify notation, this subscript is omitted if profiles for only one model are considered.

Symbol \(x \in \mathcal R^d\) refers to a point in the feature space. We use subscript \(x_i\) to refer to a different data points and superscript \(x^j\) to refer to specific dimensions. Additionally, let \(x^{-j}\) denote all coordinates except \(j\)-th and let \(x|^j=z\) denote a data point \(x^*\) with all coordinates equal to \(x\) except coordinate \(j\) equal to value \(z\). I.e. \(\forall_{i \neq {j}} x^i = x^{*,i}\) and \(x^j = z\). In other words \(x|^j=z\) denote a \(x\) with \(j\)th coordinate changed to \(z\).

Now we can define Ceteris Paribus Profile for model \(f\), variable \(j\) and point \(x\) as

\[ CP^{f, j, x}(z) := f(x|^j = z). \] I.e. CP profile is a model response obtained for observations created based on \(x\) with \(j\) coordinated changes and all other coordinates kept unchanged.

It is convenient to use an alternative name for this plot: What-If Plots. CP profiles show what would happen if only a single variable is changed.

Figure 5.1 shows an example of Ceteris Paribus profile. The black dot stands for prediction for a single observation. Grey line show how the model response would change if in this single observation coordinate `surface`

will be changes to selected value. From this profile one may read that the model response is non monotonic. If `construction.year`

for this observation would be below 1935 the model response would be higher, but if construction year were between 1935 and 1995 the model response would be lower.

First, we need to specify an observation. Let’s assume that we have a new apartment with following attributes.

```
aplevels <- levels(apartments$district)
new_apartment <- data.frame(construction.year = 2000,
surface = 100,
floor = 1L,
no.rooms = 4,
district = factor("Bemowo", levels = aplevels))
new_apartment
```

```
## construction.year surface floor no.rooms district
## 1 2000 100 1 4 Bemowo
```

And we are interested in the predicted price for this apartment calculated with the random forest model `model_rf`

.

```
## 1
## 3490.541
```

We also know, that the variable `construction.year`

is used in the model. So how would the model response change for different values of `construction.year`

attribute?

Based on this observation we create \(N\) virtual apartments with `construction.year`

span between 1920 and 2010. New values for this attribute are selected from empirical distribution of the `apartments$construction.year`

variable. By default \(N = 101\) so percentiles are used for new values of `construction.year`

.

```
## Top profiles :
## construction.year surface floor no.rooms district _yhat_
## 1 1920 100 1 4 Bemowo 3693.729
## 1.1 1921 100 1 4 Bemowo 3733.894
## 1.2 1922 100 1 4 Bemowo 3748.508
## 1.3 1923 100 1 4 Bemowo 3756.184
## 1.4 1923 100 1 4 Bemowo 3756.184
## 1.5 1924 100 1 4 Bemowo 3670.971
## _vname_ _ids_ _label_
## 1 construction.year 1 randomForest
## 1.1 construction.year 1 randomForest
## 1.2 construction.year 1 randomForest
## 1.3 construction.year 1 randomForest
## 1.4 construction.year 1 randomForest
## 1.5 construction.year 1 randomForest
##
##
## Top observations:
## construction.year surface floor no.rooms district _yhat_ _label_
## 1 2000 100 1 4 Bemowo 3490.541 randomForest
```

Also note, that the `apartments`

data is available in the model explainer specified as the first parameter of the `ceteris_paribus`

function.

These artificial apartments constitute profile of conditional model response for different values of `construction.year`

. Such profiles may be plotted with the generic `plot()`

function. Note that the `ceteris_paribus()`

function by default calculates profiles for every variable in the dataset (this can be changes with the `variables`

parameter).

We use `selected_variables`

parameter in the `plot()`

function to limit plot to only a single variable `construction.year`

.

There is no reason to limit our perspective to only one variable. The `plot()`

function by default plots profiles for all numerical variables.

Use the `selected_variables`

parameter to limit number of variables to be presented.