- 1 Introduction
- 2 Architecture of DALEX
- 3 Model understanding
- 4 Prediction understanding
- 5 Ceteris Paribus Profiles
- 5.1 Ceteris Paribus profiles for a single observation
- 5.2 Exploration of local structure with Ceteris Paribus profiles
- 5.3 Exploration of global structure with Ceteris Paribus profiles
- 5.4 What-If scenarios: Single Observation and Multiple Models
- 5.5 Exploration of multiclass classification models
- 5.6 Global Structure and Multiple Models

- 6 Epilogue
- 7 Exercises

Explainers presented in this section are designed to better understand the relation between a variable and a model output.

Subsection 3.3.1 presents Partial Dependence Plots (PDP), one of the most popular methods for exploration of a relation between a continuous variable and a model outcome. Subsection 3.3.2 presents Accumulated Local Effects Plots (ALEP), an extension of PDP more suited for highly correlated variables.

Subsection 3.3.3 presents Merging Path Plots, a method for exploration of a relation between a categorical variable and a model outcome.

Partial Dependence Plots (see `pdp`

package (Greenwell 2017Greenwell, Brandon M. 2017. “Pdp: An R Package for Constructing Partial Dependence Plots.” *The R Journal* 9 (1):421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.)) for a black box \(f(x; \theta)\) show the expected output condition on a selected variable.

\[ p_i(x_i) = E_{x_{-i}}[ f(x^i, x^{-i}; \theta) ]. \]

Of course, this expectation cannot be calculated directly as we do not know fully neither the distribution of \(x_{-i}\) nor the \(f()\). Yet this value may be estimated by

\[ \hat p_i(x_i) = \frac{1}{n} \sum_{j=1}^{n} f(x^i_j, x_j^{-i}, \hat \theta). \]

Let’s see an example for the model `apartments_rf_model`

. Below we use `variable_response()`

from `DALEX`

, which calls `pdp::partial`

function to calculate PDP response.

Section 3.2 shows variable importance plots for different models. The variable `construction.year`

is interesting as it is important for the random forest model `apartments_rf_model`

but not for the linear model `apartments_lm_model`

. Let’s see the relation between the variable and the model output.

We can use PDP plots to compare two or more models. Below we plot PDP for the linear model against the random forest model.

```
sv_lm <- single_variable(explainer_lm, variable = "construction.year", type = "pdp")
plot(sv_rf, sv_lm)
```

It looks like the random forest captures the non-linear relation that cannot be captured by linear models.

As demonstrated in section 3.3.1, the Partial Dependence Plot presents the expected model response with respect to marginal distribution of \(x_{-i}\). In some cases, e.g. when repressors are highly correlated, expectation towards the marginal distribution may lead to biases/poorly extrapolated model responses.

Accumulated local effects (ALE) plots (see `ALEPlot`

package (Apley 2017Apley, Dan. 2017. *ALEPlot: Accumulated Local Effects (Ale) Plots and Partial Dependence (Pd) Plots*. https://CRAN.R-project.org/package=ALEPlot.)) solve this problem by using conditional distribution \(x_{-i}|x_i = x_i^*\). This solution leads to more stable and reliable estimates (at least when the predictors are highly correlated).

Estimation of the main effects for `construction.year`

is similar to the PDP curves. We use here `DALEX::single_variable`

function that calls `ALEPlot::ALEPlot`

function to calculate the ALE curve for the variable `construction.year`

.

```
sva_rf <- single_variable(explainer_rf, variable = "construction.year", type = "ale")
sva_lm <- single_variable(explainer_lm, variable = "construction.year", type = "ale")
plot(sva_rf, sva_lm)
```

Results for PDP and ALEP are very similar except that effects for ALEP are centered around 0.

The package `ICEbox`

does not work for factor variables, while the `pdp`

package returns plots that are hard to interpret.

An interesting tool that helps to understand what happens with factor variables is the **factorMerger** package. See (Sitko and Biecek 2017Sitko, Agnieszka, and Przemyslaw Biecek. 2017. *FactorMerger: Hierarchical Algorithm for Post-Hoc Testing*. https://github.com/MI2DataLab/factorMerger.).

Below you may see a Merging Path Plot for a factor variable `district`

.

```
svd_rf <- single_variable(explainer_rf, variable = "district", type = "factor")
svd_lm <- single_variable(explainer_lm, variable = "district", type = "factor")
plot(svd_rf, svd_lm)
```

The three clusters are: the city center (Srodmiescie), districts well communicated with city center (Ochota, Mokotow, Zoliborz) and other districts closer to city boundaries.

Factor variables are handled very differently by random forest and linear model, yet despite these differences both models result in very similar plots.