3.3 Variable response

Explainers presented in this section are designed to better understand the relation between a variable and a model output.

Subsection 3.3.1 presents Partial Dependence Plots (PDP), one of the most popular methods for exploration of a relation between a continuous variable and a model outcome. Subsection 3.3.2 presents Accumulated Local Effects Plots (ALEP), an extension of PDP more suited for highly correlated variables.

Subsection 3.3.3 presents Merging Path Plots, a method for exploration of a relation between a categorical variable and a model outcome.

3.3.1 Partial Dependence Plot

Partial Dependence Plots (see pdp package (Greenwell 2017Greenwell, Brandon M. 2017. “Pdp: An R Package for Constructing Partial Dependence Plots.” The R Journal 9 (1):421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.)) for a black box \(f(x; \theta)\) show the expected output condition on a selected variable.

\[ p_i(x_i) = E_{x_{-i}}[ f(x^i, x^{-i}; \theta) ]. \]

Of course, this expectation cannot be calculated directly as we do not know fully neither the distribution of \(x_{-i}\) nor the \(f()\). Yet this value may be estimated by

\[ \hat p_i(x_i) = \frac{1}{n} \sum_{j=1}^{n} f(x^i_j, x_j^{-i}, \hat \theta). \]

Let’s see an example for the model apartments_rf_model. Below we use variable_response() from DALEX, which calls pdp::partial function to calculate PDP response.

Section 3.2 shows variable importance plots for different models. The variable construction.year is interesting as it is important for the random forest model apartments_rf_model but not for the linear model apartments_lm_model. Let’s see the relation between the variable and the model output.

Figure 3.6: Relation between output from apartments_rf_model and variable construction.year

Relation between output from `apartments_rf_model` and variable `construction.year`

We can use PDP plots to compare two or more models. Below we plot PDP for the linear model against the random forest model.

Figure 3.7: Relation between output from models apartments_rf_model and apartments_lm_model against the variable construction.year

Relation between output from models `apartments_rf_model` and `apartments_lm_model` against the variable `construction.year`

It looks like the random forest captures the non-linear relation that cannot be captured by linear models.

3.3.2 Accumulated Local Effects Plot

As demonstrated in section 3.3.1, the Partial Dependence Plot presents the expected model response with respect to marginal distribution of \(x_{-i}\). In some cases, e.g. when repressors are highly correlated, expectation towards the marginal distribution may lead to biases/poorly extrapolated model responses.

Accumulated local effects (ALE) plots (see ALEPlot package (Apley 2017Apley, Dan. 2017. ALEPlot: Accumulated Local Effects (Ale) Plots and Partial Dependence (Pd) Plots. https://CRAN.R-project.org/package=ALEPlot.)) solve this problem by using conditional distribution \(x_{-i}|x_i = x_i^*\). This solution leads to more stable and reliable estimates (at least when the predictors are highly correlated).

Estimation of the main effects for construction.year is similar to the PDP curves. We use here DALEX::single_variable function that calls ALEPlot::ALEPlot function to calculate the ALE curve for the variable construction.year.

Figure 3.8: Relation between output from models apartments_rf_model and apartments_lm_model against the variable construction.year calculated with Accumulated local effects.

Relation between output from models `apartments_rf_model` and `apartments_lm_model` against the variable `construction.year` calculated with Accumulated local effects.

Results for PDP and ALEP are very similar except that effects for ALEP are centered around 0.

3.3.3 Mering Path Plot

The package ICEbox does not work for factor variables, while the pdp package returns plots that are hard to interpret.

An interesting tool that helps to understand what happens with factor variables is the factorMerger package. See (Sitko and Biecek 2017Sitko, Agnieszka, and Przemyslaw Biecek. 2017. FactorMerger: Hierarchical Algorithm for Post-Hoc Testing. https://github.com/MI2DataLab/factorMerger.).

Below you may see a Merging Path Plot for a factor variable district.

Figure 3.9: Merging Path Plot for district variable. Left panel shows the dendrogram for districts, here we have clearly three clusters. Right panel shows distribution of predictions for each district.

Merging Path Plot for `district` variable. Left panel shows the dendrogram for districts, here we have clearly three clusters. Right panel shows distribution of predictions for each district.

The three clusters are: the city center (Srodmiescie), districts well communicated with city center (Ochota, Mokotow, Zoliborz) and other districts closer to city boundaries.

Factor variables are handled very differently by random forest and linear model, yet despite these differences both models result in very similar plots.