18 Partial dependence profiles

18.1 Introduction

In this chapter we focus on partial dependence (PD) plots, sometimes also called PD profiles. They were introduced by Friedman in a paper devoted to Gradient Boosting Machines (GBM) (Friedman 2000). For many years PD profiles went unnoticed in the shadow of GBM. However, in recent years, the profiles have become very popular and are available in many data-science-oriented software packages like DALEX, iml (Molnar, Bischl, and Casalicchio 2018), pdp (Greenwell 2017).

The general idea underlying the construction of PD profiles is to show how the expected value of model prediction behaves as a function of a selected explanatory variable. For a single model, one can construct an overall PD profile by using all observations from a dataset, or several profiles for sub-groups of the observations. Comparison of sub-group-specific PD profiles may provide important insight into, for instance, stability of the model predictions.
PD profiles are also useful for comparisons of different models:

  • Agreement between profiles for different models is reassuring. Some models are more flexible than others. If PD profiles for models from the two classes are similar, we can treat it as a evidence that the more flexible model is not over-fitting.
  • Disagreement between profiles may suggest a way to improve a model. If a PD profile of a simpler, more interpretable model disagrees with a profile of a flexible model, this may suggest a variable transformation that can be used to improve the interpretable model. For example, if a random-forest model indicates a non-linear relationship between the dependent variable and an explanatory variable, then a suitable transformation of the explanatory variable may improve the fit or performance of a linear regression model.
  • Evaluation of model performance at boundaries. Models are known to have a different behavior at the boundaries of dependent variables, i.e., for the largest or the lowest values. For instance, random-forest models are known to shrink predictions towards the average, whereas support-vector machines are known to have larger variance at edges. Comparison of PD profiles may help to understand the differences in models’ behavior at boundaries.

18.2 Intuition

The general idea underlying the construction of PD profiles is to show how the expected value of model prediction behaves as a function of a selected explanatory variable. Toward this aim, the average of a set of individual Ceteris-paribus (CP) profiles is used. Recall that a CP profile (see Chapter 11) shows the dependence of an instance-level prediction for an explanatory variable. A PD profile is estimated by the average of the CP profiles for all instances (observations) from a dataset.

Note that, for additive models, CP profiles are parallel. In particular, they have got the same shape. Consequently, the average retains the shape, while offering a more precise estimate. However, for models that, for instance, include interactions, CP profiles may not be parallel. In that case, the average may not necessarily correspond to the shape of any particular profile. Nevertheless, it can still offer a summary of how (in general) the model predictions depend on changes in a given explanatory variable.

The left-hand-side panel of Figure 18.1 presents CP profiles for the explanatory variable age for the random-forest model titanic_rf_v6 (see Section ??) for 25 randomly selected instances (observations) from the Titanic dataset (see Section ??). Note that the profiles are not parallel, indicating non-additive effects of explanatory variables. The right-hand-side panel show the average of the CP profiles, which offers an estimate of the PD profile. Clearly, the shape of the PD profile does not capture, for instance, the shape of the three CP profiles shown at the top of the panel. Nevertheless, it does seem to reflect the fact that the majority of CP profiles suggest a substantial drop in the predicted probability of survival for the ages between 2 and 18.

Ceteris-paribus and partial-dependence profiles for the random-forest model for 25 randomly selected observations from the Titanic dataset. Left: CP profiles for age; blue dots indicate the age and corresponding prediction for the selected observations. Right: CP profiles (grey lines) and the corresponding partial-dependence profile (blue line)

Figure 18.1: Ceteris-paribus and partial-dependence profiles for the random-forest model for 25 randomly selected observations from the Titanic dataset. Left: CP profiles for age; blue dots indicate the age and corresponding prediction for the selected observations. Right: CP profiles (grey lines) and the corresponding partial-dependence profile (blue line)

18.3 Method

18.3.1 Partial dependence profiles

The value of a PD profile for model \(f()\) and explanatory variable \(X^j\) at \(z\) is defined as follows:

\[\begin{equation} g_{PD}^{f, j}(z) = E_{X^{-j}}[f(X^{j|=z})]. \tag{18.1} \end{equation}\]

Thus, it is the expected value of the model predictions when \(X^j\) is fixed at \(z\) over the (marginal) distribution of \(X^{-j}\), i.e., over the joint distribution of all explanatory variables other than \(X^j\). Or, in other words, it is the expected value of the CP profile ntoduced in Equation (11.1) for \(X^j\) over the (marginal) distribution of \(X^{-j}\).

Usually, we do not know the true distribution of \(X^{-j}\). We can estimate it, however, by the empirical distribution of \(N\), say, observations available in a training dataset. This leads to the use of the average of CP profiles for \(X^j\) as an estimator of the PD profile:

\[\begin{equation} \hat g_{PD}^{f, j}(z) = \frac{1}{N} \sum_{i=1}^{N} f(x_i^{j|=z}). \tag{18.2} \end{equation}\]

18.3.2 Clustered partial dependence profiles

As it has been already mentioned, the average of CP profiles is a good summary if the profiles are parallel. If they are not parallel, the average may not adequately represent the shape of a subset of profiles. To deal with this issue, one can consider clustering the profiles and calculate the average separately for each cluster. To cluster the CP profiles, one may use standard methods like K-means or hierarchical clustering. The similarities between observations can be calculated based on the Euclidean distance between CP profiles.

Figure 18.2 illustrates an application of that approach to the random-forest model titanic_rf_v6 (see Section 5.1.3) for 100 randomly selected instances (observations) from the Titanic dataset. The CP profiles for age are marked in grey. It can be noted that they could be split into three clusters based on the hclust method: one for a group of passengers with a substantial drop in the predicted survival probability for ages below 18 (with the average represented by the red line), one with an almost linear decrease of the probability over the age (with the average represented by the green line), and one with almost constant predicted probability (with the average represented by the blue line). The plot itself does not allow to identify the variables that may be linked with these clusters, but additional exploratory analysis could be performed for this purpose.

Clustered partial-dependence profiles for the random-forest model for 100 randomly selected observations from the Titanic dataset. Grey lines indicate Ceteris-paribus profiles that are clustered into 3 groups with the average profiles indicated by the blue, green, and red lines.

Figure 18.2: Clustered partial-dependence profiles for the random-forest model for 100 randomly selected observations from the Titanic dataset. Grey lines indicate Ceteris-paribus profiles that are clustered into 3 groups with the average profiles indicated by the blue, green, and red lines.

18.3.3 Grouped partial dependence profiles

It may happen that we can identify an explanatory variable that can influence the shape of CP profiles for the explanatory variable of interest. The most obvious situation is when a model includes an interaction between the variable of interest and another one. In that case, a natural approach is to investigate the PD profiles for the variable of interest corresponding to the groups of observations defined by the variable involved in the interaction. Figure 18.3 illustrates an application of the approach to the random-forest model titanic_rf_v6 (see Section 5.1.3) for 100 randomly selected instances (observations) from the Titanic dataset. The CP profiles for age are marked in grey. The red and blue lines present the PD profiles for females and males, respectively. The latter have different shapes: the predicted survival probability for females is more stable across different ages, as compared to males. Thus, the PD profiles clearly indicate an interaction between age and gender.

Partial-dependence profiles for two genders for the random-forest model for 100 randomly selected observations from the Titanic dataset. Grey lines indicate ceteris-paribus profiles for age.

Figure 18.3: Partial-dependence profiles for two genders for the random-forest model for 100 randomly selected observations from the Titanic dataset. Grey lines indicate ceteris-paribus profiles for age.

18.3.4 Contrastive partial dependence profiles

Comparison of clustered or grouped PD profiles for a single model may provide important insight into, for instance, stability of the model predictions. PD profiles can also be compared between different models.

Figure 18.4 presents PD profiles for age for the random-forest model and the logistic regression model with splines for the Titanic data (see Section 5.1.3). The profiles are similar with respect to a general relation between age and the predicted probability of survival (the younger the passenger, the better chance of survival). However, the profile for the random-forest model is flatter. The difference between both models is the largest at the edges of the age scale. This pattern can be treated as expected, because random-forest models, in general, shrink predictions towards the average and they are not very good for extrapolation outside the range of values observed in the training dataset.

Partial-dependence profiles for age for the random-forest (green line) and logistic-regression (blue line) models for the Titanic dataset.

Figure 18.4: Partial-dependence profiles for age for the random-forest (green line) and logistic-regression (blue line) models for the Titanic dataset.

18.4 Example: Apartments data

In this section, we use PD profiles to evaluate performance of the random-forest model apartments_rf_v5 (see Section 5.2.3) for the Apartments dataset (see Section @ref()). Recall that the goal is to predict the price per square-meter of an apartment. In our illustration we focus on two explanatory variables, surface and construction year.

18.4.1 Partial dependence profiles

Figure 18.5 presents CP profiles (green lines) for 25 randomly-selected apartments together with the estimated PD profile (blue line) for surface and construction year.

PD profile for surface suggest an approximately linear relationship between the explanatory variable and the predicted price. On the other hand, PD profile for construction year is U-shaped: the predicted price is the highest for the very new and very old apartments. While the data were simulated, they were generated to reflect the effect of a lower quality of building materials used in housing construction after the II World War.

Ceteris-paribus and partial-dependence profiles for 100 randomly-selected apartments for the Random forest model for the Apartments dataset.

Figure 18.5: Ceteris-paribus and partial-dependence profiles for 100 randomly-selected apartments for the Random forest model for the Apartments dataset.

18.4.2 Clustered partial dependence profiles

All CP profiles for construction year, presented in Figure 18.5, seem to be U-shaped. The same shape is observed for the PD profile. One might want to confirm that the shape is, indeed, common for all the observations. The left-hand-side panel of Figure 18.6 presents clustered PD profiles for construction year for three clusters derived from the CP profiles presented in Figure 18.5. The three PD profiles differ slightly in the size of the oscillations at the edges, but they all are U-shaped. Thus, we could conclude that the overall PD profile adequately captures the shape of the CP profiles. Or, put differently, there is little evidence that there might be any strong interaction between construction year and any other variable in the model. Similar conclusions can be drawn for the CP and PD profiles for surface, presented in the right-hand-side panel of Figure 18.6.

Ceteris-paribus (grey lines) and partial-dependence profiles (red, green and blue lines) for three clusters for 100 randomly-selected apartments for the random-forest model for the Apartments dataset. Left: profiles for construction year. Right: profiles for surface.

Figure 18.6: Ceteris-paribus (grey lines) and partial-dependence profiles (red, green and blue lines) for three clusters for 100 randomly-selected apartments for the random-forest model for the Apartments dataset. Left: profiles for construction year. Right: profiles for surface.

18.4.3 Grouped partial dependence profiles

One of the categorical explanatory variables in the Apartments dataset is district. We may want to investigate whether the relationship between the model predictions and construction year and surface is similar for all districts. Toward this aim, we can use grouped PD profiles, for groups of apartments defined by districts.

Figure 18.7 shows PD profiles for construction year (left-hand-side panel) and surface (right-hand-side panel) for each district. Several observations are worth making. First, profiles for apartments in ‘’Srodmiescie’’ (Downtown) are clearly much higher than for other districts. Second, the profiles are roughly parallel, indicating that the effects of construction year and surface are similar in each district. Third, the profiles appear to form three clusters, i.e., ‘’Srodmiescie’’ (Downtown), three districts close to ‘’Srodmiescie’’ (namely ‘’Mokotow’‘,’‘Ochota’‘, and’‘Ursynow’’), and the six remaining districts.

Partial-dependence profiles for separate districts for the random-forest model for the Apartments dataset. Left: profiles for construction year. Right: profiles for surface.

Figure 18.7: Partial-dependence profiles for separate districts for the random-forest model for the Apartments dataset. Left: profiles for construction year. Right: profiles for surface.

18.4.4 Contrastive partial dependence profiles

One of the main challenges in predictive modelling is to avoid over-fitting. The issue is particularly important for flexible models, such as random-forest models.

Figure 18.8 presents PD profiles for construction year (left-hand-side panel) and surface (right-hand-side panel) for the linear regression model (see Section @ref()) and the random-forest model. Several observations are worth making. The linear model cannot, of course, accommodate the non-monotonic relationship between the construction year and the price per square-meter. However, for surface, both models support a linear relationship, though the slope of the line resulting from the linear regression is steeper. This may be seen as an expected difference, given that random-forest models yield predictions that are shrunk towards the mean.

Thus, the profiles in Figure 18.8 suggest that both models miss some aspects of the data. In particular, the linear regression model does not capture the U-shaped relationship between the construction year and the apartment price. On the other hand, the effect of the surface on the apartment price seems to be underestimated by the random-forest model. Hence, one could conclude that, by addressing the issues, one could improve either of the models, possibly with an improvement in predictive performance.

Partial-dependence profiles for the linear regression and random-forest models for the Apartments dataset. Left: profiles for construction year. Right: profiles for surface.

Figure 18.8: Partial-dependence profiles for the linear regression and random-forest models for the Apartments dataset. Left: profiles for construction year. Right: profiles for surface.

18.5 Pros and cons

PD profiles, presented in this chapter, offer a simple way to summarize the effect of a particular explanatory variable on the dependent variable. They are easy to explain and intuitive. They can be obtained for sub-groups of observations and compared across different models. For these reasons, they have gained in popularity and have been implemented in various software packages, including R and Python.

Given that the PD profiles are averages of CP profiles, they inherit the limitations of the latter. In particular, as CP profiles are problematic for correlated features, PD profiles are also not suitable for that case. (An approach to deal with this issue will be discussed in the next chapter.) For models including interactions, the averages of CP profiles may offer a crude and potentially misleading summarization.

18.6 Code snippets for R

Here we show partial dependence profiles calculated with DALEX package which wrap functions from ingredients package (Biecek et al. 2019). You will also find similar functions in the pdp package (Greenwell 2017), ALEPlots package (Apley 2018) or iml (Molnar, Bischl, and Casalicchio 2018) package.

The easiest way to calculate PD profiles is to use the function DALEX::model_profile. The only required argument is the explainer and by default PD profiles are calculated for all variables. The only required argument is the model explainer. By default, PD profiles are calculated for all explanatory variables. In the code below we use the variables argument to limit the list of variables for which PD profiles are calculated. We store the computed PD profile in object pdp_rf. Subsequently, we apply the plot() function to the object to generate the plot of the PD profile.

For illustration purposes, we use the random-forest model titanic_rf_v6 (see Section ??) for the Titanic data. Recall that it is developed to predict the probability of survival from sinking of Titanic. Below we use variables argument to limit list of variables for which PD profiles are calculated. Here we need profiles only for the age variable.

Partial dependence profile for age.

Figure 18.9: Partial dependence profile for age.

PD profiles can be plotted on top of CP profiles. This is a very useful feature if we want to learn how similar are the CP profiles to the average. Toward this aim, we first have got to compute and store the CP profiles with the help of the model_profile() function. The argument N set the number of randomly-selected instances used for calculation of partial dependence. By default its 100.

The argument geom = "profiles" in the plot() function results in Partial dependence profile plotted on top of Ceteris paribus profiles. In the example below we only select the profiles for age.

Ceteris-paribus and partial-dependence profiles for age.

Figure 18.10: Ceteris-paribus and partial-dependence profiles for age.

18.6.1 Clustered partial dependence profiles

To calculate clustered PD profiles, first we have to calculate and store the CP profiles and the use the hclust clustering to the profiles. This can be done with the model_profile() function. The number of clusters is specified with the help of argument k. Additional arguments of the function include center (a logical argument indicating if the profiles should be centered before calculation of distances between them) and variables (a list with the names of the explanatory variables for which the profiles are to be clustered, with the default value NULL indicating all the available variables).

The clustered PD profiles can be plotted on top of the CP profiles by setting the geom = "profiles" argument to the plot() function. Note that in the R code below we perform the calculations only for a randomly-selected set of 100 observations from the titanic data frame. Also, we only select the plots for the profiles for age.

Clustered Partial dependence profiles.

Figure 18.11: Clustered Partial dependence profiles.

18.6.2 Grouped partial dependence profiles

The model_profile() function admits the groups argument. If the argument is set to the name of a categorical explanatory variable, PD profiles are constructed for the groups of observations defined by the levels of the variable. In the example below, the argument is applied to obtain PD profiles for age grouped by gender. Subsequently, the profiles are plotted on top of the CP profiles for 100 randomly-selected observations from the titanic data frame (stored in object pdp_sex_rf).

Grouped Partial dependence profiles.

Figure 18.12: Grouped Partial dependence profiles.

18.6.3 Contrastive partial dependence profiles

To overlay PD profiles for two or more models in a single plot, one can use the generic plot() function. In the code below, we create PD profiles for age for the random-forest (see Section ??) and logistic regression (see Section ??) models, stored in the explainer-objects explain_titanic_rf and explain_titanic_lmr, respectively. Subsequently, we apply the plot() function to plot the two PD profiles together in a single plot.

Contrastive Partial dependence profiles.

Figure 18.13: Contrastive Partial dependence profiles.

References

Apley, Dan. 2018. ALEPlot: Accumulated Local Effects (Ale) Plots and Partial Dependence (Pd) Plots. https://CRAN.R-project.org/package=ALEPlot.

Biecek, Przemyslaw, Hubert Baniecki, Adam Izdebski, and Katarzyna Pekala. 2019. ingredients: Effects and Importances of Model Ingredients.

Friedman, Jerome H. 2000. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics 29: 1189–1232.

Greenwell, Brandon M. 2017. “pdp: An R Package for Constructing Partial Dependence Plots.” The R Journal 9 (1): 421–36. https://journal.r-project.org/archive/2017/RJ-2017-016/index.html.

Molnar, Christoph, Bernd Bischl, and Giuseppe Casalicchio. 2018. “iml: An R package for Interpretable Machine Learning.” Joss 3 (26). Journal of Open Source Software: 786. https://doi.org/10.21105/joss.00786.