# 20 Residual Diagnostics

## 20.1 Introduction

In this chapter, we present methods that are useful for detailed examination of both overall and instance-specific model performance. In particular, we focus on graphical methods that use residuals. The methods may be used for several purposes:

• In the first part of the book, we discussed tools for single-instance examination. Residuals can be used to identify potentially problematic instances. The single-instance explainers can then be used in the problematic cases to understand, for instance, which factors contribute most to the errors in prediction.

• For most models, residuals should express a random behavior with certain properties (like, e.g., being concentrated around 0). If we find any systematic deviations from the expected behavior, they may signal an issue with a model (like, e.g., an omitted explanatory variable or wrong functional form of a variable included in the model).

• In Chapter 16 we discussed measures to evaluate the overall performance of a predictive model. Sometimes, however, we may be more interested in cases with the largest errors of prediction, which can be identified with the help of residuals.

Residual diagnostics is a classical topic related to statistical modeling. Literature on the topic is vast – essentially every book on statistical modeling includes some discussion about residuals. Thus, in this chapter, we are not aiming at being exhaustive. Rather, our goal is to present selected concepts that underlie the use of residuals.

## 20.2 Intuition

As we mentioned in Section 2.5, we primarily focus on models describing the expected value of the dependent as a function of explanatory variables. In such case, for a perfect predictive model, the predicted value of the dependent variable should be exactly equal to the actual value of the variable for every observation. Perfect prediction is rarely, if ever, expected. In practice, we want the predictions to be reasonably close to the actual values. This suggests that we can use the difference between the predicted and the actual value of the dependent variable to quantify the quality of predictions obtained from a model. The difference is called a residual.

For a single observation, residual will almost always be different from zero. While a large (absolute) value of a residual may indicate a problem with a prediction for a particular observation, it does not mean that the quality of predictions obtained from a model is unsatisfactory in general. To evaluate the quality, we should investigate the ‘’behavior’’ of residuals for a group of observations. In other words, we should look at the distribution of the values of residuals.

For a ‘’good’’ model, residuals should deviate from zero randomly, i.e., not systematically. Thus, their distribution should be symmetric around zero, implying that their mean (or median) value should be zero. Also, residuals should be close to zero themselves, i.e., they should show low variability.

Usually, to verify these properties, graphical methods are used. For instance, a histogram of can be used to check the symmetry and location of the distribution of the residuals. For linear regression models, a plot of residuals against a continuous covariate can be checked for absence of any patterns that would suggest any systematic error in the predictions obtained for a specific range of the values of the covariate.

Note that a model may imply a concrete distribution for residuals. For instance, in the case of the classical linear regression model, standardized residuals should be normally distributed with mean zero and a constant variance. In such a case, a the distributional assumption can be verified by using a suitable graphical method like, for instance, a quantile-quantile plot. If the assumption is found to be violated, one might want to be careful regarding to predictions obtained from the model.

## 20.3 Method

As it was already mentioned in Chapter 2, for a continuous dependent variable (or a count), residual $$r_i$$ for the $$i$$-th observation in a dataset is the difference between the model prediction and the corresponding value of the variable:

$$$r_i = y_i - f(x_i). \tag{20.1}$$$

A histogram of the estimated residuals can be used to check the symmetry and location of their distribution. An index plot of residuals, i.e., the plot of residuals against the corresponding observation number, may be used to identify observations with large residuals.

For diagnostic purposes the standardized residuals are defined as

$$$\tilde{r}_i = \frac{r_i}{\sqrt{\mbox{Var}(r_i)}}, \tag{20.2}$$$

where $$\mbox{Var}(r_i)$$ is the variance of the residual $$r_i$$. Of course, in practice, the variance of $$r_i$$ is usually unknown. Hence, in Equation (20.2), the estimated value of the variance is used. Residuals defined in this way are often called the Pearson residuals.

For a gaussian linear model the $$\mbox{Var}(r_i)$$ can be estimated from design matrix and the distribution of $$\tilde{r}_i$$ is approximately standard normal. In general case, for complicated models, it is hard to estimate the variance $$\mbox{Var}(r_i)$$ for a single instance so it is often approximated by a constant.

Definition (20.2) can also be applied to a binary dependent variable if the model prediction $$f(x_i)$$ is the probability observing $$y_i$$ and upon coding the two possible values of the variable as 0 and 1. However, in this case, the range of possible values of $$r_i$$ is restricted to $$(-1,1)$$, which limits the usefulness of the residuals. For this reason, more often the Pearson residuals are used. Note that, if the values of the explanatory-variable vectors $$x_i$$ lead to different predicted values $$f(x_i)$$ for different observations in a dataset, the distribution of the Pearson residuals will not be approximated by the standard normal one. Nevertheless, the index plot may still be useful to detect observations with large residuals. The standard-normal approximation is more likely to apply in the situation when vectors $$x_i$$ split data into (a few, perhaps) groups sharing the same predicted value $$f(x_i)$$. In that case, one one can consider averaging residuals $$r_i$$ per group and standardizing them by $$\sqrt{f(x_i)\{1-f(x_i)/k\}}$$, where $$k$$ is the number of observations in a particular group.

For categorical data, residuals can only be defined in terms of residuals of the binary variables indicating the category observed for the $$i$$-th observation.

## 20.4 Example: Apartments data

In this section, we use the linear-regression model apartments_lm_v5 (Section 5.2.2) and the random-forest model apartments_rf_v5 (Section 5.2.3) for the apartment-prices data (Section 5.2). Recall that the dependent variable of interest, the price per square-meter, is continuous. Thus, we can use residuals $$r_i$$, as defined in equation (20.1). It is worth noting that, as it was mentioned in Section 16.4.1, RMSE for both models is very similar. Thus, overall, the two models could be seen as performing similarly.

Figures 20.2 and 20.3 summarize the distribution of residuals for both models. In particular, Figure 20.2 presents histogram of residuals, while Figure 20.3 shows box-and-whisker plots for the absolute value of the residuals.

Despite the similar value of RMSE, the distribution of residuals for both models is different. In particular, Figure 20.2 indicates that the distribution for the linear-regression model is, in fact, split into two separate, normal-like parts, which may suggest omission of a binary explanatory variable in the model. The two components are located around the values of about -200 and 400. As mentioned in the previous chapters, the reason for this behavior of the residuals is the fact that the model does not capture the non-linear relationship between the price and the year of construction.

As seen from Figure 20.2, the distribution for the random-forest model is skewed to the right and multimodal. It seems to be centered at a value closer to zero than the distribution for the linear-regression model, but it shows a larger variation. These conclusions are confirmed by the box-and-whisker plots in Figure 20.3.

The two plots suggest that the residuals for the random-forest model are more frequently smaller than the residuals for the linear-regression model. However, a fraction of the random-forest-model residuals are very large and these large residuals result in the RMSE being comparable for the two models.

In the remainder of the section, we focus on the random-forest model.

Figure 20.4 shows a scatterplot of residuals (y-axis) in function of the observed (x-axis) values of the dependent variable. For a perfect predictive model, we would expect the horizontal line at zero. For a ‘’good’’ model, we would like to see a symmetric scatter of points around the horizontal line at zero, indicating random deviations of predictions from the observed values. The plot in Figure 20.4 shows that, for the large observed values of the dependent variable, the residuals are positive, while for small values they are negative. Thus, the plot suggests that the predictions are shifted (biased) towards the average.

The shift towards the average can also be seen from Figure 20.5 that shows a scatterplot of the predicted (y-axis) and observed (x-axis) values of the dependent variable. For a perfect predictive model we would expect a diagonal line (marked in red). The plot shows that, for large observed values of the dependent variable, the predictions are smaller than the observed values, with an opposite trend for the small observed values of the dependent variable.

Figure 20.6 shows an index plot of residuals, i.e., their scatterplot in function of an (arbitrary) id-number of the observation (x-axis). The plot indicates an asymmetric distribution of residuals around zero, as there is an excess of large positive (larger than 500) residuals without a corresponding fraction of negative values. This can be linked to the right-skewed distribution seen in Figures 20.2 and 20.3 for the random-forest model.

plot(md_rf, variable = "ids", yvariable = "residuals") +
xlab("observation id") + ylab("residuals")

Figure 20.7 shows a scatterplot of residuals (y-axis) in function of the predicted (x-axis) value of the dependent variable. For a ‘’good’’ model, we would like to see a symmetric scatter of points around the horizontal line at zero. The plot in Figure 20.7, as the one in Figure 20.4, the plot suggests that the predictions are shifted (biased) towards the average.

plot(md_rf, variable = "y_hat", yvariable = "residuals") +
xlab("predicted price") 

The random-forest model, as the linear-regression model, assumes that residuals should be homoscedastic, i.e., that they should have a constant variance. Figure 20.8 presents the scale-location plot of residuals, i.e., a scatterplot of the absolute value of residuals in function of the predicted values of the dependent variable. The plot includes a smoothed line capturing the average trend. For homoscedastic residuals, we would expect a symmetric scatter around a horizontal line, for which the smoothed trend should be also horizontal. The plot in Figure 20.8 deviates from the expected pattern and indicates that the variability of the residuals depends on the (predicted) value of the dependent variable.

plot(md_rf, variable = "y_hat", yvariable = "abs_residuals") +
xlab("predicted price") + ylab("|residuals|")

## 20.5 Pros and cons

Diagnostics of the residuals is a very important stage of model exploration. Properly performed diagnostics allows to identify many different types of problems such as:

• Bias in predictions for instances with extremely high values of the target variable.
• The heterogeneous variance of the residuals, suggesting perhaps incorrect specification of the model.
• High values of residual for some ranges of a variable suggesting an incorrect model specification for some subgroup of observations.

However, the problem with diagnostics is that there is lots of diagnostic charts to review. And without quantitative measures of meeting assumption, we can only rely on the organoleptic review of the graph after the graph.

## 20.6 Code snippets for R

In this section, we present the key features of the DALEX package which is a wrapper for functions from auditor package (Gosiewska and Biecek 2018). This package covers all methods presented in this chapter.

First, we load explainers for the linear-regression model apartments_lm_v5 and the random-forest model apartments_rf_v5 created in Section 5.2.6 for the apartments data.

library("DALEX")
library("randomForest")

explainer_apartments_rf <- archivist:: aread("pbiecek/models/569b0")

There are two functions that will be used for exploration of residuals. The DALEX::model_performance() function is useful for exploration of distribution of residuals while DALEX::model_diagnostics() function is useful for looking for relation between residuals and other variables.

Let’s start with distributions of residuals. This can be done with the model_performance() function. The residuals are stored in separate objects that can be used for construction of various plots and summaries.

mr_lr <- DALEX::model_performance(explainer_apartments_lr)
mr_rf <- DALEX::model_performance(explainer_apartments_rf)

The generic plot() function shows different statistics based on specified geom argument. In particular, Figure 20.2 can be constructed by using the following simple code:

plot(mr_lr, mr_rf, geom = "histogram") 

Note that, by including the two objects containing residuals for the linear-regression model and the random-forest model in the function call, we automatically get an overlay of the plots of the histogram of residuals for the two models.

The box-and-whisker plots of the residuals for the two models, shown in Figure 20.3, can be constructed by using the following simple call with geom = "boxplot".

plot(mr_lr, mr_rf, geom = "boxplot")

Function model_diagnostics() calculates residuals for various scatter plots of residuals against some other variables

md_lr <- model_diagnostics(explainer_apartments_lr)
md_rf <- model_diagnostics(explainer_apartments_rf)

The generic plot() function produces a scatterplot of residuals (y-axis) in function of the observed (x-axis) values of the dependent variable, as in Figure 20.4. By using arguments variable and yvariable, one specify which variables will be plotted on OX and OY axis. Apart of variables names, one can use following constants:

• y for true target values,
• y_hat for predicted target values,
• obs for ids of an observation,
• residuals for calculated residual,
• abs_residuals for absolute values of residual.

For example, to reproduce Figure 20.4 one needs to plot target variable on the OX axis and residuals on OY.

plot(md_rf, variable = "y", yvariable = "residuals") 

To produce Figure 20.5 we need to plot predicted target values on OY axis. This can be done with yvariable = "y_hat" argument.

plot(md_rf, variable = "y", yvariable = "y_hat") 

The Figure 20.6 has indexes of observations on OX axis. This can be achieved with variable = "ids" argument.

plot(md_rf, variable = "ids", yvariable = "residuals")

In the Figure 20.8 on OY scale we plotted absolute residuals. This can be done with yvariable = "abs_residuals" argument.

plot(md_rf, variable = "y_hat", yvariable = "abs_residuals")

### References

Gosiewska, Alicja, and Przemyslaw Biecek. 2018. Auditor: Model Audit - Verification, Validation, and Error Analysis. https://CRAN.R-project.org/package=auditor.