a hands-on tutorial on the Interactive Explanatory Model Analysis with examples for classification models
8/6/23
Feel free to interrupt me and ask questions during the workshop!
2:30 Introduction to the process
2:45 Let’s meet some models
3:00 Global XAI - Variable Importance + Partial Dependence Profile
3:30 Do it yourself !!!
4:00 BREAK
4:30 Local XAI - Break-down + SHAP + Ceteris Paribus
5:00 Interactive EMA
5:20 Do it yourself !!!
5:50 Closing remarks
Source: https://incidentdatabase.ai/
IrResponsible Machine Learning:
Responsible Machine Learning:
The purpose of this tutorial is to present techniques for model exploration, visualisation and explanation. To do this we will use some interesting real-world data, train a few models on the data and then use XAI (eXplainable artificial intelligence) techniques to explore these models.
Materials for our workshop

The workshop consists of 1/3 lecture, 1/3 code examples discussed by the tutor and 1/3 computer-based exercises for participants.
It aims to present a set of methods for the exploration of complex predictive models. I assume that participants are familiar with Python and have some basic knowledge of predictive models. In this workshop, we will show how to explore these models.
Feel free to interrupt me and ask questions during the workshop!
To make working with models more enjoyable, the materials are based on a true story, which we will tell with the help of a comic strips.
The problem
The life cycle of a predictive model begins with a well-defined problem. In this example, we are looking for a model that assesses the risk of death after being diagnosed covid. We don’t want to guess who will survive and who won’t. We want to construct a score that allows us to sort patients by risk of death.
Why do we need such a model? It could have many applications! Those at higher risk of death could be given more protection, such as providing them with pulse oximeters or preferentially vaccinating them.
Przemysław Biecek

The model development is an iterative process. In each iteration, new versions of the model are created and then the models are evaluated, conclusions are drawn so as to move to the next iteration.
Over time, the amount of resources spent on each step changes. In successive iterations, the validation of models becomes more important and exploring the data becomes less important.
In this workshop we will go through the process of building a model in ten steps. We will build several candidate models so that they can be compared later using XAI techniques.
To demonstrate what responsible predictive modelling looks like, we used data obtained in collaboration with the Polish Institute of Hygiene in modelling mortality after the Covid infection. We realize that data on Coronavirus disease can evoke negative feelings. However, it is a good example of how predictive modelling can directly impact our society and how data analysis allows us to deal with complex, important and topical problems.
All the results presented in this book can be independently reproduced using the snippets and instructions presented in this book. If you do not want to retype them, then all the examples, data, and codes can be found on the https://betaandbit.github.io/RML2/.
Our first model will be based on the statistics collected by the Centers for Disease Control and Prevention (CDC).
Figure 1: Mortality statistics as presented on the CDC website accessed on May 2021. This table shows rate ratios compared to the group 5- to 17-year-olds (selected as the reference group because it has accounted for the largest cumulative number of COVID-19 cases compared to other age groups).
In this book we follow the convention that a model is an object that has predict function, which transforms the \(n\times p\) input matrix with \(p\) variables for \(n\) observations into a vector of \(n\) predictions. Below, we define a class cdc_risk that calculates odds of Covid related death based on statistics from the CDC website.
import numpy as np
import pandas as pd
class cdc_risk:
def __init__(self, base_risk = 0.00003):
self.base_risk = base_risk
def predict(self, x):
rratio = np.array([7900] * x.shape[0])
rratio[x.Age < 84.5] = 2800
rratio[x.Age < 74.5] = 1100
rratio[x.Age < 64.5] = 400
rratio[x.Age < 49.5] = 130
rratio[x.Age < 39.5] = 45
rratio[x.Age < 29.5] = 15
rratio[x.Age < 17.5] = 1
rratio[x.Age < 4.5] = 2
return rratio * self.base_risk
model_cdc = cdc_risk()
steve = pd.DataFrame({"Age": [25], "Diabetes": ["Yes"]})
model_cdc.predict(steve)
## array([0.00045])Predictive models may have different structures. To work responsibly with a large number of models, a uniform standardized interface is needed. In this book, we use the abstraction implemented in the dalex package.
The Explainer constructor from this package creates an Explainer, i.e. a wrapper for the model that will allow you to work uniformly with objects of very different structures. The first argument is a model. It can be an object of any class but it is assumed that the objects contains the predict function as in our example. The label specifies a unique name that appear in the plots while other arguments will be introduced in next sections.
Using the Explainer may seem like an unnecessary complication at the moment, but on the following pages, we show how it simplifies the work.
In Machine Learning, the word good means a large amount of representative data. Unfortunately, collecting representative data is neither easy nor cheap and often requires designing and conducting a specific experiment.
Here we use the data collected through epidemiological interviews. The number of interviewed patients is large, so we treat this data as representative, although unfortunately, this data only involves symptomatic patients who are tested positive for SARS-COV-2. Asymptomatic cases are more likely to be young adults.
The data is divided into two sets: covid_spring and covid_summer. The first set was acquired in spring 2020 and will be used as training data, while the second dataset was acquired in the summer and will be used for validation. In machine learning, model validation is performed on a separate data set called validation data. This controls the risk of overfitting an elastic model to the training data. If we do not have a separate set, then it is generated using cross-validation, out-of-sample, out-of-time or similar data splitting techniques.
All files needed to replicate the following code are available at website https://github.com/BetaAndBit/RML2. Download csv file with data and save them the working directory. Let’s start our analyses with reading data from these files.
Let’s use the package matplotlib to draw a simple histogram for Age variable, and statsmodels to draw a mosaic plot for Diabetes variable.

A handy way to summarise tabular data in groups is the so-called ,,Table 1’’. This is a summary of the main characteristics of each variable broken down into groups defined by the variable of interest (here, binary information about Covid death). The name stems from the fact that this summary of the data is usually the first table to be shown in many medical and related scientific journals.
from tableone import TableOne
columns = ['Gender', 'Age', 'Cardiovascular.Diseases', 'Diabetes',
'Neurological.Diseases', 'Kidney.Diseases', 'Cancer',
'Hospitalization', 'Fever', 'Cough', 'Death']
categorical = ['Gender', 'Cardiovascular.Diseases', 'Diabetes',
'Neurological.Diseases', 'Kidney.Diseases', 'Cancer',
'Hospitalization', 'Fever', 'Cough']
groupby = ['Death']
TableOne(covid_spring, columns=columns,
categorical=categorical, groupby=groupby, limit=1) 
One of the most important rules to remember when building a predictive model is: Do not condition on future!. I.e. do not use variables that are not defined at the time the prediction needs to be made. Note that in the discussed case variables Hospitalization, Fever or Cough are not good predictors because they are not known in advance before infection. In the following lines, we remove invalid variables from both data sets.
Data exploration and cleansing often consume most of the time spent on data analysis. Here we have only touched on exploration, but even this initial analysis helped to determine that Age is an important characteristic (we will confirm this later). We will build further models only on variables from the selected_vars vector.
Note that in the Covid-mortality-risk-assessment problem, we are not interested in the binary prediction survived/dead, but rather in the quality of the ranking of risk scores. Relative risks can be used to do a triage, to determine which people need a response most quickly, such as a vaccine.

For such types of problems, instead of a contingency table, one looks at Receiver Operating Characteristic (ROC) curve, which illustrates the trade-off between the true positive rate (sensitivity) and the false positive rate (1-specificity) at different classification thresholds. The curve’s shape and the area under it (AUC-ROC) provide insights into the classifier’s discrimination ability and overall predictive accuracy.

There are many measures for evaluating predictive models, and they are implemented in sklearn.metrics Python module. Below we will show the most common measures calculated by default by the dalex package.
First, we need an explainer with specified validation data (here covid_summer) and the corresponding response variable.
Model exploration starts with an assessment of how good is the model. The model_performance() function calculates a set of measures for a specified type of task, here classification.

Note: The explainer knows whether the model is trained for classification or regression task, so it automatically selects the right performance measures. This can be overridden if needed.
The plot function draws a graphical summary of the model performance. With the geom argument, one can determine the type of chart.

There are hundreds of different methods for training machine learning models available to experienced data scientists. One of the oldest and most popular are tree-based algorithms, first introduced in the book Classification And Regression Trees and commonly called CART. Here is the general deescription for this class of algorithms.
The most common way to train a decision tree in Python is by using the DecisionTreeClassifier class from sklearn.tree module. Scikit-learn provides a robust and user-friendly implementation of decision tree algorithms. The training process typically involves two steps: (1) create an instance of the decision tree classifier specifying any desired hyperparameters like tree’s depth, splitting criteria, and other aspects of the learning process, (2) train the decision tree model to the training data using the fit method.
Trained model can be plotted with the plot_tree method.

In order to use dalex pipeline we need to turn the model into the Explainer.
As before, we can calculate the performance of the model with the method model_performance. Here on the test data the performance is lower than for the previous model.

Let’s add the AUC curve for the decision tree, to the curve obtained from the previous model.

In 2001, Leo Breiman proposed a new family of models, called random forests, which aggregate decisions from an ensemble of trees trained on bootstrap samples.
Random forests combine two interesting concepts. First, combining multiple weaker models produces a stronger and more stable model. Second, the more diverse the individual models, the greater the benefit of averaging them. To increase the diversity of models, Breiman used the bootstrap technique. Bootstrap is today a very widespread and powerful statistical procedure.
Figure 8: The key steps are: to generate a set of B bootstrap copies of the dataset by sampling rows with replacement. Deep trees are trained for each copy. During the prediction, the results of the individual trees are aggregated.
The most common way to train a decision tree in Python is by using the RandomForestClassifier class from sklearn.ensemble module. Scikit-learn provides a straightforward implementation of Random Forest algorithms. As with other Scikit-learn models the training process typically involves two steps: (1) create an instance of the classifier specifying any desired hyperparameters like number of trees, (2) train the model to the training data using the fit method.
A trained model can be turned into a dalex explainer. Note that we set the label argument to control name of the model presented in diagnostic plots.
We can now check how good this model is. As expected, a random forest model has better performance/AUC than a single tree.


Machine Learning algorithms typically have many hyperparameters that specify a model training process. For some model families, like Support Vector Machines (SVM) or Gradient Boosting Machines (GBM), the selection of such hyperparameters has a strong impact on the performance of the final model. The process of finding good hyperparameters is commonly called tuning.
Different model families have different sets of hyperparameters. We don’t always want to optimize all of them simultaneously, so the first step is to define the hyperparameter search space. Once it is specified, then tuning is based on a looped two steps: (1) select a set of hyperparameters and (2) evaluate how good this set of hyperparameters is. These steps are repeated until some stopping criterion is met, such as the maximum number of iterations, desired minimum model performance, or some increase in model performance.
Figure 10: The hyperparameter optimization scheme implemented in the mlr3tuning package. Source: https://mlr3book.mlr-org.com/tuning.html
The most common way to perform a hyperparameter optimisation in Python is by using the RandomizedSearchCV class from module sklearn.model_selection.
First, we need to specify the space of hyperparameters to search. Not all hyperparameters are worth optimizing. Let’s focus on three for the random forest algorithm: number of trees, max depth of a tree and the spliting critera.
For automatic hyperparameter search, it is necessary to specify: (1) search strategy (below it is the random search), (2) family of models to be tested, (3) definition of space for hyperparameters, (4) a procedure to evaluate the performance of the proposed models (below it is the AUC determined by 5-fold cross-validation), (5) a stopping criterion (below it is 10 evaluations).
Now we are ready to fit parameters in this pipeline. As usual in can be done with fit() method. After the tuning the best identified hyperparameters cna be extracted from the best_params_ field.
There is, of course, no guarantee that the tuner will find better hyperparameters than the default ones. But in this example, the tuned model is better than all other models that we have considered so far. Let’s see how much. We need a dalex wrapper.
We can calculate and compare the model performance/AUC on validation data and then compare ROC curves for various models.

Let’s plot all ROC curves in a single plot.

Some models have built-in methods for the assessment of Variable importance. For linear models, one can use standardized model coefficients or p-values. For random forest one can use out-of-bag classification error. For tree boosting models, one can use gain statistics.
From: http://www.jmlr.org/papers/volume20/18-760/18-760.pdf
Several common approaches for variable selection, or for describing relationships between variables, do not necessarily capture a variable’s importance. Null hypothesis testing methods may identify a relationship, but do not describe the relationship’s strength. Similarly, checking whether a variable is included by a sparse model-fitting algorithm, such as the Lasso (Hastie et al., 2009), does not describe the extent to which the variable is relied on. Partial dependence plots (Breiman et al., 2001; Hastie et al., 2009) can be difficult to interpret if multiple variables are of interest, or if the prediction model contains interaction effects .
Another common VI procedure is to run a model-fitting algorithm twice, first on all of the data, and then again after removing X1 from the data set. The losses for the two resulting models are then compared to determine the importance, or “necessity,” of X1 (Gevrey et al., 2003). Because this measure is a function of two prediction models rather than one, it does not measure how much either individual model relies on X1.
Some models have built-in methods for the assessment of variable importance. For example, for linear models, one can use standardized model coefficients or p-values. For the random forest, one can use out-of-bag classification error. For tree boosting models, one can use information gain statistics. Yet, the problem with such model-specific techniques is that they cannot be compared between models of different structures. For this and a few other reasons, it is convenient to use model agnostic techniques, such as permutational importance of variables.
The procedure is based on perturbations of a selected variable or group of variables. The intuition is that if a variable is important in a model, then after its random perturbation the model predictions should be less accurate.
The permutation-based variable-importance of a variable \(i\) is the difference (or ratio) between the model performance for the original data and the model performance calculated on data with the permuted variable \(i\). More formally
\[ VI(i) = L(f, X^{perm(i)}, y) - L(f, X, y), \]
where \(L(f, X, y)\) is the value of loss function or performance measure for the data \(X\), true labels \(y\) and model \(f\), while \(X^{perm(i)}\) is dataset \(x\) with \(i\)-th variable permuted.
We use the model_parts method from the dalex package to calculate the importance of variables. The only required argument is the model to be analyzed. With additional arguments, one can also specify how the importance of variables is to be calculated, whether as a difference, ratio or without normalization.

This technique is handy when we want to compare the importance of variables in different models. Let’s see what it looks like in our example. The plot method works for any number of models given as consecutive arguments.
importance_cdc = explainer_cdc.model_parts(
loss_function="1-auc", type="difference", random_state=0)
importance_dtc = explainer_dtc.model_parts(
loss_function="1-auc", type="difference", random_state=0)
importance_rfc_tuned = explainer_rfc_tuned.model_parts(
loss_function="1-auc", type="difference", random_state=0)
importance_cdc.plot([importance_rfc, importance_dtc,
importance_rfc_tuned], show=False)
For the CDC model, the only important variable is Age. For the tree model, the three important variables are Age, Cancer, and Cardiovascular diseases, an observation consistent with Figure 6. For the ranger model and the model after tuning of hyperparameters, more variables are taken into account. However, Age is indisputably the most important variable in all models.
The same perturbation technique can be used to analyze the importance of groups of variables. Just use the variable_groups argument. Grouping variables can be particularly useful if the number of variables is large and groups of variables describe some common aspects. In our case we could group all diseases together.
covid dataimport dalex as dx
titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived
Once we know which variables are important, it is usually interesting to determine the relationship between a particular variable and the model prediction. Popular techniques for this type of Explanatory Model Analysis are Partial Dependence (PD) and Accumulated Local Effects (ALE).
PD profiles were initially proposed in 2001 for gradient boosting models but can be used in a model agnostic fashion. This method is based on analysis of average model response after replacing variable \(i\) with the value of \(t\).
More formally, Partial Dependence profile for variable \(i\) is a function of \(t\) defined as
\[ PD(i, t) = E\left[ f(x_1, ..., x_{i-1}, t, x_{i+1}, ..., x_p) \right], \]
where the expected value is calculated over the data distribution. The straightforward estimator is
\[ \widehat{PD}(i, t) = \frac 1n \sum_{j=1}^n f(x^j_1, ..., x^j_{i-1}, t, x^j_{i+1}, ..., x^j_p). \]
We use the model_profile method from the dalex package to calculate the variable profile. The only required argument is the model to be analyzed. It is a good idea to specify names of variables for profile estimation as a second argument; otherwise, profiles are calculated for all variables, which can take some time. One can also specify the exact grid of values for calculations of profiles.
The average is calculated for the distribution specified in the data argument in the explainer. Here we calculate the PD profiles for the Age variable for covid_summer data.

Since we have four models it is worth comparing how they differ in terms of the model’s response to the Age variable.

covid_spring data are more likely to place the dramatic increase in the risk around age 65. The tree model is too shallow to capture the ever-increasing risk in the oldest group. Despite this, the models are quite consistent about the general shape of the relationship.By default, the average is calculated for all observations. But with the argument one can specify a grouping variable. PD profiles are calculated independently for each level of this variable.
Figure 15: Partial Dependence for Age in groups defined by Diabetes variable.
How to divide the reward?
Note that any two parties can form a government. In that case, should the prize for C be equal to or less than that for A?
Students A, B and C carry out a project together. With this payoff table, determine what portion of the award each student should get.
Students A, B and C carry out a project together. With this payoff table, determine what portion of the award each student should get.
Students A, B and C carry out a project together. With this payoff table, determine what portion of the award each student should get.
Students A, B and C carry out a project together. With this payoff table, determine what portion of the award each student should get.
One can define various desirable properties of fair reward distribution. The following seem to be natural (or at least they were for Lord Shapley).
\[ \sum_j \phi_j = v(P) \]
\[ \forall_S v(S \cup \{i\}) = v(S \cup \{j\}) \Rightarrow \phi_i = \phi_j \]
\[ \forall_S v(S \cup \{i\}) = v(S) \Rightarrow \phi_i = 0 \]
\[ \forall_S v(S) = v_1(S) + v_2(S) \Rightarrow \phi_i = \phi_{1,i} + \phi_{2,i} \]
\[ \phi_j = \frac{1}{|P|!} \sum_{\pi \in \Pi} (v(S_j^\pi \cup \{j\}) - v(S_j^\pi)) \]
where \(\Pi\) is a set of all possible permutations of players \(P\) while \(S_j^\pi\) is a set of players that are before player \(j\) in permutation \(\pi\).
\[ \hat\phi_j = \frac{1}{|B|} \sum_{\pi \in B} (v(S_j^\pi \cup \{j\}) - v(S_j^\pi)) \]
For tabular data, one of the most commonly used techniques for local variable attribution is Shapley values. The key idea behind this method is to analyze the sequence of conditional expected values. This way, we can trace how the conditional mean moves from the average model response to the model prediction for observation of interest \(x^*\). Let’s consider a sequence of expected values.

By looking at consecutive differences \(\mu_{x_1}-\mu\), \(\mu_{x_1,x_2}-\mu_{x_1}\) and so on, one can calculate the added effects of individual variables, see an example in Figure 16. It sounds like a straightforward solution; however, there are two issues with this approach.
One is that it is not easy to estimate the conditional expected value. In most implementations, it is assumed that features are independent, and then we can estimate \(\mu_{K}\) as an average model response with variables in the set \(K\) replaced by corresponding values from observation \(x^*\). So the crude estimate would be
\[ \widehat{\mu}_{K} = \frac 1n \sum_{i=1}^n f(x_1^o, x_2^o, ..., x_p^o),\text{ where }\left\{ {{x_j^o = x_j^*,\text{ if }j \in K} \atop {x_j^o = x_j^i,\text{ if }j \not\in K.} } \right. \]
The second issue is that these effects may depend on the order of conditioning. How to solve this problem? The Shapley values method calculates attributions as an average of all (or at least a large number of random) orderings, while the Break-down method uses a single ordering determined with a greedy heuristic that prefers variables with the largest attribution at the beginning.



Consecutive conditioning may depend on the order of variables.
The Shapley values method calculates attributions as an average of all (or at least a large number of random) orderings,
while the Break-down method uses a single ordering determined with a greedy heuristic that prefers variables with the largest attribution at the beginning.
Let’s define an observation for which we will examine the model more closely. Let it be a 76-year-old man with hypertension. We show a local model analysis using model_ranger as an example.
The predict_parts function for a specified model and a specified observation calculates local variable attributions. The optional argument order forces use of a specified sequence of variables. If not specified, then a greedy heuristic is used to start conditioning with the most relevant variables.

The alternative is to average over all (or at least many random) orderings of variables. This is how the Shapley values are calculated. The show_boxplots argument highlights the stability of the estimated attributions between different orderings.

Ceteris Paribus (CP) is a Latin phrase for “other things being equal”. It is also a very useful technique for an analysis of model behaviour for a single observation. CP profiles, sometimes called Individual Conditional Expectations (ICE), show how the model response would change for a selected observation if a value for one variable was changed while leaving the other variables unchanged.
While local variable attribution is a convenient technique for answering the question of which variables affect the prediction, the local profile analysis is a good technique for answering the question of how the model response depends on a particular variable. Or answering the question of what if…
Once we know which variables are important, it is usually interesting to determine the relationship between a particular variable and the model prediction. Popular techniques for this type of Explanatory Model Analysis are Partial Dependence (PD) and Accumulated Local Effects (ALE).
Introduced in 2001 in the paper Greedy Function Approximation: A Gradient Boosting Machine. Jerome Friedman. The Annals of Statistics 2001
Ceteris Paribus averaged profile following marginal distribution of variables \(X^{-j}\).
\[ PD(i, t) = E\left[ f(x_1, ..., x_{i-1}, t, x_{i+1}, ..., x_p) \right], \]
The estimation is based on the average of the CP profiles.
The computational complexity is \(N \times Z\) model evaluations, where \(N\) is the number of observations and \(Z\) is the number of points at which the CP profile is calculated (how to select these points?).
\[ \widehat{PD}(i, t) = \frac 1n \sum_{j=1}^n f(x^j_1, ..., x^j_{i-1}, t, x^j_{i+1}, ..., x^j_p). \]
Replacing \(i\)-th variable by value \(t\) can lead to very strange observations, especially when \(i\)-th variable is correlated with other variables and we ignore the correlation structure. One solution to this are Accumulated Local Effects profiles, which average over the conditional distribution.
ALE method was introduced in Visualizing the effects of predictor variables in black box supervised learning models by Daniel Apley and Jingyu Zhu
The predict_profiles() function calculates Ceteris Paribus profiles for a selected model and selected observations. By default, it calculates profiles for all variables, but one can limit this list with the variables vector of variables.
The calculated profiles can be drawn with the generic plot function. As with other explanations in the DALEX library, multiple models can be plotted on a single graph. Although for technical reasons quantitative and qualitative variables cannot be shown in a single chart. So if you want to show the importance of quality variables, you need to plot them separately.
Figure below shows an example of a CP profile for continuous variable Age and categorical variable Cardiovascular.Diseases.

Age, on the right for the categorical variable Cardiovascular.Diseases. For categorical variables, one can specify how the CP profiles should be drawn by setting the categorical_type argument.The plot function can combine multiple models, making it easier to see similarities and differences.

Source: www.pinkconcussions.com
Multiple complementary explanations work more effectively than a single explanation
Source: https://iema.drwhy.ai/
Multiple complementary explanations work more effectively than a single explanation
Source: https://iema.drwhy.ai/
We have made the model built for Covid data, along with the explanations described in this book, available at webpage. After two months, tens of thousands of people used it. With proper tools the deployment of such a model is not difficult.
To obtain a safe and effective model, it is necessary to perform a detailed Explanatory Model Analysis. However, we often don’t have much time for it. That is why tools that facilitate fast and automated model exploration are so useful.
One of such tools is arena. It is a package that transforms an explainer into an HTML page with javascript based interaction. Such an HTML page is easy to save on a disk or share by email. The webpage has various explanations pre-calculated, so its generation may be time-consuming, but the model exploration is very fast, and the feedback loop is tight.
Generating a arena dashboard for an explainer is easy More info
covid dataimport dalex as dx
titanic = dx.datasets.load_titanic()
X = titanic.drop(columns='survived')
y = titanic.survived
Why interpretability is important?


Explanation is a process – EuADS Summer School – 08/06/2023