# 2 Model Development

## 2.1 Introduction

In general, we can distinguish between two approaches to statistical modelling: *explanatory* and *predictive* (Leo Breiman 2001b; Shmueli 2010). In *explanatory modelling*, models are applied for inferential purposes, i.e., to test hypotheses resulting from some theoretical considerations related to the investigated phenomenon (for instance, related to an effect of a particular clinical factor on a probability of a disease). In *predictive modelling*, models are used for the purpose of predicting the value of a new or future observation (for instance, whether a person has got or will develop a disease). It is important to know what is the intended purpose of modelling because it has got important consequences for the methods used in the model development process.

In this book, we focus on predictive modelling. Thus, we present mainly the methods relevant for predictive models. Nevertheless, we also show selected methods used in case of explanatory models, in order to discuss, if relevant, substantive differences between the methods applied to the two approaches to modelling.

Predictive models are created for various purposes. For instance, a team of data scientists may spend months on developing a single model that will be used for scoring risks of transactions in a large financial company. In that case, every aspect of the model is important, as the model will be used on a large scale and will have important long-term consequences for the company. Hence, the model-development process may be lengthy and tedious. On the other hand, if a small pizza-delivery chain may want to develop a simple model to roughly predict the demand for deliveries, the development process may be much shorter and less complicated. In that case, the model may be quickly updated or even discarded, without major consequences.

Irrespectively of the goals of modelling, model-development process involves similar steps. In this chapter, we briefly discuss these steps.

## 2.2 Model-development process

Several approaches have been proposed to describe the process of model development. One of the most known general approaches is the Cross-industry Standard Process for Data Mining (CRISP-DM) (Chapman et al. 1999; Wikipedia 2019). Methodologies specific for predictive models have been introduced also by Grolemund and Wickham (2017), Hall (2019), and Biecek (2019).

The common goal of the approaches is to standardize the process. Standardization can help to plan resources needed to develop and maintain a model, and to make sure that no important steps are missed when developing the model.

CRISP-DM is a tool-agnostic procedure. It breaks the model-development process into six phases: *business understanding*, *data understanding*, *data preparation*, *modelling*, *evaluation*, and *deployment*. The phases can be iterated. Note that iterative phases are also considered by Grolemund and Wickham (2017) and Hall (2019).

Figure 2.1 presents a variant of the iterative process, divided into five steps. Data collection and preparation is needed prior to any modelling. One cannot hope for building a model with good performance if the data are not of good quality. Once data have been collected, they have to be explored to understand their structure. Subsequently, a model can be selected and fitted to the data. The constructed model should be validated. The three steps: data understanding, model assembly, and model audit, are often iterated to arrive at a point when, for instance, a model with the best predictive performance is obtained. Once the “best” model has been obtained, it can be “delivered,” i.e., implemented in practice after performing tests and developing the necessary documentation.

The Model-development Process (MDP), proposed by Biecek (2019), has been motivated by Rational Unified Process for Software Development (Kruchten 1998; Jacobson, Booch, and Rumbaugh 1999; Boehm 1988). MDP can be seen as an extension of the scheme presented in Figure 2.1. It recognizes that fact that consecutive iterations are not identical because the knowledge increases during the process and consecutive iterations are performed with different goals in mind. This is why MDP is presented in Figure 2.2 as an untangled version of Figure 2.1. The five phases, present in CRIPSP-DM, are shown in the rows. A single bar in each row represents a number of resources (for instance, a week-worth workload) that can be devoted to the project at a specific time-point (indicated on the horizontal axis). For a particular phase, resources can be used in different amounts depending on the current stage of the process, as indicated by the height of the bars. The stages are indicated at the top of the diagram in Figure 2.2: *problem formulation*, *crisp modelling*, *fine tuning*, and *maintenance and decommissioning*. Problem formulation aims at defining the needs for the model, defining datasets that will be used for training and validation, and deciding which performance measures will be used for the evaluation of the performance of the final model. Crisp modelling focuses on the creation of first versions of the model that may provide an idea about, for instance, how complex may the model have to be to yield the desired solution? Fine-tuning focuses on improving the initial version(s) of the model and selecting the best one according to the pre-defined metrics. Finally, maintenance and decommissioning aims at monitoring the performance of the model after its implementation. Note that, unlike in CRISP-DM, the diagram in Figure 2.2 indicates that the process may start with some resources being spent not on the data-preparation phase, but on the model-audit one. This is because, at the problem formulation stage, we may have to spend some time on defining the goals and model-performance metrics (that will be used in model benchmarking) before any attempt to collect the data.

Figure 2.2 also indicates that there may be several iterations of the different phases within each stage, as indicated at the bottom of the diagram. For instance, in the crisp-modelling stage, several versions of a model may be prepared in subsequent iterations.

Methods presented in this book can be used to get a better understanding the data and the application domain (*exploration*), obtaining insight into model-based predictions (*model explanation*), and evaluation of model’s performance (*model examination*). Thus, referring to MDP in Figure 2.2, the methods are suitable for data understanding, model assembly, and model audit phases.

In the remainder of the chapter, we provide a brief overview of the notation that will be used in the book, and the methods commonly used for data exploration, model fitting, and model validation.

## 2.3 Notation

Methods described in this book have been developed by different authors, who used different mathematical notations. We have made an attempt at keeping the mathematical notation consistent throughout the entire book. In some cases this may result in formulas with a fairly complex system of indices.

In this section, we provide a general overview of the notation we use. Whenever necessary, parts of the notation will be explained again in subsequent chapters.

We use capital letters like \(X\) or \(Y\) to denote (scalar) random variables. Observed values of these variables are denoted by lower-case letters like \(x\) or \(y\). Vectors and matrices are distinguished by underlining the letter. Thus, \(\underline{X}\) and \(\underline{x}\) denote matrix \(X\) and (column) vector \(x\), respectively. Note, however, that in some situations \(\underline{X}\) may indicate a vector of (scalar) random variables. We explicitly mention this when needed. Transposition is indicated by the prime, i.e., \(\underline{x}'\) is the row vector resulting from transposition of a column vector \(\underline{x}\).

We use notation \(E(Y)\) and \(Var(Y)\) to denote the expected (mean) value and variance of random variable \(Y\). If needed, we use a subscript to indicate the distribution used to compute the parameters. Thus, for instance, we use \[E_{Y|X=x}(Y) = E_{Y|x}(Y) = E_{Y}(Y|X=x) \] to indicate the conditional mean of \(Y\) given that random variable \(X\) assumes the value of \(x\).

We assume that the data available for modelling consist of \(n\) observations/instances. For the \(i\)-th observation, we have got an observed value of \(y_i\) of a dependent (random) variable \(Y\). We assume that \(Y\) is a scalar, i.e., a single number. In case of dependent categorical variable, we usually consider \(Y\) to be a binary indicator of observing a particular category.

Additionally, each observation from a dataset is described by \(p\) explanatory variables. We refer to the (column) vector of the explanatory variables, describing the \(i\)-th observation, by \(\underline{x}_i\). We can thus consider observations as points in a \(p\)-dimensional space \(\mathcal X \equiv \mathcal R^p\), with \(\underline{x}_i \in \mathcal X\). We often collect all explanatory-variable data in the \(n\times p\) matrix \(\underline{X}\) that contains, in the \(i\)-th row, vector \(\underline{x}'_i\).

When introducing some of the model-exploration methods, we often consider “an observation of interest,” for which the vector of explanatory variables is denoted by \(x_{*}\). As the observation may not necessarily belong to the analyzed dataset, we use of the asterisk in the subscript. Clearly, \(\underline{x}_{*} \in \mathcal X\).

We refer to the \(j\)-th coordinate of vector \(\underline{x}\) by using \(j\) in superscript. Thus, \(\underline{x}_i = ({x}^1_i, \ldots , {x}^p_i)'\), where \({x}^j_i\) denotes the \(j\)-th coordinate of vector \(\underline{x}_i\) for the \(i\)-th observation from the analyzed dataset. If a power (for instance, a square) of \({x}^j_i\) is needed, it will be denoted by using parentheses, i.e., \(\left({x}^j_i\right)^2\).

If \(\mathcal J\) denotes a subset of indices, then \(\underline{x}^{\mathcal J}\) denotes the vector formed by the coordinates of \(\underline{x}\) corresponding to the indices included in \(\mathcal J\).

We use \(\underline{x}^{-j}\) to refer to a vector that results from removing the \(j\)-th coordinate from vector \(\underline{x}\). By \(\underline{x}^{j|=z}\), we denote a vector in which all coordinates are equal to their values in \(\underline{x}\), except of the \(j\)-th coordinate, which values is set equal to \(z\). Thus, \(\underline{x}^{j|=z} = ({x}^1, \ldots, {x}^{j-1}, z, {x}^{j+1}, \ldots, {x}^p)'\).

Notation \(\underline{X}^{*j}\) is used to denote a matrix with the values as in \(\underline{X}\) except of the \(j\)-th column, for which elements are permuted.

In this book, a model is a function \(f:\mathcal X \rightarrow \mathcal R\) that transforms a point from \(\mathcal X\) into a real number. In most cases, the presented methods can be used directly for multivariate dependent variables; however, we use examples with univariate responses to simplify the notation.

Typically, during the model development, we create many competing models. Formally, we shall index models to refer to a specific version fitted to a dataset. However, for the sake of simplicity, we will omit the index when it is not important. For the same reason we ignore in the notation the fact that, in practice, we never know true model coefficients and use the estimated values.

We use the term “model residual” to indicate the difference between the observed value of the dependent variable \(Y\) for the \(i\)-th observation from a particular dataset and the model’s prediction for the observation:

\[\begin{equation} r_i = y_i - f(x_i) = y_i - \hat y_i, \tag{2.1} \end{equation}\]

where \(\hat y_i\) denotes the predicted (or fitted) value of \(y_i\). More information about residuals is provided in Chapter 19.

## 2.4 Data understanding

As indicated in Figures 2.1 and 2.2, before starting construction of any models, we have got to understand the data. Toward this aim, tools for data exploration such as visualization techniques, tabular summaries, and statistical methods can be used. The choice of the tools depends on the character of variables included in a dataset.

The most known introduction to data exploration is a famous book by Tukey (1977). It introduces the (now classical) tools like, for example, box-and-whisker plots or stem-and-leaf plots. Good overviews of techniques for data exploration can also be found in books by Nolan and Lang (2015) and Wickham and Grolemund (2017).

In this book, we rely on five visualization techniques for data exploration, schematically presented in Figure 2.3. Two of them (histogram and empirical cumulative-distribution (ECD) plot) are used to summarize the distribution of a single random (explanatory or dependent) variable; the remaining three (box plot, mosaic plot, box plot, and scatter plot) are used to explore the relationship between pairs of variables. Note that histogram can be used to explore the distribution of a continuous or a categorical variable, while the ECD and box plots are suitable for continuous variables. A mosaic plot is useful for exploring the relationship between two categorical variables, while a scatter plot can be applied for two continuous variables. It is worth noting that box plots can also be used for evaluating a relation between a categorical variable and a continuous one, as illustrated in Figure 2.3.

Exploration of data for the dependent variable usually focuses on the question related to the distribution of the variable. For instance, for a continuous variable, questions like approximate normality or symmetry of the distribution are most often of interest, because of the availability of many powerful methods and models that use the normality assumption. In the case of asymmetry (skewness), a possibility of a transformation that could make the distribution approximately symmetric or normal is usually investigated. For a categorical dependent variable, an important question is whether the proportion of observations in different categories is balanced or not. This is because, for instance, some methods related to the classification problem do not work well with if there is a substantial imbalance between the categories.

Exploration of data for explanatory variables may also include investigation of their distribution. This is because the results may reveal, for instance, that there is little variability in the observed values of a variable. As a consequence, the variable might be deemed not interesting from a model-construction point of view. Usually, however, the exploration focuses on the relationship between explanatory variables themselves on one hand, and their relationship with the dependent variable on the other hand. The results may have important consequences for model construction. For instance, if an explanatory variable does not appear to be related to the dependent variable, it may be dropped from a model (variable selection/filtering). The exploration results may also suggest, for instance, a need for a transformation of an explanatory variable to make its relationship with the dependent variable linear (variable engineering). Detection of pairs of strongly-correlated explanatory variables is also of interest, as it may help in resolving issues with, for instance, instability of optimization algorithms used for fitting of a model.

## 2.5 Model assembly (fitting)

In statistical modelling, we are interested in the distribution of a dependent variable \(Y\) given \(\underline{x}\), the vector of values of explanatory variables. In the ideal world, we would like to know the entire conditional distribution. In practical applications, however, we usually do not evaluate the entire distribution, but just some of its characteristics, like the expected (mean) value, a quantile, or variance. Without loss of generality we will assume that we model the conditional expected value of \(Y\), i.e., \(E_{Y | \underline{x}}(Y)\).

Assume that we have got model \(f()\), for which \(f(\underline{x})\) is an approximation of \(E_{Y | \underline{x}}(Y)\), i.e., \(E_{Y | \underline{x}}(Y) \approx f(\underline{x})\). Note that, in our book, we do not assume that it is a “good” model, nor that the approximation is precise. We simply assume that we have got a model that is used to estimate the conditional expected value and to form predictions of the values of the dependent variable. Our interest lies in the evaluation of the quality of the predictions. If the model offers a “good” approximation of the conditional expected value, it should be reflected in its satisfactory predictive performance.

Usually, when building a model, the available data are split into two parts. One part, often called a “training set” or “learning data,” is used for estimation of the model coefficients. The other part, called a “testing set” or “validation data,” is used for model validation. The splitting may be done repeatedly, as in k-fold cross-validation. We leave the topic of model validation for Chapter 15.

The process of estimation of model coefficients based on the training data, i.e., “fitting” of the model, differs for different models. In most cases, however, it can be seen as an optimization problem. Let \(\Theta\) be the space of all possible values of model coefficients. Model fitting (or training) is a procedure of selecting a value \(\underline{\hat{\theta}} \in \Theta\) that minimizes some loss function \(L()\):

\[\begin{equation} \underline{\hat{\theta}} = \arg \min_{\underline{\theta} \in \Theta} L\{\underline{y}, f(\underline{\theta};\underline{X})\}, \tag{2.2} \end{equation}\]

where \(\underline{y}\) is the vector of observed values of the dependent variable and \(f(\underline{\theta}; \underline{X})\) is the corresponding vector of the model’s predictions computed for model coefficients \(\underline{\theta}\) and matrix \(\underline{X}\) of values of explanatory variables for the observations from the training dataset. Denote the estimated form of the model by \(f(\underline{\hat{\theta}};\underline{X})\).

Consider predction of a new observation for which the vector of explanatory variables assumes the value of \(\underline{x}_*\), i.e., \(f(\underline{\hat{\theta}};\underline{x}_*)\). Assume that \(E_{Y | \underline{x}_*}(Y) = f(\underline{\theta};\underline{x}_*)\). It can be shown (Hastie, Tibshirani, and Friedman 2009; Shmueli 2010) that the expected squared-error of prediction can be expressed as follows:

\[\begin{align} E_{(Y,\underline{\hat{\theta}})|\underline{x}_*}\{Y-f(\underline{\hat{\theta}};\underline{x}_*)\}^2 &= E_{Y|\underline{x}_*}\{Y-f(\underline{\theta};\underline{x}_*)\}^2 + \nonumber \\ & \ \ \ [f(\underline{\theta};\underline{x}_*)-E_{\underline{\hat{\theta}}|\underline{x}_*}\{{f}(\underline{\hat{\theta}};\underline{x}_*)\}]^2 +\nonumber\\ & \ \ \ E_{\underline{\hat{\theta}}|\underline{x}_*}[{f}(\underline{\hat{\theta}};\underline{x}_*)-E_{\underline{\hat{\theta}}|\underline{x}_*}\{{f}(\underline{\hat{\theta}};\underline{x}_*)\}]^2\nonumber\\ &= Var_{Y|\underline{x}_*}(Y)+Bias^2+Var_{\underline{\hat{\theta}}|\underline{x}_*}\{\hat{f}(\underline{x}_*)\}. \tag{2.3} \end{align}\]

The first term on the right-hand-side of equation (2.3) is the variability of \(Y\) around its conditional expected value \(f(\underline{\theta};\underline{x}_*)\). In general, it cannot be reduced. The second term is the squared difference between the expected value and its estimate, i.e., the squared bias. Bias results from misspecifying the model by, for instance, using a more parsimonious or a simpler model. The third term is the variance of the estimate, due to the fact that we use training data to estimate the model.

The decomposition presented in (2.3) underlines an important difference between explanatory and predictive modelling. In the explanatory modelling, the goal is to minimize the bias, as we are interested in obtaining the most accurate representation of the investigated phenomenon and the related theory. In the predictive modelling, the focus is on minimization of the sum of the (squared) bias and the estimation variance, because we are interested in minimization of the prediction error. Thus, sometimes we can accept a certain amount of bias, if it leads to a substantial gain in precision of estimation and, consequently, in a smaller prediction error (Shmueli 2010).

It follows that the choice of the loss function \(L()\) in equation (2.2) may differ for explanatory and predictive modelling. For the former, it is common to assume some family of probability distributions for the conditional distribution of \(Y\) given \(\underline{x}\). In such case, the loss function \(L()\) may be defined as the negative logarithm of the likelihood function, where the likelihood is the probability of observing \(\underline{y}\), given \(\underline{X}\), treated as a function of \(\underline{\theta}\). The resulting estimate of \(\underline{\theta}\) is usually denoted by \(\underline{\hat{\theta}}\).

In predictive modelling, it is common to add term \(\lambda(\underline{\theta})\) to the loss function that “penalizes” for the use of more complex models:

\[\begin{equation} \underline{\tilde{\theta}} = \arg \min_{\underline{\theta} \in \Theta} \left[L\{\underline{y}, f(\underline{\theta};\underline{X})\} + \lambda(\underline{\theta})\right]. \tag{2.4} \end{equation}\]

For example, in linear regression we assume that the observed vector \(\underline{y}\) follows a multivariate Gaussian distribution:

\[
\underline{y} \sim \mathcal N(\underline{X}' \underline{\beta}, \sigma^2\underline{I}_n),
\]
where \(\underline{\theta}' = (\underline{\beta}', \sigma^2)\) and \(\underline{I}_n\) denotes the \(n \times n\) identity matrix. In this case, equation (2.4) becomes

\[\begin{equation} \underline{\tilde{\theta}} = \arg \min_{\underline{\theta} \in \Theta}\left\{ \frac{1}{n}||\underline{y} - \underline{X}' \underline{\beta}||_{2} + \lambda(\underline{\beta}) \right\}= \arg \min_{\underline{\theta} \in \Theta} \left\{ \frac{1}{n}\sum_{i=1}^n (y_i-\underline{x}'_i\underline{\beta})^2+ \lambda(\underline{\beta}) \right\}. \tag{2.5} \end{equation}\]

For the classcal linear regression, the penalty term \(\lambda(\underline{\beta})\) is equal to \(0\). In that case, the optimal parameters \(\hat{\underline{\beta}}\) and \(\hat{\sigma}^2\), obtained from (2.2), can be expressed in a closed form:

\[\begin{eqnarray*} \hat{\underline{\beta}} &=& (\underline{X}'\underline{X})^{-1}\underline{X}'\underline{y},\\ \hat{\sigma}^2 &=& \frac{1}{n}||\underline{y} - \underline{X}' \hat{\underline{\beta}}||_{2} = \frac{1}{n}\sum_{i=1}^n (y_i-\underline{x}'_i\hat{\underline{\beta}})^2= \frac{1}{n}\sum_{i=1}^n (y_i-\hat{y}_i)^2. \end{eqnarray*}\]

On the other hand, in ridge regression, the penalty function is defined as follows:

\[\begin{equation} \lambda(\underline{\beta}) = \lambda \cdot ||\underline{\beta}||_2 = \lambda \sum_{k=1}^p (\beta^k) ^2. \tag{2.6} \end{equation}\]

In that case, the optimal parameters \(\tilde{\underline{\beta}}\) and \(\tilde{\sigma}^2\), obtained from equation (2.4), can also be expressed in a closed form:

\[\begin{eqnarray*} \tilde{\underline{\beta}} &=& (\underline{X}'\underline{X} + \lambda \underline{I}_n)^{-1}\underline{X}'\underline{y},\\ \tilde{\sigma}^2 &=& \frac{1}{n}||\underline{y} - \underline{X}' \tilde{\underline{\beta}}||_{2}. \end{eqnarray*}\]

Note that ridge regression leads to non-zero squared-bias in equation (2.3), but at the benefit of a reduced estimation variance (Hastie, Tibshirani, and Friedman 2009).

Another possible form of penalty, used in the Least Absolute Shrinkage and Selection Operator (LASSO) regression, is given by

\[\begin{equation} \lambda(\underline{\beta}) = \lambda \cdot ||\underline{\beta}||_1 = \lambda \sum_{k=1}^p |\beta^k|. \tag{2.7} \end{equation}\]

In that case, \(\tilde{\underline{\beta}}\) and \(\tilde{\sigma}^2\) have to be obtained by using a numerical optimization procedure.

For a binary dependent variable, i.e., a classification problem, the natural choice for the distribution of \(Y\) is the Bernoulli distribution. The resulting loss function, based on the logarithm of the Bernoulli likelihood, is

\[\begin{equation} L(\underline{y},\underline{p})=-\frac{1}{n}\sum_{i=1}^n \{y_i\ln{p_i}+(1-y_i)\ln{(1-p_i)}\}, \tag{2.8} \end{equation}\]

where \(y_i\) is equal to 0 or 1 in case of “no response” and “response” (or “failure” and “success”), and \(p_i\) is the probability of \(y_i\) being equal to 1. Function (2.8) is often called “log-loss” or “binary cross-entropy” in machine-learning literature.

A popular model for binary data is logistic regression, for which

\[ \ln{\frac{p_i}{1-p_i}}=\underline{x}_i'\underline{\beta}. \]

In that case, the loss function in equation (2.8) becomes equal to

\[\begin{equation} L\{\underline{y},f(\underline{\beta},\underline{X})\}=-\frac{1}{n}\sum_{i=1}^n [y_i\underline{x}_i'\underline{\beta}-\ln\{1+\exp(\underline{x}_i'\underline{\beta})\}]. \tag{2.9} \end{equation}\]

Optimal values of parameters \(\hat{\underline{\beta}}\), resulting from equation (2.2), have to be found by numerical optimization algorithms.

Of course, one can combine the loss functions in equations (2.8) and (2.9) with penalties (2.6) or (2.7) .

For a categorical dependent variable, i.e., a multilabel classification problem, the natural choice for the distribution of \(Y\) is the multinomial distribution. The resulting loss function, in case of \(K\) categories, is given by

\[\begin{equation} L(\underline{Y},\underline{P})=-\frac{1}{n}\sum_{i=1}^n\sum_{k=1}^K y_{ik}\ln{p_{ik}}, \tag{2.10} \end{equation}\]

where \(y_{ik}=1\) if the \(k\)-th category was noted for the \(i\)-th observation and 0 otherwise, and \(p_{ik}\) is the probability of \(y_{ik}\) being equal to 1. Function (2.10) is often called “categorical cross-entropy” in machine-learning literature. Also in this case, optimal parameters \(\hat{\underline{\beta}}\), resulting from equation (2.2), have to be found by numerical optimization algorithms.

## 2.6 Model audit

As indicated in Figure 2.2, the modelling process starts with some crisp early versions that are fine-tuned in consecutive iterations. To arrive at a final model, we usually have got to evaluate (audit) numerous candidate models that. In this book, we introduce techniques that allow:

- decomposing a model’s predictions into components that can be attributed to particular explanatory variables (Chapters 6–9);
- conducting sensitivity analysis for a model’s predictions (Chapter 10–12);
- summarizing the predictive performance of a model (Chapter 15). In particular, the presented measures are usually used to trace the progress in model development.
- assessing the importance of an explanatory variable (Chapter 16). The techniques can be helpful in reducing the set of explanatory variables to be included in a model in the fine-tuning stage.
- evaluating the effect of an explanatory variable on a model’s predictions (Chapters 17–18).
- detailed examination of both overall and instance-specific model performance (Chapter 19). These are residual-diagnostic tools that can help in identifying potential causes that may lead to issues with model performance.

All those techniques can be used to evaluate the current version of a model and to get suggestions for possible improvements. The improvements may be developed and evaluated in the next crisp-modelling or fine-tuning phase.

### References

Biecek, Przemyslaw. 2019. “Model Development Process.” *CoRR* abs/1907.04461. http://arxiv.org/abs/1907.04461.

Boehm, Barry. 1988. “A Spiral Model of Software Development and Enhancement.” * IEEE Computer, IEEE* 21(5): 61–72.

Breiman, Leo. 2001b. “Statistical Modeling: The Two Cultures.” *Statistical Science* 16 (3): 199–231. https://doi.org/10.1214/ss/1009213726.

Chapman, Pete, Julian Clinton, Randy Kerber, Thomas Khabaza, Thomas Reinartz, Colin Shearer, and Rudiger Wirth. 1999. *The CRISP-DM 1.0 Step-by-step data mining guide*. ftp://ftp.software.ibm.com/software/analytics/spss/support/Modeler/Documentation/14/UserManual/CRISP-DM.pdf.

Grolemund, Garrett, and Hadley Wickham. 2017. *R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. O’Reilly Media. http://r4ds.had.co.nz/.

Hall, Patrick. 2019. *On Explainable Machine Learning Misconceptions and A More Human-Centered Machine Learning*. https://github.com/jphall663/xai_manualonceptions/blob/master/xai_misconceptions.pdf.

Hastie, T., R. Tibshirani, and J. Friedman. 2009. *The Elements of Statistical Learning. Data Mining, Inference, and Prediction.* 2nd ed. New York, NY, USA: Springer.

Jacobson, Ivar, Grady Booch, and James Rumbaugh. 1999. *The Unified Software Development Process*. Addison-Wesley.

Kruchten, Philippe. 1998. *The Rational Unified Process*. Addison-Wesley.

Nolan, Deborah, and Duncan Temple Lang. 2015. *Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving*. Chapman & Hall/CRC.

Shmueli, G. 2010. “To explain or to predict?” *Statistical Science* 25: 289–310.

Tukey, John W. 1977. *Exploratory Data Analysis*. Addison-Wesley.

Wickham, Hadley, and Garrett Grolemund. 2017. *R for Data Science: Import, Tidy, Transform, Visualize, and Model Data*. 1st ed. O’Reilly Media, Inc.

Wikipedia. 2019. *CRISP DM: Cross-industry standard process for data mining*. https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining.