3 Do-it-yourself with R
In this book, we introduce various methods for instance-level and dataset-level exploration and explanation of predictive models. In each chapter, there is a section with code snippets for R and Python that shows how to use a particular method. In this chapter, we provide a short description of the steps that are needed to set-up the R environment with the required libraries.
3.1 What to install?
Obviously, the R software (R Core Team 2018) is needed. It is always a good idea to use the newest version. At least R in version 3.6 is recommended. It can be downloaded from the CRAN website https://cran.r-project.org/.
A good editor makes working with R much easier. There is plenty of choices, but, especially for beginners, it is worth considering the RStudio editor, an open-source and enterprise-ready tool for R. It can be downloaded from https://www.rstudio.com/.
Once R and the editor are available, the required packages should be installed.
The most important one is the
DALEX package in version 1.0 or newer. It is the entry point to solutions introduced in this book. The package can be installed by executing the following command from the R command line:
DALEX will automatically take care about installation of other requirements (packages required by it), like the
ggplot2 package for data visualization,
iBreakDown with specific methods for model exploration.
3.2 How to work with
To conduct model exploration with
DALEX, first, a model has to be created. Then the model has got to be prepared for exploration.
There are many packages in R that can be used to construct a model. Some packages are algorithm-specific, like
randomForest for random-forest classification and regression models (Liaw and Wiener 2002),
gbm for generalized boosted regression models (Ridgeway 2017), extensions for generalized linear models (Harrell Jr 2018), and many others. There is also a number of packages that can be used for constructing models with different algorithm These include the
h2o package (LeDell et al. 2019),
caret (Jed Wing et al. 2016) and its successor
parsnip (Kuhn and Vaughan 2019), a very powerful and extensible framework
mlr (Bischl et al. 2016), or
keras that is a wrapper to Python library with the same name (Allaire and Chollet 2019).
While it is great to have such a large choice of tools for constructing models, the disadvantage is that different packages have different interfaces and different arguments. Moreover, model-objects created with different packages may have different internal structures. The main goal of the
DALEX package is to create a level of abstraction around a model that makes it easier to explore and explain the model.
DALEX::explain is THE function for model wrapping. There is only one argument that is required by the function; it is
model, which is used to specify the model-object with the fitted form of the model. However, the function allows additional arguments that extend its functionalities. They will be discussed in Section 5.2.6.
3.3 How to work with
As we will focus on the exploration of predictive models, we prefer not to waste space nor time on replication of the code necessary for model development. This is where the
archivist packages help.
archivist package (Biecek and Kosinski 2017) is designed to store, share, and manage R objects. We will use it to easily access pretrained R models and precalculated explainers. To install the package, the following command should be executed in the R command line:
Once the package has been installed, function
aread() can be used to retrieve R objects from any remote repository. For this book, we use a GitHub repository
models hosted at https://github.com/pbiecek/models. For instance, to download a model with the md5 hash
ceb40, the following command has to be executed:
Since the md5 hash
ceb40 uniquely defines the model, referring to the repository object results in using exactly the same model and the same explanations. Thus, in the subsequent chapters, pre-constructed models will be accessed with
archivist hooks. In the following sections, we will also use
archivist hooks when referring to datasets.
Allaire, JJ, and François Chollet. 2019. Keras: R Interface to ’Keras’. https://CRAN.R-project.org/package=keras.
Biecek, Przemyslaw, and Marcin Kosinski. 2017. “archivist: An R Package for Managing, Recording and Restoring Data Analysis Results.” Journal of Statistical Software 82 (11): 1–28. https://doi.org/10.18637/jss.v082.i11.
Bischl, Bernd, Michel Lang, Lars Kotthoff, Julia Schiffner, Jakob Richter, Erich Studerus, Giuseppe Casalicchio, and Zachary M. Jones. 2016. “mlr: Machine Learning in R.” Journal of Machine Learning Research 17 (170): 1–5. http://jmlr.org/papers/v17/15-066.html.
Harrell Jr, Frank E. 2018. Rms: Regression Modeling Strategies. https://CRAN.R-project.org/package=rms.
Jed Wing, Max Kuhn. Contributions from, Steve Weston, Andre Williams, Chris Keefer, Allan Engelhardt, Tony Cooper, Zachary Mayer, et al. 2016. Caret: Classification and Regression Training. https://CRAN.R-project.org/package=caret.
Kuhn, Max, and Davis Vaughan. 2019. Parsnip: A Common Api to Modeling and Analysis Functions. https://CRAN.R-project.org/package=parsnip.
LeDell, Erin, Navdeep Gill, Spencer Aiello, Anqi Fu, Arno Candel, Cliff Click, Tom Kraljevic, et al. 2019. H2o: R Interface for ’H2o’. https://CRAN.R-project.org/package=h2o.
Liaw, Andy, and Matthew Wiener. 2002. “Classification and Regression by randomForest.” R News 2 (3): 18–22. http://CRAN.R-project.org/doc/Rnews/.
R Core Team. 2018. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.R-project.org/.
Ridgeway, Greg. 2017. Gbm: Generalized Boosted Regression Models. https://CRAN.R-project.org/package=gbm.