To illustrate applications of DALEX to regression problems we will use an artificial dataset `apartments`

available in the `DALEX`

package. Our goal is to predict the price per square meter of an apartment based on selected features such as construction year, surface, floor, number of rooms, district. It should be noted that four of these variables are continuous while the fifth one is a categorical one. Prices are given in Euro.

(#tab:hr_data)Artificial dataset about apartment prices in Warsaw. The goal here is to predict the price per square meter for a new apartment.

m2.price | construction.year | surface | floor | no.rooms | district |
---|---|---|---|---|---|

5897 | 1953 | 25 | 3 | 1 | Srodmiescie |

1818 | 1992 | 143 | 9 | 5 | Bielany |

3643 | 1937 | 56 | 1 | 2 | Praga |

3517 | 1995 | 93 | 7 | 3 | Ochota |

3013 | 1992 | 144 | 6 | 5 | Mokotow |

5795 | 1926 | 61 | 6 | 2 | Srodmiescie |

The first model is based on linear regression. It will be a simple model without any feature engineering.

```
apartments_lm_model <- lm(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
summary(apartments_lm_model)
```

```
##
## Call:
## lm(formula = m2.price ~ construction.year + surface + floor +
## no.rooms + district, data = apartments)
##
## Residuals:
## Min 1Q Median 3Q Max
## -247.5 -202.8 -172.8 381.4 469.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5020.1391 682.8721 7.352 4.11e-13 ***
## construction.year -0.2290 0.3483 -0.657 0.5110
## surface -10.2378 0.5778 -17.720 < 2e-16 ***
## floor -99.4820 3.0874 -32.222 < 2e-16 ***
## no.rooms -37.7299 15.8440 -2.381 0.0174 *
## districtBielany 17.2144 40.4502 0.426 0.6705
## districtMokotow 918.3802 39.4386 23.286 < 2e-16 ***
## districtOchota 926.2540 40.5279 22.855 < 2e-16 ***
## districtPraga -37.1047 40.8930 -0.907 0.3644
## districtSrodmiescie 2080.6110 40.0149 51.996 < 2e-16 ***
## districtUrsus 29.9419 39.7249 0.754 0.4512
## districtUrsynow -18.8651 39.7565 -0.475 0.6352
## districtWola -16.8912 39.6283 -0.426 0.6700
## districtZoliborz 889.9735 40.4099 22.024 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 281.3 on 986 degrees of freedom
## Multiple R-squared: 0.905, Adjusted R-squared: 0.9037
## F-statistic: 722.5 on 13 and 986 DF, p-value: < 2.2e-16
```

We have also another `apartmentsTest`

dataset that can be used for validation of the model. Below is presented the mean square error calculated on the basis of validation data.

```
predicted_mi2_lm <- predict(apartments_lm_model, apartmentsTest)
sqrt(mean((predicted_mi2_lm - apartmentsTest$m2.price)^2))
```

`## [1] 283.0865`

To create an explainer for the regression model it is enough to use `explain()`

function with the `model`

, `data`

and `y`

parameters. In the next chapter we will show how to use this explainer.

The second model is based on the random forest. It’s a very elastic out-of-the-box model.

```
library("randomForest")
set.seed(59)
apartments_rf_model <- randomForest(m2.price ~ construction.year + surface + floor +
no.rooms + district, data = apartments)
apartments_rf_model
```

```
##
## Call:
## randomForest(formula = m2.price ~ construction.year + surface + floor + no.rooms + district, data = apartments)
## Type of random forest: regression
## Number of trees: 500
## No. of variables tried at each split: 1
##
## Mean of squared residuals: 82614.7
## % Var explained: 89.94
```

Below you may see the mean square error calculated for `apartmentsTest`

dataset.

```
predicted_mi2_rf <- predict(apartments_rf_model, apartmentsTest)
sqrt(mean((predicted_mi2_rf - apartmentsTest$m2.price)^2))
```

`## [1] 286.5357`

We will create an explainer also for the random forest model. In the next chapter we will show how to use this explainer.

```
explainer_rf <- explain(apartments_rf_model,
data = apartmentsTest[,2:6], y = apartmentsTest$m2.price)
```

**These two models have identical performance!** Which one should be used?