2.2 Use case: Regression. Apartment prices in Warsaw

To illustrate applications of DALEX to regression problems we will use an artificial dataset apartments available in the DALEX package. Our goal is to predict the price per square meter of an apartment based on selected features such as construction year, surface, floor, number of rooms, district. It should be noted that four of these variables are continuous while the fifth one is a categorical one. Prices are given in Euro.

(#tab:hr_data)Artificial dataset about apartment prices in Warsaw. The goal here is to predict the price per square meter for a new apartment.

m2.price construction.year surface floor no.rooms district
5897 1953 25 3 1 Srodmiescie
1818 1992 143 9 5 Bielany
3643 1937 56 1 2 Praga
3517 1995 93 7 3 Ochota
3013 1992 144 6 5 Mokotow
5795 1926 61 6 2 Srodmiescie

2.2.1 Model 1: Linear regression

The first model is based on linear regression. It will be a simple model without any feature engineering.

## 
## Call:
## lm(formula = m2.price ~ construction.year + surface + floor + 
##     no.rooms + district, data = apartments)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -247.5 -202.8 -172.8  381.4  469.0 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5020.1391   682.8721   7.352 4.11e-13 ***
## construction.year     -0.2290     0.3483  -0.657   0.5110    
## surface              -10.2378     0.5778 -17.720  < 2e-16 ***
## floor                -99.4820     3.0874 -32.222  < 2e-16 ***
## no.rooms             -37.7299    15.8440  -2.381   0.0174 *  
## districtBielany       17.2144    40.4502   0.426   0.6705    
## districtMokotow      918.3802    39.4386  23.286  < 2e-16 ***
## districtOchota       926.2540    40.5279  22.855  < 2e-16 ***
## districtPraga        -37.1047    40.8930  -0.907   0.3644    
## districtSrodmiescie 2080.6110    40.0149  51.996  < 2e-16 ***
## districtUrsus         29.9419    39.7249   0.754   0.4512    
## districtUrsynow      -18.8651    39.7565  -0.475   0.6352    
## districtWola         -16.8912    39.6283  -0.426   0.6700    
## districtZoliborz     889.9735    40.4099  22.024  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 281.3 on 986 degrees of freedom
## Multiple R-squared:  0.905,  Adjusted R-squared:  0.9037 
## F-statistic: 722.5 on 13 and 986 DF,  p-value: < 2.2e-16

We have also another apartmentsTest dataset that can be used for validation of the model. Below is presented the mean square error calculated on the basis of validation data.

## [1] 283.0865

To create an explainer for the regression model it is enough to use explain() function with the model, data and y parameters. In the next chapter we will show how to use this explainer.

2.2.2 Model 2: Random forest

The second model is based on the random forest. It’s a very elastic out-of-the-box model.

## 
## Call:
##  randomForest(formula = m2.price ~ construction.year + surface +      floor + no.rooms + district, data = apartments) 
##                Type of random forest: regression
##                      Number of trees: 500
## No. of variables tried at each split: 1
## 
##           Mean of squared residuals: 82614.7
##                     % Var explained: 89.94

Below you may see the mean square error calculated for apartmentsTest dataset.

## [1] 286.5357

We will create an explainer also for the random forest model. In the next chapter we will show how to use this explainer.

These two models have identical performance! Which one should be used?