Here we will use the HR churn data (https://www.kaggle.com/ludobenistant/hr-analytics/data) to present the breakDown package for ranger models.

The data is in the breakDown package

library(breakDown)
head(HR_data, 3)
#>   satisfaction_level last_evaluation number_project average_montly_hours
#> 1               0.38            0.53              2                  157
#> 2               0.80            0.86              5                  262
#> 3               0.11            0.88              7                  272
#>   time_spend_company Work_accident left promotion_last_5years sales salary
#> 1                  3             0    1                     0 sales    low
#> 2                  6             0    1                     0 sales medium
#> 3                  4             0    1                     0 sales medium

Now let’s create a ranger classification forest for churn, the left variable.

library(ranger)
HR_data$left <- factor(HR_data$left)
model <- ranger(left ~ ., data = HR_data, importance = 'impurity', min.node.size = 10)

Variable importance for all trees in the forest.

importance(model)
#>    satisfaction_level       last_evaluation        number_project 
#>           1833.409055            616.910745            935.699387 
#>  average_montly_hours    time_spend_company         Work_accident 
#>            755.414201            989.271500             28.037134 
#> promotion_last_5years                 sales                salary 
#>              4.997529             42.910538             30.136999

But how to understand which factors drive predictions for a single observation?

With the breakDown package!

Explanations for the trees votings.

library(ggplot2)
explain_1 <- broken(model, HR_data[1159,])
explain_1
#>                            contribution
#> time_spend_company = 2            0.045
#> satisfaction_level = 0.57         0.044
#> average_montly_hours = 219        0.039
#> number_project = 4                0.038
#> last_evaluation = 0.85            0.037
#> Work_accident = 1                 0.030
#> sales = sales                     0.019
#> salary = medium                   0.015
#> promotion_last_5years = 0         0.003
#> final_prognosis                   0.270
#> baseline:  0.5
plot(explain_1) + scale_y_continuous( limits = c(0,1), name = "fraction of trees", expand = c(0,0))
#> Scale for 'y' is already present. Adding another scale for 'y', which
#> will replace the existing scale.


explain_1 <- broken(model, HR_data[10099,])
explain_1
#>                            contribution
#> satisfaction_level = 0.73        -0.045
#> time_spend_company = 5           -0.045
#> last_evaluation = 0.83           -0.045
#> average_montly_hours = 266       -0.044
#> number_project = 5               -0.043
#> salary = low                     -0.025
#> Work_accident = 0                -0.021
#> sales = sales                    -0.020
#> promotion_last_5years = 0        -0.012
#> final_prognosis                  -0.300
#> baseline:  0.5
plot(explain_1) + scale_y_continuous( limits = c(0,1), name = "fraction of trees", expand = c(0,0))
#> Scale for 'y' is already present. Adding another scale for 'y', which
#> will replace the existing scale.

This is not the right approach.