1 임직원 이탈 예측¹

tidymodels 체계를 구성하는 parsnip 팩키지를 활용하여 앙상블 모형의 또 다른 형태인 super learner를 구현하는 것도 가능한데 차근차근 purrr, furrr 팩키지를 활용하여 체계적으로 접근해보자.

2 데이터셋²

앞서 xwMOOC 모형 - tidymodels: “caret → parsnip”에서 사용한 임직원 이탈 데이터를 가지고 전처리 작업을 통해 basetable을 제작하고 parsnip을 통해 예측모형을 구축한다.

임직원 이탈 데이터에 대한 정제작업 수행
훈련/시험 데이터분할
피처 공학을 통한 데이터 전처리

library(tidyverse)
library(tidymodels)
library(furrr)
library(tictoc)

library(doParallel)
all_cores <- parallel::detectCores(logical = FALSE)
registerDoParallel(cores = all_cores)

## HR 데이터 -----
hr_dat <- read_csv("data/HR_comma_sep.csv") %>% 
  janitor::clean_names()

hr_df <- hr_dat %>% 
  mutate(left = factor(left, levels=c(0,1), labels=c("stay", "left"))) %>%
  mutate(departments = factor(departments),
         work_accident = factor(work_accident),
         salary = factor(salary))

## 훈련/시험 데이터 -----

tidy_split <- rsample::initial_split(hr_df, prop = .7, strata = left)
tidy_train <- training(tidy_split)
tidy_test <- testing(tidy_split)
tidy_kfolds <- vfold_cv(tidy_train, v=5)

## 전처리 -----

tidy_rec <- recipe(left ~ ., data = tidy_train) %>% 
  step_dummy(all_nominal(), -all_outcomes()) %>% 
  step_normalize(all_predictors())

3 예측 모형

CV 데이터를 사용해서 교차검증(Cross Validation) 작업을 수행할 예정이라 이를 초모수를 tune() 함수로 지정한다. 모형은 XGBoost를 사용할 것이라 엔진을 xgboost로 지정하고 mode =는 “classification”으로 특정한다.

xgboost_model <- parsnip::boost_tree(
    mode = "classification",
    trees = tune(),
    min_n = tune(),
    tree_depth = tune(),
    learn_rate = tune(),
    loss_reduction = tune()
  ) %>%
    set_engine("xgboost", objective = "binary:logistic")

# 격자탐색 초모수 설정
xgboost_params <- dials::parameters(
  trees(),
  min_n(),
  tree_depth(),
  learn_rate(),
  loss_reduction())

xgboost_grid <- dials::grid_max_entropy(
    xgboost_params, 
    size = 100)

knitr::kable(head(xgboost_grid))

trees	min_n	tree_depth	learn_rate	loss_reduction
162	8	6	0.0082793	8.7263165
1190	39	14	0.0000000	4.6640236
137	10	11	0.0000001	0.0365955
1824	23	10	0.0000000	0.0326286
97	39	1	0.0000590	0.0000000
1419	11	9	0.0000000	0.3692713

4 작업흐름 설정

피처 공학 단계에서 이미 전처리 로직을 설정했기 때문에 이를 가져오고, XGBoost 모형을 앞서 설정했기 때문에 이를 다시 설정해서 작업흐름(workflow)으로 명시해서 작업한다.

xgboost_wf <- workflows::workflow() %>%
  add_model(xgboost_model) %>% 
  add_recipe(tidy_rec)

5 `XGBoost` 초모수 식별

앞서 설정한 XGBoost 초모수를 격자탐색 방법을 사용해서 식별해낸다.

tic()

xgboost_tuned <- tune::tune_grid(
  object = xgboost_wf,
  resamples = tidy_kfolds,
  grid = xgboost_grid,
  metrics = yardstick::metric_set(accuracy, roc_auc),
  control = tune::control_grid(verbose = TRUE)
)

toc()

1103.885 sec elapsed

XGBoost 초모수 튜닝을 마친 최종모형을 finalize_model()로 뽑아낸다.

xgboosted_param <- xgboost_tuned %>% select_best("roc_auc")

best_xgboost_model <- finalize_model(xgboost_model, xgboosted_param)
best_xgboost_model

Boosted Tree Model Specification (classification)

Main Arguments:
  trees = 1424
  min_n = 6
  tree_depth = 11
  learn_rate = 0.0056149830370537
  loss_reduction = 0.00114667459287856

Engine-Specific Arguments:
  objective = binary:logistic

Computational engine: xgboost

6 모형평가

교차검증 데이터셋을 통해 초모수 검증까지 완료된 모형을 best_xgboost_model로 명령하고 workflow()에 태워 last_fit을 통해 최종모형을 만들어낸다. 그리고 나서, 시험데이터에 대한 성능을 최종 평가한다.

production_wf <- workflow() %>% 
  add_model(best_xgboost_model) %>% 
  add_recipe(tidy_rec)

production_result <- last_fit(production_wf, tidy_split)

production_result %>% 
  unnest(.predictions) %>% 
  conf_mat(truth = left, estimate = .pred_class)

          Truth
Prediction stay left
      stay 3407   73
      left   21  998

6.1 모형 배포

fit() 함수로 최종 모형을 .rds 파일로 말아서 배포한다.

hr_production_model <- fit(production_wf, hr_df)
saveRDS(hr_production_model, "data/hr_production_model.rds")

데이터 과학자 이광춘 저작

kwangchun.lee.7@gmail.com

xwMOOC 모형 - `tidymodels`

임직원 이탈 예측: `tidymodel`

Tidyverse Korea

2020-07-20

1 임직원 이탈 예측¹

2 데이터셋²

3 예측 모형

4 작업흐름 설정

5 `XGBoost` 초모수 식별

6 모형평가

6.1 모형 배포

xwMOOC 모형 - tidymodels

임직원 이탈 예측: tidymodel

Tidyverse Korea

2020-07-20

1 임직원 이탈 예측1

2 데이터셋2

3 예측 모형

4 작업흐름 설정

5 XGBoost 초모수 식별

6 모형평가

6.1 모형 배포

xwMOOC 모형 - `tidymodels`

임직원 이탈 예측: `tidymodel`

1 임직원 이탈 예측¹

2 데이터셋²

5 `XGBoost` 초모수 식별