1 당근(`caret`)에서 파스닙(`parsnip`) ¹ ² ³

당근(caret)은 이제 유지관리상태로 접어들고 최근 활발히 개발되고 있는 것이 tidymodels라는 체계 아래 기계학습모형에 대한 대대적인 해체작업과 재창조 작업이 일어나고 있다.

기계학습 예측모형에 사용되는 모듈 팩키지 -skimr: 탐색적 데이터 분석
- recipes: 데이터 전처리
- rsample: 훈련/시험 표본 분리 및 교차검증(cross-validation) 표본
- parsnip: R 기계학습 API, 파이썬 scikit-learn에 대응됨.
- ranger: Random Forest 기계학습 팩키지
- yardstick: 예측모형 성능 평가

고객이탈모형 tidymodels 작업흐름

2 환경설정

앞서 정의한 팩키지를 가지오고 데이터도 불러온다.

library(tidyverse)   
library(tidymodels)  
library(skimr)       
library(knitr)

telco <- read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

telco %>% head() %>% kable()

customerID	gender	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	TotalCharges	Churn
7590-VHVEG	Female	Yes	No	1	No	No phone service	DSL	No	Yes	No	No	No	No	Month-to-month	Yes	Electronic check	29.85	29.85	No
5575-GNVDE	Male	No	No	34	Yes	No	DSL	Yes	No	Yes	No	No	No	One year	No	Mailed check	56.95	1889.50	No
3668-QPYBK	Male	No	No	2	Yes	No	DSL	Yes	Yes	No	No	No	No	Month-to-month	Yes	Mailed check	53.85	108.15	Yes
7795-CFOCW	Male	No	No	45	No	No phone service	DSL	Yes	No	Yes	Yes	No	No	One year	No	Bank transfer (automatic)	42.30	1840.75	No
9237-HQITU	Female	No	No	2	Yes	No	Fiber optic	No	No	No	No	No	No	Month-to-month	Yes	Electronic check	70.70	151.65	Yes
9305-CDSKC	Female	No	No	8	Yes	Yes	Fiber optic	No	No	Yes	No	Yes	Yes	Month-to-month	Yes	Electronic check	99.65	820.50	Yes

3 EDA

skimr 팩키지로 탐색적 데이터 분석 작업을 수행한다.

telco %>% skim()

Skim summary statistics
 n obs: 7043 
 n variables: 21 

── Variable type:character ───────────────────────────────────────────────────────────────────────────
         variable missing complete    n min max empty n_unique
            Churn       0     7043 7043   2   3     0        2
         Contract       0     7043 7043   8  14     0        3
       customerID       0     7043 7043  10  10     0     7043
       Dependents       0     7043 7043   2   3     0        2
 DeviceProtection       0     7043 7043   2  19     0        3
           gender       0     7043 7043   4   6     0        2
  InternetService       0     7043 7043   2  11     0        3
    MultipleLines       0     7043 7043   2  16     0        3
     OnlineBackup       0     7043 7043   2  19     0        3
   OnlineSecurity       0     7043 7043   2  19     0        3
 PaperlessBilling       0     7043 7043   2   3     0        2
          Partner       0     7043 7043   2   3     0        2
    PaymentMethod       0     7043 7043  12  25     0        4
     PhoneService       0     7043 7043   2   3     0        2
  StreamingMovies       0     7043 7043   2  19     0        3
      StreamingTV       0     7043 7043   2  19     0        3
      TechSupport       0     7043 7043   2  19     0        3

── Variable type:numeric ─────────────────────────────────────────────────────────────────────────────
       variable missing complete    n    mean      sd    p0    p25     p50
 MonthlyCharges       0     7043 7043   64.76   30.09 18.25  35.5    70.35
  SeniorCitizen       0     7043 7043    0.16    0.37  0      0       0   
         tenure       0     7043 7043   32.37   24.56  0      9      29   
   TotalCharges      11     7032 7043 2283.3  2266.77 18.8  401.45 1397.47
     p75    p100     hist
   89.85  118.75 ▇▁▃▂▆▅▅▂
    0       1    ▇▁▁▁▁▁▁▂
   55      72    ▇▃▃▂▂▃▃▅
 3794.74 8684.8  ▇▃▂▂▁▁▁▁

4 데이터 정제

cusomterID는 유일무이한 값이라 제외하고 결측값이 일부 있어 drop_na() 함수로 제거한다.

telco <- telco %>% 
  select(-customerID) %>% 
  drop_na()

5 예측 모형 개발 - GLM

5.1 훈련/시험 데이터셋

훈련/시험 데이터셋으로 나누는데 rsample 팩키지 initial_split() 함수를 사용한다. 그리고, 훈련은 training() 함수로, 시험은 testing() 함수를 사용해서 각각 준비시킨다.

train_test_split <-
    rsample::initial_split(
        data = telco,     
        prop = 0.80   
    ) 

train_tbl <- train_test_split %>% training() 
test_tbl  <- train_test_split %>% testing()

5.2 피처 공학(feature engineering)

recipes 팩키지는 요리과정을 은유로 사용하기는데 기계학습 예측모형을 개발할 때 많이 사용하는 결측값 처리, 척도 통일, 가변수 처리, 상관관계가 높은 변수 제거 등을 이를 통해서 작업할 수 있다.

churn_recipe <- function(dataset) {
    recipe(Churn ~ ., data = dataset) %>%
        step_string2factor(all_nominal(), -all_outcomes()) %>%
        prep(data = dataset)
}

recipe_prepped <- churn_recipe(dataset = train_tbl)

train_baked <- bake(recipe_prepped, new_data = train_tbl)
test_baked  <- bake(recipe_prepped, new_data = test_tbl)

5.3 모형 적합

파이썬 scikit-learn과 유사한 역할을 수행하는 parsnip을 활용하여 기계학습 예측모형을 개발한다.

logistic_glm <- logistic_reg(mode = "classification") %>%
    set_engine("glm") %>%
    fit(Churn ~ ., data = train_baked)

5.4 모형 평가

모형평가를 위해서 시험데이터에 대해서 예측값을 뽑아내고 이를 실제값과 비교할 수 있도록 데이터프레임을 제작한다.

logistic_pred <- logistic_glm %>%
    predict(new_data = test_baked) %>%
    bind_cols(test_baked %>% select(Churn))

logistic_pred %>% head() %>% kable()

.pred_class	Churn
No	Yes
No	No
No	No
No	Yes
No	No
No	No

혼동행렬(confusion matrix)를 구해서 이해하기 쉬운 표형태로 만들어 낸다.

logistic_pred %>%
    conf_mat(Churn, .pred_class) %>%
    pluck(1) %>%
    as_tibble() %>% 
    spread(Prediction, n)

# A tibble: 2 x 3
  Truth    No   Yes
  <chr> <int> <int>
1 No      954    75
2 Yes     187   190

yardstick 팩키지 metrics 함수를 사용해서 정확도(accuracy)를 통해 예측함수 성능을 비교한다.

logistic_pred %>%
  yardstick::metrics(Churn, .pred_class)

# A tibble: 2 x 3
  .metric  .estimator .estimate
  <chr>    <chr>          <dbl>
1 accuracy binary         0.814
2 kap      binary         0.476

그외 precision과 recall을 사용해서 이탈할 것으로 예측한 고객 중 얼마나 떠났는지, 실제 이탈한 고객 중 얼마나 떠났는지 맞추는 측도도 함께 계산한다.

tibble(
  "precision" = precision(logistic_pred, Churn, .pred_class) %>% 
    select(.estimate),
  "recall"    = recall(logistic_pred, Churn, .pred_class) %>% 
  select(.estimate),
  "F1"    = f_meas(logistic_pred, Churn, .pred_class) %>% 
  select(.estimate))

# A tibble: 1 x 3
  precision$.estimate recall$.estimate F1$.estimate
                <dbl>            <dbl>        <dbl>
1               0.836            0.927        0.879

6 Random Forest - `ranger`

6.1 교차검증 표본

10 교차검증 표본을 vfold_cv() 함수로 생성한다.

cross_val_tbl <- vfold_cv(train_tbl, v = 10)
cross_val_tbl

#  10-fold cross-validation 
# A tibble: 10 x 2
   splits             id    
   <named list>       <chr> 
 1 <split [5.1K/563]> Fold01
 2 <split [5.1K/563]> Fold02
 3 <split [5.1K/563]> Fold03
 4 <split [5.1K/563]> Fold04
 5 <split [5.1K/563]> Fold05
 6 <split [5.1K/563]> Fold06
 7 <split [5.1K/562]> Fold07
 8 <split [5.1K/562]> Fold08
 9 <split [5.1K/562]> Fold09
10 <split [5.1K/562]> Fold10

cross_val_tbl %>% pluck("splits", 1)

<5063/563/5626>

6.2 `ranger` 모형 적합

fit_ranger_model() 함수를 만들어서, ranger 모델에 적합을 시킨다.

fit_ranger_model <- function(split, id, try, tree) {
    
    analysis_set <- split %>% analysis()
    analysis_prepped <- analysis_set %>% churn_recipe()
    analysis_baked <- analysis_prepped %>% bake(new_data = analysis_set)
    
    model_rf <-
        rand_forest(
            mode = "classification",
            mtry = try,
            trees = tree
        ) %>%
        set_engine("ranger",
                   importance = "impurity"
        ) %>%
        fit(Churn ~ ., data = analysis_baked)
    
    assessment_set     <- split %>% assessment()
    assessment_prepped <- assessment_set %>% churn_recipe()
    assessment_baked   <- assessment_prepped %>% bake(new_data = assessment_set)
    
    tibble(
        "id" = id,
        "truth" = assessment_baked$Churn,
        "prediction" = model_rf %>%
            predict(new_data = assessment_baked) %>%
            unlist()
    )
}

pred_rf <- map2_df(
    .x = cross_val_tbl$splits,
    .y = cross_val_tbl$id,
    ~ fit_ranger_model(split = .x, id = .y, try = 3, tree = 200)
)

head(pred_rf)

# A tibble: 6 x 3
  id     truth prediction
  <chr>  <fct> <fct>     
1 Fold01 No    No        
2 Fold01 No    No        
3 Fold01 Yes   No        
4 Fold01 Yes   Yes       
5 Fold01 No    No        
6 Fold01 No    No

6.3 `ranger` 성능 평가

ranger 예측모형 객체 → conf_mat() → summary()를 파이프로 연결시켜 예측모형 성능지표를 추출한다.

pred_rf %>%
    conf_mat(truth, prediction) %>%
    summary() %>%
    select(-.estimator) %>%
    filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>%
    kable()

.metric	.estimate
accuracy	0.7948809
precision	0.8293546
recall	0.9075955
f_meas	0.8667129

7 두 모형 성능 비교

glm_metrics <- logistic_pred %>%
    conf_mat(Churn, .pred_class) %>%
    summary() %>%
    select(-.estimator) %>%
    filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>% 
    rename(GLM = .estimate)

rf_metrics <- pred_rf %>%
    conf_mat(truth, prediction) %>%
    summary() %>%
    select(-.estimator) %>%
    filter(.metric %in% c("accuracy", "precision", "recall", "f_meas")) %>% 
    rename(RF = .estimate)

inner_join(glm_metrics, rf_metrics) %>% 
  kable()

.metric	GLM	RF
accuracy	0.8136558	0.7948809
precision	0.8361087	0.8293546
recall	0.9271137	0.9075955
f_meas	0.8792627	0.8667129

파스닙(parsnip): 명사, 식물, 미나릿과의 한해살이 또는 두해살이풀. 유라시아 원산으로 17세기 초 미국에 도입되어 귀화되었다. 춘파하면 여름이 끝날 무렵에는 녹말 성분이 많고 충실한 뿌리를 얻을 수 있다. 달콤한 맛이 나고, 보통 채소로 요리해 먹는다.↩
Diego Usai on November 18, 2019, “Customer Churn Modeling using Machine Learning with parsnip”↩
Diego Usai on November 18, 2019, “Customer Churn Modeling using Machine Learning with parsnip”↩

xwMOOC 모형

고객이탈 - `tidymodels`

xwMOOC

2019-11-22

1 당근(`caret`)에서 파스닙(`parsnip`) ¹ ² ³

2 환경설정

3 EDA

4 데이터 정제

5 예측 모형 개발 - GLM

5.1 훈련/시험 데이터셋

5.2 피처 공학(feature engineering)

5.3 모형 적합

5.4 모형 평가

6 Random Forest - `ranger`

6.1 교차검증 표본

6.2 `ranger` 모형 적합

6.3 `ranger` 성능 평가

7 두 모형 성능 비교

xwMOOC 모형

고객이탈 - tidymodels

xwMOOC

2019-11-22

1 당근(caret)에서 파스닙(parsnip) 1 2 3

2 환경설정

3 EDA

4 데이터 정제

5 예측 모형 개발 - GLM

5.1 훈련/시험 데이터셋

5.2 피처 공학(feature engineering)

5.3 모형 적합

5.4 모형 평가

6 Random Forest - ranger

6.1 교차검증 표본

6.2 ranger 모형 적합

6.3 ranger 성능 평가

7 두 모형 성능 비교

고객이탈 - `tidymodels`

1 당근(`caret`)에서 파스닙(`parsnip`) ¹ ² ³

6 Random Forest - `ranger`

6.2 `ranger` 모형 적합

6.3 `ranger` 성능 평가