1 고객이탈 설명

예측성능이 좋은 예측모형을 설명하기 위해서 예측모형 자체를 설명하는 것과 함께 이를 활용하는 사업에 대한 설명을 넘어 고객에게 설명하는 부분까지 확장해 나가고 있다.

예측모형 제작자
예측모형 활용
예측모형 대상 고객

2 고객이탈 데이터 정제

.csv 데이터를 read_csv()를 통해 불러와서 변수명과 변수 자료형을 향수 분석에 맞게 조정한다.

library(tidyverse)
library(janitor)
library(skimr)

churn_dat <- read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

churn_dat <- churn_dat %>% 
  clean_names()

churn_list <- skim_to_list(churn_dat)

churn_df <- churn_dat %>% 
  mutate(churn = factor(churn, levels = c("No", "Yes"))) %>% 
  mutate(senior_citizen = factor(senior_citizen)) %>% 
  mutate(multiple_lines    = ifelse(str_detect(multiple_lines, "No"), "No", "Yes"),
         internet_service  = ifelse(str_detect(internet_service, "No"), "No", "Yes"),
         online_security   = ifelse(str_detect(online_security, "No"), "No", "Yes"),
         online_backup     = ifelse(str_detect(online_backup, "No"), "No", "Yes"),
         device_protection = ifelse(str_detect(device_protection, "No"), "No", "Yes"),
         tech_support      = ifelse(str_detect(tech_support, "No"), "No", "Yes"),
         streaming_tv      = ifelse(str_detect(streaming_tv, "No"), "No", "Yes"),
         streaming_movies  = ifelse(str_detect(streaming_movies, "No"), "No", "Yes")) %>% 
  select(-customer_id) %>% 
  mutate_if(is.character, as.factor) %>% 
  filter(complete.cases(.))

3 예측모형 생성

3.1 훈련/시험 데이터 생성

caret 팩키지를 활용하여 훈련데이터와 검증데이터로 분리한다.

library(caret)

index_train <- createDataPartition(churn_df$churn, p = 0.5, list = FALSE)

train_df <- churn_df[index_train, ]
test_df  <- churn_df[-index_train, ]

## 2.2. 모형 개발/검증 데이터셋 준비 ------

cv_folds <- createMultiFolds(train_df$churn, k = 5, times = 3)

cv_ctrl <- trainControl(method = "repeatedcv", number = 5,
                         repeats = 3, 
                         index = cv_folds)


## 2.2. 모형 개발/검증 데이터셋 준비 ------

library(doSNOW)
# 실행시간
start.time <- Sys.time()

cl <- makeCluster(8, type = "SOCK")
registerDoSNOW(cl)


churn_rf  <- train(churn ~ ., data = train_df, 
                   method = "rf",
                   trControl = cv_ctrl, 
                   tuneLength = 15,
                   importance = TRUE)

churn_glm  <- train(churn ~ ., data = train_df, 
                    method = "glm",
                    family="binomial")


stopCluster(cl)
 
total.time <- Sys.time() - start.time
total.time

Time difference of 3.898562 mins

4 모형설명

4.1 모형 아키텍처

예측모형을 개발할 경우 가능하면 다른 조건이 동일하다면 단순한 예측모형이 좋다. 블랙박스 예측모형의 성능이 일반화 선형모형과 별차이가 없다면 당연히 일반화 선형모형을 사용하는 것이 최선일 수 있다. 따라서, 가장 먼저 예측모형 아키텍처를 모형성능에 따라 선정하는 과정을 거친다.

library(DALEX)
# 3. DALEX 설정 -----
## 3.1. explainer 사전 설정
prob_fun <- function(object, newdata) { 
    predict(object, newdata=newdata, type="prob")[,2]
}

test_v <- as.numeric(test_df$churn)

## 3.2. explainer 실행
explainer_glm <- DALEX::explain(churn_glm, label = "GLM", 
                                data = test_df[, !(colnames(test_df) %in% c("churn"))], y = test_v,
                                predict_function = prob_fun)

explainer_rf <- DALEX::explain(churn_rf, label = "RF",
                               data = test_df[, !(colnames(test_df) %in% c("churn"))], y = test_v,
                               predict_function = prob_fun)

# 4. 예측 모형 이해와 설명 -----
## 4.1. 모형 성능
mp_glm <- model_performance(explainer_glm)
mp_rf  <- model_performance(explainer_rf)

plot(mp_rf, mp_glm, geom = "boxplot", show_outliers = 3) +
    theme(legend.position = "top")

4.2 중요변수 추출

예측모형마다 예측성능에 사용된 중요변수가 차이가 있다. 각 모형 아키텍처마다 중요변수를 추출하여 각 예측모형에 공통적으로 선택되고 중요 변수 순위를 식별한다.

## 4.2. 중요 변수 
vi_glm <- variable_importance(explainer_glm, n_sample = -1, type = "raw")
vi_rf  <- variable_importance(explainer_rf,  n_sample = -1, type = "raw")

plot(vi_glm, vi_rf, max_vars = 6)

4.3 반응변수 연관

추려진 중요변수를 뽑아서 중요변수와 반응변수 사이 연관성을 살펴본다.

## 4.3. 변수 반응도
pdp_glm <- variable_response(explainer_glm, variable = "tenure", type = "pdp")
pdp_rf  <- variable_response(explainer_rf,  variable = "tenure", type = "pdp")

plot(pdp_glm, pdp_rf)

5 예측설명 ¹

예측에 기여한 변수와 가중치를 각 관측점별로 식별한다. plot_explanations() 함수를 통해서 관측점별로 긍정적인 부정적인 영향을 주는 변수가 어떤 것인지도 시각화를 통해 판별한다.

library(lime)
set.seed(777)
predict_obs <- test_df %>% 
  sample_n(6)

explainer_caret <- lime(train_df, churn_rf, n_bins = 5)

explanation_caret <- explain(
  x = predict_obs, 
  explainer = explainer_caret, 
  n_permutations = 5000,
  dist_fun = "gower",
  kernel_width = .75,
  n_features = 10, 
  feature_select = "highest_weights",
  labels = "Yes")

plot_explanations(explanation_caret)

plot_features(explanation_caret)

6 사업 예측설명

최근에 mlr 뿐만 아니라 caret에 대한 지원도 시작했다. 이를 통해서 Lift, Gain 등 예측모형에 대한 사업적인 설명도 한층 탄력을 받게 되었다.

library(modelplotr) # install_github("modelplot/modelplotr")

prepare_scores_and_deciles(datasets=list("train_df","test_df"),
  dataset_labels = list("train data","test data"),
  models = list("churn_glm","churn_rf"),
  model_labels = list("GLM", "Random Forest"),
  target_column="churn")

... scoring caret model "churn_glm" on dataset "train_df".
... scoring caret model "churn_rf" on dataset "train_df".
... scoring caret model "churn_glm" on dataset "test_df".
... scoring caret model "churn_rf" on dataset "test_df".

[1] "Data preparation step 1 succeeded! Dataframe 'scores_and_deciles' created."

plotting_scope(select_model_label = 'Random Forest', select_dataset_label = 'test data')

[1] "deciles_aggregate not available; input_modelevalplots() is run..."
Data preparation step 3 succeeded! Dataframe 'plot_input' created.

No comparison specified, default values are used. 

Single evaluation line will be plotted: Target value "Yes" plotted for dataset "test data" and model "Random Forest.
"
-> To compare models, specify: scope = "compare_models"
-> To compare datasets, specify: scope = "compare_datasets"
-> To compare target classes, specify: scope = "compare_targetclasses"
-> To plot one line, do not specify scope or specify scope = "no_comparison".

plot_cumgains(highlight_decile = 2)

 
Plot annotation:
- When we select 20% with the highest probability according to Random Forest, this selection holds 49% of all Yes cases in test data.

plot_cumlift(highlight_decile = 2)

 
Plot annotation:
- When we select 20% with the highest probability according to model Random Forest in test data, this selection for Yes cases is 2.5 times better than selecting without a model.

plot_response(highlight_decile = 2)

 
Plot annotation:
- When we select decile 2 according to model Random Forest in dataset test data the % of Yes cases in the selection is 55%.

Visualizing ML Models with LIME ↩

xwMOOC 모형

Cloudera: 고객이탈 - LIME

xwMOOC

2018-11-04

1 고객이탈 설명

2 고객이탈 데이터 정제

3 예측모형 생성

3.1 훈련/시험 데이터 생성

4 모형설명

4.1 모형 아키텍처

4.2 중요변수 추출

4.3 반응변수 연관

5 예측설명 ¹

6 사업 예측설명

xwMOOC 모형

Cloudera: 고객이탈 - LIME

xwMOOC

2018-11-04

1 고객이탈 설명

2 고객이탈 데이터 정제

3 예측모형 생성

3.1 훈련/시험 데이터 생성

4 모형설명

4.1 모형 아키텍처

4.2 중요변수 추출

4.3 반응변수 연관

5 예측설명 1

6 사업 예측설명

5 예측설명 ¹