1 고객/HR 이탈 네트워크 데이터

엑셀파일에 노드와 연결선을 정성스럽게 정한 다음 각 쉬트 데이터를 데이터프레임으로 변환한 후에 igraph 팩키지 graph_from_data_frame() 함수를 통해 데이터프레임을 네트워크 객체로 불러온다.

library(tidyverse)
library(readxl)
library(igraph)

node_df <- read_excel("data/churn_network.xlsx", sheet="node")
edge_df <- read_excel("data/churn_network.xlsx", sheet="edge")

net <- graph_from_data_frame(d=edge_df, vertices = node_df, directed = FALSE)

노드(Node)

node_df %>% 
  sample_n(10) %>% 
  DT::datatable()

연결선(Edge)

edge_df %>% 
  sample_n(10) %>% 
  DT::datatable()

고객이탈 시각화를 위해서 이탈고객과 이탈하지 않는 고객을 색상을 달리해서 시각화한다.

V(net)$color <- V(net)$churn

V(net)$color <- str_replace(V(net)$color, "1", "red")
V(net)$color <- str_replace(V(net)$color, "0", "green")
  
plot(net, vertex.label=NA, 
     vertex.size = 3, 
     edge.color = 'black',
     edge.width = 2)

2 고객/HR 이탈 네트워크 데이터

네트워크에서 Feature를 추출하는 작업을 노드(Node)에서 추출하고, 연결선(Edge)에서도 추출한다.

2.1 노드 Feature

노드 관련 측도에 대한 자세한 설명은 xwMOOC 네트워크 - 네트워크 기술통계을 참조한다.

V(net)$degree <- degree(net, normalized=TRUE)
second_degree <- neighborhood.size(net, 2)
V(net)$second_degree <- second_degree / (length(V(net)) - 1)

V(net)$triangles <- count_triangles(net)

V(net)$betweenness <- betweenness(net, normalized=TRUE)
V(net)$closeness <- closeness(net, normalized=TRUE)
V(net)$eigen <- eigen_centrality(net, scale = TRUE)$vector

V(net)$transitivity <- transitivity(net, type="local", isolates='zero')

V(net)$pr <- page.rank(net)$vector

2.2 연결선 Features

2.3 데이터프레임 변환

노드에 대한 다양한 통계량이 준비되면, as_data_frame() 함수를 통해서 네트워크 객체를 데이터프레임으로 변환한다.

network_dat <- as_data_frame(net, what = "vertices") %>% 
  tbl_df()

network_df <- network_dat %>% 
  tbl_df() %>% 
  select(-name, -color) %>% 
  mutate(churn = ifelse(churn == 1, "yes", "no")) %>% 
  mutate(churn = factor(churn, levels=c("yes","no"))) %>% 
  filter(complete.cases(.))

3 예측모형

3.1 예측모형 적합

네트워크 객체를 데이터프레임으로 변환한 후 caret 팩키지 일반적인 예측모형 방법론과 절차에 따라 작업해 나간다.

# 2. 예측모형 -----
## 2.1. 훈련/시험 데이터 분할 ------
library(caret)

network_index <- createDataPartition(network_df$churn, times =1, p=0.7, list=FALSE)

train_df <- network_df[network_index, ]
test_df  <- network_df[-network_index, ]

## 2.2. 모형 개발/검증 데이터셋 준비 ------

cv_folds <- createMultiFolds(train_df$churn, k = 10, times = 3)

cv_cntrl <- trainControl(method = "repeatedcv", number = 10,
                         repeats = 3,
                         sampling = "down",
                         summaryFunction = twoClassSummary,
                         classProbs = TRUE,
                         index = cv_folds)


## 2.2. 모형 개발/검증 데이터셋 준비 ------
library(doSNOW)
# 실행시간
start.time <- Sys.time()

cl <- makeCluster(4, type = "SOCK")
registerDoSNOW(cl)

churn_glm   <- train(churn ~ ., data = train_df, 
                    method = "glm",
                    family = "binomial",
                    metric='Sens',
                    trControl = cv_cntrl, 
                    tuneLength = 7)

churn_rf    <- train(churn ~ ., data = train_df, 
                   method = "rf",
                   trControl = cv_cntrl, 
                   metric='Sens',
                   tuneLength = 7,
                   importance = TRUE)

stopCluster(cl)

total.time <- Sys.time() - start.time
total.time

Time difference of 14.28043 secs

3.2 예측모형 성능

네트워크 Feature만으로 고객 이탈 모형을 생성한 후에 성능을 확인한다.

# 3. 예측모형 성능 -----
## GLM
glm_pred_df <- predict(churn_glm, newdata=test_df, type="prob") %>%
  tbl_df %>% 
  mutate(class = factor(ifelse(yes > no, "yes", "no"), levels = c("yes", "no")),
         prob  = yes)

confusionMatrix(glm_pred_df$class, test_df$churn)

Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   6 117
       no    3 160
                                          
               Accuracy : 0.5804          
                 95% CI : (0.5209, 0.6383)
    No Information Rate : 0.9685          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0343          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.66667         
            Specificity : 0.57762         
         Pos Pred Value : 0.04878         
         Neg Pred Value : 0.98160         
             Prevalence : 0.03147         
         Detection Rate : 0.02098         
   Detection Prevalence : 0.43007         
      Balanced Accuracy : 0.62214         
                                          
       'Positive' Class : yes

## randomForest
rf_pred_df <- predict(churn_rf, newdata=test_df) %>% 
  tbl_df %>% 
  rename(class = value)

confusionMatrix(rf_pred_df$class, test_df$churn)

Confusion Matrix and Statistics

          Reference
Prediction yes  no
       yes   6 138
       no    3 139
                                          
               Accuracy : 0.507           
                 95% CI : (0.4475, 0.5663)
    No Information Rate : 0.9685          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0204          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.66667         
            Specificity : 0.50181         
         Pos Pred Value : 0.04167         
         Neg Pred Value : 0.97887         
             Prevalence : 0.03147         
         Detection Rate : 0.02098         
   Detection Prevalence : 0.50350         
      Balanced Accuracy : 0.58424         
                                          
       'Positive' Class : yes

xwMOOC 모형

예측모형 - 네트워크

xwMOOC

2018-11-22