1 고객이탈 예측모형 ¹

1.1 데이터 전처리

library(tidyverse)

telco <- read_csv("data/WA_Fn-UseC_-Telco-Customer-Churn.csv")

## 범주형 변수
cat_variables <- names(telco[9:14])

telco_df <- telco %>%
  select(-customerID, -TotalCharges) %>% 
  drop_na() %>% 
  mutate_at(.vars = cat_variables,
            .funs = ~recode_factor(., `No internet service`="No")) %>%
  mutate_at(.vars = "MultipleLines",
            .funs = ~recode_factor(., `No phone service`="No")) %>% 
  mutate(tenure = case_when(tenure >= 0 & tenure <= 12 ~ '0-12 Month',
                            tenure > 12 & tenure <= 24 ~ '12-24 Month',
                            tenure > 24 & tenure <= 48 ~ '24-48 Month',
                            tenure > 48 & tenure <= 60~ '48-60 Month', 
                            tenure > 60 ~'> 60 Month')) %>% 
  mutate(SeniorCitizen = if_else(SeniorCitizen == 0, "No", "Yes")) %>% 
  mutate_if(is.character, as.factor)
  

telco_df %>% 
  sample_n(10) %>% 
  knitr::kable()

gender	SeniorCitizen	Partner	Dependents	tenure	PhoneService	MultipleLines	InternetService	OnlineSecurity	OnlineBackup	DeviceProtection	TechSupport	StreamingTV	StreamingMovies	Contract	PaperlessBilling	PaymentMethod	MonthlyCharges	Churn
Female	No	Yes	Yes	12-24 Month	Yes	No	DSL	No	No	Yes	No	No	No	Month-to-month	Yes	Bank transfer (automatic)	49.60	No
Female	No	Yes	No	48-60 Month	Yes	No	DSL	Yes	No	Yes	No	Yes	No	One year	Yes	Mailed check	64.30	No
Male	No	Yes	No	48-60 Month	Yes	Yes	DSL	Yes	Yes	Yes	Yes	No	Yes	One year	Yes	Credit card (automatic)	81.40	No
Male	No	No	No	24-48 Month	Yes	No	No	No	No	No	No	No	No internet service	Two year	No	Bank transfer (automatic)	20.05	No
Female	No	No	Yes	0-12 Month	Yes	No	DSL	Yes	No	No	No	No	Yes	Month-to-month	Yes	Credit card (automatic)	61.20	No
Female	No	Yes	Yes	> 60 Month	Yes	Yes	DSL	Yes	No	No	No	Yes	Yes	Two year	Yes	Electronic check	75.50	No
Male	Yes	No	No	24-48 Month	Yes	Yes	Fiber optic	No	No	No	No	Yes	Yes	Month-to-month	Yes	Credit card (automatic)	96.15	Yes
Male	No	Yes	Yes	24-48 Month	Yes	No	Fiber optic	No	Yes	Yes	Yes	Yes	No	One year	No	Bank transfer (automatic)	92.90	No
Male	No	Yes	No	24-48 Month	Yes	No	No	No	No	No	No	No	No internet service	Two year	No	Mailed check	19.25	No
Female	No	No	No	12-24 Month	Yes	Yes	No	No	No	No	No	No	No internet service	Month-to-month	No	Bank transfer (automatic)	24.70	No

1.2 GLM 모형

가장 일반적인 방식으로 GLM 모형에 stepwise 변수선택 방법으로 최적모형을 구축해보자.

telco_full <- glm(Churn ~ ., 
                family = binomial(link="logit"),
                data = telco_df)

telco_stepwise <- telco_full %>% 
  MASS::stepAIC(trace = FALSE)

print(summary(telco_full))


Call:
glm(formula = Churn ~ ., family = binomial(link = "logit"), data = telco_df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0058  -0.6710  -0.2883   0.6727   3.1218  

Coefficients: (1 not defined because of singularities)
                                      Estimate Std. Error z value Pr(>|z|)
(Intercept)                          -2.787714   0.250947 -11.109  < 2e-16
genderMale                           -0.025249   0.064897  -0.389 0.697234
SeniorCitizenYes                      0.220369   0.084378   2.612 0.009010
PartnerYes                           -0.029174   0.077642  -0.376 0.707097
DependentsYes                        -0.132388   0.089657  -1.477 0.139781
tenure0-12 Month                      1.747872   0.170139  10.273  < 2e-16
tenure12-24 Month                     0.841128   0.167032   5.036 4.76e-07
tenure24-48 Month                     0.482497   0.152572   3.162 0.001565
tenure48-60 Month                     0.247678   0.165731   1.494 0.135054
PhoneServiceYes                       0.062844   0.650024   0.097 0.922981
MultipleLinesYes                      0.414765   0.177215   2.340 0.019259
InternetServiceDSL                    1.513102   0.808142   1.872 0.061162
InternetServiceFiber optic            3.096718   1.596710   1.939 0.052448
OnlineSecurityYes                    -0.246999   0.179066  -1.379 0.167781
OnlineBackupYes                      -0.027829   0.175699  -0.158 0.874150
DeviceProtectionYes                   0.115779   0.176452   0.656 0.511729
TechSupportYes                       -0.190985   0.180853  -1.056 0.290958
StreamingTVYes                        0.532038   0.327168   1.626 0.103908
StreamingMoviesNo internet service          NA         NA      NA       NA
StreamingMoviesYes                    0.544324   0.327354   1.663 0.096353
ContractOne year                     -0.774108   0.106994  -7.235 4.65e-13
ContractTwo year                     -1.677554   0.180913  -9.273  < 2e-16
PaperlessBillingYes                   0.324741   0.074605   4.353 1.34e-05
PaymentMethodCredit card (automatic) -0.073373   0.113693  -0.645 0.518695
PaymentMethodElectronic check         0.326059   0.094305   3.458 0.000545
PaymentMethodMailed check             0.004019   0.114311   0.035 0.971953
MonthlyCharges                       -0.027294   0.031791  -0.859 0.390585
                                        
(Intercept)                          ***
genderMale                              
SeniorCitizenYes                     ** 
PartnerYes                              
DependentsYes                           
tenure0-12 Month                     ***
tenure12-24 Month                    ***
tenure24-48 Month                    ** 
tenure48-60 Month                       
PhoneServiceYes                         
MultipleLinesYes                     *  
InternetServiceDSL                   .  
InternetServiceFiber optic           .  
OnlineSecurityYes                       
OnlineBackupYes                         
DeviceProtectionYes                     
TechSupportYes                          
StreamingTVYes                          
StreamingMoviesNo internet service      
StreamingMoviesYes                   .  
ContractOne year                     ***
ContractTwo year                     ***
PaperlessBillingYes                  ***
PaymentMethodCredit card (automatic)    
PaymentMethodElectronic check        ***
PaymentMethodMailed check               
MonthlyCharges                          
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 8150.1  on 7042  degrees of freedom
Residual deviance: 5849.8  on 7017  degrees of freedom
AIC: 5901.8

Number of Fisher Scoring iterations: 6

1.3 변수선택

GLM 모형을 배포하기에 앞서 각 변수별로 고객이탈(Churn)을 예측하는데 가장 높은 정확도를 보이는 변수를 추출한다. 이를 바탕으로 RESTful API로 모형을 배포한다.

# 독립변수 벡터
telco_variables <- setdiff(names(telco_df), "Churn")

# Accuracy 계산 함수
calculate_accuracy <- function(basetable, variable) {
  
  one_full <- glm(paste("Churn", variable, sep="~"), 
                  family = binomial(link="logit"),
                  data = basetable)
  
  cutoff_prob <- table(telco_df$Churn)[2] / nrow(telco_df)
  pred_telco <- ifelse(predict(one_full, telco_df, type = "response") > cutoff_prob, "Yes", "No")
  
  conf_matrix <- table(telco_df$Churn, pred_telco)
  
  variable_accuracy <- (conf_matrix[1,1] + conf_matrix[2,2]) / sum(conf_matrix)  
  
  return(variable_accuracy)
}

# 변수별 정확도 계산
telco_accuracy <- vector("numeric", 0)

for(telco_variable in telco_variables) {
  telco_accuracy[telco_variable] <- calculate_accuracy(telco_df, telco_variable)
  # cat(telco_variable, "\n")
}

accuracy_df <- tibble(
  varname = telco_variables,
  accuracy = telco_accuracy
)

accuracy_df %>% 
  arrange(desc(accuracy)) %>% 
  DT::datatable() %>% 
  DT::formatSignif("accuracy", digits=3)

1.4 예측모형 수식도출

SeniorCitizen 변수가 가장 높은 예측성을 보이는 변수라 이를 예측모형으로 작성한다. glm 객체로부터 변수를 추출하여 수식을 만들어 낼 수 있다.

prod_glm <- glm(Churn ~ SeniorCitizen, 
                      family = binomial(link="logit"),
                      data = telco_df)

initial_formula <- paste(coef(prod_glm), names(coef(prod_glm)), sep = ' * ', collapse = ' + ')

glm_formula <- str_remove_all(initial_formula, "\\(Intercept\\)") %>% 
  str_remove(., " \\* ")

glm_formula

[1] "-1.17439390137758 + 0.83852208506855 * SeniorCitizenYes"

즉, 앞서 산출한 $\beta X_i$를 $\operatorname{logit}(p_i) = \ln \left(\frac{p_i}{1-p_i}\right) = \beta_0 + \beta_1 x_{1,i} + \cdots + \beta_m x_{m,i}$ 관계에 따라 역로짓변환 시켜면 다음과 같이 수식을 전개할 수 있으며 고객이탈확률을 산출할 수 있게 된다.

\[p = \frac{1}{exp(-\beta X_i)} = \frac{1}{exp(1.17439390137758 - 0.83852208506855 * SeniorCitizen)} \]

SeniorCitizen 이면 “Yes”, 아니면 “No”가 되기 때문에 값은 다음 두가지 이탈확률값을 가지게 된다. 이를 앞서 계산한 table(telco_df$Churn)[2] / nrow(telco_df) 즉, 0.2653699 값보다 높으면 이탈, 낮으면 잔존고객으로 정의한다.

1/(1+exp(1.17439390137758 - 0.83852208506855 * 1))

[1] 0.4168126

1/(1+exp(1.17439390137758 - 0.83852208506855 * 0))

[1] 0.2360617

2 RESTful API 개발

앞서 산출한 수학공식을 고객이탈 - RESTful API 기본기 plumber /sum_two와 동일한 로직으로 고객이탈 확률을 계산하는 API로 제작한다.

http://localhost:8000/churn_probability?senior=Yes
- [0.2361]
http://localhost:8000/churn_probability?senior=No
- [0.4168]

메인호출 함수

telco_RESTful.R는 telco.R에서 작성한 서비스를 기동시키는 역할을 수행한다.

library(plumber)
r <- plumb("telco.R")
r$run(port=8000)

RESTful API 서비스

/healthcheck, /churn_probability 두가지 서비스를 제공한다.

# telco.R
library(tidyverse)

#* Echo back the input
#* @param msg The message to echo
#* @get /healthcheck
function(msg=""){
  list(msg = paste0("We are alive!!!"))
}

#* Return Churn Probability and Class
#* @param senior Is Senior Citizen?
#* @get /churn_probability
function(senior){
  senior_val <- ifelse(senior == "Yes", 1, 0)
  1/(1+exp(1.17439390137758 - 0.83852208506855 * senior_val))
}

이제 RESTful API를 호출하여 고객이탈확률을 예측하여 보자.

library(httr)
library(tidyverse)
GET('http://localhost:8000/churn_probability?senior=Yes') %>% 
  content()

[[1]]
[1] 0.4168

GET('http://localhost:8000/churn_probability?senior=No') %>% 
  content()

[[1]]
[1] 0.2361

3 RESTful API 일반화

수식을 뽑아낸 방법이 아닌 GLM R 객체를 활용하는 방식으로 RESTful API 제작을 일반화할 수 있다. 이를 위해서 먼저 deploy_glm.rds 이름으로 예측모형 객체를 저장시키고 나중에 plumber에서 불러오는 방식으로 이를 활용한다.

telco_api_df <- telco_df %>% 
  select(Churn, SeniorCitizen, MonthlyCharges)

deploy_glm <- glm(Churn ~ SeniorCitizen + MonthlyCharges, 
                      family = binomial(link="logit"),
                      data = telco_api_df)

deploy_glm %>% 
  readr::write_rds("deploy/deploy_glm.rds")

predict(deploy_glm, newdata = telco_api_df %>% sample_n(10), type = "response")

        1         2         3         4         5         6         7 
0.1682675 0.2342056 0.2500732 0.4135918 0.4479143 0.3304257 0.4124269 
        8         9        10 
0.1983202 0.2444518 0.1415450

broom 팩키지를 통해서 중요한 변수명을 추출한다.

broom::tidy(deploy_glm)

# A tibble: 3 x 5
  term             estimate std.error statistic   p.value
  <chr>               <dbl>     <dbl>     <dbl>     <dbl>
1 (Intercept)       -2.07    0.0745      -27.8  1.16e-169
2 SeniorCitizenYes   0.626   0.0695        9.01 1.98e- 19
3 MonthlyCharges     0.0137  0.000990     13.9  9.77e- 44

마지막으로 RESTful API 개발에 앞서 test_api_df 를 만들어 혹시라도 모를 버그를 사전에 방지한다.

test_api_df <- tibble(SeniorCitizen = "Yes",
                      MonthlyCharges = 19.2)

predict(deploy_glm, newdata = test_api_df, type = "response")

        1 
0.2352228

3.1 서비스 추가

메인호출 함수

telco_RESTful.R는 telco.R에서 작성한 서비스를 기동시키는 역할을 수행한다.

library(plumber)
r <- plumb("telco.R")
r$run(port=8000)

RESTful API 서비스

/healthcheck, /churn_probability 두가지 서비스외 추가로 예측모형 객체 prod_glm을 직접 사용해서 예측모형을 개발한다.

# telco.R
library(tidyverse)

deploy_glm <- 
  read_rds("deploy/deploy_glm.rds")

#* Echo back the input
#* @param msg The message to echo
#* @get /healthcheck
function(msg=""){
  list(msg = paste0("We are alive!!!"))
}

#* Return Churn Probability and Class
#* @param senior Is Senior Citizen?
#* @get /churn_probability
function(senior){
  senior_val <- ifelse(senior == "Yes", 1, 0)
  1/(1+exp(1.17439390137758 - 0.83852208506855 * senior_val))
}

#* Return Churn Probability and Class
#* @param senior Is Senior Citizen?
#* @param charge Monthly Charges?
#* @get /predict_churn
function(senior, charge){
  test_api_df <- tibble(SeniorCitizen = senior,
                        MonthlyCharges = as.numeric(charge))
  
  predict(deploy_glm, newdata = test_api_df, type = "response")
}

이제 RESTful API를 직접 호출하여 고객이탈확률을 좀더 정교하게 계산하여 보자.

GET('http://localhost:8000/predict_churn?senior=Yes&charge=18.1') %>% 
  content()

[[1]]
[1] 0.2325

GET('http://localhost:8000/predict_churn?senior=Yes&charge=38.1') %>% 
  content()

[[1]]
[1] 0.285

GET('http://localhost:8000/predict_churn?senior=No&charge=25.1') %>% 
  content()

[[1]]
[1] 0.1513

GET('http://localhost:8000/predict_churn?senior=No&charge=29.1') %>% 
  content()

[[1]]
[1] 0.1585

Susan Li (Nov 16, 2017 ), “Predict Customer Churn with R”, Towards Data Science ↩

xwMOOC 모형

고객이탈 - RESTful API 기본기 `plumber`

xwMOOC

2019-11-24

1 고객이탈 예측모형 ¹

1.1 데이터 전처리

1.2 GLM 모형

1.3 변수선택

1.4 예측모형 수식도출

2 RESTful API 개발

3 RESTful API 일반화

3.1 서비스 추가

xwMOOC 모형

고객이탈 - RESTful API 기본기 plumber

xwMOOC

2019-11-24

1 고객이탈 예측모형 1

1.1 데이터 전처리

1.2 GLM 모형

1.3 변수선택

1.4 예측모형 수식도출

2 RESTful API 개발

3 RESTful API 일반화

3.1 서비스 추가

고객이탈 - RESTful API 기본기 `plumber`

1 고객이탈 예측모형 ¹