1 기계학습 모형 개발 30분 1 2

최적 기계학습 모형 아키텍처(GLM, NN, DL, SVM, GBM, xgBoost, Random Forest…)를 선택하고, 이에 대한 모델 튜닝을 진행하여 최적모형을 개발하는 것도 중요한 일이지만, 사업분야에 적용하기 위해서는 여러가지 제약조건을 따르고 주어진 제약조건에 최적화하는 것이 필요하다.

이러한 면에서 최고 성능의 기계학습 모형을 개발하는 것보다 “애자일 기계학습모형 개발(Agile ML development)” 개념을 접목하여 극단적으로 최고성능의 90%를 내는 기계학습 예측모형을 30분내에 개발하는 것이 더 의미가 클 수 있다.

Matt Dancho(August 7, 2018), “KAGGLE COMPETITION IN 30 MINUTES: PREDICT HOME CREDIT DEFAULT RISK WITH R” Business Science.에 제시된 사례를 바탕으로 30분 내에 기계학습 모형 구축 및 배포를 시연해 보자.

2 자동화 전략

기계학습 예측모형 개발 자동화 전략은 원본데이터를 있는 그대로 두고, 각 변수에 맞게 적절한 데이터 변형 전략을 구현하고 결측값에 대한 채워넣기를 수행한 후에 \(H_2O\) AutoML 모형을 개발하여 운영계에 배포하는 것으로 마무리한다.

AutoML 자동화 전략 파이프라인

가장 먼저 원본 데이터(application_train.csv)를 가져와서 AutoML 엔진인 \(H_2O\)에 넣기 전에 결측값 처리 및 자료형 변수변환에 대한 사항을 적시한다.

2.1 데이터 가져오기

Kaggle - Home Credit Default Risk 데이터셋을 다운로드 받아 kable 함수로 데이터가 원본 데이터에서 데이터프레임으로 제대로 변환이 되었는지 확인한다.

# 0. 환경설정 -----
# General 
library(tidyverse)
library(skimr)

# Preprocessing
library(recipes)

# Machine Learning
library(h2o)

# 1. 데이터 -----
## 1.1. 데이터 가져오기 
application_train_tbl <- read_csv("data/application_train.csv")

## 1.2. 데이터 일별하기
application_train_tbl %>%
    slice(1:10) %>%
    kable() %>% 
    kable_styling() %>%
    scroll_box(width = "800px")
SK_ID_CURR TARGET NAME_CONTRACT_TYPE CODE_GENDER FLAG_OWN_CAR FLAG_OWN_REALTY CNT_CHILDREN AMT_INCOME_TOTAL AMT_CREDIT AMT_ANNUITY AMT_GOODS_PRICE NAME_TYPE_SUITE NAME_INCOME_TYPE NAME_EDUCATION_TYPE NAME_FAMILY_STATUS NAME_HOUSING_TYPE REGION_POPULATION_RELATIVE DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH OWN_CAR_AGE FLAG_MOBIL FLAG_EMP_PHONE FLAG_WORK_PHONE FLAG_CONT_MOBILE FLAG_PHONE FLAG_EMAIL OCCUPATION_TYPE CNT_FAM_MEMBERS REGION_RATING_CLIENT REGION_RATING_CLIENT_W_CITY WEEKDAY_APPR_PROCESS_START HOUR_APPR_PROCESS_START REG_REGION_NOT_LIVE_REGION REG_REGION_NOT_WORK_REGION LIVE_REGION_NOT_WORK_REGION REG_CITY_NOT_LIVE_CITY REG_CITY_NOT_WORK_CITY LIVE_CITY_NOT_WORK_CITY ORGANIZATION_TYPE EXT_SOURCE_1 EXT_SOURCE_2 EXT_SOURCE_3 APARTMENTS_AVG BASEMENTAREA_AVG YEARS_BEGINEXPLUATATION_AVG YEARS_BUILD_AVG COMMONAREA_AVG ELEVATORS_AVG ENTRANCES_AVG FLOORSMAX_AVG FLOORSMIN_AVG LANDAREA_AVG LIVINGAPARTMENTS_AVG LIVINGAREA_AVG NONLIVINGAPARTMENTS_AVG NONLIVINGAREA_AVG APARTMENTS_MODE BASEMENTAREA_MODE YEARS_BEGINEXPLUATATION_MODE YEARS_BUILD_MODE COMMONAREA_MODE ELEVATORS_MODE ENTRANCES_MODE FLOORSMAX_MODE FLOORSMIN_MODE LANDAREA_MODE LIVINGAPARTMENTS_MODE LIVINGAREA_MODE NONLIVINGAPARTMENTS_MODE NONLIVINGAREA_MODE APARTMENTS_MEDI BASEMENTAREA_MEDI YEARS_BEGINEXPLUATATION_MEDI YEARS_BUILD_MEDI COMMONAREA_MEDI ELEVATORS_MEDI ENTRANCES_MEDI FLOORSMAX_MEDI FLOORSMIN_MEDI LANDAREA_MEDI LIVINGAPARTMENTS_MEDI LIVINGAREA_MEDI NONLIVINGAPARTMENTS_MEDI NONLIVINGAREA_MEDI FONDKAPREMONT_MODE HOUSETYPE_MODE TOTALAREA_MODE WALLSMATERIAL_MODE EMERGENCYSTATE_MODE OBS_30_CNT_SOCIAL_CIRCLE DEF_30_CNT_SOCIAL_CIRCLE OBS_60_CNT_SOCIAL_CIRCLE DEF_60_CNT_SOCIAL_CIRCLE DAYS_LAST_PHONE_CHANGE FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21 AMT_REQ_CREDIT_BUREAU_HOUR AMT_REQ_CREDIT_BUREAU_DAY AMT_REQ_CREDIT_BUREAU_WEEK AMT_REQ_CREDIT_BUREAU_MON AMT_REQ_CREDIT_BUREAU_QRT AMT_REQ_CREDIT_BUREAU_YEAR
100002 1 Cash loans M N Y 0 202500 406597.5 24700.5 351000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.018801 -9461 -637 -3648 -2120 NA 1 1 0 1 1 0 Laborers 1 2 2 WEDNESDAY 10 0 0 0 0 0 0 Business Entity Type 3 0.0830370 0.2629486 0.1393758 0.0247 0.0369 0.9722 0.6192 0.0143 0.00 0.0690 0.0833 0.1250 0.0369 0.0202 0.0190 0.0000 0.0000 0.0252 0.0383 0.9722 0.6341 0.0144 0.0000 0.0690 0.0833 0.1250 0.0377 0.022 0.0198 0 0 0.0250 0.0369 0.9722 0.6243 0.0144 0.00 0.0690 0.0833 0.1250 0.0375 0.0205 0.0193 0.0000 0.00 reg oper account block of flats 0.0149 Stone, brick No 2 2 2 2 -1134 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
100003 0 Cash loans F N N 0 270000 1293502.5 35698.5 1129500 Family State servant Higher education Married House / apartment 0.003541 -16765 -1188 -1186 -291 NA 1 1 0 1 1 0 Core staff 2 1 1 MONDAY 11 0 0 0 0 0 0 School 0.3112673 0.6222458 NA 0.0959 0.0529 0.9851 0.7960 0.0605 0.08 0.0345 0.2917 0.3333 0.0130 0.0773 0.0549 0.0039 0.0098 0.0924 0.0538 0.9851 0.8040 0.0497 0.0806 0.0345 0.2917 0.3333 0.0128 0.079 0.0554 0 0 0.0968 0.0529 0.9851 0.7987 0.0608 0.08 0.0345 0.2917 0.3333 0.0132 0.0787 0.0558 0.0039 0.01 reg oper account block of flats 0.0714 Block No 1 0 1 0 -828 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
100004 0 Revolving loans M Y Y 0 67500 135000.0 6750.0 135000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.010032 -19046 -225 -4260 -2531 26 1 1 1 1 1 0 Laborers 1 2 2 MONDAY 9 0 0 0 0 0 0 Government NA 0.5559121 0.7295667 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 -815 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
100006 0 Cash loans F N Y 0 135000 312682.5 29686.5 297000 Unaccompanied Working Secondary / secondary special Civil marriage House / apartment 0.008019 -19005 -3039 -9833 -2437 NA 1 1 0 1 0 0 Laborers 2 2 2 WEDNESDAY 17 0 0 0 0 0 0 Business Entity Type 3 NA 0.6504417 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 0 2 0 -617 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA NA
100007 0 Cash loans M N Y 0 121500 513000.0 21865.5 513000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.028663 -19932 -3038 -4311 -3458 NA 1 1 0 1 0 0 Core staff 1 2 2 THURSDAY 11 0 0 0 0 1 1 Religion NA 0.3227383 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 -1106 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
100008 0 Cash loans M N Y 0 99000 490495.5 27517.5 454500 Spouse, partner State servant Secondary / secondary special Married House / apartment 0.035792 -16941 -1588 -4970 -477 NA 1 1 1 1 1 0 Laborers 2 2 2 WEDNESDAY 16 0 0 0 0 0 0 Other NA 0.3542247 0.6212263 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 0 0 0 0 -2536 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1
100009 0 Cash loans F Y Y 1 171000 1560726.0 41301.0 1395000 Unaccompanied Commercial associate Higher education Married House / apartment 0.035792 -13778 -3130 -1213 -619 17 1 1 0 1 1 0 Accountants 3 2 2 SUNDAY 16 0 0 0 0 0 0 Business Entity Type 3 0.7747614 0.7239999 0.4920601 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 0 1 0 -1562 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1 2
100010 0 Cash loans M Y Y 0 360000 1530000.0 42075.0 1530000 Unaccompanied State servant Higher education Married House / apartment 0.003122 -18850 -449 -4597 -2379 8 1 1 1 1 0 0 Managers 2 3 3 MONDAY 16 0 0 0 0 1 1 Other NA 0.7142793 0.5406545 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 0 2 0 -1070 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
100011 0 Cash loans F N Y 0 112500 1019610.0 33826.5 913500 Children Pensioner Secondary / secondary special Married House / apartment 0.018634 -20099 365243 -7427 -3514 NA 1 0 0 1 0 0 NA 2 2 2 WEDNESDAY 14 0 0 0 0 0 0 XNA 0.5873340 0.2057473 0.7517237 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 1 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
100012 0 Revolving loans M N Y 0 135000 405000.0 20250.0 405000 Unaccompanied Working Secondary / secondary special Single / not married House / apartment 0.019689 -14469 -2019 -14437 -3992 NA 1 1 0 1 0 0 Laborers 1 2 2 THURSDAY 8 0 0 0 0 0 0 Electricity NA 0.7466436 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA 2 0 2 0 -1673 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 NA NA NA NA NA NA
## 1.3. 훈련/시험 데이터 분할
x_train_tbl <- application_train_tbl %>% select(-TARGET)
y_train_tbl <- application_train_tbl %>% select(TARGET)   

# 메모리 절약을 위해서 불필요해진 객체 제거
rm(application_train_tbl)

2.2 데이터 변환

원본 데이터를 불러왔다면 다음 단계로 문자형 변수를 식별하여 string_2_factor_names 벡터로 준비하고, 숫자형 변수 중 범주형으로 코딩할 수 있는 것을 범주 수준이 7개를 기준으로 미만이면 범주 요인형변수로 식별하여 num_2_factor_names 벡터로 준비한다. 마지막으로 결측값은 숫자형 결측값은 평균으로, 범주형 결측값은 최빈치로 채워 넣도록 준비한다.

recipe 팩키지를 활용하여 변수변환 과정과 결측값 채워넣는 과정에 대해서 명세를 기록한다.

# 2. 데이터 탐색 -----
## 2.1. 데이터 자료형 살펴보기
skim_to_list(x_train_tbl)
$character
# A tibble: 16 x 8
   variable              missing complete n     min   max   empty n_unique
 * <chr>                 <chr>   <chr>    <chr> <chr> <chr> <chr> <chr>   
 1 CODE_GENDER           0       307511   3075~ 1     3     0     3       
 2 EMERGENCYSTATE_MODE   145755  161756   3075~ 2     3     0     2       
 3 FLAG_OWN_CAR          0       307511   3075~ 1     1     0     2       
 4 FLAG_OWN_REALTY       0       307511   3075~ 1     1     0     2       
 5 FONDKAPREMONT_MODE    210295  97216    3075~ 13    21    0     4       
 6 HOUSETYPE_MODE        154297  153214   3075~ 14    16    0     3       
 7 NAME_CONTRACT_TYPE    0       307511   3075~ 10    15    0     2       
 8 NAME_EDUCATION_TYPE   0       307511   3075~ 15    29    0     5       
 9 NAME_FAMILY_STATUS    0       307511   3075~ 5     20    0     6       
10 NAME_HOUSING_TYPE     0       307511   3075~ 12    19    0     6       
11 NAME_INCOME_TYPE      0       307511   3075~ 7     20    0     8       
12 NAME_TYPE_SUITE       1292    306219   3075~ 6     15    0     7       
13 OCCUPATION_TYPE       96391   211120   3075~ 7     21    0     18      
14 ORGANIZATION_TYPE     0       307511   3075~ 3     22    0     58      
15 WALLSMATERIAL_MODE    156341  151170   3075~ 5     12    0     7       
16 WEEKDAY_APPR_PROCESS~ 0       307511   3075~ 6     9     0     7       

$integer
# A tibble: 40 x 12
   variable  missing complete n      mean   sd     p0    p25   p50   p75  
 * <chr>     <chr>   <chr>    <chr>  <chr>  <chr>  <chr> <chr> <chr> <chr>
 1 CNT_CHIL~ 0       307511   307511 "    ~ "    ~ 0     "   ~ 0     "   ~
 2 DAYS_BIR~ 0       307511   307511 "-160~ "  43~ -252~ "-19~ -157~ "-12~
 3 DAYS_EMP~ 0       307511   307511 " 638~ "1412~ -179~ " -2~ -1213 "  -~
 4 DAYS_ID_~ 0       307511   307511 " -29~ "  15~ -7197 " -4~ -3254 " -1~
 5 FLAG_CON~ 0       307511   307511 "    ~ "    ~ 0     "   ~ 1     "   ~
 6 FLAG_DOC~ 0       307511   307511 "    ~ "    ~ 0     "   ~ 0     "   ~
 7 FLAG_DOC~ 0       307511   307511 "    ~ "    ~ 0     "   ~ 0     "   ~
 8 FLAG_DOC~ 0       307511   307511 "    ~ "    ~ 0     "   ~ 0     "   ~
 9 FLAG_DOC~ 0       307511   307511 "    ~ "    ~ 0     "   ~ 0     "   ~
10 FLAG_DOC~ 0       307511   307511 "    ~ "    ~ 0     "   ~ 0     "   ~
# ... with 30 more rows, and 2 more variables: p100 <chr>, hist <chr>

$numeric
# A tibble: 65 x 12
   variable   missing complete n      mean   sd    p0    p25   p50   p75  
 * <chr>      <chr>   <chr>    <chr>  <chr>  <chr> <chr> <chr> <chr> <chr>
 1 AMT_ANNUI~ 12      307499   307511 " 271~ " 14~ "  1~ " 16~ " 24~ " 34~
 2 AMT_CREDIT 0       307511   307511 " 6e+~ " 4e~ " 45~ "270~ "513~ "808~
 3 AMT_GOODS~ 278     307233   307511 "5383~ "369~ " 40~ "238~ "450~ "679~
 4 AMT_INCOM~ 0       307511   307511 "1687~ "237~ " 25~ "112~ "147~ " 2e~
 5 AMT_REQ_C~ 41519   265992   307511 "    ~ "   ~ "   ~ "   ~ "   ~ "   ~
 6 AMT_REQ_C~ 41519   265992   307511 "    ~ "   ~ "   ~ "   ~ "   ~ "   ~
 7 AMT_REQ_C~ 41519   265992   307511 "    ~ "   ~ "   ~ "   ~ "   ~ "   ~
 8 AMT_REQ_C~ 41519   265992   307511 "    ~ "   ~ "   ~ "   ~ "   ~ "   ~
 9 AMT_REQ_C~ 41519   265992   307511 "    ~ "   ~ "   ~ "   ~ "   ~ "   ~
10 AMT_REQ_C~ 41519   265992   307511 "    ~ "   ~ "   ~ "   ~ "   ~ "   ~
# ... with 55 more rows, and 2 more variables: p100 <chr>, hist <chr>
## 2.2. 전처리 전략
### 문자형 
string_2_factor_names <- x_train_tbl %>%
    select_if(is.character) %>%
    names()

# string_2_factor_names

### 숫자형 숫자
unique_numeric_values_tbl <- x_train_tbl %>%
    select_if(is.numeric) %>%
    map_df(~ unique(.) %>% length()) %>%
    gather() %>%
    arrange(value) %>%
    mutate(key = as_factor(key))

# unique_numeric_values_tbl

factor_limit <- 7

num_2_factor_names <- unique_numeric_values_tbl %>%
    filter(value < factor_limit) %>%
    arrange(desc(value)) %>%
    pull(key) %>%
    as.character()

# num_2_factor_names

### 결측값
missing_tbl <- x_train_tbl %>%
    summarize_all(.funs = ~ sum(is.na(.)) / length(.)) %>%
    gather() %>%
    arrange(desc(value)) %>%
    filter(value > 0)

# missing_tbl

## 요리법에 따른 데이터 전처리

rec_obj <- recipe(~ ., data = x_train_tbl) %>%
    step_string2factor(string_2_factor_names) %>%
    step_num2factor(num_2_factor_names) %>%
    step_meanimpute(all_numeric()) %>%
    step_modeimpute(all_nominal()) %>%
    prep(stringsAsFactors = FALSE)

# rec_obj

x_train_processed_tbl <- bake(rec_obj, x_train_tbl) 

### Y 변수 변환
y_train_processed_tbl <- y_train_tbl %>%
    mutate(TARGET = TARGET %>% as.character() %>% as.factor())

데이터 변환 전

### 데이터 변환 전 
x_train_tbl %>%
    select(1:30) %>%
    glimpse()
Observations: 307,511
Variables: 30
$ SK_ID_CURR                 <int> 100002, 100003, 100004, 100006, 100...
$ NAME_CONTRACT_TYPE         <chr> "Cash loans", "Cash loans", "Revolv...
$ CODE_GENDER                <chr> "M", "F", "M", "F", "M", "M", "F", ...
$ FLAG_OWN_CAR               <chr> "N", "N", "Y", "N", "N", "N", "Y", ...
$ FLAG_OWN_REALTY            <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", ...
$ CNT_CHILDREN               <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,...
$ AMT_INCOME_TOTAL           <dbl> 202500.00, 270000.00, 67500.00, 135...
$ AMT_CREDIT                 <dbl> 406597.5, 1293502.5, 135000.0, 3126...
$ AMT_ANNUITY                <dbl> 24700.5, 35698.5, 6750.0, 29686.5, ...
$ AMT_GOODS_PRICE            <dbl> 351000, 1129500, 135000, 297000, 51...
$ NAME_TYPE_SUITE            <chr> "Unaccompanied", "Family", "Unaccom...
$ NAME_INCOME_TYPE           <chr> "Working", "State servant", "Workin...
$ NAME_EDUCATION_TYPE        <chr> "Secondary / secondary special", "H...
$ NAME_FAMILY_STATUS         <chr> "Single / not married", "Married", ...
$ NAME_HOUSING_TYPE          <chr> "House / apartment", "House / apart...
$ REGION_POPULATION_RELATIVE <dbl> 0.018801, 0.003541, 0.010032, 0.008...
$ DAYS_BIRTH                 <int> -9461, -16765, -19046, -19005, -199...
$ DAYS_EMPLOYED              <int> -637, -1188, -225, -3039, -3038, -1...
$ DAYS_REGISTRATION          <dbl> -3648, -1186, -4260, -9833, -4311, ...
$ DAYS_ID_PUBLISH            <int> -2120, -291, -2531, -2437, -3458, -...
$ OWN_CAR_AGE                <dbl> NA, NA, 26, NA, NA, NA, 17, 8, NA, ...
$ FLAG_MOBIL                 <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_EMP_PHONE             <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,...
$ FLAG_WORK_PHONE            <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
$ FLAG_CONT_MOBILE           <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_PHONE                 <int> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,...
$ FLAG_EMAIL                 <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ OCCUPATION_TYPE            <chr> "Laborers", "Core staff", "Laborers...
$ CNT_FAM_MEMBERS            <dbl> 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, 3, 2,...
$ REGION_RATING_CLIENT       <int> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2,...

데이터 변환 후

### 데이터 변환 후
x_train_processed_tbl %>%
    select(1:30) %>%
    glimpse()
Observations: 307,511
Variables: 30
$ SK_ID_CURR                 <int> 100002, 100003, 100004, 100006, 100...
$ NAME_CONTRACT_TYPE         <fct> Cash loans, Cash loans, Revolving l...
$ CODE_GENDER                <fct> M, F, M, F, M, M, F, M, F, M, F, F,...
$ FLAG_OWN_CAR               <fct> N, N, Y, N, N, N, Y, Y, N, N, N, N,...
$ FLAG_OWN_REALTY            <fct> Y, N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
$ CNT_CHILDREN               <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,...
$ AMT_INCOME_TOTAL           <dbl> 202500.00, 270000.00, 67500.00, 135...
$ AMT_CREDIT                 <dbl> 406597.5, 1293502.5, 135000.0, 3126...
$ AMT_ANNUITY                <dbl> 24700.5, 35698.5, 6750.0, 29686.5, ...
$ AMT_GOODS_PRICE            <dbl> 351000, 1129500, 135000, 297000, 51...
$ NAME_TYPE_SUITE            <fct> Unaccompanied, Family, Unaccompanie...
$ NAME_INCOME_TYPE           <fct> Working, State servant, Working, Wo...
$ NAME_EDUCATION_TYPE        <fct> Secondary / secondary special, High...
$ NAME_FAMILY_STATUS         <fct> Single / not married, Married, Sing...
$ NAME_HOUSING_TYPE          <fct> House / apartment, House / apartmen...
$ REGION_POPULATION_RELATIVE <dbl> 0.018801, 0.003541, 0.010032, 0.008...
$ DAYS_BIRTH                 <int> -9461, -16765, -19046, -19005, -199...
$ DAYS_EMPLOYED              <int> -637, -1188, -225, -3039, -3038, -1...
$ DAYS_REGISTRATION          <dbl> -3648, -1186, -4260, -9833, -4311, ...
$ DAYS_ID_PUBLISH            <int> -2120, -291, -2531, -2437, -3458, -...
$ OWN_CAR_AGE                <dbl> 12.06109, 12.06109, 26.00000, 12.06...
$ FLAG_MOBIL                 <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_EMP_PHONE             <fct> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,...
$ FLAG_WORK_PHONE            <fct> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
$ FLAG_CONT_MOBILE           <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_PHONE                 <fct> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,...
$ FLAG_EMAIL                 <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ OCCUPATION_TYPE            <fct> Laborers, Core staff, Laborers, Lab...
$ CNT_FAM_MEMBERS            <dbl> 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, 3, 2,...
$ REGION_RATING_CLIENT       <fct> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2,...
# 메모리 절약을 위해서 불필요해진 객체 제거
rm(rec_obj)
rm(x_train_tbl)
rm(y_train_tbl)

2.3 자동 기계학습

자동 기계학습 과정은 \(H_2O\)를 활용하기 때문에 h2o.init()으로 \(H_2O\) 클러스터를 띄워서 준비를 하고 훈련, 교차검증, 시험 데이터로 나눠 데이터를 준비하고 h2o.automl()함수를 호출하여 학습을 수행하여 최적 모형을 추출한다.

# 3. 모형 개발 ------
## 3.0. H2O 환경설정 -----
h2o.init()

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    C:\Users\chongmu\AppData\Local\Temp\RtmpMLycSU/h2o_victorlee_started_from_r.out
    C:\Users\chongmu\AppData\Local\Temp\RtmpMLycSU/h2o_victorlee_started_from_r.err


Starting H2O JVM and connecting:  Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         5 seconds 956 milliseconds 
    H2O cluster timezone:       Asia/Seoul 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.20.0.2 
    H2O cluster version age:    1 month and 23 days  
    H2O cluster name:           H2O_started_from_R_victorlee_muv244 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.54 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    H2O API Extensions:         Algos, AutoML, Core V3, Core V4 
    R Version:                  R version 3.5.1 (2018-07-02) 
h2o.removeAll()
[1] 0
h2o.no_progress()

## 3.1. 훈련/시험 데이터 -----
data_h2o <- as.h2o(bind_cols(y_train_processed_tbl, x_train_processed_tbl))

splits_h2o <- h2o.splitFrame(data_h2o, ratios = c(0.5, 0.3), seed = 1234)

train_h2o <- splits_h2o[[1]]
valid_h2o <- splits_h2o[[2]]
test_h2o  <- splits_h2o[[3]]

## 3.2. H2O 모형 -----

y <- "TARGET"
x <- setdiff(names(train_h2o), y)

automl_models_h2o <- h2o.automl(
    x = x,
    y = y,
    training_frame    = train_h2o,
    validation_frame  = valid_h2o,
    leaderboard_frame = test_h2o,
    max_runtime_secs  = 10
)

2.4 자동 기계학습 모형성능

자동 기계학습된 automl_models_h2o 객체의 성능을 살펴본다.

## 3.3. H2O 모형 성능평가 -----
automl_leader <- automl_models_h2o@leader

performance_h2o <- h2o.performance(automl_leader, newdata = test_h2o)

performance_h2o %>%
    h2o.confusionMatrix()

# Confusion Matrix (vertical: actual; across: predicted)  for max f1 @ # threshold = 0.106174235650871:
#            0    1    Error         Rate
# 0      36250 6137 0.144785  =6137/42387
# 1       2347 1366 0.632103   =2347/3713
# Totals 38597 7503 0.184035  =8484/46100

performance_h2o %>%
    h2o.auc()

# 0.6962854