최적 기계학습 모형 아키텍처(GLM, NN, DL, SVM, GBM, xgBoost, Random Forest…)를 선택하고, 이에 대한 모델 튜닝을 진행하여 최적모형을 개발하는 것도 중요한 일이지만, 사업분야에 적용하기 위해서는 여러가지 제약조건을 따르고 주어진 제약조건에 최적화하는 것이 필요하다.
이러한 면에서 최고 성능의 기계학습 모형을 개발하는 것보다 “애자일 기계학습모형 개발(Agile ML development)” 개념을 접목하여 극단적으로 최고성능의 90%를 내는 기계학습 예측모형을 30분내에 개발하는 것이 더 의미가 클 수 있다.
Matt Dancho(August 7, 2018), “KAGGLE COMPETITION IN 30 MINUTES: PREDICT HOME CREDIT DEFAULT RISK WITH R” Business Science.에 제시된 사례를 바탕으로 30분 내에 기계학습 모형 구축 및 배포를 시연해 보자.
기계학습 예측모형 개발 자동화 전략은 원본데이터를 있는 그대로 두고, 각 변수에 맞게 적절한 데이터 변형 전략을 구현하고 결측값에 대한 채워넣기를 수행한 후에 \(H_2O\) AutoML 모형을 개발하여 운영계에 배포하는 것으로 마무리한다.
가장 먼저 원본 데이터(application_train.csv
)를 가져와서 AutoML
엔진인 \(H_2O\)에 넣기 전에 결측값 처리 및 자료형 변수변환에 대한 사항을 적시한다.
Kaggle - Home Credit Default Risk 데이터셋을 다운로드 받아 kable
함수로 데이터가 원본 데이터에서 데이터프레임으로 제대로 변환이 되었는지 확인한다.
# 0. 환경설정 -----
# General
library(tidyverse)
library(skimr)
# Preprocessing
library(recipes)
# Machine Learning
library(h2o)
# 1. 데이터 -----
## 1.1. 데이터 가져오기
application_train_tbl <- read_csv("data/application_train.csv")
## 1.2. 데이터 일별하기
application_train_tbl %>%
slice(1:10) %>%
kable() %>%
kable_styling() %>%
scroll_box(width = "800px")
SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
100002 | 1 | Cash loans | M | N | Y | 0 | 202500 | 406597.5 | 24700.5 | 351000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648 | -2120 | NA | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.0830370 | 0.2629486 | 0.1393758 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0 | 0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2 | 2 | 2 | 2 | -1134 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
100003 | 0 | Cash loans | F | N | N | 0 | 270000 | 1293502.5 | 35698.5 | 1129500 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186 | -291 | NA | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.3112673 | 0.6222458 | NA | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0 | 0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1 | 0 | 1 | 0 | -828 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500 | 135000.0 | 6750.0 | 135000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260 | -2531 | 26 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NA | 0.5559121 | 0.7295667 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | -815 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100006 | 0 | Cash loans | F | N | Y | 0 | 135000 | 312682.5 | 29686.5 | 297000 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833 | -2437 | NA | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NA | 0.6504417 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 0 | 2 | 0 | -617 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA |
100007 | 0 | Cash loans | M | N | Y | 0 | 121500 | 513000.0 | 21865.5 | 513000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311 | -3458 | NA | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NA | 0.3227383 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | -1106 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100008 | 0 | Cash loans | M | N | Y | 0 | 99000 | 490495.5 | 27517.5 | 454500 | Spouse, partner | State servant | Secondary / secondary special | Married | House / apartment | 0.035792 | -16941 | -1588 | -4970 | -477 | NA | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Other | NA | 0.3542247 | 0.6212263 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 0 | 0 | 0 | 0 | -2536 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
100009 | 0 | Cash loans | F | Y | Y | 1 | 171000 | 1560726.0 | 41301.0 | 1395000 | Unaccompanied | Commercial associate | Higher education | Married | House / apartment | 0.035792 | -13778 | -3130 | -1213 | -619 | 17 | 1 | 1 | 0 | 1 | 1 | 0 | Accountants | 3 | 2 | 2 | SUNDAY | 16 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.7747614 | 0.7239999 | 0.4920601 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 0 | 1 | 0 | -1562 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 2 |
100010 | 0 | Cash loans | M | Y | Y | 0 | 360000 | 1530000.0 | 42075.0 | 1530000 | Unaccompanied | State servant | Higher education | Married | House / apartment | 0.003122 | -18850 | -449 | -4597 | -2379 | 8 | 1 | 1 | 1 | 1 | 0 | 0 | Managers | 2 | 3 | 3 | MONDAY | 16 | 0 | 0 | 0 | 0 | 1 | 1 | Other | NA | 0.7142793 | 0.5406545 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 0 | 2 | 0 | -1070 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
100011 | 0 | Cash loans | F | N | Y | 0 | 112500 | 1019610.0 | 33826.5 | 913500 | Children | Pensioner | Secondary / secondary special | Married | House / apartment | 0.018634 | -20099 | 365243 | -7427 | -3514 | NA | 1 | 0 | 0 | 1 | 0 | 0 | NA | 2 | 2 | 2 | WEDNESDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.5873340 | 0.2057473 | 0.7517237 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
100012 | 0 | Revolving loans | M | N | Y | 0 | 135000 | 405000.0 | 20250.0 | 405000 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.019689 | -14469 | -2019 | -14437 | -3992 | NA | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 1 | 2 | 2 | THURSDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | Electricity | NA | 0.7466436 | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | NA | 2 | 0 | 2 | 0 | -1673 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NA | NA | NA | NA | NA | NA |
## 1.3. 훈련/시험 데이터 분할
x_train_tbl <- application_train_tbl %>% select(-TARGET)
y_train_tbl <- application_train_tbl %>% select(TARGET)
# 메모리 절약을 위해서 불필요해진 객체 제거
rm(application_train_tbl)
원본 데이터를 불러왔다면 다음 단계로 문자형 변수를 식별하여 string_2_factor_names
벡터로 준비하고, 숫자형 변수 중 범주형으로 코딩할 수 있는 것을 범주 수준이 7개를 기준으로 미만이면 범주 요인형변수로 식별하여 num_2_factor_names
벡터로 준비한다. 마지막으로 결측값은 숫자형 결측값은 평균으로, 범주형 결측값은 최빈치로 채워 넣도록 준비한다.
recipe
팩키지를 활용하여 변수변환 과정과 결측값 채워넣는 과정에 대해서 명세를 기록한다.
# 2. 데이터 탐색 -----
## 2.1. 데이터 자료형 살펴보기
skim_to_list(x_train_tbl)
$character
# A tibble: 16 x 8
variable missing complete n min max empty n_unique
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 CODE_GENDER 0 307511 3075~ 1 3 0 3
2 EMERGENCYSTATE_MODE 145755 161756 3075~ 2 3 0 2
3 FLAG_OWN_CAR 0 307511 3075~ 1 1 0 2
4 FLAG_OWN_REALTY 0 307511 3075~ 1 1 0 2
5 FONDKAPREMONT_MODE 210295 97216 3075~ 13 21 0 4
6 HOUSETYPE_MODE 154297 153214 3075~ 14 16 0 3
7 NAME_CONTRACT_TYPE 0 307511 3075~ 10 15 0 2
8 NAME_EDUCATION_TYPE 0 307511 3075~ 15 29 0 5
9 NAME_FAMILY_STATUS 0 307511 3075~ 5 20 0 6
10 NAME_HOUSING_TYPE 0 307511 3075~ 12 19 0 6
11 NAME_INCOME_TYPE 0 307511 3075~ 7 20 0 8
12 NAME_TYPE_SUITE 1292 306219 3075~ 6 15 0 7
13 OCCUPATION_TYPE 96391 211120 3075~ 7 21 0 18
14 ORGANIZATION_TYPE 0 307511 3075~ 3 22 0 58
15 WALLSMATERIAL_MODE 156341 151170 3075~ 5 12 0 7
16 WEEKDAY_APPR_PROCESS~ 0 307511 3075~ 6 9 0 7
$integer
# A tibble: 40 x 12
variable missing complete n mean sd p0 p25 p50 p75
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 CNT_CHIL~ 0 307511 307511 " ~ " ~ 0 " ~ 0 " ~
2 DAYS_BIR~ 0 307511 307511 "-160~ " 43~ -252~ "-19~ -157~ "-12~
3 DAYS_EMP~ 0 307511 307511 " 638~ "1412~ -179~ " -2~ -1213 " -~
4 DAYS_ID_~ 0 307511 307511 " -29~ " 15~ -7197 " -4~ -3254 " -1~
5 FLAG_CON~ 0 307511 307511 " ~ " ~ 0 " ~ 1 " ~
6 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 " ~
7 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 " ~
8 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 " ~
9 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 " ~
10 FLAG_DOC~ 0 307511 307511 " ~ " ~ 0 " ~ 0 " ~
# ... with 30 more rows, and 2 more variables: p100 <chr>, hist <chr>
$numeric
# A tibble: 65 x 12
variable missing complete n mean sd p0 p25 p50 p75
* <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 AMT_ANNUI~ 12 307499 307511 " 271~ " 14~ " 1~ " 16~ " 24~ " 34~
2 AMT_CREDIT 0 307511 307511 " 6e+~ " 4e~ " 45~ "270~ "513~ "808~
3 AMT_GOODS~ 278 307233 307511 "5383~ "369~ " 40~ "238~ "450~ "679~
4 AMT_INCOM~ 0 307511 307511 "1687~ "237~ " 25~ "112~ "147~ " 2e~
5 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ " ~
6 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ " ~
7 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ " ~
8 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ " ~
9 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ " ~
10 AMT_REQ_C~ 41519 265992 307511 " ~ " ~ " ~ " ~ " ~ " ~
# ... with 55 more rows, and 2 more variables: p100 <chr>, hist <chr>
## 2.2. 전처리 전략
### 문자형
string_2_factor_names <- x_train_tbl %>%
select_if(is.character) %>%
names()
# string_2_factor_names
### 숫자형 숫자
unique_numeric_values_tbl <- x_train_tbl %>%
select_if(is.numeric) %>%
map_df(~ unique(.) %>% length()) %>%
gather() %>%
arrange(value) %>%
mutate(key = as_factor(key))
# unique_numeric_values_tbl
factor_limit <- 7
num_2_factor_names <- unique_numeric_values_tbl %>%
filter(value < factor_limit) %>%
arrange(desc(value)) %>%
pull(key) %>%
as.character()
# num_2_factor_names
### 결측값
missing_tbl <- x_train_tbl %>%
summarize_all(.funs = ~ sum(is.na(.)) / length(.)) %>%
gather() %>%
arrange(desc(value)) %>%
filter(value > 0)
# missing_tbl
## 요리법에 따른 데이터 전처리
rec_obj <- recipe(~ ., data = x_train_tbl) %>%
step_string2factor(string_2_factor_names) %>%
step_num2factor(num_2_factor_names) %>%
step_meanimpute(all_numeric()) %>%
step_modeimpute(all_nominal()) %>%
prep(stringsAsFactors = FALSE)
# rec_obj
x_train_processed_tbl <- bake(rec_obj, x_train_tbl)
### Y 변수 변환
y_train_processed_tbl <- y_train_tbl %>%
mutate(TARGET = TARGET %>% as.character() %>% as.factor())
데이터 변환 전
### 데이터 변환 전
x_train_tbl %>%
select(1:30) %>%
glimpse()
Observations: 307,511
Variables: 30
$ SK_ID_CURR <int> 100002, 100003, 100004, 100006, 100...
$ NAME_CONTRACT_TYPE <chr> "Cash loans", "Cash loans", "Revolv...
$ CODE_GENDER <chr> "M", "F", "M", "F", "M", "M", "F", ...
$ FLAG_OWN_CAR <chr> "N", "N", "Y", "N", "N", "N", "Y", ...
$ FLAG_OWN_REALTY <chr> "Y", "N", "Y", "Y", "Y", "Y", "Y", ...
$ CNT_CHILDREN <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,...
$ AMT_INCOME_TOTAL <dbl> 202500.00, 270000.00, 67500.00, 135...
$ AMT_CREDIT <dbl> 406597.5, 1293502.5, 135000.0, 3126...
$ AMT_ANNUITY <dbl> 24700.5, 35698.5, 6750.0, 29686.5, ...
$ AMT_GOODS_PRICE <dbl> 351000, 1129500, 135000, 297000, 51...
$ NAME_TYPE_SUITE <chr> "Unaccompanied", "Family", "Unaccom...
$ NAME_INCOME_TYPE <chr> "Working", "State servant", "Workin...
$ NAME_EDUCATION_TYPE <chr> "Secondary / secondary special", "H...
$ NAME_FAMILY_STATUS <chr> "Single / not married", "Married", ...
$ NAME_HOUSING_TYPE <chr> "House / apartment", "House / apart...
$ REGION_POPULATION_RELATIVE <dbl> 0.018801, 0.003541, 0.010032, 0.008...
$ DAYS_BIRTH <int> -9461, -16765, -19046, -19005, -199...
$ DAYS_EMPLOYED <int> -637, -1188, -225, -3039, -3038, -1...
$ DAYS_REGISTRATION <dbl> -3648, -1186, -4260, -9833, -4311, ...
$ DAYS_ID_PUBLISH <int> -2120, -291, -2531, -2437, -3458, -...
$ OWN_CAR_AGE <dbl> NA, NA, 26, NA, NA, NA, 17, 8, NA, ...
$ FLAG_MOBIL <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_EMP_PHONE <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,...
$ FLAG_WORK_PHONE <int> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
$ FLAG_CONT_MOBILE <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_PHONE <int> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,...
$ FLAG_EMAIL <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ OCCUPATION_TYPE <chr> "Laborers", "Core staff", "Laborers...
$ CNT_FAM_MEMBERS <dbl> 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, 3, 2,...
$ REGION_RATING_CLIENT <int> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2,...
데이터 변환 후
### 데이터 변환 후
x_train_processed_tbl %>%
select(1:30) %>%
glimpse()
Observations: 307,511
Variables: 30
$ SK_ID_CURR <int> 100002, 100003, 100004, 100006, 100...
$ NAME_CONTRACT_TYPE <fct> Cash loans, Cash loans, Revolving l...
$ CODE_GENDER <fct> M, F, M, F, M, M, F, M, F, M, F, F,...
$ FLAG_OWN_CAR <fct> N, N, Y, N, N, N, Y, Y, N, N, N, N,...
$ FLAG_OWN_REALTY <fct> Y, N, Y, Y, Y, Y, Y, Y, Y, Y, Y, Y,...
$ CNT_CHILDREN <int> 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0,...
$ AMT_INCOME_TOTAL <dbl> 202500.00, 270000.00, 67500.00, 135...
$ AMT_CREDIT <dbl> 406597.5, 1293502.5, 135000.0, 3126...
$ AMT_ANNUITY <dbl> 24700.5, 35698.5, 6750.0, 29686.5, ...
$ AMT_GOODS_PRICE <dbl> 351000, 1129500, 135000, 297000, 51...
$ NAME_TYPE_SUITE <fct> Unaccompanied, Family, Unaccompanie...
$ NAME_INCOME_TYPE <fct> Working, State servant, Working, Wo...
$ NAME_EDUCATION_TYPE <fct> Secondary / secondary special, High...
$ NAME_FAMILY_STATUS <fct> Single / not married, Married, Sing...
$ NAME_HOUSING_TYPE <fct> House / apartment, House / apartmen...
$ REGION_POPULATION_RELATIVE <dbl> 0.018801, 0.003541, 0.010032, 0.008...
$ DAYS_BIRTH <int> -9461, -16765, -19046, -19005, -199...
$ DAYS_EMPLOYED <int> -637, -1188, -225, -3039, -3038, -1...
$ DAYS_REGISTRATION <dbl> -3648, -1186, -4260, -9833, -4311, ...
$ DAYS_ID_PUBLISH <int> -2120, -291, -2531, -2437, -3458, -...
$ OWN_CAR_AGE <dbl> 12.06109, 12.06109, 26.00000, 12.06...
$ FLAG_MOBIL <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_EMP_PHONE <fct> 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,...
$ FLAG_WORK_PHONE <fct> 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,...
$ FLAG_CONT_MOBILE <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
$ FLAG_PHONE <fct> 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1,...
$ FLAG_EMAIL <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
$ OCCUPATION_TYPE <fct> Laborers, Core staff, Laborers, Lab...
$ CNT_FAM_MEMBERS <dbl> 1, 2, 1, 2, 1, 2, 3, 2, 2, 1, 3, 2,...
$ REGION_RATING_CLIENT <fct> 2, 1, 2, 2, 2, 2, 2, 3, 2, 2, 2, 2,...
# 메모리 절약을 위해서 불필요해진 객체 제거
rm(rec_obj)
rm(x_train_tbl)
rm(y_train_tbl)
자동 기계학습 과정은 \(H_2O\)를 활용하기 때문에 h2o.init()
으로 \(H_2O\) 클러스터를 띄워서 준비를 하고 훈련, 교차검증, 시험 데이터로 나눠 데이터를 준비하고 h2o.automl()
함수를 호출하여 학습을 수행하여 최적 모형을 추출한다.
# 3. 모형 개발 ------
## 3.0. H2O 환경설정 -----
h2o.init()
H2O is not running yet, starting it now...
Note: In case of errors look at the following log files:
C:\Users\chongmu\AppData\Local\Temp\RtmpMLycSU/h2o_victorlee_started_from_r.out
C:\Users\chongmu\AppData\Local\Temp\RtmpMLycSU/h2o_victorlee_started_from_r.err
Starting H2O JVM and connecting: Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 5 seconds 956 milliseconds
H2O cluster timezone: Asia/Seoul
H2O data parsing timezone: UTC
H2O cluster version: 3.20.0.2
H2O cluster version age: 1 month and 23 days
H2O cluster name: H2O_started_from_R_victorlee_muv244
H2O cluster total nodes: 1
H2O cluster total memory: 3.54 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Algos, AutoML, Core V3, Core V4
R Version: R version 3.5.1 (2018-07-02)
h2o.removeAll()
[1] 0
h2o.no_progress()
## 3.1. 훈련/시험 데이터 -----
data_h2o <- as.h2o(bind_cols(y_train_processed_tbl, x_train_processed_tbl))
splits_h2o <- h2o.splitFrame(data_h2o, ratios = c(0.5, 0.3), seed = 1234)
train_h2o <- splits_h2o[[1]]
valid_h2o <- splits_h2o[[2]]
test_h2o <- splits_h2o[[3]]
## 3.2. H2O 모형 -----
y <- "TARGET"
x <- setdiff(names(train_h2o), y)
automl_models_h2o <- h2o.automl(
x = x,
y = y,
training_frame = train_h2o,
validation_frame = valid_h2o,
leaderboard_frame = test_h2o,
max_runtime_secs = 10
)
자동 기계학습된 automl_models_h2o
객체의 성능을 살펴본다.
## 3.3. H2O 모형 성능평가 -----
automl_leader <- automl_models_h2o@leader
performance_h2o <- h2o.performance(automl_leader, newdata = test_h2o)
performance_h2o %>%
h2o.confusionMatrix()
# Confusion Matrix (vertical: actual; across: predicted) for max f1 @ # threshold = 0.106174235650871:
# 0 1 Error Rate
# 0 36250 6137 0.144785 =6137/42387
# 1 2347 1366 0.632103 =2347/3713
# Totals 38597 7503 0.184035 =8484/46100
performance_h2o %>%
h2o.auc()
# 0.6962854