1 펭귄 데이터셋 ¹

데이터 과학의 기본 데이터셋으로 인정받던 붓꽃(Iris) 데이터가 이제는 자리를 펭귄 데이터셋에게 자리를 물려주는 듯 싶다. 기초 통계학, 시각화 뿐만 아니라 tidymodels에도 이를 기본 데이터셋으로 활용하는 사례가 심심치 않게 보여주고 있다. 이런 추세는 최근 TidyTuesday 2020년 31번째 데이터셋(2020-07-28)으로 팔머 팽귄(Palmer Penguins)으로 당첨되면서 Iris 데이터를 대체하는 것은 기정사실화 된 듯 싶다.

library(tidyverse)
tuesdata <- tidytuesdayR::tt_load('2020-07-28')


    Downloading file 1 of 2: `penguins.csv`
    Downloading file 2 of 2: `penguins_raw.csv`

penguin <- tuesdata$penguins

penguin

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen           NA            NA                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# … with 334 more rows, and 2 more variables: sex <chr>, year <dbl>

2 성별예측 초모수 모형 ²

초모수(Hyper Parameter)를 활용할 수 있는 예측모형이 더 성능이 높은 것으로 알려져 있어, 펭귄 데이터셋에서 성별(sex)을 예측하는 모형을 구축해본다. 이를 위해서 사용되는 기본적인 팩키지는 기존 단순 훈련/시험 예측모형에 추가하여 tune, dials 팩키지가 추가된다.

훈련/시험 데이터 구분을 위해서 `rsample``
피처 공학을 위해서 recipes
예측모형을 지정하기 위해서 parsnip
예측모형 작업흐름 관리를 위해서 workflows
예측모형 성능평가를 위해서 yardstick
초모수 튜닝: tune, dials

library(tidyverse)
library(magick)

pkgs <- c('rsample', 'recipes', 'parsnip', 'tune', 'dials', 'workflows', 'yardstick')

pkgs_path <- glue::glue("https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/{pkgs}.svg")

pkgs_img <- purrr::map(pkgs_path, magick::image_read) %>% 
  magick::image_join(.)

pkgs_img %>% 
  image_scale('150') %>% 
  image_append(stack = FALSE)

펭귄 데이터셋에서 성별(sex)을 예측하는 모형을 구축해본다.

3 데이터 전처리

나름 잘 정리되어 있어 결측값만 처리하고 연도와 펭귄이 거주하고 있는 섬만 제거한다.

penguin_tbl <- penguin %>%
  filter(!is.na(sex)) %>%
  select(-year, -island) %>% 
  mutate_if(is.character, as.factor)

3.1 훈련/교차검증/시험 데이터셋

initial_split() 함수로 훈련/시험 데이터셋으로 나누고, vfold_cv()함수를 사용해서 hyper parameter 튜닝 등을 통한 최적 모형 선정을 위해서 교차검증 데이터셋을 준비한다.

library(tidymodels)

splits <- initial_split(penguin_tbl, prop = 0.8, strata = sex)

penguin_cv <- vfold_cv(training(splits), v =3, repeats = 1)

3.2 피처 공학

피처 공학(Feature Engineering)을 적용하여 예측모형 정확도 향상에 사용될 모형 행렬(Model Matrix)을 준비한다.

penguin_recipe <- recipe(sex ~ ., data = training(splits)) %>%
  step_corr(all_numeric(), threshold = 0.9) %>% 
  step_normalize(all_numeric()) %>% 
  step_dummy(all_nominal(), -all_outcomes())

penguin_recipe

Recipe

Inputs:

      role #variables
   outcome          1
 predictor          5

Operations:

Correlation filter on all_numeric()
Centering and scaling for all_numeric()
Dummy variables from all_nominal(), -all_outcomes()

4 기본설정 모형

펭귄 성별 예측에 사용될 모형 아키텍처를 선정하는데 우선 가장 많이 사용되는 GLM, Random Forest, MARS 모형을 선택한다. GLM 모형이 가장 좋은 성능을 앞서 보여줬기 때문에 Baseline 모형으로 GLM(일반화 선형 모형)을 선택하고 Hyper Parameter 튜닝을 하여 성능을 좀더 높여보도록 한다. Search parsnip models 웹사이트에 모형에 대한 전반적인 사항을 파악할 수 있다.

4.1 모형 아키텍처

펭귄 성별 예측에 사용될 모형 아키텍처를 선정하는데 우선 가장 많이 사용되는 GLM 모형이 기본 평타를 치는 모형으로 삼고 Random Forest를 그 다음 가장 많이 사용하는 인기 모형이라 이것을 모형 아키텍처로 삼는다. ranger 엔진을 사용하지만, 초모수에 대해서는 아무런 설정도 하지 않는다.

rf_spec <- rand_forest() %>%
  set_mode("classification") %>% 
  set_engine("ranger") 

rf_spec

Random Forest Model Specification (classification)

Computational engine: ranger

4.2 모형 작업흐름

workflows 팩키지를 사용해서 예측 모형 작업흐름을 완성한다. 먼저 workflow()를 설정한다. 다양한 모형조합을 작업흐름에 맞춰 유연하고 수월히 최적 모형 구축을 할 수 있다.

penguin_wf <- workflow() %>%
  add_recipe(penguin_recipe) %>% 
  add_model(rf_spec)

penguin_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_corr()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Computational engine: ranger

4.3 바닐라 모형 적합(fit)

CV 교차검증 표본을 대상으로 각 모형에 적합시킨다. Hyper Parameter는 탐색범위가 정해지지 않았기 때문에 Random Forest에서 설정된 기본값들이 CV 교차검증 표본에 대해서 vfold 를 10 모둠으로 지정했기 때문에 10번 훈련 표본에 대해 적합을 시도한다.

doParallel::registerDoParallel()

rf_vanilla_rs <- penguin_wf  %>% 
  fit_resamples(
    resamples = penguin_cv,
    control = control_resamples(save_pred = TRUE),
    metrics = metric_set(accuracy, kap, roc_auc)
  )

collect_metrics() 또는 show_best() 함수로 최적 Hyper Paramter를 선정할 수 있다.

rf_vanilla_rs %>% 
  show_best("roc_auc")

# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 roc_auc binary     0.959     3  0.0117 Preprocessor1_Model1

이제 최적 모형을 만들어서 기준 roc_auc로 RF 모형을 생성한다.

best_roc_auc <- select_best(rf_vanilla_rs, "roc_auc")
best_roc_auc

# A tibble: 1 × 1
  .config             
  <chr>               
1 Preprocessor1_Model1

rf_vanilla_final <- finalize_workflow(penguin_wf, best_roc_auc)
rf_vanilla_final

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_corr()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Computational engine: ranger

4.4 시험 데이터 성능측정

last_fit()함수를 사용하여 최종 모형을 아주 귀중한 시험 데이터(Test Data)에 적용시켜 최종 예측모형 성능을 평가한다.

rf_vanilla_final_fit <- rf_vanilla_final %>%
  last_fit(splits)

rf_vanilla_final_fit %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.940 Preprocessor1_Model1
2 roc_auc  binary         0.981 Preprocessor1_Model1

많이 친숙한 confusion matrix를 작성해 마무리한다.

rf_vanilla_final_fit %>% 
  collect_predictions() %>%
  conf_mat(sex, .pred_class)

          Truth
Prediction female male
    female     30    1
    male        3   33

5 Hyper Parameter 찾기

show_model_info() 함수를 사용해서 rand_forest에 대한 Random Forest 모형에 대한 자세한 정보를 얻을 수 있다.

show_model_info("rand_forest")

Information for `rand_forest`
 modes: unknown, classification, regression, censored regression 

 engines: 
   classification: randomForest, ranger¹, spark
   regression:     randomForest, ranger¹, spark

¹The model can use case weights.

 arguments: 
   ranger:       
      mtry  --> mtry
      trees --> num.trees
      min_n --> min.node.size
   randomForest: 
      mtry  --> mtry
      trees --> ntree
      min_n --> nodesize
   spark:        
      mtry  --> feature_subset_strategy
      trees --> num_trees
      min_n --> min_instances_per_node

 fit modules:
         engine           mode
         ranger classification
         ranger     regression
   randomForest classification
   randomForest     regression
          spark classification
          spark     regression

 prediction modules:
             mode       engine                    methods
   classification randomForest           class, prob, raw
   classification       ranger class, conf_int, prob, raw
   classification        spark                class, prob
       regression randomForest               numeric, raw
       regression       ranger     conf_int, numeric, raw
       regression        spark                    numeric

또한, show_engines() 사용해서 Random Forest 모형에 대해 사용가능한 엔진과 분류문제냐 회귀문제냐를 확인할 수 있다.

show_engines('rand_forest')

# A tibble: 6 × 2
  engine       mode          
  <chr>        <chr>         
1 ranger       classification
2 ranger       regression    
3 randomForest classification
4 randomForest regression    
5 spark        classification
6 spark        regression

6 Hyper Parameter 모형: `tune`

6.1 모형 아키텍처

Random Forest 모형을 선정하고 엔진으로 randomForest 대신 ranger 엔진을 사용하고, Hyper Parameter를 tune() 함수로 지정한다.

rf_tune_spec <- rand_forest(mtry  = tune(), 
                       trees = tune(),
                       min_n = tune()) %>%
  set_mode("classification") %>% 
  set_engine("ranger") 

rf_tune_spec

Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger

6.2 모형 작업흐름

penguin_tune_wf <- penguin_wf %>%
  update_model(rf_tune_spec)

penguin_tune_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_corr()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger

6.3 `tune` 모형 적합(fit)

CV 교차검증 표본을 대상으로 각 모형에 적합시킨다. Hyper Parameter를 지정했기 때문에 Random Forest에서 해당 초모수에 대해 모형적합을 실시한다. CV 교차검증 표본에 대해서 vfold 를 10모둠으로 지정했기 때문에 10번 훈련 표본에 대해 적합을 시도한다. fit_resamples() 함수 대신 앞서 지정한 Hyper Parameter를 대상으로 모형적합을 진행해야 되기 때문에 tune_grid() 함수를 사용한다. 이를 위해서 크게 두가지 방식이 존재한다. 하나는 격자를 손수 수작업으로 지정하는 방식이고 다른 하나는 무작위로 탐색갯수를 지정하는 것이다.

수작업으로 지정하는 방식	무작위로 갯수만 지정하는 방식
`tune_grid( resamples = folds_5, grid = expand.grid( mtry = c(1, 3, 5), trees = c(500, 1000, 2000) ) )`	`tune_grid( resamples = folds_5, grid = 5 )`

doParallel::registerDoParallel()

rf_tune_rs <- penguin_tune_wf  %>% 
  tune_grid(
    resamples = penguin_cv,
    grid      = 10,
    control   = control_resamples(save_pred = TRUE),
    metrics   = metric_set(accuracy, kap, roc_auc)
  )

collect_metrics() 또는 show_best() 함수로 최적 Hyper Paramter를 선정할 수 있다.

rf_tune_rs %>% 
  show_best("roc_auc", n = 3)

# A tibble: 3 × 9
   mtry trees min_n .metric .estimator  mean     n std_err .config              
  <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1     1   334     3 roc_auc binary     0.960     3 0.00804 Preprocessor1_Model07
2     2  1680    19 roc_auc binary     0.957     3 0.0143  Preprocessor1_Model06
3     4  1022    10 roc_auc binary     0.955     3 0.0146  Preprocessor1_Model04

이제 최적 모형을 만들어서 기준 roc_auc로 RF 모형을 생성한다. finalize_workflow()를 통해 최적으로 선택된 Hyper Parameter 조합을 살펴볼 수 있다.

best_tune_roc_auc <- select_best(rf_tune_rs, "roc_auc")
best_tune_roc_auc

# A tibble: 1 × 4
   mtry trees min_n .config              
  <int> <int> <int> <chr>                
1     1   334     3 Preprocessor1_Model07

rf_tune_final <- finalize_workflow(penguin_tune_wf, best_tune_roc_auc)
rf_tune_final

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_corr()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 1
  trees = 334
  min_n = 3

Computational engine: ranger

6.4 시험 데이터 성능측정

Random Forest 모형의 Hyper Parameter를 선정하여 last_fit()함수를 사용하여 최종 모형을 아주 귀중한 시험 데이터(Test Data)에 적용시켜 최종 예측모형 성능을 평가한다.

rf_tune_final_fit <- rf_tune_final %>%
  last_fit(splits)

rf_tune_final_fit %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.910 Preprocessor1_Model1
2 roc_auc  binary         0.980 Preprocessor1_Model1

많이 친숙한 confusion matrix를 작성해 마무리한다.

rf_tune_final_fit %>% 
  collect_predictions() %>%
  conf_mat(sex, .pred_class)

          Truth
Prediction female male
    female     29    2
    male        4   32

7 Hyper Parameter 모형: `dials`

7.1 모형 아키텍처

Random Forest 모형을 선정하고 엔진으로 randomForest 대신 ranger 엔진을 사용하고, Hyper Parameter를 tune() 함수로 지정하는 것 까지는 동일하다.

rf_dials_spec <- rand_forest(mtry  = tune(), 
                       trees = tune(),
                       min_n = tune()) %>%
  set_mode("classification") %>% 
  set_engine("ranger") 

rf_dials_spec

Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger

7.2 모형 작업흐름

penguin_dials_wf <- penguin_wf %>%
  update_model(rf_dials_spec)

penguin_dials_wf

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_corr()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = tune()
  trees = tune()
  min_n = tune()

Computational engine: ranger

7.3 `dials` 초모수 공간

grid_random() 함수를 사용하게 되면 탐색할 초모수 공간에서 size = 5로 특정되어 5개만 만들어 낼 수 있다.

dials_random <- grid_random(
  mtry  = mtry(c(1, ncol(penguin) - 1)),
  trees(),
  min_n(),
  size = 5
)

dials_random

# A tibble: 5 × 3
  trees  mtry min_n
  <int> <int> <int>
1  1577     1    22
2  1630     5     5
3   716     6     5
4   286     6    24
5   996     7    37

grid_regular() 함수를 사용하게 되면 각 초모수에서 2개씩만 뽑아 이를 조합한 \(2 \times 2 \times 2 = 8\) 개 초모수 공간을 만들어낸다.

dials_regular <- grid_regular(
  mtry(c(1, ncol(penguin) - 1)),
  trees(),
  min_n(),
  levels = 2
)
dials_regular

# A tibble: 8 × 3
   mtry trees min_n
  <int> <int> <int>
1     1     1     2
2     7     1     2
3     1  2000     2
4     7  2000     2
5     1     1    40
6     7     1    40
7     1  2000    40
8     7  2000    40

7.4 `dials` 모형 적합(fit)

CV 교차검증 표본을 대상으로 각 모형에 적합시킨다. Hyper Parameter를 지정했기 때문에 Random Forest에서 해당 초모수에 대해 모형적합을 실시한다. CV 교차검증 표본에 대해서 vfold 를 10모둠으로 지정했기 때문에 10번 훈련 표본에 대해 적합을 시도한다. tune()을 사용하는 것과 비교하여 dials를 사용하게 되면 Hyper Parameter를 좀더 유연하게 명시적으로 설정하여 예측모형 개발에 사용할 수 있다.

doParallel::registerDoParallel()

rf_dials_rs <- penguin_dials_wf  %>% 
  tune_grid(
    resamples = penguin_cv,
    grid      = dials_regular,
    control   = control_resamples(save_pred = TRUE),
    metrics   = metric_set(accuracy, kap, roc_auc)
  )
rf_dials_rs

# Tuning results
# 3-fold cross-validation 
# A tibble: 3 × 5
  splits           id    .metrics          .notes           .predictions      
  <list>           <chr> <list>            <list>           <list>            
1 <split [177/89]> Fold1 <tibble [24 × 7]> <tibble [4 × 3]> <tibble [712 × 9]>
2 <split [177/89]> Fold2 <tibble [24 × 7]> <tibble [4 × 3]> <tibble [712 × 9]>
3 <split [178/88]> Fold3 <tibble [24 × 7]> <tibble [4 × 3]> <tibble [704 × 9]>

There were issues with some computations:

  - Warning(s) x12: 7 columns were requested but there were 6 predictors in the data....

Use `collect_notes(object)` for more information.

collect_metrics() 또는 show_best() 함수로 최적 Hyper Paramter를 선정할 수 있다.

rf_dials_rs %>% 
  show_best("roc_auc", n = 3)

# A tibble: 3 × 9
   mtry trees min_n .metric .estimator  mean     n std_err .config             
  <int> <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1     1  2000     2 roc_auc binary     0.960     3  0.0110 Preprocessor1_Model3
2     7  2000     2 roc_auc binary     0.954     3  0.0159 Preprocessor1_Model4
3     1  2000    40 roc_auc binary     0.952     3  0.0155 Preprocessor1_Model7

이제 최적 모형을 만들어서 기준 roc_auc로 RF 모형을 생성한다. finalize_workflow()를 통해 최적으로 선택된 Hyper Parameter 조합을 살펴볼 수 있다.

best_dials_roc_auc <- select_best(rf_dials_rs, "roc_auc")
best_dials_roc_auc

# A tibble: 1 × 4
   mtry trees min_n .config             
  <int> <int> <int> <chr>               
1     1  2000     2 Preprocessor1_Model3

rf_dials_final <- finalize_workflow(penguin_dials_wf, best_dials_roc_auc)
rf_dials_final

══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
3 Recipe Steps

• step_corr()
• step_normalize()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Random Forest Model Specification (classification)

Main Arguments:
  mtry = 1
  trees = 2000
  min_n = 2

Computational engine: ranger

7.5 시험 데이터 성능측정

rf_dials_final_fit <- rf_dials_final %>%
  last_fit(splits)

rf_dials_final_fit %>% 
  collect_metrics()

# A tibble: 2 × 4
  .metric  .estimator .estimate .config             
  <chr>    <chr>          <dbl> <chr>               
1 accuracy binary         0.925 Preprocessor1_Model1
2 roc_auc  binary         0.980 Preprocessor1_Model1

많이 친숙한 confusion matrix를 작성해 마무리한다.

rf_dials_final_fit %>% 
  collect_predictions() %>%
  conf_mat(sex, .pred_class)

          Truth
Prediction female male
    female     29    1
    male        4   33

8 Hyper Parameter 분석

Hyper Parameter를 통해 예측 모형을 만들게 되면 ANOVA 형태 데이터를 갖게 되어 이를 활용한 다양한 분석을 수행할 수 있다. 즉, 다음과 같은 수학적인 모형을 마음속에 상정하고 분석을 수행하게 되면 최적의 Hyper Parameter 조합을 찾는데 도움을 받을 수 있다.

\[\text{정확도, AUC 등} = \text{mtry} + \text{trees} + \text{min_n} + \epsilon\]

hp_tbl <- rf_dials_rs %>% 
  collect_metrics(summarize = FALSE)

hp_tbl

# A tibble: 72 × 8
   id     mtry trees min_n .metric  .estimator .estimate .config             
   <chr> <int> <int> <int> <chr>    <chr>          <dbl> <chr>               
 1 Fold1     1     1     2 accuracy binary         0.798 Preprocessor1_Model1
 2 Fold1     1     1     2 kap      binary         0.594 Preprocessor1_Model1
 3 Fold1     1     1     2 roc_auc  binary         0.795 Preprocessor1_Model1
 4 Fold2     1     1     2 accuracy binary         0.787 Preprocessor1_Model1
 5 Fold2     1     1     2 kap      binary         0.566 Preprocessor1_Model1
 6 Fold2     1     1     2 roc_auc  binary         0.836 Preprocessor1_Model1
 7 Fold3     1     1     2 accuracy binary         0.886 Preprocessor1_Model1
 8 Fold3     1     1     2 kap      binary         0.772 Preprocessor1_Model1
 9 Fold3     1     1     2 roc_auc  binary         0.915 Preprocessor1_Model1
10 Fold1     7     1     2 accuracy binary         0.775 Preprocessor1_Model2
# … with 62 more rows

요약 통계량을 구해보면 다음과 같다.

hp_tbl %>% 
  group_by(.metric) %>%
  summarize(min = min(.estimate),
  median = median(.estimate),
  max = max(.estimate),
  mean = mean(.estimate),
  sd = sd(.estimate))

# A tibble: 3 × 6
  .metric       min median   max  mean     sd
  <chr>       <dbl>  <dbl> <dbl> <dbl>  <dbl>
1 accuracy  0.472    0.864 0.955 0.847 0.0940
2 kap      -0.00480  0.729 0.909 0.695 0.179 
3 roc_auc   0.528    0.920 0.984 0.892 0.0966

9 예측모형 배포

fit() 함수를 사용해서 예측모형 바이너리 파일을 로컬 파일로 저장한다. 제대로 저장이 되었는지 관측점 한개를 만들어서 가상의 펭귄의 성별을 기본 정보를 바탕으로 예측해보자.

penguin_predictvie_model <- fit(rf_dials_final, data = penguin_tbl)
write_rds(penguin_predictvie_model, "data/penguin_predictvie_model.rds")
penguin_predictvie_model_test <- read_rds("data/penguin_predictvie_model.rds")

obs_df <- tibble("species" = "Adelie",
                  "bill_length_mm" =  39.1,
                  "bill_depth_mm" =  18.7,
                  "flipper_length_mm" =  181,
                  "body_mass_g" = 3750)

predict(penguin_predictvie_model_test, obs_df)

# A tibble: 1 × 1
  .pred_class
  <fct>      
1 male

predict(penguin_predictvie_model_test, obs_df, type = "prob")

# A tibble: 1 × 2
  .pred_female .pred_male
         <dbl>      <dbl>
1        0.305      0.695

데이터 과학자 이광춘 저작

kwangchun.lee.7@gmail.com

Tidyverse 모형 - `tidymodels`

펭귄 성별예측모형: `tidymodels` - Hyper Parameter

Tidyverse Korea

2022-06-20

1 펭귄 데이터셋 ¹

2 성별예측 초모수 모형 ²

3 데이터 전처리

3.1 훈련/교차검증/시험 데이터셋

3.2 피처 공학

4 기본설정 모형

4.1 모형 아키텍처

4.2 모형 작업흐름

4.3 바닐라 모형 적합(fit)

4.4 시험 데이터 성능측정

5 Hyper Parameter 찾기

6 Hyper Parameter 모형: `tune`

6.1 모형 아키텍처

6.2 모형 작업흐름

6.3 `tune` 모형 적합(fit)

6.4 시험 데이터 성능측정

7 Hyper Parameter 모형: `dials`

7.1 모형 아키텍처

7.2 모형 작업흐름

7.3 `dials` 초모수 공간

7.4 `dials` 모형 적합(fit)

7.5 시험 데이터 성능측정

8 Hyper Parameter 분석

9 예측모형 배포

Tidyverse 모형 - tidymodels

펭귄 성별예측모형: tidymodels - Hyper Parameter

Tidyverse Korea

2022-06-20

1 펭귄 데이터셋 1

2 성별예측 초모수 모형 2

3 데이터 전처리

3.1 훈련/교차검증/시험 데이터셋

3.2 피처 공학

4 기본설정 모형

4.1 모형 아키텍처

4.2 모형 작업흐름

4.3 바닐라 모형 적합(fit)

4.4 시험 데이터 성능측정

5 Hyper Parameter 찾기

6 Hyper Parameter 모형: tune

6.1 모형 아키텍처

6.2 모형 작업흐름

6.3 tune 모형 적합(fit)

6.4 시험 데이터 성능측정

7 Hyper Parameter 모형: dials

7.1 모형 아키텍처

7.2 모형 작업흐름

7.3 dials 초모수 공간

7.4 dials 모형 적합(fit)

7.5 시험 데이터 성능측정

8 Hyper Parameter 분석

9 예측모형 배포

Tidyverse 모형 - `tidymodels`

펭귄 성별예측모형: `tidymodels` - Hyper Parameter

1 펭귄 데이터셋 ¹

2 성별예측 초모수 모형 ²

6 Hyper Parameter 모형: `tune`

6.3 `tune` 모형 적합(fit)

7 Hyper Parameter 모형: `dials`

7.3 `dials` 초모수 공간

7.4 `dials` 모형 적합(fit)