1. 지도학습 예측모형 1

예측할 결과가 주어진 지도학습 모형을 전통적인 회귀모형에서 정확도를 획기적으로 높인 기계학습모형까지 살펴본다. 선형 회귀모형은 설명을 통한 커뮤니케이션하기는 좋으나, 정확도가 상대적으로 떨어지고, 신경망 모형을 필두로한 최근 예측모형은 정확도는 높으나 설명을 통한 커뮤니케이션 능력이 좋지 않다.

회귀모형에서 H2O까지

이를 극복하기 위해서 전통적으로 GAM, MARS, SLIM(supersparse linear integer models), Rule lists 방법론이 제시되었고, 다른 한편으로는 LIME도 동일한 목적 성취를 위한 다른 접근방법을 취하고 있다.

해석능력과 정확성

2. 붓꽃 데이터

UCI 데이터 저장소에서 붓꽃데이터를 가져와서 데이터를 전처리한다. 그리고, 회귀모형부터 \(H2O\)에서 제공하는 고급 예측모형까지 적합한다.

# 0. 환경설정 --------------------------------

#library(tidyverse)
#library(stringr)

# 1. 데이터 가져오기 --------------------------------
# https://archive.ics.uci.edu/ml/datasets/Iris

iris_df <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = FALSE)
Parsed with column specification:
cols(
  X1 = col_double(),
  X2 = col_double(),
  X3 = col_double(),
  X4 = col_double(),
  X5 = col_character()
)
# 2. 데이터 전처리 --------------------------------

iris_df <- iris_df %>% rename(sepal_length = X1,
                   sepal_width = X2,
                   petal_length = X3,
                   petal_width = X4,
                   class = X5) %>% 
  mutate(class = str_replace(class, "Iris-", "")) %>% 
  mutate(class = factor(class, levels = c("setosa", "versicolor", "virginica")))

DT::datatable(iris_df)
# 3. 데이터프레임 저장 --------------------------------

#dir.create("data_processed")
#saveRDS(iris_df, "data_processed/iris_df.rds")

3. 예측모형

3.1. 의사결정나무 모형 2

가장 먼저, 탐색적 데이터 분석을 통해서 붓꽃 종류를 분류할 수 있는 변수를 시각적으로 확연히 확인할 수 있다. 특히 평렬그래프(parallelplot)을 통해서 보면 확연한 붓꽃종을 분류하는데 중요한 역할을 하는 변수를 확인할 수 있다.

# 0. 환경설정 --------------------------------

# library(rpart)
# library(lattice)
# library(tree)
# library(rpart)
# library(rattle)

# 2. 탐색적 데이터 분석 -----------------------------

super.sym <- trellis.par.get("superpose.symbol")

splom( ~ iris_df[1:4], groups = class, data = iris_df,
      panel = panel.superpose,
      key = list(title = "붓꽃 3종 산점도",
                 columns = 3, 
                 points = list(pch = super.sym$pch[1:3],
                               col = super.sym$col[1:3]),
                 text = list(c("Setosa", "Versicolor", "Virginica"))))

parallelplot( ~ iris_df[1:4], iris_df, groups = class,
              horizontal.axis = FALSE, scales = list(x = list(rot = 90))) 

나무모형을 적합시켜 보면 150 관측점 중 4개만 오분류가 되어 성능도 좋은 것으로 판별할 수 있다. class ~ . 공식을 활용하여 전체 변수를 모두 사용해도 별반 차이는 없다.

# 3. 나무모형 -----------------------------
## 3.1. `tree` 모형 시각화 ----------------
iris_tree <- tree(class ~ petal_length + petal_width, data = iris_df)
summary(iris_tree)

Classification tree:
tree(formula = class ~ petal_length + petal_width, data = iris_df)
Number of terminal nodes:  5 
Residual mean deviance:  0.157 = 22.77 / 145 
Misclassification error rate: 0.02667 = 4 / 150 
plot(iris_tree)
text(iris_tree)

## `tree` 모형 시각화
with(iris_df,
     plot(petal_length, petal_width, pch=19, col=class))

partition.tree(iris_tree, label="class",add=TRUE)

par(xpd=TRUE)
legend(7.5, 2.5,legend=unique(iris_df$class), col=unique(as.numeric(iris_df$class)), pch=19)

## 3.2. `tree` 모형 ----------------
iris_full_tree <- tree(class ~ ., data = iris_df)
summary(iris_full_tree)

Classification tree:
tree(formula = class ~ ., data = iris_df)
Variables actually used in tree construction:
[1] "petal_length" "petal_width"  "sepal_length"
Number of terminal nodes:  6 
Residual mean deviance:  0.1253 = 18.05 / 144 
Misclassification error rate: 0.02667 = 4 / 150 

rparttree는 거의 차이가 없고, tree가 작성된 사유는 오래전에 S로 구현된 버그 추적용으로 개발된 것이고, rpart는 C로 작성되어 훨씬 더 빠르고 기능도 확장되었다. 3

# 4. rpart 나무모형 -----------------------------

iris_rpart <- rpart(class ~ ., data=iris_df, method="class")
summary(iris_rpart)
Call:
rpart(formula = class ~ ., data = iris_df, method = "class")
  n= 150 

    CP nsplit rel error xerror       xstd
1 0.50      0      1.00   1.13 0.05279520
2 0.44      1      0.50   0.67 0.06088788
3 0.01      2      0.06   0.09 0.02908608

Variable importance
 petal_width petal_length sepal_length  sepal_width 
          34           31           21           13 

Node number 1: 150 observations,    complexity param=0.5
  predicted class=setosa      expected loss=0.6666667  P(node) =1
    class counts:    50    50    50
   probabilities: 0.333 0.333 0.333 
  left son=2 (50 obs) right son=3 (100 obs)
  Primary splits:
      petal_length < 2.45 to the left,  improve=50.00000, (0 missing)
      petal_width  < 0.8  to the left,  improve=50.00000, (0 missing)
      sepal_length < 5.45 to the left,  improve=34.16405, (0 missing)
      sepal_width  < 3.35 to the right, improve=18.05556, (0 missing)
  Surrogate splits:
      petal_width  < 0.8  to the left,  agree=1.000, adj=1.00, (0 split)
      sepal_length < 5.45 to the left,  agree=0.920, adj=0.76, (0 split)
      sepal_width  < 3.35 to the right, agree=0.827, adj=0.48, (0 split)

Node number 2: 50 observations
  predicted class=setosa      expected loss=0  P(node) =0.3333333
    class counts:    50     0     0
   probabilities: 1.000 0.000 0.000 

Node number 3: 100 observations,    complexity param=0.44
  predicted class=versicolor  expected loss=0.5  P(node) =0.6666667
    class counts:     0    50    50
   probabilities: 0.000 0.500 0.500 
  left son=6 (54 obs) right son=7 (46 obs)
  Primary splits:
      petal_width  < 1.75 to the left,  improve=38.969400, (0 missing)
      petal_length < 4.75 to the left,  improve=37.353540, (0 missing)
      sepal_length < 6.15 to the left,  improve=10.686870, (0 missing)
      sepal_width  < 2.45 to the left,  improve= 3.555556, (0 missing)
  Surrogate splits:
      petal_length < 4.75 to the left,  agree=0.91, adj=0.804, (0 split)
      sepal_length < 6.15 to the left,  agree=0.73, adj=0.413, (0 split)
      sepal_width  < 2.95 to the left,  agree=0.67, adj=0.283, (0 split)

Node number 6: 54 observations
  predicted class=versicolor  expected loss=0.09259259  P(node) =0.36
    class counts:     0    49     5
   probabilities: 0.000 0.907 0.093 

Node number 7: 46 observations
  predicted class=virginica   expected loss=0.02173913  P(node) =0.3066667
    class counts:     0     1    45
   probabilities: 0.000 0.022 0.978 
iris_rpart_pred <- predict(iris_rpart, type="class")
table(iris_rpart_pred, iris_df$class)
               
iris_rpart_pred setosa versicolor virginica
     setosa         50          0         0
     versicolor      0         49         5
     virginica       0          1        45
fancyRpartPlot(iris_rpart, main="Iris", caption="")

데이터를 바꾸어서 과적합 방지를 위해 나무의 가지를 쳐내야만 한다. 이를 위해서는 기준이 필요하고 대체로 which.min 함수를 사용해서 “xerror”가 최소화되는 “CP” 값을 선정하고 prune 함수를 사용해서 가지를 쳐낸다.

[]: Quick-R, Tree-Based Models

# 5. 강건한 나무모형  -----------------------------
library(rpart.plot)
data("ptitanic")
titanic_rpart <- rpart(survived ~ ., data = ptitanic, method="class", control = rpart.control(cp = 0.0001))
printcp(titanic_rpart)

Classification tree:
rpart(formula = survived ~ ., data = ptitanic, method = "class", 
    control = rpart.control(cp = 1e-04))

Variables actually used in tree construction:
[1] age    parch  pclass sex    sibsp 

Root node error: 500/1309 = 0.38197

n= 1309 

         CP nsplit rel error xerror     xstd
1 0.4240000      0     1.000  1.000 0.035158
2 0.0210000      1     0.576  0.576 0.029976
3 0.0150000      3     0.534  0.548 0.029438
4 0.0113333      5     0.504  0.534 0.029157
5 0.0025714      9     0.458  0.508 0.028616
6 0.0020000     16     0.440  0.528 0.029035
7 0.0001000     18     0.436  0.524 0.028952
## 과적합을 방지한 강건한 모형
bestcp <- titanic_rpart$cptable[which.min(titanic_rpart$cptable[,"xerror"]),"CP"]
titanic_rpart_pruned <- prune(titanic_rpart, cp = bestcp)

## 두 모형 비교표
titanic_rpart_pred <- predict(titanic_rpart, type="class")
table(titanic_rpart_pred, ptitanic$survived)
                  
titanic_rpart_pred died survived
          died      744      153
          survived   65      347
titanic_rpart_pruned_pred <- predict(titanic_rpart_pruned, type="class")
table(titanic_rpart_pruned_pred, ptitanic$survived)
                         
titanic_rpart_pruned_pred died survived
                 died      749      169
                 survived   60      331
## 두 모형 시각화
par(mfrow=c(1,2))
fancyRpartPlot(titanic_rpart, main="과적합된 모형", caption="", tweak=1.5, uniform=TRUE)
fancyRpartPlot(titanic_rpart_pruned, main="강건한 모형", caption="", tweak=2.0, uniform=TRUE)

3.2. Random Forest 모형 4 5

모형에 대한 해석능력(Interpretability)은 다소 희생하더라도 예측 정확성을 높이는데 방점을 둔다고 하면 Random Forest 모형을 돌려볼 수 있다. 하지만, 의사결정나무 모형에서 과적합 방지를 위해서 가지치기(Pruning) 외에도 결정해야 되는 사항이 몇가지 더 있다.

  • ntree: 의사결정나무 갯수
  • mtry: 변수 갯수
  • node_size: 의사결정나무 노드(나뭇잎) 크기, 기본설정으로 1 로 설정됨.

가장 정확도가 높은 best_param를 선정하고 이를 예측모형으로 선정해서 모형을 예측함.

RF 튜닝

# 0. 환경설정 --------------------------------

# library(randomForest)

# 2. Random Forest 모수튜닝 -------------------------

rf_accuracy <- function(actual, predict){
  (table(actual, predict)[1,1] + table(actual, predict)[2,2] + table(actual, predict)[3,3])/sum(table(iris_df$class, predict))
}

rf_tuning_df <- data.frame(mtry=numeric(0), 
                           ntree=numeric(0), 
                           node_size=numeric(0), 
                           accuracy=numeric(0))

for(mtry in c(2,3,4)){
  for(ntree in c(10, 50, 100, 150)){
    for(node_size in c(3, 5, 10)){
      iris_rf <- randomForest(class ~ ., ntree=ntree, mtry=mtry, node_size=node_size, data=iris_df)
      iris_rf_pred <- predict(iris_rf, method="class")
      rf_tuning_df[nrow(rf_tuning_df)+1,] = c(mtry, ntree, node_size, rf_accuracy(iris_df$class, iris_rf_pred))
      cat("mtry:", mtry, " | ntree: ", ntree, " | node_size", node_size, "  | accuracy:", rf_accuracy(iris_df$class, iris_rf_pred), "\n")
    }
  }
}
mtry: 2  | ntree:  10  | node_size 3   | accuracy: 0.9315068 
mtry: 2  | ntree:  10  | node_size 5   | accuracy: 0.9183673 
mtry: 2  | ntree:  10  | node_size 10   | accuracy: 0.9527027 
mtry: 2  | ntree:  50  | node_size 3   | accuracy: 0.94 
mtry: 2  | ntree:  50  | node_size 5   | accuracy: 0.9466667 
mtry: 2  | ntree:  50  | node_size 10   | accuracy: 0.9533333 
mtry: 2  | ntree:  100  | node_size 3   | accuracy: 0.9533333 
mtry: 2  | ntree:  100  | node_size 5   | accuracy: 0.9533333 
mtry: 2  | ntree:  100  | node_size 10   | accuracy: 0.9466667 
mtry: 2  | ntree:  150  | node_size 3   | accuracy: 0.9533333 
mtry: 2  | ntree:  150  | node_size 5   | accuracy: 0.9533333 
mtry: 2  | ntree:  150  | node_size 10   | accuracy: 0.9533333 
mtry: 3  | ntree:  10  | node_size 3   | accuracy: 0.9597315 
mtry: 3  | ntree:  10  | node_size 5   | accuracy: 0.9530201 
mtry: 3  | ntree:  10  | node_size 10   | accuracy: 0.9591837 
mtry: 3  | ntree:  50  | node_size 3   | accuracy: 0.9533333 
mtry: 3  | ntree:  50  | node_size 5   | accuracy: 0.9533333 
mtry: 3  | ntree:  50  | node_size 10   | accuracy: 0.9533333 
mtry: 3  | ntree:  100  | node_size 3   | accuracy: 0.9533333 
mtry: 3  | ntree:  100  | node_size 5   | accuracy: 0.9533333 
mtry: 3  | ntree:  100  | node_size 10   | accuracy: 0.9466667 
mtry: 3  | ntree:  150  | node_size 3   | accuracy: 0.96 
mtry: 3  | ntree:  150  | node_size 5   | accuracy: 0.9533333 
mtry: 3  | ntree:  150  | node_size 10   | accuracy: 0.96 
mtry: 4  | ntree:  10  | node_size 3   | accuracy: 0.952381 
mtry: 4  | ntree:  10  | node_size 5   | accuracy: 0.9466667 
mtry: 4  | ntree:  10  | node_size 10   | accuracy: 0.9726027 
mtry: 4  | ntree:  50  | node_size 3   | accuracy: 0.9533333 
mtry: 4  | ntree:  50  | node_size 5   | accuracy: 0.9533333 
mtry: 4  | ntree:  50  | node_size 10   | accuracy: 0.96 
mtry: 4  | ntree:  100  | node_size 3   | accuracy: 0.96 
mtry: 4  | ntree:  100  | node_size 5   | accuracy: 0.96 
mtry: 4  | ntree:  100  | node_size 10   | accuracy: 0.9533333 
mtry: 4  | ntree:  150  | node_size 3   | accuracy: 0.96 
mtry: 4  | ntree:  150  | node_size 5   | accuracy: 0.96 
mtry: 4  | ntree:  150  | node_size 10   | accuracy: 0.96 
best_param <- rf_tuning_df[which.max(rf_tuning_df[,"accuracy"]),]

# 3. Random Forest 모형 -------------------------
## 3.1. Random Forest 모형 적합 -----------------
iris_tuned_rf <- randomForest(class ~ ., 
                              ntree=best_param$ntree, 
                              mtry=best_param$mtry,
                              node_size=best_param$node_size, 
                              importance=TRUE,
                              data=iris_df)

## 3.2. 중요 변수 -----------------

importance(iris_tuned_rf)
               setosa versicolor virginica MeanDecreaseAccuracy
sepal_length 0.000000  -1.054093 -1.054093            -1.579834
sepal_width  0.000000  -1.054093  2.868963             2.393436
petal_length 3.066885   4.073225  3.369511             3.633379
petal_width  3.123322   6.901647  4.000719             4.838549
             MeanDecreaseGini
sepal_length         1.703953
sepal_width          1.914524
petal_length        48.726765
petal_width         47.154758
varImpPlot(iris_tuned_rf)

## 3.3. 모형 예측/평가 -----------------

iris_tuned_rf_pred <- predict(iris_tuned_rf, method="class")
table(iris_df$class, iris_tuned_rf_pred)
            iris_tuned_rf_pred
             setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         46         4
  virginica       0          4        46

3.3. caret 6

caret 팩키지를 활용하면 다양한 예측 모형 아키텍처를 비교하고 예측성능이 가장 좋은 모형을 자동으로 추천한다. 과거 범주형 데이터, (GBM + glmnet)/2, 훈련/검증 기법이 사용되었다면, 최근에서는 xgboost를 넘어 stacking, keras 딥러닝 기법을 활용하여 예측모형을 적용시켜 활용하고 있다.

xgboost 모수 미세조정은 Owen Zhang(2015) 발표자료를 참조한다.

윈도우와 맥에서 별도 운영체제 설정

.Platform$OS.type 명령어로 운영체제를 파악하고 나서 병렬처리를 위한 클러스터 설정을 한다.

if(.Platform$OS.type =="windows") {
  library(doSNOW)
  working_cores <- parallel::detectCores() -1
  cl <- makeCluster(working_cores, type = "SOCK")
  registerDoSNOW(cl)
} else if(.Platform$OS.type =="unix"){
  library(doMC)
  working_cores <- parallel::detectCores() -1
  registerDoMC(cores = working_cores)
}
# 0. 환경설정 --------------------------------

library(caret)

Attaching package: 'caret'
The following object is masked from 'package:purrr':

    lift
library(doSNOW)

# 1. 데이터 불러오기 --------------------------------

iris_df <- iris

# 2. 데이터 정제 --------------------------------

# glimpse(iris_df)

# 3. 예측 모형 개발 --------------------------------
## 3.1. 훈련/검증 분할 -----------------------------
train_index <- createDataPartition(iris_df$Species, times=1, p=0.7, list=FALSE)

iris_train <-iris_df[ train_index,]
iris_test <- iris_df[-train_index,]

## 3.2. 모형공식 -----------------------------------

iris_x_var <- setdiff(colnames(iris_train),list('Species'))
iris_formula <- as.formula(paste('Species', paste(iris_x_var, collapse=' + '), sep=' ~ '))

## 3.3. 모형제어 -----------------------------

iris_ctrl <- trainControl(method = "repeatedcv",
                          number = 5,
                          repeats = 1,
                          verboseIter = TRUE,
                          search = "grid")

xgboost_grid <- expand.grid(eta = c(0.05, 0.075, 0.1),
                            nrounds = c(50, 75, 100),
                            max_depth = 6:8,
                            min_child_weight = c(2.0, 2.25, 2.5),
                            colsample_bytree = c(0.3, 0.4, 0.5),
                            gamma = 0,
                            subsample = 1)

## 3.4. xgboost 아키텍쳐 ------------------------

if(.Platform$OS.type =="windows") {
  library(doSNOW)
  working_cores <- parallel::detectCores() -1
  cl <- makeCluster(working_cores, type = "SOCK")
  registerDoSNOW(cl)
} else if(.Platform$OS.type =="unix"){
  library(doMC)
  working_cores <- parallel::detectCores() -1
  registerDoMC(cores = working_cores)
}

iris_glm <- train(iris_formula,  data=iris_train, 
                  method="glmnet", 
                  trControl = iris_ctrl)
Aggregating results
Selecting tuning parameters
Fitting alpha = 0.1, lambda = 0.000865 on full training set
library(gbm)
Loading required package: survival

Attaching package: 'survival'
The following object is masked from 'package:caret':

    cluster
Loading required package: splines
Loading required package: parallel

Attaching package: 'parallel'
The following objects are masked from 'package:snow':

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, clusterSplit, makeCluster,
    parApply, parCapply, parLapply, parRapply, parSapply,
    splitIndices, stopCluster
Loaded gbm 2.1.3
iris_gbm <- train(iris_formula, data=iris_train, 
                  method = "gbm", 
                  trControl = iris_ctrl)
Aggregating results
Selecting tuning parameters
Fitting n.trees = 50, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10 on full training set
Iter   TrainDeviance   ValidDeviance   StepSize   Improve
     1        1.0986             nan     0.1000    0.3053
     2        0.8841             nan     0.1000    0.2575
     3        0.7082             nan     0.1000    0.2071
     4        0.5712             nan     0.1000    0.1332
     5        0.4778             nan     0.1000    0.1209
     6        0.4006             nan     0.1000    0.0923
     7        0.3408             nan     0.1000    0.0736
     8        0.2914             nan     0.1000    0.0482
     9        0.2570             nan     0.1000    0.0397
    10        0.2246             nan     0.1000    0.0315
    20        0.0980             nan     0.1000   -0.0032
    40        0.0440             nan     0.1000   -0.0105
    50        0.0288             nan     0.1000   -0.0041
iris_xgboost <- train(iris_formula,  data=iris_train, 
                 method="xgbTree", 
                 trControl = iris_ctrl,
                 tuneGrid = xgboost_grid,
                 num_class = 3)
Aggregating results
Selecting tuning parameters
Fitting nrounds = 75, max_depth = 6, eta = 0.1, gamma = 0, colsample_bytree = 0.3, min_child_weight = 2, subsample = 1 on full training set
Warning in check.booster.params(params, ...): The following parameters were provided multiple times:
    num_class
  Only the last value for each of them will be used.
if(.Platform$OS.type =="windows") {
  stopCluster(cl)
} else {
  next
}

# 4. 모형 평가 ---------------------------------------

all_resample <- resamples(list("Logistic" = iris_glm,
                               "GBM" = iris_gbm,
                               "xgboost" = iris_xgboost))

parallelplot(all_resample, metric = "Accuracy")

confusionMatrix(iris_xgboost)
Cross-Validated (5 fold, repeated 1 times) Confusion Matrix 

(entries are percentual average cell counts across resamples)
 
            Reference
Prediction   setosa versicolor virginica
  setosa       33.3        0.0       0.0
  versicolor    0.0       29.5       3.8
  virginica     0.0        3.8      29.5
                            
 Accuracy (average) : 0.9238
iris_pred <- predict(iris_xgboost, newdata = iris_test)
prop.table(table(iris_pred, iris_test$Species))
            
iris_pred       setosa versicolor virginica
  setosa     0.3333333  0.0000000 0.0000000
  versicolor 0.0000000  0.3333333 0.0000000
  virginica  0.0000000  0.0000000 0.3333333
iris_responses <- iris_test %>%
  select(Species) %>%
  mutate(
    predicted_iris = predict(
      iris_xgboost,
      newdata=iris_test
    )
  )

table(iris_responses)
            predicted_iris
Species      setosa versicolor virginica
  setosa         15          0         0
  versicolor      0         15         0
  virginica       0          0        15
ggplot(iris_responses, aes(Species, predicted_iris)) +
  geom_jitter(width = 0.15, height=0.15, aes(colour = Species), alpha=0.3) +
  geom_abline(intercept=0, slope=1)

# 5. 모형 배포 ---------------------------------------

iris_pred <- predict(iris_xgboost, newdata = iris_test)

  1. Interpretability in conversation with Patrick Hall and Sameer Singh

  2. Building a classification tree in R

  3. Brian Ripley - Difference between “tree” and “rpart”

  4. Tuning the parameters of your Random Forest model

  5. Tune Machine Learning Algorithms in R (random forest case study)

  6. Owen Zhang(2015), Open Source Tools & Data Science Competitions