예측할 결과가 주어진 지도학습 모형을 전통적인 회귀모형에서 정확도를 획기적으로 높인 기계학습모형까지 살펴본다. 선형 회귀모형은 설명을 통한 커뮤니케이션하기는 좋으나, 정확도가 상대적으로 떨어지고, 신경망 모형을 필두로한 최근 예측모형은 정확도는 높으나 설명을 통한 커뮤니케이션 능력이 좋지 않다.
이를 극복하기 위해서 전통적으로 GAM, MARS, SLIM(supersparse linear integer models), Rule lists 방법론이 제시되었고, 다른 한편으로는 LIME도 동일한 목적 성취를 위한 다른 접근방법을 취하고 있다.
UCI 데이터 저장소에서 붓꽃데이터를 가져와서 데이터를 전처리한다. 그리고, 회귀모형부터 \(H2O\)에서 제공하는 고급 예측모형까지 적합한다.
# 0. 환경설정 --------------------------------
#library(tidyverse)
#library(stringr)
# 1. 데이터 가져오기 --------------------------------
# https://archive.ics.uci.edu/ml/datasets/Iris
iris_df <- read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data", col_names = FALSE)
Parsed with column specification:
cols(
X1 = col_double(),
X2 = col_double(),
X3 = col_double(),
X4 = col_double(),
X5 = col_character()
)
# 2. 데이터 전처리 --------------------------------
iris_df <- iris_df %>% rename(sepal_length = X1,
sepal_width = X2,
petal_length = X3,
petal_width = X4,
class = X5) %>%
mutate(class = str_replace(class, "Iris-", "")) %>%
mutate(class = factor(class, levels = c("setosa", "versicolor", "virginica")))
DT::datatable(iris_df)
# 3. 데이터프레임 저장 --------------------------------
#dir.create("data_processed")
#saveRDS(iris_df, "data_processed/iris_df.rds")
가장 먼저, 탐색적 데이터 분석을 통해서 붓꽃 종류를 분류할 수 있는 변수를 시각적으로 확연히 확인할 수 있다. 특히 평렬그래프(parallelplot
)을 통해서 보면 확연한 붓꽃종을 분류하는데 중요한 역할을 하는 변수를 확인할 수 있다.
# 0. 환경설정 --------------------------------
# library(rpart)
# library(lattice)
# library(tree)
# library(rpart)
# library(rattle)
# 2. 탐색적 데이터 분석 -----------------------------
super.sym <- trellis.par.get("superpose.symbol")
splom( ~ iris_df[1:4], groups = class, data = iris_df,
panel = panel.superpose,
key = list(title = "붓꽃 3종 산점도",
columns = 3,
points = list(pch = super.sym$pch[1:3],
col = super.sym$col[1:3]),
text = list(c("Setosa", "Versicolor", "Virginica"))))
parallelplot( ~ iris_df[1:4], iris_df, groups = class,
horizontal.axis = FALSE, scales = list(x = list(rot = 90)))
나무모형을 적합시켜 보면 150 관측점 중 4개만 오분류가 되어 성능도 좋은 것으로 판별할 수 있다. class ~ .
공식을 활용하여 전체 변수를 모두 사용해도 별반 차이는 없다.
# 3. 나무모형 -----------------------------
## 3.1. `tree` 모형 시각화 ----------------
iris_tree <- tree(class ~ petal_length + petal_width, data = iris_df)
summary(iris_tree)
Classification tree:
tree(formula = class ~ petal_length + petal_width, data = iris_df)
Number of terminal nodes: 5
Residual mean deviance: 0.157 = 22.77 / 145
Misclassification error rate: 0.02667 = 4 / 150
plot(iris_tree)
text(iris_tree)
## `tree` 모형 시각화
with(iris_df,
plot(petal_length, petal_width, pch=19, col=class))
partition.tree(iris_tree, label="class",add=TRUE)
par(xpd=TRUE)
legend(7.5, 2.5,legend=unique(iris_df$class), col=unique(as.numeric(iris_df$class)), pch=19)
## 3.2. `tree` 모형 ----------------
iris_full_tree <- tree(class ~ ., data = iris_df)
summary(iris_full_tree)
Classification tree:
tree(formula = class ~ ., data = iris_df)
Variables actually used in tree construction:
[1] "petal_length" "petal_width" "sepal_length"
Number of terminal nodes: 6
Residual mean deviance: 0.1253 = 18.05 / 144
Misclassification error rate: 0.02667 = 4 / 150
rpart
와 tree
는 거의 차이가 없고, tree
가 작성된 사유는 오래전에 S로 구현된 버그 추적용으로 개발된 것이고, rpart
는 C로 작성되어 훨씬 더 빠르고 기능도 확장되었다. 3
# 4. rpart 나무모형 -----------------------------
iris_rpart <- rpart(class ~ ., data=iris_df, method="class")
summary(iris_rpart)
Call:
rpart(formula = class ~ ., data = iris_df, method = "class")
n= 150
CP nsplit rel error xerror xstd
1 0.50 0 1.00 1.13 0.05279520
2 0.44 1 0.50 0.67 0.06088788
3 0.01 2 0.06 0.09 0.02908608
Variable importance
petal_width petal_length sepal_length sepal_width
34 31 21 13
Node number 1: 150 observations, complexity param=0.5
predicted class=setosa expected loss=0.6666667 P(node) =1
class counts: 50 50 50
probabilities: 0.333 0.333 0.333
left son=2 (50 obs) right son=3 (100 obs)
Primary splits:
petal_length < 2.45 to the left, improve=50.00000, (0 missing)
petal_width < 0.8 to the left, improve=50.00000, (0 missing)
sepal_length < 5.45 to the left, improve=34.16405, (0 missing)
sepal_width < 3.35 to the right, improve=18.05556, (0 missing)
Surrogate splits:
petal_width < 0.8 to the left, agree=1.000, adj=1.00, (0 split)
sepal_length < 5.45 to the left, agree=0.920, adj=0.76, (0 split)
sepal_width < 3.35 to the right, agree=0.827, adj=0.48, (0 split)
Node number 2: 50 observations
predicted class=setosa expected loss=0 P(node) =0.3333333
class counts: 50 0 0
probabilities: 1.000 0.000 0.000
Node number 3: 100 observations, complexity param=0.44
predicted class=versicolor expected loss=0.5 P(node) =0.6666667
class counts: 0 50 50
probabilities: 0.000 0.500 0.500
left son=6 (54 obs) right son=7 (46 obs)
Primary splits:
petal_width < 1.75 to the left, improve=38.969400, (0 missing)
petal_length < 4.75 to the left, improve=37.353540, (0 missing)
sepal_length < 6.15 to the left, improve=10.686870, (0 missing)
sepal_width < 2.45 to the left, improve= 3.555556, (0 missing)
Surrogate splits:
petal_length < 4.75 to the left, agree=0.91, adj=0.804, (0 split)
sepal_length < 6.15 to the left, agree=0.73, adj=0.413, (0 split)
sepal_width < 2.95 to the left, agree=0.67, adj=0.283, (0 split)
Node number 6: 54 observations
predicted class=versicolor expected loss=0.09259259 P(node) =0.36
class counts: 0 49 5
probabilities: 0.000 0.907 0.093
Node number 7: 46 observations
predicted class=virginica expected loss=0.02173913 P(node) =0.3066667
class counts: 0 1 45
probabilities: 0.000 0.022 0.978
iris_rpart_pred <- predict(iris_rpart, type="class")
table(iris_rpart_pred, iris_df$class)
iris_rpart_pred setosa versicolor virginica
setosa 50 0 0
versicolor 0 49 5
virginica 0 1 45
fancyRpartPlot(iris_rpart, main="Iris", caption="")
데이터를 바꾸어서 과적합 방지를 위해 나무의 가지를 쳐내야만 한다. 이를 위해서는 기준이 필요하고 대체로 which.min
함수를 사용해서 “xerror”가 최소화되는 “CP” 값을 선정하고 prune
함수를 사용해서 가지를 쳐낸다.
[]: Quick-R, Tree-Based Models
# 5. 강건한 나무모형 -----------------------------
library(rpart.plot)
data("ptitanic")
titanic_rpart <- rpart(survived ~ ., data = ptitanic, method="class", control = rpart.control(cp = 0.0001))
printcp(titanic_rpart)
Classification tree:
rpart(formula = survived ~ ., data = ptitanic, method = "class",
control = rpart.control(cp = 1e-04))
Variables actually used in tree construction:
[1] age parch pclass sex sibsp
Root node error: 500/1309 = 0.38197
n= 1309
CP nsplit rel error xerror xstd
1 0.4240000 0 1.000 1.000 0.035158
2 0.0210000 1 0.576 0.576 0.029976
3 0.0150000 3 0.534 0.548 0.029438
4 0.0113333 5 0.504 0.534 0.029157
5 0.0025714 9 0.458 0.508 0.028616
6 0.0020000 16 0.440 0.528 0.029035
7 0.0001000 18 0.436 0.524 0.028952
## 과적합을 방지한 강건한 모형
bestcp <- titanic_rpart$cptable[which.min(titanic_rpart$cptable[,"xerror"]),"CP"]
titanic_rpart_pruned <- prune(titanic_rpart, cp = bestcp)
## 두 모형 비교표
titanic_rpart_pred <- predict(titanic_rpart, type="class")
table(titanic_rpart_pred, ptitanic$survived)
titanic_rpart_pred died survived
died 744 153
survived 65 347
titanic_rpart_pruned_pred <- predict(titanic_rpart_pruned, type="class")
table(titanic_rpart_pruned_pred, ptitanic$survived)
titanic_rpart_pruned_pred died survived
died 749 169
survived 60 331
## 두 모형 시각화
par(mfrow=c(1,2))
fancyRpartPlot(titanic_rpart, main="과적합된 모형", caption="", tweak=1.5, uniform=TRUE)
fancyRpartPlot(titanic_rpart_pruned, main="강건한 모형", caption="", tweak=2.0, uniform=TRUE)
모형에 대한 해석능력(Interpretability)은 다소 희생하더라도 예측 정확성을 높이는데 방점을 둔다고 하면 Random Forest 모형을 돌려볼 수 있다. 하지만, 의사결정나무 모형에서 과적합 방지를 위해서 가지치기(Pruning) 외에도 결정해야 되는 사항이 몇가지 더 있다.
ntree
: 의사결정나무 갯수mtry
: 변수 갯수node_size
: 의사결정나무 노드(나뭇잎) 크기, 기본설정으로 1
로 설정됨.가장 정확도가 높은 best_param
를 선정하고 이를 예측모형으로 선정해서 모형을 예측함.
# 0. 환경설정 --------------------------------
# library(randomForest)
# 2. Random Forest 모수튜닝 -------------------------
rf_accuracy <- function(actual, predict){
(table(actual, predict)[1,1] + table(actual, predict)[2,2] + table(actual, predict)[3,3])/sum(table(iris_df$class, predict))
}
rf_tuning_df <- data.frame(mtry=numeric(0),
ntree=numeric(0),
node_size=numeric(0),
accuracy=numeric(0))
for(mtry in c(2,3,4)){
for(ntree in c(10, 50, 100, 150)){
for(node_size in c(3, 5, 10)){
iris_rf <- randomForest(class ~ ., ntree=ntree, mtry=mtry, node_size=node_size, data=iris_df)
iris_rf_pred <- predict(iris_rf, method="class")
rf_tuning_df[nrow(rf_tuning_df)+1,] = c(mtry, ntree, node_size, rf_accuracy(iris_df$class, iris_rf_pred))
cat("mtry:", mtry, " | ntree: ", ntree, " | node_size", node_size, " | accuracy:", rf_accuracy(iris_df$class, iris_rf_pred), "\n")
}
}
}
mtry: 2 | ntree: 10 | node_size 3 | accuracy: 0.9315068
mtry: 2 | ntree: 10 | node_size 5 | accuracy: 0.9183673
mtry: 2 | ntree: 10 | node_size 10 | accuracy: 0.9527027
mtry: 2 | ntree: 50 | node_size 3 | accuracy: 0.94
mtry: 2 | ntree: 50 | node_size 5 | accuracy: 0.9466667
mtry: 2 | ntree: 50 | node_size 10 | accuracy: 0.9533333
mtry: 2 | ntree: 100 | node_size 3 | accuracy: 0.9533333
mtry: 2 | ntree: 100 | node_size 5 | accuracy: 0.9533333
mtry: 2 | ntree: 100 | node_size 10 | accuracy: 0.9466667
mtry: 2 | ntree: 150 | node_size 3 | accuracy: 0.9533333
mtry: 2 | ntree: 150 | node_size 5 | accuracy: 0.9533333
mtry: 2 | ntree: 150 | node_size 10 | accuracy: 0.9533333
mtry: 3 | ntree: 10 | node_size 3 | accuracy: 0.9597315
mtry: 3 | ntree: 10 | node_size 5 | accuracy: 0.9530201
mtry: 3 | ntree: 10 | node_size 10 | accuracy: 0.9591837
mtry: 3 | ntree: 50 | node_size 3 | accuracy: 0.9533333
mtry: 3 | ntree: 50 | node_size 5 | accuracy: 0.9533333
mtry: 3 | ntree: 50 | node_size 10 | accuracy: 0.9533333
mtry: 3 | ntree: 100 | node_size 3 | accuracy: 0.9533333
mtry: 3 | ntree: 100 | node_size 5 | accuracy: 0.9533333
mtry: 3 | ntree: 100 | node_size 10 | accuracy: 0.9466667
mtry: 3 | ntree: 150 | node_size 3 | accuracy: 0.96
mtry: 3 | ntree: 150 | node_size 5 | accuracy: 0.9533333
mtry: 3 | ntree: 150 | node_size 10 | accuracy: 0.96
mtry: 4 | ntree: 10 | node_size 3 | accuracy: 0.952381
mtry: 4 | ntree: 10 | node_size 5 | accuracy: 0.9466667
mtry: 4 | ntree: 10 | node_size 10 | accuracy: 0.9726027
mtry: 4 | ntree: 50 | node_size 3 | accuracy: 0.9533333
mtry: 4 | ntree: 50 | node_size 5 | accuracy: 0.9533333
mtry: 4 | ntree: 50 | node_size 10 | accuracy: 0.96
mtry: 4 | ntree: 100 | node_size 3 | accuracy: 0.96
mtry: 4 | ntree: 100 | node_size 5 | accuracy: 0.96
mtry: 4 | ntree: 100 | node_size 10 | accuracy: 0.9533333
mtry: 4 | ntree: 150 | node_size 3 | accuracy: 0.96
mtry: 4 | ntree: 150 | node_size 5 | accuracy: 0.96
mtry: 4 | ntree: 150 | node_size 10 | accuracy: 0.96
best_param <- rf_tuning_df[which.max(rf_tuning_df[,"accuracy"]),]
# 3. Random Forest 모형 -------------------------
## 3.1. Random Forest 모형 적합 -----------------
iris_tuned_rf <- randomForest(class ~ .,
ntree=best_param$ntree,
mtry=best_param$mtry,
node_size=best_param$node_size,
importance=TRUE,
data=iris_df)
## 3.2. 중요 변수 -----------------
importance(iris_tuned_rf)
setosa versicolor virginica MeanDecreaseAccuracy
sepal_length 0.000000 -1.054093 -1.054093 -1.579834
sepal_width 0.000000 -1.054093 2.868963 2.393436
petal_length 3.066885 4.073225 3.369511 3.633379
petal_width 3.123322 6.901647 4.000719 4.838549
MeanDecreaseGini
sepal_length 1.703953
sepal_width 1.914524
petal_length 48.726765
petal_width 47.154758
varImpPlot(iris_tuned_rf)
## 3.3. 모형 예측/평가 -----------------
iris_tuned_rf_pred <- predict(iris_tuned_rf, method="class")
table(iris_df$class, iris_tuned_rf_pred)
iris_tuned_rf_pred
setosa versicolor virginica
setosa 50 0 0
versicolor 0 46 4
virginica 0 4 46
caret
팩키지를 활용하면 다양한 예측 모형 아키텍처를 비교하고 예측성능이 가장 좋은 모형을 자동으로 추천한다. 과거 범주형 데이터, (GBM + glmnet)/2, 훈련/검증 기법이 사용되었다면, 최근에서는 xgboost를 넘어 stacking, keras 딥러닝 기법을 활용하여 예측모형을 적용시켜 활용하고 있다.
xgboost 모수 미세조정은 Owen Zhang(2015) 발표자료를 참조한다.
윈도우와 맥에서 별도 운영체제 설정
.Platform$OS.type
명령어로 운영체제를 파악하고 나서 병렬처리를 위한 클러스터 설정을 한다.if(.Platform$OS.type =="windows") { library(doSNOW) working_cores <- parallel::detectCores() -1 cl <- makeCluster(working_cores, type = "SOCK") registerDoSNOW(cl) } else if(.Platform$OS.type =="unix"){ library(doMC) working_cores <- parallel::detectCores() -1 registerDoMC(cores = working_cores) }
# 0. 환경설정 --------------------------------
library(caret)
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
library(doSNOW)
# 1. 데이터 불러오기 --------------------------------
iris_df <- iris
# 2. 데이터 정제 --------------------------------
# glimpse(iris_df)
# 3. 예측 모형 개발 --------------------------------
## 3.1. 훈련/검증 분할 -----------------------------
train_index <- createDataPartition(iris_df$Species, times=1, p=0.7, list=FALSE)
iris_train <-iris_df[ train_index,]
iris_test <- iris_df[-train_index,]
## 3.2. 모형공식 -----------------------------------
iris_x_var <- setdiff(colnames(iris_train),list('Species'))
iris_formula <- as.formula(paste('Species', paste(iris_x_var, collapse=' + '), sep=' ~ '))
## 3.3. 모형제어 -----------------------------
iris_ctrl <- trainControl(method = "repeatedcv",
number = 5,
repeats = 1,
verboseIter = TRUE,
search = "grid")
xgboost_grid <- expand.grid(eta = c(0.05, 0.075, 0.1),
nrounds = c(50, 75, 100),
max_depth = 6:8,
min_child_weight = c(2.0, 2.25, 2.5),
colsample_bytree = c(0.3, 0.4, 0.5),
gamma = 0,
subsample = 1)
## 3.4. xgboost 아키텍쳐 ------------------------
if(.Platform$OS.type =="windows") {
library(doSNOW)
working_cores <- parallel::detectCores() -1
cl <- makeCluster(working_cores, type = "SOCK")
registerDoSNOW(cl)
} else if(.Platform$OS.type =="unix"){
library(doMC)
working_cores <- parallel::detectCores() -1
registerDoMC(cores = working_cores)
}
iris_glm <- train(iris_formula, data=iris_train,
method="glmnet",
trControl = iris_ctrl)
Aggregating results
Selecting tuning parameters
Fitting alpha = 0.1, lambda = 0.000865 on full training set
library(gbm)
Loading required package: survival
Attaching package: 'survival'
The following object is masked from 'package:caret':
cluster
Loading required package: splines
Loading required package: parallel
Attaching package: 'parallel'
The following objects are masked from 'package:snow':
clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
clusterExport, clusterMap, clusterSplit, makeCluster,
parApply, parCapply, parLapply, parRapply, parSapply,
splitIndices, stopCluster
Loaded gbm 2.1.3
iris_gbm <- train(iris_formula, data=iris_train,
method = "gbm",
trControl = iris_ctrl)
Aggregating results
Selecting tuning parameters
Fitting n.trees = 50, interaction.depth = 3, shrinkage = 0.1, n.minobsinnode = 10 on full training set
Iter TrainDeviance ValidDeviance StepSize Improve
1 1.0986 nan 0.1000 0.3053
2 0.8841 nan 0.1000 0.2575
3 0.7082 nan 0.1000 0.2071
4 0.5712 nan 0.1000 0.1332
5 0.4778 nan 0.1000 0.1209
6 0.4006 nan 0.1000 0.0923
7 0.3408 nan 0.1000 0.0736
8 0.2914 nan 0.1000 0.0482
9 0.2570 nan 0.1000 0.0397
10 0.2246 nan 0.1000 0.0315
20 0.0980 nan 0.1000 -0.0032
40 0.0440 nan 0.1000 -0.0105
50 0.0288 nan 0.1000 -0.0041
iris_xgboost <- train(iris_formula, data=iris_train,
method="xgbTree",
trControl = iris_ctrl,
tuneGrid = xgboost_grid,
num_class = 3)
Aggregating results
Selecting tuning parameters
Fitting nrounds = 75, max_depth = 6, eta = 0.1, gamma = 0, colsample_bytree = 0.3, min_child_weight = 2, subsample = 1 on full training set
Warning in check.booster.params(params, ...): The following parameters were provided multiple times:
num_class
Only the last value for each of them will be used.
if(.Platform$OS.type =="windows") {
stopCluster(cl)
} else {
next
}
# 4. 모형 평가 ---------------------------------------
all_resample <- resamples(list("Logistic" = iris_glm,
"GBM" = iris_gbm,
"xgboost" = iris_xgboost))
parallelplot(all_resample, metric = "Accuracy")
confusionMatrix(iris_xgboost)
Cross-Validated (5 fold, repeated 1 times) Confusion Matrix
(entries are percentual average cell counts across resamples)
Reference
Prediction setosa versicolor virginica
setosa 33.3 0.0 0.0
versicolor 0.0 29.5 3.8
virginica 0.0 3.8 29.5
Accuracy (average) : 0.9238
iris_pred <- predict(iris_xgboost, newdata = iris_test)
prop.table(table(iris_pred, iris_test$Species))
iris_pred setosa versicolor virginica
setosa 0.3333333 0.0000000 0.0000000
versicolor 0.0000000 0.3333333 0.0000000
virginica 0.0000000 0.0000000 0.3333333
iris_responses <- iris_test %>%
select(Species) %>%
mutate(
predicted_iris = predict(
iris_xgboost,
newdata=iris_test
)
)
table(iris_responses)
predicted_iris
Species setosa versicolor virginica
setosa 15 0 0
versicolor 0 15 0
virginica 0 0 15
ggplot(iris_responses, aes(Species, predicted_iris)) +
geom_jitter(width = 0.15, height=0.15, aes(colour = Species), alpha=0.3) +
geom_abline(intercept=0, slope=1)
# 5. 모형 배포 ---------------------------------------
iris_pred <- predict(iris_xgboost, newdata = iris_test)