1 데이터셋

tidytuesdayR 팩키지는 기존 Tidy Tuesday 데이터 과학 해킹데이터를 정리한 데이터 팩키지다. GDPR Fines 데이터를 다운로드 받아 데이터 정제 작업을 수행하고 벌금 예측모형을 개발한다. tidytuesdayR를 사용해서 데이터를 가져오려고 했으나 tt_load() invalid multibyte string error #64 이슈가 발생되어 직접 웹사이트에서 데이터를 가져오도록 한다.

library(tidytuesdayR)
library(tidyverse)
library(tidymodels)

fines <- read_tsv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv")
gdpr_text <- read_tsv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_text.tsv")

fines
# A tibble: 250 x 11
      id picture name   price authority date  controller article_violated type 
   <dbl> <chr>   <chr>  <dbl> <chr>     <chr> <chr>      <chr>            <chr>
 1     1 https:… Pola…   9380 Polish N… 10/1… Polish Ma… Art. 28 GDPR     Non-…
 2     2 https:… Roma…   2500 Romanian… 10/1… UTTIS IND… Art. 12 GDPR|Ar… Info…
 3     3 https:… Spain  60000 Spanish … 10/1… Xfera Mov… Art. 5 GDPR|Art… Non-…
 4     4 https:… Spain   8000 Spanish … 10/1… Iberdrola… Art. 31 GDPR     Fail…
 5     5 https:… Roma… 150000 Romanian… 10/0… Raiffeise… Art. 32 GDPR     Fail…
 6     6 https:… Roma…  20000 Romanian… 10/0… Vreau Cre… Art. 32 GDPR|Ar… Fail…
 7     7 https:… Gree… 200000 Hellenic… 10/0… Telecommu… Art. 5 (1) c) G… Fail…
 8     8 https:… Gree… 200000 Hellenic… 10/0… Telecommu… Art. 21 (3) GDP… Fail…
 9     9 https:… Spain  30000 Spanish … 10/0… Vueling A… Art. 5 GDPR|Art… Non-…
10    10 https:… Roma…   9000 Romanian… 09/2… Inteligo … Art. 5 (1) a) G… Non-…
# … with 240 more rows, and 2 more variables: source <chr>, summary <chr>

2 데이터셋

예측모형 개발을 위한 Basetable 생성에 대한 전처리 및 피처 공학(feature engineering) 관련 사항은 Modeling #TidyTuesday GDPR violations with tidymodels을 참고한다.

gdpr_tidy <- fines %>%
  transmute(id,
    price,
    # country = name,
    article_violated,
    articles = str_extract_all(article_violated, "Art.[:digit:]+|Art. [:digit:]+")
  ) %>%
  mutate(total_articles = map_int(articles, length)) %>%
  unnest(articles) %>%
  add_count(articles) %>%
  filter(n > 10) %>%
  select(-n)

gdpr_df <- gdpr_tidy %>%
  mutate(value = 1) %>%
  select(-article_violated) %>%
  pivot_wider(
    names_from = articles, values_from = value,
    values_fn = list(value = max), values_fill = list(value = 0)
  ) %>%
  janitor::clean_names()

gdpr_df
# A tibble: 219 x 8
      id  price total_articles art_13 art_5 art_6 art_32 art_15
   <dbl>  <dbl>          <int>  <dbl> <dbl> <dbl>  <dbl>  <dbl>
 1     2   2500              4      1     1     1      0      0
 2     3  60000              2      0     1     1      0      0
 3     5 150000              1      0     0     0      1      0
 4     6  20000              2      0     0     0      1      0
 5     7 200000              2      0     1     0      0      0
 6     9  30000              2      0     1     1      0      0
 7    10   9000              2      0     1     1      0      0
 8    11 195407              3      0     0     0      0      1
 9    12  10000              1      0     1     0      0      0
10    13 644780              1      0     0     0      1      0
# … with 209 more rows

2.1 훈련/시험 전처리

훈련/시험 데이터로 나누고 recipes 팩키지를 활용하여 feature engineering 작업을 수행한다.

# 훈련/시험 데이터셋
tidy_split <- initial_split(gdpr_df, prop = 0.8, strata = price)

tidy_train <- training(tidy_split)
tidy_test  <- testing(tidy_split)

# Feature Engineering

gdpr_rec <- recipe(price ~ ., data = gdpr_df) %>%
  update_role(id, new_role = "id") %>%
  step_log(price, base = 10, offset = 1, skip = TRUE) %>%
  # step_other(country, threshold = 0.1, other = "Other") %>%
  step_dummy(all_nominal()) %>%
  step_zv(all_predictors()) %>% 
  prep()

gdpr_rec %>% juice()
# A tibble: 219 x 8
      id total_articles art_13 art_5 art_6 art_32 art_15 price
   <dbl>          <int>  <dbl> <dbl> <dbl>  <dbl>  <dbl> <dbl>
 1     2              4      1     1     1      0      0  3.40
 2     3              2      0     1     1      0      0  4.78
 3     5              1      0     0     0      1      0  5.18
 4     6              2      0     0     0      1      0  4.30
 5     7              2      0     1     0      0      0  5.30
 6     9              2      0     1     1      0      0  4.48
 7    10              2      0     1     1      0      0  3.95
 8    11              3      0     0     0      0      1  5.29
 9    12              1      0     1     0      0      0  4.00
10    13              1      0     0     0      1      0  5.81
# … with 209 more rows

3 예측모형 생성

glm_spec <- linear_reg() %>%
  set_engine("lm") %>% 
  set_mode("regression")

gdpr_wf <- workflow() %>%
  add_recipe(gdpr_rec) %>%
  add_model(glm_spec) 

gdpr_fit <- fit(gdpr_wf, tidy_train) 

4 예측모형 평가

gdpr_fit %>%
  predict(tidy_test) %>%
  bind_cols(tidy_test) %>% 
  metrics(truth = price, estimate = .pred)
 

데이터 과학자 이광춘 저작

kwangchun.lee.7@gmail.com