tidytuesdayR
팩키지는 기존 Tidy Tuesday 데이터 과학 해킹데이터를 정리한 데이터 팩키지다. GDPR Fines 데이터를 다운로드 받아 데이터 정제 작업을 수행하고 벌금 예측모형을 개발한다. tidytuesdayR
를 사용해서 데이터를 가져오려고 했으나 tt_load()
invalid multibyte string error #64 이슈가 발생되어 직접 웹사이트에서 데이터를 가져오도록 한다.
library(tidytuesdayR)
library(tidyverse)
library(tidymodels)
fines <- read_tsv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_violations.tsv")
gdpr_text <- read_tsv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2020/2020-04-21/gdpr_text.tsv")
fines
# A tibble: 250 x 11
id picture name price authority date controller article_violated type
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
1 1 https:… Pola… 9380 Polish N… 10/1… Polish Ma… Art. 28 GDPR Non-…
2 2 https:… Roma… 2500 Romanian… 10/1… UTTIS IND… Art. 12 GDPR|Ar… Info…
3 3 https:… Spain 60000 Spanish … 10/1… Xfera Mov… Art. 5 GDPR|Art… Non-…
4 4 https:… Spain 8000 Spanish … 10/1… Iberdrola… Art. 31 GDPR Fail…
5 5 https:… Roma… 150000 Romanian… 10/0… Raiffeise… Art. 32 GDPR Fail…
6 6 https:… Roma… 20000 Romanian… 10/0… Vreau Cre… Art. 32 GDPR|Ar… Fail…
7 7 https:… Gree… 200000 Hellenic… 10/0… Telecommu… Art. 5 (1) c) G… Fail…
8 8 https:… Gree… 200000 Hellenic… 10/0… Telecommu… Art. 21 (3) GDP… Fail…
9 9 https:… Spain 30000 Spanish … 10/0… Vueling A… Art. 5 GDPR|Art… Non-…
10 10 https:… Roma… 9000 Romanian… 09/2… Inteligo … Art. 5 (1) a) G… Non-…
# … with 240 more rows, and 2 more variables: source <chr>, summary <chr>
예측모형 개발을 위한 Basetable
생성에 대한 전처리 및 피처 공학(feature engineering) 관련 사항은 Modeling #TidyTuesday GDPR violations with tidymodels을 참고한다.
gdpr_tidy <- fines %>%
transmute(id,
price,
# country = name,
article_violated,
articles = str_extract_all(article_violated, "Art.[:digit:]+|Art. [:digit:]+")
) %>%
mutate(total_articles = map_int(articles, length)) %>%
unnest(articles) %>%
add_count(articles) %>%
filter(n > 10) %>%
select(-n)
gdpr_df <- gdpr_tidy %>%
mutate(value = 1) %>%
select(-article_violated) %>%
pivot_wider(
names_from = articles, values_from = value,
values_fn = list(value = max), values_fill = list(value = 0)
) %>%
janitor::clean_names()
gdpr_df
# A tibble: 219 x 8
id price total_articles art_13 art_5 art_6 art_32 art_15
<dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 2500 4 1 1 1 0 0
2 3 60000 2 0 1 1 0 0
3 5 150000 1 0 0 0 1 0
4 6 20000 2 0 0 0 1 0
5 7 200000 2 0 1 0 0 0
6 9 30000 2 0 1 1 0 0
7 10 9000 2 0 1 1 0 0
8 11 195407 3 0 0 0 0 1
9 12 10000 1 0 1 0 0 0
10 13 644780 1 0 0 0 1 0
# … with 209 more rows
훈련/시험 데이터로 나누고 recipes
팩키지를 활용하여 feature engineering 작업을 수행한다.
# 훈련/시험 데이터셋
tidy_split <- initial_split(gdpr_df, prop = 0.8, strata = price)
tidy_train <- training(tidy_split)
tidy_test <- testing(tidy_split)
# Feature Engineering
gdpr_rec <- recipe(price ~ ., data = gdpr_df) %>%
update_role(id, new_role = "id") %>%
step_log(price, base = 10, offset = 1, skip = TRUE) %>%
# step_other(country, threshold = 0.1, other = "Other") %>%
step_dummy(all_nominal()) %>%
step_zv(all_predictors()) %>%
prep()
gdpr_rec %>% juice()
# A tibble: 219 x 8
id total_articles art_13 art_5 art_6 art_32 art_15 price
<dbl> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 2 4 1 1 1 0 0 3.40
2 3 2 0 1 1 0 0 4.78
3 5 1 0 0 0 1 0 5.18
4 6 2 0 0 0 1 0 4.30
5 7 2 0 1 0 0 0 5.30
6 9 2 0 1 1 0 0 4.48
7 10 2 0 1 1 0 0 3.95
8 11 3 0 0 0 0 1 5.29
9 12 1 0 1 0 0 0 4.00
10 13 1 0 0 0 1 0 5.81
# … with 209 more rows
데이터 과학자 이광춘 저작
kwangchun.lee.7@gmail.com