1 캐글 데이터셋

Women’s e-commerce cloting reviews 데이터를 바탕으로 텍스트 데이터를 예측모형에 Feature로 넣어 예측력을 향상시키는 방안을 살펴보자.

현장에서 자연어 처리(NLP)는 매우 드물지만 텍스트 데이터에 대한 중요도는 지속적으로 증대되고 있다. 이를 현재 시점에서 대응하는 한 방식이 깔끔한 데이터(Tidy Data) 원칙을 단순히 단어를 세는 Count-based 방법론과 결합하면 상당히 실무에서 유용하게 활용할 수 있게 된다.

1.1 데이터 사전

캐글 Women’s e-commerce cloting reviews 데이터는 총 11개 변수로 구성되어 있고 관측점이 23,486개로 구성되어 있다. Recommended IND를 라벨 목표변수로 두고 예측모형을 구축해보자.

Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
Age: Positive Integer variable of the reviewers age.
Title: String variable for the title of the review.
Review Text: String variable for the review body.
Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not - recommended.
Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
Division Name: Categorical name of the product high level division.
Department Name: Categorical name of the product department name.
Class Name: Categorical name of the product class name.

library(tidyverse)
library(janitor)

cloth_dat <- read_csv("data/Womens Clothing E-Commerce Reviews.csv")

1.2 데이터 전처리

캐글 옷 리뷰 데이터에서 텍스트 관련 변수(Title, Review)를 따로 빼서 직사각형 데이터프레임을 생성한다. 즉, 텍스트는 별도로 빼서 DTM을 만들어 결합시킨다. 텍스트 Feature를 모형설계행렬로 반영한 후 예측모형 알고리즘을 돌려 예측모형 정확도를 향상시킨다.

텍스트

cloth_dat <- cloth_dat %>% 
  clean_names() %>% 
  filter(complete.cases(.)) %>% 
  rename(y = recommended_ind)

cloth_df <- cloth_dat %>% 
  mutate(y = factor(y, levels=c(1,0), labels=c("yes", "no"))) %>% 
  mutate_if(is.character, as.factor) %>% 
  select(y, age, title, review_text, division_name, department_name, class_name) %>% 
  mutate(class = fct_lump(class_name, 9)) %>% 
  mutate(review = str_c(title, " ", review_text)) %>% 
  select(y, age, division = division_name, department = department_name, class, review) %>% 
  mutate(review_id = row_number())

cloth_text_df <- cloth_df %>% 
  select(review_id, review)

cloth_text_df %>% 
  sample_n(10) %>% 
  DT::datatable()

1.3 깔끔한 텍스트 데이터 ¹

원본 텍스트 데이터를 unnest_tokens() 함수로 깔끔한 텍스트 형태로 변환시킨다. 그리고 나서 불용어 사전에서 불용어를 anti_join()으로 제거한다.

library(tidytext)

cloth_tidytext_df <- cloth_text_df %>% 
  unnest_tokens(word, review) %>% 
  anti_join(get_stopwords(source = "smart"))

cloth_tidytext_df %>% 
  sample_n(100) %>% 
  DT::datatable()

2 텍스트 기술통계량

2.1 추천 비추천 감성분석

2.1.1 추천 비추천 고빈도 단어

추천과 비추천에 많이 사용된 단어를 상위 10개 뽑아 비교한다.

cloth_df %>% 
  select(review_id, y) %>% 
  left_join(cloth_tidytext_df) %>% 
    count(word, y, sort=TRUE) %>%
    group_by(y) %>% 
    top_n(10, wt=n) %>% 
    ungroup() %>%
    ggplot(aes(x = fct_reorder(word, n), y=n, fill=y)) +
      geom_col() +
      coord_flip() +
      facet_wrap(~y, scales = "free") +
      labs(x="", y="") +
      scale_y_continuous(labels = scales::comma)

2.1.2 긍부정 감성 고빈도 단어

옷추천과 비추천을 감성분석과 연계하여 추천에 활용된 긍정적인 단어와 비추천에 활용도가 높은 단어 빈도수를 비교한다.

cloth_tidytext_df %>% 
  inner_join(get_sentiments("bing")) %>% 
  count(word, sentiment, sort=TRUE) %>%
  group_by(sentiment) %>% 
  top_n(10, wt=n) %>% 
  ungroup() %>% 
  ggplot(aes(x=reorder(word,n), y=n, fill=sentiment)) +
    geom_col() +
    coord_flip() +
    facet_wrap(~sentiment, scales = "free") +
    labs(x="", y="") +
    scale_y_continuous(labels = scales::comma)

리뷰에 나타난 전체를 파악하는 대신에, 추천과 비추천 리뷰에 동원된 감성어도 각기 나눠 살펴볼 수 있다. 이를 위해서 review_id를 추천여부와 결합하여 감성분석결과와 결합하여 구분필드로 활용한다.

cloth_df %>% 
  select(review_id, y) %>% 
  left_join(cloth_tidytext_df) %>% 
    inner_join(get_sentiments("bing")) %>% 
    count(y, word, sentiment, sort=TRUE) %>%
    group_by(y, sentiment) %>% 
    top_n(10, wt=n) %>% 
    ungroup() %>% 
    ggplot(aes(x=fct_reorder(word, n), y=n, fill=sentiment)) +
      geom_col() +
      coord_flip() +
      facet_wrap(~y+sentiment, scales = "free") +
      labs(x="", y="") +
      scale_y_continuous(labels = scales::comma)

3 리뷰 중요단어 - `tf-idf`

tf-idf는 옷리뷰에 대한 추천과 비추천 분석에 대한 중요지표가 된다. bind_tf_idf() 함수로 추천과 비추천 리뷰에 대한 중요단어를 추출하여 막대그래프로 시각화한다.

## 리뷰 전체 단어숫자
review_words <- cloth_tidytext_df %>%
  count(review_id, word, sort = TRUE)

total_review_words <- review_words %>%
 group_by(review_id) %>%
 summarize(total = sum(n)) 

## 리뷰 TF-IDF
cloth_tidytext_tfidf <- cloth_tidytext_df %>% 
  count(review_id, word) %>% 
  left_join(total_review_words) %>% 
  bind_tf_idf(word, review_id, n) 

cloth_tidytext_tfidf %>% 
  top_n(100, wt=tf_idf) %>% 
  arrange(desc(tf_idf)) %>% 
  DT::datatable() %>% 
    DT::formatRound(c(5:7), digits=3)

cloth_df 에서 추천/비추천 필드(y)를 결합해서 앞서 계산한 tf_idf와 결합하여 해당 리뷰에서는 자주 나타나고, 전체 리뷰에는 적게 나타나는 단어를, 추천/비추천과 결합하여 중요한 단어를 식별해 낸다.

cloth_df %>% 
  select(review_id, y) %>% 
  left_join(cloth_tidytext_tfidf) %>% 
    group_by(y) %>% 
    top_n(10, wt=tf_idf) %>% 
    ungroup() %>% 
  ggplot(aes(fct_reorder(word, tf_idf), tf_idf, fill = y)) +
    geom_col(show.legend = FALSE) +
    coord_flip() +
    facet_wrap(~y, scales = "free") +
    theme_minimal() +
    labs(x="")

4 단어 연관성 - `ngram`

4.1 단어 연관성 - `tf-idf`

token = "ngrams"을 활용하여 n = 2로 지정하여 bigram 두단어 토큰을 생성하고 앞선 unigram 토근을 활용한 것과 동일한 방식으로 리뷰를 분석한다. tf_idf를 계산하여 추천/비추천에 활용된 bigram 두 단어를 식별해보자.

cloth_tidy_bigram <- cloth_text_df %>% 
  unnest_tokens(bigram, review, token = "ngrams", n = 2)

cloth_tidy_bigram_filtered <- cloth_tidy_bigram %>% 
  separate(bigram, c("word1", "word2"), sep=" ") %>% 
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>% 
  unite(bigram, word1, word2, sep = " ")

cloth_tidy_ngram_tfidf <- cloth_tidy_bigram_filtered %>% 
  count(review_id, bigram, sort=TRUE) %>% 
  bind_tf_idf(bigram, review_id, n)

cloth_tidy_ngram_tfidf

# A tibble: 125,109 x 6
   review_id bigram            n    tf   idf tf_idf
       <int> <chr>         <int> <dbl> <dbl>  <dbl>
 1      2219 amp amp          11 0.423  7.78   3.29
 2       502 bathing suit      5 0.417  5.79   2.41
 3     14404 love love         5 0.714  4.27   3.05
 4     10884 amp amp           4 0.222  7.78   1.73
 5     13346 stain remover     4 0.444  8.25   3.67
 6       195 sweater coat      3 0.25   5.93   1.48
 7       211 amp 39            3 0.3    7.22   2.17
 8       407 amp 39            3 0.429  7.22   3.10
 9       520 beach cover       3 0.25   6.68   1.67
10      1205 upper arms        3 0.375  6.06   2.27
# ... with 125,099 more rows

4.2 단어 연관성 - 감성분석

4.3 단어 연관성 - 그래프(`ggraph`)

5 예측모형 ²

텍스트 데이터에 대한 탐색적 데이터분석과 비지도 학습 모형에 대한 검토가 끝났다면 다음 단계로 옷추천에 대한 예측모형을 구축하는 것이다. 이를 위해서 tidytext, quanteda, tidymodels 3가지 도구가 시장에 나와 있다.

library(doSNOW)
library(caret)
# 실행시간
start.time <- Sys.time()

cl <- makeCluster(4, type = "SOCK")
registerDoSNOW(cl)


sparse_words <- cloth_tidytext_df %>% 
  count(y, review_id, word, sort = TRUE) %>% 
  cast_sparse(review_id, word, n)


model <- cv.glmnet(sparse_words, cloth_df$y, family = "binomial", 
                   parallel = TRUE, keep = TRUE)

stopCluster(cl)
 
total.time <- Sys.time() - start.time
total.time

xwMOOC 모형

캐글 - 전자상거래 옷 리뷰

xwMOOC

2018-11-26

1 캐글 데이터셋

1.1 데이터 사전

1.2 데이터 전처리

1.3 깔끔한 텍스트 데이터 ¹

2 텍스트 기술통계량

2.1 추천 비추천 감성분석

2.1.1 추천 비추천 고빈도 단어

2.1.2 긍부정 감성 고빈도 단어

3 리뷰 중요단어 - `tf-idf`

4 단어 연관성 - `ngram`

4.1 단어 연관성 - `tf-idf`

4.2 단어 연관성 - 감성분석

4.3 단어 연관성 - 그래프(`ggraph`)

5 예측모형 ²

6 예측모형 2 ³

xwMOOC 모형

캐글 - 전자상거래 옷 리뷰

xwMOOC

2018-11-26

1 캐글 데이터셋

1.1 데이터 사전

1.2 데이터 전처리

1.3 깔끔한 텍스트 데이터 1

2 텍스트 기술통계량

2.1 추천 비추천 감성분석

2.1.1 추천 비추천 고빈도 단어

2.1.2 긍부정 감성 고빈도 단어

3 리뷰 중요단어 - tf-idf

4 단어 연관성 - ngram

4.1 단어 연관성 - tf-idf

4.2 단어 연관성 - 감성분석

4.3 단어 연관성 - 그래프(ggraph)

5 예측모형 2

6 예측모형 2 3

1.3 깔끔한 텍스트 데이터 ¹

3 리뷰 중요단어 - `tf-idf`

4 단어 연관성 - `ngram`

4.1 단어 연관성 - `tf-idf`

4.3 단어 연관성 - 그래프(`ggraph`)

5 예측모형 ²

6 예측모형 2 ³