1 텍스트 데이터 준비

먼저 crude 텍스트 데이터, 즉 영문 뉴스기사를 tidytext 데이터로 준비한다.

library(tidyverse)
library(tm)
library(tidytext)

data("crude")

crude_tbl <- tidy(crude)
crude_text <- crude_tbl %>% select(id= oldid, text)
crude_text %>% 
  slice(1)

# A tibble: 1 x 2
  id    text                                                               
  <chr> <chr>                                                              
1 5670  "Diamond Shamrock Corp said that\neffective today it had cut its c…

2 단어주머니 (BoW, Bag of Word)와 TF-IDF

2.1 토큰화

먼저 석유관련 뉴스기사를 단어(words) 기준으로 토큰화시킨다. 그리고 나서 anti_join() 함수로 영어의 특성인 불용어를 제거한다.

crude_text %>% 
  unnest_tokens(output="word", token = "words", input=text) %>% 
  anti_join(stop_words)

# A tibble: 2,167 x 2
   id    word     
   <chr> <chr>    
 1 5670  diamond  
 2 5670  shamrock 
 3 5670  corp     
 4 5670  effective
 5 5670  cut      
 6 5670  contract 
 7 5670  prices   
 8 5670  crude    
 9 5670  oil      
10 5670  1.50     
# … with 2,157 more rows

2.2 단어주머니 → TF-IDF

두번째로 단어별 빈도수를 단어주머니 기법으로 계량화시킨다. 문서별 빈도수를 count() 함수로 계산하고, bind_tf_idf() 함수로 TF-IDF를 계산한다.

(crude_tfidf <- crude_text %>% 
  unnest_tokens(output="word", token = "words", input=text) %>% 
  anti_join(stop_words) %>% 
  count(id, word, sort=TRUE) %>% 
  bind_tf_idf(word, id, n))

# A tibble: 1,498 x 6
   id    word        n     tf   idf tf_idf
   <chr> <chr>   <int>  <dbl> <dbl>  <dbl>
 1 5687  opec       13 0.065  0.693 0.0451
 2 5687  oil        12 0.06   0     0     
 3 8321  kuwait     10 0.0429 1.39  0.0595
 4 12456 mln         9 0.0409 0.799 0.0327
 5 12887 futures     9 0.0643 3.00  0.193 
 6 8333  oil         9 0.0479 0     0     
 7 8333  prices      9 0.0479 0.288 0.0138
 8 12456 bpd         8 0.0364 1.39  0.0504
 9 8321  opec        8 0.0343 0.693 0.0238
10 8322  report      8 0.0362 3.00  0.108 
# … with 1,488 more rows

3 코사인 유사도(Cosine Similarity)

widyr 팩키지 pairwise_similarity() 함수를 사용해서 뉴스간 유사도를 측정할 수 있다.

crude_tfidf %>% 
  widyr::pairwise_similarity(id, word, tf_idf) %>% 
  arrange(desc(similarity))

# A tibble: 380 x 3
   item1 item2 similarity
   <chr> <chr>      <dbl>
 1 12672 12685      0.837
 2 12685 12672      0.837
 3 12535 8333       0.583
 4 8333  12535      0.583
 5 12536 8321       0.409
 6 8321  12536      0.409
 7 5737  12726      0.286
 8 12726 5737       0.286
 9 5670  12726      0.284
10 12726 5670       0.284
# … with 370 more rows

가장 유사도가 높은 것으로 나온 뉴스 두개를 살펴보자.

crude_text %>% 
  filter(id %in% c(12672, 12685))

# A tibble: 2 x 2
  id    text                                                               
  <chr> <chr>                                                              
1 12672 "A study group said the United States\nshould increase its strateg…
2 12685 "A study group said the United States\nshould increase its strateg…

xwMOOC 자연어 처리 - 텍스트

BoW와 TF-IDF

xwMOOC

2019-09-19

1 텍스트 데이터 준비

2 단어주머니 (BoW, Bag of Word)와 TF-IDF

2.1 토큰화

2.2 단어주머니 → TF-IDF

3 코사인 유사도(Cosine Similarity)