1 텍스트 작업흐름

tidytext 일반적인 작업 흐름은 텍스트를 단어주머니(Bag of Word)로 넣고 이를 DTM(Document-Term_matrix)로 변환시킨 후에 시각화를 위해서 단어구름 혹은 막대그래프, 토픽모형같은 비지도 학습과 예측모형을 위해서 지도학습을 함께 작업한다.

전통적인 작업흐름에 더해서 불용어(Stop Words)와 감성사전(Sentiment Dictionary)을 추가하여 텍스트에서 의미없는 단어를 제거하고 감성을 더 풍부하게 분석할 수 있다.

자연어 처리 작업흐름에 앞뒤에 광학문자인식(OCR)과 예측모형을 외부에 뽑아 내어 다른 앱이나 사용자가 사용할 수 있도록 확장하는 것도 가능하다.

2 트위터 데이터셋

캐글 Sentiment140 dataset with 1.6 million tweets - Sentiment analysis with tweets 데이터셋은 Sentiment140 으로 불리며, 트위터 API에서 추출된 1,600,000건 트윗이 담겨져 있다. 그런데, 각 트윗은 0=부정(negative), 4=긍정(positive)으로 라벨이 붙어있어 다양한 텍스트 분석을 수행하는데 적합한 데이터셋 중의 하나로 평가된다.

library(tidyverse)
library(tidytext)

tw_dat <- read_csv("data/sentiment140/training.1600000.processed.noemoticon.csv", 
                  col_names = c("target", "ids", "date", "flag", "user", "text"),
                  cols(target = col_integer(),
                       ids    = col_character(),
                       date   = col_character(),
                       flag   = col_character(),
                       user   = col_character(),
                       text   = col_character()))

tw_dat %>% 
    count(target)

# A tibble: 2 x 2
  target      n
   <int>  <int>
1      0 800000
2      4 800000

3 데이터 전처리

3.1 데이터 일부 추출

데이터가 너무 커서 긍부정(target)을 각 1%씩 즉, 8,000개를 뽑아서 텍스트 자연어 처리를 위한 준비를 한다.

set.seed(777)

tw_df <- tw_dat %>% 
  group_by(target) %>% 
  sample_n(8000) %>% 
  ungroup() %>% 
  mutate(target = ifelse(target == 0, 0, 1)) %>% 
  mutate(target = factor(target, levels = c(0,1), labels=c("부정", "긍정")))

3.2 해쉬태그, @, URL 제거

인터넷 정규표현식을 참조하여 해쉬태그, @, URL을 제거한다.

tw_regex_df <- tw_df %>% 
  mutate(text = str_remove_all(text, "\\B(\\#[a-zA-Z]+\\b)(?!;)")) %>%  # 해쉬태그 제거
  mutate(text = str_remove_all(text, "\\B(\\@[a-zA-Z]+\\b)(?!;)")) %>%  # @제거
  mutate(text = str_remove_all(text, "((https?):((//)|(\\\\))+([\\w\\d:#@%/;$()~_?\\+-=\\\\.&](#!)?)*)")) # URL 제거 
tw_regex_df

# A tibble: 16,000 x 6
   target ids     date          flag   user     text                       
   <fct>  <chr>   <chr>         <chr>  <chr>    <chr>                      
 1 부정   220287… Tue Jun 16 2… NO_QU… danielR… My last night on this tour 
 2 부정   205547… Sat Jun 06 0… NO_QU… alraK5   I need a ladder.  This wor…
 3 부정   199095… Mon Jun 01 0… NO_QU… youniqu… I TALKED TO U BUT I DIDNT …
 4 부정   232759… Thu Jun 25 0… NO_QU… jlw072   Way bummed that I don't ge…
 5 부정   220423… Wed Jun 17 0… NO_QU… theital… i miss tuesdays at My Hous…
 6 부정   147023… Tue Apr 07 0… NO_QU… SillySt… Sad day!!!!. Totally thoug…
 7 부정   199093… Mon Jun 01 0… NO_QU… ClareLa… " I couldn't sleep last ni…
 8 부정   188037… Fri May 22 0… NO_QU… tiffymo… " yes i think us lakers fa…
 9 부정   229625… Tue Jun 23 0… NO_QU… oliviaa… Arghh have to sit down the…
10 부정   197141… Sat May 30 0… NO_QU… jfunk414 realized how sad it's goin…
# … with 15,990 more rows

3.3 `tidytext` 변환

unnest_tokens() 함수를 사용하게 되면 tidytext 팩키지에서 텍스트 문장을 깔끔한 자연어 형태로 변환을 시킨다. 그 전에 가장 많이 사용되는 불용어를 별도 사전으로 정의하고 이를 활용하여 트위터 트윗 텍스트에서 불용어를 제거시킨다.

custom_stopwords <- tribble(
  ~"word", ~"lexicon",
  "2", "twitter",
  "im", "twitter"
)

custom_stop_words <- stop_words %>% 
  bind_rows(custom_stopwords)

custom_stop_words %>% 
  tail

# A tibble: 6 x 2
  word     lexicon
  <chr>    <chr>  
1 younger  onix   
2 youngest onix   
3 your     onix   
4 yours    onix   
5 2        twitter
6 im       twitter

tw_tidy_df <- tw_regex_df %>% 
  unnest_tokens(word, text) %>% 
  anti_join(custom_stop_words)

tw_tidy_df %>% 
  count(word, sort=TRUE)

# A tibble: 19,792 x 2
   word         n
   <chr>    <int>
 1 day        858
 2 quot       732
 3 love       645
 4 lol        610
 5 time       535
 6 amp        487
 7 night      467
 8 miss       380
 9 home       366
10 tomorrow   354
# … with 19,782 more rows

4 시각화

깔끔한 텍스트 데이터가 준비되면 그 다음 단계로 ggplot을 활용하여 단어주머니(Bag of Words)를 만들어서 막대그래프와 단어구름(wordcloud)을 제작하여 시각화한다.

tw_tidy_df %>% 
  count(target, word, sort=TRUE) %>% 
  group_by(target) %>% 
  top_n(15, n) %>% 
  ungroup() %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  ggplot(aes(x=word, y=n, fill=target)) +
    geom_col(show.legend = FALSE) +
    coord_flip() +
    facet_wrap( ~target, scales="free") +
    labs(x="", y="", title="트위터 긍부정", subtitle="Sentiment140 데이터셋")

전통적인 wordcloud 대신에 stackoverflow, “Subplot/facets with wordclouds”을 참조하여 단어구름을 ggplot으로 구현한다.

library(ggrepel)

tw_tidy_df %>% 
  count(target, word, sort=TRUE) %>% 
  group_by(target) %>% 
  top_n(50, n) %>% 
  ungroup() %>% 
  mutate(word = fct_reorder(word, n)) %>% 
  ggplot(., aes(x = 1, y = 1, size = n, label = word)) +
    geom_text_repel(segment.size = 0, segment.alpha = 0) +
    scale_size(range = c(2, 15), guide = FALSE) +
    theme_void() +
    theme(panel.border = element_rect(colour = "black", fill=NA, size=1)) +
    facet_wrap(~target) +
      labs(x="", y="", title="트위터 긍부정 단어구름", subtitle="Sentiment140 데이터셋")

5 감성분석

감성분석은 get_sentiments() 함수를 사용해서 감성사전을 가져오는 것부터 시작된다. 과거 3종 감성사전이 있었는데… 영어의 경우… 기존 “afinn”, “bing”, “nrc”, 사전에 “loughran” 신규 사전이 추가되었다.

트위터 트윗에서 라벨링한 것과 감성사전에서 정의한 감성 긍부정을 교차표를 통해 살펴보자. 다소 차이가 나지만, 부정은 negative로 긍정은 positive로 어느정도 두배이상 높은 정확도를 보이고 있다.

get_sentiments(lexicon = "bing") %>% 
  count(sentiment)

# A tibble: 2 x 2
  sentiment     n
  <chr>     <int>
1 negative   4782
2 positive   2006

get_sentiments(lexicon = "bing") %>% 
  inner_join(tw_tidy_df) %>% 
  count(target, sentiment) %>% 
  spread(target, n)

# A tibble: 2 x 3
  sentiment  부정  긍정
  <chr>     <int> <int>
1 negative   5222  1800
2 positive   1864  4024

5.1 시각화

“bing” 감성사전을 트위터 데이터프레임과 결합시켜 각 범주별로 15개 단어를 추출하여 막대그래프로 시각화한다.

get_sentiments(lexicon = "bing") %>% 
  inner_join(tw_tidy_df) %>% 
  count(target, word, sentiment) %>% 
  group_by(target, sentiment) %>%
  top_n(15) %>% 
  ungroup() %>% 
  ggplot(aes(x = fct_reorder(word, n), y=n, fill=target)) +
    geom_col(position="dodge") +
    facet_wrap(~sentiment, scales="free") +
    coord_flip()

6 토픽모형

긍부정 트위터 트윗을 대상으로 토픽을 살펴보고 어떤 트윗에서 긍부정이 높은지 살펴보는 것도 나름 의미가 크다. 이를 위해서 topicmodels 팩키지 LDA() 함수를 사용한다. 그런데 LDA 함수는 입력 자료구조로 DTM(Document-Term_matrix)을 기본으로 한다. 따라서 깔끔한 tidytext를 cast_dtm() 함수로 변경을 해야만 한다.

DTM 내부 구조를 살펴보기 위해서는 as.matrix() 함수로 행렬자료형으로 변환을 한 후에 텍스트 일부를 추출해서 살펴본다.

tw_dtm <- tw_tidy_df %>% 
  count(word, ids) %>% 
  cast_dtm(document = ids, term = word, value = n)

tw_dtm %>% 
  as.matrix() %>% 
  .[7:10, 1000:1005]

            Terms
Docs         akmal ako akon aktually aku akwardness
  2245246056     0   0    0        0   0          0
  1687125577     0   0    0        0   0          0
  1752715674     0   0    0        0   0          0
  2299381234     0   0    0        0   0          0

토픽모형을 LDA() 함수에 토픽갯수(k = 3)를 지정한 후에 실행한 후에 tidy() 함수로 후속 작업을 원할히 할 수 있는 자료구조로 변환을 시킨 후에 시각화한다.

library(topicmodels)

tw_lda <- LDA(tw_dtm, 
              k = 3,
              method = "Gibbs",
              control = list(seed = 777))

tw_lda %>% 
  tidy(matrix =  "beta") %>% 
  mutate(topic = as.factor(topic)) %>% 
  group_by(topic) %>% 
  top_n(15) %>% 
  ungroup() %>% 
  ggplot(aes(x=fct_reorder(term, beta), y=beta, fill=topic)) +
    geom_col(show.legend = FALSE) +
    facet_wrap( ~ topic, scales = "free") +
    coord_flip() +
    labs(x="", y="")

7 긍부정 예측모형 ¹

# 훈련/시험 데이터셋 구분
library(rsample)

tw_tidy_df <- tw_tidy_df %>% 
  filter(str_detect(word, "\\b[a-zA-Z0-0].+\\b")) %>% 
  arrange(word)

set.seed(77777)
tw_split <- tw_df %>%
  select(ids) %>%
  initial_split(0.6)

train_data <- training(tw_split)
test_data <- testing(tw_split)

# 텍스트 feature 전처리 작업
sparse_words <- tw_tidy_df %>%
  count(ids, word) %>%
  inner_join(train_data) %>%
  cast_sparse(ids, word, n)

class(sparse_words)

[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"

dim(sparse_words)

[1]  9395 13658

word_rownames <- rownames(sparse_words)

tw_joined <- data_frame(ids = word_rownames) %>%
  left_join(tw_df) %>%
    select(ids, target)

# 예측모형 적합
library(glmnet)
library(doMC)
registerDoMC(cores = 8)

is_positive <- tw_joined$target == "긍정"

tw_glm <- cv.glmnet(sparse_words, is_positive,
  family = "binomial",
  parallel = TRUE, keep = TRUE)

plot(tw_glm)

7.1 모형해석

모형 해석을 위해서 broom 팩키지를 사용한다.

library(broom)

tw_coefs <- tw_glm$glmnet.fit %>%
  tidy() %>%
  filter(lambda == tw_glm$lambda.1se)

tw_coefs %>%
  group_by(estimate > 0) %>%
  top_n(10, abs(estimate)) %>%
  ungroup() %>%
  ggplot(aes(fct_reorder(term, estimate), estimate, fill = estimate > 0)) +
  geom_col(alpha = 0.8, show.legend = FALSE) +
  coord_flip() +
  labs(
    x = NULL,
    title = "긍정 확률을 가장 높이는 회귀계수",
    subtitle = "단어 whatava가 포함되면 긍정, sad가 포함되면 부정일 가능성이 높다."
  )

7.2 트윗 긍정확률 - 시험데이터

트윗 긍정확률을 앞서 개발된 모형을 바탕으로 추정해보자. 그리고 트윗(text)을 붙여 트윗 감성예측이 제대로 되었는지도 확인해본다.

tw_intercept <- tw_coefs %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

classifications <- tw_tidy_df %>%
  inner_join(test_data) %>%
  inner_join(tw_coefs, by = c("word" = "term")) %>%
  group_by(ids) %>%
  summarize(score = sum(estimate)) %>%
  mutate(probability = plogis(tw_intercept + score)) %>% 
  left_join(tw_df)

classifications %>% 
  select(ids, score, probability, target, text) %>% 
  sample_n(10)

# A tibble: 10 x 5
   ids        score probability target text                                
   <chr>      <dbl>       <dbl> <fct>  <chr>                               
 1 2255489… -0.598        0.384 부정   I wanted to go to the rally today, …
 2 2235202… -0.491        0.409 부정   @SirKicks  i dont remember...       
 3 2182540… -0.127        0.499 부정   at work wondering when i will lose …
 4 1975520… -1.73         0.167 긍정   rawr. feeling sick. ]: i need a pop…
 5 2175104… -0.346        0.444 부정   @JessObsess no..no..it's not ur fau…
 6 1977927… -1.09         0.276 긍정   is officially going to study now    
 7 1692990…  1.70         0.861 긍정   @Octo__Mom Glad to see your post.  …
 8 1963495… -1.76         0.163 부정   man iCant send love on Bebo. Cuz im…
 9 1968386…  0.0849       0.552 부정   @Valv30 Whah?! I didn't comprehend …
10 1468990… -1.20         0.254 부정   finished ma pasta  gotta get on and…

7.3 트윗 긍부정 예측모형 성능

yardstick 팩키지를 사용해서 트윗 긍부정 예측모형 성능을 roc_auc()를 통해 확인해본다.

library(yardstick)

classifications %>% 
  select(ids, score, probability, text, target) %>%
  roc_auc(target, probability)

# A tibble: 1 x 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.783

JULIA SILGE (Dec 24, 2018), “TEXT CLASSIFICATION WITH TIDY DATA PRINCIPLES”↩

트위터 - tidyverse + tidytext