1 문재인 대통령 신년기자회견 ¹ ²

2019 문재인 대통령 신년기자회견 연설문을 웹사이트에서 다운로드받아 연설문 텍스트에서 핵심단어를 추출하고 가장 많이 사용된 단어를 확인하고 이를 ggpage 팩키지를 활용하여 시각화하자.

1.1 연설문 데이터

2019 문재인 대통령 신년기자회견 연설문 웹사이트에서 rvest로 연설문만 뽑아내고 기본적인 전처리 작업만한다. ggplot으로 시각화하는데 데이터프레임 변환시키고 빈도수가 많은 단어를 공백기준으로 나눠 빈도수를 산출한다.

library(tidyverse)
library(ggpage)
library(rvest)
library(tidytext)

speech_url <- "https://www1.president.go.kr/articles/5288"

speech_text <- read_html(speech_url) %>%
  html_nodes(xpath = '//*[@id="cont_view"]/div/div/div[1]/div[3]') %>%
  html_text() %>%
  str_remove_all("\\n") %>% 
  str_remove_all("<U\\+00A0>") %>% 
  str_split(pattern="\\.") %>% 
  unlist()

speech_df <- tibble(
  line = 1: length(speech_text),
  text = speech_text)

speech_df %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort = TRUE)

# A tibble: 1,425 x 2
   word         n
   <chr>    <int>
 1 것입니다    28
 2 수          17
 3 있습니다    14
 4 함께        14
 5 3           13
 6 등          11
 7 만          11
 8 우리        11
 9 우리는      11
10 국민        10
# ... with 1,415 more rows

1.2 연설문 시각화

ggpage팩키지를 사용하여 앞서 빈도수가 많이 나온 단어를 페이지에 표식해보자. 국민일보 , 文대통령 신년 기자회견서 가장 많이 언급된 단어는?에서 상위 5개를 뽑아 ggpage_plot() 함수에 넣어 시각화한다.

library(extrafont)
loadfonts()

speech_df %>%
  ggpage_build(wtl = TRUE, lpp = 30, x_space_pages =10, y_space_pages = 0, ncol = 7) %>%
  mutate(highlight = case_when(word %in% c("국민") ~ "국민",
                               word %in% c("경제") ~ "경제",
                               word %in% c("성장") ~ "성장",
                               word %in% c("혁신") ~ "혁신",
                               TRUE ~ "기타")) %>%
  mutate(highlight = factor(highlight, levels=c("국민", "경제", "성장", "혁신", "기타"))) %>% 
  ggpage_plot(mapping = aes(fill = highlight)) +
  scale_fill_manual(values = c("blue", "red", "green", "cyan", "darkgray")) +
  labs(title = "2019 문재인 대통령 신년기자회견 연설전문",
       caption = "출처: 청와대 홈페이지 https://www1.president.go.kr/articles/5288",
       fill = NULL) +
  theme_void(base_family = "NanumGothic")

2 한걸음 더 들어갑니다.

앞서 띄어쓰기 기준이 아닌 한국어 고유의 형태소 분석 후 의미있는 단어만 분석할 수 있도록 NLP4kec 팩키지를 활용하여 자연어 처리 분석에 한걸음 더 들어갑니다. NLP4kec_1.2.0.zip 파일크기가 1.2GB라서 놀라지 마세요!!!

# install.packages("C:/Users/tidyverse/Downloads/NLP4kec_1.2.0.zip" , repos=NULL)
library(NLP4kec)

speech_bow <- r_parser_r(speech_df$text, language = "ko")

speech_bow_df <- speech_bow %>% 
  tibble(line = 1:length(speech_bow), text=.) %>% 
  mutate(text = as.character(text))

speech_bow_df %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort=TRUE)

# A tibble: 833 x 2
   word      n
   <chr> <int>
 1 것       43
 2 있다     41
 3 경제     35
 4 우리     32
 5 성장     29
 6 국민     24
 7 하다     23
 8 않다     21
 9 혁신     21
10 되다     19
# ... with 823 more rows

사용자 정의 사전(dictionary.txt)

“포용성장”과 같은 단어를 그대로 두게 되면 “포용”과 “성장” 나눠지게 된다. 이런 문제를 사용자 정의 사전(dictionary.txt)에 기록하여 의미가 남도록 추가 작업을 한다.

speech_bow_df %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort=TRUE) %>% 
  filter(word %in% c("포용", "국가", "포용국가"))

# A tibble: 2 x 2
  word      n
  <chr> <int>
1 국가     18
2 포용      9

3 tidytext 불용어 처리

먼저 “포용국가” 등을 반영하기 위해서 korDicPath를 설정하고 형태소 분석에 들어간다. 그리고 나서 데이터프레임으로 변환작업을 하고 자체 불용어 사전 tidyverse를 제작하여 기존 stop_words 사전에 결합시킨다. 연설문에 많이 나오는 불용어(stopwords)를 처리하기 위해서 tidytext 불용어 사전과 anti_join()을 걸어 최종 형태소분석이 깔끔히 완료된 텍스트 데이터를 준비한다.

speech_bow <- r_parser_r(speech_df$text, language = "ko",  korDicPath = "data/dictionary.txt")

speech_bow_df <- speech_bow %>% 
  tibble(line = 1:length(speech_bow), text=.) %>% 
  mutate(text = as.character(text))

tidy_stopwords <- tibble(
  word = c(
    "것",
    "있다",
    "하다",
    "않다",
    "되다",
    "년",
    "수",
    "위하다",
    "등"
  ),
  lexicon = "tidyverse"
)

all_stop_words <- stop_words %>%
  bind_rows(tidy_stopwords)

speech_bow_df %>% 
  unnest_tokens(word, text) %>% 
  anti_join(all_stop_words) %>% 
  count(word, sort=TRUE) %>% 
  DT::datatable()

3.1 tidytext 연설문 시각화

앞서 형태소 분석 결과를 바탕으로 연설전문에 포용, 경제, 북한, 성장, 혁신 등을 주제로 하여 연설문에 표식을 한다.

speech_df %>%
  ggpage_build(wtl = TRUE, lpp = 30, x_space_pages =10, y_space_pages = 0, ncol = 7) %>%
  mutate(highlight = case_when(word %in% c("포용적 성장", "포용국가") ~ "포용",
                               word %in% c("경제") ~ "경제",
                               word %in% c("남북", "통일", "북한", "김정은", "개성", "한반도") ~ "북한",
                               word %in% c("성장") ~ "성장",
                               word %in% c("혁신") ~ "혁신",
                               TRUE ~ "기타")) %>%
  mutate(highlight = factor(highlight, levels=c("포용", "경제", "북한", "성장", "혁신", "기타"))) %>% 
  ggpage_plot(mapping = aes(fill = highlight)) +
  scale_fill_manual(values = c("blue", "red", "green", "cyan", "violet", "darkgray")) +
  labs(title = "2019 문재인 대통령 신년기자회견 연설전문",
       caption = "출처: 청와대 홈페이지 https://www1.president.go.kr/articles/5288",
       fill = NULL) +
  theme_void(base_family = "NanumGothic")

사용자 정의 사전(dictionary.txt)

“포용국가”과 같은 단어를 그대로 두게 되면 “포용”과 “국가” 나눠지게 된다. 이런 문제를 사용자 정의 사전(dictionary.txt)에 기록하여 의미가 남도록 추가 작업을 한다.

speech_bow_df %>% 
  unnest_tokens(word, text) %>% 
  count(word, sort=TRUE) %>% 
  filter(word %in% c("포용", "국가", "포용국가"))

# A tibble: 3 x 2
  word         n
  <chr>    <int>
1 국가        11
2 포용국가     7
3 포용         2

유트브 댓글 - 자연어 처리

깔끔한 텍스트 (Tidytext) - 신년기자회견(2019)

xwMOOC

2019-01-12

1 문재인 대통령 신년기자회견 ¹ ²

1.1 연설문 데이터

1.2 연설문 시각화

2 한걸음 더 들어갑니다.

3 tidytext 불용어 처리

3.1 tidytext 연설문 시각화

유트브 댓글 - 자연어 처리

깔끔한 텍스트 (Tidytext) - 신년기자회견(2019)

xwMOOC

2019-01-12

1 문재인 대통령 신년기자회견 1 2

1.1 연설문 데이터

1.2 연설문 시각화

2 한걸음 더 들어갑니다.

3 tidytext 불용어 처리

3.1 tidytext 연설문 시각화

1 문재인 대통령 신년기자회견 ¹ ²