tm
↔ tidytext
tm
데이터 객체: VCorpus
tm
팩키지에 acq
, crude
데이터셋이 포함되어 있다.
[1] "acq" "crude"
이중에서 crude
석유관련된 데이터를 살펴보자. crude
객체는 VCorpus
객체로 $meta
, $content
를 통해서 관련 내용을 추출할 수 있다. crude
VCorpus
는 20개 문서로 구성되어 있다.
<<VCorpus>>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 20
author : character(0)
datetimestamp: 1987-02-26 17:00:56
description :
heading : DIAMOND SHAMROCK (DIA) CUTS CRUDE PRICES
id : 127
language : en
origin : Reuters-21578 XML
topics : YES
lewissplit : TRAIN
cgisplit : TRAINING-SET
oldid : 5670
places : usa
people : character(0)
orgs : character(0)
exchanges : character(0)
[1] "Diamond Shamrock Corp said that\neffective today it had cut its contract prices for crude oil by\n1.50 dlrs a barrel.\n The reduction brings its posted price for West Texas\nIntermediate to 16.00 dlrs a barrel, the copany said.\n \"The price reduction today was made in the light of falling\noil product prices and a weak crude oil market,\" a company\nspokeswoman said.\n Diamond is the latest in a line of U.S. oil companies that\nhave cut its contract, or posted, prices over the last two days\nciting weak oil markets.\n Reuter"
VCorpus
→ tidytext
객체tm
VCorpus
객체를 tidytext
객체로 변환시키려면 tidytext::tidy()
함수를 사용한다. 즉, $meta
정보는 칼럼으로 매핑되고, $content
는 text
칼럼으로 저장된다.
# A tibble: 20 x 16
author datetimestamp description heading id language origin
<chr> <dttm> <chr> <chr> <chr> <chr> <chr>
1 <NA> 1987-02-27 02:00:56 "" DIAMON… 127 en Reute…
2 BY TE… 1987-02-27 02:34:11 "" OPEC M… 144 en Reute…
3 <NA> 1987-02-27 03:18:00 "" TEXACO… 191 en Reute…
4 <NA> 1987-02-27 03:21:01 "" MARATH… 194 en Reute…
5 <NA> 1987-02-27 04:00:57 "" HOUSTO… 211 en Reute…
6 <NA> 1987-03-01 12:25:46 "" KUWAIT… 236 en Reute…
7 By Je… 1987-03-01 12:39:14 "" INDONE… 237 en Reute…
8 <NA> 1987-03-01 14:27:27 "" SAUDI … 242 en Reute…
9 <NA> 1987-03-01 17:22:30 "" QATAR … 246 en Reute…
10 <NA> 1987-03-02 03:31:44 "" SAUDI … 248 en Reute…
11 <NA> 1987-03-02 10:05:49 "" SAUDI … 273 en Reute…
12 <NA> 1987-03-02 16:39:23 "" GULF A… 349 en Reute…
13 <NA> 1987-03-02 16:43:22 "" SAUDI … 352 en Reute…
14 <NA> 1987-03-02 16:43:41 "" KUWAIT… 353 en Reute…
15 <NA> 1987-03-02 17:25:42 "" PHILAD… 368 en Reute…
16 <NA> 1987-03-02 20:20:05 "" STUDY … 489 en Reute…
17 <NA> 1987-03-02 20:28:26 "" STUDY … 502 en Reute…
18 <NA> 1987-03-02 21:13:46 "" UNOCAL… 543 en Reute…
19 By BE… 1987-03-02 23:38:34 "" NYMEX … 704 en Reute…
20 <NA> 1987-03-02 23:49:06 "" ARGENT… 708 en Reute…
# … with 9 more variables: topics <chr>, lewissplit <chr>, cgisplit <chr>,
# oldid <chr>, places <named list>, people <chr>, orgs <chr>,
# exchanges <chr>, text <chr>
tidytext
객체 → VCorpus
반대의 경우로 tidytext
객체를 VCorpus
객체로 바꾸는 경우를 생각해보자.
crude_tm <- VCorpus(VectorSource(crude_tbl$text))
meta(crude_tm, "기자명") <- crude_tbl$author
meta(crude_tm, "작성일자") <- crude_tbl$datetimestamp
meta(crude_tm, "언어") <- crude_tbl$language
meta(crude_tm)
기자명 작성일자 언어
1 <NA> 1987-02-27 02:00:56 en
2 BY TED D'AFFLISIO, Reuters 1987-02-27 02:34:11 en
3 <NA> 1987-02-27 03:18:00 en
4 <NA> 1987-02-27 03:21:01 en
5 <NA> 1987-02-27 04:00:57 en
6 <NA> 1987-03-01 12:25:46 en
7 By Jeremy Clift, Reuters 1987-03-01 12:39:14 en
8 <NA> 1987-03-01 14:27:27 en
9 <NA> 1987-03-01 17:22:30 en
10 <NA> 1987-03-02 03:31:44 en
11 <NA> 1987-03-02 10:05:49 en
12 <NA> 1987-03-02 16:39:23 en
13 <NA> 1987-03-02 16:43:22 en
14 <NA> 1987-03-02 16:43:41 en
15 <NA> 1987-03-02 17:25:42 en
16 <NA> 1987-03-02 20:20:05 en
17 <NA> 1987-03-02 20:28:26 en
18 <NA> 1987-03-02 21:13:46 en
19 By BERNICE NAPACH, Reuters 1987-03-02 23:38:34 en
20 <NA> 1987-03-02 23:49:06 en