1 Hadley 데이터 처리 체계와 `dplyr` ¹ ²

가공되지 않은 원자료(raw data)에서 자료를 자유자재로 다룰 수 있도록 수십년동안 수많은 통계/공학자들이 아낌없이 시간을 기부해 주었기 때문에 과거에는 전문가들만 할 수 있었던 고도의 어려운 작업도 정확하고 수월하게 수행할 수 있다. 자료는 기본적으로 벡터(Vector)를 기본으로 한다. 하지만 벡터로 표현될 수 있는 정보량은 한정되어 있으며 이를 하나의 자료 형태로 구조화한 것을 데이터프레임(dataframe)으로 볼 수 있다. 따라서, 자료분석을 위한 기본 자료구조가 데이터프레임이 된다.

특히 SQL을 통해서 데이터 조작(Data Manipulation)에 대한 개념 잡고 쉘(shell)을 통한 작업 자동화 개념을 익히고, 팩키지를 사용하면 추구하는 바를 신속하고 정확하게 달성할 수 있다.

데이터프레임과 동일하지만, tbl로 작업을 일원하고, 선행 작업으로 tidyr을 받아 dplyr로 작업한다. 단일 데이터프레임은 dplyr 데이터 작업 동사 5개를 활용하며, 단일 데이터프레임 dplyr 동사와 마찬가지로 두 데이터프레임에 대해서도 *_join 동사를 활용한다.

Hadley Wickham은 tidyr을 사용하여 자료 정제하고 자료변환을 위해서 dplyr을 사용하고 그래픽 문법(glammar of graphics)에 따라 정적 그래프(static graph)는 ggplot, 동적 그래프(dynamic graph)는 ggvis로 시각화하고 R의 다양한 모형화를 이용한 자료분석 체계도를 제안하였고, broom 팩키지를 통해 R 모형에서 반환하는 결과값을 재활용한다.

dplyr 목표

데이터를 소프트웨어로 작업하기 적합하게 만든다.
데이터를 쉽게 까볼 수 있게 한다.

dplyr 팩키지를 근간에 두고 병렬처리, 텍스트, 시계열 처리 등 다양한 목적에 맞게 확장되어 더 많은 데이터를 인간이 빠르게 처리하는데 큰 도움이 된다. ³ ⁴ ⁵

2 `dplyr`, `tidyr` 예제 데이터셋

tidyr, dplyr 팩키지로 자유로이 데이터를 다루는 방법에 대하여 EDAWR 팩키지에 포함된 데이터셋을 활용하여 살펴본다.

# install.packages("devtools")
# devtools::install_github("rstudio/EDAWR")
library(tidyverse)
library(EDAWR)
# ? cases
# ?storms
# ?pollution
# ?tb

3 사람 혹은 기계 중심 - `gather`, `spread`

3.1 Wide 형식 → Long 형식 데이터 ⁶ ⁷

먼저 cases 데이터프레임, 즉 사람이 읽기 좋은 형태로 표현된 데이터를 기계가 처리하기 유익한 형태인 Long 형식으로 변환시킨다.

변경전 (Wide 형식, 사람 중심)

country	2011	2012	2013
FR	7000	6900	7000
DE	5800	6000	6200
US	15000	14000	13000

변경후 (Long 형식, 기계 중심)

country	year	n
FR	2011	7000
DE	2011	5800
US	2011	15000
FR	2012	6900
DE	2012	6000
US	2012	14000
FR	2013	7000
DE	2013	6200
US	2013	13000

DT::datatable(cases)

gather(cases, "year", "n", 2:4)

  country year     n
1      FR 2011  7000
2      DE 2011  5800
3      US 2011 15000
4      FR 2012  6900
5      DE 2012  6000
6      US 2012 14000
7      FR 2013  7000
8      DE 2013  6200
9      US 2013 13000

3.2 Long 형식 데이터 → Wide 형식

기계처리에 적합한 Long 형식 데이터프레임 환경오염(pollution) 데이터를 사람이 보기 편안한 Wide 형식 깔끔한 데이터로 변형한다.

변경후 (Long 형식, 기계 중심)

city	size	amount
New York	large	23
New York	small	14
London	large	22
London	small	16
Beijing	large	121
Beijing	small	56

변경전 (Wide 형식, 사람 중심)

city	large	small
Beijing	121	56
London	22	16
New York	23	14

DT::datatable(pollution)
spread(pollution, size, amount)

4 `separate`, `unite` 동사

storms 태풍 허리케인 데이터에는 년월일 변수에 변수 세개가 숨겨져 있다. 변수를 쪼개는데 separate() 함수를 사용한다. 합치는데는 unite() 함수를 사용한다.

변경전 (합쳐진 변수)

storm	wind	pressure	date
Alberto	110	1007	2000-08-03
Alex	45	1009	1998-07-27
Allison	65	1005	1995-06-03
Ana	40	1013	1997-06-30
Arlene	50	1010	1999-06-11
Arthur	45	1010	1996-06-17

변경후 (쪼개진 변수)

storm	wind	pressure	year	month	day
Alberto	110	1007	2000	08	03
Alex	45	1009	1998	07	27
Allison	65	1005	1995	06	03
Ana	40	1013	1997	06	30
Arlene	50	1010	1999	06	11
Arthur	45	1010	1996	06	17

storms <- storms %>% 
  mutate(date = lubridate::make_date(year, month, day)) %>% 
  select(name, date, wind, pressure, year, month, day)

separate(storms, date, c("year", "month", "day"), sep = "-")

# A tibble: 10,010 x 6
   name  year  month day    wind pressure
   <chr> <chr> <chr> <chr> <int>    <int>
 1 Amy   1975  06    27       25     1013
 2 Amy   1975  06    27       25     1013
 3 Amy   1975  06    27       25     1013
 4 Amy   1975  06    27       25     1013
 5 Amy   1975  06    28       25     1012
 6 Amy   1975  06    28       25     1012
 7 Amy   1975  06    28       25     1011
 8 Amy   1975  06    28       30     1006
 9 Amy   1975  06    29       35     1004
10 Amy   1975  06    29       40     1002
# ... with 10,000 more rows

unite(storms, "kdate", year, month, day, sep = "-")

# A tibble: 10,010 x 5
   name  date        wind pressure kdate    
   <chr> <date>     <int>    <int> <chr>    
 1 Amy   1975-06-27    25     1013 1975-6-27
 2 Amy   1975-06-27    25     1013 1975-6-27
 3 Amy   1975-06-27    25     1013 1975-6-27
 4 Amy   1975-06-27    25     1013 1975-6-27
 5 Amy   1975-06-28    25     1012 1975-6-28
 6 Amy   1975-06-28    25     1012 1975-6-28
 7 Amy   1975-06-28    25     1011 1975-6-28
 8 Amy   1975-06-28    30     1006 1975-6-28
 9 Amy   1975-06-29    35     1004 1975-6-29
10 Amy   1975-06-29    40     1002 1975-6-29
# ... with 10,000 more rows

5 `dplyr` 동사

dplyr 패키지는 데이터프레임(data.frame) 자료처리를 위한 차세대 plyr 패키지로 기능은 그대로 유지하고, plyr의 속도문제를 나름 상당히 개선시켰다. 다음 6가지 함수가 핵심 함수로 SQL 기본 기능과 유사성이 높다. 따라서, 기존 다양한 자료처리 방식을 직관적이고 빠르며 효율적인 dplyr 패키지 함수로 생산성을 높여본다.

filter (관측점 필터링) : 특정 기준을 만족하는 행을 추출한다.
select (변수 선택하기) : 변수명으로 특정 칼럼을 추출한다.
arrange (다시 정렬하기) : 행을 다시 정렬한다.
mutate (변수 추가하기) : 새로운 변수를 추가한다.
summarise (변수를 값으로 줄이기) : 변수를 값(스칼라)으로 요약한다.

5.1 변수 `select` 동사

storms 태풍 허리케인 데이터에서 변수를 뽑아낼 때 select()를 사용한다. -, : 연산자도 사용가능하다.

변경전

storm	wind	pressure	date
Alberto	110	1007	2000-08-03
Alex	45	1009	1998-07-27
Allison	65	1005	1995-06-03
Ana	40	1013	1997-06-30
Arlene	50	1010	1999-06-11
Arthur	45	1010	1996-06-17

변경후(변수 선택)

storm	pressure
Alberto	1007
Alex	1009
Allison	1005
Ana	1013
Arlene	1010
Arthur	1010

storms %>% select(wind, pressure)

# A tibble: 10,010 x 2
    wind pressure
   <int>    <int>
 1    25     1013
 2    25     1013
 3    25     1013
 4    25     1013
 5    25     1012
 6    25     1012
 7    25     1011
 8    30     1006
 9    35     1004
10    40     1002
# ... with 10,000 more rows

select(storms, -name)

# A tibble: 10,010 x 6
   date        wind pressure  year month   day
   <date>     <int>    <int> <dbl> <dbl> <int>
 1 1975-06-27    25     1013  1975     6    27
 2 1975-06-27    25     1013  1975     6    27
 3 1975-06-27    25     1013  1975     6    27
 4 1975-06-27    25     1013  1975     6    27
 5 1975-06-28    25     1012  1975     6    28
 6 1975-06-28    25     1012  1975     6    28
 7 1975-06-28    25     1011  1975     6    28
 8 1975-06-28    30     1006  1975     6    28
 9 1975-06-29    35     1004  1975     6    29
10 1975-06-29    40     1002  1975     6    29
# ... with 10,000 more rows

select(storms, wind:date)

# A tibble: 10,010 x 2
    wind date      
   <int> <date>    
 1    25 1975-06-27
 2    25 1975-06-27
 3    25 1975-06-27
 4    25 1975-06-27
 5    25 1975-06-28
 6    25 1975-06-28
 7    25 1975-06-28
 8    30 1975-06-28
 9    35 1975-06-29
10    40 1975-06-29
# ... with 10,000 more rows

유용한 select() 내장 함수

내장 함수	설명
`-`	해당 변수를 제외한 모든 칼럼을 선택한다.
`:`	해당 범위에 해당되는 칼럼을 선택한다.
`contains()`	해당 문자열을 명칭을 포함한 칼럼을 선택한다.
`starts_with()`	해당 문자열로 시작하는 명칭을 포함한 칼럼을 선택한다.
`ends_with()`	해당 문자열로 끝나는 명칭을 포함한 칼럼을 선택한다.
`everything()`	모든 칼럼을 선택한다.
`matches()`	정규표현식을 매칭하는 칼럼을 선택한다.
`num_range()`	x1, x2, x3, x4, x5 명칭이 붙은 칼럼을 선택한다.
`one_of()`	그룹에 명칭이 담긴 칼럼을 선택한다.

5.2 관측점 `filter` 동사

변경전

storm	wind	pressure	date
Alberto	110	1007	2000-08-03
Alex	45	1009	1998-07-27
Allison	65	1005	1995-06-03
Ana	40	1013	1997-06-30
Arlene	50	1010	1999-06-11
Arthur	45	1010	1996-06-17

변경후(관측점 선택)

storm	wind	pressure	date
Alberto	110	1007	2000-08-03
Allison	65	1005	1995-06-03
Arlene	50	1010	1999-06-11

storms 태풍 허리케인 데이터에서 관측점을 필터링한다. filter()를 사용한다.

filter(storms, wind >= 50)

# A tibble: 4,861 x 7
   name  date        wind pressure  year month   day
   <chr> <date>     <int>    <int> <dbl> <dbl> <int>
 1 Amy   1975-06-29    50      998  1975     6    29
 2 Amy   1975-06-30    50      998  1975     6    30
 3 Amy   1975-06-30    55      998  1975     6    30
 4 Amy   1975-06-30    60      987  1975     6    30
 5 Amy   1975-06-30    60      987  1975     6    30
 6 Amy   1975-07-01    60      984  1975     7     1
 7 Amy   1975-07-01    60      984  1975     7     1
 8 Amy   1975-07-01    60      984  1975     7     1
 9 Amy   1975-07-01    60      984  1975     7     1
10 Amy   1975-07-02    60      984  1975     7     2
# ... with 4,851 more rows

filter(storms, wind >= 50, name %in% c("Alberto", "Alex", "Allison"))

# A tibble: 118 x 7
   name    date        wind pressure  year month   day
   <chr>   <date>     <int>    <int> <dbl> <dbl> <int>
 1 Alberto 1982-06-03    50      995  1982     6     3
 2 Alberto 1982-06-03    75      985  1982     6     3
 3 Alberto 1982-06-04    65      992  1982     6     4
 4 Alberto 1982-06-04    55      998  1982     6     4
 5 Alberto 1994-07-03    50      997  1994     7     3
 6 Alberto 1994-07-03    55      993  1994     7     3
 7 Alberto 1994-07-03    55      993  1994     7     3
 8 Allison 1995-06-04    50      997  1995     6     4
 9 Allison 1995-06-04    60      995  1995     6     4
10 Allison 1995-06-04    65      987  1995     6     4
# ... with 108 more rows

filter() R 논리 연산자

비교연산자 `?Comparison`	설명	논리 연산자 `?base::Logic`	설명
`<`	적다	`&`	그리고
`>`	크다	`\|`	또는
`==`	같다	`xor`	배타적 논리합
`<=`	이하	`!`	부정
`>=`	이상	`any`	참이 있음
`!=`	같지 않다	`all`	모두 참
`%in%`	포함한다
`is.na`	`NA` 값이다
`!is.na`	`NA` 값이 아니다.

5.3 변수 생성 `mutate` 동사

storms 태풍 허리케인 데이터에서 변수를 새로 생성한다. mutate()를 사용한다.

변경전

storm	wind	pressure	date
Alberto	110	1007	2000-08-03
Alex	45	1009	1998-07-27
Allison	65	1005	1995-06-03
Ana	40	1013	1997-06-30
Arlene	50	1010	1999-06-11
Arthur	45	1010	1996-06-17

변경후(변수 생성)

storm	wind	pressure	date	ratio
Alberto	110	1007	2000-08-03	9.154545
Alex	45	1009	1998-07-27	22.422222
Allison	65	1005	1995-06-03	15.461538
Ana	40	1013	1997-06-30	25.325000
Arlene	50	1010	1999-06-11	20.200000
Arthur	45	1010	1996-06-17	22.444444

mutate(storms, ratio = pressure / wind)

# A tibble: 10,010 x 8
   name  date        wind pressure  year month   day ratio
   <chr> <date>     <int>    <int> <dbl> <dbl> <int> <dbl>
 1 Amy   1975-06-27    25     1013  1975     6    27  40.5
 2 Amy   1975-06-27    25     1013  1975     6    27  40.5
 3 Amy   1975-06-27    25     1013  1975     6    27  40.5
 4 Amy   1975-06-27    25     1013  1975     6    27  40.5
 5 Amy   1975-06-28    25     1012  1975     6    28  40.5
 6 Amy   1975-06-28    25     1012  1975     6    28  40.5
 7 Amy   1975-06-28    25     1011  1975     6    28  40.4
 8 Amy   1975-06-28    30     1006  1975     6    28  33.5
 9 Amy   1975-06-29    35     1004  1975     6    29  28.7
10 Amy   1975-06-29    40     1002  1975     6    29  25.0
# ... with 10,000 more rows

mutate(storms, ratio = pressure / wind, inverse = ratio^-1)

# A tibble: 10,010 x 9
   name  date        wind pressure  year month   day ratio inverse
   <chr> <date>     <int>    <int> <dbl> <dbl> <int> <dbl>   <dbl>
 1 Amy   1975-06-27    25     1013  1975     6    27  40.5  0.0247
 2 Amy   1975-06-27    25     1013  1975     6    27  40.5  0.0247
 3 Amy   1975-06-27    25     1013  1975     6    27  40.5  0.0247
 4 Amy   1975-06-27    25     1013  1975     6    27  40.5  0.0247
 5 Amy   1975-06-28    25     1012  1975     6    28  40.5  0.0247
 6 Amy   1975-06-28    25     1012  1975     6    28  40.5  0.0247
 7 Amy   1975-06-28    25     1011  1975     6    28  40.4  0.0247
 8 Amy   1975-06-28    30     1006  1975     6    28  33.5  0.0298
 9 Amy   1975-06-29    35     1004  1975     6    29  28.7  0.0349
10 Amy   1975-06-29    40     1002  1975     6    29  25.0  0.0399
# ... with 10,000 more rows

유용한 mutate() 내장 함수

함수명	설명
`pmin()`, `pmax()`	관측점별 최소값, 최대값
`cummin()`, `cummax()`	누적 최소값, 최대값
`cumsum()`, `cumprod()`	누적합, 누적곱
`between()`	a와 b 사이
`cume_dist()`	누적 분포값
`cumall()`, `cumany()`	모든 누적값, 조건이 맞는 누적값
`cummean()`	누적 평균
`lead()`, `lag()`	위치 값을 선행 혹은 후행하여 복사
`ntile()`	벡터를 n개 구간을 분할
`dense_rank()`, `min_rank(),`, `percent_rank()`, `row_number()`	다양한 순위 방법

5.4 분석단위 변경(요약) `summerise` 동사

pollution 환경오염 데이터에 대한 분석단위를 변경한다. summarise()를 사용한다.

변경전

city	size	amount
New York	large	23
New York	small	14
London	large	22
London	small	16
Beijing	large	121
Beijing	small	56

변경후

median	variance
22.5	1731.6

pollution %>% summarise(median = median(amount), variance = var(amount))

  median variance
1   22.5   1731.6

pollution %>% summarise(mean = mean(amount), sum = sum(amount), n = n())

  mean sum n
1   42 252 6

유용한 summarize() 내장 함수

함수명	설명
`min()`, `max()`	최소값, 최대값
`mean()`	평균
`median()`	중위수
`sum()`	합계
`var`, `sd()`	분산, 표준편차
`first()`	첫번째 값
`last()`	마지막 값
`nth()`	n번째 값
`n()`	해당 벡터에 값 개수
`n_distinct()`	해당 벡터에 유일무이한 값 개수

5.5 정렬 `arrange` 동사

storms 태풍 허리케인 데이터 칼럼을 정렬한다. arrange() 함수를 사용한다.

변경전

storm	wind	pressure	date
Alberto	110	1007	2000-08-03
Alex	45	1009	1998-07-27
Allison	65	1005	1995-06-03
Ana	40	1013	1997-06-30
Arlene	50	1010	1999-06-11
Arthur	45	1010	1996-06-17

변경후(변수를 정렬)

storm	wind	pressure	date
Ana	40	1013	1997-06-30
Alex	45	1009	1998-07-27
Arthur	45	1010	1996-06-17
Arlene	50	1010	1999-06-11
Allison	65	1005	1995-06-03
Alberto	110	1007	2000-08-03

arrange(storms, wind)

# A tibble: 10,010 x 7
   name      date        wind pressure  year month   day
   <chr>     <date>     <int>    <int> <dbl> <dbl> <int>
 1 Bonnie    1986-06-28    10     1013  1986     6    28
 2 Bonnie    1986-06-28    10     1012  1986     6    28
 3 AL031987  1987-08-16    10     1014  1987     8    16
 4 AL031987  1987-08-17    10     1015  1987     8    17
 5 AL031987  1987-08-17    10     1015  1987     8    17
 6 Alberto   1994-07-07    10     1012  1994     7     7
 7 Alberto   1994-07-07    10     1012  1994     7     7
 8 Alberto   1994-07-07    10     1012  1994     7     7
 9 Alberto   1994-07-07    10     1013  1994     7     7
10 Claudette 1979-07-27    15     1007  1979     7    27
# ... with 10,000 more rows

arrange(storms, desc(wind))

# A tibble: 10,010 x 7
   name    date        wind pressure  year month   day
   <chr>   <date>     <int>    <int> <dbl> <dbl> <int>
 1 Gilbert 1988-09-14   160      888  1988     9    14
 2 Wilma   2005-10-19   160      882  2005    10    19
 3 Gilbert 1988-09-14   155      889  1988     9    14
 4 Mitch   1998-10-26   155      905  1998    10    26
 5 Mitch   1998-10-27   155      910  1998    10    27
 6 Rita    2005-09-22   155      895  2005     9    22
 7 Rita    2005-09-22   155      897  2005     9    22
 8 Anita   1977-09-02   150      926  1977     9     2
 9 David   1979-08-30   150      924  1979     8    30
10 David   1979-08-31   150      926  1979     8    31
# ... with 10,000 more rows

arrange(storms, wind, date)

# A tibble: 10,010 x 7
   name      date        wind pressure  year month   day
   <chr>     <date>     <int>    <int> <dbl> <dbl> <int>
 1 Bonnie    1986-06-28    10     1013  1986     6    28
 2 Bonnie    1986-06-28    10     1012  1986     6    28
 3 AL031987  1987-08-16    10     1014  1987     8    16
 4 AL031987  1987-08-17    10     1015  1987     8    17
 5 AL031987  1987-08-17    10     1015  1987     8    17
 6 Alberto   1994-07-07    10     1012  1994     7     7
 7 Alberto   1994-07-07    10     1012  1994     7     7
 8 Alberto   1994-07-07    10     1012  1994     7     7
 9 Alberto   1994-07-07    10     1013  1994     7     7
10 Claudette 1979-07-27    15     1007  1979     7    27
# ... with 10,000 more rows

5.6 (분석 단위) `group_by()` 함수 동사

분석단위별로 나눠서 자료분석을 할 경우 group_by() 함수를 조합한다.

변경전

city	size	amount
(chr)	(chr)	(dbl)
New York	large	23
New York	small	14
London	large	22
London	small	16
Beijing	large	121
Beijing	small	56

변경후

city	mean	sum	n
Beijing	88.5	177	2
London	19.0	38	2
New York	18.5	37	2

pollution %>% group_by(city) %>%
  summarise(mean = mean(amount), sum = sum(amount), n = n()) %>% 
  ungroup()

# A tibble: 3 x 4
  city      mean   sum     n
  <chr>    <dbl> <dbl> <int>
1 Beijing   88.5   177     2
2 London    19      38     2
3 New York  18.5    37     2

6 `dplyr` 동사와 Base 구문 비교 ⁸

data.table과 dplyr 비교할 때, data.table과 dplyr 모두 데이터 변환작업을 수행하지만, 다음 관점에서 살펴볼 필요가 있다. 즉, 동일한 기능을 제공하지만 품질속성에서는 차이가 난다.

속도
메모리 사용량
구문
기능

dplyr 팩키지와 Base 비교를 위해서 범주형 변수, 숫자형 변수 각 하나씩을 갖는 데이터프레임을 생성시킨다.

df <- data.frame( 
  color = c("blue", "black", "blue", "blue", "black"), 
  value = 1:5)

6.1 특정 변수 선택하기 (select)

전통적 R 코드	`dplyr` R 코드
`df[, c("var01","var02")]`	`select(df, var01)`

select(df, color)

  color
1  blue
2 black
3  blue
4  blue
5 black

select(df, -color)

6.2 관측점(obervation) 필터링해서 선택하기 (filter)

전통적 R 코드	`dplyr` R 코드
`df[df$var01==3 & df$var02$==7]`	`filter(df, var01==3, var02==7)`

filter(df, color == "blue")

  color value
1  blue     1
2  blue     3
3  blue     4

filter(df, value %in% c(1, 4))

  color value
1  blue     1
2  blue     4

6.3 새변수 생성하기 (mutate)

전통적 R 코드	`dplyr` R 코드
`df$new <- df$var01/df$var02`	`df <- mutate(df, new=var01/var02)`

mutate(df, double = 2 * value)

  color value double
1  blue     1      2
2 black     2      4
3  blue     3      6
4  blue     4      8
5 black     5     10

mutate(df, double = 2 * value, quadruple = 2 * double)

  color value double quadruple
1  blue     1      2         4
2 black     2      4         8
3  blue     3      6        12
4  blue     4      8        16
5 black     5     10        20

6.4 변수 요약하기 (summarize)

전통적 R 코드	`dplyr` R 코드
`aggregate(df$value, list(var01=df$var01), mean)`	`group_by(df, var01) %.% summarize(totalvalue = sum(value))`

summarise(df, total = sum(value))

  total
1    15

by_color <- group_by(df, color) 
summarise(by_color, total = sum(value))

# A tibble: 2 x 2
  color total
  <fct> <int>
1 black     7
2 blue      8

6.5 다시 정렬하기 (arrange)

전통적 R 코드	`dplyr` R 코드
`df[order(df$var01, df$var02)`	`arrange(df, var01, var02)`

arrange(df, color)

  color value
1 black     2
2 black     5
3  blue     1
4  blue     3
5  blue     4

arrange(df, desc(color))

  color value
1  blue     1
2  blue     3
3  blue     4
4 black     2
5 black     5

7 `dplyr` 패턴과 `transmute` 동사

7.1 `group_by` + `mutate`: 그룹별 비율

group_by() 동사와 mutate()를 적용시킨 패턴도 많이 사용하는 패턴이다.

예를 들어, gapminder 데이터셋의 가장 최근 2007년 기준 각 대륙별 인구를 모두 더해서 total_pop를 계산하고 나서 각 대륙별로 가장 인구가 많은 국가와 가장 점유율이 높은 국가를 뽑아낼 수 있다. 먼저 각 대륙별 가장 인구가 많은 국가를 1곳 뽑아보자

library(gapminder)

gapminder %>% 
  mutate(pop = pop / 100000000) %>% # 단위 억명
  filter(year == max(year)) %>% 
  group_by(continent) %>% 
  mutate(total_pop = sum(pop)) %>% 
  top_n(1, wt=pop)

# A tibble: 5 x 7
# Groups:   continent [5]
  country       continent  year lifeExp    pop gdpPercap total_pop
  <fct>         <fct>     <int>   <dbl>  <dbl>     <dbl>     <dbl>
1 Australia     Oceania    2007    81.2  0.204    34435.     0.245
2 China         Asia       2007    73.0 13.2       4959.    38.1  
3 Germany       Europe     2007    79.4  0.824    32170.     5.86 
4 Nigeria       Africa     2007    46.9  1.35      2014.     9.30 
5 United States Americas   2007    78.2  3.01     42952.     8.99

각 대륙별 국가 비율 fraction을 계산해서 가장 인구 비중이 높은 국가를 하나씩 뽑아보는 것도 가능하다.

gapminder %>% 
  mutate(pop = pop / 100000000) %>% # 단위 억명
  filter(year == max(year)) %>% 
  group_by(continent) %>% 
  mutate(total_pop = sum(pop),
         fraction = pop / total_pop) %>% 
  top_n(1, wt=fraction)

# A tibble: 5 x 8
# Groups:   continent [5]
  country       continent  year lifeExp    pop gdpPercap total_pop fraction
  <fct>         <fct>     <int>   <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
1 Australia     Oceania    2007    81.2  0.204    34435.     0.245    0.832
2 China         Asia       2007    73.0 13.2       4959.    38.1      0.346
3 Germany       Europe     2007    79.4  0.824    32170.     5.86     0.141
4 Nigeria       Africa     2007    46.9  1.35      2014.     9.30     0.145
5 United States Americas   2007    78.2  3.01     42952.     8.99     0.335

7.2 `group_by` + 윈도우 함수: 그룹별 변화율

group_by() 동사와 윈도우 함수(lag, lead 등)를 사용해서 년도별 변화가 가장 큰 특징점을 뽑아낼 수 있다. 이를 통해서 가장 인구가 많이 줄어든 시점과 국가를 쉽게 특정할 수 있다.

gapminder_window <- gapminder %>% 
  group_by(continent, country) %>% 
  arrange(continent, country, year) %>% 
  mutate(lifeExp_lag = lag(lifeExp)) %>% 
  mutate(difference = lifeExp - lifeExp_lag) %>% 
  select(country, continent, year, lifeExp, lifeExp_lag, difference) %>% 
  ungroup()

worst_countries <- gapminder_window %>% 
  top_n(5, wt=-difference) %>% 
  pull(country)

gapminder_window %>% 
  filter(country %in% worst_countries) %>% 
  arrange(difference)

# A tibble: 60 x 6
   country   continent  year lifeExp lifeExp_lag difference
   <fct>     <fct>     <int>   <dbl>       <dbl>      <dbl>
 1 Rwanda    Africa     1992    23.6        44.0     -20.4 
 2 Zimbabwe  Africa     1997    46.8        60.4     -13.6 
 3 Lesotho   Africa     2002    44.6        55.6     -11.0 
 4 Swaziland Africa     2002    43.9        54.3     -10.4 
 5 Botswana  Africa     1997    52.6        62.7     -10.2 
 6 Zimbabwe  Africa     2002    40.0        46.8      -6.82
 7 Botswana  Africa     2002    46.6        52.6      -5.92
 8 Swaziland Africa     2007    39.6        43.9      -4.26
 9 Swaziland Africa     1997    54.3        58.5      -4.18
10 Lesotho   Africa     1997    55.6        59.7      -4.13
# ... with 50 more rows

7.3 `transmute` 동사

transmute 동사는 select와 mutate를 결합한 동사로 코드를 좀더 간결하게 만들 수 있다. 예를 들어, GDP를 계산하는데 인구(pop)와 일인당GDP(gdpPercap)을 곱해야 계산할 수 있는데, mutate + select를 동원하면 코드가 길어지는데 transmute를 사용하면 한줄로 깔끔하게 구할 수 있다.

select + mutate 동사 조합

gapminder %>% 
  mutate(GDP = pop * gdpPercap) %>% 
  select(continent, country, GDP)

# A tibble: 1,704 x 3
   continent country              GDP
   <fct>     <fct>              <dbl>
 1 Asia      Afghanistan  6567086330.
 2 Asia      Afghanistan  7585448670.
 3 Asia      Afghanistan  8758855797.
 4 Asia      Afghanistan  9648014150.
 5 Asia      Afghanistan  9678553274.
 6 Asia      Afghanistan 11697659231.
 7 Asia      Afghanistan 12598563401.
 8 Asia      Afghanistan 11820990309.
 9 Asia      Afghanistan 10595901589.
10 Asia      Afghanistan 14121995875.
# ... with 1,694 more rows

transmute 동사

gapminder %>% 
  transmute(continent, country, GDP = pop * gdpPercap)

# A tibble: 1,704 x 3
   continent country              GDP
   <fct>     <fct>              <dbl>
 1 Asia      Afghanistan  6567086330.
 2 Asia      Afghanistan  7585448670.
 3 Asia      Afghanistan  8758855797.
 4 Asia      Afghanistan  9648014150.
 5 Asia      Afghanistan  9678553274.
 6 Asia      Afghanistan 11697659231.
 7 Asia      Afghanistan 12598563401.
 8 Asia      Afghanistan 11820990309.
 9 Asia      Afghanistan 10595901589.
10 Asia      Afghanistan 14121995875.
# ... with 1,694 more rows

8 분할-적용-병합(Split-Apply-Combine) 전략 - `apply` 계열

R을 사용하는 방법 중 하나는 반복을 통해 한번에 하나씩 연산을 수행하기 보다 단 한번 호출(call)을 통해 전체 벡터 연산을 수행한다. 또한 apply 함수를 사용해서 행, 열, 리스트에 대해 동일 연산을 수행하고, reduce를 사용해서 함수형 프로그래밍도 확장해서 수행한다.

lapply(리스트, 함수) : 리스트(list) 자료형에 apply 함수를 적용하여 데이터를 처리한다.
sapply(리스트, 함수) : lappy 함수와 동일하나 리스트 대신에 벡터를 반환한다.
mapply(함수,리스트1,리스트2,...) : 병렬로 다수 리스트에 대해서 apply 함수로 데이터를 처리한다.
tapply(x,요인변수,함수) : 요인변수(factor)에 맞춰 apply 함수로 데이터를 처리한다.
vapply(리스트,함수,...) : lappy의 고속처리 버젼.

Hadley Wickham의 2011년 Journal of Statistical Software 에 실린 “The Split-Apply-Combine Strategy for Data Analysis”를 바탕으로 R 병렬 프로그래밍 - 분할-적용-병합(Split-Apply-Combine) 전략 구현하면 데이터 분석가도 복잡하고 난해한 고난도 데이터 분석 작업을 수월히 처리할 수 있다.

데이터 사이언스

데이터 변환 -`dplyr`

xwMOOC

2019-08-18

1 Hadley 데이터 처리 체계와 `dplyr` ¹ ²

2 `dplyr`, `tidyr` 예제 데이터셋

3 사람 혹은 기계 중심 - `gather`, `spread`

3.1 Wide 형식 → Long 형식 데이터 ⁶ ⁷

3.2 Long 형식 데이터 → Wide 형식

4 `separate`, `unite` 동사

5 `dplyr` 동사

5.1 변수 `select` 동사

5.2 관측점 `filter` 동사

5.3 변수 생성 `mutate` 동사

5.4 분석단위 변경(요약) `summerise` 동사

5.5 정렬 `arrange` 동사

5.6 (분석 단위) `group_by()` 함수 동사

6 `dplyr` 동사와 Base 구문 비교 ⁸

6.1 특정 변수 선택하기 (select)

6.2 관측점(obervation) 필터링해서 선택하기 (filter)

6.3 새변수 생성하기 (mutate)

6.4 변수 요약하기 (summarize)

6.5 다시 정렬하기 (arrange)

7 `dplyr` 패턴과 `transmute` 동사

7.1 `group_by` + `mutate`: 그룹별 비율

7.2 `group_by` + 윈도우 함수: 그룹별 변화율

7.3 `transmute` 동사

8 분할-적용-병합(Split-Apply-Combine) 전략 - `apply` 계열

데이터 사이언스

데이터 변환 -dplyr

xwMOOC

2019-08-18

1 Hadley 데이터 처리 체계와 dplyr 1 2

2 dplyr, tidyr 예제 데이터셋

3 사람 혹은 기계 중심 - gather, spread

3.1 Wide 형식 → Long 형식 데이터 6 7

3.2 Long 형식 데이터 → Wide 형식

4 separate, unite 동사

5 dplyr 동사

5.1 변수 select 동사

5.2 관측점 filter 동사

5.3 변수 생성 mutate 동사

5.4 분석단위 변경(요약) summerise 동사

5.5 정렬 arrange 동사

5.6 (분석 단위) group_by() 함수 동사

6 dplyr 동사와 Base 구문 비교 8

6.1 특정 변수 선택하기 (select)

6.2 관측점(obervation) 필터링해서 선택하기 (filter)

6.3 새변수 생성하기 (mutate)

6.4 변수 요약하기 (summarize)

6.5 다시 정렬하기 (arrange)

7 dplyr 패턴과 transmute 동사

7.1 group_by + mutate: 그룹별 비율

7.2 group_by + 윈도우 함수: 그룹별 변화율

7.3 transmute 동사

8 분할-적용-병합(Split-Apply-Combine) 전략 - apply 계열

데이터 변환 -`dplyr`

1 Hadley 데이터 처리 체계와 `dplyr` ¹ ²

2 `dplyr`, `tidyr` 예제 데이터셋

3 사람 혹은 기계 중심 - `gather`, `spread`

3.1 Wide 형식 → Long 형식 데이터 ⁶ ⁷

4 `separate`, `unite` 동사

5 `dplyr` 동사

5.1 변수 `select` 동사

5.2 관측점 `filter` 동사

5.3 변수 생성 `mutate` 동사

5.4 분석단위 변경(요약) `summerise` 동사

5.5 정렬 `arrange` 동사

5.6 (분석 단위) `group_by()` 함수 동사

6 `dplyr` 동사와 Base 구문 비교 ⁸

7 `dplyr` 패턴과 `transmute` 동사

7.1 `group_by` + `mutate`: 그룹별 비율

7.2 `group_by` + 윈도우 함수: 그룹별 변화율

7.3 `transmute` 동사

8 분할-적용-병합(Split-Apply-Combine) 전략 - `apply` 계열