1 시작점

Robert Gentleman과 Duncan Temple Lang이 2004년 발표한 논문에서 공식적인 시작점을 찾을 수 있다.¹ ²

We introduce the concept of a compendium as both a container for the different elements that make up the document and its computations (i.e. text, code, data,…), and as a means for distributing, managing and updating the collection. - Gentleman, R. and Temple Lang, D. (2004)

R packages can serve as research compendia (including code, data and outputs) for reproducible data analysis projects

R 팩키지는 재현가능한 데이터 분석 프로젝트를 위한 연구 개요서(research compendia)로 훌륭한 대안으로 역할을 수행할 수 있다. 연구 개요서는 코드, 데이터, 출력결과물 등이 포함된다.

2 논문과 `ropensci` 진행경과

Ben Marwick, Carl Boettiger, Lincoln Mullen (2018), “Packaging Data Analytical Work Reproducibly Using R (and Friends)”, University of Wollongong Research Online

재현가능한 과학연구를 위한 저작 방법

ropensci, “Community Call: Reproducible Research with R”

3 R 팩키지와 개요서 ³ ⁴

R 프로젝트, R 팩키지, 개요서(compendium)가 서로 동일한 목표를 가지고 있지만, 다소 차이가 있는 것도 사실이다. 데이터 사이언스를 하면서 코드, 데이터, 결과물을 하나의 개요서(compendium) 아래 묶어 이를 통해 재현가능한 과학기술 발전을 도모하는 것이 무엇보다 필요하다.

좋은 데이터 사이언스 프로젝트 구성
- 모든 파일은 동일한 디렉토리 아래 정돈되어 있음
- 원본 데이터(raw data)는 별도 폴더에 잘 저장되어 있어야 함.
- 정제된 데이터는 R 스크립트를 통해서 만들어져야 함.
- 함수는 분석 스크립트와 독립되어 저장되어야 함.
- 함수는 문서화가 잘 되어야 하면 (단위) 테스트도 되어 ㅎ함.
- 산출물은 코드와 격리되어야 하며 일회용으로 한번 사용하고 버림.
- Makefile은 적절한 순서로 분석을 실행해야 함.
- README 파일은 프로젝트 개요를 담고 있어야 함.
- git을 사용해서 R 코드와 Rmd 문서 파일은 버전 제어를 해야 함.

project/
|-+ data-raw/   # 원본 데이터
|-+ data/       # (R 스크립트로부터 생성된) 정제된 데이터
|-+ R/          # 함수(Functions)
|-+ man/        # (Roxygen으로 생성된) 함수 문서(Function documentation)
|-+ tests/      # 테스트 (functions, Rmd)
|-+ vignettes/  # (Rmd로 작성된) 분석결과, 원고, 보고서 등
|
|- Makefile    # 자동화 시키는 마스터 스크립트
|- DESCRIPTION # 메타데이터와 의존성
|- README      # 프로젝트 개요서

상기와 같이 개요서를 R 팩키지를 통해 구현하게 되면 어떤 점이 좋은지 살펴보자.

재현가능성: Reproducibility
일관되고, 표준적이며, 물흐르는 듯한 프로젝트 구조: Consistent, standard, streamlined organisation
모듈화, 문서화, 테스트 주도 철학을 증진: Promotes modular, well-documented and tested code
공유하기 쉬움: Easy to share (zip, GitHub repo)
설치와 실행이 쉬움: Easy to install & run (Dependencies)
R 팩키지 제작 기계: Use R package development machinery:
R CMD CHECK
지속 개발/지속 배포: Continuous integration (Travis-CI)
goodpractice로 자동화된 코드리뷰: Automatic code review with goodpractice
pkgdown으로 프로젝트 웹사이트 제작: Easily create project websites with pkgdown

3.1 시작이 반이다

시작이 반이다 - start small

project
|- DESCRIPTION
|- README.md  
|- Metadata.txt
|
|- data/                
|   +- 2014ParasiteSurveyJustBrood.csv
|   +- CedarBPLifeTable2014.csv
|   +- NorthBPLifeTable2013.txt
|   +- NorthBPLifeTable2014.csv|
|- analysis/
|   +- CodeforBPpaper.R

3.2 팩키지 개발

팩키지 개발

project
|- DESCRIPTION
|- NAMESPACE
|- README.md
|- LakeTrophicModelling.Rproj
|
|- R/
|   +- LakeTrophicModelling-package.r
|   +- class_prob_rf.R
|   +- condAccuracy.R
|   +- crossval_rf.R
|   +- density_plot.R
|   +- ecdf_ks_ci.R
|   +- ecor_map.R
|   +- getCyanoAbund.R
|   +- getLakeIDs.R
|   +- importancePlot.R
|
|- man/
|   +- class_prob_rf.Rd
|   +- condAccuracy.Rd
|   +- crossval_rf.Rd
|   +- density_plot.Rd
|   +- ecdf_ks_ci.Rd
|   +- ecor_map.Rd
|   +- getCyanoAbund.Rd
|   +- getLakeIDs.Rd
|   +- importancePlot.Rd
|
|- data/                
|   +- LakeTrophicModelling.rda
|
|- vignettes/
|
|- inst/
|   +- doc/
|      +- manuscript.pdf
|   +- extdata/
|      +- ltmData.csv
|      +- data_def.csv

3.3 CI/CD와 도커

Dockerfile 파일을 추가하여 환경도 재현가능하게 만들 수 있고, .travis.yml을 추가하여 CI/CD 환경도 구축할 수 있다. tests/를 추가하여 테스트 주도 개발(test-driven development, TDD)를 시도할 수 있고, 이를 통해 수작업 검증을 자동화하여 생산성과 품질을 대폭 향상시킬 수도 있다.

project
|- DESCRIPTION          # project metadata and dependencies 
|- README.md            # top-level description of content and guide to users
|- NAMESPACE            # exports R functions in the package for repeated use
|- LICENSE              # specify the conditions of use and reuse of the code, data & text
|- .travis.yml          # continuous integration service hook for auto-testing at each commit
|- Dockerfile           # makes a custom isolated computational environment for the project
|
|- data/                # raw data, not changed once created
|  +- my_data.csv       # data files in open formats such as TXT, CSV, TSV, etc.
|
|- analysis/            # any programmatic code
|  +- my_report.Rmd     # R markdown file with narrative text interwoven with code chunks 
|  +- makefile          # builds a PDF/HTML/DOCX file from the Rmd, code, and data files
|  +- scripts/          # code files (R, shell, etc.) used for data cleaning, analysis and visualisation 
|
|- R/                     
|  +- my_functions.R    # custom R functions that are used more than once throughout the project
|
|- man/
|  +- my_functions.Rd   # documentation for the R functions (auto-generated when using devtools)
|
|- tests/
|  +- testthat.R        # unit tests of R functions to ensure they perform as expected

4 각 사례별 템플릿

5 `rrtoools` 워크샵

“Reproducible Research in R with rrtools”, 31st October, Northwest Universities R Day
- 공통: devtools
- 윈도우즈: Rtools
- 맥: Xcode
- 리눅스: r-devel 혹은 r-base-dev

“Statistical Analyses and Reproducible Research” Bioconductor Project Working Papers↩
“Reproducible Research: A Bioinformatics Case Study” in Statistical Applications in Genetics and Molecular Biology↩
jennybc, “Use of an R package to facilitate reproducible research”↩
Francisco Rodriguez-Sanchez “Structuring data anlaysis projects as R packages”↩

Computational Documents

개요서(Compendium) 시작하며

xwMOOC

2019-12-28

1 시작점

2 논문과 `ropensci` 진행경과

3 R 팩키지와 개요서 ³ ⁴

3.1 시작이 반이다

3.2 팩키지 개발

3.3 CI/CD와 도커

4 각 사례별 템플릿

5 `rrtoools` 워크샵

Computational Documents

개요서(Compendium) 시작하며

xwMOOC

2019-12-28

1 시작점

2 논문과 ropensci 진행경과

3 R 팩키지와 개요서 3 4

3.1 시작이 반이다

3.2 팩키지 개발

3.3 CI/CD와 도커

4 각 사례별 템플릿

5 rrtoools 워크샵

2 논문과 `ropensci` 진행경과

3 R 팩키지와 개요서 ³ ⁴

5 `rrtoools` 워크샵