데이터 과학 입문: Summary of Basic Commands

Key Points

데이터 과학 (Data Science) - 3월16일
  • 데이터 과학을 이해한다.

  • 스프레드쉬트의 한계를 명확히 한다.

  • 데이터 과학의 현재 위치를 살펴본다.

  • 개인 데이터 과학 블로그를 생성한다.

프로그래밍 기초와 마크다운 보고서 - 3월23일
  • 초중등 교육과정 코딩을 스펀지처럼 흡수한다.

  • R 마크다운으로 작성한 보고서와 R로 작성한 소프트웨어를 섞어쓴다.

  • 포맷을 제어하는데 덩어리 선택옵션(chunk options)으로 지정한다.

  • knitr 팩키지를 사용해서 문서를 PDF를 비롯한 다른 포맷으로 변환시킨다.

데이터와 커뮤니티 찾기, 도움 청하기, 프로젝트 설정 - 3월30일
  • help() 함수를 사용해서 온라인 도움을 얻는다.

  • RStudio를 사용해서 일관된 방식으로 프로젝트를 생성하고 관리한다.

  • 원데이터는 읽기 전용으로 처리한다.

  • 자동생성된 출력물은 사용후 버릴 수 있는 것으로 처리한다.

  • 함수 정의와 응용을 분리한다.

데이터 가져오기(Data Ingestion) - 4월06일
자료구조(Data Structures) - 4월13일
  • read.csv 함수를 사용해서 R에 표형태 데이터를 불러온다.

  • R의 기본자료형은 실수형, 정수형, 복소수형, 논리형, 문자형이다

  • R에서 인자를 사용해서 범주를 표현한다.

데이터프레임과 SQL - 4월20일
  • cbind() 함수를 사용해서 데이터프레임에 칼럼을 추가한다.

  • rbind() 함수를 사용해서 데이터프레임에 행을 추가한다.

  • 데이터프레임에서 행을 제거한다.

  • na.omit() 명령문을 사용해서, NA 값을 갖는 행을 데이터프레임에서 제거한다.

  • levels()as.character() 함수를 사용해서 요인을 타맥하고 조작한다.

  • str(), nrow(), ncol(), dim(), colnames(), rownames(), head(), typeof() 함수를 사용해서 데이터프레임 구조를 파악한다.

  • read.csv() 함수를 사용해서 CSV 파일을 불러온다.

  • 데이터프레임 length()가 나타내는 것이 무엇인지 이해한다.

버전제어, 협업, 그리고 저작권과 라이선스 - 4월27일
  • 버전 제어는 무한정 ‘실행취소(undo)’하는 것과 같다.

  • 버전 제어는 많은 분들이 병렬로 작업하는 것도 가능하게 한다.

  • 공개 과학 작업은 폐쇄적인 과학 작업보다 더 유용하고 더 많이 인용된다.

중간고사 - 5월04일(특강)
  • 데이터 과학 에너지 절반 충전되었어요!!!

시각화 (Visualization) - 5월11일
  • ggplot2를 사용해서 그래프를 생성한다.

  • 그래프(graphics)를 계층으로 생각한다:aesthetics, 기하(geometry), 통계(statistics), 척도변환(scale transformation), 그룹(grouping).

  • 정적, 인터랙티브, 애니메이션으로 시각화를 구현한다.

정규표현식 - 5월18일
  • 자료구조는 일관성과 함께 예측가능해야 된다.

  • 데이터 디렉토리에 데이터 식별자 혹은 의미론적 요소(semantic element)를 적극 사용하는 것을 고려한다.

데이터 과학 프로그래밍 - 5월25일
  • ifelse 를 사용하여 선택을 한다.

  • for 루프를 사용하여 연산작업을 반복한다.

  • 문제를 나누어서 정복하는 재귀(recusion) 기법을 이해한다.

함수형 프로그래밍 - 6월01일
  • 함수를 왜, 언제, 어떻게 작성하는지 파악한다.

  • 함수를 다양한 관점에서 이해한다.

  • 데이터 과학 심화과정에 등장하는 함수형 프로그래밍 용어에 친숙해진다.

데이터 과학 제품(논문 등) - 6월08일
데이터 과학 저작 - 6월15일
  • OSMU - One Source Multi Use 원칙을 견지한다.

  • 텍스트, 수식, 그림/표, 통계, 모형을 반영한 글쓰기를 한다.

  • 사람이 글쓰기 잘하는 영역과 기계가 글쓰기 잘 하는 영역을 명확히 한다.

R 팩키지 - 6월22일
  • 데이터 기반 R 팩키지를 개발한다.

  • 함수 기반 R 팩키지를 개발한다.

  • R 팩키지 개발관련 기반 기술을 이해한다.

기말고사 - 6월22일

Summary of Basic Commands

Action Files Folders
Inspect ls ls
View content cat ls
Navigate to   cd
Move mv mv
Copy cp cp -r
Create nano mkdir
Delete rm rmdir, rm -r

Filesystem hierarchy

The following is an overview of a standard Unix filesystem. The exact hierarchy depends on the platform, so you may not see exactly the same files/directories on your computer:

Linux filesystem hierarchy

Glossary

absolute path
A path that refers to a particular location in a file system. Absolute paths are usually written with respect to the file system’s root directory, and begin with either “/” (on Unix) or “\” (on Microsoft Windows). See also: relative path.
argument
A value given to a function or program when it runs. The term is often used interchangeably (and inconsistently) with parameter.
command shell
See shell
command-line interface
A user interface based on typing commands, usually at a REPL. See also: graphical user interface.
comment
A remark in a program that is intended to help human readers understand what is going on, but is ignored by the computer. Comments in Python, R, and the Unix shell start with a # character and run to the end of the line; comments in SQL start with --, and other languages have other conventions.
current working directory
The directory that relative paths are calculated from; equivalently, the place where files referenced by name only are searched for. Every process has a current working directory. The current working directory is usually referred to using the shorthand notation . (pronounced “dot”).
file system
A set of files, directories, and I/O devices (such as keyboards and screens). A file system may be spread across many physical devices, or many file systems may be stored on a single physical device; the operating system manages access.
filename extension
The portion of a file’s name that comes after the final “.” character. By convention this identifies the file’s type: .txt means “text file”, .png means “Portable Network Graphics file”, and so on. These conventions are not enforced by most operating systems: it is perfectly possible (but confusing!) to name an MP3 sound file homepage.html. Since many applications use filename extensions to identify the MIME type of the file, misnaming files may cause those applications to fail.
filter
A program that transforms a stream of data. Many Unix command-line tools are written as filters: they read data from standard input, process it, and write the result to standard output.
flag
A terse way to specify an option or setting to a command-line program. By convention Unix applications use a dash followed by a single letter, such as -v, or two dashes followed by a word, such as --verbose, while DOS applications use a slash, such as /V. Depending on the application, a flag may be followed by a single argument, as in -o /tmp/output.txt.
for loop
A loop that is executed once for each value in some kind of set, list, or range. See also: while loop.
graphical user interface
A user interface based on selecting items and actions from a graphical display, usually controlled by using a mouse. See also: command-line interface.
home directory
The default directory associated with an account on a computer system. By convention, all of a user’s files are stored in or below her home directory.
loop
A set of instructions to be executed multiple times. Consists of a loop body and (usually) a condition for exiting the loop. See also for loop and while loop.
loop body
The set of statements or commands that are repeated inside a for loop or while loop.
MIME type
MIME (Multi-Purpose Internet Mail Extensions) types describe different file types for exchange on the Internet, for example images, audio, and documents.
operating system
Software that manages interactions between users, hardware, and software processes. Common examples are Linux, OS X, and Windows.
orthogonal
To have meanings or behaviors that are independent of each other. If a set of concepts or tools are orthogonal, they can be combined in any way.
parameter
A variable named in a function’s declaration that is used to hold a value passed into the call. The term is often used interchangeably (and inconsistently) with argument.
parent directory
The directory that “contains” the one in question. Every directory in a file system except the root directory has a parent. A directory’s parent is usually referred to using the shorthand notation .. (pronounced “dot dot”).
path
A description that specifies the location of a file or directory within a file system. See also: absolute path, relative path.
pipe
A connection from the output of one program to the input of another. When two or more programs are connected in this way, they are called a “pipeline”.
process
A running instance of a program, containing code, variable values, open files and network connections, and so on. Processes are the “actors” that the operating system manages; it typically runs each process for a few milliseconds at a time to give the impression that they are executing simultaneously.
prompt
A character or characters display by a REPL to show that it is waiting for its next command.
quoting
(in the shell): Using quotation marks of various kinds to prevent the shell from interpreting special characters. For example, to pass the string *.txt to a program, it is usually necessary to write it as '*.txt' (with single quotes) so that the shell will not try to expand the * wildcard.
read-evaluate-print loop
(REPL): A command-line interface that reads a command from the user, executes it, prints the result, and waits for another command.
redirect
To send a command’s output to a file rather than to the screen or another command, or equivalently to read a command’s input from a file.
regular expression
A pattern that specifies a set of character strings. REs are most often used to find sequences of characters in strings.
relative path
A path that specifies the location of a file or directory with respect to the current working directory. Any path that does not begin with a separator character (“/” or “\”) is a relative path. See also: absolute path.
root directory
The top-most directory in a file system. Its name is “/” on Unix (including Linux and Mac OS X) and “\” on Microsoft Windows.
shell
A command-line interface such as Bash (the Bourne-Again Shell) or the Microsoft Windows DOS shell that allows a user to interact with the operating system.
shell script
A set of shell commands stored in a file for re-use. A shell script is a program executed by the shell; the name “script” is used for historical reasons.
standard input
A process’s default input stream. In interactive command-line applications, it is typically connected to the keyboard; in a pipe, it receives data from the standard output of the preceding process.
standard output
A process’s default output stream. In interactive command-line applications, data sent to standard output is displayed on the screen; in a pipe, it is passed to the standard input of the next process.
sub-directory
A directory contained within another directory.
tab completion
A feature provided by many interactive systems in which pressing the Tab key triggers automatic completion of the current word or command.
variable
A name in a program that is associated with a value or a collection of values.
while loop
A loop that keeps executing as long as some condition is true. See also: for loop.
wildcard
A character used in pattern matching. In the Unix shell, the wildcard * matches zero or more characters, so that *.txt matches all files whose names end in .txt.

External references

Opening a terminal

Manuals

Miscellaneous