1 DVT 대여 데이터베이스

DVT 대여 데이터베이스를 설치했다면 다음 단계로 다양한 SQL 쿼리문을 던져 뭔가 가치 있는 정보를 추출해야만 한다. 데이터 과학: “postgreSQL - DVD 대여 데이터베이스”에서 데이터베이스 설치와 접속에 대한 사항은 확인한다.

1.1 DB 접속 헬로월드 ¹

먼저 DBI::dbConnect()를 통해 접속하고 SQL 쿼리 헬로월드를 던져보자.

library(RPostgreSQL)
library(DBI)

pgdrv <- dbDriver("PostgreSQL")

con <- dbConnect(pgdrv, dbname="dvd", 
                 port="5432", 
                 user="postgres", 
                 password=postgre_PW, 
                 host="localhost")

actor <- dbGetQuery(con, "SELECT * FROM actor LIMIT 5")

actor

  actor_id first_name    last_name         last_update
1        1   Penelope      Guiness 2013-05-26 14:47:57
2        2       Nick     Wahlberg 2013-05-26 14:47:57
3        3         Ed        Chase 2013-05-26 14:47:57
4        4   Jennifer        Davis 2013-05-26 14:47:57
5        5     Johnny Lollobrigida 2013-05-26 14:47:57

2 이탈/잔존고객 구매금액

customer 테이블에는 active 칼럼을 통해 잔존고객과 이탈고객을 파악할 수 있다. 이를 통해서 잔존고객과 이탈고객이 몇명이고 구매금액을 파악할 수 있다. 먼저 datamodelr 팩키지를 통해 해당 테이블을 뽑아내서 이를 시각화해보자.

library(tidyverse)
library(datamodelr)

payment <- tbl(con, "payment") %>% collect()
customer <- tbl(con, "customer") %>% collect()

payment_customer_model <- dm_from_data_frames(payment, customer)

payment_customer_model <- dm_add_references(
  payment_customer_model,
  customer$customer_id ==  payment$customer_id
)

payment_customer_graph <- dm_create_graph(payment_customer_model, rankdir = "LR", col_attr = c("column", "type"))
dm_render_graph(payment_customer_graph)

con을 통해 DVD 대여 데이터베이스에 접속이 이루어진 상태다. 이탈고객과 잔존고객별로 구매금액에 대한 평균, 최소, 최대, 총합계를 구하려면 두 테이블을 INNER JOIN으로 customer_id를 키값으로 합치고 나서 기술통계를 산출한다.

sql_query <- 
"SELECT active, 
       COUNT(*) AS num_active, 
       MIN(amount) AS min_amt, 
       AVG(amount) AS avg_amt,
       MAX(amount) AS max_amt, 
       SUM(amount) AS total_amt
FROM payment AS p
INNER JOIN customer AS c
  ON p.customer_id = c.customer_id
GROUP BY c.active;"

dbGetQuery(con, sql_query)

  active num_active min_amt  avg_amt max_amt total_amt
1      0        369    0.99 4.092981   11.99   1510.31
2      1      14227    0.00 4.203397   11.99  59801.73

3 쟝르별 평균 대여평점

앞서와 마찬가지로 쟝르별 평균 대여평점을 계산할 수 있는 테이블을 쭉 뽑아본다. 이를 통해서 3개 테이블, 즉 category, film_category, film을 뽑아놓고 각 해당 키값을 사용하여 결합시킨다.

category <- tbl(con, "category") %>% collect()
film_category <- tbl(con, "film_category") %>% collect()
film <- tbl(con, "film") %>% collect()

rental_rating_model <- dm_from_data_frames(category, film_category, film)

rental_rating_model <- dm_add_references(
  rental_rating_model,
  category$category_id == film_category$category_id,
  film_category$film_id == film$film_id
)

rental_rating_graph <- dm_create_graph(rental_rating_model, rankdir = "LR", col_attr = c("column", "type"))
dm_render_graph(rental_rating_graph)

먼저 film_category와 category를 결합시켜 영화(film)가 속한 쟝르(category)를 파악한다.

rate_qry <- 
"SELECT * 
FROM category AS c
INNER JOIN film_category AS fc
  ON c.category_id = fc.category_id
LIMIT 5;"

dbGetQuery(con, rate_qry)

  category_id        name         last_update film_id category_id
1           6 Documentary 2006-02-15 09:46:27       1           6
2          11      Horror 2006-02-15 09:46:27       2          11
3           6 Documentary 2006-02-15 09:46:27       3           6
4          11      Horror 2006-02-15 09:46:27       4          11
5           8      Family 2006-02-15 09:46:27       5           8
          last_update
1 2006-02-15 10:07:09
2 2006-02-15 10:07:09
3 2006-02-15 10:07:09
4 2006-02-15 10:07:09
5 2006-02-15 10:07:09

다음으로 film 테이블을 조인하여 rental_rate를 결합하고 쟝르(category) 별로 평균평점을 구하고 이를 ORDER BY ... DESC를 사용해서 내림차순으로 정렬한다.

rate_qry <- 
"SELECT c.name,
        AVG(rental_rate) AS avg_rental_rate
FROM category AS c
INNER JOIN film_category AS fc
  ON c.category_id = fc.category_id 
INNER JOIN film AS f
  ON fc.film_id = f.film_id
GROUP BY c.category_id
ORDER BY avg_rental_rate DESC;"

dbGetQuery(con, rate_qry)

          name avg_rental_rate
1        Games        3.252295
2       Travel        3.235614
3       Sci-Fi        3.219508
4       Comedy        3.162414
5       Sports        3.125135
6          New        3.116984
7      Foreign        3.099589
8       Horror        3.025714
9        Drama        3.022258
10       Music        2.950784
11    Children        2.890000
12   Animation        2.808182
13      Family        2.758116
14    Classics        2.744386
15 Documentary        2.666471
16      Action        2.646250

4 Top 10 DVD 영화

가장 많이 대여된 Top 10 DVD 영화를 찾아내기 위해서 이에 해당되는 연관 테이블을 검색하여 찾아낸다. film, inventory, rental 테이블을 특정하고 서로 연결시킬 수 있는 키값을 찾아 연결시킨다.

film <- tbl(con, "film") %>% collect()
inventory <- tbl(con, "inventory") %>% collect()
rental <- tbl(con, "rental") %>% collect()

top_10_model <- dm_from_data_frames(film, inventory, rental)

top_10_model <- dm_add_references(
  top_10_model,
  film$film_id == inventory$film_id,
  inventory$inventory_id == rental$inventory_id
)

top_10_graph <- dm_create_graph(top_10_model, rankdir = "LR", col_attr = c("column", "type"))
dm_render_graph(top_10_graph)

film → inventory → rental 테이블을 순차적으로 film_id, inventory_id를 키값으로 삼아 결합시킨다. 그리고 나서 가장 많이 대여된 영화를 찾기 위해서 COUNT() 함수로 개수하고 나서 이를 내림차순 정리한다.

top_query <- 
"SELECT f.title AS movie_title, 
        COUNT(f.title) AS num_rentals
FROM film AS f
INNER JOIN inventory AS i
  ON f.film_id = i.film_id
INNER JOIN rental AS r
  ON i.inventory_id = r.inventory_id
GROUP BY f.title
ORDER BY num_rentals DESC;"

dbGetQuery(con, top_query) %>% 
  DT::datatable()

Okoh Anita (2018-12-20), “How I analyzed DVD rental data with SQL”, freeCodeCamp ↩

데이터 과학

DVD 대여 데이터베이스에서 인사이트 도출

xwMOOC

2019-10-27

1 DVT 대여 데이터베이스

1.1 DB 접속 헬로월드 ¹

2 이탈/잔존고객 구매금액

3 쟝르별 평균 대여평점

4 Top 10 DVD 영화

데이터 과학

DVD 대여 데이터베이스에서 인사이트 도출

xwMOOC

2019-10-27

1 DVT 대여 데이터베이스

1.1 DB 접속 헬로월드 1

2 이탈/잔존고객 구매금액

3 쟝르별 평균 대여평점

4 Top 10 DVD 영화

1.1 DB 접속 헬로월드 ¹