MovieLens
데이터 1 2MovieLens 1M Dataset 데이터를 다운로드 받아 이를 정제하는 것부터 시작한다.
MovieLens 1M Dataset 데이터를 다운로드 받아 압축을 풀고, ml-1m
디렉토리 movies.dat
, ratings.dat
, users.dat
를 대상으로 작업을 준비한다.
> library(here)
> # download.file(url="http://files.grouplens.org/datasets/movielens/ml-1m.zip", destfile="data/ml-1m.zip")
> unzip_file <- here("data", "ml-1m.zip")
>
> unzip(zipfile = unzip_file, exdir="data/raw")
MovieLens
데이터 정제 3평점, 사용자, 영화 아이템 파일을 순차로 R 데이터프레임으로 불러와서 데이터를 정제한다. 특히, read_delim()
함수 등 readr
팩키지에서는 구분자가 두문자 이상인 경우 지원을 하지 않기 때문에 Multicharacter separator/deliminator in Read_delim of Tidyverse? #721 내용을 참조해서, 구분문자를 ::
에서 탭(\t
)으로 바꾼 후에 다시 read_delim()
함수로 불러온다.
> library(tidyverse)
> ## movies.dat --> https://github.com/tidyverse/readr/issues/721
> movie_txt <- read_lines("data/raw/ml-1m/movies.dat")
> movie_txt <- str_replace_all(movie_txt, "::", "\t")
> movie_txt %>% write_lines("data/raw/ml-1m/movies.txt")
> movie_df <- read_delim("data/raw/ml-1m/movies.txt", delim="\t", col_names = FALSE)
>
> ## ratings.dat
> rating_txt <- read_lines("data/raw/ml-1m/ratings.dat")
> rating_txt <- str_replace_all(rating_txt, "::", "\t")
> rating_txt %>% write_lines("data/raw/ml-1m/ratings.txt")
> rating_df <- read_delim("data/raw/ml-1m/ratings.txt", delim="\t", col_names = FALSE)
>
> ## users.dat
> user_txt <- read_lines("data/raw/ml-1m/users.dat")
> user_txt <- str_replace_all(user_txt, "::", "\t")
> user_txt %>% write_lines("data/raw/ml-1m/users.txt")
> user_df <- read_delim("data/raw/ml-1m/users.txt", delim="\t", col_names = FALSE)
>
> rm(list = ls(pattern = "txt$"))
> movie_df <- movie_df %>%
+ rename(movie_id = X1,
+ movie_nm = X2,
+ genre = X3)
>
> rating_df <- rating_df %>%
+ rename(user_id = X1,
+ movie_id = X2,
+ rating = X3,
+ timestamp = X4)
>
> user_df <- user_df %>%
+ rename(user_id = X1,
+ gender = X2,
+ age = X3,
+ occupation = X4,
+ zipcode = X5)
>
> um_rating_df <- user_df %>% select(user_id) %>%
+ left_join(rating_df, by="user_id") %>%
+ select(-timestamp) %>%
+ left_join(movie_df, by="movie_id")
>
> um_rating_df %>%
+ sample_n(100) %>%
+ DT::datatable()
sparklyr
맛보기 4RStudio, sparklyr: R interface for Apache Spark를 참조하여 자바를 설치하지 않았다면 설치하고 나서 sparklyr
을 통해 스파크 클러스터를 생성하여 작업을 추진하도록 한다.
> library(sparklyr)
> sc <- spark_connect(master = "local")
> spark_version(sc)
[1] '2.1.0'
R 데이터프레임(um_rating_df
)을 스파크 데이터프레임으로 변환시킨다.
> um_rating_sdf <- sdf_copy_to(sc, um_rating_df, "um_rating", overwrite = TRUE)
> src_tbls(sc)
[1] "um_rating"
ALS 알고리즘을 실행시켜 평점 행렬을 분해한다.
> library(dplyr)
>
> partitions <- sdf_partition(um_rating_sdf, training = 0.8, test = 0.2)
>
> movie_als <- ml_als_factorization(partitions$training,
+ rating.column = "rating",
+ user.column = "user_id",
+ item.column = "movie_id",
+ rank = 10L,
+ regularization.parameter = 0.1,
+ implicit.preferences = FALSE,
+ alpha =1,
+ nonnegative = TRUE,
+ iter.max = 10)
> summary(movie_als)
Length Class Mode
uid 1 -none- character
param_map 3 -none- list
rank 1 -none- numeric
recommend_for_all_items 1 -none- function
recommend_for_all_users 1 -none- function
item_factors 2 tbl_spark list
user_factors 2 tbl_spark list
user_col 1 -none- character
item_col 1 -none- character
prediction_col 1 -none- character
.jobj 2 spark_jobj environment
ALS 알고리즘을 실행시켜 평점 행렬을 분해한다.
> predictions <- ml_predict(movie_als, partitions$test)
>
> movie_pred_df <- predictions %>% collect() %>%
+ filter(complete.cases(.))
>
> movie_pred_df %>%
+ head()
# A tibble: 6 x 6
user_id movie_id rating movie_nm genre prediction
<int> <int> <int> <chr> <chr> <dbl>
1 660 12 1 Dracula: Dead and Loving ~ Comedy|Ho~ 1.96
2 2235 12 5 Dracula: Dead and Loving ~ Comedy|Ho~ 3.02
3 2378 12 1 Dracula: Dead and Loving ~ Comedy|Ho~ 2.31
4 5788 12 1 Dracula: Dead and Loving ~ Comedy|Ho~ 1.80
5 1112 12 1 Dracula: Dead and Loving ~ Comedy|Ho~ 2.00
6 1329 12 3 Dracula: Dead and Loving ~ Comedy|Ho~ 2.27
> library(Metrics)
> rmse(movie_pred_df$prediction, movie_pred_df$rating)
[1] 0.8735411
> spark_disconnect(sc)