1 캐글 데이터셋

Women’s e-commerce cloting reviews 데이터를 바탕으로 텍스트 데이터를 예측모형에 Feature로 넣어 예측력을 향상시키는 방안을 살펴보자.

1.1 데이터 사전

캐글 Women’s e-commerce cloting reviews 데이터는 총 11개 변수로 구성되어 있고 관측점이 23,486개로 구성되어 있다. Recommended IND를 라벨 목표변수로 두고 예측모형을 구축해보자.

  • Clothing ID: Integer Categorical variable that refers to the specific piece being reviewed.
  • Age: Positive Integer variable of the reviewers age.
  • Title: String variable for the title of the review.
  • Review Text: String variable for the review body.
  • Rating: Positive Ordinal Integer variable for the product score granted by the customer from 1 Worst, to 5 Best.
  • Recommended IND: Binary variable stating where the customer recommends the product where 1 is recommended, 0 is not - recommended.
  • Positive Feedback Count: Positive Integer documenting the number of other customers who found this review positive.
  • Division Name: Categorical name of the product high level division.
  • Department Name: Categorical name of the product department name.
  • Class Name: Categorical name of the product class name.

2 예측모형

2.1 텍스트 제외 예측모형

Time difference of 31.08911 secs
Confusion Matrix and Statistics

          Reference
Prediction  yes   no
       yes 3881  756
       no  4162 1031
                                          
               Accuracy : 0.4997          
                 95% CI : (0.4898, 0.5096)
    No Information Rate : 0.8182          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0342          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.4825          
            Specificity : 0.5769          
         Pos Pred Value : 0.8370          
         Neg Pred Value : 0.1985          
             Prevalence : 0.8182          
         Detection Rate : 0.3948          
   Detection Prevalence : 0.4717          
      Balanced Accuracy : 0.5297          
                                          
       'Positive' Class : yes             
                                          
Confusion Matrix and Statistics

          Reference
Prediction  yes   no
       yes 3917  794
       no  4126  993
                                          
               Accuracy : 0.4995          
                 95% CI : (0.4896, 0.5094)
    No Information Rate : 0.8182          
    P-Value [Acc > NIR] : 1               
                                          
                  Kappa : 0.0247          
 Mcnemar's Test P-Value : <2e-16          
                                          
            Sensitivity : 0.4870          
            Specificity : 0.5557          
         Pos Pred Value : 0.8315          
         Neg Pred Value : 0.1940          
             Prevalence : 0.8182          
         Detection Rate : 0.3985          
   Detection Prevalence : 0.4792          
      Balanced Accuracy : 0.5213          
                                          
       'Positive' Class : yes