기계학습(ML) → 딥러닝(DL)

전통적인 통계기반의 예측모형을 사용하는 것이 아니라 최근에 개발된 딥러닝을 적용시켜 뉴스그룹(뉴스기사) 분류작업을 수행해보자.

  • 나이브 베이즈: MultinomialNB
  • 로지스틱 회귀모형: LogisticRegression
  • 선형 SVM: SVC
  • SGDClassifier
  • Random Forest: RandomForestClassifier
  • Gradient Boosted Machines: GradientBoostingClassifier

전통적인 통계기반 예측모형 대신에 딥러닝에 포함되는 모형은 여러가지가 있지만 아무래도 워드2벡(word2vec) 을 가장 먼저 꼽고 있다.

데이터 준비

In [1]:
# -*- coding: utf-8 -*-

## 환경설정
import warnings
warnings.filterwarnings('ignore')

import pandas as pd
import numpy as np
import nltk

# 저자 제작 사용자정의 함수 가져오기
import sys
sys.path.insert(0, './code')
import text_normalizer as tn
In [2]:
data_df = pd.read_csv('data/clean_newsgroups.csv')
In [3]:
from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names =\
                                 train_test_split(np.array(data_df['Clean Article']), np.array(data_df['Target Label']),
                                                       np.array(data_df['Target Name']), test_size=0.33, random_state=42)

train_corpus.shape, test_corpus.shape
Out[3]:
((12261,), (6039,))
In [4]:
tokenized_train = [tn.tokenizer.tokenize(text)
                   for text in train_corpus]
tokenized_test = [tn.tokenizer.tokenize(text)
                   for text in test_corpus]

예측모형 - 워드2벡

워드2벡(word2vec)이 포함된 라이브러리로 gensim이 선택되었다. 다음 명령어를 사용해서 gensim 라이브러리를 설치한다.

$ pip install -U gensim

In [5]:
import gensim
# build word2vec model
w2v_num_features = 100
w2v_model = gensim.models.Word2Vec(tokenized_train, size=w2v_num_features, window=100,
                                   min_count=2, sample=1e-3, sg=1, iter=5, workers=10)
In [6]:
def document_vectorizer(corpus, model, num_features):
    vocabulary = set(model.wv.index2word)
    
    def average_word_vectors(words, model, vocabulary, num_features):
        feature_vector = np.zeros((num_features,), dtype="float64")
        nwords = 0.
        
        for word in words:
            if word in vocabulary: 
                nwords = nwords + 1.
                feature_vector = np.add(feature_vector, model.wv[word])
        if nwords:
            feature_vector = np.divide(feature_vector, nwords)

        return feature_vector

    features = [average_word_vectors(tokenized_sentence, model, vocabulary, num_features)
                    for tokenized_sentence in corpus]
    return np.array(features)
In [7]:
# generate averaged word vector features from word2vec model
avg_wv_train_features = document_vectorizer(corpus=tokenized_train, model=w2v_model,
                                                     num_features=w2v_num_features)
avg_wv_test_features = document_vectorizer(corpus=tokenized_test, model=w2v_model,
                                                    num_features=w2v_num_features)
In [8]:
print(f'Word2Vec model:> Train features shape: \t {avg_wv_train_features.shape}\n \
                         Test features shape: \t {avg_wv_test_features.shape}')
Word2Vec model:> Train features shape: 	 (12261, 100)
                          Test features shape: 	 (6039, 100)

예측모형 성능

In [9]:
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import SGDClassifier

svm = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=500)
svm.fit(avg_wv_train_features, train_label_names)
svm_w2v_cv_scores = cross_val_score(svm, avg_wv_train_features, train_label_names, cv=5)
svm_w2v_cv_mean_score = np.mean(svm_w2v_cv_scores)
print('CV Accuracy (5-fold):', svm_w2v_cv_scores)
print('Mean CV Accuracy:', svm_w2v_cv_mean_score)
svm_w2v_test_score = svm.score(avg_wv_test_features, test_label_names)
print('Test Accuracy:', svm_w2v_test_score)
CV Accuracy (5-fold): [0.71434376 0.69067969 0.73439412 0.70383987 0.700491  ]
Mean CV Accuracy: 0.7087496891738334
Test Accuracy: 0.692167577413479

예측모형 - GloVe

In [10]:
# feature engineering with GloVe model
train_nlp = [tn.nlp(item) for item in train_corpus]
train_glove_features = np.array([item.vector for item in train_nlp])

test_nlp = [tn.nlp(item) for item in test_corpus]
test_glove_features = np.array([item.vector for item in test_nlp])

print('GloVe model:> Train features shape:', train_glove_features.shape, 
      ' Test features shape:', test_glove_features.shape)
GloVe model:> Train features shape: (12261, 96)  Test features shape: (6039, 96)

예측모형 성능

In [11]:
svm = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=500)
svm.fit(train_glove_features, train_label_names)
svm_glove_cv_scores = cross_val_score(svm, train_glove_features, train_label_names, cv=5)
svm_glove_cv_mean_score = np.mean(svm_glove_cv_scores)
print('CV Accuracy (5-fold):', svm_glove_cv_scores)
print('Mean CV Accuracy:', svm_glove_cv_mean_score)
svm_glove_test_score = svm.score(test_glove_features, test_label_names)
print('Test Accuracy:', svm_glove_test_score)
CV Accuracy (5-fold): [0.19626168 0.15588116 0.16850265 0.16952614 0.17798691]
Mean CV Accuracy: 0.1736317081208183
Test Accuracy: 0.15283987415134956

예측모형 - FastText

In [12]:
from gensim.models.fasttext import FastText

ft_num_features = 1000
# sg decides whether to use the skip-gram model (1) or CBOW (0)
ft_model = FastText(tokenized_train, size=ft_num_features, window=100, 
                    min_count=2, sample=1e-3, sg=1, iter=5, workers=10)
In [13]:
# generate averaged word vector features from word2vec model
avg_ft_train_features = document_vectorizer(corpus=tokenized_train, model=ft_model,
                                                     num_features=ft_num_features)
avg_ft_test_features = document_vectorizer(corpus=tokenized_test, model=ft_model,
                                                    num_features=ft_num_features)
In [14]:
print('FastText model:> Train features shape:', avg_ft_train_features.shape, 
      ' Test features shape:', avg_ft_test_features.shape)
FastText model:> Train features shape: (12261, 1000)  Test features shape: (6039, 1000)

예측모형 성능

In [15]:
svm = SGDClassifier(loss='hinge', penalty='l2', random_state=42, max_iter=500)
svm.fit(avg_ft_train_features, train_label_names)
svm_ft_cv_scores = cross_val_score(svm, avg_ft_train_features, train_label_names, cv=5)
svm_ft_cv_mean_score = np.mean(svm_ft_cv_scores)
print('CV Accuracy (5-fold):', svm_ft_cv_scores)
print('Mean CV Accuracy:', svm_ft_cv_mean_score)
svm_ft_test_score = svm.score(avg_ft_test_features, test_label_names)
print('Test Accuracy:', svm_ft_test_score)
CV Accuracy (5-fold): [0.7220642  0.72364672 0.72909017 0.71078431 0.72176759]
Mean CV Accuracy: 0.7214706000605966
Test Accuracy: 0.726775956284153
In [16]:
from sklearn.neural_network import MLPClassifier

mlp = MLPClassifier(solver='adam', alpha=1e-5, learning_rate='adaptive', early_stopping=True,
                    activation = 'relu', hidden_layer_sizes=(512, 512), random_state=42)
mlp.fit(avg_ft_train_features, train_label_names)
Out[16]:
MLPClassifier(activation='relu', alpha=1e-05, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=True, epsilon=1e-08,
              hidden_layer_sizes=(512, 512), learning_rate='adaptive',
              learning_rate_init=0.001, max_iter=200, momentum=0.9,
              n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
              random_state=42, shuffle=True, solver='adam', tol=0.0001,
              validation_fraction=0.1, verbose=False, warm_start=False)
In [17]:
svm_ft_test_score = mlp.score(avg_ft_test_features, test_label_names)
print('Test Accuracy:', svm_ft_test_score)
Test Accuracy: 0.7189932107964895