뉴스기사 분류

뉴스기사를 20개 그룹으로 분류하는 분류기를 개발해보자.

분류모형 도구 가져오기

In [27]:
## 환경설정
import warnings
warnings.filterwarnings('ignore')

# -*- coding: utf-8 -*-

%matplotlib inline

## 라이브러리 가져오기
from sklearn.datasets import fetch_20newsgroups
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# 저자 제작 사용자정의 함수 가져오기
import sys
sys.path.insert(0, './code')
import text_normalizer as tn

뉴스그룹 데이터셋

영문으로 구성된 뉴스데이터셋을 sklearn.datasets에서 가져온다. fetch_20newsgroups() 함수로 뉴스그룹 데이터를 가져온다. data_labels_map 딕셔너리를 만드는데 다음 과정을 거친다. 즉 리스트를 딕셔너리로 변환하는 방법은 다음과 같다.

data.target_names (리스트) → enumerate(data.target_names)dict(enumerate(data.target_names)) (딕셔너리)

In [2]:
data = fetch_20newsgroups(subset='all', shuffle=True,
                          remove=('headers', 'footers', 'quotes'))
data_labels_map = dict(enumerate(data.target_names))

list comprehension을 통해서 target_names를 뽑아내고 뉴스기사 원문을 비롯하여 target_labels를 추출해서 파이썬 딕셔너리로 만든 후에 판다스 데이터프레임 객체로 저장시킨다.

In [3]:
corpus, target_labels, target_names = (data.data, data.target, [data_labels_map[label] for label in data.target])
data_df = pd.DataFrame({'Article': corpus, 'Target Label': target_labels, 'Target Name': target_names})
print(data_df.shape)
data_df.head(10)
(18846, 3)
Out[3]:
Article Target Label Target Name
0 \n\nI am sure some bashers of Pens fans are pr... 10 rec.sport.hockey
1 My brother is in the market for a high-perform... 3 comp.sys.ibm.pc.hardware
2 \n\n\n\n\tFinally you said what you dream abou... 17 talk.politics.mideast
3 \nThink!\n\nIt's the SCSI card doing the DMA t... 3 comp.sys.ibm.pc.hardware
4 1) I have an old Jasmine drive which I cann... 4 comp.sys.mac.hardware
5 \n\nBack in high school I worked as a lab assi... 12 sci.electronics
6 \n\nAE is in Dallas...try 214/241-6060 or 214/... 4 comp.sys.mac.hardware
7 \n[stuff deleted]\n\nOk, here's the solution t... 10 rec.sport.hockey
8 \n\n\nYeah, it's the second one. And I believ... 10 rec.sport.hockey
9 \nIf a Christian means someone who believes in... 19 talk.religion.misc

데이터와 사투(wrangling)

빈 문서 제거하기

In [4]:
total_nulls = data_df[data_df.Article.str.strip() == ''].shape[0]
print(f"깡통 문서: {total_nulls}")

data_df = data_df[~(data_df.Article.str.strip() == '')]
data_df.shape
깡통 문서: 515
Out[4]:
(18331, 3)

데이터 정제

nltk 라이브러리 불용어 사전을 가져오고, 2-그램 부정(no, not) 감성분석에 활용을 위해서 불용어 리스트에서 제거하지 않고 자체 제작한 normalize_corpus() 함수로 말뭉치를 깔끔하게 정제한다. 시간이 제법 소요되기 때문에 %timeit 마술 명령어(magic command)를 사용하여 정제에 거리는 시간을 측정한다.

In [5]:
%timeit
import nltk
stopword_list = nltk.corpus.stopwords.words('english')
# just to keep negation if any in bi-grams
stopword_list.remove('no')
stopword_list.remove('not')

# normalize our corpus
norm_corpus = tn.normalize_corpus(corpus=data_df['Article'], html_stripping=True, contraction_expansion=True, 
                                  accented_char_removal=True, text_lower_case=True, text_lemmatization=True, 
                                  text_stemming=False, special_char_removal=True, remove_digits=True,
                                  stopword_removal=True, stopwords=stopword_list)
data_df['Clean Article'] = norm_corpus

data_df = data_df[['Article', 'Clean Article', 'Target Label', 'Target Name']]
data_df.head(10)
Out[5]:
Article Clean Article Target Label Target Name
0 \n\nI am sure some bashers of Pens fans are pr... sure basher pens fan pretty confused lack kind... 10 rec.sport.hockey
1 My brother is in the market for a high-perform... brother market high performance video card sup... 3 comp.sys.ibm.pc.hardware
2 \n\n\n\n\tFinally you said what you dream abou... finally say dream mediterranean new area great... 17 talk.politics.mideast
3 \nThink!\n\nIt's the SCSI card doing the DMA t... think scsi card dma transfer not disk scsi car... 3 comp.sys.ibm.pc.hardware
4 1) I have an old Jasmine drive which I cann... old jasmine drive not use new system understan... 4 comp.sys.mac.hardware
5 \n\nBack in high school I worked as a lab assi... back high school work lab assistant bunch expe... 12 sci.electronics
6 \n\nAE is in Dallas...try 214/241-6060 or 214/... ae dallas try tech support may line one get start 4 comp.sys.mac.hardware
7 \n[stuff deleted]\n\nOk, here's the solution t... stuff delete ok solution problem move canada y... 10 rec.sport.hockey
8 \n\n\nYeah, it's the second one. And I believ... yeah second one believe price try get good loo... 10 rec.sport.hockey
9 \nIf a Christian means someone who believes in... christian mean someone believe divinity jesus ... 19 talk.religion.misc

정규표현식(r'^(\s?)+$')과 매칭되면 NA로 치환시키고 NA` 포함된 것은 제거시킨다.

In [6]:
data_df = data_df.replace(r'^(\s?)+$', np.nan, regex=True)
data_df.info()

data_df = data_df.dropna().reset_index(drop=True)
data_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 18331 entries, 0 to 18845
Data columns (total 4 columns):
Article          18331 non-null object
Clean Article    18300 non-null object
Target Label     18331 non-null int32
Target Name      18331 non-null object
dtypes: int32(1), object(3)
memory usage: 644.4+ KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18300 entries, 0 to 18299
Data columns (total 4 columns):
Article          18300 non-null object
Clean Article    18300 non-null object
Target Label     18300 non-null int32
Target Name      18300 non-null object
dtypes: int32(1), object(3)
memory usage: 500.5+ KB

정제 데이터 저장

깔끔하게 정제시킨 데이터가 시간이 많이 소요되어 이를 .csv 파일로 저장시킨다.

In [7]:
data_df.to_csv('data/clean_newsgroups.csv', index=False, encoding='utf-8')

훈련/시험 데이터 나누기

train_test_split()를 통해서 훈련/시험 데이터로 분리시킨다.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa9 in position 24574: invalid start byte 유형의 오류가 발생되는 경우, 앞서 .to_csv() 저장할 때 encoding='utf-8'을 명시하고, .read_csv()에서 가져올 때도 동일하게 지정한다.

In [8]:
data_df = pd.read_csv('data/clean_newsgroups.csv', encoding="utf-8")

from sklearn.model_selection import train_test_split

train_corpus, test_corpus, train_label_nums, test_label_nums, train_label_names, test_label_names =\
                                 train_test_split(np.array(data_df['Clean Article']), np.array(data_df['Target Label']),
                                                       np.array(data_df['Target Name']), test_size=0.33, random_state=42)

train_corpus.shape, test_corpus.shape
Out[8]:
((12261,), (6039,))

train_label_names ndarray 객체를 Counter 객체로 빈도수를 계산한 후에 딕셔너리 객체로 저장시킨다. test_label_names에도 동일한 작업을 수행하고 이를 데이터프레임으로 생성시킨 후에 정렬시킨다.

In [9]:
from collections import Counter

trd = dict(Counter(train_label_names))
tsd = dict(Counter(test_label_names))

(pd.DataFrame([[key, trd[key], tsd[key]] for key in trd], 
             columns=['Target Label', 'Train Count', 'Test Count'])
.sort_values(by=['Train Count', 'Test Count'],
             ascending=False))
Out[9]:
Target Label Train Count Test Count
15 sci.crypt 667 295
0 soc.religion.christian 662 312
5 rec.motorcycles 660 309
10 comp.sys.ibm.pc.hardware 654 309
8 comp.windows.x 653 327
11 rec.sport.hockey 651 322
19 sci.space 649 304
7 sci.med 648 312
17 rec.sport.baseball 648 303
4 sci.electronics 647 309
2 comp.graphics 646 307
1 comp.os.ms-windows.misc 642 304
13 misc.forsale 640 319
3 comp.sys.mac.hardware 612 315
18 talk.politics.mideast 606 311
16 rec.autos 590 343
14 talk.politics.guns 571 314
9 alt.atheism 512 267
12 talk.politics.misc 499 256
6 talk.religion.misc 404 201

뉴스 텍스트 → BoW, TF-IDF

뉴스 텍스트를 BoW 모형으로 바꿔 뉴스기사 분류를 위한 X 행렬로 변환시키는 작업을 훈련/시험 데이터에 공통으로 수행한다. 혹은 Tf-idf로 바꾸어서 예측모형 입력값으로 넣는 것도 가능하다.

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score

# build BOW features on train articles
cv = CountVectorizer(binary=False, min_df=0.0, max_df=1.0)
cv_train_features = cv.fit_transform(train_corpus)

# transform test articles into features
cv_test_features = cv.transform(test_corpus)

print(f'BOW model:> \n Train features shape: \t {cv_train_features.shape},\n Test features shape: \t {cv_test_features.shape}')
BOW model:> 
 Train features shape: 	 (12261, 66258),
 Test features shape: 	 (6039, 66258)
In [11]:
from sklearn.feature_extraction.text import TfidfVectorizer

# build BOW features on train articles
tv = TfidfVectorizer(use_idf=True, min_df=0.0, max_df=1.0)
tv_train_features = tv.fit_transform(train_corpus)

# transform test articles into features
tv_test_features = tv.transform(test_corpus)

print(f'TFIDF model:> \n Train features shape: \t {tv_train_features.shape},\n Test features shape: \t {tv_test_features.shape}')
TFIDF model:> 
 Train features shape: 	 (12261, 66258),
 Test features shape: 	 (6039, 66258)

예측모형 적합 - BoW

  • 나이브 베이즈: MultinomialNB
  • 로지스틱 회귀모형: LogisticRegression
  • 선형 SVM: SVC
  • SGDClassifier
  • Random Forest: RandomForestClassifier
  • Gradient Boosted Machines: GradientBoostingClassifier

나이브 베이즈: MultinomialNB

In [12]:
%%time
from sklearn.naive_bayes import MultinomialNB

mnb = MultinomialNB(alpha=1)
mnb.fit(cv_train_features, train_label_names)
mnb_bow_cv_scores = cross_val_score(mnb, cv_train_features, train_label_names, cv=5)
mnb_bow_cv_mean_score = np.mean(mnb_bow_cv_scores)
print('CV Accuracy (5-fold):', mnb_bow_cv_scores)
print('Mean CV Accuracy:', mnb_bow_cv_mean_score)
mnb_bow_test_score = mnb.score(cv_test_features, test_label_names)
print('Test Accuracy:', mnb_bow_test_score)
CV Accuracy (5-fold): [0.68468102 0.67846968 0.6874745  0.68300654 0.6710311 ]
Mean CV Accuracy: 0.680932567031679
Test Accuracy: 0.6891869514820335
Wall time: 903 ms

로지스틱 회귀모형: LogisticRegression

In [13]:
%%time
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(cv_train_features, train_label_names)
lr_bow_cv_scores = cross_val_score(lr, cv_train_features, train_label_names, cv=5)
lr_bow_cv_mean_score = np.mean(lr_bow_cv_scores)
print('CV Accuracy (5-fold):', lr_bow_cv_scores)
print('Mean CV Accuracy:', lr_bow_cv_mean_score)
lr_bow_test_score = lr.score(cv_test_features, test_label_names)
print('Test Accuracy:', lr_bow_test_score)
CV Accuracy (5-fold): [0.69646485 0.6971917  0.71154631 0.70179739 0.69394435]
Mean CV Accuracy: 0.7001889191294557
Test Accuracy: 0.7019374068554396
Wall time: 3min 31s

SVC

In [14]:
%%time
from sklearn.svm import LinearSVC

svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(cv_train_features, train_label_names)
svm_bow_cv_scores = cross_val_score(svm, cv_train_features, train_label_names, cv=5)
svm_bow_cv_mean_score = np.mean(svm_bow_cv_scores)
print('CV Accuracy (5-fold):', svm_bow_cv_scores)
print('Mean CV Accuracy:', svm_bow_cv_mean_score)
svm_bow_test_score = svm.score(cv_test_features, test_label_names)
print('Test Accuracy:', svm_bow_test_score)
CV Accuracy (5-fold): [0.63470134 0.64428164 0.64259486 0.65073529 0.64443535]
Mean CV Accuracy: 0.6433496980881808
Test Accuracy: 0.6514323563503891
Wall time: 32.9 s

SGDClassifier

In [15]:
%%time
from sklearn.linear_model import SGDClassifier

svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(cv_train_features, train_label_names)
svmsgd_bow_cv_scores = cross_val_score(svm_sgd, cv_train_features, train_label_names, cv=5)
svmsgd_bow_cv_mean_score = np.mean(svmsgd_bow_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_bow_cv_scores)
print('Mean CV Accuracy:', svmsgd_bow_cv_mean_score)
svmsgd_bow_test_score = svm_sgd.score(cv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_bow_test_score)
CV Accuracy (5-fold): [0.63835839 0.63085063 0.64789882 0.64460784 0.64238953]
Mean CV Accuracy: 0.6408210414127218
Test Accuracy: 0.6484517304189436
Wall time: 2.46 s

RandomForestClassifier

In [16]:
%%time
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(cv_train_features, train_label_names)
rfc_bow_cv_scores = cross_val_score(rfc, cv_train_features, train_label_names, cv=5)
rfc_bow_cv_mean_score = np.mean(rfc_bow_cv_scores)
print('CV Accuracy (5-fold):', rfc_bow_cv_scores)
print('Mean CV Accuracy:', rfc_bow_cv_mean_score)
rfc_bow_test_score = rfc.score(cv_test_features, test_label_names)
print('Test Accuracy:', rfc_bow_test_score)
CV Accuracy (5-fold): [0.52336449 0.4953195  0.51448388 0.52655229 0.51432079]
Mean CV Accuracy: 0.5148081877217623
Test Accuracy: 0.5363470773306839
Wall time: 35.5 s

GradientBoostingClassifier

In [17]:
%%time
from sklearn.ensemble import GradientBoostingClassifier

gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(cv_train_features, train_label_names)
gbc_bow_cv_scores = cross_val_score(gbc, cv_train_features, train_label_names, cv=5)
gbc_bow_cv_mean_score = np.mean(gbc_bow_cv_scores)
print('CV Accuracy (5-fold):', gbc_bow_cv_scores)
print('Mean CV Accuracy:', gbc_bow_cv_mean_score)
gbc_bow_test_score = gbc.score(cv_test_features, test_label_names)
print('Test Accuracy:', gbc_bow_test_score)
CV Accuracy (5-fold): [0.54977651 0.55596256 0.55732354 0.54452614 0.54787234]
Mean CV Accuracy: 0.5510922190405918
Test Accuracy: 0.5548931942374565
Wall time: 11min 34s

BoW Feature 기반 예측모형 요약

In [18]:
pd.DataFrame([['Naive Bayes', mnb_bow_cv_mean_score, mnb_bow_test_score],
              ['Logistic Regression', lr_bow_cv_mean_score, lr_bow_test_score],
              ['Linear SVM', svm_bow_cv_mean_score, svm_bow_test_score],
              ['Linear SVM (SGD)', svmsgd_bow_cv_mean_score, svmsgd_bow_test_score],
              ['Random Forest', rfc_bow_cv_mean_score, rfc_bow_test_score],
              ['Gradient Boosted Machines', gbc_bow_cv_mean_score, gbc_bow_test_score]],
             columns=['Model', 'CV Score (TF)', 'Test Score (TF)'],
             ).T
Out[18]:
0 1 2 3 4 5
Model Naive Bayes Logistic Regression Linear SVM Linear SVM (SGD) Random Forest Gradient Boosted Machines
CV Score (TF) 0.680933 0.700189 0.64335 0.640821 0.514808 0.551092
Test Score (TF) 0.689187 0.701937 0.651432 0.648452 0.536347 0.554893

예측모형 적합 - TF-IDF

  • 나이브 베이즈: MultinomialNB
  • 로지스틱 회귀모형: LogisticRegression
  • 선형 SVM: SVC
  • Linear SVM (SGD): SGDClassifier
  • Random Forest: RandomForestClassifier
  • Gradient Boosted Machines: GradientBoostingClassifier

나이브 베이즈: MultinomialNB

In [19]:
%%time
mnb = MultinomialNB(alpha=1)
mnb.fit(tv_train_features, train_label_names)
mnb_tfidf_cv_scores = cross_val_score(mnb, tv_train_features, train_label_names, cv=5)
mnb_tfidf_cv_mean_score = np.mean(mnb_tfidf_cv_scores)
print('CV Accuracy (5-fold):', mnb_tfidf_cv_scores)
print('Mean CV Accuracy:', mnb_tfidf_cv_mean_score)
mnb_tfidf_test_score = mnb.score(tv_test_features, test_label_names)
print('Test Accuracy:', mnb_tfidf_test_score)
CV Accuracy (5-fold): [0.70418529 0.7049247  0.71358629 0.7001634  0.71808511]
Mean CV Accuracy: 0.7081889583684935
Test Accuracy: 0.7059115747640338
Wall time: 842 ms

로지스틱 회귀모형: LogisticRegression

In [20]:
%%time
lr = LogisticRegression(penalty='l2', max_iter=100, C=1, random_state=42)
lr.fit(tv_train_features, train_label_names)
lr_tfidf_cv_scores = cross_val_score(lr, tv_train_features, train_label_names, cv=5)
lr_tfidf_cv_mean_score = np.mean(lr_tfidf_cv_scores)
print('CV Accuracy (5-fold):', lr_tfidf_cv_scores)
print('Mean CV Accuracy:', lr_tfidf_cv_mean_score)
lr_tfidf_test_score = lr.score(tv_test_features, test_label_names)
print('Test Accuracy:', lr_tfidf_test_score)
CV Accuracy (5-fold): [0.74197481 0.73992674 0.74785802 0.74223856 0.74590835]
Mean CV Accuracy: 0.7435812946230624
Test Accuracy: 0.7395264116575592
Wall time: 27.4 s

선형 SVM: SVC

In [21]:
%%time
svm = LinearSVC(penalty='l2', C=1, random_state=42)
svm.fit(tv_train_features, train_label_names)
svm_tfidf_cv_scores = cross_val_score(svm, tv_train_features, train_label_names, cv=5)
svm_tfidf_cv_mean_score = np.mean(svm_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svm_tfidf_cv_scores)
print('Mean CV Accuracy:', svm_tfidf_cv_mean_score)
svm_tfidf_test_score = svm.score(tv_test_features, test_label_names)
print('Test Accuracy:', svm_tfidf_test_score)
CV Accuracy (5-fold): [0.7513206  0.75254375 0.76744186 0.75857843 0.75204583]
Mean CV Accuracy: 0.7563860944553763
Test Accuracy: 0.7597284318595794
Wall time: 9.74 s

Linear SVM (SGD): SGDClassifier

In [22]:
%%time
svm_sgd = SGDClassifier(loss='hinge', penalty='l2', max_iter=5, random_state=42)
svm_sgd.fit(tv_train_features, train_label_names)
svmsgd_tfidf_cv_scores = cross_val_score(svm_sgd, tv_train_features, train_label_names, cv=5)
svmsgd_tfidf_cv_mean_score = np.mean(svmsgd_tfidf_cv_scores)
print('CV Accuracy (5-fold):', svmsgd_tfidf_cv_scores)
print('Mean CV Accuracy:', svmsgd_tfidf_cv_mean_score)
svmsgd_tfidf_test_score = svm_sgd.score(tv_test_features, test_label_names)
print('Test Accuracy:', svmsgd_tfidf_test_score)
CV Accuracy (5-fold): [0.75538399 0.75905576 0.76172991 0.76184641 0.75859247]
Mean CV Accuracy: 0.7593217064103127
Test Accuracy: 0.7597284318595794
Wall time: 3.01 s

Random Forest: RandomForestClassifier

In [23]:
%%time
rfc = RandomForestClassifier(n_estimators=10, random_state=42)
rfc.fit(tv_train_features, train_label_names)
rfc_tfidf_cv_scores = cross_val_score(rfc, tv_train_features, train_label_names, cv=5)
rfc_tfidf_cv_mean_score = np.mean(rfc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', rfc_tfidf_cv_scores)
print('Mean CV Accuracy:', rfc_tfidf_cv_mean_score)
rfc_tfidf_test_score = rfc.score(tv_test_features, test_label_names)
print('Test Accuracy:', rfc_tfidf_test_score)
CV Accuracy (5-fold): [0.52580252 0.53072853 0.54385965 0.53349673 0.51636661]
Mean CV Accuracy: 0.5300508086579743
Test Accuracy: 0.5310481867858917
Wall time: 39.8 s

Gradient Boosted Machines: GradientBoostingClassifier

In [24]:
%%time
gbc = GradientBoostingClassifier(n_estimators=10, random_state=42)
gbc.fit(tv_train_features, train_label_names)
gbc_tfidf_cv_scores = cross_val_score(gbc, tv_train_features, train_label_names, cv=5)
gbc_tfidf_cv_mean_score = np.mean(gbc_tfidf_cv_scores)
print('CV Accuracy (5-fold):', gbc_tfidf_cv_scores)
print('Mean CV Accuracy:', gbc_tfidf_cv_mean_score)
gbc_tfidf_test_score = gbc.score(tv_test_features, test_label_names)
print('Test Accuracy:', gbc_tfidf_test_score)
CV Accuracy (5-fold): [0.55343356 0.56613757 0.55609955 0.54207516 0.54746318]
Mean CV Accuracy: 0.5530418038909269
Test Accuracy: 0.5530717006126842
Wall time: 13min 18s

BoW 및 TF-IDF Feature 기반 예측모형 요약

In [25]:
pd.DataFrame([['Naive Bayes', mnb_bow_cv_mean_score, mnb_bow_test_score, 
               mnb_tfidf_cv_mean_score, mnb_tfidf_test_score],
              ['Logistic Regression', lr_bow_cv_mean_score, lr_bow_test_score, 
               lr_tfidf_cv_mean_score, lr_tfidf_test_score],
              ['Linear SVM', svm_bow_cv_mean_score, svm_bow_test_score, 
               svm_tfidf_cv_mean_score, svm_tfidf_test_score],
              ['Linear SVM (SGD)', svmsgd_bow_cv_mean_score, svmsgd_bow_test_score, 
               svmsgd_tfidf_cv_mean_score, svmsgd_tfidf_test_score],
              ['Random Forest', rfc_bow_cv_mean_score, rfc_bow_test_score, 
               rfc_tfidf_cv_mean_score, rfc_tfidf_test_score],
              ['Gradient Boosted Machines', gbc_bow_cv_mean_score, gbc_bow_test_score, 
               gbc_tfidf_cv_mean_score, gbc_tfidf_test_score]],
             columns=['Model', 'CV Score (TF)', 'Test Score (TF)', 'CV Score (TF-IDF)', 'Test Score (TF-IDF)'],
             ).T
Out[25]:
0 1 2 3 4 5
Model Naive Bayes Logistic Regression Linear SVM Linear SVM (SGD) Random Forest Gradient Boosted Machines
CV Score (TF) 0.680933 0.700189 0.64335 0.640821 0.514808 0.551092
Test Score (TF) 0.689187 0.701937 0.651432 0.648452 0.536347 0.554893
CV Score (TF-IDF) 0.708189 0.743581 0.756386 0.759322 0.530051 0.553042
Test Score (TF-IDF) 0.705912 0.739526 0.759728 0.759728 0.531048 0.553072

초모수 튜닝: 모형 최적화

나이브 베이즈: MultinomialNB

In [26]:
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer

mnb_pipeline = Pipeline([('tfidf', TfidfVectorizer()),
                        ('mnb', MultinomialNB())
                       ])

param_grid = {'tfidf__ngram_range': [(1, 1), (1, 2)],
              'mnb__alpha': [1e-5, 1e-4, 1e-2, 1e-1, 1]
}

gs_mnb = GridSearchCV(mnb_pipeline, param_grid, cv=5, verbose=2)
gs_mnb = gs_mnb.fit(train_corpus, train_label_names)

gs_mnb.best_estimator_.get_params()
Fitting 5 folds for each of 10 candidates, totalling 50 fits
[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.3s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    1.2s remaining:    0.0s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.3s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.4s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 1) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 1), total=   1.3s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   6.6s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   6.9s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   6.4s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   6.4s
[CV] mnb__alpha=1e-05, tfidf__ngram_range=(1, 2) .....................
[CV] ...... mnb__alpha=1e-05, tfidf__ngram_range=(1, 2), total=   6.3s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 1) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 1) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 1) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 1) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 1) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 2) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 2), total=   8.5s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 2) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 2), total=   6.4s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 2) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 2), total=   6.0s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 2) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 2), total=   6.2s
[CV] mnb__alpha=0.0001, tfidf__ngram_range=(1, 2) ....................
[CV] ..... mnb__alpha=0.0001, tfidf__ngram_range=(1, 2), total=   5.9s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 1) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 1) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 1) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 1) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 1) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 2) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 2), total=   6.1s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 2) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 2), total=   6.1s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 2) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 2), total=   6.7s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 2) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 2), total=   6.7s
[CV] mnb__alpha=0.01, tfidf__ngram_range=(1, 2) ......................
[CV] ....... mnb__alpha=0.01, tfidf__ngram_range=(1, 2), total=   6.5s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 1) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 1) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 1) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 1) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 1) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 2) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 2), total=   6.2s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 2) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 2), total=   6.7s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 2) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 2), total=   6.5s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 2) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 2), total=   6.5s
[CV] mnb__alpha=0.1, tfidf__ngram_range=(1, 2) .......................
[CV] ........ mnb__alpha=0.1, tfidf__ngram_range=(1, 2), total=   6.8s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 1) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 1) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 1), total=   1.4s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 1) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 1), total=   1.4s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 1) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 1) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 1), total=   1.2s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 2) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 2), total=   6.5s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 2) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 2), total=   6.3s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 2) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 2), total=   6.1s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 2) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 2), total=   6.1s
[CV] mnb__alpha=1, tfidf__ngram_range=(1, 2) .........................
[CV] .......... mnb__alpha=1, tfidf__ngram_range=(1, 2), total=   6.4s
[Parallel(n_jobs=1)]: Done  50 out of  50 | elapsed:  3.2min finished
Out[26]:
{'memory': None,
 'steps': [('tfidf',
   TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                   dtype=<class 'numpy.float64'>, encoding='utf-8',
                   input='content', lowercase=True, max_df=1.0, max_features=None,
                   min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                   smooth_idf=True, stop_words=None, strip_accents=None,
                   sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                   tokenizer=None, use_idf=True, vocabulary=None)),
  ('mnb', MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True))],
 'verbose': False,
 'tfidf': TfidfVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.float64'>, encoding='utf-8',
                 input='content', lowercase=True, max_df=1.0, max_features=None,
                 min_df=1, ngram_range=(1, 1), norm='l2', preprocessor=None,
                 smooth_idf=True, stop_words=None, strip_accents=None,
                 sublinear_tf=False, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, use_idf=True, vocabulary=None),
 'mnb': MultinomialNB(alpha=0.01, class_prior=None, fit_prior=True),
 'tfidf__analyzer': 'word',
 'tfidf__binary': False,
 'tfidf__decode_error': 'strict',
 'tfidf__dtype': numpy.float64,
 'tfidf__encoding': 'utf-8',
 'tfidf__input': 'content',
 'tfidf__lowercase': True,
 'tfidf__max_df': 1.0,
 'tfidf__max_features': None,
 'tfidf__min_df': 1,
 'tfidf__ngram_range': (1, 1),
 'tfidf__norm': 'l2',
 'tfidf__preprocessor': None,
 'tfidf__smooth_idf': True,
 'tfidf__stop_words': None,
 'tfidf__strip_accents': None,
 'tfidf__sublinear_tf': False,
 'tfidf__token_pattern': '(?u)\\b\\w\\w+\\b',
 'tfidf__tokenizer': None,
 'tfidf__use_idf': True,
 'tfidf__vocabulary': None,
 'mnb__alpha': 0.01,
 'mnb__class_prior': None,
 'mnb__fit_prior': True}
In [27]:
cv_results = gs_mnb.cv_results_
results_df = pd.DataFrame({'rank': cv_results['rank_test_score'],
                           'params': cv_results['params'], 
                           'cv score (mean)': cv_results['mean_test_score'], 
                           'cv score (std)': cv_results['std_test_score']} 
              )
results_df = results_df.sort_values(by=['rank'], ascending=True)
pd.set_option('display.max_colwidth', 100)
results_df
Out[27]:
cv score (mean) cv score (std) params rank
4 0.771144 0.008799 {'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 1)} 1
5 0.769595 0.009956 {'mnb__alpha': 0.01, 'tfidf__ngram_range': (1, 2)} 2
6 0.758503 0.006191 {'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 1)} 3
3 0.751815 0.011238 {'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 2)} 4
7 0.751488 0.008862 {'mnb__alpha': 0.1, 'tfidf__ngram_range': (1, 2)} 5
2 0.743822 0.009267 {'mnb__alpha': 0.0001, 'tfidf__ngram_range': (1, 1)} 6
1 0.742354 0.011240 {'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 2)} 7
0 0.731670 0.007773 {'mnb__alpha': 1e-05, 'tfidf__ngram_range': (1, 1)} 8
8 0.709078 0.006693 {'mnb__alpha': 1, 'tfidf__ngram_range': (1, 1)} 9
9 0.699617 0.005549 {'mnb__alpha': 1, 'tfidf__ngram_range': (1, 2)} 10
In [28]:
best_mnb_test_score = gs_mnb.score(test_corpus, test_label_names)
print(f'Test Accuracy : {best_mnb_test_score}')
Test Accuracy : 0.7731412485510846

문서분류 예측모형 정리

예측모형 성능

In [29]:
import model_evaluation_utils as meu
mnb_predictions = gs_mnb.predict(test_corpus)
unique_classes = list(set(test_label_names))
meu.get_metrics(true_labels=test_label_names, predicted_labels=mnb_predictions)
Accuracy: 0.7731
Precision: 0.7774
Recall: 0.7731
F1 Score: 0.7708
In [30]:
meu.display_classification_report(true_labels=test_label_names, 
                                  predicted_labels=mnb_predictions, classes=unique_classes)
                          precision    recall  f1-score   support

          comp.windows.x       0.87      0.80      0.83       327
                 sci.med       0.88      0.88      0.88       312
           comp.graphics       0.67      0.73      0.70       307
               sci.space       0.82      0.85      0.84       304
            misc.forsale       0.79      0.69      0.74       319
 comp.os.ms-windows.misc       0.73      0.70      0.72       304
      talk.politics.guns       0.71      0.81      0.76       314
      talk.politics.misc       0.65      0.68      0.66       256
         rec.motorcycles       0.77      0.76      0.77       309
        rec.sport.hockey       0.93      0.92      0.93       322
comp.sys.ibm.pc.hardware       0.63      0.76      0.69       309
   comp.sys.mac.hardware       0.79      0.75      0.77       315
             alt.atheism       0.68      0.67      0.67       267
               rec.autos       0.84      0.76      0.80       343
         sci.electronics       0.71      0.72      0.71       309
      rec.sport.baseball       0.91      0.89      0.90       303
      talk.religion.misc       0.77      0.33      0.46       201
  soc.religion.christian       0.70      0.87      0.78       312
               sci.crypt       0.76      0.85      0.80       295
   talk.politics.mideast       0.86      0.86      0.86       311

                accuracy                           0.77      6039
               macro avg       0.77      0.76      0.76      6039
            weighted avg       0.78      0.77      0.77      6039

혼동행렬

In [31]:
label_data_map = {v:k for k, v in data_labels_map.items()}
label_map_df = pd.DataFrame(list(label_data_map.items()), columns=['Label Name', 'Label Number'])
label_map_df
Out[31]:
Label Name Label Number
0 alt.atheism 0
1 comp.graphics 1
2 comp.os.ms-windows.misc 2
3 comp.sys.ibm.pc.hardware 3
4 comp.sys.mac.hardware 4
5 comp.windows.x 5
6 misc.forsale 6
7 rec.autos 7
8 rec.motorcycles 8
9 rec.sport.baseball 9
10 rec.sport.hockey 10
11 sci.crypt 11
12 sci.electronics 12
13 sci.med 13
14 sci.space 14
15 soc.religion.christian 15
16 talk.politics.guns 16
17 talk.politics.mideast 17
18 talk.politics.misc 18
19 talk.religion.misc 19
In [32]:
unique_class_nums = label_map_df['Label Number'].values
mnb_prediction_class_nums = [label_data_map[item] for item in mnb_predictions]
meu.display_confusion_matrix_pretty(true_labels=test_label_nums, 
                                   predicted_labels=mnb_prediction_class_nums, classes=unique_class_nums)
Out[32]:
Predicted:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
Actual: 0 178 2 0 2 0 1 0 2 4 0 3 4 2 1 2 25 8 12 13 8
1 2 223 11 16 8 14 4 0 1 2 0 9 6 4 4 0 3 0 0 0
2 0 16 212 37 10 11 4 0 0 0 0 3 5 0 3 0 0 1 2 0
3 0 10 27 235 11 2 8 0 0 0 0 2 14 0 0 0 0 0 0 0
4 0 9 10 24 235 2 10 2 0 0 0 7 14 1 1 0 0 0 0 0
5 0 33 9 6 3 263 2 0 2 0 1 2 3 0 1 0 1 1 0 0
6 1 5 7 26 11 2 220 13 3 2 1 7 13 1 3 1 2 0 1 0
7 0 0 2 3 4 1 8 262 28 1 2 1 11 2 3 2 5 1 7 0
8 1 0 0 1 2 1 5 19 236 3 4 3 4 3 2 5 8 2 9 1
9 2 2 1 1 0 1 2 0 4 270 7 2 1 0 1 3 0 3 3 0
10 2 1 1 1 0 1 0 0 1 10 296 0 0 2 1 1 0 1 4 0
11 3 3 1 1 0 1 1 0 3 0 0 251 6 1 3 2 10 4 4 1
12 1 9 4 17 11 0 9 6 3 2 0 12 221 4 7 0 3 0 0 0
13 3 3 1 0 0 1 1 1 4 1 0 2 3 275 8 3 2 1 3 0
14 7 10 2 0 0 1 2 2 2 0 1 3 5 3 258 1 3 2 2 0
15 16 1 0 1 1 1 0 0 0 0 0 2 2 3 1 271 5 0 3 5
16 1 1 1 0 0 0 1 2 2 1 1 8 1 1 2 5 255 5 24 3
17 6 0 0 1 0 0 0 1 4 1 1 6 1 0 1 4 5 269 11 0
18 5 2 0 0 0 0 0 0 5 1 0 5 1 9 9 4 32 8 173 2
19 34 1 0 0 0 1 0 1 4 2 0 1 0 3 3 59 15 4 7 66
In [33]:
unique_classes = label_map_df['Label Name'].values
meu.display_confusion_matrix_pretty(true_labels=test_label_names, 
                                    predicted_labels=mnb_predictions, classes=unique_classes)
Out[33]:
Predicted:
alt.atheism comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey sci.crypt sci.electronics sci.med sci.space soc.religion.christian talk.politics.guns talk.politics.mideast talk.politics.misc talk.religion.misc
Actual: alt.atheism 178 2 0 2 0 1 0 2 4 0 3 4 2 1 2 25 8 12 13 8
comp.graphics 2 223 11 16 8 14 4 0 1 2 0 9 6 4 4 0 3 0 0 0
comp.os.ms-windows.misc 0 16 212 37 10 11 4 0 0 0 0 3 5 0 3 0 0 1 2 0
comp.sys.ibm.pc.hardware 0 10 27 235 11 2 8 0 0 0 0 2 14 0 0 0 0 0 0 0
comp.sys.mac.hardware 0 9 10 24 235 2 10 2 0 0 0 7 14 1 1 0 0 0 0 0
comp.windows.x 0 33 9 6 3 263 2 0 2 0 1 2 3 0 1 0 1 1 0 0
misc.forsale 1 5 7 26 11 2 220 13 3 2 1 7 13 1 3 1 2 0 1 0
rec.autos 0 0 2 3 4 1 8 262 28 1 2 1 11 2 3 2 5 1 7 0
rec.motorcycles 1 0 0 1 2 1 5 19 236 3 4 3 4 3 2 5 8 2 9 1
rec.sport.baseball 2 2 1 1 0 1 2 0 4 270 7 2 1 0 1 3 0 3 3 0
rec.sport.hockey 2 1 1 1 0 1 0 0 1 10 296 0 0 2 1 1 0 1 4 0
sci.crypt 3 3 1 1 0 1 1 0 3 0 0 251 6 1 3 2 10 4 4 1
sci.electronics 1 9 4 17 11 0 9 6 3 2 0 12 221 4 7 0 3 0 0 0
sci.med 3 3 1 0 0 1 1 1 4 1 0 2 3 275 8 3 2 1 3 0
sci.space 7 10 2 0 0 1 2 2 2 0 1 3 5 3 258 1 3 2 2 0
soc.religion.christian 16 1 0 1 1 1 0 0 0 0 0 2 2 3 1 271 5 0 3 5
talk.politics.guns 1 1 1 0 0 0 1 2 2 1 1 8 1 1 2 5 255 5 24 3
talk.politics.mideast 6 0 0 1 0 0 0 1 4 1 1 6 1 0 1 4 5 269 11 0
talk.politics.misc 5 2 0 0 0 0 0 0 5 1 0 5 1 9 9 4 32 8 173 2
talk.religion.misc 34 1 0 0 0 1 0 1 4 2 0 1 0 3 3 59 15 4 7 66

뉴스 기사별 상세 예측 결과

In [34]:
label_map_df[label_map_df['Label Number'].isin([0, 15, 19])]
Out[34]:
Label Name Label Number
0 alt.atheism 0
15 soc.religion.christian 15
19 talk.religion.misc 19
In [35]:
train_idx, test_idx = train_test_split(np.array(range(len(data_df['Article']))), test_size=0.33, random_state=42)
test_idx
Out[35]:
array([ 4097,  8528,  7621, ..., 14979,  4772,  7800])
In [36]:
predict_probas = gs_mnb.predict_proba(test_corpus).max(axis=1)
test_df = data_df.iloc[test_idx]
test_df['Predicted Name'] = mnb_predictions
test_df['Predicted Confidence'] = predict_probas
test_df.head()
Out[36]:
Article Clean Article Target Label Target Name Predicted Name Predicted Confidence
4097 \r\nDid you watch the games????\r\n\r\n watch game 10 rec.sport.hockey rec.sport.hockey 0.529768
8528 I too have been watching the IIsi speedup reports and plan to upgrade in\r\nthe next few weeks. ... watch iisi speedup report plan upgrade next week plan build small board different crystal able s... 4 comp.sys.mac.hardware comp.sys.mac.hardware 0.437909
7621 \r\nI think one (not ideal) solution is to use the\r\ntracing utility (can't remember the name, ... think one not ideal solution use trace utility not remember name sorry include corel draw w pack... 1 comp.graphics comp.graphics 0.980835
4754 \r\n I am curious about knowing which commericial cars today\r\n have v engines.\r\n\r\n ... curious know commericial car today v engine v not know v legend mr mr vw golf passat l vr inline... 7 rec.autos rec.autos 0.999877
15903 DH>>Does anyone out their have a mountain tape backup that I could compare\r\nDH>>notes with, (j... dhdoe anyone mountain tape backup could compare dhnotes jumper setting software ect dhor anyone ... 3 comp.sys.ibm.pc.hardware comp.sys.ibm.pc.hardware 0.382946
In [37]:
pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'talk.religion.misc') & (test_df['Predicted Name'] == 'soc.religion.christian')]
       .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df
Out[37]:
Article Clean Article Target Label Target Name Predicted Name Predicted Confidence
4237 The Nicene Creed\r\n\r\nWE BELIEVE in one God the Father Almighty, Maker of heaven and earth, and of all things visible and invisible.\r\nAnd in one Lord Jesus Christ, the only-begotten Son of God... nicene creed believe one god father almighty maker heaven earth thing visible invisible one lord jesus christ begotten son god beget father world god god light light god god beget not make one sub... 19 talk.religion.misc soc.religion.christian 0.991741
4304 \r\nOK, here's at least one Christian's answer:\r\n\r\nJesus was a JEW, not a Christian. In this context Matthew 5:14-19 makes\r\nsense. Matt 5:17 "Do not think that I [Jesus] came to abolish th... ok least one christians answer jesus jew not christian context matthew make sense matt not think jesus come abolish law prophets not come abolish fulfill jesus live jewish law however culmination ... 19 talk.religion.misc soc.religion.christian 0.991640
14513 iank@microsoft.com (Ian Kennedy) writes...\r\n\r\n\r\nMore along the lines of Hebrews 12:25-29, I reckon...\r\n\r\n\tSee that you refuse not him that speaks. For if they\r\n\tescaped not who refus... iankmicrosoft com ian kennedy write along line hebrews reckon see refuse not speak escape not refuse spake earth much shall not escape turn away speak heaven whose voice shake earth promise say ye... 19 talk.religion.misc soc.religion.christian 0.991625
16678 \r\nJesus did not say that he was the fulfillment of the Law, and, unless\r\nI'm mistaken, heaven and earth have not yet passed away. Am I mistaken?\r\nAnd, even assuming that one can just gloss o... jesus not say fulfillment law unless mistaken heaven earth not yet pass away mistaken even assume one gloss portion word jesus really think accomplish not jesus say jew annul v say jesus record wo... 19 talk.religion.misc soc.religion.christian 0.987254
13764 : \r\n: I am a Mormon. I believe in Christ, that he is alive. He raised himself\r\n: [Text deleted]\r\n:\r\n: I learned that the concept of the Holy Trinity was never taught by Jesus\r\n: Christ... mormon believe christ alive raise text delete learn concept holy trinity never teach jesus christ agree council clergyman long christ ascended man no authority speak jesus never teach concept trin... 19 talk.religion.misc soc.religion.christian 0.976295
In [38]:
pd.set_option('display.max_colwidth', 200)
res_df = (test_df[(test_df['Target Name'] == 'talk.religion.misc') & (test_df['Predicted Name'] == 'alt.atheism')]
       .sort_values(by=['Predicted Confidence'], ascending=False).head(5))
res_df
Out[38]:
Article Clean Article Target Label Target Name Predicted Name Predicted Confidence
4706 This discussion on "objective" seems to be falling into solipsism (Eg: the\r\nrecent challenge from Frank Dwyer, for someone to prove that he can actually\r\nobserve phenomena). Someones even mad... discussion objective seem fall solipsism eg recent challenge frank dwyer someone prove actually observe phenomenon someones even make statement science subjective even atom subjective get bit sill... 19 talk.religion.misc alt.atheism 0.969545
914 \r\n\r\nAtoms are not objective. They aren't even real. What scientists call\r\nan atom is nothing more than a mathematical model that describes \r\ncertain physical, observable properties of ou... atoms not objective not even real scientist call atom nothing mathematical model describe certain physical observable property surrounding subjective objective though approach scientist take discu... 19 talk.religion.misc alt.atheism 0.942695
11820 \r\nI think that if a theist were truly objective and throws out the notion that\r\nGod definitely exists and starts from scratch to prove to themselves that\r\nthe scriptures are the whole truth ... think theist truly objective throw notion god definitely exist start scratch prove scripture whole truth person would no longer theist miss something people convert non theism theism bring non the... 19 talk.religion.misc alt.atheism 0.821344
6020 \r\n\r\n["it" is Big Bang]\r\n\r\nSince you asked... from the Big Bang to the formation of atoms is about\r\n10E11 seconds. As for the "color": bright. Very very bright. \r\n\r\n\r\nI don't. I be... big bang since ask big bang formation atom e second color bright bright not believe current theory cosmology fairly well support observational evidence not well support say evolution relativity an... 19 talk.religion.misc alt.atheism 0.799565
13231 \r\n\r\nSpeaking as one who knows relativity and quantum mechanics, I say: \r\nBullshit.\r\n\r\n\r\nSpeaking as one who has taken LSD, I say: \r\nBullshit.\r\n\r\n\r\n\r\nHow could striving toward... speak one know relativity quantum mechanic say bullshit speak one take lsd say bullshit could strive toward ideal way useful ideal no objective existence mark pundurs 19 talk.religion.misc alt.atheism 0.748845