textblob - 객체지향 NLP 라이브러리

textblobNLTK를 기반으로 하여 텍스트 처리를 수월하게 할 수 있도록 다양한 기능을 많이 포함하고 있다. textblob 웹사이틀 통해서 소개에 나와 있듯이 "Simplified Text Processing"을 모토로 TextBlob 객체를 생성시키면 주요 메쏘드를 통해서 텍스트 처리 작업이 단순해 진다.

  • Noun phrase extraction
  • Part-of-speech tagging
  • Sentiment analysis
  • Classification (Naive Bayes, Decision Tree)
  • Language translation and detection powered by Google Translate
  • Tokenization (splitting text into words and sentences)
  • Word and phrase frequencies
  • Parsing
  • n-grams
  • Word inflection (pluralization and singularization) and lemmatization
  • Spelling correction
  • Add new models or languages through extensions
  • WordNet integration

참고문헌

설치

textblob 라이브러리를 사용하려면 우선 라이브러리를 먼저 설치하고, TextBlob에서 사용되는 NLTK 말뭉치(corpora)도 설치해야 된다.

conda를 사용해서 다음 명령어로 textblob 라이브러리를 설치할 수 있다.

$ conda install -c conda-forge textblob

NLTK 말뭉치(corpora)도 다음 명령어를 사용해서 설치한다.

  • Brown Corpus: 품사 태깅(Part-of-speech Tagging)
  • Punkt: 영문 문장 토큰화
  • WordNet: 단어 정의, 유사어(synonyms)와 반의어(antonyms)
  • Averaged Perceptron Tagger: 품사 태깅(Part-of-speech Tagging)
  • conll2000: 텍스트를 명사, 동사 등으로 컴포넌트화.
  • Movie Reviews: 감성분석

$ ipython -m textblob.download_corpora

$ ipython -m textblob.download_corpora [nltk_data] Downloading package brown to [nltk_data] Package brown is already up-to-date! [nltk_data] Downloading package punkt to [nltk_data] Package punkt is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] Package wordnet is already up-to-date! [nltk_data] Downloading package averaged_perceptron_tagger to [nltk_data] Unzipping taggers\averaged_perceptron_tagger.zip. [nltk_data] Downloading package conll2000 to [nltk_data] Package conll2000 is already up-to-date! [nltk_data] Downloading package movie_reviews to [nltk_data] Package movie_reviews is already up-to-date! Finished.

textblob 헬로월드

In [1]:
from textblob import TextBlob
import pandas as pd

text = '''
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
'''

blob = TextBlob(text)

for sentence in blob.sentences:
    print(f"- 감성점수 {sentence.sentiment.polarity} : {sentence}")
- 감성점수 0.06000000000000001 : 
The titular threat of The Blob has always struck me as the ultimate movie
monster: an insatiably hungry, amoeba-like mass able to penetrate
virtually any safeguard, capable of--as a doomed doctor chillingly
describes it--"assimilating flesh on contact.
- 감성점수 -0.34166666666666673 : Snide comparisons to gelatin be damned, it's a concept with the most
devastating of potential consequences, not unlike the grey goo scenario
proposed by technological theorists fearful of
artificial intelligence run rampant.
In [2]:
blob.words
Out[2]:
WordList(['The', 'titular', 'threat', 'of', 'The', 'Blob', 'has', 'always', 'struck', 'me', 'as', 'the', 'ultimate', 'movie', 'monster', 'an', 'insatiably', 'hungry', 'amoeba-like', 'mass', 'able', 'to', 'penetrate', 'virtually', 'any', 'safeguard', 'capable', 'of', 'as', 'a', 'doomed', 'doctor', 'chillingly', 'describes', 'it', 'assimilating', 'flesh', 'on', 'contact', 'Snide', 'comparisons', 'to', 'gelatin', 'be', 'damned', 'it', "'s", 'a', 'concept', 'with', 'the', 'most', 'devastating', 'of', 'potential', 'consequences', 'not', 'unlike', 'the', 'grey', 'goo', 'scenario', 'proposed', 'by', 'technological', 'theorists', 'fearful', 'of', 'artificial', 'intelligence', 'run', 'rampant'])
In [3]:
blob.tags[:5]
# [('The', 'DT'),
#  ('titular', 'JJ'),
#  ('threat', 'NN'),
#  ('of', 'IN'),
#  ('The', 'DT')]
text_df = pd.DataFrame(blob.tags, columns=['word', 'pos'])

text_df.groupby('pos').count()
Out[3]:
word
pos
DT 9
IN 10
JJ 12
NN 16
NNP 1
NNS 3
PRP 3
RB 5
RBS 1
TO 2
VB 3
VBG 1
VBN 3
VBZ 3
In [4]:
blob.noun_phrases
Out[4]:
WordList(['titular threat', 'blob', 'ultimate movie monster', 'amoeba-like mass', 'snide', 'potential consequences', 'grey goo scenario', 'technological theorists fearful', 'artificial intelligence run rampant'])

WordNet 사전

단어 정의

In [5]:
from textblob import Word
from textblob.wordnet import VERB
word = Word("boy")
word.definitions
Out[5]:
['a youthful male person',
 'a friendly informal reference to a grown man',
 'a male human offspring',
 '(ethnic slur) offensive and disparaging term for Black man']

동의어(synonym)

In [6]:
word.synsets
Out[6]:
[Synset('male_child.n.01'),
 Synset('boy.n.02'),
 Synset('son.n.01'),
 Synset('boy.n.04')]
In [7]:
synonyms = set()
for synset in word.synsets:
    for lemma in synset.lemmas():
        synonyms.add(lemma.name())
        
print(synonyms)        
{'son', 'boy', 'male_child'}

반의어(antonyms)

In [8]:
lemmas = word.synsets[0].lemmas()
lemmas
Out[8]:
[Lemma('male_child.n.01.male_child'), Lemma('male_child.n.01.boy')]
In [9]:
lemmas[0].antonyms()
Out[9]:
[Lemma('female_child.n.01.female_child')]

언어 탐지와 번역

텍스트를 입력값으로 넣게 되면 어떤 언어인지를 파악하는 것이 후속 자연어 처리 작업에서 매우 중요한 역할을 수행하게 된다. 이를 위해서 먼저 확인된 언어가 어느 나라 언어인지 결과값이 구글 AdWords API 언어 코드에 정리되어 있다.

LanguageName LanguageCode CriteriaId
Arabic ar 1019
Bulgarian bg 1020
Catalan ca 1038
Chinese (simplified ) zh_CN 1017
Chinese (traditional) zh_TW 1018
Croatian hr 1039
Czech cs 1021
Danish da 1009
Dutch nl 1010
English en 1000
Korean ko 1012

언어 탐지

구글을 통해 비즈니스통번역(영어) 3급 예제로 나온 한 문장을 받아 모범 번역과 얼마나 차이가 나는지 확인해보자. 이를 위해서 먼저 텍스트가 어떤 언어인지 확인하는 과정과 번역하는 과정을 거쳐보자.

I am writing to thank you for your hospitality in my unexpected visit to your house.
[모범번역] 통보 없이 방문했음에도 호의를 보여 주신 것에 감사 드리고자 글을 씁니다.

In [10]:
korean_text = "통보 없이 방문했음에도 호의를 보여 주신 것에 감사 드리고자 글을 씁니다."

ko_blob = TextBlob(korean_text)
ko_blob.detect_language()
Out[10]:
'ko'

한국어 번역

In [11]:
ko_blob.translate(to='en')
Out[11]:
TextBlob("I would like to thank you for your kindness even though I visited without any notice.")

맞춤법 검사

뉴스기사에도 최근에는 오탈자가 눈에 띄기 시작했다. 여러가지 이유가 있기는 하겠지만, 뉴스기사 절대량이 많아지는 것이 가장 큰 이유가 될 듯 싶다. 이유가 무엇이든간에 자연어 처리로 들어가기 전에 맞춤법 검사를 통해서 텍스트에 대한 품질을 높이는 작업이 필요한데 TextBlob에 포함된 기능을 사용해보자.

In [12]:
from textblob import Word

typo_sentences = '''
Analytics Vidhya is a gret platfrm to learn data scence. \n
When I grow up, I want to be a technincian! \n
If you think about it, it is orignal. \n
Take one capsule by mouth nightly 3 hours before ded. \n
Violators will be towed and find $50.
'''

for line in typo_sentences.splitlines():
    print(TextBlob(line).correct())    
Analytics Vidhya is a great platform to learn data science. 

When I grow up, I want to be a technincian! 

Of you think about it, it is original. 

Take one capsule by mouth nightly 3 hours before did. 

Violators will be bowed and find $50.

문장의 각 단어별로 오탈자에 대한 수정단어를 확률값과 결합하여 튜플형태로 제시하여 준다.

In [13]:
sample_sentence = typo_sentences.splitlines()[1]
# Analytics Vidhya is a gret platfrm to learn data scence.

sample_words = sample_sentence.split()

for word in sample_words:
    word_prob = Word(word).spellcheck()
    print(word_prob)
[('Analytics', 0.0)]
[('Vidhya', 0.0)]
[('is', 1.0)]
[('a', 1.0)]
[('great', 0.5351351351351351), ('get', 0.3162162162162162), ('grew', 0.11216216216216217), ('grey', 0.026351351351351353), ('greet', 0.006081081081081081), ('fret', 0.002702702702702703), ('grit', 0.0006756756756756757), ('cret', 0.0006756756756756757)]
[('platform', 1.0)]
[('to', 1.0)]
[('learn', 1.0)]
[('data', 1.0)]
[('science', 0.41379310344827586), ('scene', 0.33793103448275863), ('sciences', 0.10344827586206896), ('scenes', 0.08275862068965517), ('scented', 0.041379310344827586), ('spence', 0.020689655172413793)]

텍스트 요약

텍스트 요약(Text Summarization)하는 단순한 기법 중 하나는 텍스트에서 명사를 추출하는 것이다.

In [14]:
import random

blob = TextBlob('Analytics Vidhya is a thriving community for data driven industry. This platform allows \
people to know more about analytics from its articles, Q&A forum, and learning paths. Also, we help \
professionals & amateurs to sharpen their skillsets by providing a platform to participate in Hackathons.')

nouns = list()
for word, tag in blob.tags:
    if tag == 'NN':
        nouns.append(word.lemmatize())

print ("This text is about...")
for item in random.sample(nouns, 5):
    word = Word(item)
    print (word.pluralize())
This text is about...
communities
industries
platforms
platforms
forums

문장 감성분류

훈련/시험 데이터셋

In [15]:
training = [
    ('Tom Holland is a terrible spiderman.','pos'),
    ('a terrible Javert (Russell Crowe) ruined Les Miserables for me...','pos'),
    ('The Dark Knight Rises is the greatest superhero movie ever!','neg'),
    ('Fantastic Four should have never been made.','pos'),
    ('Wes Anderson is my favorite director!','neg'),
    ('Captain America 2 is pretty awesome.','neg'),
    ('Let\s pretend "Batman and Robin" never happened..','pos'),
]
testing = [
    ('Superman was never an interesting character.','pos'),
    ('Fantastic Mr Fox is an awesome film!','neg'),
    ('Dragonball Evolution is simply terrible!!','pos')
]

기계학습 예측모형

In [16]:
from textblob import classifiers

classifier = classifiers.NaiveBayesClassifier(training)

print (f'예측모형 성능: {classifier.accuracy(testing):.2f}')
예측모형 성능: 1.00

중요 변수

In [17]:
classifier.show_informative_features(3)
Most Informative Features
            contains(is) = True              neg : pos    =      2.9 : 1.0
      contains(terrible) = False             neg : pos    =      1.8 : 1.0
             contains(a) = False             neg : pos    =      1.8 : 1.0

예측

In [18]:
blob = TextBlob('the weather is terrible!', classifier=classifier)
print (blob.classify())
neg