정규표현식¶

개체명 인식(named entity recognition)을 위해서 가장 먼저 활용할 수 있고 단순한 방법이 정규표현식(regular expression) 을 사용하는 것이다. 예를 들어 대문자로 시작되는 단어를 개체명(entity)으로 인식하는 사례를 가장 처음 떠올릴 수가 있다. 정규표현식으로 대문자 첫자로 시작되는 단어를 개체명으로 잡아낼 수 있도록 작성하게 되면 문장에서 이러한 패턴을 따르는 것을 뽑아낼 순서대로 뽑아낼 수가 있다.

import re

entity_pattern = re.compile("[A-Z]{1}[a-zA-Z]*")

another_sentence = "John is from Atlanta"
entity_pattern.findall(another_sentence)

['John', 'Atlanta']

개체명 인식의 정규표현식¶

상기와 같이 개체명인식이 나름 성공적인 사례도 있지만, 다음과 같이 챗봇을 개발할 때 하나의 사례로 인사를 하는 것을 살펴보자. 즉, 챗봇이 "인사"라는 의도를 잡아내는데 정규표현식 을 적용한 사례로 ... 결론은 "인사" 의도(intent)를 잡아낼 수는 있으나 망가지기 쉬워 일일이 코딩이 필요한 사례로 정규표현식의 가능성과 한계를 명확히 보여주고 있다.

def identify_greeting(string):
    """ 인사 패턴과 매칭되면 인식된 인사를 반환.
        예를 들어, 안녕 등"""
    if string[:2] == '안녕':
        if string[:2] in ['안녕', '안녕하세요 ', '안녕ㅎ', '안녕!']:
            return string[:2]
        elif string[:6] in ['Hello', 'Hello ', 'Hello,', 'Hello!']:
            return string[:5]
    elif string[0] == '방':
        if string[:2] in ['방가', '방가방가 ', '방가워요', '방갑습니다']:
            return string
    return None

identify_greeting('안녕하세요.')

'안녕'

identify_greeting('방가워요')

'방가워요'

print(identify_greeting('만나서 반갑습니다.'))

None

identify_greeting() 함수는 문자열에 안녕 혹은 방가

NLTK 개체명 인식¶

NLTK 라이브러리를 사용한 개체명 인식(Named Entity Recognition, NER) 방법을 살펴보자

NLTK를 이용한 개체명 인식(Named Entity Recognition using NTLK)
Susan Li (Aug 17, 2018), "Named Entity Recognition with NLTK and SpaCy - NER is used in many fields in Natural Language Processing (NLP)"

import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

sentence = "European authorities fined Google a record $5.1 billion on Wednesday for abusing its power \
            in the mobile phone market and ordered the company to alter its practices"
sentence_pos = pos_tag(word_tokenize(sentence))
print(sentence_pos) # 토큰화와 품사 태깅을 동시 수행

[('European', 'JJ'), ('authorities', 'NNS'), ('fined', 'VBD'), ('Google', 'NNP'), ('a', 'DT'), ('record', 'NN'), ('$', '$'), ('5.1', 'CD'), ('billion', 'CD'), ('on', 'IN'), ('Wednesday', 'NNP'), ('for', 'IN'), ('abusing', 'VBG'), ('its', 'PRP$'), ('power', 'NN'), ('in', 'IN'), ('the', 'DT'), ('mobile', 'JJ'), ('phone', 'NN'), ('market', 'NN'), ('and', 'CC'), ('ordered', 'VBD'), ('the', 'DT'), ('company', 'NN'), ('to', 'TO'), ('alter', 'VB'), ('its', 'PRP$'), ('practices', 'NNS')]

IOB 태그는 파일은 파일의 말뭉치 덩어리(chunk) 구조를 표현하는 표준으로 자리잡고 있다. 이를 활용하여 표현하면 문장을 표현하면 다음과 같다.

pattern = 'NP: {<DT>?<JJ>*<NN>}'

cp = nltk.RegexpParser(pattern)
cs = cp.parse(sentence_pos)
print(cs)

(S
  European/JJ
  authorities/NNS
  fined/VBD
  Google/NNP
  (NP a/DT record/NN)
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  (NP power/NN)
  in/IN
  (NP the/DT mobile/JJ phone/NN)
  (NP market/NN)
  and/CC
  ordered/VBD
  (NP the/DT company/NN)
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)

nltk.chunk.conlltags2tree() 함수는 태그 시퀀스를 말뭉치 나무구조로 변환시킨다.

from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint

iob_tagged = tree2conlltags(cs)
pprint(iob_tagged)

[('European', 'JJ', 'O'),
 ('authorities', 'NNS', 'O'),
 ('fined', 'VBD', 'O'),
 ('Google', 'NNP', 'O'),
 ('a', 'DT', 'B-NP'),
 ('record', 'NN', 'I-NP'),
 ('$', '$', 'O'),
 ('5.1', 'CD', 'O'),
 ('billion', 'CD', 'O'),
 ('on', 'IN', 'O'),
 ('Wednesday', 'NNP', 'O'),
 ('for', 'IN', 'O'),
 ('abusing', 'VBG', 'O'),
 ('its', 'PRP$', 'O'),
 ('power', 'NN', 'B-NP'),
 ('in', 'IN', 'O'),
 ('the', 'DT', 'B-NP'),
 ('mobile', 'JJ', 'I-NP'),
 ('phone', 'NN', 'I-NP'),
 ('market', 'NN', 'B-NP'),
 ('and', 'CC', 'O'),
 ('ordered', 'VBD', 'O'),
 ('the', 'DT', 'B-NP'),
 ('company', 'NN', 'I-NP'),
 ('to', 'TO', 'O'),
 ('alter', 'VB', 'O'),
 ('its', 'PRP$', 'O'),
 ('practices', 'NNS', 'O')]

IOB 형태로 개체명 추출¶

입력받은 문장을 NLTK 라이브러리를 활용하여 개체명을 추출할 경우 Tree 객체로 구성되어 있어 이를 IOB 형태로 변환시키려면 tree2conlltags() 함수를 사용해서 변환을 시킨다.

Complete guide to build your own Named Entity Recognizer with Python

sentence_iob_tagged = tree2conlltags(cs)
print(sentence_iob_tagged)

[('European', 'JJ', 'O'), ('authorities', 'NNS', 'O'), ('fined', 'VBD', 'O'), ('Google', 'NNP', 'O'), ('a', 'DT', 'B-NP'), ('record', 'NN', 'I-NP'), ('$', '$', 'O'), ('5.1', 'CD', 'O'), ('billion', 'CD', 'O'), ('on', 'IN', 'O'), ('Wednesday', 'NNP', 'O'), ('for', 'IN', 'O'), ('abusing', 'VBG', 'O'), ('its', 'PRP$', 'O'), ('power', 'NN', 'B-NP'), ('in', 'IN', 'O'), ('the', 'DT', 'B-NP'), ('mobile', 'JJ', 'I-NP'), ('phone', 'NN', 'I-NP'), ('market', 'NN', 'B-NP'), ('and', 'CC', 'O'), ('ordered', 'VBD', 'O'), ('the', 'DT', 'B-NP'), ('company', 'NN', 'I-NP'), ('to', 'TO', 'O'), ('alter', 'VB', 'O'), ('its', 'PRP$', 'O'), ('practices', 'NNS', 'O')]

그리고 IOB 객체는 리스트 튜플(원소가 튜플로 구성된 리스트) 구조라 list comprehension을 사용해서 해당 개체명을 추출해 낼 수 있다.

query = [e1 for (e1, rel, e2) in sentence_iob_tagged if e2 in 'B-GPE']
print(query)

[]

`ne_chunk` 개체명 인식¶

nltk 라이브러리 ne_chunk() 함수를 사용해서 개체명을 인식시킬 수 있다.

# nltk.download('maxent_ne_chunker')
# nltk.download('words')
from nltk.chunk import conlltags2tree, tree2conlltags, ne_chunk
from pprint import pprint

sentence_ne_tree = ne_chunk(sentence_pos)
print(sentence_ne_tree) # 개체명 인식

(S
  (GPE European/JJ)
  authorities/NNS
  fined/VBD
  (PERSON Google/NNP)
  a/DT
  record/NN
  $/$
  5.1/CD
  billion/CD
  on/IN
  Wednesday/NNP
  for/IN
  abusing/VBG
  its/PRP$
  power/NN
  in/IN
  the/DT
  mobile/JJ
  phone/NN
  market/NN
  and/CC
  ordered/VBD
  the/DT
  company/NN
  to/TO
  alter/VB
  its/PRP$
  practices/NNS)

`Tree` 객체에서 개체명 추출¶

문장에서 개체명 인식을 통해서 인식된 개체명만 추출하는 코드는 다음과 같다. 즉, nltk.ne_chunk() 메쏘드는 nltk.tree.Tree 객체를 반환하기 때문에 Tree객체를 훑어서 인식된 개체명을 추출한다.

-stackoverflow, "How can I extract GPE(location) using NLTK ne_chunk?"

from nltk import word_tokenize, pos_tag, ne_chunk
from nltk import Tree

def get_continuous_chunks(text, label):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for subtree in chunked:
        if type(subtree) == Tree and subtree.label() == label:
            current_chunk.append(" ".join([token for token, pos in subtree.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

get_continuous_chunks(sentence, 'GPE')

['European']

spaCy NER¶

spacy를 활용해서도 개체명 인식을 할 수 있다. Google이 NLTK라이브러리 ne_chunk()와 달리 제대로 인식된 것을 확인할 수 있다.

import spacy
from spacy import displacy
import en_core_web_sm

nlp = en_core_web_sm.load()

doc = nlp('European authorities fined Google a record $5.1 billion on Wednesday for abusing its power \
in the mobile phone market and ordered the company to alter its practices')
for entity in doc.ents:
    print(f'{entity.text:12} \t {entity.label_}')
# print([(X.text, X.label_) for X in doc.ents])

European     	 NORP
Google       	 ORG
$5.1 billion 	 MONEY
Wednesday    	 DATE