전자우편 텍스트 → CSV

전자우편 데이터가 아래와 같은 형식으로 쭉 연결되어 있다고 가정하다. 이러한 전자우편 텍스트 데이터에서 정규표현식을 동원하여 텍스트 일부를 추출하여 데이터프레임 형태로 변형시켜보자.

전자우편 데이터셋¶

"Python for Everybody" 교재에서 활용된 파이썬 코드와 데이터가 공개된 웹사이트)에서 전자우편 데이터(mbox-short.txt) 비정형 데이터를 대상으로 정형데이터로 변환시키는 코드를 작성한다.

# To: source@collab.sakaiproject.org
# From: stephen.marquard@uct.ac.za
# Subject: [sakai] svn commit: r39772 - content/branches/sakai_2-5-x/content-impl/impl/src/java/org/sakaiproject/content/impl
# X-Content-Type-Outer-Envelope: text/plain; charset=UTF-8
# X-Content-Type-Message-Body: text/plain; charset=UTF-8
# Content-Type: text/plain; charset=UTF-8
# X-DSPAM-Result: Innocent
# X-DSPAM-Processed: Sat Jan  5 09:14:16 2008
# X-DSPAM-Confidence: 0.8475
# X-DSPAM-Probability: 0.0000

전자우편 데이터 불러오기¶

하드디스크에 저장된 'mbox-short.txt' 파일을 처리할 수 있도록 파이썬 언어로 메모리로 불러 읽어 들인다. 결과는 print() 문을 통해서 확인할 수 있다.

# 디렉토리 + 파일명
working_directory = "data/email/"
file_name = 'mbox-short.txt'
file_path = working_directory + file_name

# 파일 불러들이기
file_handler = open(file_path, "r")
email_data = file_handler.read() # 파일 내부 읽어들임

print(email_data[:100])
file_handler.close()

From stephen.marquard@uct.ac.za Sat Jan  5 09:14:16 2008
Return-Path: <postmaster@collab.sakaiprojec

비정형 데이터 → 정형데이터

일반 텍스트 전자우편 문서에서 규칙을 찾아 데이터를 정제하는 작업을 수행한다. 이를 위해서 정규표현식을 사용한다. 즉, 보낸사람(To:), 받는 사람(From:), 제목 Subject:, 스팸처리시간(X-DSPAM-Processed), 스팸확신도(X-DSPAM-Confidence), 스팸확률(X-DSPAM-Probability)을 우선 추출해낸다.

To: source@collab.sakaiproject.org
From: stephen.marquard@uct.ac.za
Subject: [sakai] svn commit: r39772 - content/branches/sakai_2-5-x/content-impl/impl/src/java/org/sakaiproject/content/impl
X-DSPAM-Processed: Sat Jan 5 09:14:16 2008
X-DSPAM-Confidence: 0.8475
X-DSPAM-Probability: 0.0000

전자우편 한줄에서 전자우편주소만 추출¶

가장 먼저 전자우편 주소에서 전자우편만 추출하는 간단한 코드를 작성해보자

import re

text = 'From: stephen.marquard@uct.ac.za'
field_regex = "From:\s(.+)"

matches = re.findall(field_regex, text)

print(matches)

['stephen.marquard@uct.ac.za']

전자우편 주소 및 스팸확률 추출¶

전자우편 보낸사람(From), 전자우편 받은 사람(To)에서 전자우편 주소와 스팸확률(X-DSPAM-Probability) 추출해 보자.

import pandas as pd
text = '''
       From: stephen.marquard@uct.ac.za
       To: source@collab.sakaiproject.org
       X-DSPAM-Probability: 0.8475
       '''
# field_regex = "\w+:\s(.+)"
field_regex = "(From|To|X\-DSPAM\-Confidence):(.*)"

matches = re.findall(field_regex, text)
print(matches)

[('From', ' stephen.marquard@uct.ac.za'), ('To', ' source@collab.sakaiproject.org')]

파일을 열어서 전자우편주소 추출¶

가장 먼저 파일을 열어서 앞서 정의한 정규표현식을 활용하여 전자우편 보낸사람(From), 전자우편 받은 사람(To), 스팸확률(X-DSPAM-Confidence)을 필드로 추출해 보자

# 파일 불러들이기
file_handler = open(file_path, "r")

# 정규표현식
field_regex = "(From|To|X\-DSPAM\-Probability):(.*)"

matches = []

for line in file_handler:
    matches += re.findall(field_regex, line)
file_handler.close()

print(matches[1:10])

[('From', ' stephen.marquard@uct.ac.za'), ('X-DSPAM-Probability', ' 0.0000'), ('To', ' source@collab.sakaiproject.org'), ('From', ' louis@media.berkeley.edu'), ('X-DSPAM-Probability', ' 0.0000'), ('To', ' source@collab.sakaiproject.org'), ('From', ' zqian@umich.edu'), ('X-DSPAM-Probability', ' 0.0000'), ('To', ' source@collab.sakaiproject.org')]

튜플 리스트 → 데이터프레임

정규표현식을 사용해서 원하는 정보를 추출하는데 일단 성공했으면 다음으로 후속 데이터 분석을 위해서 데이터프레임 형태로 데이터를 가공하는 것이 편하다. 이를 위해서 튜플 리스트를 데이터르페임으로 변환시킨다.

df = pd.DataFrame(matches, columns =['key', 'value']) 

#                    key                            value
# 0                   To   source@collab.sakaiproject.org
# 1                 From       stephen.marquard@uct.ac.za
# 2  X-DSPAM-Probability                           0.0000
# 3                   To   source@collab.sakaiproject.org
# 4                 From         louis@media.berkeley.edu
df.head(9)

참고 코드¶

stackoverflow, "Python: Multiple Text Files to Dataframe"

텍스트로 작성된 전자우편 내용에서 필요한 필드만 추출하여 데이터프레임 형태 데이터로 변환시키는 프로젝트를 진행하고 있다.

import os
import sys
import re
import csv 

# Take all text files in workingDirectory and put them into a DF.
def convert_email(working_directory, output_directory):
    with open(working_directory +'email_sample.csv', 'w') as csvfile:
      fields = ['Target','Attribute','Label','Time','Full Text'] # fields you're searching for with regex
      csvfield = ['Target','Attribute','Label','Time','Full Text','Filename'] # You want to include the file name in the csv header but not find it with regex
      writer = csv.DictWriter(csvfile, delimiter=',', lineterminator='\n', fieldnames=fields)
      writer.writeheader() # writes the csvfields list to the header of the csv

      if working_directory == "": working_directory = os.getcwd() + "\\" # Returns current working directory, if workingDirectory is empty.
      i = 0
      for txt in os.listdir(working_directory): # Iterate through text filess in workingDirectory
          print("Processing File: " + str(txt))
          fileExtension = txt.split(".")[-1]
          if fileExtension == "txt":
              textFilename = working_directory + txt # Becomes: \PATH\example.text
              f = open(textFilename,"r")
              data = f.read() # read what is inside

              #print(data) # print to show it is readable
              fieldmatches = {}
              for field in fields:
                regex = "\\s" + field + ":(.*)" # iterates through each of the fields and matches using r"\sTarget:(.*) that selects everything on the line that matches with Target:
                match = re.search(regex, data)
                if match:
                  fieldmatches[field] = match.group(1)
              writer.writerow(fieldmatches) # for each file creates a dict of fields and their values and then adds that row to the csv
              i += 1 # counter
      print("Successfully read " + str(i) + " files.")


working_directory = "data/email/" # Put your source directory of text files here
output_directory  = "data/email/" # Put your source directory of text files here

convert_email(working_directory, output_directory)

Processing File: .DS_Store
Processing File: mbox-short.txt
Processing File: email_sample.txt
Processing File: email_sample.csv
Successfully read 2 files.

	key	value
0	To	source@collab.sakaiproject.org
1	From	stephen.marquard@uct.ac.za
2	X-DSPAM-Probability	0.0000
3	To	source@collab.sakaiproject.org
4	From	louis@media.berkeley.edu
5	X-DSPAM-Probability	0.0000
6	To	source@collab.sakaiproject.org
7	From	zqian@umich.edu
8	X-DSPAM-Probability	0.0000