1 OCR이란?

광학문자인식 (OCR, Optical Character Recognition)은 이미지 혹은 사진 속에 담긴 텍스트를 기계가 읽을 수 있는 형태(Machine-Readable Format)로 변환시켜 텍스트를 선택, 편집, 검색 가능하게 만드는 일련의 작업을 지칭한다.

2 OCR 응용분야

OCR 응용분야로 다음을 들 수 있다.

인보이스 데이터 자동 입력: 인보이스와 같은 (반)정형 문서에 포함된 텍스트를 OCR로 읽어 정형데이터를 추출한다.
여권 MRZ 추출: 여권과 같은 증명서에서 MRC (Machine Readable Zone) 영역에서 텍스트를 추출하여 대사작업에 활용한다.
차량번호판 인식: 주차차량 관리 및 주차요금 정산을 위한 차량 번호판 인식
교통 표시판 인식: 자율주행차가 교통법규 준수를 위한 다양한 교통 표시판 인식
고문서 스캔: 절판되거나 도서관에 보관된 책을 스캔하여 검색 및 자연어 처리 업무를 위한 OCR 작업
시각 장애인: OCR 기술을 사용해서 추출된 텍스트 데이터를 TTS(Text-to-Speech)를 사용해서 시각장애인의 정보 접근성을 높일 수 있다.

3 OCR 작업흐름

OCR 작업흐름은 텍스트가 담긴 이미지를 바로 OCR 작업을 돌릴 경우 정확도가 떨어지기 때문에 이미지 전처리 작업(회전, 노이즈 제거 등)을 수행하고 텍스트 추출작업을 수행한다. 그리고 나서 추출된 텍스트를 후처리 작업을 통해서 재구조화시키고 나서 텍스트를 저장하는 과정을 거치게 된다.

4 OCR 알고리즘

텍스트가 포함된 이미지가 있는 상태에서 이미지 전처리(이미지 정규화, 그레이스케일 변환, 이미지 크기 변환, 노이즈 제거, 기울어진 보정 등) 작업을 수행한 후에 이미지 속 텍스트를 탐지하기 위한 알고리즘과 이미지에서 텍스트를 인식하고 추출하는 과정이 핵심적인 과정으로 이에 따라 OCR 성능이 많이 좌우된다.

텍스트 탐지 (Text Detection)
- EAST 알고리즘(Efficient and Accurate Scene Text Detector) [1]
- CTPN 알고리즘(Connectionist Text Proposal Network) [2]
텍스트 인식/추출
- CRNN 알고리즘 (Convolutional Recurrent Neural Network) [3]
- CTC 알고리즘 (Connectionist Temporal Classification) [4]

5 OCR 데이터셋

Total Text Dataset은 다양한 형태 OCR 대상 이미지 데이터를 상황별로 모아놨다.

5.1 텍스트 탐지 데이터셋

5.2 텍스트 인식/추출 데이터셋

5.3 한글 OCR 데이터셋

6 OCR 문자인식 성능 [5]

문자수준 (character recognition accuracy - CRA)과 단어수준 (word recognition accuracy - WRA) 지표를 통해 OCR 성능을 평가할 수 있다.

\[\text{WRA} = \frac{W_r}{W}\] \[\text{WER} = 1 - \text{WRA} = 1 \frac{W_r}{W}\]

그외 NED(Normalized Edit Distance), F-Score, AED(Average Edit Distance) 등이 있다.

6.1 ICDAR 2019

ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction 경진대회를 통해 영수증 등 스캔된 문서의 OCR 성능을 파악할 수 있다.

6.2 텍스트 탐지 알고리즘 순위

CTPN, EAST 등과 같은 전통 알고리즘이 약 90%를 보였다면 TLGAN은 99% 텍스트 탐지 정확도를 보이고 있다.

EAST: 약 90 % 초반
CTPN: 약 85~90 %
TLGAN (Text Localization Generative Adversarial Nets): 99 %

library(tidyverse)
library(rvest)
library(reactable)

ocr_task_1_html <- read_html(x ="https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=1") 

ocr_text_detection <- ocr_task_1_html %>% 
  html_node(xpath = '//*[@id="table-1-1"]') %>% 
  html_table() %>% 
  janitor::clean_names() %>% 
  as_tibble()

ocr_text_detection %>% 
  select(team = method, recall, precision, hmean) %>% 
  mutate_at(vars(recall:hmean), parse_number) %>% 
  mutate_at(vars(recall:hmean), ~ ./100) %>% 
  reactable(columns = list(
    recall    = colDef(format = colFormat(percent = TRUE, digits = 1)),
    precision = colDef(format = colFormat(percent = TRUE, digits = 1)),
    hmean     = colDef(format = colFormat(percent = TRUE, digits = 1))))

6.3 텍스트 인식 알고리즘 순위

CRNN : 약 87% 정확도
CTC: 약 87% 정확도
CNN-RNN + LM + Attention: 97%

ocr_task_2_html <- read_html(x ="https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=2") 

ocr_text_reconition <- ocr_task_2_html %>% 
  html_node(xpath = '//*[@id="table-1-1"]') %>% 
  html_table() %>% 
  janitor::clean_names() %>% 
  as_tibble()

ocr_text_reconition %>% 
  select(team = method, recall, precision, hmean) %>% 
  mutate_at(vars(recall:hmean), parse_number) %>% 
  mutate_at(vars(recall:hmean), ~ ./100) %>% 
  reactable(columns = list(
    recall    = colDef(format = colFormat(percent = TRUE, digits = 1)),
    precision = colDef(format = colFormat(percent = TRUE, digits = 1)),
    hmean     = colDef(format = colFormat(percent = TRUE, digits = 1))))

6.4 중요정보 추출 알고리즘 순위

OCR 된 텍스트로부터 company, address, total, date 등 중요정보 추출하여 알고리즘 성능을 비교합니다.

ocr_task_3_html <- read_html(x ="https://rrc.cvc.uab.es/?ch=13&com=evaluation&task=3") 

ocr_info_extraction <- ocr_task_3_html %>% 
  html_node(xpath = '//*[@id="table-1-1"]') %>% 
  html_table() %>% 
  janitor::clean_names() %>% 
  as_tibble()

ocr_info_extraction %>% 
  select(team = method, recall, precision, hmean) %>% 
  mutate_at(vars(recall:hmean), parse_number) %>% 
  mutate_at(vars(recall:hmean), ~ ./100) %>% 
  reactable(columns = list(
    recall    = colDef(format = colFormat(percent = TRUE, digits = 1)),
    precision = colDef(format = colFormat(percent = TRUE, digits = 1)),
    hmean     = colDef(format = colFormat(percent = TRUE, digits = 1))))

1. Zhou X, Yao C, Wen H, Wang Y, Zhou S, He W, et al. East: An efficient and accurate scene text detector. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2017. p. 5551–60.

2. Tian Z, Huang W, He T, He P, Qiao Y. Detecting text in natural image with connectionist text proposal network. In: European conference on computer vision. Springer; 2016. p. 56–72.

3. Shi B, Bai X, Yao C. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE transactions on pattern analysis and machine intelligence. 2016;39:2298–304.

4. Graves A, Fernández S, Gomez F, Schmidhuber J. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd international conference on machine learning. 2006. p. 369–76.

5. Hubert I, Arppe A, Lachler J, Santos EA. Training & quality assessment of an optical character recognition model for northern haida. In: Proceedings of the tenth international conference on language resources and evaluation (LREC’16). 2016. p. 3227–34.

데이터 과학자 이광춘 저작

kwangchun.lee.7@gmail.com

PDF를 데이터로 보는 올바른 자세

광학문자인식(OCR, Optical Character Recognition)

이광춘

2021-06-24