1 OCR 대상 이미지¹ ² ³ ⁴

검은색 배경에 흰색글씨가 담긴 이미지에서 텍스트를 추출하는 사례를 살펴보자. 스택오버플로우 “tesseract in R - read white font on black background”에 올라온 한장의 이미지를 가지고 시작해보자.

library(tidyverse)
library(magick)
library(tesseract)

char_image <- image_read("fig/white-character.jpeg")

char_image %>% 
  image_resize("2000x")

1.1 이미지 전처리

magick 객체보다 숫자로 작업하는 것이 경우에 따라서는 더 수월한 경우도 많다.

input <- char_image %>% 
  .[[1]] %>% 
  as.numeric() # 숫자가 작업하기 수월하다.

image_read(ifelse(input < .9, 1, 0) )

1.2 텍스트 추출

tessearact 팩키지 ocr() 함수를 사용해서 텍스트를 추출한다.

ocr_characters <- ifelse(input < .9, 1, 0) %>% 
  image_read() %>% 
  image_resize('500x') %>% # make it smaller to work around the errors
  tesseract::ocr( engine = "eng") %>% 
  str_remove("\n")

ocr_characters

[1] "TLC200 PRO 2019/10/31 17:33:00"

1.3 성능 평가

추출된 텍스트와 라벨(정답) 데이터간 문자열 거리(stringdist)를 사용하여 거리를 성능지표로 계산해 본다.

library(stringdist)

label <- "TLC200 PRO 2019/10/31 17:33:00"
stringdist(label, ocr_characters)

[1] 0

2 OCR 성능지표⁵ [1]

글자 단위 정확도 (character metric)
단어 단위 정확도 (word metric)

글자 단위 성능지표는 다음과 같이 Precision, Recall을 정의한다.

\(C_\text{precision} = \frac{C_\text{truth}}{C_\text{model}}\)
\(C_\text{recall} = \frac{C_\text{model}}{C_\text{truth}}\)

단어 단위 성능지표는 Levenshtein 거리를 사용하는 것도 좋을 듯 싶다.

OCR 평가 도구는 다음과 같다.

The ISRI Analytic Tools: ocreval
hOCR tools: hocr-tools
An open-source OCR evaluation tool: 자바 ocrevalUAtion

3 이미지 이진화⁶

OCR (Optical Character Recognition), HTR(Handwritten Text Recognition) 성능향상을 위해서 adaptive thresholding 기법을 적용하는 이미지 이진화(Image Binarization)를 손쉽게 적용할 수 있는 팩키지가 도입되었다. “image.binarization” 팩키지는 사실 Δoxa Binarization Framework C/C++ 코드를 R 팩키지로 묶어놓은 것이다. WebAssembly Demo - Local Adaptive Binarization 데모가 있으니 살펴보면 좋을 듯 싶다.

지원되는 알고리즘은 다음과 같다.

Otsu - “A threshold selection method from gray-level histograms”, 1979.
Bernsen - “Dynamic thresholding of gray-level images”, 1986.
Niblack - “An Introduction to Digital Image Processing”, 1986.
Sauvola - “Adaptive document image binarization”, 1999.
Wolf - “Extraction and Recognition of Artificial Text in Multimedia Documents”, 2003.
Gatos - “Adaptive degraded document image binarization”, 2005. (Partial)
NICK - “Comparison of Niblack inspired Binarization methods for ancient documents”, 2009.
Su - “Binarization of Historical Document Images Using the Local Maximum and Minimum”, 2010.
T.R. Singh - “A New local Adaptive Thresholding Technique in Binarization”, 2011.
Bataineh - “An adaptive local binarization method for document images based on a novel thresholding method and dynamic windows”, 2011. (unreproducible)
ISauvola - “ISauvola: Improved Sauvola’s Algorithm for Document Image Binarization”, 2016.
WAN - “Binarization of Document Image Using Optimum Threshold Modification”, 2018.

3.1 이미지 전처리

OTSU 알고리즘으로 텍스트 OCR 작업 수행하기 전에 전처리 작업을 진행한다.

library(magick)
# remotes::install_github("DIGI-VUB/image.binarization")
library(image.binarization)
test_img <- image_read("fig/white-character.jpeg") %>%  image_resize('500x')
converted_img <- image_convert(test_img, format = "PGM", colorspace = "Gray")
processed_img <- image_binarization(converted_img, type = "otsu")

processed_img

3.2 텍스트 추출

tesseract 팩키지를 통해 텍스트를 추출하고 정답과 비교해보자.

ocred_text <- tesseract::ocr(processed_img, engine = "eng") %>% 
  stringr::str_remove("\n")

ocred_text

[1] "C266 PRO 2819/16/31 17:33:86"

stringdist::stringdist(label, ocred_text)

[1] 8

3.3 다양한 방법

문서에 대해 적절한 방법을 탐색해 본다.

test_img <- image_read("fig/white-character.jpeg") %>%  image_resize('500x')
converted_img <- image_convert(test_img, format = "PGM", colorspace = "Gray")

methods_list <- c("otsu", "sauvola", "wolf")

for(i in seq_along(methods_list)) {
  
  processed_img <- image_binarization(converted_img, type = methods_list[i])
  
  ocred_text <- tesseract::ocr(processed_img, engine = "eng") %>% 
    stringr::str_remove("\n")

  cat(methods_list[i], "- Error :", stringdist::stringdist(label, ocred_text), "\n",
      "Label:", label, "\n",
      "OCRED:", ocred_text, "\n")
}

otsu - Error : 8 
 Label: TLC200 PRO 2019/10/31 17:33:00 
 OCRED: C266 PRO 2819/16/31 17:33:86 
sauvola - Error : 28 
 Label: TLC200 PRO 2019/10/31 17:33:00 
 OCRED: 12 =) -73 eel. Bae Pee ec 87: J 
wolf - Error : 15 
 Label: TLC200 PRO 2019/10/31 17:33:00 
 OCRED: am 66 PRO 2819/18. 3:33 :0B

1. Tomaschek M. Evaluation of off-the-shelf ocr technologies. PhD thesis. PhD Thesis, Masaryk University; 2018.

데이터 과학자 이광춘 저작

kwangchun.lee.7@gmail.com

광학문자인식(OCR, Optical Character Recognition)

검정배경 흰색글자 인식

Tidyverse Korea

2020-09-10

1 OCR 대상 이미지¹ ² ³ ⁴

1.1 이미지 전처리

1.2 텍스트 추출

1.3 성능 평가

2 OCR 성능지표⁵ [1]

3 이미지 이진화⁶

3.1 이미지 전처리

3.2 텍스트 추출

3.3 다양한 방법

광학문자인식(OCR, Optical Character Recognition)

검정배경 흰색글자 인식

Tidyverse Korea

2020-09-10

1 OCR 대상 이미지1234

1.1 이미지 전처리

1.2 텍스트 추출

1.3 성능 평가

2 OCR 성능지표5 [1]

3 이미지 이진화6

3.1 이미지 전처리

3.2 텍스트 추출

3.3 다양한 방법

1 OCR 대상 이미지¹ ² ³ ⁴

2 OCR 성능지표⁵ [1]

3 이미지 이진화⁶