Title: OCR Improvements for Images of Multi-page Historical Documents
Authors: Gruber, Ivan
Hrúz, Marek
Ircing, Pavel
Neduchal, Petr
Zítka, Tomáš
Hlaváč, Miroslav
Zajíc, Zbyněk
Švec, Jan
Bulín, Martin
Citation: GRUBER, I. HRÚZ, M. IRCING, P. NEDUCHAL, P. ZÍTKA, T. HLAVÁČ, M. ZAJÍC, Z. ŠVEC, J. BULÍN, M. OCR Improvements for Images of Multi-page Historical Documents. In 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings. Cham: Springer, 2021. s. 226-237. ISBN: 978-3-030-87801-6 , ISSN: 0302-9743
Issue Date: 2021
Publisher: Springer
Document type: konferenční příspěvek
ConferenceObject
URI: 2-s2.0-85116373386
http://hdl.handle.net/11025/47184
ISBN: 978-3-030-87801-6
ISSN: 0302-9743
Keywords in different language: document digitization;document layout analysis;optical character recognition;image preprocessing
Abstract in different language: This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.
Rights: Plný text je přístupný v rámci univerzity přihlášeným uživatelům.
© Springer Nature Switzerland AG
Appears in Collections:Články / Articles (NTIS)
Konferenční příspěvky / Conference Papers (KKY)
OBD

Files in This Item:
File SizeFormat 
Gruber2021_Chapter_OCRImprovementsForImagesOfMult.pdf2,27 MBAdobe PDFView/Open    Request a copy


Please use this identifier to cite or link to this item: http://hdl.handle.net/11025/47184

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

search
navigation
  1. DSpace at University of West Bohemia
  2. Publikační činnost / Publications
  3. OBD