OCR Improvements for Images of Multi-page Historical Documents

Gruber, Ivan; Hrúz, Marek; Ircing, Pavel; Neduchal, Petr; Zítka, Tomáš; Hlaváč, Miroslav; Zajíc, Zbyněk; Švec, Jan; Bulín, Martin

Full metadata record

DC pole	Hodnota	Jazyk
dc.contributor.author	Gruber, Ivan
dc.contributor.author	Hrúz, Marek
dc.contributor.author	Ircing, Pavel
dc.contributor.author	Neduchal, Petr
dc.contributor.author	Zítka, Tomáš
dc.contributor.author	Hlaváč, Miroslav
dc.contributor.author	Zajíc, Zbyněk
dc.contributor.author	Švec, Jan
dc.contributor.author	Bulín, Martin
dc.date.accessioned	2022-03-21T11:00:17Z	-
dc.date.available	2022-03-21T11:00:17Z	-
dc.date.issued	2021
dc.identifier.citation	GRUBER, I. HRÚZ, M. IRCING, P. NEDUCHAL, P. ZÍTKA, T. HLAVÁČ, M. ZAJÍC, Z. ŠVEC, J. BULÍN, M. OCR Improvements for Images of Multi-page Historical Documents. In 23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings. Cham: Springer, 2021. s. 226-237. ISBN: 978-3-030-87801-6 , ISSN: 0302-9743	cs
dc.identifier.isbn	978-3-030-87801-6
dc.identifier.issn	0302-9743
dc.identifier.uri	2-s2.0-85116373386
dc.identifier.uri	http://hdl.handle.net/11025/47184
dc.format	12 s.	cs
dc.format.mimetype	application/pdf
dc.language.iso	en	en
dc.publisher	Springer	en
dc.relation.ispartofseries	23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings	en
dc.rights	Plný text je přístupný v rámci univerzity přihlášeným uživatelům.	cs
dc.rights	© Springer Nature Switzerland AG	en
dc.title	OCR Improvements for Images of Multi-page Historical Documents	en
dc.type	konferenční příspěvek	cs
dc.type	ConferenceObject	en
dc.rights.access	restrictedAccess	en
dc.type.version	publishedVersion	en
dc.description.abstract-translated	This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.	en
dc.subject.translated	document digitization	en
dc.subject.translated	document layout analysis	en
dc.subject.translated	optical character recognition	en
dc.subject.translated	image preprocessing	en
dc.identifier.doi	10.1007/978-3-030-87802-3_21
dc.type.status	Peer-reviewed	en
dc.identifier.obd	43933456
dc.project.ID	90042/Velká výzkumná infrastruktura povinnost (J) - CESNET II	cs
dc.project.ID	DG20P02OVV018/Digitální archiv dokumentů NKVD/KGB vztahujících se k Československu	cs
Vyskytuje se v kolekcích:	Články / Articles (NTIS) Konferenční příspěvky / Conference Papers (KKY) OBD

Soubory připojené k záznamu:

Soubor	Velikost	Formát
Gruber2021_Chapter_OCRImprovementsForImagesOfMult.pdf	2,27 MB	Adobe PDF	Zobrazit/otevřít Vyžádat kopii

Zobrazit minimální záznam Zobrazit statistiky

Použijte tento identifikátor k citaci nebo jako odkaz na tento záznam: http://hdl.handle.net/11025/47184

Všechny záznamy v DSpace jsou chráněny autorskými právy, všechna práva vyhrazena.

hledání

navigace