Search CORE

38 research outputs found

ICDAR 2019 Competition on Post-OCR Text Correction

Author: Coustaty Mickaël
Doucet Antoine
Moreux Jean-Philippe
Rigaud Christophe
Publication venue: HAL CCSD
Publication date: 20/09/2019
Field of study

International audienceThis paper describes the second round of the ICDAR 2019 competition on post-OCR text correction and presents the different methods submitted by the participants. OCR has been an active research field for over the past 30 years but results are still imperfect, especially for historical documents. The purpose of this competition is to compare and evaluate automatic approaches for correcting (denoising) OCR-ed texts. The present challenge consists of two tasks: 1) error detection and 2) error correction. An original dataset of 22M OCR-ed symbols along with an aligned ground truth was provided to the participants with 80% of the dataset dedicated to training and 20% to evaluation. Different sources were aggregated and contain newspapers, historical printed documents as well as manuscripts and shopping receipts, covering 10 European languages (Bulgarian, Czech, Dutch, English, Finish, French, German, Polish, Spanish and Slovak). Five teams submitted results, the error detection scores vary from 41 to 95% and the best error correction improvement is 44%. This competition, which counted 34 registrations, illustrates the strong interest of the community to improve OCR output, which is a key issue to any digitization process involving textual data

Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition

Author: Colavizza G.
Todorov K.
Publication venue
Publication date: 01/01/2020
Field of study

International Migration, Integration and Social Cohesion online publications

Transfer Learning for Historical Corpora: An Assessment on Post-OCR Correction and Named Entity Recognition

Author: Colavizza G.
Todorov K.
Publication venue: CEUR-WS
Publication date: 01/01/2020
Field of study

International Migration, Integration and Social Cohesion online publications

Optimizing the neural network training for OCR error correction of historical Hebrew texts

Author: Elmalech Avshalom
Suissa Omri
Zhitomirsky-Geffet Maayan
Publication venue: 'iSchools'
Publication date: 23/03/2020
Field of study

Over the past few decades, large archives of paper-based documents such as books and newspapers have been digitized using Optical Character Recognition. This technology is error-prone, especially for historical documents. To correct OCR errors, post-processing algorithms have been proposed based on natural language analysis and machine learning techniques such as neural networks. Neural network's disadvantage is the vast amount of manually labeled data required for training, which is often unavailable. This paper proposes an innovative method for training a light-weight neural network for Hebrew OCR post-correction using significantly less manually created data. The main research goal is to develop a method for automatically generating language and task-specific training data to improve the neural network results for OCR post-correction, and to investigate which type of dataset is the most effective for OCR post-correction of historical documents. To this end, a series of experiments using several datasets was conducted. The evaluation corpus was based on Hebrew newspapers from the JPress project. An analysis of historical OCRed newspapers was done to learn common language and corpus-specific OCR errors. We found that training the network using the proposed method is more effective than using randomly generated errors. The results also show that the performance of the neural net-work for OCR post-correction strongly depends on the genre and area of the training data. Moreover, neural networks that were trained with the proposed method outperform other state-of-the-art neural networks for OCR post-correction and complex spellcheckers. These results may have practical implications for many digital humanities projects

Illinois Digital Environment for Access to Learning and Scholarship Repository

Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals

Author: Kervinen Jukka
Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Rebuilding the Story of a Hero: Information Extraction in Ancient Argentinian Texts

Author: Marmanillo Walter Gabriel
Mechaca Ana Lidia
Xamena Eduardo
Publication venue
Publication date: 20/12/2019
Field of study

Large amounts of ancient documents have become available in the last years, regarding Argentinian history. This fact turns possible to find interesting and useful aggregated information. This work proposes the application of Natural Language Processing, Text Mining and Visualization tools over Argentinian ancient document repositories. Conceptual maps and entity networks make up the first target of this preliminary paper. The first step is the normalization of OCR acquired books of General G¨uemes. Exploratory analyses reveal the presence of manifold spelling errors, due to the OCR acquisition process of the volumes. We propose smart automatic ways for overcoming this issue in the process of normalization. Besides, a first topic landscape of a subset of volumes is obtained and analysed, via Topic Modelling tools.Sociedad Argentina de Informática e Investigación Operativ

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Author: Beelen K.
Colavizza G.
Coll Ardanuy M.
Hosseini K.
McGillivray B.
van Strien D.
Publication venue: 'Scitepress'
Publication date: 01/01/2020
Field of study

International Migration, Integration and Social Cohesion online publications