Search CORE

416 research outputs found

OCR and post-correction of historical Finnish texts

Author: Drobac Senka
Kauppinen Pekka Sakari
Linden Bo Krister Johan
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/01/2017
Field of study

This paper presents experiments on Optical character recognition (OCR) as a combination of Ocropy software and data-driven spelling correction that uses Weighted Finite-State Methods. Both model training and testing were done on Finnish corpora of historical newspaper text and the best combination of OCR and post-processing models give 95.21% character recognition accuracy.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals – Collected Notes on Quality Improvement

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue: CEUR-WS.org
Publication date: 06/03/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish

Author: Duong Quan
Hengchen Simon
Hämäläinen Mika
Publication venue: 'Linkoping University Electronic Press'
Publication date: 03/11/2020
Field of study

Peer reviewe

arXiv.org e-Print Archive

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Helsingin yliopiston digitaalinen arkisto

Optical character recognition with neural networks and post-correction with finite state methods

Author: Drobac Senka
Linden Krister
Publication venue
Publication date: 01/12/2020
Field of study

The optical character recognition (OCR) quality of the historical part of the Finnish newspaper and journal corpus is rather low for reliable search and scientific research on the OCRed data. The estimated character error rate (CER) of the corpus, achieved with commercial software, is between 8 and 13%. There have been earlier attempts to train high-quality OCR models with open-source software, like Ocropy (https://github.com/tmbdev/ocropy) and Tesseract (https://github.com/tesseract-ocr/tesseract), but so far, none of the methods have managed to successfully train a mixed model that recognizes all of the data in the corpus, which would be essential for an efficient re-OCRing of the corpus. The difficulty lies in the fact that the corpus is printed in the two main languages of Finland (Finnish and Swedish) and in two font families (Blackletter and Antiqua). In this paper, we explore the training of a variety of OCR models with deep neural networks (DNN). First, we find an optimal DNN for our data and, with additional training data, successfully train high-quality mixed-language models. Furthermore, we revisit the effect of confidence voting on the OCR results with different model combinations. Finally, we perform post-correction on the new OCR results and perform error analysis. The results show a significant boost in accuracy, resulting in 1.7% CER on the Finnish and 2.7% CER on the Swedish test set. The greatest accomplishment of the study is the successful training of one mixed language model for the entire corpus and finding a voting setup that further improves the results.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

ICDAR 2019 Competition on Post-OCR Text Correction

Author: Coustaty Mickaël
Doucet Antoine
Moreux Jean-Philippe
Rigaud Christophe
Publication venue: HAL CCSD
Publication date: 20/09/2019
Field of study

International audienceThis paper describes the second round of the ICDAR 2019 competition on post-OCR text correction and presents the different methods submitted by the participants. OCR has been an active research field for over the past 30 years but results are still imperfect, especially for historical documents. The purpose of this competition is to compare and evaluate automatic approaches for correcting (denoising) OCR-ed texts. The present challenge consists of two tasks: 1) error detection and 2) error correction. An original dataset of 22M OCR-ed symbols along with an aligned ground truth was provided to the participants with 80% of the dataset dedicated to training and 20% to evaluation. Different sources were aggregated and contain newspapers, historical printed documents as well as manuscripts and shopping receipts, covering 10 European languages (Bulgarian, Czech, Dutch, English, Finish, French, German, Polish, Spanish and Slovak). Five teams submitted results, the error detection scores vary from 41 to 95% and the best error correction improvement is 44%. This competition, which counted 34 registrations, illustrates the strong interest of the community to improve OCR output, which is a key issue to any digitization process involving textual data

Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals

Author: Kervinen Jukka
Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Old Content and Modern Tools : Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910

Author: Kettunen Kimmo
Kuokkala Juha
Löfberg Laura
Mäkelä Eetu
Ruokolainen Teemu
Publication venue
Publication date: 09/11/2016
Field of study

Named Entity Recognition (NER), search, classification and tagging of names and name-like informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general, the performance of a NER system is genre- and domain-dependent and also used entity categories vary [Nadeau and Sekine 2007]. The most general set of named entities is usually some version of a tripartite categorization of locations, persons, and organizations. In this paper we report trials and evaluation of NER with data from a digitized Finnish historical newspaper collection (Digi). Experiments, results, and discussion of this research serve development of the web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from 1771–1910 in both Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % [Kettunen and Pääkkönen 2016]. Our principal NE tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We also show results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.Peer reviewe

arXiv.org e-Print Archive

Helsingin yliopiston digitaalinen arkisto

Linguistic change and historical periodization of Old Literary Finnish

Author: Alnajjar Khalid
Hämäläinen Mika
Partanen Niko
Rueter Jack
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2021
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Ruokolainen Teemu Petteri
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto