Search CORE

16 research outputs found

Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Pääkkönen Tuula Anneli
Publication venue: 'Linkoping University Electronic Press'
Publication date: 01/05/2017
Field of study

In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical DocumentsPeer reviewe

Helsingin yliopiston digitaalinen arkisto

Open Source Tesseract in Re-OCR of Finnish Fraktur from 19th and Early 20th Century Newspapers and Journals – Collected Notes on Quality Improvement

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue: CEUR-WS.org
Publication date: 06/03/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals

Author: Kervinen Jukka
Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

Author: Kettunen Kimmo Tapio
Koistinen Jani Mika Olavi
Ruokolainen Teemu Petteri
Publication venue
Publication date: 03/04/2018
Field of study

The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Wanca in Korp : Text corpora for underresourced Uralic languages

Author: Jauhiainen Heidi
Jauhiainen Tommi
Linden Krister
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Border crossing and trespassing? : Expanding digital humanities research to developing peripheries with the novel digital technologies

Author: Hyyryläinen Torsti
Ryynänen Toni
Publication venue: University of Oulu
Publication date: 01/01/2019
Field of study

Peer reviewe

Helsingin yliopiston digitaalinen arkisto

Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

Author: Dias Mariana
Lopes Carla Teixeira
Publication venue
Publication date: 27/11/2023
Field of study

Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.Comment: 25 pages, 4 figure

arXiv.org e-Print Archive

The reuse of texts in Finnish newspapers and journals, 1771–1920: A digital humanities perspective

Author: Ginter Filip
Nivala Asko
Paju Petri
Rantala Heli
Salmi Hannu
Vesanto Aleksi
Publication venue: 'Informa UK Limited'
Publication date: 28/10/2022
Field of study

The digital collections of newspapers have given rise to a growing interest in studying them with computational methods. This article contributes to this discussion by presenting a method for detecting text reuse in a large corpus of digitized texts. Empirically, the article is based on the corpus of newspapers and journals from the collection of the National Library of Finland. Often, digitized repositories offer only partial views of what actually was published in printed form. The Finnish collection is unique, however, since it covers all published issues up to the year 1920. This article has a two-fold objective: methodologically, it explores how computational methods can be developed so that text reuse can be effectively identified; empirically, the article concentrates on how the circulation of texts developed in Finland from the late eighteenth century to the early twentieth century and what this reveals about the transformation of public discourse in Finland. According to our results, the reuse of texts was an integral part of the press throughout the studied period, which, on the other hand, was part of a wider transnational practice.</p

UTUPub

Tekstien uudelleenkäyttö suomalaisessa sanoma- ja aikakauslehdistössä 1771–1920. Digitaalisten ihmistieteiden näkökulma

Author: Ginter Filip
Nivala Asko
Paju Petri
Rantala Heli
Salmi Hannu
Sippola Reetta
Vesanto Aleksi
Publication venue: 'Baishideng Publishing Group Inc.'
Publication date: 28/10/2022
Field of study

Artikkelissa tutkitaan suomalaista sanoma- ja aikakauslehdistöä tekstin uudelleenkäytön näkökulmasta.Saman tekstin julkaiseminen uudelleen eri yhteyksissä on sinänsä vanha ja tunnettu ilmiö, mutta ennen sanoma- ja aikakauslehtien digitoimista tätä lehdistön piirrettä ei ole voitu tutkia systemaattisesti. Tutkimuksen lähdeaineistona on Suomen Kansalliskirjaston julkaisema sanoma- ja aikakauslehtien digitoitu OCR-korpus, josta on COMHIS-hankkeessa kehitetyn, tekstin uudelleenkäytön tunnistavan BLAST-menetelmän avulla etsitty lehdistössä esiintyvää kopiointia ja toisteisuutta. Aikavälillä 1771–1920 toistoa sisältäviä tekstejä tai tekstikatkelmia on löytynyt noin 13,8 miljoonan klusterin eli pidemmän merkkijonon verran. Artikkelissa esitellään sekä itse uudelleenkäytön tunnistukseen käytettyä BLAST-menetelmää että tämän tunnistuksen tuloksia. Tutkimus osoittaa, että tekstien kopioiminen ja uudelleenkäyttö on merkittävä osa suomalaista lehdistöä. Menetelmänä tekstien uudelleenkäytön tunnistus tarjoaa uuden keinon tutkia informaation liikkeitä ja reittejä.This article explores Finnish newspapers and periodicals produced between 1771 and 1920, with a focus on the reuse of texts. While the reprinting of particular texts in a range of different locations can be regarded as an old and well-acknowledged practice in the press, a systematic examination was not possible until the digitization of these historical documents. This primary research material derives from the digitized OCR corpus of newspapers and periodicals published by the National Library of Finland. In the COMHIS project, we have developed a text-mining software, based on NCBI BLAST, which effectively recognizes and enables the location of textual repetitions. We have found approximately 13.8 million clusters of text reuse. As well as an introduction to the methods and uses of BLAST, the article will also explore the results gained through these and what they reveal about the nature of the circulation of information in the Finnish press during this period. This article shows that the copying and reuse of texts was a remarkable part of the process.</p

UTUPub

Proceedings of the Research Data And Humanities (RDHUM) 2019 Conference: Data, Methods And Tools

Author: Ali Zeeshan Ijaz
Iiro Tiihonen
Leo Lahti
Mikko Tolonen
Publication venue: Suomen kasvatuksen ja koulutuksen historian seura
Publication date: 28/10/2022
Field of study

Analytical bibliography aims to understand the production of books. Systematic methods can be used to determine an overall view of the publication history. In this paper, we present the state of the art analytical approach towards the determination of editions using the ESTC meta data. The preliminary results illustrate that metadata cleanup and analysis can provide opportunities for edition determination. This would significantly help projects aiming to do large scale text mining.</p

UTUPub