16 research outputs found

    Improving Optical Character Recognition of Finnish Historical Newspapers with a Combination of Fraktur & Antiqua Models and Image Preprocessing

    Get PDF
    In this paper we describe a method for improving the optical character recognition (OCR) toolkit Tesseract for Finnish historical documents. First we create a model for Finnish Fraktur fonts. Second we test Tesseract with the created Fraktur model and Antiqua model on single images and combinations of images with different image preprocessing methods. Against commercial ABBYY FineReader toolkit our method achieves 27.48% (FineReader 7 or 8) and 9.16% (FineReader 11) improvement on word level. Keywords: Optical Character Recognition, OCR Quality, Digital Image Processing, Binarization, Noise Removal, Tesseract, Finnish, Historical DocumentsPeer reviewe

    Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals

    Get PDF
    The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.Peer reviewe

    Research and Development Efforts on the Digitized Historical Newspaper and Journal Collection of The National Library of Finland

    Get PDF
    The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.8 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi (Digi). The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. This paper presents work that has been carried out in the NLF related to the historical newspaper and journal collection. We offer an overall account of research and development related to the data.Peer reviewe

    Optimization of Image Processing Algorithms for Character Recognition in Cultural Typewritten Documents

    Full text link
    Linked Data is used in various fields as a new way of structuring and connecting data. Cultural heritage institutions have been using linked data to improve archival descriptions and facilitate the discovery of information. Most archival records have digital representations of physical artifacts in the form of scanned images that are non-machine-readable. Optical Character Recognition (OCR) recognizes text in images and translates it into machine-encoded text. This paper evaluates the impact of image processing methods and parameter tuning in OCR applied to typewritten cultural heritage documents. The approach uses a multi-objective problem formulation to minimize Levenshtein edit distance and maximize the number of words correctly identified with a non-dominated sorting genetic algorithm (NSGA-II) to tune the methods' parameters. Evaluation results show that parameterization by digital representation typology benefits the performance of image pre-processing algorithms in OCR. Furthermore, our findings suggest that employing image pre-processing algorithms in OCR might be more suitable for typologies where the text recognition task without pre-processing does not produce good results. In particular, Adaptive Thresholding, Bilateral Filter, and Opening are the best-performing algorithms for the theatre plays' covers, letters, and overall dataset, respectively, and should be applied before OCR to improve its performance.Comment: 25 pages, 4 figure

    The reuse of texts in Finnish newspapers and journals, 1771–1920: A digital humanities perspective

    Get PDF
    The digital collections of newspapers have given rise to a growing interest in studying them with computational methods. This article contributes to this discussion by presenting a method for detecting text reuse in a large corpus of digitized texts. Empirically, the article is based on the corpus of newspapers and journals from the collection of the National Library of Finland. Often, digitized repositories offer only partial views of what actually was published in printed form. The Finnish collection is unique, however, since it covers all published issues up to the year 1920. This article has a two-fold objective: methodologically, it explores how computational methods can be developed so that text reuse can be effectively identified; empirically, the article concentrates on how the circulation of texts developed in Finland from the late eighteenth century to the early twentieth century and what this reveals about the transformation of public discourse in Finland. According to our results, the reuse of texts was an integral part of the press throughout the studied period, which, on the other hand, was part of a wider transnational practice.</p

    Tekstien uudelleenkäyttö suomalaisessa sanoma- ja aikakauslehdistössä 1771–1920. Digitaalisten ihmistieteiden näkökulma

    Get PDF
    Artikkelissa tutkitaan suomalaista sanoma- ja aikakauslehdistöä tekstin uudelleenkäytön näkökulmasta.Saman tekstin julkaiseminen uudelleen eri yhteyksissä on sinänsä vanha ja tunnettu ilmiö, mutta ennen sanoma- ja aikakauslehtien digitoimista tätä lehdistön piirrettä ei ole voitu tutkia systemaattisesti. Tutkimuksen lähdeaineistona on Suomen Kansalliskirjaston julkaisema sanoma- ja aikakauslehtien digitoitu OCR-korpus, josta on COMHIS-hankkeessa kehitetyn, tekstin uudelleenkäytön tunnistavan BLAST-menetelmän avulla etsitty lehdistössä esiintyvää kopiointia ja toisteisuutta. Aikavälillä 1771–1920 toistoa sisältäviä tekstejä tai tekstikatkelmia on löytynyt noin 13,8 miljoonan klusterin eli pidemmän merkkijonon verran. Artikkelissa esitellään sekä itse uudelleenkäytön tunnistukseen käytettyä BLAST-menetelmää että tämän tunnistuksen tuloksia. Tutkimus osoittaa, että tekstien kopioiminen ja uudelleenkäyttö on merkittävä osa suomalaista lehdistöä. Menetelmänä tekstien uudelleenkäytön tunnistus tarjoaa uuden keinon tutkia informaation liikkeitä ja reittejä.This article explores Finnish newspapers and periodicals produced between 1771 and 1920, with a focus on the reuse of texts. While the reprinting of particular texts in a range of different locations can be regarded as an old and well-acknowledged practice in the press, a systematic examination was not possible until the digitization of these historical documents. This primary research material derives from the digitized OCR corpus of newspapers and periodicals published by the National Library of Finland. In the COMHIS project, we have developed a text-mining software, based on NCBI BLAST, which effectively recognizes and enables the location of textual repetitions. We have found approximately 13.8 million clusters of text reuse. As well as an introduction to the methods and uses of BLAST, the article will also explore the results gained through these and what they reveal about the nature of the circulation of information in the Finnish press during this period. This article shows that the copying and reuse of texts was a remarkable part of the process.</p

    Proceedings of the Research Data And Humanities (RDHUM) 2019 Conference: Data, Methods And Tools

    Get PDF
    Analytical bibliography aims to understand the production of books. Systematic methods can be used to determine an overall view of the publication history. In this paper, we present the state of the art analytical approach towards the determination of editions using the ESTC meta data. The preliminary results illustrate that metadata cleanup and analysis can provide opportunities for edition determination. This would significantly help projects aiming to do large scale text mining.</p
    corecore