144 research outputs found

    Old Content and Modern Tools : Searching Named Entities in a Finnish OCRed Historical Newspaper Collection 1771–1910

    Get PDF
    Named Entity Recognition (NER), search, classification and tagging of names and name-like informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general, the performance of a NER system is genre- and domain-dependent and also used entity categories vary [Nadeau and Sekine 2007]. The most general set of named entities is usually some version of a tripartite categorization of locations, persons, and organizations. In this paper we report trials and evaluation of NER with data from a digitized Finnish historical newspaper collection (Digi). Experiments, results, and discussion of this research serve development of the web collection of historical Finnish newspapers. Digi collection contains 1,960,921 pages of newspaper material from 1771–1910 in both Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75 % [Kettunen and Pääkkönen 2016]. Our principal NE tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We also show results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. Three other tools are also evaluated briefly. This paper reports the first large scale results of NER in a historical Finnish OCRed newspaper collection. Results of this research supplement NER results of other languages with similar noisy data. As the results are also achieved with a small and morphologically rich language, they illuminate the relatively well-researched area of Named Entity Recognition from a new perspective.Peer reviewe

    Tagging Named Entities in 19th Century and Modern Finnish Newspaper Material with a Finnish Semantic Tagger

    Get PDF
    Named Entity Recognition (NER), search, classification and tagging of names and name like informational elements in texts, has become a standard information extraction procedure for textual data during the last two decades. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent. Also used entity categories vary a lot (Nadeau and Sekine, 2007). The most general set of named entities is usually some version of three part categorization of locations, persons and corporations. In this paper we report evaluation results of NER with two different data: digitized Finnish historical newspaper collection Digi and modern Finnish technology news, Digitoday. Historical newspaper collection Digi contains 1,960,921 pages of newspaper material from years 1771–1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 70–75%, and its NER evaluation collection consists of 75 931 words (Kettunen and Pääkkönen, 2016; Kettunen et al., 2016). Digitoday’s annotated collection consists of 240 articles in six different sections of the newspaper. Our new evaluated tool for NER tagging is non-conventional: it is a rule-based semantic tagger of Finnish, the FST (Löfberg et al., 2005), and its results are compared to those of a standard rule-based NE tagger, FiNER. The FST achieves up to 55–61 F-score with locations and F-score of 51–52 with persons with the historical newspaper data, and its performance is comparative to FiNER with locations. With the modern Finnish technology news of Digitoday FiNER achieves F-scores of up to 79 with locations at best. Person names show worst performance; their F-score varies from 33 to 66. The FST performs equally well as FiNER with Digitoday’s location names, but is worse with persons. With corporations, FST is at its worst, while FiNER performs reasonably well. Overall our results show that a general semantic tool like the FST is able to perform in a restricted semantic task of name recognition almost as well as a dedicated NE tagger. As NER is a popular task in information extraction and retrieval, our results show that NE tagging does not need to be only a task of dedicated NE taggers, but it can be performed equally well with more general multipurpose semantic tools.Peer reviewe

    Transfer learning for historical corpora: An assessment on post-OCR correction and named entity recognition

    Get PDF
    Transfer learning in Natural Language Processing, mainly in the form of pre-trained language models, has recently delivered substantial gains across a range of tasks. Scholars and practitioners working with OCRed historical corpora are thus increasingly exploring the use of pre-trained language models. Nevertheless, the specific challenges posed by historical documents, including OCR quality and linguistic change, call for a critical assessment of the use of pre-trained language models in this setting. We consider two shared tasks, ICDAR2019 (post-OCR correction) and CLEF-HIPE-2020 (Named Entity Recognition, NER), and systematically assess using pre-trained language models with data in French, German and English. We find that using pre-trained language models helps with NER but less so with post-OCR correction. Pre-trained language models should therefore be used critically when working with OCRed historical corpora. We release our code base, in order to allow replicating our results and testing other pre-trained representations

    Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

    Get PDF
    Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe

    Generating a training corpus for OCR post-correction using encoder-decoder model

    Get PDF
    International audienceIn this paper we present a novel approach to the automatic correction of OCR-induced orthographic errors in a given text. While current systems depend heavily on large training corpora or exter- nal information, such as domain-specific lexicons or confidence scores from the OCR process, our system only requires a small amount of relatively clean training data from a representative corpus to learn a character-based statistical language model using Bidirectional Long Short- Term Memory Networks (biLSTMs). We demonstrate the versatility and adaptability of our system on different text corpora with varying degrees of textual noise, in- cluding a real-life OCR corpus in the med- ical domain

    Diachronic Evaluation of NER Systems on Old Newspapers

    Get PDF
    In recent years, many cultural institutions have engaged in large-scale newspaper digitization projects and large amounts of historical texts are being acquired (via transcription or OCRization). Beyond document preservation, the next step consists in providing an enhanced access to the content of these digital resources. In this regard, the processing of units which act as referential anchors, namely named entities (NE), is of particular importance. Yet, the application of standard NE tools to historical texts faces several challenges and performances are often not as good as on contemporary documents. This paper investigates the performances of different NE recognition tools applied on old newspapers by conducting a diachronic evaluation over 7 time-series taken from the archives of Swiss newspaper Le Temps
    corecore