Search CORE

976 research outputs found

Estimating and Rating the Quality of Optically Character Recognised Text

Author: Alex Beatrice
Burns John
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2014
Field of study

Crossref

Edinburgh Research Explorer

Rerunning OCR: A Machine Learning Approach to Quality Assessment and Enhancement Prediction

Author: Maurer Yves
Schneider Pit
Publication venue
Publication date: 31/03/2022
Field of study

Iterating with new and improved OCR solutions enforces decision making when it comes to targeting the right candidates for reprocessing. This especially applies when the underlying data collection is of considerable size and rather diverse in terms of fonts, languages, periods of publication and consequently OCR quality. This article captures the efforts of the National Library of Luxembourg to support those targeting decisions. They are crucial in order to guarantee low computational overhead and reduced quality degradation risks, combined with a more quantifiable OCR improvement. In particular, this work explains the methodology of the library with respect to text block level quality assessment. Through extension of this technique, a regression model, that is able to take into account the enhancement potential of a new OCR engine, is also presented. They both mark promising approaches, especially for cultural institutions dealing with historical data of lower quality.Comment: Journal of Data Mining and Digital Humanities; Major revisio

arXiv.org e-Print Archive

Episciences.org

Directory of Open Access Journals

Modern Tools for Old Content - in Search of Named Entities in a Finnish OCRed Historical Newspaper Collection 1771-1910

Author: Kettunen Kimmo Tapio
Kuokkala Juha Markus
Mäkelä Eetu
Niemi Jyrki Antero
Ruokolainen Teemu Petteri
Publication venue: CEUR Workshop Proceedings
Publication date: 01/01/2016
Field of study

Named entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed textNamed entity recognition (NER), search, classification and tagging of names and name like frequent informational elements in texts, has become a standard information extraction procedure for textual data. NER has been applied to many types of texts and different types of entities: newspapers, fiction, historical records, persons, locations, chemical compounds, protein families, animals etc. In general a NER system’s performance is genre and domain dependent and also used entity categories vary [1]. The most general set of named entities is usually some version of three partite categorization of locations, persons and organizations. In this paper we report first trials and evaluation of NER with data out of a digitized Finnish historical newspaper collection Digi. Digi collection contains 1,960,921 pages of newspaper material from years 1771– 1910 both in Finnish and Swedish. We use only material of Finnish documents in our evaluation. The OCRed newspaper collection has lots of OCR errors; its estimated word level correctness is about 74–75 % [2]. Our principal NER tagger is a rule-based tagger of Finnish, FiNER, provided by the FIN-CLARIN consortium. We show also results of limited category semantic tagging with tools of the Semantic Computing Research Group (SeCo) of the Aalto University. FiNER is able to achieve up to 60.0 F-score with named entities in the evaluation data. Seco’s tools achieve 30.0–60.0 F-score with locations and persons. Performance of FiNER and SeCo’s tools with the data shows that at best about half of named entities can be recognized even in a quite erroneous OCRed text.Peer reviewe

Aaltodoc Publication Archive

Helsingin yliopiston digitaalinen arkisto

Trading Consequences: A Case Study of Combining Text Mining and Visualization to Facilitate Document Exploration

Author: Alex Beatrice
Clifford Jim
Coates Colin M.
Hinrichs Uta
Klein Ewan
Quigley Aaron
Watson Andrew
Publication venue: 'Oxford University Press (OUP)'
Publication date: 01/01/2015
Field of study

Large-scale digitization efforts and the availability of computational methods, including text mining and information visualization, have enabled new approaches to historical research. However, we lack case studies of how these methods can be applied in practice and what their potential impact may be. Trading Consequences is an interdisciplinary research project between environmental historians, computational linguists and visualization specialists. It combines text mining and information visualization alongside traditional research methods in environmental history to explore commodity trade in the nineteenth century from a global perspective. Along with a unique data corpus, this project developed three visual interfaces to enable the exploration and analysis of four historical document collections, consisting of approximately 200,000 documents and 11 million pages related to commodity trading. In this paper we discuss the potential and limitations of our approach based on feedback from historians we elicited over the course of this project. Informing the design of such tools in the larger context of digital humanities projects, our findings show that visualization-based interfaces are a valuable starting point to large-scale explorations in historical research. Besides providing multiple visual perspectives on the document collection to highlight general patterns, it is important to provide a context in which these patterns occur and offer analytical tools for more in-depth investigations.PostprintPeer reviewe

Crossref

Edinburgh Research Explorer

Humanities Commons

University of St. Andrews - Pure

St Andrews Research Repository

Evaluation von Volltextdaten mit Open-Source-Komponenten

Author: Hartwig Uwe
Publication venue: o-bib. Das offene Bibliotheksjournal / Herausgeber VDB
Publication date: 20/12/2022
Field of study

Im Bereich der Volltexterzeugung stehen heute vollwertige Open-Source Systeme zur Verfügung. Auch bei der Auswertung der Resultate können etablierte Open-Source-Werkzeuge aus den Bereichen Data Science (DS), Information Retrieval (IR) und Natural Language Processing (NLP) eingesetzt werden. Nach einer kurzen Vorstellung üblicher Auswertungsverfahren und Metriken wird exemplarisch über den Einsatz dieser Tools im DFG-Projekt „Digitalisierung Historischer Deutscher Zeitungen I“ der Universitäts- und Landesbibliothek Sachsen-Anhalt (ULB) berichtet.In the area of full text recognition, several fully-fledged open source systems are available today. Established open source tools stemming from the fields of Data Science (DS), Information Retrieval (IR) and Natural Language Processing (NLP) can also be used to evaluate the results. After a brief discussion of common evaluation procedures and metrics, the application of such tools in the DFG-funded project „Digitisaion of historical German newspapers I (Digitalisierung Historischer Deutscher Zeitungen I)“ at the University and State Library Saxony-Anhalt is used as an example

o-bib - Das offene Bibliotheksjournal (VDB - Verein Deutscher Bibliothekare e.V.)

Plague Dot Text:Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)

Author: Alex Beatrice
Bennett Mike
Casey Arlene
Engelmann Lukas
Grover Claire
Tobin RICHARD
Walker Iona
Publication venue
Publication date: 01/01/2021
Field of study

The design of models that govern diseases in population is commonly built on information and data gathered from past outbreaks. However, epidemic outbreaks are never captured in statistical data alone but are communicated by narratives, supported by empirical observations. Outbreak reports discuss correlations between populations, locations and the disease to infer insights into causes, vectors and potential interventions. The problem with these narratives is usually the lack of consistent structure or strong conventions, which prohibit their formal analysis in larger corpora. Our interdisciplinary research investigates more than 100 reports from the third plague pandemic (1894-1952) evaluating ways of building a corpus to extract and structure this narrative information through text mining and manual annotation. In this paper we discuss the progress of our ongoing exploratory project, how we enhance optical character recognition (OCR) methods to improve text capture, our approach to structure the narratives and identify relevant entities in the reports. The structured corpus is made available via Solr enabling search and analysis across the whole collection for future research dedicated, for example, to the identification of concepts. We show preliminary visualisations of the characteristics of causation and differences with respect to gender as a result of syntactic-category-dependent corpus statistics. Our goal is to develop structured accounts of some of the most significant concepts that were used to understand the epidemiology of the third plague pandemic around the globe. The corpus enables researchers to analyse the reports collectively allowing for deep insights into the global epidemiological consideration of plague in the early twentieth century.Comment: Journal of Data Mining & Digital Humanities 202

arXiv.org e-Print Archive

Crossref

Episciences.org

Directory of Open Access Journals

Edinburgh Research Explorer

Assessing the impact of OCR quality on downstream NLP tasks

Author: Ardanuy MC
Beelen K
Colavizza G
Hosseini K
McGillivray B
van Strien D
Publication venue: ICAART 2020 - Proceedings of the 12th International Conference on Agents and Artificial Intelligence
Publication date: 01/01/2020
Field of study

A growing volume of heritage data is being digitized and made available as text via optical character recognition (OCR). Scholars and libraries are increasingly using OCR-generated text for retrieval and analysis. However, the process of creating text through OCR introduces varying degrees of error to the text. The impact of these errors on natural language processing (NLP) tasks has only been partially studied. We perform a series of extrinsic assessment tasks — sentence segmentation, named entity recognition, dependency parsing, information retrieval, topic modelling and neural language model fine-tuning — using popular, out-of-the-box tools in order to quantify the impact of OCR quality on these tasks. We find a consistent impact resulting from OCR errors on our downstream tasks with some tasks more irredeemably harmed by OCR errors. Based on these results, we offer some preliminary guidelines for working with text produced through OCR

Crossref

Apollo (Cambridge)

International Migration, Integration and Social Cohesion online publications

UvA-DARE

Assessing the Impact of OCR Quality on Downstream NLP Tasks

Author: Beelen K.
Colavizza G.
Coll Ardanuy M.
Hosseini K.
McGillivray B.
van Strien D.
Publication venue: 'Scitepress'
Publication date: 01/01/2020
Field of study

International Migration, Integration and Social Cohesion online publications