89 research outputs found
Plague Dot Text:Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)
The design of models that govern diseases in population is commonly built on
information and data gathered from past outbreaks. However, epidemic outbreaks
are never captured in statistical data alone but are communicated by
narratives, supported by empirical observations. Outbreak reports discuss
correlations between populations, locations and the disease to infer insights
into causes, vectors and potential interventions. The problem with these
narratives is usually the lack of consistent structure or strong conventions,
which prohibit their formal analysis in larger corpora. Our interdisciplinary
research investigates more than 100 reports from the third plague pandemic
(1894-1952) evaluating ways of building a corpus to extract and structure this
narrative information through text mining and manual annotation. In this paper
we discuss the progress of our ongoing exploratory project, how we enhance
optical character recognition (OCR) methods to improve text capture, our
approach to structure the narratives and identify relevant entities in the
reports. The structured corpus is made available via Solr enabling search and
analysis across the whole collection for future research dedicated, for
example, to the identification of concepts. We show preliminary visualisations
of the characteristics of causation and differences with respect to gender as a
result of syntactic-category-dependent corpus statistics. Our goal is to
develop structured accounts of some of the most significant concepts that were
used to understand the epidemiology of the third plague pandemic around the
globe. The corpus enables researchers to analyse the reports collectively
allowing for deep insights into the global epidemiological consideration of
plague in the early twentieth century.Comment: Journal of Data Mining & Digital Humanities 202
Creating and Using Ground Truth OCR Sample Data for Finnish Historical Newspapers and Journals
The National Library of Finland (NLF) has digitized historical newspapers, journals and ephemera published in Finland since the late 1990s. The present collection consists of about 12.9 million pages mainly in Finnish and Swedish. Out of these about 7.36 million pages are freely available on the web site digi.kansalliskirjasto.fi. The copyright restricted part of the collection can be used at six legal deposit libraries in different parts of Finland. The time period of the open collection is from 1771 to 1929. The years 1920–1929 were opened in January 2018. This paper presents the ground truth Optical Character Recognition data of about 500 000 Finnish words that has been compiled at the NLF for development of a new OCR process for the collection. We discuss compilation of the data and show basic results of the new OCR process in comparison to current OCR using the ground truth data.Peer reviewe
Detecting and Analyzing Text Reuse with BLAST
In this thesis I expand upon my previous work on text reuse detection. I propose a novel method of detecting text reuse by leveraging BLAST (Basic Local Alignment Search Tool), an algorithm originally designed for aligning and comparing biomedical sequences, such as DNA and protein sequences.
I explain the original BLAST algorithm in depth by going through it step-by-step. I also describe two other popular sequence alignment methods. I demonstrate the effectiveness of the BLAST text reuse detection method by comparing it against the previous state-of-the-art and show that the proposed method beats it by a large margin.
I apply the method to a dataset of 3 million documents of scanned Finnish newspapers and journals, which have been turned into text using OCR (Optical Character Recognition) software. I categorize the results from the method into three categories: every day text reuse, long term reuse and viral news. I describe them and provide examples of them as well as propose a new, novel method of calculating a virality score for the clusters
Proceedings
Proceedings of the NODALIDA 2009 workshop
Nordic Perspectives on the CLARIN Infrastructure of Language Resources.
Editors: Rickard Domeij, Kimmo Koskenniemi, Steven Krauwer, Bente Maegaard,
Eiríkur Rögnvaldsson and Koenraad de Smedt.
NEALT Proceedings Series, Vol. 5 (2009), v+45 pp.
© 2009 The editors and contributors.
Published by
Northern European Association for Language
Technology (NEALT)
http://omilia.uio.no/nealt .
Electronically published at
Tartu University Library (Estonia)
http://hdl.handle.net/10062/9207
- …