6 research outputs found
Plague Dot Text:Text mining and annotation of outbreak reports of the Third Plague Pandemic (1894-1952)
The design of models that govern diseases in population is commonly built on
information and data gathered from past outbreaks. However, epidemic outbreaks
are never captured in statistical data alone but are communicated by
narratives, supported by empirical observations. Outbreak reports discuss
correlations between populations, locations and the disease to infer insights
into causes, vectors and potential interventions. The problem with these
narratives is usually the lack of consistent structure or strong conventions,
which prohibit their formal analysis in larger corpora. Our interdisciplinary
research investigates more than 100 reports from the third plague pandemic
(1894-1952) evaluating ways of building a corpus to extract and structure this
narrative information through text mining and manual annotation. In this paper
we discuss the progress of our ongoing exploratory project, how we enhance
optical character recognition (OCR) methods to improve text capture, our
approach to structure the narratives and identify relevant entities in the
reports. The structured corpus is made available via Solr enabling search and
analysis across the whole collection for future research dedicated, for
example, to the identification of concepts. We show preliminary visualisations
of the characteristics of causation and differences with respect to gender as a
result of syntactic-category-dependent corpus statistics. Our goal is to
develop structured accounts of some of the most significant concepts that were
used to understand the epidemiology of the third plague pandemic around the
globe. The corpus enables researchers to analyse the reports collectively
allowing for deep insights into the global epidemiological consideration of
plague in the early twentieth century.Comment: Journal of Data Mining & Digital Humanities 202
Historical-Domain Pre-trained Language Model for Historical Extractive Text Summarization
In recent years, pre-trained language models (PLMs) have shown remarkable advancements in the extractive summarization
task across diverse domains. However, there remains a lack of research specifically in the historical domain. In this paper, we propose a
novel method for extractive historical single-document summarization that leverages the potential of a domain-aware historical
bidirectional language model, pre-trained on a large-scale historical corpus. Subsequently, we fine-tune the language model specifically
for the task of extractive historical single-document summarization. One major challenge for this task is the lack of annotated datasets
for historical summarization. To address this issue, we construct a dataset by collecting archived historical documents from the Centre
Virtuel de la Connaissance sur l’Europe (CVCE) group at the University of Luxembourg. Furthermore, to better learn the structural
features of the input documents, we use a sentence position embedding mechanism that enables the model to learn the position information
of sentences. The overall experimental results on our historical dataset collected from the CVCE group show that our method outperforms
recent state-of-the-art methods in terms of ROUGE-1, ROUGE-2, and ROUGE-L F1 scores. To the best of our knowledge, this is the
first work on extractive historical text summarization
Embedding Multilingual and Relational Data Using Linear Mappings
This thesis presents our research on the embedding method, a machine learning technique that encodes real-world signals into high-dimensional vectors. Specifically, we focus on a family of algorithms whose backbone is one simple yet elegant type of topological operation, the linear mapping, aka. linear transformation or vector space homomorphism. Past studies have shown the usefulness of these approaches for modelling complex data, such as lexicons from different languages and networks storing factual relations. However, they also exhibit crucial limitations, including a lack of theoretical justifications, precision drop in challenging setups, and considerable environmental impact during training, among others.
To bridge these gaps, we first identify the unnoticed link between the success of linear Cross-Lingual Word Embedding (CLWE) mappings and the preservation of the implicit analogy relation, using both theoretical and empirical evidence. Next, we propose a post-hoc L1-norm rotation step which substantially improves the performance of existing CLWE mappings. Then, beyond solving conventional questions where only modern languages are involved, we extend the application of CLWE mappings to summarising lengthy and opaque historical text. Finally, motivated by the learning procedure of CLWE models, we adopt linear mappings to optimise Knowledge Graph Embeddings (KGEs) iteratively, significantly reducing the carbon footprint required to train the algorithm