2,272 research outputs found
Extracting Temporal Expressions from Unstructured Open Resources
AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources
Recommended from our members
Where are you talking about? Advances and Challenges of Geographic Analysis of Text with Application to Disease Monitoring
The Natural Language Processing task we focus on in this thesis is Geoparsing. Geoparsing is the process of extraction and grounding of toponyms (place names). Consider this sentence: "The victims of the Spanish earthquake off the coast of Malaga were of American and Mexican origin." Four toponyms will be extracted (called Geotagging) and grounded to their geographic coordinates (called Toponym Resolution). However, our research goes further than any previous work by showing how to distinguish the literal place(s) of the event (Spain, Malaga) from other linguistic types/uses such as nationalities (Mexican, American), improving downstream task accuracy. We consolidate and extend the Standard Evaluation Framework, discuss key research problems, then present concrete solutions in order to advance each stage of geoparsing. For geotagging, as well as training a SOTA neural Location-NER tagger, we simplify Metonymy Resolution with a novel minimalist feature extraction combined with an LSTM-based classifier, matching SOTA results. For toponym resolution, we deploy the latest deep learning methods to achieve SOTA performance by augmenting neural models with hitherto unused geographic features called Map Vectors. With each research project, we provide high-quality datasets and system prototypes, further building resources in this field. We then show how these geoparsing advances coupled with our proposed Intra-Document Analysis can be used to associate news articles with locations in order to monitor the spread of public health threats. To this end, we evaluate our research contributions with production data from a real-time downstream application to improve geolocation of news events for disease monitoring. The data was made available to us by the Joint Research Centre (JRC), which operates one such system called MediSys that processes incoming news articles in order to monitor threats to public health and make these available to a variety of governmental, business and non-profit organisations. We also discuss steps towards an end-to-end, automated news monitoring system and make actionable recommendations for future work. In summary, the thesis aims are twofold: (1) Generate original geoparsing research aimed at advancing each stage of the pipeline by addressing pertinent challenges with concrete solutions and actionable proposals. (2) Demonstrate how this research can be applied to news event monitoring to increase the efficacy of existing biosurveillance systems, e.g. European Commission’s MediSys.I was generously funded by DREAM CDT, which was funded by NERC of UKRI
Semantic Representation and Inference for NLP
Semantic representation and inference is essential for Natural Language
Processing (NLP). The state of the art for semantic representation and
inference is deep learning, and particularly Recurrent Neural Networks (RNNs),
Convolutional Neural Networks (CNNs), and transformer Self-Attention models.
This thesis investigates the use of deep learning for novel semantic
representation and inference, and makes contributions in the following three
areas: creating training data, improving semantic representations and extending
inference learning. In terms of creating training data, we contribute the
largest publicly available dataset of real-life factual claims for the purpose
of automatic claim verification (MultiFC), and we present a novel inference
model composed of multi-scale CNNs with different kernel sizes that learn from
external sources to infer fact checking labels. In terms of improving semantic
representations, we contribute a novel model that captures non-compositional
semantic indicators. By definition, the meaning of a non-compositional phrase
cannot be inferred from the individual meanings of its composing words (e.g.,
hot dog). Motivated by this, we operationalize the compositionality of a phrase
contextually by enriching the phrase representation with external word
embeddings and knowledge graphs. Finally, in terms of inference learning, we
propose a series of novel deep learning architectures that improve inference by
using syntactic dependencies, by ensembling role guided attention heads,
incorporating gating layers, and concatenating multiple heads in novel and
effective ways. This thesis consists of seven publications (five published and
two under review).Comment: PhD thesis, the University of Copenhage
Human-competitive automatic topic indexing
Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance.
Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples.
This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages
Biographical information extraction: A language-agnostic methodology for datasets and models
A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Information extraction (IE) refers to the task of detecting and linking information
contained in written texts. While it includes various subtasks, relation extraction
(RE) is used to link two entities in a text via a common relation. RE can therefore
be used to build linked databases of knowledge across a wide area of topics.
Today, the task of RE is treated as a supervised machine learning (ML) task,
where a model is trained using a specific architecture and a specific annotated
dataset. These specific datasets typically aim to represent common patterns that
the model is to learn, albeit at the cost of manual annotation, which can be costly
and time-consuming. In addition, due to the nature of the training process, the
models can be sensitive to a specific genre or topic, and are generally monolingual.
It therefore stands to reason, that certain genres and topics have better models,
as they are treated with a higher priority due to financial interests for instance.
This in turn leads to RE models not being available to every area of research,
leaving incomplete linked databases of knowledge. For instance, if the birthplace
of a person is not correctly extracted, the place and the person can not be linked
correctly, therefore not leaving linked databases incomplete.
To address this problem, this thesis explores aspects of RE that could be
adapted in ways which require little human effort, therefore making RE models
more widely available. The first aspect is the annotated data. During the course of this thesis, Wikipedia and its subsidiaries are used as sources to automatically
annotate sentences for RE. The dataset, which is aimed towards digital humanities
(DH) and historical research, is automatically compiled by aligning sentences
from Wikipedia articles with matching structured data from sources including
Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and
robust named entity recognition (NER), information is matched with relatively
high precision in order to compile annotated relation pairs for ten different
relations that are important in the DH domain: birthdate, birthplace, deathdate,
deathplace, occupation, parent, educated, child, sibling and other (all other
relations). Furthermore, the effectiveness of the dataset is demonstrated by
training a state-of-the-art neural model to classify relation pairs. For its evaluation,
a manually annotated gold standard set is used. An investigation of the necessary
adaptations to recreate the automatic process in a multilingual setting is also
undertaken, looking specifically at English and German, for which similar neural
models are trained and evaluated on a gold standard dataset. While the process
is aimed here at training neural models for RE within the domain of digital
humanities and history, it may be transferable to other domains
- …