2,272 research outputs found

    Relation extraction for information extraction from free text

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Extracting Temporal Expressions from Unstructured Open Resources

    Get PDF
    AETAS is an end-to-end system with SOA approach that retrieves plain text data from web and blog news and represents and stores them in RDF, with a special focus on their temporal dimension. The system allows users to acquire, browse and query Linked Data obtained from unstructured sources

    Semantic Representation and Inference for NLP

    Full text link
    Semantic representation and inference is essential for Natural Language Processing (NLP). The state of the art for semantic representation and inference is deep learning, and particularly Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and transformer Self-Attention models. This thesis investigates the use of deep learning for novel semantic representation and inference, and makes contributions in the following three areas: creating training data, improving semantic representations and extending inference learning. In terms of creating training data, we contribute the largest publicly available dataset of real-life factual claims for the purpose of automatic claim verification (MultiFC), and we present a novel inference model composed of multi-scale CNNs with different kernel sizes that learn from external sources to infer fact checking labels. In terms of improving semantic representations, we contribute a novel model that captures non-compositional semantic indicators. By definition, the meaning of a non-compositional phrase cannot be inferred from the individual meanings of its composing words (e.g., hot dog). Motivated by this, we operationalize the compositionality of a phrase contextually by enriching the phrase representation with external word embeddings and knowledge graphs. Finally, in terms of inference learning, we propose a series of novel deep learning architectures that improve inference by using syntactic dependencies, by ensembling role guided attention heads, incorporating gating layers, and concatenating multiple heads in novel and effective ways. This thesis consists of seven publications (five published and two under review).Comment: PhD thesis, the University of Copenhage

    Human-competitive automatic topic indexing

    Get PDF
    Topic indexing is the task of identifying the main topics covered by a document. These are useful for many purposes: as subject headings in libraries, as keywords in academic publications and as tags on the web. Knowing a document's topics helps people judge its relevance quickly. However, assigning topics manually is labor intensive. This thesis shows how to generate them automatically in a way that competes with human performance. Three kinds of indexing are investigated: term assignment, a task commonly performed by librarians, who select topics from a controlled vocabulary; tagging, a popular activity of web users, who choose topics freely; and a new method of keyphrase extraction, where topics are equated to Wikipedia article names. A general two-stage algorithm is introduced that first selects candidate topics and then ranks them by significance based on their properties. These properties draw on statistical, semantic, domain-specific and encyclopedic knowledge. They are combined using a machine learning algorithm that models human indexing behavior from examples. This approach is evaluated by comparing automatically generated topics to those assigned by professional indexers, and by amateurs. We claim that the algorithm is human-competitive because it chooses topics that are as consistent with those assigned by humans as their topics are with each other. The approach is generalizable, requires little training data and applies across different domains and languages

    Biographical information extraction: A language-agnostic methodology for datasets and models

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Information extraction (IE) refers to the task of detecting and linking information contained in written texts. While it includes various subtasks, relation extraction (RE) is used to link two entities in a text via a common relation. RE can therefore be used to build linked databases of knowledge across a wide area of topics. Today, the task of RE is treated as a supervised machine learning (ML) task, where a model is trained using a specific architecture and a specific annotated dataset. These specific datasets typically aim to represent common patterns that the model is to learn, albeit at the cost of manual annotation, which can be costly and time-consuming. In addition, due to the nature of the training process, the models can be sensitive to a specific genre or topic, and are generally monolingual. It therefore stands to reason, that certain genres and topics have better models, as they are treated with a higher priority due to financial interests for instance. This in turn leads to RE models not being available to every area of research, leaving incomplete linked databases of knowledge. For instance, if the birthplace of a person is not correctly extracted, the place and the person can not be linked correctly, therefore not leaving linked databases incomplete. To address this problem, this thesis explores aspects of RE that could be adapted in ways which require little human effort, therefore making RE models more widely available. The first aspect is the annotated data. During the course of this thesis, Wikipedia and its subsidiaries are used as sources to automatically annotate sentences for RE. The dataset, which is aimed towards digital humanities (DH) and historical research, is automatically compiled by aligning sentences from Wikipedia articles with matching structured data from sources including Pantheon and Wikidata. By exploiting the structure of Wikipedia articles and robust named entity recognition (NER), information is matched with relatively high precision in order to compile annotated relation pairs for ten different relations that are important in the DH domain: birthdate, birthplace, deathdate, deathplace, occupation, parent, educated, child, sibling and other (all other relations). Furthermore, the effectiveness of the dataset is demonstrated by training a state-of-the-art neural model to classify relation pairs. For its evaluation, a manually annotated gold standard set is used. An investigation of the necessary adaptations to recreate the automatic process in a multilingual setting is also undertaken, looking specifically at English and German, for which similar neural models are trained and evaluated on a gold standard dataset. While the process is aimed here at training neural models for RE within the domain of digital humanities and history, it may be transferable to other domains

    Procceding 2rd International Seminar on Linguistics

    Get PDF
    corecore