15 research outputs found

    Total Relation Recall: High-Recall Relation Extraction

    Get PDF
    As Knowledge Graphs (KGs) become important in a wide range of applications, including question-answering and recommender systems, more and more enterprises have recognized the value of constructing KGs with their own data. While enterprise data consists of structured and unstructured data, companies primarily focus on structured ones, which are easier to exploit than unstructured ones. However, most enterprise data are unstructured, including intranet, documents, and emails, where plenty of business insights live. Therefore, companies would like to utilize unstructured data as well, and KGs are an excellent way to collect and organize information from unstructured data. In this thesis, we introduce a novel task, Total Relation Recall (TRR), that leverages the enterprise's unstructured documents to build KGs using high-recall relation extraction. Given a target relation and its relevant information, TRR aims to extract all instances of such relation from the given documents. We propose a Python-based system to address this task. To evaluate the effectiveness of our system, we conduct experiments on 12 different relations with two news article corpora. Moreover, we conduct an ablation study to investigate the impact of natural language processing (NLP) features

    Extracting Structured Information from Greek Legislation Data

    Get PDF

    Poetry: Identification, Entity Recognition, and Retrieval

    Get PDF
    Modern advances in natural language processing (NLP) and information retrieval (IR) provide for the ability to automatically analyze, categorize, process and search textual resources. However, generalizing these approaches remains an open problem: models that appear to understand certain types of data must be re-trained on other domains. Often, models make assumptions about the length, structure, discourse model and vocabulary used by a particular corpus. Trained models can often become biased toward an original dataset, learning that – for example – all capitalized words are names of people or that short documents are more relevant than longer documents. As a result, small amounts of noise or shifts in style can cause models to fail on unseen data. The key to more robust models is to look at text analytics tasks on more challenging and diverse data. Poetry is an ancient art form that is believed to pre-date writing and is still a key form of expression through text today. Some poetry forms (e.g., haiku and sonnets) have rigid structure but still break our traditional expectations of text. Other poetry forms drop punctuation and other rules in favor of expression. Our contributions include a set of novel, challenging datasets that extend traditional tasks: a text classification task for which content features perform poorly, a named entity recognition task that is inherently ambiguous, and a retrieval corpus over the largest public collection of poetry ever released. We begin by looking at poetry identification - the task of finding poetry within existing textual collections, and devise an effective method of extracting poetry based on how it is usually formatted within digitally scanned books, since content models do not generalize well. Then we work on the content of poetry: we construct a dataset of around 6,000 tagged spans that identify the people, places, organizations and personified concepts within poetry. We show that cross-training with existing datasets based on news-corpora helps modern models to learn to recognize entities within poetry. Finally, we return to IR, and construct a dataset of queries and documents inspired by real-world data that expose some of the key challenges of searching through poetry. Our work is the first significant effort to use poetry in these three tasks and our datasets and models will provide strong baselines for new avenues of research on this challenging domain

    CLARIN

    Get PDF
    The book provides a comprehensive overview of the Common Language Resources and Technology Infrastructure – CLARIN – for the humanities. It covers a broad range of CLARIN language resources and services, its underlying technological infrastructure, the achievements of national consortia, and challenges that CLARIN will tackle in the future. The book is published 10 years after establishing CLARIN as an Europ. Research Infrastructure Consortium

    Low-Resource Unsupervised NMT:Diagnosing the Problem and Providing a Linguistically Motivated Solution

    Get PDF
    Unsupervised Machine Translation hasbeen advancing our ability to translatewithout parallel data, but state-of-the-artmethods assume an abundance of mono-lingual data. This paper investigates thescenario where monolingual data is lim-ited as well, finding that current unsuper-vised methods suffer in performance un-der this stricter setting. We find that theperformance loss originates from the poorquality of the pretrained monolingual em-beddings, and we propose using linguis-tic information in the embedding train-ing scheme. To support this, we look attwo linguistic features that may help im-prove alignment quality: dependency in-formation and sub-word information. Us-ing dependency-based embeddings resultsin a complementary word representationwhich offers a boost in performance ofaround 1.5 BLEU points compared to stan-dardWORD2VECwhen monolingual datais limited to 1 million sentences per lan-guage. We also find that the inclusion ofsub-word information is crucial to improv-ing the quality of the embedding
    corecore