399 research outputs found

    Knowledge Base Population using Semantic Label Propagation

    Get PDF
    A crucial aspect of a knowledge base population system that extracts new facts from text corpora, is the generation of training data for its relation extractors. In this paper, we present a method that maximizes the effectiveness of newly trained relation extractors at a minimal annotation cost. Manual labeling can be significantly reduced by Distant Supervision, which is a method to construct training data automatically by aligning a large text corpus with an existing knowledge base of known facts. For example, all sentences mentioning both 'Barack Obama' and 'US' may serve as positive training instances for the relation born_in(subject,object). However, distant supervision typically results in a highly noisy training set: many training sentences do not really express the intended relation. We propose to combine distant supervision with minimal manual supervision in a technique called feature labeling, to eliminate noise from the large and noisy initial training set, resulting in a significant increase of precision. We further improve on this approach by introducing the Semantic Label Propagation method, which uses the similarity between low-dimensional representations of candidate training instances, to extend the training set in order to increase recall while maintaining high precision. Our proposed strategy for generating training data is studied and evaluated on an established test collection designed for knowledge base population tasks. The experimental results show that the Semantic Label Propagation strategy leads to substantial performance gains when compared to existing approaches, while requiring an almost negligible manual annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge Bases for Natural Language Processin

    Deep learning methods for knowledge base population

    Get PDF
    Knowledge bases store structured information about entities or concepts of the world and can be used in various applications, such as information retrieval or question answering. A major drawback of existing knowledge bases is their incompleteness. In this thesis, we explore deep learning methods for automatically populating them from text, addressing the following tasks: slot filling, uncertainty detection and type-aware relation extraction. Slot filling aims at extracting information about entities from a large text corpus. The Text Analysis Conference yearly provides new evaluation data in the context of an international shared task. We develop a modular system to address this challenge. It was one of the top-ranked systems in the shared task evaluations in 2015. For its slot filler classification module, we propose contextCNN, a convolutional neural network based on context splitting. It improves the performance of the slot filling system by 5.0% micro and 2.9% macro F1. To train our binary and multiclass classification models, we create a dataset using distant supervision and reduce the number of noisy labels with a self-training strategy. For model optimization and evaluation, we automatically extract a labeled benchmark for slot filler classification from the manual shared task assessments from 2012-2014. We show that results on this benchmark are correlated with slot filling pipeline results with a Pearson's correlation coefficient of 0.89 (0.82) on data from 2013 (2014). The combination of patterns, support vector machines and contextCNN achieves the best results on the benchmark with a micro (macro) F1 of 51% (53%) on test. Finally, we analyze the results of the slot filling pipeline and the impact of its components. For knowledge base population, it is essential to assess the factuality of the statements extracted from text. From the sentence "Obama was rumored to be born in Kenya", a system should not conclude that Kenya is the place of birth of Obama. Therefore, we address uncertainty detection in the second part of this thesis. We investigate attention-based models and make a first attempt to systematize the attention design space. Moreover, we propose novel attention variants: External attention, which incorporates an external knowledge source, k-max average attention, which only considers the vectors with the k maximum attention weights, and sequence-preserving attention, which allows to maintain order information. Our convolutional neural network with external k-max average attention sets the new state of the art on a Wikipedia benchmark dataset with an F1 score of 68%. To the best of our knowledge, we are the first to integrate an uncertainty detection component into a slot filling pipeline. It improves precision by 1.4% and micro F1 by 0.4%. In the last part of the thesis, we investigate type-aware relation extraction with neural networks. We compare different models for joint entity and relation classification: pipeline models, jointly trained models and globally normalized models based on structured prediction. First, we show that using entity class prediction scores instead of binary decisions helps relation classification. Second, joint training clearly outperforms pipeline models on a large-scale distantly supervised dataset with fine-grained entity classes. It improves the area under the precision-recall curve from 0.53 to 0.66. Third, we propose a model with a structured prediction output layer, which globally normalizes the score of a triple consisting of the classes of two entities and the relation between them. It improves relation extraction results by 4.4% F1 on a manually labeled benchmark dataset. Our analysis shows that the model learns correct correlations between entity and relation classes. Finally, we are the first to use neural networks for joint entity and relation classification in a slot filling pipeline. The jointly trained model achieves the best micro F1 score with a score of 22% while the neural structured prediction model performs best in terms of macro F1 with a score of 25%

    The role of knowledge in determining identity of long-tail entities

    Get PDF
    The NIL entities do not have an accessible representation, which means that their identity cannot be established through traditional disambiguation. Consequently, they have received little attention in entity linking systems and tasks so far. Given the non-redundancy of knowledge on NIL entities, the lack of frequency priors, their potentially extreme ambiguity, and numerousness, they form an extreme class of long-tail entities and pose a great challenge for state-of-the-art systems. In this paper, we investigate the role of knowledge when establishing the identity of NIL entities mentioned in text. What kind of knowledge can be applied to establish the identity of NILs? Can we potentially link to them at a later point? How to capture implicit knowledge and fill knowledge gaps in communication? We formulate and test hypotheses to provide insights to these questions. Due to the unavailability of instance-level knowledge, we propose to enrich the locally extracted information with profiling models that rely on background knowledge in Wikidata. We describe and implement two profiling machines based on state-of-the-art neural models. We evaluate their intrinsic behavior and their impact on the task of determining identity of NIL entities

    Web knowledge bases

    Get PDF
    Knowledge is key to natural language understanding. References to specific people, places and things in text are crucial to resolving ambiguity and extracting meaning. Knowledge Bases (KBs) codify this information for automated systems — enabling applications such as entity-based search and question answering. This thesis explores the idea that sites on the web may act as a KB, even if that is not their primary intent. Dedicated kbs like Wikipedia are a rich source of entity information, but are built and maintained at an ongoing cost in human effort. As a result, they are generally limited in terms of the breadth and depth of knowledge they index about entities. Web knowledge bases offer a distributed solution to the problem of aggregating entity knowledge. Social networks aggregate content about people, news sites describe events with tags for organizations and locations, and a diverse assortment of web directories aggregate statistics and summaries for long-tail entities notable within niche movie, musical and sporting domains. We aim to develop the potential of these resources for both web-centric entity Information Extraction (IE) and structured KB population. We first investigate the problem of Named Entity Linking (NEL), where systems must resolve ambiguous mentions of entities in text to their corresponding node in a structured KB. We demonstrate that entity disambiguation models derived from inbound web links to Wikipedia are able to complement and in some cases completely replace the role of resources typically derived from the KB. Building on this work, we observe that any page on the web which reliably disambiguates inbound web links may act as an aggregation point for entity knowledge. To uncover these resources, we formalize the task of Web Knowledge Base Discovery (KBD) and develop a system to automatically infer the existence of KB-like endpoints on the web. While extending our framework to multiple KBs increases the breadth of available entity knowledge, we must still consolidate references to the same entity across different web KBs. We investigate this task of Cross-KB Coreference Resolution (KB-Coref) and develop models for efficiently clustering coreferent endpoints across web-scale document collections. Finally, assessing the gap between unstructured web knowledge resources and those of a typical KB, we develop a neural machine translation approach which transforms entity knowledge between unstructured textual mentions and traditional KB structures. The web has great potential as a source of entity knowledge. In this thesis we aim to first discover, distill and finally transform this knowledge into forms which will ultimately be useful in downstream language understanding tasks

    Entity-Oriented Search

    Get PDF
    This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)—a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms

    A Survey on Open Information Extraction

    Get PDF
    We provide a detailed overview of the various approaches that were proposed to date to solve the task of Open Information Extraction. We present the major challenges that such systems face, show the evolution of the suggested approaches over time and depict the specific issues they address. In addition, we provide a critique of the commonly applied evaluation procedures for assessing the performance of Open IE systems and highlight some directions for future work.Comment: 27th International Conference on Computational Linguistics (COLING 2018
    • …
    corecore