5 research outputs found

    Effective distant supervision for end-to-end knowledge base population systems

    Get PDF
    The growing amounts of textual data require automatic methods for structuring relevant information so that it can be further processed by computers and systematically accessed by humans. The scenario dealt with in this dissertation is known as Knowledge Base Population (KBP), where relational information about entities is retrieved from a large text collection and stored in a database, structured according to a pre-specified schema. Most of the research in this dissertation is placed in the context of the KBP benchmark of the Text Analysis Conference (TAC KBP), which provides a test-bed to examine all steps in a complex end-to-end relation extraction setting. In this dissertation a new state of the art for the TAC KBP benchmark was achieved by focussing on the following research problems: (1) The KBP task was broken down into a modular pipeline of sub-problems, and the most pressing issues were identified and quantified at all steps. (2) The quality of semi-automatically generated training data was increased by developing noise-reduction methods, decreasing the influence of false-positive training examples. (3) A focus was laid on fine-grained entity type modelling, entity expansion, entity matching and tagging, to maintain as much recall as possible on the relational argument level. (4) A new set of effective methods for generating training data, encoding features and training relational classifiers was developed and compared with previous state-of-the-art methods.Die wachsende Menge an Textdaten erfordert Methoden, relevante Informationen so zu strukturieren, dass sie von Computern weiterverarbeitet werden können, und dass Menschen systematisch auf sie zugreifen können. Das in dieser Dissertation behandelte Szenario ist unter dem Begriff Knowledge Base Population (KBP) bekannt. Hier werden relationale Informationen über Entitäten aus großen Textbeständen automatisch zusammengetragen und gemäß einem vorgegebenen Schema strukturiert. Ein Großteil der Forschung der vorliegenden Dissertation ist im Kontext des TAC KBP Vergleichstests angesiedelt. Dieser stellt ein Testumfeld dar, um alle Schritte eines anfragebasierten Relationsextraktions-Systems zu untersuchen. Die in der vorliegenden Dissertation entwickelten Verfahren setzen einen neuen Standard für TAC KBP. Dies wurde durch eine Schwerpunktsetzung auf die folgenden Forschungsfragen erreicht: Erstens wurden die wichtigsten Unterprobleme von KBP identifiziert und die jeweiligen Effekte genau quantifiziert. Zweitens wurde die Qualität von halbautomatischen Trainingsdaten durch Methoden erhöht, die den Einfluss von falsch positiven Trainingsbeispielen verringern. Drittens wurde ein Schwerpunkt auf feingliedrige Typmodellierung, die Expansion von Entitätennamen und das Auffinden von Entitäten gelegt, um eine größtmögliche Abdeckung von relationalen Argumenten zu erreichen. Viertens wurde eine Reihe von neuen leistungsstarken Methoden entwickelt und untersucht, um Trainingsdaten zu erzeugen, Klassifizierungsmerkmale zu kodieren und relationale Klassifikatoren zu trainieren

    Effective distant supervision for end-to-end knowledge base population systems

    Get PDF
    The growing amounts of textual data require automatic methods for structuring relevant information so that it can be further processed by computers and systematically accessed by humans. The scenario dealt with in this dissertation is known as Knowledge Base Population (KBP), where relational information about entities is retrieved from a large text collection and stored in a database, structured according to a pre-specified schema. Most of the research in this dissertation is placed in the context of the KBP benchmark of the Text Analysis Conference (TAC KBP), which provides a test-bed to examine all steps in a complex end-to-end relation extraction setting. In this dissertation a new state of the art for the TAC KBP benchmark was achieved by focussing on the following research problems: (1) The KBP task was broken down into a modular pipeline of sub-problems, and the most pressing issues were identified and quantified at all steps. (2) The quality of semi-automatically generated training data was increased by developing noise-reduction methods, decreasing the influence of false-positive training examples. (3) A focus was laid on fine-grained entity type modelling, entity expansion, entity matching and tagging, to maintain as much recall as possible on the relational argument level. (4) A new set of effective methods for generating training data, encoding features and training relational classifiers was developed and compared with previous state-of-the-art methods.Die wachsende Menge an Textdaten erfordert Methoden, relevante Informationen so zu strukturieren, dass sie von Computern weiterverarbeitet werden können, und dass Menschen systematisch auf sie zugreifen können. Das in dieser Dissertation behandelte Szenario ist unter dem Begriff Knowledge Base Population (KBP) bekannt. Hier werden relationale Informationen über Entitäten aus großen Textbeständen automatisch zusammengetragen und gemäß einem vorgegebenen Schema strukturiert. Ein Großteil der Forschung der vorliegenden Dissertation ist im Kontext des TAC KBP Vergleichstests angesiedelt. Dieser stellt ein Testumfeld dar, um alle Schritte eines anfragebasierten Relationsextraktions-Systems zu untersuchen. Die in der vorliegenden Dissertation entwickelten Verfahren setzen einen neuen Standard für TAC KBP. Dies wurde durch eine Schwerpunktsetzung auf die folgenden Forschungsfragen erreicht: Erstens wurden die wichtigsten Unterprobleme von KBP identifiziert und die jeweiligen Effekte genau quantifiziert. Zweitens wurde die Qualität von halbautomatischen Trainingsdaten durch Methoden erhöht, die den Einfluss von falsch positiven Trainingsbeispielen verringern. Drittens wurde ein Schwerpunkt auf feingliedrige Typmodellierung, die Expansion von Entitätennamen und das Auffinden von Entitäten gelegt, um eine größtmögliche Abdeckung von relationalen Argumenten zu erreichen. Viertens wurde eine Reihe von neuen leistungsstarken Methoden entwickelt und untersucht, um Trainingsdaten zu erzeugen, Klassifizierungsmerkmale zu kodieren und relationale Klassifikatoren zu trainieren

    Slot Filling

    Get PDF
    Slot filling (SF) is the task of automatically extracting facts about particular entities from unstructured text, and populating a knowledge base (KB) with these facts. These structured KBs enable applications such as structured web queries and question answering. SF is typically framed as a query-oriented setting of the related task of relation extraction. Throughout this thesis, we reflect on how SF is a task with many distinct problems. We demonstrate that recall is a major limiter on SF system performance. We contribute an analysis of typical SF recall loss, and find a substantial amount of loss occurs early in the SF pipeline. We confirm that accurate NER and coreference resolution are required for high-recall SF. We measure upper bounds using a naïve graph-based semi-supervised bootstrapping technique, and find that only 39% of results are reachable using a typical feature space. We expect that this graph-based technique will be directly useful for extraction, and this leads us to frame SF as a label propagation task. We focus on a detailed graph representation of the task which reflects the behaviour and assumptions we want to model based on our analysis, including modifying the label propagation process to model multiple types of label interaction. Analysing the graph, we find that a large number of errors occur in very close proximity to training data, and identify that this is of major concern for propagation. While there are some conflicts caused by a lack of sufficient disambiguating context—we explore adding additional contextual features to address this—many of these conflicts are caused by subtle annotation problems. We find that lack of a standard for how explicit expressions of relations must be in text makes consistent annotation difficult. Using a strict definition of explicitness results in 20% of correct annotations being removed from a standard dataset. We contribute several annotation-driven analyses of this problem, exploring the definition of slots and the effect of the lack of a concrete definition of explicitness: annotation schema do not detail how explicit expressions of relations need to be, and there is large scope for disagreement between annotators. Additionally, applications may require relatively strict or relaxed evidence for extractions, but this is not considered in annotation tasks. We demonstrate that annotators frequently disagree on instances, dependent on differences in annotator world knowledge and thresholds on making probabilistic inference. SF is fundamental to enabling many knowledge-based applications, and this work motivates modelling and evaluating SF to better target these tasks

    Joint Discourse-aware Concept Disambiguation and Clustering

    Get PDF
    This thesis addresses the tasks of concept disambiguation and clustering. Concept disambiguation is the task of linking common nouns and proper names in a text – henceforth called mentions – to their corresponding concepts in a predefined inventory. Concept clustering is the task of clustering mentions, so that all mentions in one cluster denote the same concept. In this thesis, we investigate concept disambiguation and clustering from a discourse perspective and propose a discourse-aware approach for joint concept disambiguation and clustering in the framework of Markov logic. The contributions of this thesis are fourfold: Joint Concept Disambiguation and Clustering. In previous approaches, concept disambiguation and concept clustering have been considered as two separate tasks (Schütze, 1998; Ji & Grishman, 2011). We analyze the relationship between concept disambiguation and concept clustering and argue that these two tasks can mutually support each other. We propose the – to our knowledge – first joint approach for concept disambiguation and clustering. Discourse-Aware Concept Disambiguation. One of the determining factors for concept disambiguation and clustering is the context definition. Most previous approaches use the same context definition for all mentions (Milne & Witten, 2008b; Kulkarni et al., 2009; Ratinov et al., 2011, inter alia). We approach the question which context is relevant to disambiguate a mention from a discourse perspective and state that different mentions require different notions of contexts. We state that the context that is relevant to disambiguate a mention depends on its embedding into discourse. However, how a mention is embedded into discourse depends on its denoted concept. Hence, the identification of the denoted concept and the relevant concept mutually depend on each other. We propose a binwise approach with three different context definitions and model the selection of the context definition and the disambiguation jointly. Modeling Interdependencies with Markov Logic. To model the interdependencies between concept disambiguation and concept clustering as well as the interdependencies between the context definition and the disambiguation, we use Markov logic (Domingos & Lowd, 2009). Markov logic combines first order logic with probabilities and allows us to concisely formalize these interdependencies. We investigate how we can balance between linguistic appropriateness and time efficiency and propose a hybrid approach that combines joint inference with aggregation techniques. Concept Disambiguation and Clustering beyond English: Multi- and Cross-linguality. Given the vast amount of texts written in different languages, the capability to extend an approach to cope with other languages than English is essential. We thus analyze how our approach copes with other languages than English and show that our approach largely scales across languages, even without retraining. Our approach is evaluated on multiple data sets originating from different sources (e.g. news, web) and across multiple languages. As an inventory, we use Wikipedia. We compare our approach to other approaches and show that it achieves state-of-the-art results. Furthermore, we show that joint concept disambiguating and clustering as well as joint context selection and disambiguation leads to significant improvements ceteris paribus

    Entity-Oriented Search

    Get PDF
    This open access book covers all facets of entity-oriented search—where “search” can be interpreted in the broadest sense of information access—from a unified point of view, and provides a coherent and comprehensive overview of the state of the art. It represents the first synthesis of research in this broad and rapidly developing area. Selected topics are discussed in-depth, the goal being to establish fundamental techniques and methods as a basis for future research and development. Additional topics are treated at a survey level only, containing numerous pointers to the relevant literature. A roadmap for future research, based on open issues and challenges identified along the way, rounds out the book. The book is divided into three main parts, sandwiched between introductory and concluding chapters. The first two chapters introduce readers to the basic concepts, provide an overview of entity-oriented search tasks, and present the various types and sources of data that will be used throughout the book. Part I deals with the core task of entity ranking: given a textual query, possibly enriched with additional elements or structural hints, return a ranked list of entities. This core task is examined in a number of different variants, using both structured and unstructured data collections, and numerous query formulations. In turn, Part II is devoted to the role of entities in bridging unstructured and structured data. Part III explores how entities can enable search engines to understand the concepts, meaning, and intent behind the query that the user enters into the search box, and how they can provide rich and focused responses (as opposed to merely a list of documents)—a process known as semantic search. The final chapter concludes the book by discussing the limitations of current approaches, and suggesting directions for future research. Researchers and graduate students are the primary target audience of this book. A general background in information retrieval is sufficient to follow the material, including an understanding of basic probability and statistics concepts as well as a basic knowledge of machine learning concepts and supervised learning algorithms
    corecore