39 research outputs found

    OpenTag: Open Attribute Value Extraction from Product Profiles [Deep Learning, Active Learning, Named Entity Recognition]

    Full text link
    Extraction of missing attribute values is to find values describing an attribute of interest from a free text input. Most past related work on extraction of missing attribute values work with a closed world assumption with the possible set of values known beforehand, or use dictionaries of values and hand-crafted features. How can we discover new attribute values that we have never seen before? Can we do this with limited human annotation or supervision? We study this problem in the context of product catalogs that often have missing values for many attributes of interest. In this work, we leverage product profile information such as titles and descriptions to discover missing values of product attributes. We develop a novel deep tagging model OpenTag for this extraction problem with the following contributions: (1) we formalize the problem as a sequence tagging task, and propose a joint model exploiting recurrent neural networks (specifically, bidirectional LSTM) to capture context and semantics, and Conditional Random Fields (CRF) to enforce tagging consistency, (2) we develop a novel attention mechanism to provide interpretable explanation for our model's decisions, (3) we propose a novel sampling strategy exploring active learning to reduce the burden of human annotation. OpenTag does not use any dictionary or hand-crafted features as in prior works. Extensive experiments in real-life datasets in different domains show that OpenTag with our active learning strategy discovers new attribute values from as few as 150 annotated samples (reduction in 3.3x amount of annotation effort) with a high F-score of 83%, outperforming state-of-the-art models.Comment: Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, London, UK, August 19-23, 201

    Ontology Mediated Information Extraction with MASTRO SYSTEM-T

    Get PDF
    In several data-centric application domains, the need arises to extract valuable information from unstructured text documents. The recent paradigm of Ontology Mediated Information Extraction (OMIE) faces this problem by taking into account the knowledge expressed by a domain ontology, and reasoning over it to improve the quality of extracted data. MASTRO SYSTEM-T is a novel tool for OMIE, developed by Sapienza University and IBM Almaden Research. In this work, we demonstrate its usage for information extraction over real-world financial text documents from the U.S. EDGAR system

    ABSTRACT Debugging Schema Mappings with Routes ∗

    No full text
    A schema mapping is a high-level declarative specification of the relationship between two schemas; it specifies how data structured under one schema, called the source schema, is to be converted into data structured under a possibly different schema, called the target schema. Schema mappings are fundamental components for both data exchange and data integration. To date, a language for specifying (or programming) schema mappings exists. However, developmental support for programming schema mappings is still lacking. In particular, a tool for debugging schema mappings has not yet been developed. In this paper, we propose to build a debugger for understanding and exploring schema mappings. We present a primary feature of our debugger, called routes, that describes the relationship between source and target data with the schema mapping. We present two algorithms for computing all routes or one route for selected target data. Both algorithms execute in polynomial time in the size of the input. In computing all routes, our algorithm produces a concise representation that factors common steps in the routes. Furthermore, every minimal route for the selected data can, essentially, be found in this representation. Our second algorithm is able to produce one route fast, if there is one, and alternative routes as needed. We demonstrate the feasibility of our route algorithms through a set of experimental results on both synthetic and real datasets. 1

    DBNotes: A Post-It System for Relational Databases based on Provenance

    No full text
    We demonstrate DBNotes, a Post-It note system for relational databases where every piece of data may be associated with zero or more notes (or annotations). These annotations are transparently propagated along as data is being transformed
    corecore