10,061 research outputs found
Unsupervised String Transformation Learning for Entity Consolidation
Data integration has been a long-standing challenge in data management with
many applications. A key step in data integration is entity consolidation. It
takes a collection of clusters of duplicate records as input and produces a
single "golden record" for each cluster, which contains the canonical value for
each attribute. Truth discovery and data fusion methods, as well as Master Data
Management (MDM) systems, can be used for entity consolidation. However, to
achieve better results, the variant values (i.e., values that are logically the
same with different formats) in the clusters need to be consolidated before
applying these methods.
For this purpose, we propose a data-driven method to standardize the variant
values based on two observations: (1) the variant values usually can be
transformed to the same representation (e.g., "Mary Lee" and "Lee, Mary") and
(2) the same transformation often appears repeatedly across different clusters
(e.g., transpose the first and last name). Our approach first uses an
unsupervised method to generate groups of value pairs that can be transformed
in the same way (i.e., they share a transformation). Then the groups are
presented to a human for verification and the approved ones are used to
standardize the data. In a real-world dataset with 17,497 records, our method
achieved 75% recall and 99.5% precision in standardizing variant values by
asking a human 100 yes/no questions, which completely outperformed a state of
the art data wrangling tool
OpenTag: Open Attribute Value Extraction from Product Profiles [Deep Learning, Active Learning, Named Entity Recognition]
Extraction of missing attribute values is to find values describing an
attribute of interest from a free text input. Most past related work on
extraction of missing attribute values work with a closed world assumption with
the possible set of values known beforehand, or use dictionaries of values and
hand-crafted features. How can we discover new attribute values that we have
never seen before? Can we do this with limited human annotation or supervision?
We study this problem in the context of product catalogs that often have
missing values for many attributes of interest.
In this work, we leverage product profile information such as titles and
descriptions to discover missing values of product attributes. We develop a
novel deep tagging model OpenTag for this extraction problem with the following
contributions: (1) we formalize the problem as a sequence tagging task, and
propose a joint model exploiting recurrent neural networks (specifically,
bidirectional LSTM) to capture context and semantics, and Conditional Random
Fields (CRF) to enforce tagging consistency, (2) we develop a novel attention
mechanism to provide interpretable explanation for our model's decisions, (3)
we propose a novel sampling strategy exploring active learning to reduce the
burden of human annotation. OpenTag does not use any dictionary or hand-crafted
features as in prior works. Extensive experiments in real-life datasets in
different domains show that OpenTag with our active learning strategy discovers
new attribute values from as few as 150 annotated samples (reduction in 3.3x
amount of annotation effort) with a high F-score of 83%, outperforming
state-of-the-art models.Comment: Proceedings of the 24th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining, London, UK, August 19-23, 201
Recommended from our members
Robust Algorithms for Clustering with Applications to Data Integration
A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and scalable methods for clustering cannot be overstated. We tackle the various facets of clustering with a multi-pronged approach described below.
1. While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness of clustering methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing. Additionally, we focus on scalability to handle web-scale datasets.
2. In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model that considers dependent edge formation and devise techniques for efficient cluster recovery
- …