11 research outputs found
From Data Fusion to Knowledge Fusion
The task of {\em data fusion} is to identify the true values of data items
(eg, the true date of birth for {\em Tom Cruise}) among multiple observed
values drawn from different sources (eg, Web sites) of varying (and unknown)
reliability. A recent survey\cite{LDL+12} has provided a detailed comparison of
various fusion methods on Deep Web data. In this paper, we study the
applicability and limitations of different fusion techniques on a more
challenging problem: {\em knowledge fusion}. Knowledge fusion identifies true
subject-predicate-object triples extracted by multiple information extractors
from multiple information sources. These extractors perform the tasks of entity
linkage and schema alignment, thus introducing an additional source of noise
that is quite different from that traditionally considered in the data fusion
literature, which only focuses on factual errors in the original sources. We
adapt state-of-the-art data fusion techniques and apply them to a knowledge
base with 1.6B unique knowledge triples extracted by 12 extractors from over 1B
Web pages, which is three orders of magnitude larger than the data sets used in
previous data fusion papers. We show great promise of the data fusion
approaches in solving the knowledge fusion problem, and suggest interesting
research directions through a detailed error analysis of the methods.Comment: VLDB'201
MultiImport: Inferring Node Importance in a Knowledge Graph from Multiple Input Signals
Given multiple input signals, how can we infer node importance in a knowledge
graph (KG)? Node importance estimation is a crucial and challenging task that
can benefit a lot of applications including recommendation, search, and query
disambiguation. A key challenge towards this goal is how to effectively use
input from different sources. On the one hand, a KG is a rich source of
information, with multiple types of nodes and edges. On the other hand, there
are external input signals, such as the number of votes or pageviews, which can
directly tell us about the importance of entities in a KG. While several
methods have been developed to tackle this problem, their use of these external
signals has been limited as they are not designed to consider multiple signals
simultaneously. In this paper, we develop an end-to-end model MultiImport,
which infers latent node importance from multiple, potentially overlapping,
input signals. MultiImport is a latent variable model that captures the
relation between node importance and input signals, and effectively learns from
multiple signals with potential conflicts. Also, MultiImport provides an
effective estimator based on attentive graph neural networks. We ran
experiments on real-world KGs to show that MultiImport handles several
challenges involved with inferring node importance from multiple input signals,
and consistently outperforms existing methods, achieving up to 23.7% higher
NDCG@100 than the state-of-the-art method.Comment: KDD 2020 Research Track. 10 page
Structured Prediction on Dirty Datasets
Many errors cannot be detected or repaired without taking into account the underlying structure and dependencies in the dataset. One way of modeling the structure of the data is graphical models. Graphical models combine probability theory and graph theory in order to address one of the key objectives in designing and fitting probabilistic models, which is to capture dependencies among relevant random variables. Structure representation helps to understand the side effect of the errors or it reveals correct interrelationships between data points. Hence, principled representation of structure in prediction and cleaning tasks of dirty data is essential for the quality of downstream analytical results. Existing structured prediction research considers limited structures and configurations, with little attention to the performance limitations and how well the problem can be solved in more general settings where the structure is complex and rich.
In this dissertation, I present the following thesis: By leveraging the underlying dependency and structure in machine learning models, we can effectively detect and clean errors via pragmatic structured predictions techniques. To highlight the main contributions: I investigate prediction algorithms and systems on dirty data with a more realistic structure and dependencies to help deploy this type of learning in more pragmatic settings. Specifically, We introduce a few-shot learning framework for error detection that uses structure-based features of data such as denial constraints violations and Bayesian network as co-occurrence feature. I have studied the problem of recovering the latent ground truth labeling of a structured instance. Then, I consider the problem of mining integrity constraints from data and specifically using the sampling methods for extracting approximate denial constraints. Finally, I have introduced an ML framework that uses solitary and structured data features to solve the problem of record fusion