4,048 research outputs found
Reasoning about Record Matching Rules
To accurately match records it is often necessary to utilize the semantics of the data. Functional dependencies (FDs) have proven useful in identifying tuples in a clean relation, based on the semantics of the data. For all the reasons that FDs and their inference are needed, it is also important to develop dependencies and their reasoning techniques for matching tuples from
unreliable
data sources. This paper investigates dependencies and their reasoning for record matching. (a) We introduce a class of
matching dependencies
(MDs) for specifying the semantics of data in unreliable relations, defined in terms of
similarity metrics
and a
dynamic semantics
. (b) We identify a special case of MDs, referred to as
relative candidate keys
(RCKs), to determine what attributes to compare and how to compare them when matching records across possibly different relations. (c) We propose a mechanism for inferring MDs, a departure from traditional implication analysis, such that when we cannot match records by comparing attributes that contain errors, we may still find matches by using other, more reliable attributes. (d) We provide an
O
(
n
2
) time algorithm for inferring MDs, and an effective algorithm for deducing a set of RCKs from MDs. (e) We experimentally verify that the algorithms help matching tools efficiently identify keys at compile time for matching, blocking or windowing, and that the techniques effectively improve both the quality and efficiency of various record matching methods.
</jats:p
Graph Homomorphism Revisited for Graph Matching
In a variety of emerging applications one needs to decide whether a graph
G matches
another
G
p
,
i.e.
, whether
G
has a topological structure similar to that of
G
p
. The traditional notions of graph homomorphism and isomorphism often fall short of capturing the structural similarity in these applications. This paper studies revisions of these notions, providing a full treatment from complexity to algorithms. (1) We propose
p-homomorphism (p
-hom) and 1-1
p
-hom, which extend graph homomorphism and subgraph isomorphism, respectively, by mapping
edges
from one graph to
paths
in another, and by measuring
the similarity of nodes
. (2) We introduce metrics to measure graph similarity, and several optimization problems for
p
-hom and 1-1
p
-hom. (3) We show that the decision problems for
p
-hom and 1-1
p
-hom are NP-complete even for DAGs, and that the optimization problems are approximation-hard. (4) Nevertheless, we provide approximation algorithms with
provable guarantees
on match quality. We experimentally verify the effectiveness of the revised notions and the efficiency of our algorithms in Web site matching, using real-life and synthetic data.
</jats:p
Capturing Topology in Graph Pattern Matching
Graph pattern matching is often defined in terms of subgraph isomorphism, an
NP-complete problem. To lower its complexity, various extensions of graph
simulation have been considered instead. These extensions allow pattern
matching to be conducted in cubic-time. However, they fall short of capturing
the topology of data graphs, i.e., graphs may have a structure drastically
different from pattern graphs they match, and the matches found are often too
large to understand and analyze. To rectify these problems, this paper proposes
a notion of strong simulation, a revision of graph simulation, for graph
pattern matching. (1) We identify a set of criteria for preserving the topology
of graphs matched. We show that strong simulation preserves the topology of
data graphs and finds a bounded number of matches. (2) We show that strong
simulation retains the same complexity as earlier extensions of simulation, by
providing a cubic-time algorithm for computing strong simulation. (3) We
present the locality property of strong simulation, which allows us to
effectively conduct pattern matching on distributed graphs. (4) We
experimentally verify the effectiveness and efficiency of these algorithms,
using real-life data and synthetic data.Comment: VLDB201
Towards Certain Fixes with Editing Rules and Master Data
A variety of integrity constraints have been studied for data cleaning. While these constraints can detect the presence of errors, they fall short of guiding us to correct the errors. Indeed, data repairing based on these constraints may not find
certain fixes
that are absolutely correct, and worse, may introduce new errors when repairing the data. We propose a method for finding certain fixes, based on master data, a notion of
certain regions
, and a class of
editing rules
. A certain region is a set of attributes that are assured correct by the users. Given a certain region and master data, editing rules tell us what attributes to fix and how to update them. We show how the method can be used in data monitoring and enrichment. We develop techniques for reasoning about editing rules, to decide whether they lead to a unique fix and whether they are able to fix all the attributes in a tuple,
relative
to master data and a certain region. We also provide an algorithm to identify minimal certain regions, such that a certain fix is warranted by editing rules and master data as long as one of the regions is correct. We experimentally verify the effectiveness and scalability of the algorithm.
</jats:p
Unsupervised Neural Machine Translation with SMT as Posterior Regularization
Without real bilingual corpus available, unsupervised Neural Machine
Translation (NMT) typically requires pseudo parallel data generated with the
back-translation method for the model training. However, due to weak
supervision, the pseudo data inevitably contain noises and errors that will be
accumulated and reinforced in the subsequent training process, leading to bad
translation performance. To address this issue, we introduce phrase based
Statistic Machine Translation (SMT) models which are robust to noisy data, as
posterior regularizations to guide the training of unsupervised NMT models in
the iterative back-translation process. Our method starts from SMT models built
with pre-trained language models and word-level translation tables inferred
from cross-lingual embeddings. Then SMT and NMT models are optimized jointly
and boost each other incrementally in a unified EM framework. In this way, (1)
the negative effect caused by errors in the iterative back-translation process
can be alleviated timely by SMT filtering noises from its phrase tables;
meanwhile, (2) NMT can compensate for the deficiency of fluency inherent in
SMT. Experiments conducted on en-fr and en-de translation tasks show that our
method outperforms the strong baseline and achieves new state-of-the-art
unsupervised machine translation performance.Comment: To be presented at AAAI 2019; 9 pages, 4 figure
- …