20 research outputs found
Efficient Discovery of Ontology Functional Dependencies
Poor data quality has become a pervasive issue due to the increasing
complexity and size of modern datasets. Constraint based data cleaning
techniques rely on integrity constraints as a benchmark to identify and correct
errors. Data values that do not satisfy the given set of constraints are
flagged as dirty, and data updates are made to re-align the data and the
constraints. However, many errors often require user input to resolve due to
domain expertise defining specific terminology and relationships. For example,
in pharmaceuticals, 'Advil' \emph{is-a} brand name for 'ibuprofen' that can be
captured in a pharmaceutical ontology. While functional dependencies (FDs) have
traditionally been used in existing data cleaning solutions to model syntactic
equivalence, they are not able to model broader relationships (e.g., is-a)
defined by an ontology. In this paper, we take a first step towards extending
the set of data quality constraints used in data cleaning by defining and
discovering \emph{Ontology Functional Dependencies} (OFDs). We lay out
theoretical and practical foundations for OFDs, including a set of sound and
complete axioms, and a linear inference procedure. We then develop effective
algorithms for discovering OFDs, and a set of optimizations that efficiently
prune the search space. Our experimental evaluation using real data show the
scalability and accuracy of our algorithms.Comment: 12 page
Towards effective analysis of big graphs: from scalability to quality
This thesis investigates the central issues underlying graph analysis, namely, scalability
and quality.
We first study the incremental problems for graph queries, which aim to compute
the changes to the old query answer, in response to the updates to the input graph.
The incremental problem is called bounded if its cost is decided by the sizes of the
query and the changes only. No matter how desirable, however, our first results are
negative: for common graph queries such as graph traversal, connectivity, keyword
search and pattern matching, their incremental problems are unbounded. In light of
the negative results, we propose two new characterizations for the effectiveness of
incremental computation, and show that the incremental computations above can still
be effectively conducted, by either reducing the computations on big graphs to small
data, or incrementalizing batch algorithms by minimizing unnecessary recomputation.
We next study the problems with regards to improving the quality of the graphs.
To uniquely identify entities represented by vertices in a graph, we propose a class of
keys that are recursively defined in terms of graph patterns, and are interpreted with
subgraph isomorphism. As an application, we study the entity matching problem,
which is to find all pairs of entities in a graph that are identified by a given set of
keys. Although the problem is proved to be intractable, and cannot be parallelized in
logarithmic rounds, we provide two parallel scalable algorithms for it.
In addition, to catch numeric inconsistencies in real-life graphs, we extend graph
functional dependencies with linear arithmetic expressions and comparison predicates,
referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity,
since if we allow non-linear arithmetic expressions, even of degree at most 2, the
satisfiability and implication problems become undecidable. A localizable incremental
algorithm is developed to detect errors using NGDs, where the cost is determined by
small neighbors of nodes in the updates instead of the entire graph.
Finally, a rule-based method to clean graphs is proposed. We extend graph entity
dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of
ground truth, we fix violations of GEDs in the graph by combining data repairing and
object identification. The method finds certain fixes to errors detected by GEDs, i.e.,
as long as the GEDs and the ground truth are correct, the fixes are assured correct as
their logical consequences. Several fundamental results underlying the method are established,
and an algorithm is developed to implement the method. We also parallelize
the method and guarantee to reduce its running time with the increase of processors