3 research outputs found
Towards effective analysis of big graphs: from scalability to quality
This thesis investigates the central issues underlying graph analysis, namely, scalability
and quality.
We first study the incremental problems for graph queries, which aim to compute
the changes to the old query answer, in response to the updates to the input graph.
The incremental problem is called bounded if its cost is decided by the sizes of the
query and the changes only. No matter how desirable, however, our first results are
negative: for common graph queries such as graph traversal, connectivity, keyword
search and pattern matching, their incremental problems are unbounded. In light of
the negative results, we propose two new characterizations for the effectiveness of
incremental computation, and show that the incremental computations above can still
be effectively conducted, by either reducing the computations on big graphs to small
data, or incrementalizing batch algorithms by minimizing unnecessary recomputation.
We next study the problems with regards to improving the quality of the graphs.
To uniquely identify entities represented by vertices in a graph, we propose a class of
keys that are recursively defined in terms of graph patterns, and are interpreted with
subgraph isomorphism. As an application, we study the entity matching problem,
which is to find all pairs of entities in a graph that are identified by a given set of
keys. Although the problem is proved to be intractable, and cannot be parallelized in
logarithmic rounds, we provide two parallel scalable algorithms for it.
In addition, to catch numeric inconsistencies in real-life graphs, we extend graph
functional dependencies with linear arithmetic expressions and comparison predicates,
referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity,
since if we allow non-linear arithmetic expressions, even of degree at most 2, the
satisfiability and implication problems become undecidable. A localizable incremental
algorithm is developed to detect errors using NGDs, where the cost is determined by
small neighbors of nodes in the updates instead of the entire graph.
Finally, a rule-based method to clean graphs is proposed. We extend graph entity
dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of
ground truth, we fix violations of GEDs in the graph by combining data repairing and
object identification. The method finds certain fixes to errors detected by GEDs, i.e.,
as long as the GEDs and the ground truth are correct, the fixes are assured correct as
their logical consequences. Several fundamental results underlying the method are established,
and an algorithm is developed to implement the method. We also parallelize
the method and guarantee to reduce its running time with the increase of processors