21,639 research outputs found
Towards effective analysis of big graphs: from scalability to quality
This thesis investigates the central issues underlying graph analysis, namely, scalability
and quality.
We first study the incremental problems for graph queries, which aim to compute
the changes to the old query answer, in response to the updates to the input graph.
The incremental problem is called bounded if its cost is decided by the sizes of the
query and the changes only. No matter how desirable, however, our first results are
negative: for common graph queries such as graph traversal, connectivity, keyword
search and pattern matching, their incremental problems are unbounded. In light of
the negative results, we propose two new characterizations for the effectiveness of
incremental computation, and show that the incremental computations above can still
be effectively conducted, by either reducing the computations on big graphs to small
data, or incrementalizing batch algorithms by minimizing unnecessary recomputation.
We next study the problems with regards to improving the quality of the graphs.
To uniquely identify entities represented by vertices in a graph, we propose a class of
keys that are recursively defined in terms of graph patterns, and are interpreted with
subgraph isomorphism. As an application, we study the entity matching problem,
which is to find all pairs of entities in a graph that are identified by a given set of
keys. Although the problem is proved to be intractable, and cannot be parallelized in
logarithmic rounds, we provide two parallel scalable algorithms for it.
In addition, to catch numeric inconsistencies in real-life graphs, we extend graph
functional dependencies with linear arithmetic expressions and comparison predicates,
referred to as NGDs. Indeed, NGDs strike a balance between expressivity and complexity,
since if we allow non-linear arithmetic expressions, even of degree at most 2, the
satisfiability and implication problems become undecidable. A localizable incremental
algorithm is developed to detect errors using NGDs, where the cost is determined by
small neighbors of nodes in the updates instead of the entire graph.
Finally, a rule-based method to clean graphs is proposed. We extend graph entity
dependencies (GEDs) as data quality rules. Given a graph, a set of GEDs and a block of
ground truth, we fix violations of GEDs in the graph by combining data repairing and
object identification. The method finds certain fixes to errors detected by GEDs, i.e.,
as long as the GEDs and the ground truth are correct, the fixes are assured correct as
their logical consequences. Several fundamental results underlying the method are established,
and an algorithm is developed to implement the method. We also parallelize
the method and guarantee to reduce its running time with the increase of processors
DDSL: Efficient Subgraph Listing on Distributed and Dynamic Graphs
Subgraph listing is a fundamental problem in graph theory and has wide
applications in areas like sociology, chemistry, and social networks. Modern
graphs can usually be large-scale as well as highly dynamic, which challenges
the efficiency of existing subgraph listing algorithms. Recent works have shown
the benefits of partitioning and processing big graphs in a distributed system,
however, there is only few work targets subgraph listing on dynamic graphs in a
distributed environment. In this paper, we propose an efficient approach,
called Distributed and Dynamic Subgraph Listing (DDSL), which can incrementally
update the results instead of running from scratch. DDSL follows a general
distributed join framework. In this framework, we use a Neighbor-Preserved
storage for data graphs, which takes bounded extra space and supports dynamic
updating. After that, we propose a comprehensive cost model to estimate the I/O
cost of listing subgraphs. Then based on this cost model, we develop an
algorithm to find the optimal join tree for a given pattern. To handle dynamic
graphs, we propose an efficient left-deep join algorithm to incrementally
update the join results. Extensive experiments are conducted on real-world
datasets. The results show that DDSL outperforms existing methods in dealing
with both static dynamic graphs in terms of the responding time
A Large-scale Distributed Video Parsing and Evaluation Platform
Visual surveillance systems have become one of the largest data sources of
Big Visual Data in real world. However, existing systems for video analysis
still lack the ability to handle the problems of scalability, expansibility and
error-prone, though great advances have been achieved in a number of visual
recognition tasks and surveillance applications, e.g., pedestrian/vehicle
detection, people/vehicle counting. Moreover, few algorithms explore the
specific values/characteristics in large-scale surveillance videos. To address
these problems in large-scale video analysis, we develop a scalable video
parsing and evaluation platform through combining some advanced techniques for
Big Data processing, including Spark Streaming, Kafka and Hadoop Distributed
Filesystem (HDFS). Also, a Web User Interface is designed in the system, to
collect users' degrees of satisfaction on the recognition tasks so as to
evaluate the performance of the whole system. Furthermore, the highly
extensible platform running on the long-term surveillance videos makes it
possible to develop more intelligent incremental algorithms to enhance the
performance of various visual recognition tasks.Comment: Accepted by Chinese Conference on Intelligent Visual Surveillance
201
Investigative Simulation: Towards Utilizing Graph Pattern Matching for Investigative Search
This paper proposes the use of graph pattern matching for investigative graph
search, which is the process of searching for and prioritizing persons of
interest who may exhibit part or all of a pattern of suspicious behaviors or
connections. While there are a variety of applications, our principal
motivation is to aid law enforcement in the detection of homegrown violent
extremists. We introduce investigative simulation, which consists of several
necessary extensions to the existing dual simulation graph pattern matching
scheme in order to make it appropriate for intelligence analysts and law
enforcement officials. Specifically, we impose a categorical label structure on
nodes consistent with the nature of indicators in investigations, as well as
prune or complete search results to ensure sensibility and usefulness of
partial matches to analysts. Lastly, we introduce a natural top-k ranking
scheme that can help analysts prioritize investigative efforts. We demonstrate
performance of investigative simulation on a real-world large dataset.Comment: 8 pages, 6 figures. Paper to appear in the Fosint-SI 2016 conference
proceedings in conjunction with the 2016 IEEE/ACM International Conference on
Advances in Social Networks Analysis and Mining ASONAM 201
Any-k: Anytime Top-k Tree Pattern Retrieval in Labeled Graphs
Many problems in areas as diverse as recommendation systems, social network
analysis, semantic search, and distributed root cause analysis can be modeled
as pattern search on labeled graphs (also called "heterogeneous information
networks" or HINs). Given a large graph and a query pattern with node and edge
label constraints, a fundamental challenge is to nd the top-k matches ac-
cording to a ranking function over edge and node weights. For users, it is di
cult to select value k . We therefore propose the novel notion of an any-k
ranking algorithm: for a given time budget, re- turn as many of the top-ranked
results as possible. Then, given additional time, produce the next lower-ranked
results quickly as well. It can be stopped anytime, but may have to continues
until all results are returned. This paper focuses on acyclic patterns over
arbitrary labeled graphs. We are interested in practical algorithms that
effectively exploit (1) properties of heterogeneous networks, in particular
selective constraints on labels, and (2) that the users often explore only a
fraction of the top-ranked results. Our solution, KARPET, carefully integrates
aggressive pruning that leverages the acyclic nature of the query, and
incremental guided search. It enables us to prove strong non-trivial time and
space guarantees, which is generally considered very hard for this type of
graph search problem. Through experimental studies we show that KARPET achieves
running times in the order of milliseconds for tree patterns on large networks
with millions of nodes and edges.Comment: To appear in WWW 201
- …