32,455 research outputs found
Scalable supergraph search in large graph databases
© 2016 IEEE. Supergraph search is a fundamental problem in graph databases that is widely applied in many application scenarios. Given a graph database and a query-graph, supergraph search retrieves all data-graphs contained in the query-graph from the graph database. Most existing solutions for supergraph search follow the pruning-and-verification framework, which prunes false answers based on features in the pruning phase and performs subgraph isomorphism testings on the remaining graphs in the verification phase. However, they are not scalable to handle large-sized data-graphs and query-graphs due to three drawbacks. First, they rely on a frequent subgraph mining algorithm to select features which is expensive and cannot generate large features. Second, they require a costly verification phase. Third, they process features in a fixed order without considering their relationship to the query-graph. In this paper, we address the three drawbacks and propose new indexing and query processing algorithms. In indexing, we select features directly from the data-graphs without expensive frequent subgraph mining. The features form a feature-tree that contains all-sized features and both the cost sharing and pruning power of the features are considered. In query processing, we propose a verification-free algorithm, where the order to process features is query-dependent by considering both the cost sharing and the pruning power. We explore two optimization strategies to further improve the algorithm efficiency. The first strategy applies a lightweight graph compression technique and the second strategy optimizes the inclusion of answers. Finally, we conduct extensive performance studies on two real large datasets to demonstrate the high scalability of our algorithms
Optimization of Retrieval Algorithms on Large Scale Knowledge Graphs
Knowledge graphs have been shown to play an important role in recent
knowledge mining and discovery, for example in the field of life sciences or
bioinformatics. Although a lot of research has been done on the field of query
optimization, query transformation and of course in storing and retrieving
large scale knowledge graphs the field of algorithmic optimization is still a
major challenge and a vital factor in using graph databases. Few researchers
have addressed the problem of optimizing algorithms on large scale labeled
property graphs. Here, we present two optimization approaches and compare them
with a naive approach of directly querying the graph database. The aim of our
work is to determine limiting factors of graph databases like Neo4j and we
describe a novel solution to tackle these challenges. For this, we suggest a
classification schema to differ between the complexity of a problem on a graph
database. We evaluate our optimization approaches on a test system containing a
knowledge graph derived biomedical publication data enriched with text mining
data. This dense graph has more than 71M nodes and 850M relationships. The
results are very encouraging and - depending on the problem - we were able to
show a speedup of a factor between 44 and 3839
GraphFind: enhancing graph searching by low support data mining techniques
<p>Abstract</p> <p>Background</p> <p>Biomedical and chemical databases are large and rapidly growing in size. Graphs naturally model such kinds of data. To fully exploit the wealth of information in these graph databases, a key role is played by systems that search for all exact or approximate occurrences of a query graph. To deal efficiently with graph searching, advanced methods for indexing, representation and matching of graphs have been proposed.</p> <p>Results</p> <p>This paper presents GraphFind. The system implements efficient graph searching algorithms together with advanced filtering techniques that allow approximate search. It allows users to select candidate subgraphs rather than entire graphs. It implements an effective data storage based also on low-support data mining.</p> <p>Conclusions</p> <p>GraphFind is compared with Frowns, GraphGrep and gIndex. Experiments show that GraphFind outperforms the compared systems on a very large collection of small graphs. The proposed low-support mining technique which applies to any searching system also allows a significant index space reduction.</p
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
Performance and scalability of indexed subgraph query processing methods
Graph data management systems have become very popular
as graphs are the natural data model for many applications.
One of the main problems addressed by these systems is subgraph
query processing; i.e., given a query graph, return all
graphs that contain the query. The naive method for processing
such queries is to perform a subgraph isomorphism
test against each graph in the dataset. This obviously does
not scale, as subgraph isomorphism is NP-Complete. Thus,
many indexing methods have been proposed to reduce the
number of candidate graphs that have to underpass the subgraph
isomorphism test. In this paper, we identify a set of
key factors-parameters, that influence the performance of
related methods: namely, the number of nodes per graph,
the graph density, the number of distinct labels, the number
of graphs in the dataset, and the query graph size. We then
conduct comprehensive and systematic experiments that analyze
the sensitivity of the various methods on the values of
the key parameters. Our aims are twofold: first to derive
conclusions about the algorithms’ relative performance, and,
second, to stress-test all algorithms, deriving insights as to
their scalability, and highlight how both performance and
scalability depend on the above factors. We choose six wellestablished
indexing methods, namely Grapes, CT-Index,
GraphGrepSX, gIndex, Tree+∆, and gCode, as representative
approaches of the overall design space, including the
most recent and best performing methods. We report on
their index construction time and index size, and on query
processing performance in terms of time and false positive
ratio. We employ both real and synthetic datasets. Specifi-
cally, four real datasets of different characteristics are used:
AIDS, PDBS, PCM, and PPI. In addition, we generate a
large number of synthetic graph datasets, empowering us to
systematically study the algorithms’ performance and scalability
versus the aforementioned key parameters
A Distributed Path Query Engine for Temporal Property Graphs
Property graphs are a common form of linked data, with path queries used to
traverse and explore them for enterprise transactions and mining. Temporal
property graphs are a recent variant where time is a first-class entity to be
queried over, and their properties and structure vary over time. These are seen
in social, telecom, transit and epidemic networks. However, current graph
databases and query engines have limited support for temporal relations among
graph entities, no support for time-varying entities and/or do not scale on
distributed resources. We address this gap by extending a linear path query
model over property graphs to include intuitive temporal predicates and
aggregation operators over temporal graphs. We design a distributed execution
model for these temporal path queries using the interval-centric computing
model, and develop a novel cost model to select an efficient execution plan
from several. We perform detailed experiments of our Granite distributed query
engine using both static and dynamic temporal property graphs as large as 52M
vertices, 218M edges and 325M properties, and a 1600-query workload, derived
from the LDBC benchmark. We often offer sub-second query latencies on a
commodity cluster, which is 149x-1140x faster compared to industry-leading
Neo4J shared-memory graph database and the JanusGraph / Spark distributed graph
query engine. Granite also completes 100% of the queries for all graphs,
compared to only 32-92% workload completion by the baseline systems. Further,
our cost model selects a query plan that is within 10% of the optimal execution
time in 90% of the cases. Despite the irregular nature of graph processing, we
exhibit a weak-scaling efficiency >= 60% on 8 nodes and >= 40% on 16 nodes, for
most query workloads.Comment: An extended version of the paper that appears in IEEE/ACM
International Symposium on Cluster, Cloud and Internet Computing (CCGrid),
202
- …