12,173 research outputs found
gMark: Schema-Driven Generation of Graphs and Queries
Massive graph data sets are pervasive in contemporary application domains.
Hence, graph database systems are becoming increasingly important. In the
experimental study of these systems, it is vital that the research community
has shared solutions for the generation of database instances and query
workloads having predictable and controllable properties. In this paper, we
present the design and engineering principles of gMark, a domain- and query
language-independent graph instance and query workload generator. A core
contribution of gMark is its ability to target and control the diversity of
properties of both the generated instances and the generated workloads coupled
to these instances. Further novelties include support for regular path queries,
a fundamental graph query paradigm, and schema-driven selectivity estimation of
queries, a key feature in controlling workload chokepoints. We illustrate the
flexibility and practical usability of gMark by showcasing the framework's
capabilities in generating high quality graphs and workloads, and its ability
to encode user-defined schemas across a variety of application domains.Comment: Accepted in November 2016. URL:
http://ieeexplore.ieee.org/document/7762945/. in IEEE Transactions on
Knowledge and Data Engineering 201
Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1
We propose an efficient and scalable architecture for processing generalized
graph-pattern queries as they are specified by the current W3C recommendation
of the SPARQL 1.1 "Query Language" component. Specifically, the class of
queries we consider consists of sets of SPARQL triple patterns with labeled
property paths. From a relational perspective, this class resolves to
conjunctive queries of relational joins with additional graph-reachability
predicates. For the scalable, i.e., distributed, processing of this kind of
queries over very large RDF collections, we develop a suitable partitioning and
indexing scheme, which allows us to shard the RDF triples over an entire
cluster of compute nodes and to process an incoming SPARQL query over all of
the relevant graph partitions (and thus compute nodes) in parallel. Unlike most
prior works in this field, we specifically aim at the unified optimization and
distributed processing of queries consisting of both relational joins and
graph-reachability predicates. All communication among the compute nodes is
established via a proprietary, asynchronous communication protocol based on the
Message Passing Interface
Sublinear-Time Algorithms for Monomer-Dimer Systems on Bounded Degree Graphs
For a graph , let be the partition function of the
monomer-dimer system defined by , where is the
number of matchings of size in . We consider graphs of bounded degree
and develop a sublinear-time algorithm for estimating at an
arbitrary value within additive error with high
probability. The query complexity of our algorithm does not depend on the size
of and is polynomial in , and we also provide a lower bound
quadratic in for this problem. This is the first analysis of a
sublinear-time approximation algorithm for a # P-complete problem. Our
approach is based on the correlation decay of the Gibbs distribution associated
with . We show that our algorithm approximates the probability
for a vertex to be covered by a matching, sampled according to this Gibbs
distribution, in a near-optimal sublinear time. We extend our results to
approximate the average size and the entropy of such a matching within an
additive error with high probability, where again the query complexity is
polynomial in and the lower bound is quadratic in .
Our algorithms are simple to implement and of practical use when dealing with
massive datasets. Our results extend to other systems where the correlation
decay is known to hold as for the independent set problem up to the critical
activity
Answering Regular Path Queries on Workflow Provenance
This paper proposes a novel approach for efficiently evaluating regular path
queries over provenance graphs of workflows that may include recursion. The
approach assumes that an execution g of a workflow G is labeled with
query-agnostic reachability labels using an existing technique. At query time,
given g, G and a regular path query R, the approach decomposes R into a set of
subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is
rewritten so that, using the reachability labels of nodes in g, whether or not
there is a path which matches Ri between two nodes can be decided in constant
time. The results of each safe subquery are then composed, possibly with some
small unsafe remainder, to produce an answer to R. The approach results in an
algorithm that significantly reduces the number of subqueries k over existing
techniques by increasing their size and complexity, and that evaluates each
subquery in time bounded by its input and output size. Experimental results
demonstrate the benefit of this approach
Context-Free Path Querying by Matrix Multiplication
Graph data models are widely used in many areas, for example, bioinformatics,
graph databases. In these areas, it is often required to process queries for
large graphs. Some of the most common graph queries are navigational queries.
The result of query evaluation is a set of implicit relations between nodes of
the graph, i.e. paths in the graph. A natural way to specify these relations is
by specifying paths using formal grammars over the alphabet of edge labels. An
answer to a context-free path query in this approach is usually a set of
triples (A, m, n) such that there is a path from the node m to the node n,
whose labeling is derived from a non-terminal A of the given context-free
grammar. This type of queries is evaluated using the relational query
semantics. Another example of path query semantics is the single-path query
semantics which requires presenting a single path from the node m to the node
n, whose labeling is derived from a non-terminal A for all triples (A, m, n)
evaluated using the relational query semantics. There is a number of algorithms
for query evaluation which use these semantics but all of them perform poorly
on large graphs. One of the most common technique for efficient big data
processing is the use of a graphics processing unit (GPU) to perform
computations, but these algorithms do not allow to use this technique
efficiently. In this paper, we show how the context-free path query evaluation
using these query semantics can be reduced to the calculation of the matrix
transitive closure. Also, we propose an algorithm for context-free path query
evaluation which uses relational query semantics and is based on matrix
operations that make it possible to speed up computations by using a GPU.Comment: 9 pages, 11 figures, 2 table
A Selectivity based approach to Continuous Pattern Detection in Streaming Graphs
Cyber security is one of the most significant technical challenges in current
times. Detecting adversarial activities, prevention of theft of intellectual
properties and customer data is a high priority for corporations and government
agencies around the world. Cyber defenders need to analyze massive-scale,
high-resolution network flows to identify, categorize, and mitigate attacks
involving networks spanning institutional and national boundaries. Many of the
cyber attacks can be described as subgraph patterns, with prominent examples
being insider infiltrations (path queries), denial of service (parallel paths)
and malicious spreads (tree queries). This motivates us to explore subgraph
matching on streaming graphs in a continuous setting. The novelty of our work
lies in using the subgraph distributional statistics collected from the
streaming graph to determine the query processing strategy. We introduce a
"Lazy Search" algorithm where the search strategy is decided on a
vertex-to-vertex basis depending on the likelihood of a match in the vertex
neighborhood. We also propose a metric named "Relative Selectivity" that is
used to select between different query processing strategies. Our experiments
performed on real online news, network traffic stream and a synthetic social
network benchmark demonstrate 10-100x speedups over selectivity agnostic
approaches.Comment: in 18th International Conference on Extending Database Technology
(EDBT) (2015
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
Efficient Subgraph Matching on Billion Node Graphs
The ability to handle large scale graph data is crucial to an increasing
number of applications. Much work has been dedicated to supporting basic graph
operations such as subgraph matching, reachability, regular expression
matching, etc. In many cases, graph indices are employed to speed up query
processing. Typically, most indices require either super-linear indexing time
or super-linear indexing space. Unfortunately, for very large graphs,
super-linear approaches are almost always infeasible. In this paper, we study
the problem of subgraph matching on billion-node graphs. We present a novel
algorithm that supports efficient subgraph matching for graphs deployed on a
distributed memory store. Instead of relying on super-linear indices, we use
efficient graph exploration and massive parallel computing for query
processing. Our experimental results demonstrate the feasibility of performing
subgraph matching on web-scale graph data.Comment: VLDB201
Enumerating Subgraph Instances Using Map-Reduce
The theme of this paper is how to find all instances of a given "sample"
graph in a larger "data graph," using a single round of map-reduce. For the
simplest sample graph, the triangle, we improve upon the best known such
algorithm. We then examine the general case, considering both the communication
cost between mappers and reducers and the total computation cost at the
reducers. To minimize communication cost, we exploit the techniques of (Afrati
and Ullman, TKDE 2011)for computing multiway joins (evaluating conjunctive
queries) in a single map-reduce round. Several methods are shown for
translating sample graphs into a union of conjunctive queries with as few
queries as possible. We also address the matter of optimizing computation cost.
Many serial algorithms are shown to be "convertible," in the sense that it is
possible to partition the data graph, explore each partition in a separate
reducer, and have the total computation cost at the reducers be of the same
order as the computation cost of the serial algorithm.Comment: 37 page
- …