3,554 research outputs found
Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1
We propose an efficient and scalable architecture for processing generalized
graph-pattern queries as they are specified by the current W3C recommendation
of the SPARQL 1.1 "Query Language" component. Specifically, the class of
queries we consider consists of sets of SPARQL triple patterns with labeled
property paths. From a relational perspective, this class resolves to
conjunctive queries of relational joins with additional graph-reachability
predicates. For the scalable, i.e., distributed, processing of this kind of
queries over very large RDF collections, we develop a suitable partitioning and
indexing scheme, which allows us to shard the RDF triples over an entire
cluster of compute nodes and to process an incoming SPARQL query over all of
the relevant graph partitions (and thus compute nodes) in parallel. Unlike most
prior works in this field, we specifically aim at the unified optimization and
distributed processing of queries consisting of both relational joins and
graph-reachability predicates. All communication among the compute nodes is
established via a proprietary, asynchronous communication protocol based on the
Message Passing Interface
Answering Regular Path Queries on Workflow Provenance
This paper proposes a novel approach for efficiently evaluating regular path
queries over provenance graphs of workflows that may include recursion. The
approach assumes that an execution g of a workflow G is labeled with
query-agnostic reachability labels using an existing technique. At query time,
given g, G and a regular path query R, the approach decomposes R into a set of
subqueries R1, ..., Rk that are safe for G. For each safe subquery Ri, G is
rewritten so that, using the reachability labels of nodes in g, whether or not
there is a path which matches Ri between two nodes can be decided in constant
time. The results of each safe subquery are then composed, possibly with some
small unsafe remainder, to produce an answer to R. The approach results in an
algorithm that significantly reduces the number of subqueries k over existing
techniques by increasing their size and complexity, and that evaluates each
subquery in time bounded by its input and output size. Experimental results
demonstrate the benefit of this approach
Reverse k Nearest Neighbor Search over Trajectories
GPS enables mobile devices to continuously provide new opportunities to
improve our daily lives. For example, the data collected in applications
created by Uber or Public Transport Authorities can be used to plan
transportation routes, estimate capacities, and proactively identify low
coverage areas. In this paper, we study a new kind of query-Reverse k Nearest
Neighbor Search over Trajectories (RkNNT), which can be used for route planning
and capacity estimation. Given a set of existing routes DR, a set of passenger
transitions DT, and a query route Q, a RkNNT query returns all transitions that
take Q as one of its k nearest travel routes. To solve the problem, we first
develop an index to handle dynamic trajectory updates, so that the most
up-to-date transition data are available for answering a RkNNT query. Then we
introduce a filter refinement framework for processing RkNNT queries using the
proposed indexes. Next, we show how to use RkNNT to solve the optimal route
planning problem MaxRkNNT (MinRkNNT), which is to search for the optimal route
from a start location to an end location that could attract the maximum (or
minimum) number of passengers based on a pre-defined travel distance threshold.
Experiments on real datasets demonstrate the efficiency and scalability of our
approaches. To the best of our best knowledge, this is the first work to study
the RkNNT problem for route planning.Comment: 12 page
Efficient Race Detection with Futures
This paper addresses the problem of provably efficient and practically good
on-the-fly determinacy race detection in task parallel programs that use
futures. Prior works determinacy race detection have mostly focused on either
task parallel programs that follow a series-parallel dependence structure or
ones with unrestricted use of futures that generate arbitrary dependences. In
this work, we consider a restricted use of futures and show that it can be race
detected more efficiently than general use of futures.
Specifically, we present two algorithms: MultiBags and MultiBags+. MultiBags
targets programs that use futures in a restricted fashion and runs in time
, where is the sequential running time of the
program, is the inverse Ackermann's function, is the total number
of memory accesses, is the dynamic count of places at which parallelism is
created. Since is a very slowly growing function (upper bounded by
for all practical purposes), it can be treated as a close-to-constant overhead.
MultiBags+ an extension of MultiBags that target programs with general use of
futures. It runs in time where , ,
and are defined as before, and is the number of future operations in
the computation. We implemented both algorithms and empirically demonstrate
their efficiency
IO-Top-k: index-access optimized top-k query processing
Top-k query processing is an important building block for ranked retrieval, with applications ranging from text and data integration to distributed aggregation of network logs and sensor data. Top-k queries operate on index lists for a query's elementary conditions and aggregate scores for result candidates. One of the best implementation methods in this setting is the family of threshold algorithms, which aim to terminate the index scans as early as possible based on lower and upper bounds for the final scores of result candidates. This procedure performs sequential disk accesses for sorted index scans, but also has the option of performing random accesses to resolve score uncertainty. This entails scheduling for the two kinds of accesses: 1) the prioritization of different index lists in the sequential accesses, and 2) the decision on when to perform random accesses and for which candidates. The prior literature has studied some of these scheduling issues, but only for each of the two access types in isolation. The current paper takes an integrated view of the scheduling issues and develops novel strategies that outperform prior proposals by a large margin. Our main contributions are new, principled, scheduling methods based on a Knapsack-related optimization for sequential accesses and a cost model for random accesses. The methods can be further boosted by harnessing probabilistic estimators for scores, selectivities, and index list correlations. We also discuss efficient implementation techniques for the underlying data structures. In performance experiments with three different datasets (TREC Terabyte, HTTP server logs, and IMDB), our methods achieved significant performance gains compared to the best previously known methods: a factor of up to 3 in terms of execution costs, and a factor of 5 in terms of absolute run-times of our implementation. Our best techniques are close to a lower bound for the execution cost of the considered class of threshold algorithms
A Density-Based Approach to the Retrieval of Top-K Spatial Textual Clusters
Keyword-based web queries with local intent retrieve web content that is
relevant to supplied keywords and that represent points of interest that are
near the query location. Two broad categories of such queries exist. The first
encompasses queries that retrieve single spatial web objects that each satisfy
the query arguments. Most proposals belong to this category. The second
category, to which this paper's proposal belongs, encompasses queries that
support exploratory user behavior and retrieve sets of objects that represent
regions of space that may be of interest to the user. Specifically, the paper
proposes a new type of query, namely the top-k spatial textual clusters (k-STC)
query that returns the top-k clusters that (i) are located the closest to a
given query location, (ii) contain the most relevant objects with regard to
given query keywords, and (iii) have an object density that exceeds a given
threshold. To compute this query, we propose a basic algorithm that relies on
on-line density-based clustering and exploits an early stop condition. To
improve the response time, we design an advanced approach that includes three
techniques: (i) an object skipping rule, (ii) spatially gridded posting lists,
and (iii) a fast range query algorithm. An empirical study on real data
demonstrates that the paper's proposals offer scalability and are capable of
excellent performance
Qualitative Multi-Objective Reachability for Ordered Branching MDPs
We study qualitative multi-objective reachability problems for Ordered
Branching Markov Decision Processes (OBMDPs), or equivalently context-free
MDPs, building on prior results for single-target reachability on Branching
Markov Decision Processes (BMDPs).
We provide two separate algorithms for "almost-sure" and "limit-sure"
multi-target reachability for OBMDPs. Specifically, given an OBMDP,
, given a starting non-terminal, and given a set of target
non-terminals of size , our first algorithm decides whether the
supremum probability, of generating a tree that contains every target
non-terminal in set , is . Our second algorithm decides whether there is
a strategy for the player to almost-surely (with probability ) generate a
tree that contains every target non-terminal in set .
The two separate algorithms are needed: we show that indeed, in this context,
"almost-sure" "limit-sure" for multi-target reachability, meaning that
there are OBMDPs for which the player may not have any strategy to achieve
probability exactly of reaching all targets in set in the same
generated tree, but may have a sequence of strategies that achieve probability
arbitrarily close to . Both algorithms run in time , where is the total bit encoding length
of the given OBMDP, . Hence they run in polynomial time when
is fixed, and are fixed-parameter tractable with respect to . Moreover, we
show that even the qualitative almost-sure (and limit-sure) multi-target
reachability decision problem is in general NP-hard, when the size of the
set of target non-terminals is not fixed.Comment: 47 page
- …