368 research outputs found
An introduction to Graph Data Management
A graph database is a database where the data structures for the schema
and/or instances are modeled as a (labeled)(directed) graph or generalizations
of it, and where querying is expressed by graph-oriented operations and type
constructors. In this article we present the basic notions of graph databases,
give an historical overview of its main development, and study the main current
systems that implement them
K-Reach: Who is in Your Small World
We study the problem of answering k-hop reachability queries in a directed
graph, i.e., whether there exists a directed path of length k, from a source
query vertex to a target query vertex in the input graph. The problem of k-hop
reachability is a general problem of the classic reachability (where
k=infinity). Existing indexes for processing classic reachability queries, as
well as for processing shortest path queries, are not applicable or not
efficient for processing k-hop reachability queries. We propose an index for
processing k-hop reachability queries, which is simple in design and efficient
to construct. Our experimental results on a wide range of real datasets show
that our index is more efficient than the state-of-the-art indexes even for
processing classic reachability queries, for which these indexes are primarily
designed. We also show that our index is efficient in answering k-hop
reachability queries.Comment: VLDB201
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms
for processing a text database: either we can scan, or "crawl," the text database or, alternatively,
we can exploit search engine indexes and retrieve the documents of interest via carefully crafted
queries constructed in task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output "completeness" (e.g.,
in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to make the choice of execution
plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to
analyze query- and crawl-based plans in terms of both execution time and output completeness.
We adapt results from random-graph theory and statistics to develop a rigorous cost model for
the execution plans. Our cost model reflects the fact that the performance of the plans depends
on fundamental task-specific properties of the underlying text databases. We identify these
properties and present efficient techniques for estimating the associated parameters of the cost
model. We also present two optimization approaches for text-centric tasks that rely on the cost-model
parameters and select efficient execution plans. Overall, our optimization approaches
help build efficient execution plans for a task, resulting in significant efficiency and output
completeness benefits. We complement our results with a large-scale experimental evaluation
for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the Web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for
processing a text database: either we can scan, or ācrawl,ā the text database or, alternatively, we can
exploit search engine indexes and retrieve the documents of interest via carefully crafted queries
constructed in task-specific ways. The choice between crawl- and query-based execution plans can
have a substantial impact on both execution time and output ācompletenessā (e.g., in terms of
recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition.
In this article, we present fundamental building blocks to make the choice of execution plans
for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze
query- and crawl-based plans in terms of both execution time and output completeness. We adapt
results from random-graph theory and statistics to develop a rigorous cost model for the execution
plans. Our cost model reflects the fact that the performance of the plans depends on fundamental
task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two
optimization approaches for text-centric tasks that rely on the cost-model parameters and select
efficient execution plans. Overall, our optimization approaches help build efficient execution plans
for a task, resulting in significant efficiency and output completeness benefits. We complement our
results with a large-scale experimental evaluation for three important text-centric tasks and over
multiple real-life data sets.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc
Optimization of Regular Path Queries in Graph Databases
Regular path queries offer a powerful navigational mechanism in graph databases. Recently, there has been renewed interest in such queries in the context of the Semantic Web. The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. While eminently useful, such queries are difficult to optimize and evaluate efficiently, however. We design and implement a cost-based optimizer we call Waveguide for SPARQL queries with property paths. Waveguide builds a query planwhich we call a waveplan (WP)which guides the query evaluation. There are numerous choices in the con- struction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude with the best plan often offering excellent performance. A WPs costs can be estimated, which opens the way to cost-based optimization. We demonstrate that Waveguide properly subsumes existing techniques and that the new plans it adds are relevant. We analyze the effective plan space which is enabled by Waveguide and design an efficient enumerator for it. We implement a pro- totype of a Waveguide cost-based optimizer on top of an open-source relational RDF store. Finally, we perform a comprehensive performance study of the state of the art for evaluation of SPARQL property paths and demonstrate the significant performance gains that Waveguide offers
RDF Querying
Reactive Web systems, Web services, and Web-based publish/
subscribe systems communicate events as XML messages, and in
many cases require composite event detection: it is not sufficient to react
to single event messages, but events have to be considered in relation to
other events that are received over time.
Emphasizing language design and formal semantics, we describe the
rule-based query language XChangeEQ for detecting composite events.
XChangeEQ is designed to completely cover and integrate the four complementary
querying dimensions: event data, event composition, temporal
relationships, and event accumulation. Semantics are provided as
model and fixpoint theories; while this is an established approach for rule
languages, it has not been applied for event queries before
Towards a Query Optimizer for Text-Centric Tasks
Text is ubiquitous and, not surprisingly, many important applications rely on textual data for
a variety of tasks. As a notable example, information extraction applications derive structured
relations from unstructured text; as another example, focused crawlers explore the web to locate
pages about specific topics. Execution plans for text-centric tasks follow two general paradigms
for processing a text database: either we can scan, or "crawl," the text database or, alternatively,
we can exploit search engine indexes and retrieve the documents of interest via carefully crafted
queries constructed in task-specific ways. The choice between crawl- and query-based execution
plans can have a substantial impact on both execution time and output "completeness" (e.g.,
in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain
intuition. In this article, we present fundamental building blocks to make the choice of execution
plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to
analyze query- and crawl-based plans in terms of both execution time and output completeness.
We adapt results from random-graph theory and statistics to develop a rigorous cost model for
the execution plans. Our cost model reflects the fact that the performance of the plans depends
on fundamental task-specific properties of the underlying text databases. We identify these
properties and present efficient techniques for estimating the associated parameters of the cost
model. We also present two optimization approaches for text-centric tasks that rely on the cost-model
parameters and select efficient execution plans. Overall, our optimization approaches
help build efficient execution plans for a task, resulting in significant efficiency and output
completeness benefits. We complement our results with a large-scale experimental evaluation
for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
- ā¦