368 research outputs found

    K-Reach: Who is in Your Small World

    Full text link
    We study the problem of answering k-hop reachability queries in a directed graph, i.e., whether there exists a directed path of length k, from a source query vertex to a target query vertex in the input graph. The problem of k-hop reachability is a general problem of the classic reachability (where k=infinity). Existing indexes for processing classic reachability queries, as well as for processing shortest path queries, are not applicable or not efficient for processing k-hop reachability queries. We propose an index for processing k-hop reachability queries, which is simple in design and efficient to construct. Our experimental results on a wide range of real datasets show that our index is more efficient than the state-of-the-art indexes even for processing classic reachability queries, for which these indexes are primarily designed. We also show that our index is efficient in answering k-hop reachability queries.Comment: VLDB201

    Towards a Query Optimizer for Text-Centric Tasks

    Get PDF
    Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or "crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie

    Towards a Query Optimizer for Text-Centric Tasks

    Get PDF
    Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the Web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or ā€œcrawl,ā€ the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output ā€œcompletenessā€ (e.g., in terms of recall). Nevertheless, this choice is typically ad hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model.We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.NYU, Stern School of Business, IOMS Department, Center for Digital Economy Researc

    Optimization of Regular Path Queries in Graph Databases

    Get PDF
    Regular path queries offer a powerful navigational mechanism in graph databases. Recently, there has been renewed interest in such queries in the context of the Semantic Web. The extension of SPARQL in version 1.1 with property paths offers a type of regular path query for RDF graph databases. While eminently useful, such queries are difficult to optimize and evaluate efficiently, however. We design and implement a cost-based optimizer we call Waveguide for SPARQL queries with property paths. Waveguide builds a query planwhich we call a waveplan (WP)which guides the query evaluation. There are numerous choices in the con- struction of a plan, and a number of optimization methods, so the space of plans for a query can be quite large. Execution costs of plans for the same query can vary by orders of magnitude with the best plan often offering excellent performance. A WPs costs can be estimated, which opens the way to cost-based optimization. We demonstrate that Waveguide properly subsumes existing techniques and that the new plans it adds are relevant. We analyze the effective plan space which is enabled by Waveguide and design an efficient enumerator for it. We implement a pro- totype of a Waveguide cost-based optimizer on top of an open-source relational RDF store. Finally, we perform a comprehensive performance study of the state of the art for evaluation of SPARQL property paths and demonstrate the significant performance gains that Waveguide offers

    RDF Querying

    Get PDF
    Reactive Web systems, Web services, and Web-based publish/ subscribe systems communicate events as XML messages, and in many cases require composite event detection: it is not sufficient to react to single event messages, but events have to be considered in relation to other events that are received over time. Emphasizing language design and formal semantics, we describe the rule-based query language XChangeEQ for detecting composite events. XChangeEQ is designed to completely cover and integrate the four complementary querying dimensions: event data, event composition, temporal relationships, and event accumulation. Semantics are provided as model and fixpoint theories; while this is an established approach for rule languages, it has not been applied for event queries before

    Towards a Query Optimizer for Text-Centric Tasks

    Get PDF
    Text is ubiquitous and, not surprisingly, many important applications rely on textual data for a variety of tasks. As a notable example, information extraction applications derive structured relations from unstructured text; as another example, focused crawlers explore the web to locate pages about specific topics. Execution plans for text-centric tasks follow two general paradigms for processing a text database: either we can scan, or "crawl," the text database or, alternatively, we can exploit search engine indexes and retrieve the documents of interest via carefully crafted queries constructed in task-specific ways. The choice between crawl- and query-based execution plans can have a substantial impact on both execution time and output "completeness" (e.g., in terms of recall). Nevertheless, this choice is typically ad-hoc and based on heuristics or plain intuition. In this article, we present fundamental building blocks to make the choice of execution plans for text-centric tasks in an informed, cost-based way. Towards this goal, we show how to analyze query- and crawl-based plans in terms of both execution time and output completeness. We adapt results from random-graph theory and statistics to develop a rigorous cost model for the execution plans. Our cost model reflects the fact that the performance of the plans depends on fundamental task-specific properties of the underlying text databases. We identify these properties and present efficient techniques for estimating the associated parameters of the cost model. We also present two optimization approaches for text-centric tasks that rely on the cost-model parameters and select efficient execution plans. Overall, our optimization approaches help build efficient execution plans for a task, resulting in significant efficiency and output completeness benefits. We complement our results with a large-scale experimental evaluation for three important text-centric tasks and over multiple real-life data sets.Information Systems Working Papers Serie
    • ā€¦
    corecore