118,099 research outputs found

    Partout: A Distributed Engine for Efficient RDF Processing

    Full text link
    The increasing interest in Semantic Web technologies has led not only to a rapid growth of semantic data on the Web but also to an increasing number of backend applications with already more than a trillion triples in some cases. Confronted with such huge amounts of data and the future growth, existing state-of-the-art systems for storing RDF and processing SPARQL queries are no longer sufficient. In this paper, we introduce Partout, a distributed engine for efficient RDF processing in a cluster of machines. We propose an effective approach for fragmenting RDF data sets based on a query log, allocating the fragments to nodes in a cluster, and finding the optimal configuration. Partout can efficiently handle updates and its query optimizer produces efficient query execution plans for ad-hoc SPARQL queries. Our experiments show the superiority of our approach to state-of-the-art approaches for partitioning and distributed SPARQL query processing

    Efficient Multi-way Theta-Join Processing Using MapReduce

    Full text link
    Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.Comment: VLDB201

    Efficient Subgraph Matching on Billion Node Graphs

    Full text link
    The ability to handle large scale graph data is crucial to an increasing number of applications. Much work has been dedicated to supporting basic graph operations such as subgraph matching, reachability, regular expression matching, etc. In many cases, graph indices are employed to speed up query processing. Typically, most indices require either super-linear indexing time or super-linear indexing space. Unfortunately, for very large graphs, super-linear approaches are almost always infeasible. In this paper, we study the problem of subgraph matching on billion-node graphs. We present a novel algorithm that supports efficient subgraph matching for graphs deployed on a distributed memory store. Instead of relying on super-linear indices, we use efficient graph exploration and massive parallel computing for query processing. Our experimental results demonstrate the feasibility of performing subgraph matching on web-scale graph data.Comment: VLDB201

    A network-aware framework for energy-efficient data acquisition in wireless sensor networks

    Get PDF
    Wireless sensor networks enable users to monitor the physical world at an extremely high fidelity. In order to collect the data generated by these tiny-scale devices, the data management community has proposed the utilization of declarative data-acquisition frameworks. While these frameworks have facilitated the energy-efficient retrieval of data from the physical environment, they were agnostic of the underlying network topology and also did not support advanced query processing semantics. In this paper we present KSpot+, a distributed network-aware framework that optimizes network efficiency by combining three components: (i) the tree balancing module, which balances the workload of each sensor node by constructing efficient network topologies; (ii) the workload balancing module, which minimizes data reception inefficiencies by synchronizing the sensor network activity intervals; and (iii) the query processing module, which supports advanced query processing semantics. In order to validate the efficiency of our approach, we have developed a prototype implementation of KSpot+ in nesC and JAVA. In our experimental evaluation, we thoroughly assess the performance of KSpot+ using real datasets and show that KSpot+ provides significant energy reductions under a variety of conditions, thus significantly prolonging the longevity of a WSN

    A bloom-filter strategy for response time reduction in distributed query processing.

    Get PDF
    In distributed database systems, query optimization is to find strategies attempt to minimize the amount of data transmitted over the network. Optimization algorithms have an important impact on the performance of distributed query processing. Since optimal query processing in distributed database systems has been shown to be NP-Hard [WC96], heuristics are applied to find a cost-effective and efficient (but suboptimal) processing strategy. Many query optimization strategies have been proposed to minimize either the total cost or the response time. The approaches in distributed query processing have mainly focused on the use of joins, semijoins, and filters. In this thesis, we propose a new reduction strategy based on bloom-filters to significantly reduce the response time of a distributed query. This algorithm can process general queries consisting of an arbitrary number of relations and join attributes. The performance of the algorithm with respect to response time is compared against the Initial Feasible Solution (IFS). An amount of experimental results has been used to evaluate the performance of our algorithm. Compared to the IFS, our algorithm provides a significantly improved query solution. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2003 .G36. Source: Masters Abstracts International, Volume: 43-05, page: 1749. Thesis (M.Sc.)--University of Windsor (Canada), 2003

    Distributed Stream Filtering for Database Applications

    Get PDF
    Distributed stream filtering is a mechanism for implementing a new class of real-time applications with distributed processing requirements. These applications require scalable architectures to support the efficient processing and multiplexing of large volumes of continuously generated data. This paper provides an overview of a stream-oriented model for database query processing and presents a supporting implementation. To facilitate distributed stream filtering, we introduce several new query processing operations, including pipelined filtering that efficiently joins and eliminates duplicates from database streams and a new join method, the progressive join, that joins streams of tuples. Finally, recognizing that the stream-oriented model results in performance tradeoffs that differ significantly from those in traditional databases, we present a new query optimization strategy specifically designed for stream-oriented databases

    Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1

    Get PDF
    We propose an efficient and scalable architecture for processing generalized graph-pattern queries as they are specified by the current W3C recommendation of the SPARQL 1.1 "Query Language" component. Specifically, the class of queries we consider consists of sets of SPARQL triple patterns with labeled property paths. From a relational perspective, this class resolves to conjunctive queries of relational joins with additional graph-reachability predicates. For the scalable, i.e., distributed, processing of this kind of queries over very large RDF collections, we develop a suitable partitioning and indexing scheme, which allows us to shard the RDF triples over an entire cluster of compute nodes and to process an incoming SPARQL query over all of the relevant graph partitions (and thus compute nodes) in parallel. Unlike most prior works in this field, we specifically aim at the unified optimization and distributed processing of queries consisting of both relational joins and graph-reachability predicates. All communication among the compute nodes is established via a proprietary, asynchronous communication protocol based on the Message Passing Interface
    • …
    corecore