164,927 research outputs found

    Distributed Join Approaches for W3C-Conform SPARQL Endpoints

    Get PDF
    Currently many SPARQL endpoints are freely available and accessible without any costs to users: Everyone can submit SPARQL queries to SPARQL endpoints via a standardized protocol, where the queries are processed on the datasets of the SPARQL endpoints and the query results are sent back to the user in a standardized format. As these distributed execution environments for semantic big data (as intersection of semantic data and big data) are freely accessible, the Semantic Web is an ideal playground for big data research. However, when utilizing these distributed execution environments, questions about the performance arise. Especially when several datasets (locally and those residing in SPARQL endpoints) need to be combined, distributed joins need to be computed. In this work we give an overview of the various possibilities of distributed join processing in SPARQL endpoints, which follow the SPARQL specification and hence are "W3C conform". We also introduce new distributed join approaches as variants of the Bitvector-Join and combination of the Semi- and Bitvector-Join. Finally we compare all the existing and newly proposed distributed join approaches for W3C conform SPARQL endpoints in an extensive experimental evaluation

    Robust and Skew-resistant Parallel Joins in Shared-Nothing Systems

    Get PDF
    The performance of joins in parallel database management systems is critical for data intensive operations such as querying. Since data skew is common in many applications, poorly engineered join operations result in load imbalance and performance bottlenecks. State-of-the-art methods designed to handle this problem offer significant improvements over naive implementations. However, performance could be further improved by removing the dependency on global skew knowledge and broadcasting. In this paper, we propose PRPQ (partial redistribution & partial query), an efficient and robust join algorithm for processing large-scale joins over distributed systems. We present the detailed implementation and a quantitative evaluation of our method. The experimental results demonstrate that the proposed PRPQ algorithm is indeed robust and scalable under a wide range of skew conditions. Specifically, compared to the state-of-art PRPD method, we achieve 16% - 167% performance improvement and 24% - 54% less network communication under different join workloads

    DHTJoin: Processing Continuous Join Queries Using DHT Networks

    Get PDF
    International audienceContinuous query processing in data stream management systems (DSMS) has received considerable attention recently. Many applications share the same need for processing data streams in a continuous fashion. For most distributed streaming applications, the centralized processing of continuous queries over distributed data is simply not viable. This paper addresses the problem of computing approximate answers to continuous join queries over distributed data streams. We present a new method, called DHTJoin, which combines hash-based placement of tuples in a Distributed Hash Table (DHT) and dissemination of queries by exploiting the embedded trees in the underlying DHT, thereby incuring little overhead. DHTJoin also deals with join attribute value skew which may hurt load balancing and result completeness. We provide a performance evaluation of DHTJoin which shows that it can achieve significant performance gains in terms of network traffic

    맵리듀스 클러스터에서 필터링 기법을 사용한 조인 처리

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 김형주.The join operation is one of the essential operations for data analysis because it is necessary to join large datasets to analyze heterogeneous data collected from different sources. MapReduce is a very useful framework for large-scale data analysis, but it is not suitable for joining multiple datasets. This is because it may produce a large number of redundant intermediate results, irrespective of the size of the joined records. Several existing approaches have been employed to improve the join performance, but they can only be used in specific circumstances or they may require multiple MapReduce jobs. To alleviate this problem, MFR-Join is proposed in this dissertation, which is a general join framework for processing equi-joins with filtering techniques in MapReduce. MFR-Join filters out redundant intermediate records within a single MapReduce job by applying filters in the map phase. To achieve this, the MapReduce framework is modified in two ways. First, map tasks are scheduled according to the processing order of the input datasets. Second, filters are created dynamically with the join keys of the datasets in a distributed manner. Various filtering techniques that support specific desirable operations can be plugged into MFR-Join. If the performance of join processing with filters is worse than that without filters, adaptive join processing methods are also proposed. The filters can be applied according to their performance, which is estimated in terms of the false positive rate. Furthermore, two map task scheduling policies are also provided: synchronous and asynchronous scheduling. The concept of filtering techniques is extended to multi-way joins. Methods for filter applications are proposed for the two types of multi-way joins: common attribute joins and distinct attribute joins. The experimental results showed that the proposed approach outperformed existing join algorithms and reduced the size of intermediate results when small portions of input datasets were joined.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction 1 1.1 Research Background and Motivation . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Join Processing with Filtering Techniques in MapReduce . . . . . . 4 1.2.2 Adaptive Join Processing with Filtering Techniques in MFR-Join . 5 1.2.3 Multi-way Join Processing in MFR-Join . . . . . . . . . . . . . . . 6 1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Preliminaries and Related Work 9 2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Parallel and Distributed Join Algorithms in DBMS . . . . . . . . . . . . . 11 2.3 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Map-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Reduce-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Multi-way Joins in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Filtering Techniques for Join Processing . . . . . . . . . . . . . . . . . . . 19 3 MFR-Join: A General Join Framework with Filtering Techniques in MapReduce 23 3.1 MFR-Join Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Map Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 Filter Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.4 Filtering Techniques Applicable to MFR-Join . . . . . . . . . . . . 29 3.1.5 API and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Effects of the Filters . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Adaptive Join Processing with Filtering Techniques in MFR-Join 53 4.1 Adaptive join processing in MFR-Join . . . . . . . . . . . . . . . . . . . . 54 4.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.2 Additional Filter Operations for Adaptive Joins . . . . . . . . . . . 57 4.1.3 Early Detection of FPR Threshold Being Exceeded . . . . . . . . . 58 4.1.4 Map Task Scheduling Policies . . . . . . . . . . . . . . . . . . . . 59 4.1.5 Additional Parameters for Adaptive Joins . . . . . . . . . . . . . . 60 4.2 Join Cost and FPR Threshold Analysis . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Cost of Adaptive Join . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Effects of FPR Threshold . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.3 Effects of Map Task Scheduling Policy . . . . . . . . . . . . . . . 63 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Multi-way Join Processing in MFR-Join 77 5.1 Applying filters to multi-way joins . . . . . . . . . . . . . . . . . . . . . . 78 5.1.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.2 Distinct Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.3 General Multi-way Joins . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.4 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.1 Partition Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.2 MapReduce Functions . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.2 Distinct attribute joins . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Conclusions and Future Work 99 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.1 Integration with Data Warehouse Systems . . . . . . . . . . . . . . 100 6.2.2 Join-based Applications . . . . . . . . . . . . . . . . . . . . . . . 101 6.2.3 Improving Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 102 References 105 Summary (in Korean) 113Docto

    Efficient Parallel and Adaptive Partitioning for Load-balancing in Spatial Join

    Get PDF
    Due to the developments of topographic techniques, clear satellite imagery, and various means for collecting information, geospatial datasets are growing in volume, complexity, and heterogeneity. For efficient execution of spatial computations and analytics on large spatial data sets, parallel processing is required. To exploit fine-grained parallel processing in large scale compute clusters, partitioning in a load-balanced way is necessary for skewed datasets. In this work, we focus on spatial join operation where the inputs are two layers of geospatial data. Our partitioning method for spatial join uses Adaptive Partitioning (ADP) technique, which is based on Quadtree partitioning. Unlike existing partitioning techniques, ADP partitions the spatial join workload instead of partitioning the individual datasets separately to provide better load-balancing. Based on our experimental evaluation, ADP partitions spatial data in a more balanced way than Quadtree partitioning and Uniform grid partitioning. ADP uses an output-sensitive duplication avoidance technique which minimizes duplication of geometries that are not part of spatial join output. In a distributed memory environment, this technique can reduce data communication and storage requirements compared to traditional methods.To improve the performance of ADP, an MPI+Threads based parallelization is presented. With ParADP, a pair of real world datasets, one with 717 million polylines and another with 10 million polygons, is partitioned into 65,536 grid cells within 7 seconds. ParADP performs well with both good weak scaling up to 4,032 CPU cores and good strong scaling up to 4,032 CPU cores

    Query Driven Operator Placement for Complex Event Detection over Data Streams

    Get PDF
    We consider the problem of efficiently processing subscription queries over data streams in large-scale interconnected sensor networks. We propose a scalable algorithm for distributed data stream processing, applicable on top of any platform granting access to interconnected sensor networks. We make use of a probabilistic algorithm to check whether subscriptions are subsumed by other subscriptions and thus can be pruned for more efficient processing. Our proposed methods are query driven, hence do not replicate data streams, but intelligently place join operators inside the global network of sources. We show by a performance evaluation using real world sensor data the suitability of our approach

    Distributed Processing of Generalized Graph-Pattern Queries in SPARQL 1.1

    Get PDF
    We propose an efficient and scalable architecture for processing generalized graph-pattern queries as they are specified by the current W3C recommendation of the SPARQL 1.1 "Query Language" component. Specifically, the class of queries we consider consists of sets of SPARQL triple patterns with labeled property paths. From a relational perspective, this class resolves to conjunctive queries of relational joins with additional graph-reachability predicates. For the scalable, i.e., distributed, processing of this kind of queries over very large RDF collections, we develop a suitable partitioning and indexing scheme, which allows us to shard the RDF triples over an entire cluster of compute nodes and to process an incoming SPARQL query over all of the relevant graph partitions (and thus compute nodes) in parallel. Unlike most prior works in this field, we specifically aim at the unified optimization and distributed processing of queries consisting of both relational joins and graph-reachability predicates. All communication among the compute nodes is established via a proprietary, asynchronous communication protocol based on the Message Passing Interface

    Forecasting the cost of processing multi-join queries via hashing for main-memory databases (Extended version)

    Full text link
    Database management systems (DBMSs) carefully optimize complex multi-join queries to avoid expensive disk I/O. As servers today feature tens or hundreds of gigabytes of RAM, a significant fraction of many analytic databases becomes memory-resident. Even after careful tuning for an in-memory environment, a linear disk I/O model such as the one implemented in PostgreSQL may make query response time predictions that are up to 2X slower than the optimal multi-join query plan over memory-resident data. This paper introduces a memory I/O cost model to identify good evaluation strategies for complex query plans with multiple hash-based equi-joins over memory-resident data. The proposed cost model is carefully validated for accuracy using three different systems, including an Amazon EC2 instance, to control for hardware-specific differences. Prior work in parallel query evaluation has advocated right-deep and bushy trees for multi-join queries due to their greater parallelization and pipelining potential. A surprising finding is that the conventional wisdom from shared-nothing disk-based systems does not directly apply to the modern shared-everything memory hierarchy. As corroborated by our model, the performance gap between the optimal left-deep and right-deep query plan can grow to about 10X as the number of joins in the query increases.Comment: 15 pages, 8 figures, extended version of the paper to appear in SoCC'1
    corecore