45 research outputs found

    DHTJoin: Processing Continuous Join Queries Using DHT Networks

    Get PDF
    International audienceContinuous query processing in data stream management systems (DSMS) has received considerable attention recently. Many applications share the same need for processing data streams in a continuous fashion. For most distributed streaming applications, the centralized processing of continuous queries over distributed data is simply not viable. This paper addresses the problem of computing approximate answers to continuous join queries over distributed data streams. We present a new method, called DHTJoin, which combines hash-based placement of tuples in a Distributed Hash Table (DHT) and dissemination of queries by exploiting the embedded trees in the underlying DHT, thereby incuring little overhead. DHTJoin also deals with join attribute value skew which may hurt load balancing and result completeness. We provide a performance evaluation of DHTJoin which shows that it can achieve significant performance gains in terms of network traffic

    Building Efficient Query Engines in a High-Level Language

    Get PDF
    Abstraction without regret refers to the vision of using high-level programming languages for systems development without experiencing a negative impact on performance. A database system designed according to this vision offers both increased productivity and high performance, instead of sacrificing the former for the latter as is the case with existing, monolithic implementations that are hard to maintain and extend. In this article, we realize this vision in the domain of analytical query processing. We present LegoBase, a query engine written in the high-level language Scala. The key technique to regain efficiency is to apply generative programming: LegoBase performs source-to-source compilation and optimizes the entire query engine by converting the high-level Scala code to specialized, low-level C code. We show how generative programming allows to easily implement a wide spectrum of optimizations, such as introducing data partitioning or switching from a row to a column data layout, which are difficult to achieve with existing low-level query compilers that handle only queries. We demonstrate that sufficiently powerful abstractions are essential for dealing with the complexity of the optimization effort, shielding developers from compiler internals and decoupling individual optimizations from each other. We evaluate our approach with the TPC-H benchmark and show that: (a) With all optimizations enabled, LegoBase significantly outperforms a commercial database and an existing query compiler. (b) Programmers need to provide just a few hundred lines of high-level code for implementing the optimizations, instead of complicated low-level code that is required by existing query compilation approaches. (c) The compilation overhead is low compared to the overall execution time, thus making our approach usable in practice for compiling query engines

    Database partitioning strategies for social network data

    Get PDF
    Thesis (M. Eng.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2012.Cataloged from PDF version of thesis.Includes bibliographical references (p. 64-66).In this thesis, I designed, prototyped and benchmarked two different data partitioning strategies for social network type workloads. The first strategy takes advantage of the heavy-tailed degree distributions of social networks to optimize the latency of vertex neighborhood queries. The second strategy takes advantage of the high temporal locality of workloads to improve latencies for vertex neighborhood intersection queries. Both techniques aim to shorten the tail of the latency distribution, while avoiding decreased write performance or reduced system throughput when compared to the default hash partitioning approach. The strategies presented were evaluated using synthetic workloads of my own design as well as real workloads provided by Twitter, and show promising improvements in latency at some cost in system complexity.by Oscar Ricardo Moll Thomae.M.Eng

    Evaluation of Main Memory : Join Algorithms for Joins with Set Comparison Predicates

    Full text link
    Current data models like the NF2 model and object-oriented models support set-valued attributes. Hence, it becomes possible to have join predicates based on set comparison. This paper introduces and evaluates several main memory algorithms to evaluate efficiently this kind of join. More specifically, we concentrate on the set equality and the subset predicates

    맵리듀스에서의 병렬 조인을 위한 다차원 범위 분할 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 8. 이상구.Joins are fundamental operations for many data analysis tasks, but are not directly supported by the MapReduce framework. This is because 1) the framework is basically designed to process a single input data set, and 2) MapReduce's key-equality based data grouping method makes it difficult to support complex join conditions. As a result, a large number of MapReduce-based join algorithms have been proposed. As in traditional shared-nothing systems, one of the major issues in join algorithms using MapReduce is handling of data skew. We propose a new skew handling method, called Multi-Dimensional Range Partitioning (MDRP), and show that the proposed method outperforms traditional skew handling methods: range-based and randomized methods. Specifically, the proposed method has the following advantages: 1) Compared to the range-based method, it considers the number of output tuples at each machine, which leads better handling of join product skew. 2) Compared with the randomized method, it exploits given join conditions before the actual join begins, so that unnecessary input duplication can be reduced. The MDRP method can be used to support advanced join operations such as theta-joins and multi-way joins. With extensive experiments using real and synthetic data sets, we evaluate the effectiveness of the proposed algorithm.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 II. Backgrounds and RelatedWork . . . . . . . . . . . . . . . . 8 2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . 11 2.2.1 Two-Way Join Algorithms . . . . . . . . . . . . . . 11 2.2.2 Multi-Way Join Algorithms . . . . . . . . . . . . . 17 2.3 Data Skew in Join Algorithms . . . . . . . . . . . . . . . . 18 2.4 Skew Handling Approaches in MapReduce . . . . . . . . . 22 2.4.1 Hash-Based Approach . . . . . . . . . . . . . . . . 22 2.4.2 Range-Based Approach . . . . . . . . . . . . . . . 24 2.4.3 Randomized Approach . . . . . . . . . . . . . . . . 26 III. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Multi-Dimensional Range Partitioning . . . . . . . . . . . . 29 3.1.1 Creation of a Partitioning Matrix . . . . . . . . . . . 29 3.1.2 Identifying and Chopping of Heavy Cells . . . . . . 31 3.1.3 Assigning Cells to Reducers . . . . . . . . . . . . . 33 3.1.4 Join Processing using the Partitioning Matrix . . . . 35 3.2 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . 39 3.3 Complex Join Conditions . . . . . . . . . . . . . . . . . . . 41 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Scalar Skew Experiments . . . . . . . . . . . . . . . 44 3.4.2 Zipfs Distribution . . . . . . . . . . . . . . . . . . 49 3.4.3 Non-Equijoin Experiments . . . . . . . . . . . . . . 50 3.4.4 Scalability Experiments . . . . . . . . . . . . . . . 52 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.2 Memory-Awareness . . . . . . . . . . . . . . . . . 58 3.5.3 Handling of Heavy Cells . . . . . . . . . . . . . . . 59 3.5.4 Existing Histograms . . . . . . . . . . . . . . . . . 60 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 IV. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 Joining Multiple Relations in a MapReduce Job . . . . . . . 65 4.1.1 Example: SPARQL Basic Graph Pattern . . . . . . . 65 4.1.2 Example: Matrix Chain Multiplication . . . . . . . . 67 4.1.3 Single-Key Join and Multiple-Key Join Queries . . . 69 4.2 Skew Handling for Multi-Way Joins . . . . . . . . . . . . . 71 4.2.1 Skew Handling for SK-Join Queries . . . . . . . . . 71 4.2.2 Skew Handling for MK-Join Queires . . . . . . . . 72 4.3 Combinations of SK-Join and MK-Join . . . . . . . . . . . 74 4.3.1 Complex Queries . . . . . . . . . . . . . . . . . . . 74 4.3.2 Iteration-Based Algorithms . . . . . . . . . . . . . . 75 4.3.3 Replication-Based Algorithms . . . . . . . . . . . . 77 4.3.4 Iteration-Based vs. Replication-Based . . . . . . . . 78 4.4 Join-Key Selection Algorithms for Complex Queries . . . . 83 4.4.1 Greedy Key Selection . . . . . . . . . . . . . . . . 84 4.4.2 Multiple Key Selection . . . . . . . . . . . . . . . . 85 4.4.3 Hybrid Key Selection . . . . . . . . . . . . . . . . . 86 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.1 SK-Join Experiments . . . . . . . . . . . . . . . . . 87 4.5.2 MK-Join Experiments . . . . . . . . . . . . . . . . 89 4.5.3 Analysis of TV Watching Logs . . . . . . . . . . . . 90 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 V. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.1 Algorithms for SPARQL Basic Graph Pattern . . . . . . . . 94 5.1.1 MR-Selection . . . . . . . . . . . . . . . . . . . . . 95 5.1.2 MR-Join . . . . . . . . . . . . . . . . . . . . . . . 98 5.1.3 Performance Evaluation . . . . . . . . . . . . . . . 101 5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Algorithms for Matrix Chain Multiplication . . . . . . . . . 107 5.2.1 Serial Two-Way Join (S2) . . . . . . . . . . . . . . 109 5.2.2 Parallel M-Way Join (P2, PM) . . . . . . . . . . . . 111 5.2.3 Serial Two-Way vs. Parallel M-Way . . . . . . . . . 115 5.2.4 Performance Evaluation . . . . . . . . . . . . . . . 116 5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . 119 5.2.6 Extension: Embedded MapReduce . . . . . . . . . . 119 VI. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 초록 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Docto

    Scaling and Load-Balancing Equi-Joins

    Full text link
    The task of joining two tables is fundamental for querying databases. In this paper, we focus on the equi-join problem, where a pair of records from the two joined tables are part of the join results if equality holds between their values in the join column(s). While this is a tractable problem when the number of records in the joined tables is relatively small, it becomes very challenging as the table sizes increase, especially if hot keys (join column values with a large number of records) exist in both joined tables. This paper, an extended version of [metwally-SIGMOD-2022], proposes Adaptive-Multistage-Join (AM-Join) for scalable and fast equi-joins in distributed shared-nothing architectures. AM-Join utilizes (a) Tree-Join, a proposed novel algorithm that scales well when the joined tables share hot keys, and (b) Broadcast-Join, the known fastest when joining keys that are hot in only one table. Unlike the state-of-the-art algorithms, AM-Join (a) holistically solves the join-skew problem by achieving load balancing throughout the join execution, and (b) supports all outer-join variants without record deduplication or custom table partitioning. For the fastest AM-Join outer-join performance, we propose the Index-Broadcast-Join (IB-Join) family of algorithms for Small-Large joins, where one table fits in memory and the other can be up to orders of magnitude larger. The outer-join variants of IB-Join improves on the state-of-the-art Small-Large outer-join algorithms. The proposed algorithms can be adopted in any shared-nothing architecture. We implemented a MapReduce version using Spark. Our evaluation shows the proposed algorithms execute significantly faster and scale to more skewed and orders-of-magnitude bigger tables when compared to the state-of-the-art algorithms

    Distributed and Deep Vertical Federated Learning with Big Data

    Full text link
    In recent years, data are typically distributed in multiple organizations while the data security is becoming increasingly important. Federated Learning (FL), which enables multiple parties to collaboratively train a model without exchanging the raw data, has attracted more and more attention. Based on the distribution of data, FL can be realized in three scenarios, i.e., horizontal, vertical, and hybrid. In this paper, we propose to combine distributed machine learning techniques with Vertical FL and propose a Distributed Vertical Federated Learning (DVFL) approach. The DVFL approach exploits a fully distributed architecture within each party in order to accelerate the training process. In addition, we exploit Homomorphic Encryption (HE) to protect the data against honest-but-curious participants. We conduct extensive experimentation in a large-scale cluster environment and a cloud environment in order to show the efficiency and scalability of our proposed approach. The experiments demonstrate the good scalability of our approach and the significant efficiency advantage (up to 6.8 times with a single server and 15.1 times with multiple servers in terms of the training time) compared with baseline frameworks.Comment: To appear in CCPE (Concurrency and Computation: Practice and Experience
    corecore