3,649 research outputs found

    Parallel Processing of Large Graphs

    Full text link
    More and more large data collections are gathered worldwide in various IT systems. Many of them possess the networked nature and need to be processed and analysed as graph structures. Due to their size they require very often usage of parallel paradigm for efficient computation. Three parallel techniques have been compared in the paper: MapReduce, its map-side join extension and Bulk Synchronous Parallel (BSP). They are implemented for two different graph problems: calculation of single source shortest paths (SSSP) and collective classification of graph nodes by means of relational influence propagation (RIP). The methods and algorithms are applied to several network datasets differing in size and structural profile, originating from three domains: telecommunication, multimedia and microblog. The results revealed that iterative graph processing with the BSP implementation always and significantly, even up to 10 times outperforms MapReduce, especially for algorithms with many iterations and sparse communication. Also MapReduce extension based on map-side join usually noticeably presents better efficiency, although not as much as BSP. Nevertheless, MapReduce still remains the good alternative for enormous networks, whose data structures do not fit in local memories.Comment: Preprint submitted to Future Generation Computer System

    맵리듀스에서의 병렬 조인을 위한 다차원 범위 분할 기법

    Get PDF
    학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 8. 이상구.Joins are fundamental operations for many data analysis tasks, but are not directly supported by the MapReduce framework. This is because 1) the framework is basically designed to process a single input data set, and 2) MapReduce's key-equality based data grouping method makes it difficult to support complex join conditions. As a result, a large number of MapReduce-based join algorithms have been proposed. As in traditional shared-nothing systems, one of the major issues in join algorithms using MapReduce is handling of data skew. We propose a new skew handling method, called Multi-Dimensional Range Partitioning (MDRP), and show that the proposed method outperforms traditional skew handling methods: range-based and randomized methods. Specifically, the proposed method has the following advantages: 1) Compared to the range-based method, it considers the number of output tuples at each machine, which leads better handling of join product skew. 2) Compared with the randomized method, it exploits given join conditions before the actual join begins, so that unnecessary input duplication can be reduced. The MDRP method can be used to support advanced join operations such as theta-joins and multi-way joins. With extensive experiments using real and synthetic data sets, we evaluate the effectiveness of the proposed algorithm.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i I. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.2 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 II. Backgrounds and RelatedWork . . . . . . . . . . . . . . . . 8 2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.2 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . 11 2.2.1 Two-Way Join Algorithms . . . . . . . . . . . . . . 11 2.2.2 Multi-Way Join Algorithms . . . . . . . . . . . . . 17 2.3 Data Skew in Join Algorithms . . . . . . . . . . . . . . . . 18 2.4 Skew Handling Approaches in MapReduce . . . . . . . . . 22 2.4.1 Hash-Based Approach . . . . . . . . . . . . . . . . 22 2.4.2 Range-Based Approach . . . . . . . . . . . . . . . 24 2.4.3 Randomized Approach . . . . . . . . . . . . . . . . 26 III. Our Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.1 Multi-Dimensional Range Partitioning . . . . . . . . . . . . 29 3.1.1 Creation of a Partitioning Matrix . . . . . . . . . . . 29 3.1.2 Identifying and Chopping of Heavy Cells . . . . . . 31 3.1.3 Assigning Cells to Reducers . . . . . . . . . . . . . 33 3.1.4 Join Processing using the Partitioning Matrix . . . . 35 3.2 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . 39 3.3 Complex Join Conditions . . . . . . . . . . . . . . . . . . . 41 3.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 43 3.4.1 Scalar Skew Experiments . . . . . . . . . . . . . . . 44 3.4.2 Zipfs Distribution . . . . . . . . . . . . . . . . . . 49 3.4.3 Non-Equijoin Experiments . . . . . . . . . . . . . . 50 3.4.4 Scalability Experiments . . . . . . . . . . . . . . . 52 3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.1 Sampling . . . . . . . . . . . . . . . . . . . . . . . 55 3.5.2 Memory-Awareness . . . . . . . . . . . . . . . . . 58 3.5.3 Handling of Heavy Cells . . . . . . . . . . . . . . . 59 3.5.4 Existing Histograms . . . . . . . . . . . . . . . . . 60 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 IV. Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.1 Joining Multiple Relations in a MapReduce Job . . . . . . . 65 4.1.1 Example: SPARQL Basic Graph Pattern . . . . . . . 65 4.1.2 Example: Matrix Chain Multiplication . . . . . . . . 67 4.1.3 Single-Key Join and Multiple-Key Join Queries . . . 69 4.2 Skew Handling for Multi-Way Joins . . . . . . . . . . . . . 71 4.2.1 Skew Handling for SK-Join Queries . . . . . . . . . 71 4.2.2 Skew Handling for MK-Join Queires . . . . . . . . 72 4.3 Combinations of SK-Join and MK-Join . . . . . . . . . . . 74 4.3.1 Complex Queries . . . . . . . . . . . . . . . . . . . 74 4.3.2 Iteration-Based Algorithms . . . . . . . . . . . . . . 75 4.3.3 Replication-Based Algorithms . . . . . . . . . . . . 77 4.3.4 Iteration-Based vs. Replication-Based . . . . . . . . 78 4.4 Join-Key Selection Algorithms for Complex Queries . . . . 83 4.4.1 Greedy Key Selection . . . . . . . . . . . . . . . . 84 4.4.2 Multiple Key Selection . . . . . . . . . . . . . . . . 85 4.4.3 Hybrid Key Selection . . . . . . . . . . . . . . . . . 86 4.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 87 4.5.1 SK-Join Experiments . . . . . . . . . . . . . . . . . 87 4.5.2 MK-Join Experiments . . . . . . . . . . . . . . . . 89 4.5.3 Analysis of TV Watching Logs . . . . . . . . . . . . 90 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 V. Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 5.1 Algorithms for SPARQL Basic Graph Pattern . . . . . . . . 94 5.1.1 MR-Selection . . . . . . . . . . . . . . . . . . . . . 95 5.1.2 MR-Join . . . . . . . . . . . . . . . . . . . . . . . 98 5.1.3 Performance Evaluation . . . . . . . . . . . . . . . 101 5.1.4 Discussion . . . . . . . . . . . . . . . . . . . . . . 105 5.2 Algorithms for Matrix Chain Multiplication . . . . . . . . . 107 5.2.1 Serial Two-Way Join (S2) . . . . . . . . . . . . . . 109 5.2.2 Parallel M-Way Join (P2, PM) . . . . . . . . . . . . 111 5.2.3 Serial Two-Way vs. Parallel M-Way . . . . . . . . . 115 5.2.4 Performance Evaluation . . . . . . . . . . . . . . . 116 5.2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . 119 5.2.6 Extension: Embedded MapReduce . . . . . . . . . . 119 VI. Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 초록 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133Docto

    Garbage collection auto-tuning for Java MapReduce on Multi-Cores

    Get PDF
    MapReduce has been widely accepted as a simple programming pattern that can form the basis for efficient, large-scale, distributed data processing. The success of the MapReduce pattern has led to a variety of implementations for different computational scenarios. In this paper we present MRJ, a MapReduce Java framework for multi-core architectures. We evaluate its scalability on a four-core, hyperthreaded Intel Core i7 processor, using a set of standard MapReduce benchmarks. We investigate the significant impact that Java runtime garbage collection has on the performance and scalability of MRJ. We propose the use of memory management auto-tuning techniques based on machine learning. With our auto-tuning approach, we are able to achieve MRJ performance within 10% of optimal on 75% of our benchmark tests

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    MapReduce is Good Enough? If All You Have is a Hammer, Throw Away Everything That's Not a Nail!

    Full text link
    Hadoop is currently the large-scale data analysis "hammer" of choice, but there exist classes of algorithms that aren't "nails", in the sense that they are not particularly amenable to the MapReduce programming model. To address this, researchers have proposed MapReduce extensions or alternative programming models in which these algorithms can be elegantly expressed. This essay espouses a very different position: that MapReduce is "good enough", and that instead of trying to invent screwdrivers, we should simply get rid of everything that's not a nail. To be more specific, much discussion in the literature surrounds the fact that iterative algorithms are a poor fit for MapReduce: the simple solution is to find alternative non-iterative algorithms that solve the same problem. This essay captures my personal experiences as an academic researcher as well as a software engineer in a "real-world" production analytics environment. From this combined perspective I reflect on the current state and future of "big data" research
    corecore