Search CORE

53,724 research outputs found

맵리듀스 클러스터에서 필터링 기법을 사용한 조인 처리

Author: 이태휘
Publication venue: 서울대학교 대학원
Publication date: 01/02/2014
Field of study

학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 김형주.The join operation is one of the essential operations for data analysis because it is necessary to join large datasets to analyze heterogeneous data collected from different sources. MapReduce is a very useful framework for large-scale data analysis, but it is not suitable for joining multiple datasets. This is because it may produce a large number of redundant intermediate results, irrespective of the size of the joined records. Several existing approaches have been employed to improve the join performance, but they can only be used in specific circumstances or they may require multiple MapReduce jobs. To alleviate this problem, MFR-Join is proposed in this dissertation, which is a general join framework for processing equi-joins with filtering techniques in MapReduce. MFR-Join filters out redundant intermediate records within a single MapReduce job by applying filters in the map phase. To achieve this, the MapReduce framework is modified in two ways. First, map tasks are scheduled according to the processing order of the input datasets. Second, filters are created dynamically with the join keys of the datasets in a distributed manner. Various filtering techniques that support specific desirable operations can be plugged into MFR-Join. If the performance of join processing with filters is worse than that without filters, adaptive join processing methods are also proposed. The filters can be applied according to their performance, which is estimated in terms of the false positive rate. Furthermore, two map task scheduling policies are also provided: synchronous and asynchronous scheduling. The concept of filtering techniques is extended to multi-way joins. Methods for filter applications are proposed for the two types of multi-way joins: common attribute joins and distinct attribute joins. The experimental results showed that the proposed approach outperformed existing join algorithms and reduced the size of intermediate results when small portions of input datasets were joined.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix 1 Introduction 1 1.1 Research Background and Motivation . . . . . . . . . . . . . . . . . . . . 1 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 1.2.1 Join Processing with Filtering Techniques in MapReduce . . . . . . 4 1.2.2 Adaptive Join Processing with Filtering Techniques in MFR-Join . 5 1.2.3 Multi-way Join Processing in MFR-Join . . . . . . . . . . . . . . . 6 1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Preliminaries and Related Work 9 2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 2.2 Parallel and Distributed Join Algorithms in DBMS . . . . . . . . . . . . . 11 2.3 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 12 2.3.1 Map-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3.2 Reduce-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.4 Multi-way Joins in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 17 2.5 Filtering Techniques for Join Processing . . . . . . . . . . . . . . . . . . . 19 3 MFR-Join: A General Join Framework with Filtering Techniques in MapReduce 23 3.1 MFR-Join Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 24 3.1.2 Map Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.1.3 Filter Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.1.4 Filtering Techniques Applicable to MFR-Join . . . . . . . . . . . . 29 3.1.5 API and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.2 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.2.1 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.2.2 Effects of the Filters . . . . . . . . . . . . . . . . . . . . . . . . . 39 3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 41 3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 43 4 Adaptive Join Processing with Filtering Techniques in MFR-Join 53 4.1 Adaptive join processing in MFR-Join . . . . . . . . . . . . . . . . . . . . 54 4.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.1.2 Additional Filter Operations for Adaptive Joins . . . . . . . . . . . 57 4.1.3 Early Detection of FPR Threshold Being Exceeded . . . . . . . . . 58 4.1.4 Map Task Scheduling Policies . . . . . . . . . . . . . . . . . . . . 59 4.1.5 Additional Parameters for Adaptive Joins . . . . . . . . . . . . . . 60 4.2 Join Cost and FPR Threshold Analysis . . . . . . . . . . . . . . . . . . . . 61 4.2.1 Cost of Adaptive Join . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.2.2 Effects of FPR Threshold . . . . . . . . . . . . . . . . . . . . . . . 62 4.2.3 Effects of Map Task Scheduling Policy . . . . . . . . . . . . . . . 63 4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 65 5 Multi-way Join Processing in MFR-Join 77 5.1 Applying filters to multi-way joins . . . . . . . . . . . . . . . . . . . . . . 78 5.1.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 79 5.1.2 Distinct Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . . 80 5.1.3 General Multi-way Joins . . . . . . . . . . . . . . . . . . . . . . . 83 5.1.4 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 5.2.1 Partition Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 85 5.2.2 MapReduce Functions . . . . . . . . . . . . . . . . . . . . . . . . 88 5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 5.3.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 90 5.3.2 Distinct attribute joins . . . . . . . . . . . . . . . . . . . . . . . . 91 6 Conclusions and Future Work 99 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.2.1 Integration with Data Warehouse Systems . . . . . . . . . . . . . . 100 6.2.2 Join-based Applications . . . . . . . . . . . . . . . . . . . . . . . 101 6.2.3 Improving Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 102 References 105 Summary (in Korean) 113Docto

SNU Open Repository and Archive

Enumerating Subgraph Instances Using Map-Reduce

Author: Afrati Foto N.
Fotakis Dimitris
Ullman Jeffrey D.
Publication venue
Publication date: 01/01/2012
Field of study

The theme of this paper is how to find all instances of a given "sample" graph in a larger "data graph," using a single round of map-reduce. For the simplest sample graph, the triangle, we improve upon the best known such algorithm. We then examine the general case, considering both the communication cost between mappers and reducers and the total computation cost at the reducers. To minimize communication cost, we exploit the techniques of (Afrati and Ullman, TKDE 2011)for computing multiway joins (evaluating conjunctive queries) in a single map-reduce round. Several methods are shown for translating sample graphs into a union of conjunctive queries with as few queries as possible. We also address the matter of optimizing computation cost. Many serial algorithms are shown to be "convertible," in the sense that it is possible to partition the data graph, explore each partition in a separate reducer, and have the total computation cost at the reducers be of the same order as the computation cost of the serial algorithm.Comment: 37 page

arXiv.org e-Print Archive

CiteSeerX

Three-Way Joins on MapReduce: An Experimental Study

Author: Kimmett Ben
Thomo Alex
Venkatesh S.
Publication venue
Publication date: 15/05/2014
Field of study

We study three-way joins on MapReduce. Joins are very useful in a multitude of applications from data integration and traversing social networks, to mining graphs and automata-based constructions. However, joins are expensive, even for moderate data sets; we need efficient algorithms to perform distributed computation of joins using clusters of many machines. MapReduce has become an increasingly popular distributed computing system and programming paradigm. We consider a state-of-the-art MapReduce multi-way join algorithm by Afrati and Ullman and show when it is appropriate for use on very large data sets. By providing a detailed experimental study, we demonstrate that this algorithm scales much better than what is suggested by the original paper. However, if the join result needs to be summarized or aggregated, as opposed to being only enumerated, then the aggregation step can be integrated into a cascade of two-way joins, making it more efficient than the other algorithm, and thus becomes the preferred solution.Comment: 6 page

arXiv.org e-Print Archive

Crossref

The Family of MapReduce and Large Scale Data Processing Systems

Author: Anna Liu
Ayman G. Fayoumi
King Abdulaziz
See Profile
Sherif Sakr
Sherif Sakr
South Wales
South Wales
Publication venue
Publication date: 12/02/2013
Field of study

In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

arXiv.org e-Print Archive

CiteSeerX