151,942 research outputs found
맵리듀스 클러스터에서 필터링 기법을 사용한 조인 처리
학위논문 (박사)-- 서울대학교 대학원 : 전기·컴퓨터공학부, 2014. 2. 김형주.The join operation is one of the essential operations for data analysis because it is necessary to join large datasets to analyze heterogeneous data collected from different sources. MapReduce is a very useful framework for large-scale data analysis, but it is not suitable for joining multiple datasets. This is because it may produce a large number of redundant intermediate results, irrespective of the size of the joined records. Several existing approaches have been employed to improve the join performance, but they can only be used in specific circumstances or they may require multiple MapReduce jobs. To alleviate this problem, MFR-Join is proposed in this dissertation, which is a general join framework for processing equi-joins with filtering techniques in MapReduce. MFR-Join filters out redundant intermediate records within a single MapReduce job by applying filters in the map phase. To achieve this, the MapReduce framework is modified in two ways. First, map tasks are scheduled according to the processing order of the input datasets. Second, filters are created dynamically with the join keys of the datasets in a distributed manner. Various filtering techniques that support specific desirable operations can be plugged into MFR-Join. If the performance of join processing with filters is worse than that without filters, adaptive join processing methods are also proposed. The filters can be applied according to their performance, which is estimated in terms of the false positive rate. Furthermore, two map task scheduling policies are also provided: synchronous and asynchronous scheduling. The concept of filtering techniques is extended to multi-way joins. Methods for filter applications are proposed for the two types of multi-way joins: common attribute joins and distinct attribute joins. The experimental results showed that the proposed approach outperformed existing join algorithms and reduced the size of intermediate results when small portions of input datasets were joined.Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . i
Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . iii
List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . vii
List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
1 Introduction 1
1.1 Research Background and Motivation . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.2.1 Join Processing with Filtering Techniques in MapReduce . . . . . . 4
1.2.2 Adaptive Join Processing with Filtering Techniques in MFR-Join . 5
1.2.3 Multi-way Join Processing in MFR-Join . . . . . . . . . . . . . . . 6
1.3 Dissertation Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Preliminaries and Related Work 9
2.1 MapReduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Parallel and Distributed Join Algorithms in DBMS . . . . . . . . . . . . . 11
2.3 Join Algorithms in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 12
2.3.1 Map-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3.2 Reduce-side joins . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Multi-way Joins in MapReduce . . . . . . . . . . . . . . . . . . . . . . . . 17
2.5 Filtering Techniques for Join Processing . . . . . . . . . . . . . . . . . . . 19
3 MFR-Join: A General Join Framework with Filtering Techniques in MapReduce
23
3.1 MFR-Join Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.1.2 Map Task Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.3 Filter Construction . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.4 Filtering Techniques Applicable to MFR-Join . . . . . . . . . . . . 29
3.1.5 API and Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.2.1 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.2 Effects of the Filters . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 43
4 Adaptive Join Processing with Filtering Techniques in MFR-Join 53
4.1 Adaptive join processing in MFR-Join . . . . . . . . . . . . . . . . . . . . 54
4.1.1 Execution Overview . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.1.2 Additional Filter Operations for Adaptive Joins . . . . . . . . . . . 57
4.1.3 Early Detection of FPR Threshold Being Exceeded . . . . . . . . . 58
4.1.4 Map Task Scheduling Policies . . . . . . . . . . . . . . . . . . . . 59
4.1.5 Additional Parameters for Adaptive Joins . . . . . . . . . . . . . . 60
4.2 Join Cost and FPR Threshold Analysis . . . . . . . . . . . . . . . . . . . . 61
4.2.1 Cost of Adaptive Join . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.2.2 Effects of FPR Threshold . . . . . . . . . . . . . . . . . . . . . . . 62
4.2.3 Effects of Map Task Scheduling Policy . . . . . . . . . . . . . . . 63
4.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 65
5 Multi-way Join Processing in MFR-Join 77
5.1 Applying filters to multi-way joins . . . . . . . . . . . . . . . . . . . . . . 78
5.1.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 79
5.1.2 Distinct Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . . 80
5.1.3 General Multi-way Joins . . . . . . . . . . . . . . . . . . . . . . . 83
5.1.4 Cost Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.2 Implementation Details . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.2.1 Partition Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 85
5.2.2 MapReduce Functions . . . . . . . . . . . . . . . . . . . . . . . . 88
5.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.1 Common Attribute Joins . . . . . . . . . . . . . . . . . . . . . . . 90
5.3.2 Distinct attribute joins . . . . . . . . . . . . . . . . . . . . . . . . 91
6 Conclusions and Future Work 99
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 Integration with Data Warehouse Systems . . . . . . . . . . . . . . 100
6.2.2 Join-based Applications . . . . . . . . . . . . . . . . . . . . . . . 101
6.2.3 Improving Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 102
References 105
Summary (in Korean) 113Docto
Towards More Data-Aware Application Integration (extended version)
Although most business application data is stored in relational databases,
programming languages and wire formats in integration middleware systems are
not table-centric. Due to costly format conversions, data-shipments and faster
computation, the trend is to "push-down" the integration operations closer to
the storage representation.
We address the alternative case of defining declarative, table-centric
integration semantics within standard integration systems. For that, we replace
the current operator implementations for the well-known Enterprise Integration
Patterns by equivalent "in-memory" table processing, and show a practical
realization in a conventional integration system for a non-reliable,
"data-intensive" messaging example. The results of the runtime analysis show
that table-centric processing is promising already in standard, "single-record"
message routing and transformations, and can potentially excel the message
throughput for "multi-record" table messages.Comment: 18 Pages, extended version of the contribution to British
International Conference on Databases (BICOD), 2015, Edinburgh, Scotlan
An Expressive Language and Efficient Execution System for Software Agents
Software agents can be used to automate many of the tedious, time-consuming
information processing tasks that humans currently have to complete manually.
However, to do so, agent plans must be capable of representing the myriad of
actions and control flows required to perform those tasks. In addition, since
these tasks can require integrating multiple sources of remote information ?
typically, a slow, I/O-bound process ? it is desirable to make execution as
efficient as possible. To address both of these needs, we present a flexible
software agent plan language and a highly parallel execution system that enable
the efficient execution of expressive agent plans. The plan language allows
complex tasks to be more easily expressed by providing a variety of operators
for flexibly processing the data as well as supporting subplans (for
modularity) and recursion (for indeterminate looping). The executor is based on
a streaming dataflow model of execution to maximize the amount of operator and
data parallelism possible at runtime. We have implemented both the language and
executor in a system called THESEUS. Our results from testing THESEUS show that
streaming dataflow execution can yield significant speedups over both
traditional serial (von Neumann) as well as non-streaming dataflow-style
execution that existing software and robot agent execution systems currently
support. In addition, we show how plans written in the language we present can
represent certain types of subtasks that cannot be accomplished using the
languages supported by network query engines. Finally, we demonstrate that the
increased expressivity of our plan language does not hamper performance;
specifically, we show how data can be integrated from multiple remote sources
just as efficiently using our architecture as is possible with a
state-of-the-art streaming-dataflow network query engine
Ringo: Interactive Graph Analytics on Big-Memory Machines
We present Ringo, a system for analysis of large graphs. Graphs provide a way
to represent and analyze systems of interacting objects (people, proteins,
webpages) with edges between the objects denoting interactions (friendships,
physical interactions, links). Mining graphs provides valuable insights about
individual objects as well as the relationships among them.
In building Ringo, we take advantage of the fact that machines with large
memory and many cores are widely available and also relatively affordable. This
allows us to build an easy-to-use interactive high-performance graph analytics
system. Graphs also need to be built from input data, which often resides in
the form of relational tables. Thus, Ringo provides rich functionality for
manipulating raw input data tables into various kinds of graphs. Furthermore,
Ringo also provides over 200 graph analytics functions that can then be applied
to constructed graphs.
We show that a single big-memory machine provides a very attractive platform
for performing analytics on all but the largest graphs as it offers excellent
performance and ease of use as compared to alternative approaches. With Ringo,
we also demonstrate how to integrate graph analytics with an iterative process
of trial-and-error data exploration and rapid experimentation, common in data
mining workloads.Comment: 6 pages, 2 figure
Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources
Apache Calcite is a foundational software framework that provides query
processing, optimization, and query language support to many popular
open-source data processing systems such as Apache Hive, Apache Storm, Apache
Flink, Druid, and MapD. Calcite's architecture consists of a modular and
extensible query optimizer with hundreds of built-in optimization rules, a
query processor capable of processing a variety of query languages, an adapter
architecture designed for extensibility, and support for heterogeneous data
models and stores (relational, semi-structured, streaming, and geospatial).
This flexible, embeddable, and extensible architecture is what makes Calcite an
attractive choice for adoption in big-data frameworks. It is an active project
that continues to introduce support for the new types of data sources, query
languages, and approaches to query processing and optimization.Comment: SIGMOD'1
- …