14 research outputs found
Efficient Multi-way Theta-Join Processing Using MapReduce
Multi-way Theta-join queries are powerful in describing complex relations and
therefore widely employed in real practices. However, existing solutions from
traditional distributed and parallel databases for multi-way Theta-join queries
cannot be easily extended to fit a shared-nothing distributed computing
paradigm, which is proven to be able to support OLAP applications over immense
data volumes. In this work, we study the problem of efficient processing of
multi-way Theta-join queries using MapReduce from a cost-effective perspective.
Although there have been some works using the (key,value) pair-based
programming model to support join operations, efficient processing of multi-way
Theta-join queries has never been fully explored. The substantial challenge
lies in, given a number of processing units (that can run Map or Reduce tasks),
mapping a multi-way Theta-join query to a number of MapReduce jobs and having
them executed in a well scheduled sequence, such that the total processing time
span is minimized. Our solution mainly includes two parts: 1) cost metrics for
both single MapReduce job and a number of MapReduce jobs executed in a certain
order; 2) the efficient execution of a chain-typed Theta-join with only one
MapReduce job. Comparing with the query evaluation strategy proposed in [23]
and the widely adopted Pig Latin and Hive SQL solutions, our method achieves
significant improvement of the join processing efficiency.Comment: VLDB201
DualTable: A Hybrid Storage Model for Update Optimization in Hive
Hive is the most mature and prevalent data warehouse tool providing SQL-like
interface in the Hadoop ecosystem. It is successfully used in many Internet
companies and shows its value for big data processing in traditional
industries. However, enterprise big data processing systems as in Smart Grid
applications usually require complicated business logics and involve many data
manipulation operations like updates and deletes. Hive cannot offer sufficient
support for these while preserving high query performance. Hive using the
Hadoop Distributed File System (HDFS) for storage cannot implement data
manipulation efficiently and Hive on HBase suffers from poor query performance
even though it can support faster data manipulation.There is a project based on
Hive issue Hive-5317 to support update operations, but it has not been finished
in Hive's latest version. Since this ACID compliant extension adopts same data
storage format on HDFS, the update performance problem is not solved.
In this paper, we propose a hybrid storage model called DualTable, which
combines the efficient streaming reads of HDFS and the random write capability
of HBase. Hive on DualTable provides better data manipulation support and
preserves query performance at the same time. Experiments on a TPC-H data set
and on a real smart grid data set show that Hive on DualTable is up to 10 times
faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201
Efficient String Similarity Joins using MapReduce
ํ์๋
ผ๋ฌธ (์์ฌ)-- ์์ธ๋ํ๊ต ๋ํ์ : ์ ๊ธฐยท์ ๋ณด๊ณตํ๋ถ, 2015. 8. ์ฌ๊ท์.๋ฌธ์์ด ์ ์ฌ๋ ์กฐ์ธ์ ๋ฐ์ดํฐ ๋ฒ ์ด์ค ๋ถ์ผ์์ ๋งค์ฐ ์ค์ํ๊ณ ์์ฃผ ์ฌ์ฉ๋๋ ์ง์์ด๋ค. ์ต๊ทผ ํ ํฐ ๊ธฐ๋ฐ ์ ์ฌ๋์ ๋ฌธ์ ๊ธฐ๋ฐ ์ ์ฌ๋์ ์ฅ์ ์ ํผํฉํ Fuzzy ํ ํฐ ์์นด๋ ์ ์ฌ๋๊ฐ ์ ์๋์๋ค. ๊ทธ๋ฌ๋ Fuzzy ํ ํฐ ์์นด๋ ์ ์ฌ๋๋ฅผ ์ด์ฉํ ์กฐ์ธ์ ์ํ ์๊ฐ์ด ๋๋ฌด ์ค๋ ๊ฑธ๋ ค ์ด๋ฅผ ๋์ฉ๋ ๋ฐ์ดํฐ์์๋ ์ฌ์ฉํ๊ธฐ๋ ์ด๋ ค์ ๋ค. ๋ฐ๋ผ์ ์ด๋ฅผ ๊ทน๋ณตํ๊ธฐ ์ํด ๋งต๋ฆฌ๋์ค ํ๋ ์์ํฌ๋ฅผ ์ด์ฉํ๋ ์๋ก์ด ๋ถ์ฐ๋ณ๋ ฌ์ฒ๋ฆฌ ์๊ณ ๋ฆฌ์ฆ๊ณผ ์ด๋ฅผ ์ํ ์๋ก์ด ์๊ทธ๋์ณ๋ฅผ ์ ์ํ์๋ค. ๊ทธ๋ฆฌ๊ณ ๊ธฐ์กด์ ๋จ์ผ ๋จธ์ ์๊ณ ๋ฆฌ์ฆ๊ณผ ์คํ์ ํตํด ๊ทธ ์ฑ๋ฅ์ ๋น๊ตํ์์ผ๋ฉฐ 20๋์ ์ปดํจํฐ๋ฅผ ์ด์ฉํ์์ ๋ ์ต๋ 7๋ฐฐ๊น์ง ์ฑ๋ฅ์ด ํฅ์๋๋ ๊ฒ์ ํ์ธํ ์ ์์๋ค. ๋ํ ์ปดํจํฐ์ ์๋ฅผ ๋๋ ธ์ ๋ ๋ถ์ฐ์ฒ๋ฆฌ ๋ฐฉ์์ ์ ์ฌ๋ ์กฐ์ธ ์๊ณ ๋ฆฌ์ฆ ์ํ์๊ฐ์ด ํจ๊ณผ์ ์ผ๋ก ์ค์ด๋๋ ๊ฒ์ ํ์ธํ์๋ค.๋ชฉ ์ฐจ
์ด๋ก i
๋ชฉ์ฐจ ii
์ 1 ์ฅ ์ ๋ก 1
์ 1 ์ ์ฐ๊ตฌ์ ๋ฐฐ๊ฒฝ ๋ฐ ๋ด์ฉ 1
์ 2 ์ฅ ๊ด๋ จ ์ฐ๊ตฌ 4
์ 1 ์ ๋ถ์ฐ ๋ณ๋ ฌ ์ฒ๋ฆฌ 4
์ 2 ์ ๋ฌธ์์ด ์ ์ฌ๋ 6
์ 3 ์ ๋ฌธ์์ด ์ ์ฌ๋ ์กฐ์ธ 8
์ 3 ์ฅ ๋ถ์ฐ ์ฒ๋ฆฌ ์ ์ฌ๋ ์กฐ์ธ 10
์ 1 ์ ํ ํฐ ๋น๋ ์นด์ดํ
11
์ 2 ์ ์๊ทธ๋์ณ ์์ฑ 12
์ 3 ์ ๋ฌธ์ ๊ธฐ๋ฐ ์ ์ฌ๋ ์กฐ์ธ 16
์ 4 ์ ์์
๋ถ๋ฐฐ 20
์ 5 ์ ๊ฒ์ฆ 23
์ 4 ์ฅ ์คํ ๋ฐ ๊ฒฐ๊ณผ 25
์ 1 ์ ๋จ์ผ ๋จธ์ ์๊ณ ๋ฆฌ์ฆ๊ณผ์ ๋น๊ต 25
์ 2 ์ ์ปดํจํฐ ์์ ๋ฐ๋ฅธ ์ํ์๊ฐ ๋ฐ ํจ์จ 28
์ 5 ์ฅ ๊ฒฐ๋ก 32
์ฐธ๊ณ ๋ฌธํ 33
Abstract 36Maste
Distributed Evaluation of Top-k Temporal Joins
To appear in SIGMOD'16We study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a variety of applications such as network traffic monitoring , task scheduling, and tweet analysis. RTJ queries are often best interpreted as top-k queries where only the best matches are returned. We show how to exploit the nature of temporal predicates and the properties of their associated scoring semantics to design TKIJ , an efficient query evaluation approach on a distributed Map-Reduce architecture. TKIJ relies on an offline statistics computation that, given a time partitioning into granules, computes the distribution of intervals' endpoints in each granule, and an online computation that generates query-dependent score bounds. Those statistics are used for workload assignment to reducers. This aims at reducing data replication, to limit I/O cost. Additionally , high-scoring results are distributed evenly to enable each reducer to prune unnecessary results. Our extensive experiments on synthetic and real datasets show that TKIJ outperforms state-of-the-art competitors and provides very good performance for n-ary RTJ queries on temporal data
Scalable and Adaptive Online Joins
Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Previous research proposes static partitioning schemes that require statistics beforehand. In an online or streaming environment in which no statistics about the workload are known, traditional static approaches perform poorly. This paper presents a novel parallel online dataflow join operator that supports arbitrary join predicates. The proposed operator continuously adjusts itself to the data dynamics through adaptive dataflow routing and state repartitioning. The operator is resilient to data skew, maintains high throughput rates, avoids blocking behavior during state repartitioning, takes an eventual consistency approach for maintaining its local state, and behaves strongly consistently as a black-box dataflow operator. We prove that the operator ensures a constant competitive ratio 3.75 in data distribution optimality and that the cost of processing an input tuple is amortized constant, taking into account adaptivity costs. Our evaluation demonstrates that our operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time
Near-Optimal Distributed Band-Joins through Recursive Partitioning
We consider running-time optimization for band-joins in a distributed system,
e.g., the cloud. To balance load across worker machines, input has to be
partitioned, which causes duplication. We explore how to resolve this tension
between maximum load per worker and input duplication for band-joins between
two relations. Previous work suffered from high optimization cost or considered
partitionings that were too restricted (resulting in suboptimal join
performance). Our main insight is that recursive partitioning of the
join-attribute space with the appropriate split scoring measure can achieve
both low optimization cost and low join cost. It is the first approach that is
not only effective for one-dimensional band-joins but also for joins on
multiple attributes. Experiments indicate that our method is able to find
partitionings that are within 10% of the lower bound for both maximum load per
worker and input duplication for a broad range of settings, significantly
improving over previous work