Search CORE

14 research outputs found

Efficient Multi-way Theta-Join Processing Using MapReduce

Author: Chen Lei
Wang Min
Zhang Xiaofei
Publication venue
Publication date: 01/01/2012
Field of study

Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.Comment: VLDB201

arXiv.org e-Print Archive

University of Memphis Digital Commons

CiteSeerX

Hong Kong University of Science and Technology Institutional Repository

DualTable: A Hybrid Storage Model for Update Optimization in Hive

Author: Hu Songlin
Huang Shuo
Jacobsen Hans-Arno
Liang Ying
Liu Wantao
Pei Xubin
Rabl Tilmann
Wang Jiye
Xiao Zheng
Publication venue
Publication date: 01/12/2014
Field of study

Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201

arXiv.org e-Print Archive

Crossref

Efficient String Similarity Joins using MapReduce

Author: 이창형
Publication venue: 서울대학교 대학원
Publication date: 01/08/2015
Field of study

학위논문 (석사)-- 서울대학교 대학원 : 전기·정보공학부, 2015. 8. 심규석.문자열 유사도 조인은 데이터 베이스 분야에서 매우 중요하고 자주 사용되는 질의이다. 최근 토큰 기반 유사도와 문자 기반 유사도의 장점을 혼합한 Fuzzy 토큰 자카드 유사도가 제안되었다. 그러나 Fuzzy 토큰 자카드 유사도를 이용한 조인은 수행 시간이 너무 오래 걸려 이를 대용량 데이터에서도 사용하기는 어려웠다. 따라서 이를 극복하기 위해 맵리듀스 프레임워크를 이용하는 새로운 분산병렬처리 알고리즘과 이를 위한 새로운 시그니쳐를 제안하였다. 그리고 기존의 단일 머신 알고리즘과 실험을 통해 그 성능을 비교하였으며 20대의 컴퓨터를 이용하였을 때 최대 7배까지 성능이 향상되는 것을 확인할 수 있었다. 또한 컴퓨터의 수를 늘렸을 때 분산처리 방식의 유사도 조인 알고리즘 수행시간이 효과적으로 줄어드는 것을 확인하였다.목 차 초록 i 목차 ii 제 1 장 서 론 1 제 1 절 연구의 배경 및 내용 1 제 2 장 관련 연구 4 제 1 절 분산 병렬 처리 4 제 2 절 문자열 유사도 6 제 3 절 문자열 유사도 조인 8 제 3 장 분산 처리 유사도 조인 10 제 1 절 토큰 빈도 카운팅 11 제 2 절 시그니쳐 생성 12 제 3 절 문자 기반 유사도 조인 16 제 4 절 작업 분배 20 제 5 절 검증 23 제 4 장 실험 및 결과 25 제 1 절 단일 머신 알고리즘과의 비교 25 제 2 절 컴퓨터 수에 따른 수행시간 및 효율 28 제 5 장 결론 32 참고문헌 33 Abstract 36Maste

SNU Open Repository and Archive

Distributed Evaluation of Top-k Temporal Joins

Author: Amer-Yahia Sihem
Leroy Vincent
Pilourdault Julien
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/02/2016
Field of study

To appear in SIGMOD'16We study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a variety of applications such as network traffic monitoring , task scheduling, and tweet analysis. RTJ queries are often best interpreted as top-k queries where only the best matches are returned. We show how to exploit the nature of temporal predicates and the properties of their associated scoring semantics to design TKIJ , an efficient query evaluation approach on a distributed Map-Reduce architecture. TKIJ relies on an offline statistics computation that, given a time partitioning into granules, computes the distribution of intervals' endpoints in each granule, and an online computation that generates query-dependent score bounds. Those statistics are used for workload assignment to reducers. This aims at reducing data replication, to limit I/O cost. Additionally , high-scoring results are distributed evenly to enable each reducer to prune unnecessary results. Our extensive experiments on synthetic and real datasets show that TKIJ outperforms state-of-the-art competitors and provides very good performance for n-ary RTJ queries on temporal data

Crossref

Hal - Université Grenoble Alpes

Scalable and Adaptive Online Joins

Author: Elguindy Abdallah
ElSeidy Mohammed
Koch Christoph
Vitorovic Aleksandar
Publication venue: Hangzhou, China, VLDB
Publication date: 28/10/2013
Field of study

Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Previous research proposes static partitioning schemes that require statistics beforehand. In an online or streaming environment in which no statistics about the workload are known, traditional static approaches perform poorly. This paper presents a novel parallel online dataflow join operator that supports arbitrary join predicates. The proposed operator continuously adjusts itself to the data dynamics through adaptive dataflow routing and state repartitioning. The operator is resilient to data skew, maintains high throughput rates, avoids blocking behavior during state repartitioning, takes an eventual consistency approach for maintaining its local state, and behaves strongly consistently as a black-box dataflow operator. We prove that the operator ensures a constant competitive ratio 3.75 in data distribution optimality and that the cost of processing an input tuple is amortized constant, taking into account adaptivity costs. Our evaluation demonstrates that our operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time

Infoscience - École polytechnique fédérale de Lausanne

Near-Optimal Distributed Band-Joins through Recursive Partitioning

Author: Gatterbauer Wolfgang
Li Rundong
Riedewald Mirek
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 13/04/2020
Field of study

We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work

arXiv.org e-Print Archive

Crossref