14 research outputs found

    Efficient Multi-way Theta-Join Processing Using MapReduce

    Full text link
    Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key,value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.Comment: VLDB201

    DualTable: A Hybrid Storage Model for Update Optimization in Hive

    Full text link
    Hive is the most mature and prevalent data warehouse tool providing SQL-like interface in the Hadoop ecosystem. It is successfully used in many Internet companies and shows its value for big data processing in traditional industries. However, enterprise big data processing systems as in Smart Grid applications usually require complicated business logics and involve many data manipulation operations like updates and deletes. Hive cannot offer sufficient support for these while preserving high query performance. Hive using the Hadoop Distributed File System (HDFS) for storage cannot implement data manipulation efficiently and Hive on HBase suffers from poor query performance even though it can support faster data manipulation.There is a project based on Hive issue Hive-5317 to support update operations, but it has not been finished in Hive's latest version. Since this ACID compliant extension adopts same data storage format on HDFS, the update performance problem is not solved. In this paper, we propose a hybrid storage model called DualTable, which combines the efficient streaming reads of HDFS and the random write capability of HBase. Hive on DualTable provides better data manipulation support and preserves query performance at the same time. Experiments on a TPC-H data set and on a real smart grid data set show that Hive on DualTable is up to 10 times faster than Hive when executing update and delete operations.Comment: accepted by industry session of ICDE201

    Efficient String Similarity Joins using MapReduce

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (์„์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ •๋ณด๊ณตํ•™๋ถ€, 2015. 8. ์‹ฌ๊ทœ์„.๋ฌธ์ž์—ด ์œ ์‚ฌ๋„ ์กฐ์ธ์€ ๋ฐ์ดํ„ฐ ๋ฒ ์ด์Šค ๋ถ„์•ผ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•˜๊ณ  ์ž์ฃผ ์‚ฌ์šฉ๋˜๋Š” ์งˆ์˜์ด๋‹ค. ์ตœ๊ทผ ํ† ํฐ ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„์™€ ๋ฌธ์ž ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„์˜ ์žฅ์ ์„ ํ˜ผํ•ฉํ•œ Fuzzy ํ† ํฐ ์ž์นด๋“œ ์œ ์‚ฌ๋„๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ Fuzzy ํ† ํฐ ์ž์นด๋“œ ์œ ์‚ฌ๋„๋ฅผ ์ด์šฉํ•œ ์กฐ์ธ์€ ์ˆ˜ํ–‰ ์‹œ๊ฐ„์ด ๋„ˆ๋ฌด ์˜ค๋ž˜ ๊ฑธ๋ ค ์ด๋ฅผ ๋Œ€์šฉ๋Ÿ‰ ๋ฐ์ดํ„ฐ์—์„œ๋„ ์‚ฌ์šฉํ•˜๊ธฐ๋Š” ์–ด๋ ค์› ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ๋งต๋ฆฌ๋“€์Šค ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ด์šฉํ•˜๋Š” ์ƒˆ๋กœ์šด ๋ถ„์‚ฐ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์ด๋ฅผ ์œ„ํ•œ ์ƒˆ๋กœ์šด ์‹œ๊ทธ๋‹ˆ์ณ๋ฅผ ์ œ์•ˆํ•˜์˜€๋‹ค. ๊ทธ๋ฆฌ๊ณ  ๊ธฐ์กด์˜ ๋‹จ์ผ ๋จธ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ ์‹คํ—˜์„ ํ†ตํ•ด ๊ทธ ์„ฑ๋Šฅ์„ ๋น„๊ตํ•˜์˜€์œผ๋ฉฐ 20๋Œ€์˜ ์ปดํ“จํ„ฐ๋ฅผ ์ด์šฉํ•˜์˜€์„ ๋•Œ ์ตœ๋Œ€ 7๋ฐฐ๊นŒ์ง€ ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋‹ค. ๋˜ํ•œ ์ปดํ“จํ„ฐ์˜ ์ˆ˜๋ฅผ ๋Š˜๋ ธ์„ ๋•Œ ๋ถ„์‚ฐ์ฒ˜๋ฆฌ ๋ฐฉ์‹์˜ ์œ ์‚ฌ๋„ ์กฐ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜ ์ˆ˜ํ–‰์‹œ๊ฐ„์ด ํšจ๊ณผ์ ์œผ๋กœ ์ค„์–ด๋“œ๋Š” ๊ฒƒ์„ ํ™•์ธํ•˜์˜€๋‹ค.๋ชฉ ์ฐจ ์ดˆ๋ก i ๋ชฉ์ฐจ ii ์ œ 1 ์žฅ ์„œ ๋ก  1 ์ œ 1 ์ ˆ ์—ฐ๊ตฌ์˜ ๋ฐฐ๊ฒฝ ๋ฐ ๋‚ด์šฉ 1 ์ œ 2 ์žฅ ๊ด€๋ จ ์—ฐ๊ตฌ 4 ์ œ 1 ์ ˆ ๋ถ„์‚ฐ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ 4 ์ œ 2 ์ ˆ ๋ฌธ์ž์—ด ์œ ์‚ฌ๋„ 6 ์ œ 3 ์ ˆ ๋ฌธ์ž์—ด ์œ ์‚ฌ๋„ ์กฐ์ธ 8 ์ œ 3 ์žฅ ๋ถ„์‚ฐ ์ฒ˜๋ฆฌ ์œ ์‚ฌ๋„ ์กฐ์ธ 10 ์ œ 1 ์ ˆ ํ† ํฐ ๋นˆ๋„ ์นด์šดํŒ… 11 ์ œ 2 ์ ˆ ์‹œ๊ทธ๋‹ˆ์ณ ์ƒ์„ฑ 12 ์ œ 3 ์ ˆ ๋ฌธ์ž ๊ธฐ๋ฐ˜ ์œ ์‚ฌ๋„ ์กฐ์ธ 16 ์ œ 4 ์ ˆ ์ž‘์—… ๋ถ„๋ฐฐ 20 ์ œ 5 ์ ˆ ๊ฒ€์ฆ 23 ์ œ 4 ์žฅ ์‹คํ—˜ ๋ฐ ๊ฒฐ๊ณผ 25 ์ œ 1 ์ ˆ ๋‹จ์ผ ๋จธ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜๊ณผ์˜ ๋น„๊ต 25 ์ œ 2 ์ ˆ ์ปดํ“จํ„ฐ ์ˆ˜์— ๋”ฐ๋ฅธ ์ˆ˜ํ–‰์‹œ๊ฐ„ ๋ฐ ํšจ์œจ 28 ์ œ 5 ์žฅ ๊ฒฐ๋ก  32 ์ฐธ๊ณ ๋ฌธํ—Œ 33 Abstract 36Maste

    Distributed Evaluation of Top-k Temporal Joins

    No full text
    To appear in SIGMOD'16We study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a variety of applications such as network traffic monitoring , task scheduling, and tweet analysis. RTJ queries are often best interpreted as top-k queries where only the best matches are returned. We show how to exploit the nature of temporal predicates and the properties of their associated scoring semantics to design TKIJ , an efficient query evaluation approach on a distributed Map-Reduce architecture. TKIJ relies on an offline statistics computation that, given a time partitioning into granules, computes the distribution of intervals' endpoints in each granule, and an online computation that generates query-dependent score bounds. Those statistics are used for workload assignment to reducers. This aims at reducing data replication, to limit I/O cost. Additionally , high-scoring results are distributed evenly to enable each reducer to prune unnecessary results. Our extensive experiments on synthetic and real datasets show that TKIJ outperforms state-of-the-art competitors and provides very good performance for n-ary RTJ queries on temporal data

    Scalable and Adaptive Online Joins

    Get PDF
    Scalable join processing in a parallel shared-nothing environment requires a partitioning policy that evenly distributes the processing load while minimizing the size of state maintained and number of messages communicated. Previous research proposes static partitioning schemes that require statistics beforehand. In an online or streaming environment in which no statistics about the workload are known, traditional static approaches perform poorly. This paper presents a novel parallel online dataflow join operator that supports arbitrary join predicates. The proposed operator continuously adjusts itself to the data dynamics through adaptive dataflow routing and state repartitioning. The operator is resilient to data skew, maintains high throughput rates, avoids blocking behavior during state repartitioning, takes an eventual consistency approach for maintaining its local state, and behaves strongly consistently as a black-box dataflow operator. We prove that the operator ensures a constant competitive ratio 3.75 in data distribution optimality and that the cost of processing an input tuple is amortized constant, taking into account adaptivity costs. Our evaluation demonstrates that our operator outperforms the state-of-the-art static partitioning schemes in resource utilization, throughput, and execution time

    Near-Optimal Distributed Band-Joins through Recursive Partitioning

    Full text link
    We consider running-time optimization for band-joins in a distributed system, e.g., the cloud. To balance load across worker machines, input has to be partitioned, which causes duplication. We explore how to resolve this tension between maximum load per worker and input duplication for band-joins between two relations. Previous work suffered from high optimization cost or considered partitionings that were too restricted (resulting in suboptimal join performance). Our main insight is that recursive partitioning of the join-attribute space with the appropriate split scoring measure can achieve both low optimization cost and low join cost. It is the first approach that is not only effective for one-dimensional band-joins but also for joins on multiple attributes. Experiments indicate that our method is able to find partitionings that are within 10% of the lower bound for both maximum load per worker and input duplication for a broad range of settings, significantly improving over previous work
    corecore