93 research outputs found

    Finding Top-k Dominance on Incomplete Big Data Using Map-Reduce Framework

    Full text link
    Incomplete data is one major kind of multi-dimensional dataset that has random-distributed missing nodes in its dimensions. It is very difficult to retrieve information from this type of dataset when it becomes huge. Finding top-k dominant values in this type of dataset is a challenging procedure. Some algorithms are present to enhance this process but are mostly efficient only when dealing with a small-size incomplete data. One of the algorithms that make the application of TKD query possible is the Bitmap Index Guided (BIG) algorithm. This algorithm strongly improves the performance for incomplete data, but it is not originally capable of finding top-k dominant values in incomplete big data, nor is it designed to do so. Several other algorithms have been proposed to find the TKD query, such as Skyband Based and Upper Bound Based algorithms, but their performance is also questionable. Algorithms developed previously were among the first attempts to apply TKD query on incomplete data; however, all these had weak performances or were not compatible with the incomplete data. This thesis proposes MapReduced Enhanced Bitmap Index Guided Algorithm (MRBIG) for dealing with the aforementioned issues. MRBIG uses the MapReduce framework to enhance the performance of applying top-k dominance queries on huge incomplete datasets. The proposed approach uses the MapReduce parallel computing approach using multiple computing nodes. The framework separates the tasks between several computing nodes that independently and simultaneously work to find the result. This method has achieved up to two times faster processing time in finding the TKD query result in comparison to previously presented algorithms

    Distributed Indexing Schemes for k-Dominant Skyline Analytics on Uncertain Edge-IoT Data

    Full text link
    Skyline queries typically search a Pareto-optimal set from a given data set to solve the corresponding multiobjective optimization problem. As the number of criteria increases, the skyline presumes excessive data items, which yield a meaningless result. To address this curse of dimensionality, we proposed a k-dominant skyline in which the number of skyline members was reduced by relaxing the restriction on the number of dimensions, considering the uncertainty of data. Specifically, each data item was associated with a probability of appearance, which represented the probability of becoming a member of the k-dominant skyline. As data items appear continuously in data streams, the corresponding k-dominant skyline may vary with time. Therefore, an effective and rapid mechanism of updating the k-dominant skyline becomes crucial. Herein, we proposed two time-efficient schemes, Middle Indexing (MI) and All Indexing (AI), for k-dominant skyline in distributed edge-computing environments, where irrelevant data items can be effectively excluded from the compute to reduce the processing duration. Furthermore, the proposed schemes were validated with extensive experimental simulations. The experimental results demonstrated that the proposed MI and AI schemes reduced the computation time by approximately 13% and 56%, respectively, compared with the existing method.Comment: 13 pages, 8 figures, 12 tables, to appear in IEEE Transactions on Emerging Topics in Computin

    Efficient subspace skyline query based on user preference using MapReduce

    Get PDF
    Subspace skyline, as an important variant of skyline, has been widely applied for multiple-criteria decisions, business planning. With the development of mobile internet, subspace skyline query in mobile distributed environments has recently attracted considerable attention. However, efficiently obtaining the meaningful subset of skyline points in any subspace remains a challenging task in the current mobile internet. For more and more mobile applications, subspace skyline query on mobile units is usually limited by big data and wireless bandwidth. To address this issue, in this paper, we propose a system model that can support subspace skyline query in mobile distributed environment. An efficient algorithm for processing the Subspace Skyline Query using MapReduce (SSQ) is also presented which can obtain the meaningful subset of points from the full set of skyline points in any subspace. The SSQ algorithm divides a subspace skyline query into two processing phases: the preprocess phase and the query phase. The preprocess phase includes the pruning process and constructing index process which is designed to reduce network delay and response time. Additionally, the query phase provides two filtering methods, SQM-filtering and ฮต-filtering, to filter the skyline points according to user preference and reduce network cost. Extensive experiments on real and synthetic data are conducted and the experimental results indicate that our algorithm is much efficient, meanwhile, the pruning strategy can further improve the efficiency of the algorithm

    CONTINUOUS MULTIQUERIES K-DOMINANT SKYLINE ON ROAD NETWORK

    Get PDF
    The increasing use of mobile devices makes spatial data worthy of consideration. To get maximum results, users often look for the best from a collection of objects. Among the algorithms that can be used is the skyline query. The algorithm looks for all objects that are not dominated by other objects in all of its attributes. However, data that has many attributes makes the query output a lot of objects so it is less useful for the user. k-dominant skyline queries can be a solution to reduce the output. Among the challenges is the use of skyline queries with spatial data and the many user preferences in finding the best object. This study proposes IKSR: the k-dominant skyline query algorithm that works in a road network environment and can process many queries that have the same subspace in one processing. This algorithm combines queries that operate on the same subspace and set of objects with different k values by computing from the smallest to the largest k. Optimization occurs when some data for larger k are precomputed when calculating the result for the smallest k so the Voronoi cell computing is not repeated. Testing is done by comparing with the naรฏve algorithm without precomputation. IKSR algorithm can speed up computing time two to three times compared to naรฏve algorithm

    ๋น…๋ฐ์ดํ„ฐ์˜ ํšจ์œจ์ ์ธ ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ณ‘๋ ฌ์ฒ˜๋ฆฌ ์•Œ๊ณ ๋ฆฌ์ฆ˜

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ (๋ฐ•์‚ฌ)-- ์„œ์šธ๋Œ€ํ•™๊ต ๋Œ€ํ•™์› : ์ „๊ธฐยท์ปดํ“จํ„ฐ๊ณตํ•™๋ถ€, 2017. 2. ์‹ฌ๊ทœ์„.์Šค์นด์ด๋ผ์ธ ์งˆ์˜์™€ ์Šค์นด์ด๋ผ์ธ์—์„œ ํŒŒ์ƒ๋œ ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ ๊ทธ๋ฆฌ๊ณ  ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์งˆ์˜๋“ค์€ ๋‹ค์–‘ํ•œ ์‘์šฉ์ด ๊ฐ€๋Šฅํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ตœ๊ทผ์— ๋งŽ์€ ์—ฐ๊ตฌ๊ฐ€ ์ง„ํ–‰๋˜์–ด ์™”๋‹ค. ์Šค์นด์ด๋ผ์ธ ์งˆ์˜๋“ค์€ ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๊ธฐ ๋•Œ๋ฌธ์— ํšจ์œจ์ ์ธ ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋Š” ์ค‘์š”ํ•œ ๋ฌธ์ œ์ด๋‹ค. ํฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฒ˜๋ฆฌํ•ด์•ผํ•˜๋Š” ๊ฒฝ์šฐ๋ฅผ ์œ„ํ•ด ๋งต๋ฆฌ๋“€์Šค ํ”„๋ ˆ์ž„์›Œํฌ๊ฐ€ ์ œ์•ˆ๋˜์—ˆ๊ณ , ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” ์Šค์นด์ด๋ผ์ธ, ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ, ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์งˆ์˜ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ํšจ์œจ์ ์ธ ๋งต๋ฆฌ๋“€์Šค ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ๊ฐœ๋ฐœํ•œ๋‹ค. ์Šค์นด์ด๋ผ์ธ, ๋™์  ์Šค์นด์ด๋ผ์ธ, ์—ญ ์Šค์นด์ด๋ผ์ธ์— ๋Œ€ํ•ด์„œ๋Š” ์งˆ์˜ ๊ฒฐ๊ณผ์— ํฌํ•จ๋  ์ˆ˜ ์—†๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋น ๋ฅด๊ฒŒ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด์„œ ์ฟผ๋“œํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•œ๋‹ค. ๊ทธ๋ฆฌ๊ณ  ํžˆ์Šคํ† ๊ทธ๋žจ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ์—ฌ๋Ÿฌ ํŒŒํ‹ฐ์…˜์œผ๋กœ ๋‚˜๋ˆ„๊ณ  ๊ฐ ํŒŒํ‹ฐ์…˜์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋งŒ์„ ์ด์šฉํ•˜์—ฌ ์Šค์นด์ด๋ผ์ธ์ด ๋  ์ˆ˜ ์žˆ๋Š” ํ›„๋ณด ๋ฐ์ดํ„ฐ๋ฅผ ๋งต๋ฆฌ๋“€์Šค๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ ์œผ๋กœ ๋ฝ‘์•„๋‚ธ๋‹ค. ๊ทธ ํ›„์— ๋‹ค์‹œ ๋งต๋ฆฌ๋“€์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ณ‘๋ ฌ์ ์œผ๋กœ ํ›„๋ณด ๋ฐ์ดํ„ฐ์ค‘ ์‹ค์ œ ์Šค์นด์ด๋ผ์ธ์„ ์ฐพ์•„๋‚ธ๋‹ค. ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ์˜ ํšจ์œจ์ ์ธ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•ด ๋จผ์ € ์„ธ๊ฐ€์ง€ ํ•„ํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ์ œ์•ˆํ•˜์˜€๋‹ค. ์ด ํ•„ํ„ฐ๋ง ๊ธฐ๋ฒ•์„ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ์ฟผ๋“œํŠธ๋ฆฌ์— ๊ธฐ๋ฐ˜ํ•œ ํžˆ์Šคํ† ๊ทธ๋žจ์„ ์ƒ์„ฑํ•œ๋‹ค. ์ฟผ๋“œํŠธ๋ฆฌ์˜ ์˜์—ญ์— ๋”ฐ๋ผ ๋ฐ์ดํ„ฐ๋ฅผ ํŒŒํ‹ฐ์…˜ํ•˜๊ณ  ๊ฐ ํŒŒํ‹ฐ์…˜๋งˆ๋‹ค ํ™•๋ฅ ์  ์Šค์นด์ด๋ผ์ธ ์ ๋“ค์„ ์ฐพ์•„๋‚ธ๋‹ค. ๊ฐ ์ปดํ“จํ„ฐ์˜ ์ˆ˜ํ–‰์‹œ๊ฐ„์„ ๋น„์Šทํ•˜๊ฒŒ ๋งž์ถ”๊ธฐ ์œ„ํ•ด์„œ ๋ถ€ํ•˜๊ท ํ˜• ๊ธฐ๋ฒ•๋„ ์ œ์•ˆํ•˜์˜€๋‹ค. ๋‹ค์–‘ํ•œ ์‹คํ—˜์„ ํ†ตํ•ด ์ œ์•ˆํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ์„ฑ๋Šฅ๋“ค์ด ์ตœ์‹  ๊ด€๋ จ ์—ฐ๊ตฌ ๋ณด๋‹ค ์ข‹์Œ์„ ํ™•์ธํ•˜์˜€๊ณ , ์‚ฌ์šฉํ•˜๋Š” ์ปดํ“จํ„ฐ์˜ ์ˆ˜๋ฅผ ๋Š˜๋ฆผ์— ๋”ฐ๋ผ ์„ฑ๋Šฅ์ด ํ™•์žฅ์„ฑ์„ ๊ฐ–๊ณ  ์žˆ์Œ์„ ํ™•์ธํ•˜์˜€๋‹ค.The skyline operator and its variants such as dynamic skyline, reverse skyline and probabilistic skyline operators have attracted considerable attention recently due to its broad applications. However, computing a skyline is challenging today since we have to deal with big data. For data-intensive applications, the MapReduce framework has been widely used recently. In this dissertation, we propose the efficient parallel algorithms for processing skyline, dynamic skyline, reverse skyline and probabilistic skyline queries using MapReduce. For the skyline, dynamic skyline and reverse skyline queries, we first build quadtree-based histograms to prune out non-skyline points. We next partition data based on the regions divided by the histograms and compute candidate skyline points for each partition using MapReduce. Finally, in every partition, we check whether each skyline candidate point is actually a skyline point or not using MapReduce. For the probabilistic skyline query, we first introduce three filtering techniques to prune out points that are not probabilistic skyline points. Then, we build a quadtree-based histogram and split data into partitions according to the regions divided by the quadtree. We finally compute the probabilistic skyline points for each partition using MapReduce. We also develop the workload balancing methods to make the estimated execution times of all available machines to be similar. We did experiments to compare our algorithms with the state-of-the-art algorithms using MapReduce and confirmed the effectiveness as well as the scalability of our proposed skyline algorithms.1 INTRODUCTION 1 1.1 Motivation 1 1.2 Contributions of This Dissertation 6 1.3 Dissertation Overview 8 2 Related Work 10 2.1 Skyline Queries 10 2.2 Reverse Skyline Queries 13 2.3 Probabilistic Skyline Queries 14 3 Background 17 3.1 Skyline and Its Variants 17 3.2 MapReduce Framework 22 4 Parallel Skyline Query Processing 24 4.1 SKY-MR: Our Skyline Computation Algorithm 24 4.1.1 SKY-QTREE: The Sky-Quadtree Building Algorithm 25 4.1.2 L-SKY-MR: The Local Skyline Computation Algorithm 29 4.1.3 G-SKY-MR: The Global Skyline Computation Algorithm 32 4.2 Experiment 34 4.2.1 Performance Results for Skylines 36 4.2.2 Performance Results in Other Environments 41 5 Parallel Reverse Skyline Query Processing 45 5.1 RSKY-MR: Our Reverse Skyline Computation Algorithm 45 5.1.1 RSKY-QTREE: The Rsky-Quadtree Building Algorithm 47 5.1.2 Computations of Reverse Skylines using Rsky-Quadtrees 50 5.1.3 L-RSKY-MR: The Local Reverse Skyline Computation Algorithm 53 5.1.4 G-RSKY-MR: The Global Reverse Skyline Computation Algorithm 57 5.2 Experiment 59 5.2.1 Performance Results for Reverse Skylines 59 6 Parallel Probabilistic Skyline Query Processing 63 6.1 Early Pruning Techniques 63 6.1.1 Upper-bound Filtering 63 6.1.2 Zero-probability Filtering 67 6.1.3 Dominance-Power Filtering 68 6.2 Utilization of a PS-QTREE for Pruning 69 6.2.1 Generating a PS-QTREE 70 6.2.2 Exploiting a PS-QTREE for Filtering 70 6.2.3 Partitioning Objects by a PS-QTREE 71 6.3 PS-QPF-MR: Our Algorithm with Quadtree Partitiong and Filtering 73 6.3.1 Optimizations of PS-QPF-MR 79 6.3.2 Sample Size and Split Threshold of a PSQtree 83 6.4 PS-BRF-MR: Our Algorithm with Random Partitioning and Filtering 84 6.5 Experiments 87 6.5.1 Performance Results for Probabilistic Skylines 89 7 Conclusion 97 Bibliography 99 Abstract (In Korean) 105Docto

    Distributed Evaluation of Top-k Temporal Joins

    No full text
    To appear in SIGMOD'16We study a particular kind of join, coined Ranked Temporal Join (RTJ), featuring predicates that compare time intervals and a scoring function associated with each predicate to quantify how well it is satisfied. RTJ queries are prevalent in a variety of applications such as network traffic monitoring , task scheduling, and tweet analysis. RTJ queries are often best interpreted as top-k queries where only the best matches are returned. We show how to exploit the nature of temporal predicates and the properties of their associated scoring semantics to design TKIJ , an efficient query evaluation approach on a distributed Map-Reduce architecture. TKIJ relies on an offline statistics computation that, given a time partitioning into granules, computes the distribution of intervals' endpoints in each granule, and an online computation that generates query-dependent score bounds. Those statistics are used for workload assignment to reducers. This aims at reducing data replication, to limit I/O cost. Additionally , high-scoring results are distributed evenly to enable each reducer to prune unnecessary results. Our extensive experiments on synthetic and real datasets show that TKIJ outperforms state-of-the-art competitors and provides very good performance for n-ary RTJ queries on temporal data
    • โ€ฆ
    corecore